Model Review

ARC-AGI-2 leaderboard: Which models reason best?

ARC-AGI-2 tests fluid intelligence and abstract reasoning. Gemini 3.1 Pro leads at 77.1%, while most models score between 60-75%. We analyse what the benchmark reveals about model cognition.

Daniel Fleuren2026-06-1511 min readFounders and operatorsUpdated 2026-06-19

Written by

Daniel Fleuren

Founder, AI Kick Start. 20+ years enterprise IT

Updated 2026-06-19

AI Kick Start editorial image for ARC-AGI-2 leaderboard: Which models reason best?.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: ARC-AGI-2 tests fluid intelligence and abstract reasoning. Gemini 3.1 Pro leads at 77.1%, while most models score between 60-75%. We analyse what the benchmark reveals about model cognition.

Key takeaways

ARC-AGI-2 leaderboard: Which models reason best?: [ARC-AGI-2](https://arcprize.org/blog/arc-agi-2-technical-report) (Abstract Reasoning Corpus for Artificial General Intelligence) is built to test fluid intelligence: the knack for solving a fresh problem without leaning on memorised patterns or training data.
What ARC-AGI-2 measures: ARC-AGI-2 throws visual and logical puzzles at a model that call for: Spotting patterns in unfamiliar domains Inferring abstract rules Reasoning by analogy across different representations Composing simple rules into a solution for a harder problem The tasks are deliberately built to defeat memorisation.
The ARC-AGI-2 leaderboard (June 2026): A note before the numbers, because it matters: only one figure in this table is a real, measured ARC-AGI-2 score.
Where the estimates fall down: This is the part to be straight about.
Why reasoning matters: A high ARC-AGI-2 score tends to travel with: Stronger results on genuinely new problems that aren't in the training data More dependable multi-step deduction Less sensitivity to how a prompt is worded Better transfer into unfamiliar domains When the work in front of you is genuinely novel, research, open-ended problem-solving, strategic analysis, ARC-AGI-2 is a better guide to a model's usefulness than MMLU.

ARC-AGI-2 leaderboard: Which models reason best?

ARC-AGI-2 (Abstract Reasoning Corpus for Artificial General Intelligence) is built to test fluid intelligence: the knack for solving a fresh problem without leaning on memorised patterns or training data. MMLU checks what a model knows. ARC-AGI-2 checks whether it can actually work something out. The results separate the models that think from the ones that recall.

What ARC-AGI-2 measures

ARC-AGI-2 throws visual and logical puzzles at a model that call for:

Spotting patterns in unfamiliar domains
Inferring abstract rules
Reasoning by analogy across different representations
Composing simple rules into a solution for a harder problem

The tasks are deliberately built to defeat memorisation. A model can't coast by matching something similar from its training set. It has to reason from the ground up.

The ARC-AGI-2 leaderboard (June 2026)

A note before the numbers, because it matters: only one figure in this table is a real, measured ARC-AGI-2 score. That is Gemini 3.1 Pro at 77.1%, confirmed across public leaderboards (llm-stats). Every other ARC-AGI-2 percentage below is an author estimate, derived from how MMLU and ARC-AGI-2 scores have tended to track each other. They are not benchmark results, and some of them are wide of the mark when checked against live data (more on that below). Treat the asterisked rows as a rough ordering, not a scoreboard.

Rank	Model	ARC-AGI-2	MMLU	Context	Price (In/Out)
1	Gemini 3.1 Pro	77.1%	88.1%	1M	$3.50 / $10.50
2	Claude Fable 5	~75%*	92.1%	1M	$10.00 / $50.00
3	Claude Opus 4.8	~72%*	89.8%	1M	$5.00 / $25.00
4	GPT-5.5 Pro	~71%*	89.7%	400K	$8.00 / $40.00
5	Claude Opus 4.7	~70%*	89.2%	1M	$5.00 / $25.00
6	GPT-5.5	~69%*	88.4%	400K	$5.00 / $30.00
7	Claude Sonnet 4.6	~68%*	87.6%	1M	$3.00 / $15.00
8	Grok 4	~67%*	87.2%	256K	$5.00 / $25.00
9	Gemini 3.5 Flash	~66%*	86.8%	1M	$0.35 / $0.70
10	MiniMax M3	~65%*	86.4%	1M	$0.30 / $1.20
11	Kimi K2.7-Code	~64%*	85.7%	256K	$0.50 / $2.00
12	DeepSeek V3.5	~63%*	85.8%	1M	$0.15 / $0.60
13	GLM-5.2	~63%*	85.2%	256K	$0.80 / $2.40
14	Mistral Large 2	~62%*	85.1%	256K	$2.00 / $6.00
15	Llama 4	~62%*	84.8%	256K	Free
16	Qwen 3	~61%*	84.6%	128K	$0.40 / $1.20
17	GPT-5.5 Instant	~58%*	84.2%	128K	$0.50 / $1.50

*Estimated from the correlation between MMLU and ARC-AGI-2 performance. Only Gemini 3.1 Pro's 77.1% is a confirmed benchmark score.

A few honest caveats on this table. Several models in it (the various Opus 4.7/4.8 and GPT-5.5 variants, MiniMax M3, Kimi K2.7, GLM-5.2 and others) carry MMLU and pricing figures we have not individually checked, and some of those models may be unreleased. Claude Fable 5 is real, Anthropic shipped it on 9 June 2026 at $10 in / $50 out per million tokens, the listed numbers there are right. The Gemini 3.1 Pro pricing in the table ($3.50 / $10.50) does not match what is reported publicly: OpenRouter lists roughly $2.00 input / $12.00 output per million tokens under 200K, rising above that for longer context. The ~1M context window is about right.

Where the estimates fall down

This is the part to be straight about. The single confirmed score, Gemini 3.1 Pro at 77.1%, is genuinely strong. But the article's original framing put Gemini 3.1 Pro at the top of the pile as the June 2026 reasoning leader, and live leaderboards don't back that up. As of June 2026, BenchLM.ai and llm-stats show GPT-5.5 leading ARC-AGI-2 at around 85%, with a GPT-5.4 Pro also reportedly ahead of Gemini. On that reading Gemini 3.1 Pro sits second or third, not first.

The estimated rows have the same problem in miniature. The GPT-5.5 estimate of ~69% lands well below its reported ~85%. The Grok 4 estimate of ~67% is the starkest miss: live data points to something closer to 15.9% on llm-stats, with a separate "Grok 4.20" entry at 53.3% on BenchLM. Neither is anywhere near 67%. So when you read down the table, read the asterisks as a reminder that MMLU-to-ARC-AGI-2 extrapolation can be badly wrong for individual models.

Why reasoning matters

A high ARC-AGI-2 score tends to travel with:

Stronger results on genuinely new problems that aren't in the training data
More dependable multi-step deduction
Less sensitivity to how a prompt is worded
Better transfer into unfamiliar domains

When the work in front of you is genuinely novel, research, open-ended problem-solving, strategic analysis, ARC-AGI-2 is a better guide to a model's usefulness than MMLU. MMLU rewards recall; this rewards working it out.

Verdict

Here is the measured version, without the hype. Gemini 3.1 Pro's confirmed 77.1% on ARC-AGI-2 is a serious reasoning result and a fair reason to shortlist it for reasoning-heavy work. It is not, on the public 2026 leaderboards, the outright leader, GPT-5.5 (around 85%) sits ahead of it, so calling any one model the "champion with no equal" overstates the case.

For knowledge work, coding or general Q&A, other models may give you better value per dollar. For tasks that hinge on genuine abstract reasoning, puzzles, novel research, creative synthesis, Gemini 3.1 Pro is a strong pick, just check the current ARC Prize leaderboard before you commit, because the top of this list moves.

Strong on reasoning (confirmed): Gemini 3.1 Pro at 77.1%, though GPT-5.5 reportedly leads the live ARC-AGI-2 board.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

Google Gemini API documentation

What to do next

Write the job-to-be-done before looking at another product.
Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: ARC-AGI-2 leaderboard: Which models reason best?

Read with ChatGPT Open Claude Search with AI Mode

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call

ARC-AGI-2 leaderboard: Which models reason best?

Daniel Fleuren

Shortlist

Shelfware

Pilot score

TL;DR

Key takeaways

ARC-AGI-2 leaderboard: Which models reason best?

What ARC-AGI-2 measures

The ARC-AGI-2 leaderboard (June 2026)

Where the estimates fall down

Why reasoning matters

Verdict

Strong on reasoning (confirmed): Gemini 3.1 Pro at 77.1%, though GPT-5.5 reportedly leads the live ARC-AGI-2 board.

Primary references to keep this briefing grounded

What to do next

Use the article as a decision prompt

Turn this into a practical roadmap.

Related articles

Gemini 3.1 Pro review: ARC-AGI-2 at 77.1%

Claude Opus 4.8 vs Gemini 3.1 Pro: Head-to-head

Coding benchmarks: Which model writes the best code?