Model Review

Gemini 3.1 Pro review: ARC-AGI-2 at 77.1%.

Gemini 3.1 Pro launched 19 February 2026 with 54.2% SWE-bench Pro, 88.1% MMLU, and a standout 77.1% on ARC-AGI-2. We review Google's reasoning specialist.

Daniel Fleuren2026-06-1511 min readFounders and operatorsUpdated 2026-06-19

Written by

Daniel Fleuren

Founder, AI Kick Start. 20+ years enterprise IT

Updated 2026-06-19

AI Kick Start editorial image for Gemini 3.1 Pro review: ARC-AGI-2 at 77.1%.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: Gemini 3.1 Pro launched 19 February 2026 with 54.2% SWE-bench Pro, 88.1% MMLU, and a standout 77.1% on ARC-AGI-2. We review Google's reasoning specialist.

Key takeaways

Gemini 3.1 Pro review: ARC-AGI-2 at 77.1%: **Release date:** 19 February 2026 | **Status:** Active | **Licence:** Closed Google quietly shipped a model that does something most of its rivals still can't.
Benchmarks at a glance: SWE-bench Pro: 54.2%: Mid-tier coding MMLU: 88.1%: Just 0.3 pts behind GPT-5.5 ARC-AGI-2: 77.1%: Outstanding Context window: 1M tokens: Best-in-class Price (input): $3.50 / 1M tokens: Mid-premium Price (output): $10.50 / 1M tokens: Reasonable for tier A caveat on two rows.
The ARC-AGI-2 story: ARC-AGI-2 tests fluid intelligence: can a model solve a problem it has never seen, with no chance to lean on training data?
The coding paradox: For all that reasoning muscle, Gemini 3.1 Pro lands at just 54.2% on SWE-bench Pro.
Pricing analysis: The article lists $3.50 input and $10.50 output per million tokens, but that does not match any current listing.

Gemini 3.1 Pro review: ARC-AGI-2 at 77.1%

Release date: 19 February 2026 | Status: Active | Licence: Closed

Google quietly shipped a model that does something most of its rivals still can't. On 19 February 2026, Google DeepMind released Gemini 3.1 Pro, and the number that got everyone's attention was its score on a reasoning test built specifically to be hard to game: 77.1% on ARC-AGI-2.

Here's why that matters for a business reader who has no interest in benchmark trivia. Most AI tests can be passed by a model that has effectively seen the answers before. ARC-AGI-2 is built to block that. It throws problems at a model that aren't in any training set, so a high score points to actual problem-solving rather than good memory. Gemini 3.1 Pro got more of those right than almost anything else on the market.

The catch is that the same model is only middling at writing software. So you end up with a tool that can reason its way through a novel puzzle but stumbles on the day-to-day grind of production code. For Australian teams deciding where to spend their AI budget, that split is the whole story: pick this one for thinking, not for shipping code.

A note on naming before we go further. The article calls it "Gemini 3.1 Pro", though most current availability is under the "Gemini 3.1 Pro Preview" label.

Benchmarks at a glance

Metric	Score	Context
SWE-bench Pro	54.2%	Mid-tier coding
MMLU	88.1%	Just 0.3 pts behind GPT-5.5
ARC-AGI-2	77.1%	Outstanding
Context window	1M tokens	Best-in-class
Price (input)	$3.50 / 1M tokens	Mid-premium
Price (output)	$10.50 / 1M tokens	Reasonable for tier

A caveat on two rows. The MMLU figure of 88.1% and the claim that it trails GPT-5.5 by 0.3 points could not be confirmed; current sources put Gemini 3.1 Pro at 90.99% on MMLU-Pro, a different test. And the pricing in the table is unconfirmed too. See the pricing section below for what the live listings actually say.

The ARC-AGI-2 story

ARC-AGI-2 tests fluid intelligence: can a model solve a problem it has never seen, with no chance to lean on training data? A 77.1% score suggests Gemini 3.1 Pro is doing real abstract reasoning rather than matching patterns it memorised earlier. In practice that shows up in:

Novel mathematical proofs and derivations
Abstract logical puzzles
Creative problem-solving with minimal examples
Transfer learning across domains

What makes the score worth a second look is how the test was built. ARC-AGI-2 was designed to resist memorisation and shortcut pattern-matching. Score well on it and you're reasoning, not recalling.

The coding paradox

For all that reasoning muscle, Gemini 3.1 Pro lands at just 54.2% on SWE-bench Pro. That puts it below Opus 4.8 at 69.2%, and reportedly behind Sonnet 4.6, which one source puts around 53-58% (the exact figure varies by source and could not be pinned down). The gap between abstract reasoning and shipping software is real here. The model handles a clean logic puzzle but struggles with the messy, specification-heavy work of production code.

Pricing analysis

The article lists $3.50 input and $10.50 output per million tokens, but that does not match any current listing. Live pricing on Artificial Analysis and elsewhere shows Gemini 3.1 Pro Preview at roughly $2.00 input and $12.00 output per million tokens, with the rate doubling above 200K tokens. Treat the $3.50/$10.50 figures as unconfirmed.

For reference, Sonnet 4.6 runs $3 input and $15 output, and GPT-5.5 runs $5 input and $30 output. On the verified numbers, Gemini 3.1 Pro's output pricing undercuts both, which helps for high-output jobs like long-form content or verbose analysis.

Verdict

Reach for Gemini 3.1 Pro when reasoning is the job. The ARC-AGI-2 result is the standout, and the 1M-token context window gives you room to work. The soft coding score keeps it off the shortlist for software engineering, but for research, analysis and hard problem-solving, few models do it better.

Score: 8.1 / 10

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

Google Gemini API documentation

What to do next

Write the job-to-be-done before looking at another product.
Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Gemini 3.1 Pro review: ARC-AGI-2 at 77.1%

Read with ChatGPT Open Claude Search with AI Mode

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call