Gemini 3.1 Pro review: ARC-AGI-2 at 77.1%
Release date: 19 February 2026 | Status: Active | Licence: Closed
Google quietly shipped a model that does something most of its rivals still can't. On 19 February 2026, Google DeepMind released Gemini 3.1 Pro, and the number that got everyone's attention was its score on a reasoning test built specifically to be hard to game: 77.1% on ARC-AGI-2.
Here's why that matters for a business reader who has no interest in benchmark trivia. Most AI tests can be passed by a model that has effectively seen the answers before. ARC-AGI-2 is built to block that. It throws problems at a model that aren't in any training set, so a high score points to actual problem-solving rather than good memory. Gemini 3.1 Pro got more of those right than almost anything else on the market.
The catch is that the same model is only middling at writing software. So you end up with a tool that can reason its way through a novel puzzle but stumbles on the day-to-day grind of production code. For Australian teams deciding where to spend their AI budget, that split is the whole story: pick this one for thinking, not for shipping code.
A note on naming before we go further. The article calls it "Gemini 3.1 Pro", though most current availability is under the "Gemini 3.1 Pro Preview" label.
Benchmarks at a glance
| Metric | Score | Context |
|---|---|---|
| SWE-bench Pro | 54.2% | Mid-tier coding |
| MMLU | 88.1% | Just 0.3 pts behind GPT-5.5 |
| ARC-AGI-2 | 77.1% | Outstanding |
| Context window | 1M tokens | Best-in-class |
| Price (input) | $3.50 / 1M tokens | Mid-premium |
| Price (output) | $10.50 / 1M tokens | Reasonable for tier |
A caveat on two rows. The MMLU figure of 88.1% and the claim that it trails GPT-5.5 by 0.3 points could not be confirmed; current sources put Gemini 3.1 Pro at 90.99% on MMLU-Pro, a different test. And the pricing in the table is unconfirmed too. See the pricing section below for what the live listings actually say.
The ARC-AGI-2 story
ARC-AGI-2 tests fluid intelligence: can a model solve a problem it has never seen, with no chance to lean on training data? A 77.1% score suggests Gemini 3.1 Pro is doing real abstract reasoning rather than matching patterns it memorised earlier. In practice that shows up in:
- Novel mathematical proofs and derivations
- Abstract logical puzzles
- Creative problem-solving with minimal examples
- Transfer learning across domains
What makes the score worth a second look is how the test was built. ARC-AGI-2 was designed to resist memorisation and shortcut pattern-matching. Score well on it and you're reasoning, not recalling.
The coding paradox
For all that reasoning muscle, Gemini 3.1 Pro lands at just 54.2% on SWE-bench Pro. That puts it below Opus 4.8 at 69.2%, and reportedly behind Sonnet 4.6, which one source puts around 53-58% (the exact figure varies by source and could not be pinned down). The gap between abstract reasoning and shipping software is real here. The model handles a clean logic puzzle but struggles with the messy, specification-heavy work of production code.
Pricing analysis
The article lists $3.50 input and $10.50 output per million tokens, but that does not match any current listing. Live pricing on Artificial Analysis and elsewhere shows Gemini 3.1 Pro Preview at roughly $2.00 input and $12.00 output per million tokens, with the rate doubling above 200K tokens. Treat the $3.50/$10.50 figures as unconfirmed.
For reference, Sonnet 4.6 runs $3 input and $15 output, and GPT-5.5 runs $5 input and $30 output. On the verified numbers, Gemini 3.1 Pro's output pricing undercuts both, which helps for high-output jobs like long-form content or verbose analysis.
Verdict
Reach for Gemini 3.1 Pro when reasoning is the job. The ARC-AGI-2 result is the standout, and the 1M-token context window gives you room to work. The soft coding score keeps it off the shortlist for software engineering, but for research, analysis and hard problem-solving, few models do it better.


