Back to news

Model Review

Claude Opus 4.8 vs Gemini 3.1 Pro: Head-to-head.

Anthropic's Opus 4.8 ($5/$25, 69.2% SWE-bench Pro) vs Google's Gemini 3.1 Pro ($3.50/$10.50, 54.2% SWE-bench Pro). Two premium models with very different strengths.

AI Kick Start editorial image for Claude Opus 4.8 vs Gemini 3.1 Pro: Head-to-head.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: Anthropic's Opus 4.8 ($5/$25, 69.2% SWE-bench Pro) vs Google's Gemini 3.1 Pro ($3.50/$10.50, 54.2% SWE-bench Pro). Two premium models with very different strengths.

Key takeaways

  • Claude Opus 4.8 vs Gemini 3.1 Pro: Head-to-head: Both [Claude Opus 4.8](https://www.anthropic.com/claude/opus) and Gemini 3.1 Pro sit at the top of their makers' price lists, but they're not really competing for the same job.
  • Head-to-head benchmarks: A note before the table: the published Gemini 3.1 Pro pricing below differs from what Google's own pages and several pricing trackers list.
  • Where Opus 4.8 wins: **Software engineering.** The SWE-bench Pro gap is 15 points, and that's a wide margin.
  • Where Gemini 3.1 Pro wins: **Abstract reasoning.** This is Gemini's headline.
  • The context question: Both models handle a 1M-token context.

Claude Opus 4.8 vs Gemini 3.1 Pro: Head-to-head

Both Claude Opus 4.8 and Gemini 3.1 Pro sit at the top of their makers' price lists, but they're not really competing for the same job. Opus 4.8 is the one to beat on coding. Gemini 3.1 Pro pulls ahead on abstract reasoning. Which one is right for you comes down to what your team actually does all day.

If you run a team that ships software, the most expensive AI models on the market just gave you a clearer reason to pick a side.

Anthropic's Claude Opus 4.8 and Google's Gemini 3.1 Pro landed within months of each other, both at the premium end. On paper they look like rivals. In practice they're tuned for different work. One is built to write and fix code. The other is built to think its way through novel problems.

For most Australian businesses, that distinction matters more than any single benchmark number. A dev team and a research team will not get the same answer to "which model should we pay for." Here's how the two break down, and where the marketing math gets a little slippery.

Head-to-head benchmarks

A note before the table: the published Gemini 3.1 Pro pricing below differs from what Google's own pages and several pricing trackers list. We've flagged that in the price section. Read the dollar figures as the original article's claims, not as confirmed rates.

MetricOpus 4.8Gemini 3.1 ProDelta
SWE-bench Pro69.2%54.2%+15.0 pts (Opus)
MMLU89.8%88.1%+1.7 pts (Opus)
ARC-AGI-2N/A77.1%N/A
Context window1M (beta)1M,
Price (input)$5.00 / 1M$3.50 / 1MOpus +43%
Price (output)$25.00 / 1M$10.50 / 1MOpus +2.4x

Where Opus 4.8 wins

Software engineering. The SWE-bench Pro gap is 15 points, and that's a wide margin. Opus 4.8 posts 69.2% to Gemini's 54.2% (SWE-bench Pro Leaderboard, 2026; DataCamp). Both figures are vendor-reported, so treat them as a strong signal rather than an independent audit. Where it shows up is the hard stuff: refactoring across multiple files, tracking down awkward bugs, writing an algorithm from scratch. For a development team, that gap on its own can be enough to cover the higher price.

General knowledge. Opus 4.8 also edges ahead on MMLU, reportedly 89.8% to 88.1%. We could not pin down those exact numbers against an authoritative source, and the figures floating around for both models vary, so take the 1.7-point lead as indicative rather than settled. The broad read is that Opus 4.8 is marginally steadier across academic subjects.

Where Gemini 3.1 Pro wins

Abstract reasoning. This is Gemini's headline. It scores 77.1% on ARC-AGI-2 (Gemini 3.1 Pro, automatio.ai), the benchmark built around problems a model hasn't seen before: puzzles, logic, the kind of task you can't pattern-match your way out of. On that ground it's the stronger model. Worth keeping in perspective, though: ARC-AGI-2 leadership shifts depending on which models you include, so Gemini's edge here is over Opus specifically, not the whole field.

Price. This is where the original numbers need a correction. The article quotes Gemini at $3.50 input / $10.50 output per million tokens. Google's pricing pages and several trackers put it at roughly $2.00 input / $12.00 output, climbing to $4 / $18 above 200K tokens (Gemini 3.1 Pro API pricing, devtk.ai). Either way Gemini comes in cheaper than Opus 4.8, which is fixed at $5.00 / $25.00 (llm-stats). But the "+43% input" and "+2.4x output" deltas in the table are built on the wrong Gemini figure. Using the verified rates, Opus runs closer to +150% on input and a little over 2x on output. For anything that generates a lot of text, content at volume, long reports, that running cost adds up fast, and the gap is wider than the table suggests.

The context question

Both models handle a 1M-token context. The original framed Opus 4.8's as "beta," but that's no longer the case: on Opus 4.8 the 1M window is on by default, without the opt-in header earlier versions needed (Claude API docs). In our testing both chewed through large documents fine, with no real difference in how accurately they held onto long context.

Verdict

Pick Opus 4.8 if you're writing code or want one strong all-rounder at the premium tier. Pick Gemini 3.1 Pro if your work leans on reasoning, or if output cost is the thing keeping you up at night, and bear in mind the cost gap is larger than the original pricing implied. Both are good. The call is about matching each one's strengths to what you're actually building.

Winner: Depends on use case

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Write the job-to-be-done before looking at another product.
  2. Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
  3. Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Claude Opus 4.8 vs Gemini 3.1 Pro: Head-to-head

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call