Back to news

Model Review

Coding benchmarks: Which model writes the best code?

We rank all 17 models by SWE-bench Pro scores. Claude Fable 5 leads at 80.3%, followed by Opus 4.8 at 69.2% and GPT-5.5 Pro at 62.4%. The full coding leaderboard.

AI Kick Start editorial image for Coding benchmarks: Which model writes the best code?.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: [SWE-bench Pro](https://labs.scale.com/leaderboard/swe_bench_pro_public) has become the test that matters when you want to know whether an AI model can actually do software engineering work. It checks the real stuff: fixing bugs, building features, writing tests, and reviewing code across several languages and frameworks. Here is where things stand as of June 2026, with one big caveat about how these numbers get reported.

Key takeaways

  • Coding benchmarks: Which model writes the best code?:
  • Analysis: If you manage a team that ships software, you have probably watched the parade of AI coding models and wondered which one is worth paying for.
  • The SWE-bench Pro leaderboard: 1: Claude Fable 5: 80.3%: $10.00 / $50.00: Closed: 1M 2: Claude Opus 4.8: 69.2%: $5.00 / $25.00: Closed: 1M 3: Claude Opus 4.7: 63.8%: $5.00 / $25.00: Closed: 1M 4: GPT-5.5 Pro: 62.4%: $8.00 / $40.00: Closed: 400K 5: MiniMax M3: 59.0%: $0.30 / $1.20: Open: 1M 6: Claude Sonnet 4.6: 58.1%: $3.00 / $15.00: Closed: 1M 7: GPT-5.5: 58.6%: $5.00 / $30.00: Closed: 400K 8: Kimi K2.7-Code: 56.8%: $0.50 / $2.00: Open: 256K 9: Grok 4: 54.8%: $5.00 / $25.00: Closed: 256K 10: Gemini 3.1 Pro: 54.2%: $3.50 / $10.50: Closed: 1M 11: DeepSeek V3.5: 52.4%: $0.15 / $0.60: Open: 1M 12: GLM-5.2: 51.4%: $0.80 / $2.40: Open: 256K 13: Llama 4: 50.2%: Free / Free: Open: 256K 14: Gemini 3.5 Flash: 48.2%: $0.35 / $0.70: Closed: 1M 15: Mistral Large 2: 48.6%: $2.00 / $6.00: Open: 256K 16: Qwen 3: 46.2%: $0.40 / $1.20: Open: 128K 17: GPT-5.5 Instant: 42.1%: $0.50 / $1.50: Closed: 128K A word on that table before you act on it.
  • Tier analysis: The tiers below are our reading of the numbers, not an official published ranking.
  • Price-per-point analysis: If you care about getting the most capability per dollar, here is how the value picks shake out.

Coding benchmarks: Which model writes the best code?

Analysis

If you manage a team that ships software, you have probably watched the parade of AI coding models and wondered which one is worth paying for. The marketing decks all say the same thing. Every model is the best at coding. They cannot all be right.

That is the problem SWE-bench Pro was built to solve. Instead of asking a model to autocomplete a function, it hands the model real tickets from real codebases, 1,865 tasks pulled from 41 repositories across Python, Go, TypeScript and JavaScript, and checks whether the code actually works (Scale Labs). It is closer to "can this thing do a junior engineer's day" than "can it pass a quiz."

But there is a catch worth knowing before you read a single percentage. Most of the headline scores come from the model makers themselves, run on their own setups. Neutral, apples-to-apples scores tend to land a fair bit lower. So treat the leaderboard below as a ranking with an asterisk, and we will point out where the vendor number and the standardised number part ways.

The short version for a busy team: Claude Opus 4.8 is the strongest model you can actually use right now, the open-weights field has gotten genuinely good and cheap, and the difference between the top closed model and a budget open one is smaller than the price tags suggest.

The SWE-bench Pro leaderboard

RankModelSWE-bench ProPrice (Input/Output)LicenceContext
1Claude Fable 580.3%$10.00 / $50.00Closed1M
2Claude Opus 4.869.2%$5.00 / $25.00Closed1M
3Claude Opus 4.763.8%$5.00 / $25.00Closed1M
4GPT-5.5 Pro62.4%$8.00 / $40.00Closed400K
5MiniMax M359.0%$0.30 / $1.20Open1M
6Claude Sonnet 4.658.1%$3.00 / $15.00Closed1M
7GPT-5.558.6%$5.00 / $30.00Closed400K
8Kimi K2.7-Code56.8%$0.50 / $2.00Open256K
9Grok 454.8%$5.00 / $25.00Closed256K
10Gemini 3.1 Pro54.2%$3.50 / $10.50Closed1M
11DeepSeek V3.552.4%$0.15 / $0.60Open1M
12GLM-5.251.4%$0.80 / $2.40Open256K
13Llama 450.2%Free / FreeOpen256K
14Gemini 3.5 Flash48.2%$0.35 / $0.70Closed1M
15Mistral Large 248.6%$2.00 / $6.00Open256K
16Qwen 346.2%$0.40 / $1.20Open128K
17GPT-5.5 Instant42.1%$0.50 / $1.50Closed128K

A word on that table before you act on it. The numbers above are mostly vendor-reported, meaning each company ran the test on its own tooling. Scale's standardised harness, which runs every model the same way, tells a less flattering story: its neutral leader is GPT-5.4 (xHigh) at 59.1%, well short of the vendor figures you see here (Scale Labs). Same benchmark, different plumbing, very different result. Read the ranking as "roughly who's ahead," not as a precise score you can quote to your CFO.

A few rows also deserve their own asterisks. Fable 5's chart-topping 80.3% comes from Anthropic's own launch materials using Anthropic's own scaffolding, and it has been called contested by independent reviewers; standardised leaderboards paint a more competitive picture (Morph LLM). The table's 54.2% for Gemini 3.1 Pro is also higher than the 46.1% standardised figure that turns up in the source data. And several rows, DeepSeek V3.5, GLM-5.2, Llama 4, Grok 4, Kimi K2.7-Code, Mistral Large 2, Qwen 3, GPT-5.5 Pro, GPT-5.5 Instant and Opus 4.7, could not be confirmed against the leaderboards we checked, so treat their exact percentages and prices as unconfirmed. One likely version mix-up worth flagging: the sources reference GLM-5.1 at 58.4%, not GLM-5.2.

Opus 4.8's figures, by contrast, hold up: 69.2% vendor-reported, $5.00 input / $25.00 output per million tokens, 1M context, released 28 May 2026 (Finout). MiniMax M3's 59.0% also checks out, along with its 1M context and roughly $0.30/$1.20 launch pricing (Fello AI). GPT-5.5's 58.6% is corroborated across several sources too (Morph LLM).

Tier analysis

The tiers below are our reading of the numbers, not an official published ranking. They are a sensible way to group models by what they can realistically handle, but they sit on top of scores that carry the caveats above.

Tier 1 (65%+): Claude Fable 5 and Opus 4.8. On these numbers, they are the only two that reliably get through complex, multi-file engineering work. And there is a twist: a US export-control directive on 12 June 2026 forced Anthropic to suspend Fable 5 and Mythos 5 for everyone. The order required cutting off foreign nationals, and since nationality cannot be checked in real time, both models went dark for all users while Opus 4.8, Sonnet 4.6 and Haiku 4.5 kept running (BetaNews). That leaves Opus 4.8 as the only Tier 1 model you can actually log in and use.

Tier 2 (55-65%): Opus 4.7, GPT-5.5 Pro, MiniMax M3, Sonnet 4.6, GPT-5.5, Kimi K2.7-Code. These handle most coding work fine but start to slip on the gnarliest edge cases. MiniMax M3 and Kimi K2.7-Code are the open-weights standouts here, and M3 in particular punches above its price.

Tier 3 (45-55%): Grok 4, Gemini 3.1 Pro, DeepSeek V3.5, GLM-5.2, Llama 4, Gemini 3.5 Flash, Mistral Large 2. Fine for routine work, boilerplate, simple debugging, documentation, but not something you'd trust with a hard problem unsupervised.

Tier 4 (<45%): Qwen 3, GPT-5.5 Instant. Basic help only. Good for explaining code or knocking out a small script, not for production engineering.

Price-per-point analysis

If you care about getting the most capability per dollar, here is how the value picks shake out. Same caveat applies, these are our derivations from the scores, not a published value index.

  1. Llama 4, Free, 50.2% (effectively unlimited value if you've got the GPUs to run it)
  2. DeepSeek V3.5, $0.15/$0.60, 52.4% (the best value among paid models)
  3. MiniMax M3, $0.30/$1.20, 59.0% (the best open-weights coder, with a note: open weights were committed at launch but reportedly hadn't shipped as of reporting)
  4. Gemini 3.5 Flash, $0.35/$0.70, 48.2% (the best of the budget closed models)

Verdict

If you want the most coding capability you can actually access today, Claude Opus 4.8 (69.2%) is the pick, Fable 5 sits higher on paper but is offline. For the best open-weights option, MiniMax M3 (59.0%) leads. For value, DeepSeek V3.5 (52.4%) or Llama 4 (50.2%, free) get you most of the way at a fraction of the cost. And for serious engineering work, skip GPT-5.5 Instant and Qwen 3.

One last reminder: every number here carries the vendor-versus-standardised gap. Before you commit a team to a model, run it against your own codebase on the kind of tickets you actually close. The leaderboard tells you who to shortlist. Your repo tells you who to hire.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Write the job-to-be-done before looking at another product.
  2. Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
  3. Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Coding benchmarks: Which model writes the best code?

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call