Model Review

Coding benchmarks: Which model writes the best code?

We rank all 17 models by SWE-bench Pro scores. Claude Fable 5 leads at 80.3%, followed by Opus 4.8 at 69.2% and GPT-5.5 Pro at 62.4%. The full coding leaderboard.

Daniel Fleuren2026-06-1512 min readFounders and operatorsUpdated 2026-06-19

Written by

Daniel Fleuren

Founder, AI Kick Start. 20+ years enterprise IT

Updated 2026-06-19

AI Kick Start editorial image for Coding benchmarks: Which model writes the best code?.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: [SWE-bench Pro](https://labs.scale.com/leaderboard/swe_bench_pro_public) has become the test that matters when you want to know whether an AI model can actually do software engineering work. It checks the real stuff: fixing bugs, building features, writing tests, and reviewing code across several languages and frameworks. Here is where things stand as of June 2026, with one big caveat about how these numbers get reported.

Key takeaways

Coding benchmarks: Which model writes the best code?:
Analysis: If you manage a team that ships software, you have probably watched the parade of AI coding models and wondered which one is worth paying for.
The SWE-bench Pro leaderboard: 1: Claude Fable 5: 80.3%: $10.00 / $50.00: Closed: 1M 2: Claude Opus 4.8: 69.2%: $5.00 / $25.00: Closed: 1M 3: Claude Opus 4.7: 63.8%: $5.00 / $25.00: Closed: 1M 4: GPT-5.5 Pro: 62.4%: $8.00 / $40.00: Closed: 400K 5: MiniMax M3: 59.0%: $0.30 / $1.20: Open: 1M 6: Claude Sonnet 4.6: 58.1%: $3.00 / $15.00: Closed: 1M 7: GPT-5.5: 58.6%: $5.00 / $30.00: Closed: 400K 8: Kimi K2.7-Code: 56.8%: $0.50 / $2.00: Open: 256K 9: Grok 4: 54.8%: $5.00 / $25.00: Closed: 256K 10: Gemini 3.1 Pro: 54.2%: $3.50 / $10.50: Closed: 1M 11: DeepSeek V3.5: 52.4%: $0.15 / $0.60: Open: 1M 12: GLM-5.2: 51.4%: $0.80 / $2.40: Open: 256K 13: Llama 4: 50.2%: Free / Free: Open: 256K 14: Gemini 3.5 Flash: 48.2%: $0.35 / $0.70: Closed: 1M 15: Mistral Large 2: 48.6%: $2.00 / $6.00: Open: 256K 16: Qwen 3: 46.2%: $0.40 / $1.20: Open: 128K 17: GPT-5.5 Instant: 42.1%: $0.50 / $1.50: Closed: 128K A word on that table before you act on it.
Tier analysis: The tiers below are our reading of the numbers, not an official published ranking.
Price-per-point analysis: If you care about getting the most capability per dollar, here is how the value picks shake out.

Coding benchmarks: Which model writes the best code?

Analysis

If you manage a team that ships software, you have probably watched the parade of AI coding models and wondered which one is worth paying for. The marketing decks all say the same thing. Every model is the best at coding. They cannot all be right.

That is the problem SWE-bench Pro was built to solve. Instead of asking a model to autocomplete a function, it hands the model real tickets from real codebases, 1,865 tasks pulled from 41 repositories across Python, Go, TypeScript and JavaScript, and checks whether the code actually works (Scale Labs). It is closer to "can this thing do a junior engineer's day" than "can it pass a quiz."

But there is a catch worth knowing before you read a single percentage. Most of the headline scores come from the model makers themselves, run on their own setups. Neutral, apples-to-apples scores tend to land a fair bit lower. So treat the leaderboard below as a ranking with an asterisk, and we will point out where the vendor number and the standardised number part ways.

The short version for a busy team: Claude Opus 4.8 is the strongest model you can actually use right now, the open-weights field has gotten genuinely good and cheap, and the difference between the top closed model and a budget open one is smaller than the price tags suggest.

The SWE-bench Pro leaderboard

Rank	Model	SWE-bench Pro	Price (Input/Output)	Licence	Context
1	Claude Fable 5	80.3%	$10.00 / $50.00	Closed	1M
2	Claude Opus 4.8	69.2%	$5.00 / $25.00	Closed	1M
3	Claude Opus 4.7	63.8%	$5.00 / $25.00	Closed	1M
4	GPT-5.5 Pro	62.4%	$8.00 / $40.00	Closed	400K
5	MiniMax M3	59.0%	$0.30 / $1.20	Open	1M
6	Claude Sonnet 4.6	58.1%	$3.00 / $15.00	Closed	1M
7	GPT-5.5	58.6%	$5.00 / $30.00	Closed	400K
8	Kimi K2.7-Code	56.8%	$0.50 / $2.00	Open	256K
9	Grok 4	54.8%	$5.00 / $25.00	Closed	256K
10	Gemini 3.1 Pro	54.2%	$3.50 / $10.50	Closed	1M
11	DeepSeek V3.5	52.4%	$0.15 / $0.60	Open	1M
12	GLM-5.2	51.4%	$0.80 / $2.40	Open	256K
13	Llama 4	50.2%	Free / Free	Open	256K
14	Gemini 3.5 Flash	48.2%	$0.35 / $0.70	Closed	1M
15	Mistral Large 2	48.6%	$2.00 / $6.00	Open	256K
16	Qwen 3	46.2%	$0.40 / $1.20	Open	128K
17	GPT-5.5 Instant	42.1%	$0.50 / $1.50	Closed	128K

A word on that table before you act on it. The numbers above are mostly vendor-reported, meaning each company ran the test on its own tooling. Scale's standardised harness, which runs every model the same way, tells a less flattering story: its neutral leader is GPT-5.4 (xHigh) at 59.1%, well short of the vendor figures you see here (Scale Labs). Same benchmark, different plumbing, very different result. Read the ranking as "roughly who's ahead," not as a precise score you can quote to your CFO.

A few rows also deserve their own asterisks. Fable 5's chart-topping 80.3% comes from Anthropic's own launch materials using Anthropic's own scaffolding, and it has been called contested by independent reviewers; standardised leaderboards paint a more competitive picture (Morph LLM). The table's 54.2% for Gemini 3.1 Pro is also higher than the 46.1% standardised figure that turns up in the source data. And several rows, DeepSeek V3.5, GLM-5.2, Llama 4, Grok 4, Kimi K2.7-Code, Mistral Large 2, Qwen 3, GPT-5.5 Pro, GPT-5.5 Instant and Opus 4.7, could not be confirmed against the leaderboards we checked, so treat their exact percentages and prices as unconfirmed. One likely version mix-up worth flagging: the sources reference GLM-5.1 at 58.4%, not GLM-5.2.

Opus 4.8's figures, by contrast, hold up: 69.2% vendor-reported, $5.00 input / $25.00 output per million tokens, 1M context, released 28 May 2026 (Finout). MiniMax M3's 59.0% also checks out, along with its 1M context and roughly $0.30/$1.20 launch pricing (Fello AI). GPT-5.5's 58.6% is corroborated across several sources too (Morph LLM).

Tier analysis

The tiers below are our reading of the numbers, not an official published ranking. They are a sensible way to group models by what they can realistically handle, but they sit on top of scores that carry the caveats above.

Tier 1 (65%+): Claude Fable 5 and Opus 4.8. On these numbers, they are the only two that reliably get through complex, multi-file engineering work. And there is a twist: a US export-control directive on 12 June 2026 forced Anthropic to suspend Fable 5 and Mythos 5 for everyone. The order required cutting off foreign nationals, and since nationality cannot be checked in real time, both models went dark for all users while Opus 4.8, Sonnet 4.6 and Haiku 4.5 kept running (BetaNews). That leaves Opus 4.8 as the only Tier 1 model you can actually log in and use.

Tier 2 (55-65%): Opus 4.7, GPT-5.5 Pro, MiniMax M3, Sonnet 4.6, GPT-5.5, Kimi K2.7-Code. These handle most coding work fine but start to slip on the gnarliest edge cases. MiniMax M3 and Kimi K2.7-Code are the open-weights standouts here, and M3 in particular punches above its price.

Tier 3 (45-55%): Grok 4, Gemini 3.1 Pro, DeepSeek V3.5, GLM-5.2, Llama 4, Gemini 3.5 Flash, Mistral Large 2. Fine for routine work, boilerplate, simple debugging, documentation, but not something you'd trust with a hard problem unsupervised.

Tier 4 (<45%): Qwen 3, GPT-5.5 Instant. Basic help only. Good for explaining code or knocking out a small script, not for production engineering.

Price-per-point analysis

If you care about getting the most capability per dollar, here is how the value picks shake out. Same caveat applies, these are our derivations from the scores, not a published value index.

Llama 4, Free, 50.2% (effectively unlimited value if you've got the GPUs to run it)
DeepSeek V3.5, $0.15/$0.60, 52.4% (the best value among paid models)
MiniMax M3, $0.30/$1.20, 59.0% (the best open-weights coder, with a note: open weights were committed at launch but reportedly hadn't shipped as of reporting)
Gemini 3.5 Flash, $0.35/$0.70, 48.2% (the best of the budget closed models)

Verdict

If you want the most coding capability you can actually access today, Claude Opus 4.8 (69.2%) is the pick, Fable 5 sits higher on paper but is offline. For the best open-weights option, MiniMax M3 (59.0%) leads. For value, DeepSeek V3.5 (52.4%) or Llama 4 (50.2%, free) get you most of the way at a fraction of the cost. And for serious engineering work, skip GPT-5.5 Instant and Qwen 3.

One last reminder: every number here carries the vendor-versus-standardised gap. Before you commit a team to a model, run it against your own codebase on the kind of tickets you actually close. The leaderboard tells you who to shortlist. Your repo tells you who to hire.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

Write the job-to-be-done before looking at another product.
Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Coding benchmarks: Which model writes the best code?

Read with ChatGPT Open Claude Search with AI Mode

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call

Coding benchmarks: Which model writes the best code?

Daniel Fleuren

Shortlist

Shelfware

Pilot score

TL;DR

Key takeaways

Coding benchmarks: Which model writes the best code?

Analysis

The SWE-bench Pro leaderboard

Tier analysis

Price-per-point analysis

Verdict

Primary references to keep this briefing grounded

What to do next

Use the article as a decision prompt

Turn this into a practical roadmap.

Related articles

Best model for coding: SWE-bench Pro leaderboard

Kimi K2.7-Code review: Moonshot's coding specialist

Claude Fable 5 review: Anthropic's most capable model, and why it was banned