Model Review

Best model for coding: SWE-bench Pro leaderboard.

The definitive SWE-bench Pro ranking for June 2026. Claude Fable 5 leads at 80.3%, Opus 4.8 follows at 69.2%, and MiniMax M3 tops the open-weights at 59.0%. Full analysis of every tier.

Daniel Fleuren2026-06-1512 min readFounders and operatorsUpdated 2026-06-19

Written by

Daniel Fleuren

Founder, AI Kick Start. 20+ years enterprise IT

Updated 2026-06-19

AI Kick Start editorial image for Best model for coding: SWE-bench Pro leaderboard.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: Among models you can actually buy and run today, [Claude Opus 4.8](https://llm-stats.com/blog/research/claude-opus-4-8-launch) sits on top of SWE-bench Pro at 69.2%. The higher-scoring Claude Fable 5 (a vendor-reported 80.3%) was suspended within days of launch, so it is off the table. Below Opus, the picks split by what you care about: [MiniMax M3](https://datanorth.ai/news/minimax-launches-m3) is the strongest open-weights option at 59.0%, and a few cheaper open models cover routine work. One caveat runs through the whole table, though, the scores below mix vendor-tuned numbers with independently measured ones, and those two things are not the same.

Key takeaways

Opus 4.8 (69.2%) is the strongest coding model you can actually buy; the higher-scoring Fable 5 was suspended days after launch.
The leaderboard blends vendor-reported and independently measured scores, a gap of 10 to 30 points, so don't read the ranking as a clean apples-to-apples comparison.
MiniMax M3 (59.0%, verified) is the standout open-weights option on price and context; GLM-5.2 may rank higher on independent boards than this table shows.
Several entries are unconfirmed (Sonnet 4.6, Grok 4, DeepSeek V3.5, Llama 4, Mistral Large 2, Gemini 3.5 Flash, Qwen 3, GPT-5.5 Instant) and two look fabricated (the GPT-5.5 Pro line and Kimi K2.7-Code's score).
Shortlist from the benchmark, then run your top picks against your own code before deciding.

Best model for coding: SWE-bench Pro leaderboard

Analysis

A new coding model lands almost every week now, each one claiming to be the best your money can buy. For a business deciding what to put in front of its developers, that noise is the problem. You need one number that says how well a model actually does the job.

SWE-bench Pro is meant to be that number. It throws real GitHub issues at a model, fix this bug, build this feature, write these tests, and checks whether the change works. So when the June 2026 results came out, the headline was simple: Anthropic's Opus 4.8 leads the field of models you can buy, at 69.2%. The model that beat it, Claude Fable 5, was pulled offline under a US export-control directive days after launch, which leaves Opus as the practical top pick.

Here is the part the headline skips. The leaderboard stacks two kinds of scores in one column. Some are measured by an independent lab running every model through the same harness. Others are the figures vendors report from their own tuned setups. The gap between the two can run 10 to 30 points, so a straight rank-by-rank read of the table flatters some models and shortchanges others. Worth keeping in mind before you sign anything.

What follows is the full table, the tier-by-tier breakdown, and where the numbers are solid versus where they need a pinch of salt.

The June 2026 leaderboard

The scores below come from the article's source table. Some are independently measured; several are vendor-reported or could not be confirmed against independent leaderboards, and we flag those as we go.

Rank	Model	SWE-bench Pro	MMLU	Price (In/Out)	Licence
1	Claude Fable 5	80.3%	92.1%	$10.00 / $50.00	Closed (SUSPENDED)
2	Claude Opus 4.8	69.2%	89.8%	$5.00 / $25.00	Closed
3	Claude Opus 4.7	63.8%	89.2%	$5.00 / $25.00	Closed
4	GPT-5.5 Pro	62.4%	89.7%	$8.00 / $40.00	Closed
5	MiniMax M3	59.0%	86.4%	$0.30 / $1.20	Open
6	GPT-5.5	58.6%	88.4%	$5.00 / $30.00	Closed
7	Claude Sonnet 4.6	58.1%	87.6%	$3.00 / $15.00	Closed
8	Kimi K2.7-Code	56.8%	85.7%	$0.50 / $2.00	Open
9	Grok 4	54.8%	87.2%	$5.00 / $25.00	Closed
10	Gemini 3.1 Pro	54.2%	88.1%	$3.50 / $10.50	Closed
11	DeepSeek V3.5	52.4%	85.8%	$0.15 / $0.60	Open
12	GLM-5.2	51.4%	85.2%	$0.80 / $2.40	Open
13	Llama 4	50.2%	84.8%	Free	Open
14	Mistral Large 2	48.6%	85.1%	$2.00 / $6.00	Open
15	Gemini 3.5 Flash	48.2%	86.8%	$0.35 / $0.70	Closed
16	Qwen 3	46.2%	84.6%	$0.40 / $1.20	Open
17	GPT-5.5 Instant	42.1%	84.2%	$0.50 / $1.50	Closed

A word on what this benchmark is before you read too much into the column. SWE-bench Pro is a large set of real engineering tasks, roughly 1,865 of them, pulled from 41 professional repositories, covering bug fixes, feature work, test generation, and code review. The public set deliberately uses GPL-licensed code to make it harder for a model to have memorised the answers during training. A high score points to a model that can act as a working engineering assistant, not just spit out snippets.

One thing the table does not show on its face: the figures blend two measurement styles. Vendor-tuned numbers (Fable 5's 80.3%, Opus 4.8's 69.2%) sit next to standardised ones, and on independent trackers the best apples-to-apples score as of mid-June 2026 was closer to 59%. Read the ranking as a rough guide, not gospel.

Tier 1: The elite (65%+)

One available model clears 65%: Claude Opus 4.8 at 69.2%. Anthropic released it on 28 May 2026 as its strongest coding model, with the Pro score climbing from the prior 64.3% (the table lists Opus 4.7 at a slightly lower 63.8%). This is the one to reach for on work that cannot afford mistakes, heavy refactoring, modernising legacy code, architectural changes. Fable 5's reported 80.3% would have owned this tier, but it was globally suspended on 12 June 2026 under an export-control directive, and that score is vendor-reported and contested in any case. For now, Opus 4.8 is the real pick.

Tier 2: The capable (55-65%)

This is the workhorse band. GPT-5.5 Pro (a reported 62.4%), MiniMax M3 (59.0%), GPT-5.5 (58.6%), Sonnet 4.6 (58.1%), and Kimi K2.7-Code (a reported 56.8%) cover most engineering tasks dependably.

Two of those numbers come with caveats. The GPT-5.5 Pro line in the table, 62.4% at $8/$40, does not hold up: OpenAI's actual GPT-5.5 Pro pricing is closer to $30/$180, and the standard GPT-5.5 sits at 58.6%, so treat the Pro figure as unconfirmed. Kimi K2.7-Code's 56.8% is also shaky, no independent SWE-bench Pro number exists for K2.7 yet (the 58.6% often quoted belongs to the older K2.6), and its real pricing looks more like $0.95/$4.00.

The standout here is MiniMax M3. Open weights, a 1M-token context window, and a verified 59.0% on SWE-bench Pro at $0.30/$1.20, it beats GPT-5.5 and Gemini 3.1 Pro on this benchmark for a fraction of the cost. GPT-5.5's 58.6% at $5/$30 is also a confirmed figure. Sonnet 4.6's 58.1% could not be confirmed on independent boards, so take it as indicative.

Tier 3: The competent (45-55%)

These models do routine coding fine but lose the thread on harder problems: Grok 4 (54.8%), Gemini 3.1 Pro (54.2%), DeepSeek V3.5 (52.4%), GLM-5.2 (51.4%), Llama 4 (50.2%), Mistral Large 2 (48.6%), and Gemini 3.5 Flash (48.2%). DeepSeek V3.5 and Llama 4 are the value plays in the band.

Several of these figures are worth questioning. Gemini 3.1 Pro's 54.2% runs ahead of the ~46.1% reported under a standardised harness. GLM-5.2 looks understated, Zhipu's model is the top open-source entry on the llm-stats board at 62.1%, well above the 51.4% here, and on that board it actually outranks MiniMax M3, which flips the article's ordering. The Grok 4, DeepSeek V3.5, Llama 4, Mistral Large 2, and Gemini 3.5 Flash scores could not be corroborated on the independent leaderboards we checked, so read them as unconfirmed. DeepSeek may also be a generation behind, mid-2026 sources point to DeepSeek V4/V4-Pro as the current release rather than V3.5.

Tier 4: The assistants (<45%)

Qwen 3 (46.2%) and GPT-5.5 Instant (42.1%) suit code explanation, simple scripts, and boilerplate. Don't lean on them for production engineering. Both scores are unconfirmed against independent SWE-bench Pro boards, which is reason enough on its own to keep them out of critical work.

Recommendations by use case

Mission-critical coding: Opus 4.8 (69.2%, verified)
Best open-weights coding: MiniMax M3 (59.0%, verified), though GLM-5.2 may edge it out on independent boards
Best value coding: DeepSeek V3.5 (a reported 52.4% at $0.15/$0.60; score unconfirmed)
Best free coding: Llama 4 (a reported 50.2%; score unconfirmed)
Enterprise with OpenAI: GPT-5.5 (58.6%, verified; the GPT-5.5 Pro line in the table is unreliable)
Speed-sensitive coding: Sonnet 4.6 (a reported 58.1%, fast; score unconfirmed)

Verdict

The coding-model market is crowded, and that is good news for buyers. Opus 4.8 leads on raw capability among models you can use, MiniMax M3 makes a strong case on open weights (with GLM-5.2 close behind on the independent board), and the cheaper open models cover most day-to-day work.

Pick on your real constraints, budget, data privacy, which ecosystem you're already in. But do it with eyes open: the table mixes vendor-tuned and independently measured scores, and a few of the lower-tier entries could not be confirmed at all. Use the leaderboard to narrow the shortlist, then test your top two or three against your own codebase before you commit. The benchmark tells you who's in the running; your repository tells you who wins.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

Write the job-to-be-done before looking at another product.
Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Best model for coding: SWE-bench Pro leaderboard

Read with ChatGPT Open Claude Search with AI Mode

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call