Coding benchmarks: Which model writes the best code?
Analysis
If you manage a team that ships software, you have probably watched the parade of AI coding models and wondered which one is worth paying for. The marketing decks all say the same thing. Every model is the best at coding. They cannot all be right.
That is the problem SWE-bench Pro was built to solve. Instead of asking a model to autocomplete a function, it hands the model real tickets from real codebases, 1,865 tasks pulled from 41 repositories across Python, Go, TypeScript and JavaScript, and checks whether the code actually works (Scale Labs). It is closer to "can this thing do a junior engineer's day" than "can it pass a quiz."
But there is a catch worth knowing before you read a single percentage. Most of the headline scores come from the model makers themselves, run on their own setups. Neutral, apples-to-apples scores tend to land a fair bit lower. So treat the leaderboard below as a ranking with an asterisk, and we will point out where the vendor number and the standardised number part ways.
The short version for a busy team: Claude Opus 4.8 is the strongest model you can actually use right now, the open-weights field has gotten genuinely good and cheap, and the difference between the top closed model and a budget open one is smaller than the price tags suggest.
The SWE-bench Pro leaderboard
| Rank | Model | SWE-bench Pro | Price (Input/Output) | Licence | Context |
|---|---|---|---|---|---|
| 1 | Claude Fable 5 | 80.3% | $10.00 / $50.00 | Closed | 1M |
| 2 | Claude Opus 4.8 | 69.2% | $5.00 / $25.00 | Closed | 1M |
| 3 | Claude Opus 4.7 | 63.8% | $5.00 / $25.00 | Closed | 1M |
| 4 | GPT-5.5 Pro | 62.4% | $8.00 / $40.00 | Closed | 400K |
| 5 | MiniMax M3 | 59.0% | $0.30 / $1.20 | Open | 1M |
| 6 | Claude Sonnet 4.6 | 58.1% | $3.00 / $15.00 | Closed | 1M |
| 7 | GPT-5.5 | 58.6% | $5.00 / $30.00 | Closed | 400K |
| 8 | Kimi K2.7-Code | 56.8% | $0.50 / $2.00 | Open | 256K |
| 9 | Grok 4 | 54.8% | $5.00 / $25.00 | Closed | 256K |
| 10 | Gemini 3.1 Pro | 54.2% | $3.50 / $10.50 | Closed | 1M |
| 11 | DeepSeek V3.5 | 52.4% | $0.15 / $0.60 | Open | 1M |
| 12 | GLM-5.2 | 51.4% | $0.80 / $2.40 | Open | 256K |
| 13 | Llama 4 | 50.2% | Free / Free | Open | 256K |
| 14 | Gemini 3.5 Flash | 48.2% | $0.35 / $0.70 | Closed | 1M |
| 15 | Mistral Large 2 | 48.6% | $2.00 / $6.00 | Open | 256K |
| 16 | Qwen 3 | 46.2% | $0.40 / $1.20 | Open | 128K |
| 17 | GPT-5.5 Instant | 42.1% | $0.50 / $1.50 | Closed | 128K |
A word on that table before you act on it. The numbers above are mostly vendor-reported, meaning each company ran the test on its own tooling. Scale's standardised harness, which runs every model the same way, tells a less flattering story: its neutral leader is GPT-5.4 (xHigh) at 59.1%, well short of the vendor figures you see here (Scale Labs). Same benchmark, different plumbing, very different result. Read the ranking as "roughly who's ahead," not as a precise score you can quote to your CFO.
A few rows also deserve their own asterisks. Fable 5's chart-topping 80.3% comes from Anthropic's own launch materials using Anthropic's own scaffolding, and it has been called contested by independent reviewers; standardised leaderboards paint a more competitive picture (Morph LLM). The table's 54.2% for Gemini 3.1 Pro is also higher than the 46.1% standardised figure that turns up in the source data. And several rows, DeepSeek V3.5, GLM-5.2, Llama 4, Grok 4, Kimi K2.7-Code, Mistral Large 2, Qwen 3, GPT-5.5 Pro, GPT-5.5 Instant and Opus 4.7, could not be confirmed against the leaderboards we checked, so treat their exact percentages and prices as unconfirmed. One likely version mix-up worth flagging: the sources reference GLM-5.1 at 58.4%, not GLM-5.2.
Opus 4.8's figures, by contrast, hold up: 69.2% vendor-reported, $5.00 input / $25.00 output per million tokens, 1M context, released 28 May 2026 (Finout). MiniMax M3's 59.0% also checks out, along with its 1M context and roughly $0.30/$1.20 launch pricing (Fello AI). GPT-5.5's 58.6% is corroborated across several sources too (Morph LLM).
Tier analysis
The tiers below are our reading of the numbers, not an official published ranking. They are a sensible way to group models by what they can realistically handle, but they sit on top of scores that carry the caveats above.
Tier 1 (65%+): Claude Fable 5 and Opus 4.8. On these numbers, they are the only two that reliably get through complex, multi-file engineering work. And there is a twist: a US export-control directive on 12 June 2026 forced Anthropic to suspend Fable 5 and Mythos 5 for everyone. The order required cutting off foreign nationals, and since nationality cannot be checked in real time, both models went dark for all users while Opus 4.8, Sonnet 4.6 and Haiku 4.5 kept running (BetaNews). That leaves Opus 4.8 as the only Tier 1 model you can actually log in and use.
Tier 2 (55-65%): Opus 4.7, GPT-5.5 Pro, MiniMax M3, Sonnet 4.6, GPT-5.5, Kimi K2.7-Code. These handle most coding work fine but start to slip on the gnarliest edge cases. MiniMax M3 and Kimi K2.7-Code are the open-weights standouts here, and M3 in particular punches above its price.
Tier 3 (45-55%): Grok 4, Gemini 3.1 Pro, DeepSeek V3.5, GLM-5.2, Llama 4, Gemini 3.5 Flash, Mistral Large 2. Fine for routine work, boilerplate, simple debugging, documentation, but not something you'd trust with a hard problem unsupervised.
Tier 4 (<45%): Qwen 3, GPT-5.5 Instant. Basic help only. Good for explaining code or knocking out a small script, not for production engineering.
Price-per-point analysis
If you care about getting the most capability per dollar, here is how the value picks shake out. Same caveat applies, these are our derivations from the scores, not a published value index.
- Llama 4, Free, 50.2% (effectively unlimited value if you've got the GPUs to run it)
- DeepSeek V3.5, $0.15/$0.60, 52.4% (the best value among paid models)
- MiniMax M3, $0.30/$1.20, 59.0% (the best open-weights coder, with a note: open weights were committed at launch but reportedly hadn't shipped as of reporting)
- Gemini 3.5 Flash, $0.35/$0.70, 48.2% (the best of the budget closed models)
Verdict
If you want the most coding capability you can actually access today, Claude Opus 4.8 (69.2%) is the pick, Fable 5 sits higher on paper but is offline. For the best open-weights option, MiniMax M3 (59.0%) leads. For value, DeepSeek V3.5 (52.4%) or Llama 4 (50.2%, free) get you most of the way at a fraction of the cost. And for serious engineering work, skip GPT-5.5 Instant and Qwen 3.
One last reminder: every number here carries the vendor-versus-standardised gap. Before you commit a team to a model, run it against your own codebase on the kind of tickets you actually close. The leaderboard tells you who to shortlist. Your repo tells you who to hire.


