Back to news

Model Review

Claude Fable 5 vs GPT-5.5: What the benchmarks say.

Anthropic's suspended Claude Fable 5 (80.3% SWE-bench Pro, 92.1% MMLU) vs OpenAI's GPT-5.5 (58.6% SWE-bench Pro, 88.4% MMLU). The numbers tell a clear story.

AI Kick Start editorial image for Claude Fable 5 vs GPT-5.5: What the benchmarks say.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: [Claude Fable 5](https://www.anthropic.com/news/claude-fable-5-mythos-5) posted the strongest coding benchmark of any frontier model so far, with [80.3% on SWE-bench Pro](https://claude5.ai/news/claude-fable-5-benchmarks-swe-bench-pro-80-percent) against [GPT-5.5's 58.6%](https://www.vellum.ai/blog/everything-you-need-to-know-about-gpt-5-5). It also cost roughly twice as much. Then Anthropic [suspended access](https://www.anthropic.com/news/fable-mythos-access) on 12 June 2026, so the model that won the benchmark fight is the one you can't actually buy. For Australian teams, that leaves GPT-5.5 as the practical default and a useful lesson about betting your stack on a single vendor's flagship.

Key takeaways

  • Claude Fable 5 vs GPT-5.5: What the benchmarks say:
  • Analysis: For about three days in June 2026, the best coding model on the market was one almost nobody could use.
  • Head-to-head benchmarks: SWE-bench Pro: 80.3%: 58.6%: +21.7 pts (Fable) MMLU: 92.1%: 88.4%: +3.7 pts (Fable) Context window: 1M: 400K: +600K (Fable) Price (input): $10.00 / 1M: $5.00 / 1M: 2x (Fable) Price (output): $50.00 / 1M: $30.00 / 1M: 1.67x (Fable) Status: SUSPENDED: Active: , Two notes on this table before you lean on it.
  • The capability gap: The SWE-bench Pro gap is the real story: 21.7 points.
  • The pricing reality: Fable 5 was expensive: [$10 input and $50 output per million tokens](https://www.anthropic.com/news/claude-fable-5-mythos-5), against [GPT-5.5's $5 and $30](https://llm-stats.com/models/gpt-5.5).

Claude Fable 5 vs GPT-5.5: What the benchmarks say

Analysis

For about three days in June 2026, the best coding model on the market was one almost nobody could use.

Anthropic announced Claude Fable 5 on 9 June, a publicly accessible cut of its Mythos line. The headline number was hard to ignore: 80.3% on SWE-bench Pro, a test that measures whether a model can actually fix real software bugs rather than just talk about them. OpenAI's GPT-5.5, released back in April, sat at 58.6% on the same test. A gap that size doesn't usually show up between two flagship models in the same season.

Then, on 12 June, Anthropic pulled it. The company suspended access to Fable 5 and Mythos 5 after a US government export-control directive (Anthropic also referenced a claimed jailbreak), and said it was working to restore access. The suspension wasn't a quality problem or a recall. It was a policy and access issue. But the effect on anyone planning to build on Fable 5 was the same: the model vanished from their options.

So the comparison below is partly a post-mortem. It tells you how far ahead Anthropic got on paper, where OpenAI's model actually stands, and why "best benchmark" and "best choice for your business" are not the same sentence.

Head-to-head benchmarks

MetricClaude Fable 5GPT-5.5Delta
SWE-bench Pro80.3%58.6%+21.7 pts (Fable)
MMLU92.1%88.4%+3.7 pts (Fable)
Context window1M400K+600K (Fable)
Price (input)$10.00 / 1M$5.00 / 1M2x (Fable)
Price (output)$50.00 / 1M$30.00 / 1M1.67x (Fable)
StatusSUSPENDEDActive,

Two notes on this table before you lean on it. The MMLU row (92.1% vs 88.4%) is widely repeated but I couldn't trace it to a primary source; vendors for both models published GPQA, Terminal-Bench and SWE-bench numbers rather than classic MMLU, so treat those two figures as unconfirmed. And the context-window row is wrong as printed: GPT-5.5's API context is 1M tokens, not 400K (the 400K figure applies to Codex). The +600K Fable advantage in the table doesn't hold up.

The capability gap

The SWE-bench Pro gap is the real story: 21.7 points. That's not a rounding difference between two models doing roughly the same job. It's the kind of margin that changes what you'd hand the model in the first place.

What that score translates to in practice is harder to pin down. Coverage of Fable 5 pointed to gains on the messier end of software work, things like multi-file refactoring, novel algorithm implementation and chasing down deeply nested dependency bugs. That's a reasonable read of an 80% SWE-bench Pro result, but it's an interpretation, not a documented capability claim from either lab. A higher score tells you the model fixes more of the test's bugs; it doesn't certify a specific list of tasks GPT-5.5 supposedly can't touch.

The MMLU gap, if the figures hold, is 3.7 points. That's a much narrower margin, and both models are strong on general knowledge either way, so it's not where the decision gets made.

Context window is where the table oversells it. The pitch was that Fable 5's 1M-token window let it ingest entire repositories that GPT-5.5 had to chunk. But GPT-5.5 also offers a 1M-token window in the API, so on raw context size the two are level. If large-codebase analysis is your use case, that's a real correction to make.

The pricing reality

Fable 5 was expensive: $10 input and $50 output per million tokens, against GPT-5.5's $5 and $30. For high-value coding work, paying double for a 22-point lead on SWE-bench Pro is defensible. You're buying fewer failed runs and less human cleanup, and on expensive engineering time that maths can work.

For everyday use it's a different call. GPT-5.5's lower price makes it the easier model to roll out across a team, though $30 per million output tokens is still on the steep side next to cheaper general-purpose options.

What this means today

With Fable 5 suspended, the head-to-head is academic for now. What it shows is that Anthropic opened a clear lead, at least on the one benchmark that's well-verified, and that OpenAI has room to close it. There's talk of a stronger GPT-5.5 Pro variant scoring 62.4% on SWE-bench Pro, but I couldn't find a source confirming either the variant or that number, so treat it as rumoured rather than fact. Even taken at face value, it would narrow the gap, not erase it. The next releases from both labs are the ones worth watching.

Verdict

On the numbers that hold up, Fable 5 won the comparison that mattered most for coding teams, by a wide margin on SWE-bench Pro and on price. Calling it superior on every benchmark would be overstating it, though: the MMLU edge is unverified and the context-window advantage doesn't survive a closer look.

The practical takeaway is simpler. Fable 5's suspension hands the field to GPT-5.5 and any rumoured Pro variant, which are now the default for teams who'd otherwise have reached for Fable 5. And it's a reminder worth keeping: a model can top the leaderboard and still disappear from your stack overnight for reasons that have nothing to do with how good it is. Build so you can swap.

Winner: Claude Fable 5 (but unavailable)

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Write the job-to-be-done before looking at another product.
  2. Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
  3. Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Claude Fable 5 vs GPT-5.5: What the benchmarks say

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call