GPT-5.5 review: OpenAI's 'Spud' codename explained
Release date: 23 April 2026 | Status: Active | Licence: Closed
OpenAI shipped GPT-5.5 on 23 April 2026 under an internal codename that says more than the marketing did: "Spud." A potato. Nothing glamorous, but it turns up in everything and rarely lets you down. Axios reported the codename alongside the launch, and the joke landed. This is not the model OpenAI wants on the billboard. It is the one that quietly does the cooking.
For Australian business teams, the question is simpler than the hype suggests: is GPT-5.5 worth paying for, and where does it fit next to Claude and Gemini? The short version is that it is a competent generalist with one real catch, what it charges you to talk back. On output pricing, it sits at the top of its tier, and that single number reshapes who should bother with it.
A note before the numbers: this is a closed, proprietary model. There are no open weights. OpenAI offers it through ChatGPT, Codex and the API, and that is the only way in.
One caveat on the figures below. Several of the benchmark and context-window numbers in early write-ups, including some quoted here, do not line up with OpenAI's own documentation. We have flagged those inline rather than scrub them, because the gap between the rumour mill and the spec sheet is part of the story.
Benchmarks at a glance
| Metric | Score | Notes |
|---|---|---|
| SWE-bench Pro | 58.6% | Solid mid-tier |
| MMLU | 88.4% | Competitive with Sonnet 4.6 |
| Context window | 400K tokens | Half of 1M models |
| Price (input) | $5.00 / 1M tokens | Premium tier |
| Price (output) | $30.00 / 1M tokens | Highest output price in tier |
The number worth staring at is output pricing. At $5.00 input and $30.00 output per million tokens, GPT-5.5 charges roughly 20% more on output than Opus 4.8 at $25.00. It has reportedly been pitched as double the cost of Gemini 3.1 Pro at a $10.50 output rate, though current pricing trackers put Gemini 3.1 Pro nearer $12.00, which would make the real gap closer to 2.5x. Either way, the direction is the same: if your workload produces a lot of tokens, long-form content, chatty coding assistants, anything verbose, GPT-5.5 will cost you.
What 'Spud' delivered
Read against its predecessor, GPT-5.5 is more tune-up than reinvention. OpenAI positioned it as a step up from GPT-5.4 on coding, knowledge work and scientific research, with fewer hallucinations and what the company called "a new class of intelligence." Our read is more measured: in practice it is a model that holds steady on instruction following and rarely throws a tantrum, but rarely dazzles either. The codename fits.
The coding figures are where the rumour mill and the record part ways. Early write-ups put GPT-5.5 at 58.6% on SWE-bench Pro, which would land it in the upper-middle tier. That number is unconfirmed and does not match the figures since reported elsewhere, other accounts cite full SWE-bench scores near 88.7% and a headline Terminal-Bench 2.0 result of 82.7%. In hands-on use the pattern is consistent regardless of the benchmark: it handles Python and JavaScript well, gets stuck on Rust and Haskell, and debugs reliably without doing anything clever.
The 88.4% MMLU score, said to trail Sonnet 4.6 by 0.8 points and Opus 4.8 by 1.4, is also unverified, one tracker puts GPT-5.5 closer to 92.4%. Treat the precise gap-to-rivals as a claim, not a measurement. If it is real, it is the kind of difference you only notice with two models open side by side.
The 400K context limitation
This is the part to read with one eyebrow up. The 400K-token context window quoted above is contradicted by OpenAI's own spec. The API docs list a context window of roughly 1,050,000 tokens, about 1M, not 400K, with up to 128K output. So the "half of 1M models" framing, and the idea that GPT-5.5 is outgunned on long documents, appears to be wrong at the source.
If the 400K figure were accurate, the trade-off would matter: for most jobs 400K is plenty, but for chewing through large monorepos or long legal bundles you would want a 1M model. On the official numbers, GPT-5.5 already is one of those, and the comparison collapses. The competitors sometimes named in that bracket, Claude Opus 4.8, Gemini 3.5 Flash, MiniMax M3, are listed on the strength of that disputed premise, so take the line-up as unconfirmed.
Verdict
Strip away the shaky benchmark and context claims and you are left with a steady, capable model carrying one genuine liability: output price. At $5/$30 it is a hard sell over Opus 4.8 at $5/$25 unless you are already living in OpenAI's world, custom GPTs, the Assistants API, integrations you have built and do not want to rewrite. The "Spud" codename gave the game away by accident. This is a workhorse, not a show pony.


