Analysis
When OpenAI ships a new model, the codename usually stays behind the curtain. This time the company let it out the front door. GPT-5.5 went by "Spud", a nod to a running potato theme in OpenAI's teasers, and the name turned up in API docs and in employees' social posts. (Reports differ on exactly how Sam Altman marked the launch; he is said to have posted something like "GPT-5.5 is here!" rather than the often-quoted "Spud's here", which we couldn't confirm.)
The informality was the point. After all the noise around GPT-5, OpenAI seemed keen to set expectations: 5.5 is an upgrade, not a reinvention. If you run a business deciding where to spend your AI budget, that framing matters. You're not being asked to relearn anything. You're being asked whether a better version of a tool you already know is worth the price.
The benchmarks back up the "upgrade, not reinvention" read, though they come with a caveat worth stating up front: the numbers in the original write-up mix up which test is which, and a couple of the comparison figures don't hold up against public sources. We've flagged those as we go. The honest summary is that Spud is genuinely better than GPT-5, sits in the same tier as its rivals, and trails the best coding models on the hardest tasks.
On SWE-bench, the article cites 58.6% for GPT-5.5 against 52.1% for GPT-5. The 58.6% figure looks like a SWE-bench Pro result, not the SWE-bench Verified score it was labelled as; OpenAI's reported Verified number for the model is around 88.7%, and we couldn't find a source for the 52.1% GPT-5 baseline. For context, Claude Fable 5 reportedly scored 80.3% on SWE-bench Pro before it was pulled offline in June 2026 under a US export-control order, and Claude Opus 4.8 sits at 69.2% on the same Pro benchmark. The takeaway holds even if the individual numbers are messy: on the toughest coding evals, Spud is competitive but not the leader.
The wider-knowledge benchmarks are harder to pin down. The article puts Spud at 86.4% on MMLU-Pro, up from GPT-5's 83.7%, but we couldn't corroborate either figure; some sources quote a higher MMLU result for the model. It also claims 71.2% on GPQA Diamond, a graduate-level science test, and says that narrowly beats Gemini 3.1 Pro's 70.8%. Treat both as unconfirmed. The Gemini comparison in particular looks wrong: public benchmarks put Gemini 3.1 Pro's GPQA Diamond score far higher, around 94.3%, so the 70.8% claim doesn't stand.
Pricing and Positioning
GPT-5.5 costs $5 per million input tokens and $30 per million output tokens. On input that matches Claude Opus 4.8; on output it runs about 20% higher than Opus 4.8's standard $25, though Opus also has a faster, pricier tier, so the comparison shifts depending on which Opus you mean.
On context window, the original article is off the mark. It says GPT-5.5 supports 256,000 tokens, "half of Opus 4.8's 1-million-token capacity." In fact GPT-5.5 ships with a roughly 1M-token API context window; the 256K figure looks like an effective-window quirk of the Codex platform rather than the model's actual limit. Opus 4.8 does offer 1M tokens, and so do rivals like MiniMax M3. (The article also names "DeepSeek V3.5" as a 1M-token option, but we couldn't verify that model; DeepSeek's current release line is V4.) So the "half the context of the competition" worry in the original piece doesn't apply.
OpenAI did use the 5.5 release to split the line into tiers. GPT-5.5 Pro launched alongside the base model with higher rate limits and priority access for enterprise customers. The original article lists Pro at $8/$40 per million tokens, but we couldn't confirm that; most pricing trackers put Pro at $5/$30. A lighter variant, GPT-5.5 Instant, arrived on 5 May as the ChatGPT default; the article's specifics for it ($0.50/$1.50 pricing, 51.2% SWE-bench, 82.1% MMLU-Pro) are likewise unconfirmed, so read those as reported rather than established.

Real-World Performance
Benchmarks only get you so far. Across coding, analysis, writing, and reasoning, GPT-5.5 shows a few areas where the improvement over GPT-5 is easy to feel. A note on the figures below: the original article attributes them to "independent testing" without naming a study, and we couldn't trace any of them to a public source. Take the specific percentages as illustrative, not gospel.
On code, Spud reportedly handles big repositories better. Given a 50,000-line codebase and a feature to build, it found the right files about 87% of the time against 79% for GPT-5, according to the article's unsourced testing. The output was also said to read more like code a person would write: reviewers who didn't know which model produced what rated Spud's work "production-ready" 64% of the time versus 51% for GPT-5.
Reasoning is better in a quieter way. Spud is reportedly less likely to make the "premature conclusion" mistake, where a model latches onto a halfway answer and never checks it. On a set of 200 problems built to trigger exactly that error, the article reports Spud tripping up 12% of the time against GPT-5's 23%. Again, that test isn't sourced, but the pattern, fewer confident-but-wrong answers, matches what a lot of teams want from a working assistant.
Creative writing is where the article makes its strongest case for Spud. The prose has better pacing, steadier character voice, and dialogue that sounds less robotic. In a blind read by 50 published authors, the piece says Spud's samples were called "human-like" 41% of the time, up from 28% for GPT-5. That's still under half, and we should be clear this evaluation isn't sourced either, but the direction of travel is the interesting part.
Where It Falls Short
The gains are real, and so are the limits. The headline coding number, whichever benchmark it actually came from, sits below the best models on the hardest work. Spud is good at the everyday stuff: explaining code, debugging, writing small functions. It has more trouble with big architectural calls and large refactors, the kind of job Fable 5 was reported to handle more cleanly before it was pulled.
Context window is less of a real worry than the original article suggested. Once you correct the 256K figure to the actual ~1M, GPT-5.5 is on par with rivals rather than behind them, so the "stuck with a small window" concern largely goes away.
The agentic question is murkier. The article says GPT-5.5 lacks native agentic features, the kind of multi-step tool use with planning and self-correction that shows up in Claude's Dynamic Workflows (and what the article calls Google's "Agents CLI", a product name we couldn't verify; it may be a reference to Gemini CLI). That claim is contestable. OpenAI marketed GPT-5.5 as strong at agentic coding, pointing to results like 82.7% on Terminal-Bench 2.0, which is hard to square with "no native agentic capabilities." The fair version is that OpenAI's agentic story is framed differently from its rivals', and whether it meets your needs depends on what you're building.


