AI News

GPT-5.5 'Spud' Reviewed: OpenAI's Smartest Model Yet, But Is It Enough?

GPT-5.5, codenamed 'Spud', launched on 23 April 2026 as OpenAI's most capable model. We put it through comprehensive testing to see how it stacks up against the competition.

Daniel Fleuren2026-05-0212 min readFounders and operatorsUpdated 2026-06-19

Written by

Daniel Fleuren

Founder, AI Kick Start. 20+ years enterprise IT

Updated 2026-06-19

AI Kick Start editorial image for GPT-5.5 'Spud' Reviewed: OpenAI's Smartest Model Yet, But Is It Enough?.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: GPT-5.5, nicknamed "Spud", is OpenAI's most capable model so far, [launched on 23 April 2026](https://en.wikipedia.org/wiki/GPT-5.5) at [$5/$30 per million tokens](https://openrouter.ai/openai/gpt-5.5). The article's headline benchmark figure, a 58.6% SWE-bench score, appears to be a SWE-bench Pro result rather than the SWE-bench Verified score it was originally labelled as; OpenAI's own reported Verified number is closer to 88.7%. Either way, Spud is a clear step up from GPT-5. Whether it beats Anthropic's Fable 5 on coding depends heavily on which benchmark you trust. The more useful story is how it behaves in daily work, where it shows real gains in reasoning, instruction following, and creative writing.

Key takeaways

GPT-5.5 "Spud" reportedly scores 58.6% on SWE-bench (a Pro figure, not the Verified score it was first labelled), 86.4% MMLU-Pro, and 71.2% GPQA Diamond; the MMLU-Pro and GPQA numbers are unconfirmed, and OpenAI's reported SWE-bench Verified is closer to 88.7% (Source: OpenAI, 2026)
Base pricing is [$5/$30](https://openrouter.ai/openai/gpt-5.5); Pro is widely listed at $5/$30 (not the $8/$40 originally claimed), and Instant's $0.50/$1.50 is unconfirmed (Source: OpenAI, 2026)
An unsourced test reported 87% accuracy in finding relevant files in large codebases; treat as illustrative (Source: Independent testing, 2026)
Whether GPT-5.5 has native agentic capabilities is contested: the original piece says no, but OpenAI markets it as strong at agentic coding (Source: OpenAI, 2026)

Analysis

When OpenAI ships a new model, the codename usually stays behind the curtain. This time the company let it out the front door. GPT-5.5 went by "Spud", a nod to a running potato theme in OpenAI's teasers, and the name turned up in API docs and in employees' social posts. (Reports differ on exactly how Sam Altman marked the launch; he is said to have posted something like "GPT-5.5 is here!" rather than the often-quoted "Spud's here", which we couldn't confirm.)

The informality was the point. After all the noise around GPT-5, OpenAI seemed keen to set expectations: 5.5 is an upgrade, not a reinvention. If you run a business deciding where to spend your AI budget, that framing matters. You're not being asked to relearn anything. You're being asked whether a better version of a tool you already know is worth the price.

The benchmarks back up the "upgrade, not reinvention" read, though they come with a caveat worth stating up front: the numbers in the original write-up mix up which test is which, and a couple of the comparison figures don't hold up against public sources. We've flagged those as we go. The honest summary is that Spud is genuinely better than GPT-5, sits in the same tier as its rivals, and trails the best coding models on the hardest tasks.

On SWE-bench, the article cites 58.6% for GPT-5.5 against 52.1% for GPT-5. The 58.6% figure looks like a SWE-bench Pro result, not the SWE-bench Verified score it was labelled as; OpenAI's reported Verified number for the model is around 88.7%, and we couldn't find a source for the 52.1% GPT-5 baseline. For context, Claude Fable 5 reportedly scored 80.3% on SWE-bench Pro before it was pulled offline in June 2026 under a US export-control order, and Claude Opus 4.8 sits at 69.2% on the same Pro benchmark. The takeaway holds even if the individual numbers are messy: on the toughest coding evals, Spud is competitive but not the leader.

The wider-knowledge benchmarks are harder to pin down. The article puts Spud at 86.4% on MMLU-Pro, up from GPT-5's 83.7%, but we couldn't corroborate either figure; some sources quote a higher MMLU result for the model. It also claims 71.2% on GPQA Diamond, a graduate-level science test, and says that narrowly beats Gemini 3.1 Pro's 70.8%. Treat both as unconfirmed. The Gemini comparison in particular looks wrong: public benchmarks put Gemini 3.1 Pro's GPQA Diamond score far higher, around 94.3%, so the 70.8% claim doesn't stand.

Pricing and Positioning

GPT-5.5 costs $5 per million input tokens and $30 per million output tokens. On input that matches Claude Opus 4.8; on output it runs about 20% higher than Opus 4.8's standard $25, though Opus also has a faster, pricier tier, so the comparison shifts depending on which Opus you mean.

On context window, the original article is off the mark. It says GPT-5.5 supports 256,000 tokens, "half of Opus 4.8's 1-million-token capacity." In fact GPT-5.5 ships with a roughly 1M-token API context window; the 256K figure looks like an effective-window quirk of the Codex platform rather than the model's actual limit. Opus 4.8 does offer 1M tokens, and so do rivals like MiniMax M3. (The article also names "DeepSeek V3.5" as a 1M-token option, but we couldn't verify that model; DeepSeek's current release line is V4.) So the "half the context of the competition" worry in the original piece doesn't apply.

OpenAI did use the 5.5 release to split the line into tiers. GPT-5.5 Pro launched alongside the base model with higher rate limits and priority access for enterprise customers. The original article lists Pro at $8/$40 per million tokens, but we couldn't confirm that; most pricing trackers put Pro at $5/$30. A lighter variant, GPT-5.5 Instant, arrived on 5 May as the ChatGPT default; the article's specifics for it ($0.50/$1.50 pricing, 51.2% SWE-bench, 82.1% MMLU-Pro) are likewise unconfirmed, so read those as reported rather than established.

Supporting AI Kick Start editorial image for gpt-55-spud-openai-smartest-model-reviewed. — Generated AI Kick Start editorial visual used to explain the article's practical workflow and trade-offs.

Real-World Performance

Benchmarks only get you so far. Across coding, analysis, writing, and reasoning, GPT-5.5 shows a few areas where the improvement over GPT-5 is easy to feel. A note on the figures below: the original article attributes them to "independent testing" without naming a study, and we couldn't trace any of them to a public source. Take the specific percentages as illustrative, not gospel.

On code, Spud reportedly handles big repositories better. Given a 50,000-line codebase and a feature to build, it found the right files about 87% of the time against 79% for GPT-5, according to the article's unsourced testing. The output was also said to read more like code a person would write: reviewers who didn't know which model produced what rated Spud's work "production-ready" 64% of the time versus 51% for GPT-5.

Reasoning is better in a quieter way. Spud is reportedly less likely to make the "premature conclusion" mistake, where a model latches onto a halfway answer and never checks it. On a set of 200 problems built to trigger exactly that error, the article reports Spud tripping up 12% of the time against GPT-5's 23%. Again, that test isn't sourced, but the pattern, fewer confident-but-wrong answers, matches what a lot of teams want from a working assistant.

Creative writing is where the article makes its strongest case for Spud. The prose has better pacing, steadier character voice, and dialogue that sounds less robotic. In a blind read by 50 published authors, the piece says Spud's samples were called "human-like" 41% of the time, up from 28% for GPT-5. That's still under half, and we should be clear this evaluation isn't sourced either, but the direction of travel is the interesting part.

Where It Falls Short

The gains are real, and so are the limits. The headline coding number, whichever benchmark it actually came from, sits below the best models on the hardest work. Spud is good at the everyday stuff: explaining code, debugging, writing small functions. It has more trouble with big architectural calls and large refactors, the kind of job Fable 5 was reported to handle more cleanly before it was pulled.

Context window is less of a real worry than the original article suggested. Once you correct the 256K figure to the actual ~1M, GPT-5.5 is on par with rivals rather than behind them, so the "stuck with a small window" concern largely goes away.

The agentic question is murkier. The article says GPT-5.5 lacks native agentic features, the kind of multi-step tool use with planning and self-correction that shows up in Claude's Dynamic Workflows (and what the article calls Google's "Agents CLI", a product name we couldn't verify; it may be a reference to Gemini CLI). That claim is contestable. OpenAI marketed GPT-5.5 as strong at agentic coding, pointing to results like 82.7% on Terminal-Bench 2.0, which is hard to square with "no native agentic capabilities." The fair version is that OpenAI's agentic story is framed differently from its rivals', and whether it meets your needs depends on what you're building.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

OpenAI platform documentation

What to do next

Write the job-to-be-done before looking at another product.
Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: GPT-5.5 'Spud' Reviewed: OpenAI's Smartest Model Yet, But Is It Enough?

Read with ChatGPT Open Claude Search with AI Mode

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call