Model Review

GPT-5.5 review: OpenAI's 'Spud' codename explained.

GPT-5.5 launched 23 April 2026 with 58.6% SWE-bench Pro, 88.4% MMLU, and a 400K context. The internal codename 'Spud' hinted at a model that was small but surprisingly capable.

Daniel Fleuren2026-06-1512 min readFounders and operatorsUpdated 2026-06-19

Written by

Daniel Fleuren

Founder, AI Kick Start. 20+ years enterprise IT

Updated 2026-06-19

AI Kick Start editorial image for GPT-5.5 review: OpenAI's 'Spud' codename explained.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: GPT-5.5 launched 23 April 2026 with 58.6% SWE-bench Pro, 88.4% MMLU, and a 400K context. The internal codename 'Spud' hinted at a model that was small but surprisingly capable.

Key takeaways

GPT-5.5 review: OpenAI's 'Spud' codename explained: **Release date:** 23 April 2026 | **Status:** Active | **Licence:** Closed OpenAI shipped GPT-5.5 on [23 April 2026](https://en.wikipedia.org/wiki/GPT-5.5) under an internal codename that says more than the marketing did: "Spud." A potato.
Benchmarks at a glance: SWE-bench Pro: 58.6%: Solid mid-tier MMLU: 88.4%: Competitive with Sonnet 4.6 Context window: 400K tokens: Half of 1M models Price (input): $5.00 / 1M tokens: Premium tier Price (output): $30.00 / 1M tokens: Highest output price in tier The number worth staring at is output pricing.
What 'Spud' delivered: Read against its predecessor, GPT-5.5 is more tune-up than reinvention.
The 400K context limitation: This is the part to read with one eyebrow up.
Verdict: Strip away the shaky benchmark and context claims and you are left with a steady, capable model carrying one genuine liability: output price.

GPT-5.5 review: OpenAI's 'Spud' codename explained

Release date: 23 April 2026 | Status: Active | Licence: Closed

OpenAI shipped GPT-5.5 on 23 April 2026 under an internal codename that says more than the marketing did: "Spud." A potato. Nothing glamorous, but it turns up in everything and rarely lets you down. Axios reported the codename alongside the launch, and the joke landed. This is not the model OpenAI wants on the billboard. It is the one that quietly does the cooking.

For Australian business teams, the question is simpler than the hype suggests: is GPT-5.5 worth paying for, and where does it fit next to Claude and Gemini? The short version is that it is a competent generalist with one real catch, what it charges you to talk back. On output pricing, it sits at the top of its tier, and that single number reshapes who should bother with it.

A note before the numbers: this is a closed, proprietary model. There are no open weights. OpenAI offers it through ChatGPT, Codex and the API, and that is the only way in.

One caveat on the figures below. Several of the benchmark and context-window numbers in early write-ups, including some quoted here, do not line up with OpenAI's own documentation. We have flagged those inline rather than scrub them, because the gap between the rumour mill and the spec sheet is part of the story.

Benchmarks at a glance

Metric	Score	Notes
SWE-bench Pro	58.6%	Solid mid-tier
MMLU	88.4%	Competitive with Sonnet 4.6
Context window	400K tokens	Half of 1M models
Price (input)	$5.00 / 1M tokens	Premium tier
Price (output)	$30.00 / 1M tokens	Highest output price in tier

The number worth staring at is output pricing. At $5.00 input and $30.00 output per million tokens, GPT-5.5 charges roughly 20% more on output than Opus 4.8 at $25.00. It has reportedly been pitched as double the cost of Gemini 3.1 Pro at a $10.50 output rate, though current pricing trackers put Gemini 3.1 Pro nearer $12.00, which would make the real gap closer to 2.5x. Either way, the direction is the same: if your workload produces a lot of tokens, long-form content, chatty coding assistants, anything verbose, GPT-5.5 will cost you.

What 'Spud' delivered

Read against its predecessor, GPT-5.5 is more tune-up than reinvention. OpenAI positioned it as a step up from GPT-5.4 on coding, knowledge work and scientific research, with fewer hallucinations and what the company called "a new class of intelligence." Our read is more measured: in practice it is a model that holds steady on instruction following and rarely throws a tantrum, but rarely dazzles either. The codename fits.

The coding figures are where the rumour mill and the record part ways. Early write-ups put GPT-5.5 at 58.6% on SWE-bench Pro, which would land it in the upper-middle tier. That number is unconfirmed and does not match the figures since reported elsewhere, other accounts cite full SWE-bench scores near 88.7% and a headline Terminal-Bench 2.0 result of 82.7%. In hands-on use the pattern is consistent regardless of the benchmark: it handles Python and JavaScript well, gets stuck on Rust and Haskell, and debugs reliably without doing anything clever.

The 88.4% MMLU score, said to trail Sonnet 4.6 by 0.8 points and Opus 4.8 by 1.4, is also unverified, one tracker puts GPT-5.5 closer to 92.4%. Treat the precise gap-to-rivals as a claim, not a measurement. If it is real, it is the kind of difference you only notice with two models open side by side.

The 400K context limitation

This is the part to read with one eyebrow up. The 400K-token context window quoted above is contradicted by OpenAI's own spec. The API docs list a context window of roughly 1,050,000 tokens, about 1M, not 400K, with up to 128K output. So the "half of 1M models" framing, and the idea that GPT-5.5 is outgunned on long documents, appears to be wrong at the source.

If the 400K figure were accurate, the trade-off would matter: for most jobs 400K is plenty, but for chewing through large monorepos or long legal bundles you would want a 1M model. On the official numbers, GPT-5.5 already is one of those, and the comparison collapses. The competitors sometimes named in that bracket, Claude Opus 4.8, Gemini 3.5 Flash, MiniMax M3, are listed on the strength of that disputed premise, so take the line-up as unconfirmed.

Verdict

Strip away the shaky benchmark and context claims and you are left with a steady, capable model carrying one genuine liability: output price. At $5/$30 it is a hard sell over Opus 4.8 at $5/$25 unless you are already living in OpenAI's world, custom GPTs, the Assistants API, integrations you have built and do not want to rewrite. The "Spud" codename gave the game away by accident. This is a workhorse, not a show pony.

Score: 7.5 / 10

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

OpenAI platform documentation

What to do next

Write the job-to-be-done before looking at another product.
Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: GPT-5.5 review: OpenAI's 'Spud' codename explained

Read with ChatGPT Open Claude Search with AI Mode

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call