Analysis
If you've shortlisted an AI model lately, you've probably stared at a row of benchmark scores and felt none the wiser. Five models, all scoring somewhere in the high 80s and 90s, and no obvious way to tell which one will actually do your job better.
That's the problem in a sentence. The tests we've used for years to rank AI models are getting too easy for the models to beat. When everything scores near the top, the scoreboard stops telling you anything useful.
This isn't a scandal so much as a sign the field has moved on. The headline numbers still get quoted in launch posts and sales decks, but the people choosing models for real work have quietly stopped trusting them. For Australian teams picking a tool for support, document handling, or coding, the takeaway is simple: a high public benchmark score is a weak reason to choose one model over another. What matters is how it does on your tasks.
Below is what's driving the saturation, why it bites, and what's replacing the old scoreboard.
Why Benchmarks Are Saturating
Three things push scores toward the ceiling: contamination, overfitting, and the models simply getting better.
Contamination is when the benchmark itself leaks into the training data. Most major benchmarks are published openly, and modern training runs scrape close to the entire public web. So a model can "see" the questions and answers before it's ever tested, and a strong score then reflects memorisation, not reasoning. The range here is well documented in the literature, with older, heavily discussed benchmarks contaminated worst; one widely cited figure puts roughly 30-50% of popular benchmark content inside the training sets of major models, though that aggregate number is more a rough read across several studies than a single confirmed measurement (Source: contamination research, 2025).
Overfitting is when developers tune for the test on purpose. That can mean fine-tuning on benchmark-style data, prompt engineering aimed at the exact behaviour a benchmark rewards, or just picking the model variant that posts the best score. You end up with benchmark athletes: systems tuned to ace specific tests without being noticeably better at the work people actually need done.
The third driver is the honest one. Models genuinely keep improving, and any fixed test eventually gets fully solved by a capable enough system. The original MMLU spans 57 subjects; its harder successor, MMLU-Pro, consolidates those into 14 broader categories and bumps the answer choices from four to ten to make guessing harder. Either way, a model trained on most of recorded human knowledge answering most undergraduate-level questions correctly isn't shocking. In that sense, saturation is a win: it means AI has caught up to human-level performance on these particular tasks.

The Consequences of Saturation
The first cost is practical. Benchmark scores no longer help you choose. When five models all land between 85% and 95% on MMLU-Pro, that 10-point spread tells you almost nothing about which will handle your specific job better. So buyers fall back on word of mouth, vendor marketing, or expensive private testing.
The second cost is the incentives it creates. If public benchmarks can't separate the field, there's less reward for hard capability work and more for benchmark-gaming tricks. Some analysts argue this slowly drains the value benchmarks were supposed to provide as a spur to real progress (Source: industry analysis, 2026), it's Goodhart's law applied to AI, and an argued risk rather than a measured one.
The third cost is the dangerous one. A model that scores 94% on a safety test still fails on the other 6%, and those failures can be exactly the cases you most needed it to get right. A high average can hide the failure modes that matter.
The Industry Response
The field is adjusting in a few directions.
Dynamic benchmarks, which change over time so models can't memorise them, are gaining ground. The LiveBench project releases fresh questions every month and archives the old ones rather than scoring on them. The team describes it as contamination-limited, and the monthly refresh means scores lean far more on live reasoning than on recall.
Task-specific evaluation is replacing the general leaderboard for a lot of real use. Instead of "which model has the highest MMLU score?", teams are asking "which model is best at my task?" That means building your own evaluation set, which costs time and money but tells you something you can actually act on.
Private evaluations run by independent firms are becoming the standard for high-stakes choices. Several consultancies now offer confidential testing against proprietary datasets built to mirror real-world use. They're reportedly pricey, figures in the tens of thousands of dollars per model get mentioned, though that number isn't publicly confirmed, but they surface what public benchmarks can't.


