Model Review

Best model for RAG systems: Context vs accuracy.

RAG systems need large context windows for retrieved documents and high MMLU for comprehension. We rank Gemini 3.5 Flash, DeepSeek V3.5, MiniMax M3, and Opus 4.8 for retrieval-augmented generation.

Daniel Fleuren2026-06-1511 min readFounders and operatorsUpdated 2026-06-19

Written by

Daniel Fleuren

Founder, AI Kick Start. 20+ years enterprise IT

Updated 2026-06-19

AI Kick Start editorial image for Best model for RAG systems: Context vs accuracy.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: A RAG system lives or dies on two things: how much retrieved text the model can hold at once, and how well it understands what it's reading. The cheap-but-capable end of the market (MiniMax M3, the budget tiers) looks attractive on paper, and premium models like Claude Opus 4.8 and GPT-5.5 cost a lot more for a modest accuracy bump. But a warning up front: several of the headline numbers that were circulating when this piece was first drafted don't hold up. We've corrected the record inline. Treat any pricing or model name below that we've flagged as unconfirmed, and check the live spec sheet before you budget.

Key takeaways

Context window and comprehension are the two levers that decide RAG model choice; price and speed decide whether it's affordable at volume.
The Gemini 3.5 Flash pricing that drove the original recommendation ($0.35 / $0.70) is unconfirmed and likely wrong, [real rates look closer to $1.50 / $9.00](https://llm-stats.com/models/gemini-3.5-flash), which changes the cost case.
"DeepSeek V3.5" could not be verified as a real model; [DeepSeek's change log](https://api-docs.deepseek.com/updates) goes V3.2 to V4. Don't build a private-RAG plan around it.
Verified, ship-today options: [Opus 4.8](https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-8) (1M, $5/$25), [Sonnet 4.6](https://www.anthropic.com/news/claude-sonnet-4-6) (1M, $3/$15), and [MiniMax M3](https://openrouter.ai/minimax/minimax-m3) (1M, $0.30/$1.20).
Treat every MMLU figure here as indicative; none could be confirmed against vendor documentation, and most vendors now report MMLU-Pro instead.

Best model for RAG systems: Context vs accuracy

Analysis

If you've built anything with Retrieval-Augmented Generation, you already know the awkward truth: the model is only half the system. You pull the right documents, stuff them into the prompt, and hope the model can read all of it and answer without making things up. Get the retrieval wrong and the best model in the world gives you confident nonsense. Get it right and a cheaper model can carry you a long way.

So which model should you actually run behind your RAG pipeline? That's the question Australian teams keep asking, usually with one eye on the monthly bill. The honest answer in mid-2026 is that it depends on two numbers, context window and comprehension, and on whether you can send your documents to a third-party API at all.

A note before we go further. When this comparison was first put together, it leaned on a set of prices and a model name that turned out to be wrong. One of the "cheapest" options didn't exist, and the supposed bargain pricing on another was off by a factor of four or more. We've left the original figures in place so you can see where the cost case came from, but we've marked each problem clearly. Read the corrections, not just the table.

RAG model requirements

Four things matter, roughly in this order.

Context window. It has to fit your retrieved chunks plus the query plus the system prompt. A 1M-token window lets you pass more chunks, or bigger chunks, which usually means better recall.
MMLU. A rough proxy for general knowledge and comprehension. Higher MMLU tends to mean the model synthesises retrieved material more reliably. (Caveat below: the exact MMLU figures in this piece could not be confirmed against vendor docs.)
Price. RAG runs at volume. Per-token cost compounds fast, so the input and output rates are not a footnote, they're the budget.
Speed. For anything interactive, latency is part of the product.

The RAG leaderboard

Model	Context	MMLU	Input Price	Output Price	RAG Score
Gemini 3.5 Flash	1M	86.8%	$0.35	$0.70	9.0/10
DeepSeek V3.5	1M	85.8%	$0.15	$0.60	9.0/10
MiniMax M3	1M	86.4%	$0.30	$1.20	8.5/10
Claude Opus 4.8	1M	89.8%	$5.00	$25.00	7.5/10
Claude Sonnet 4.6	1M	87.6%	$3.00	$15.00	7.5/10
GPT-5.5	400K	88.4%	$5.00	$30.00	6.5/10
GPT-5.5 Instant	128K	84.2%	$0.50	$1.50	6.0/10

A few rows in that table need correcting before you act on them:

Gemini 3.5 Flash pricing. The $0.35 / $0.70 rates are unconfirmed and look wrong. llm-stats lists Flash at roughly $1.50 input and $9.00 output per million tokens, about 4x and 13x higher. Every cost figure built on the lower numbers below is therefore unreliable.
DeepSeek V3.5. As far as we can tell, this model does not exist. DeepSeek's own change log goes from V3.2 (December 2025) straight to V4 / V4-Pro (April 2026). The context, MMLU, and pricing for "V3.5" are all unsupported.
GPT-5.5 context and MMLU. The pricing ($5 / $30) checks out, but the spec sheet puts the context window near 1M+, not 400K, and MMLU around 92.4%, not 88.4%.
GPT-5.5 Instant. Listed here as 128K context at $0.50 / $1.50. In reality it shares the GPT-5.5 family's ~1.1M window and $5 / $30 pricing; the 84.2% MMLU is unsupported.
All MMLU figures. None of the seven percentages could be confirmed against official documentation. Most vendors now report MMLU-Pro rather than plain MMLU, so treat these as indicative at best.

What does hold up: Claude Opus 4.8 at 1M context and $5 / $25, Claude Sonnet 4.6 at 1M context and $3 / $15, and MiniMax M3 at 1M context and $0.30 / $1.20. Those three match reality.

Top recommendation: Gemini 3.5 Flash

The original case for Flash was simple: 1M context, an MMLU around 86.8%, and the cheapest output pricing ($0.70/1M) of any 1M-context model. For a RAG system pulling 50 documents of 10K tokens each, that output saving was meant to compound into a big monthly win.

The catch is the price the whole argument rested on. With Flash's output rate actually nearer $9.00/1M, it is not the cheapest 1M-context model, and the cost advantage that made it the headline pick largely evaporates. Gemini 3.5 Flash is real and does have a 1M-token window, that part stands. The bargain framing does not.

Here's the sample monthly cost as originally calculated (10M input, 5M output, 500K retrieved context):

Gemini 3.5 Flash: $3.50 + $3.50 = $7.00
DeepSeek V3.5: $1.50 + $3.00 = $4.50 (even cheaper!)
MiniMax M3: $3.00 + $6.00 = $9.00
Opus 4.8: $50.00 + $125.00 = $175.00

Two of those lines don't survive scrutiny. The $7.00 Flash figure uses the unconfirmed low prices; at the rates llm-stats publishes (10M input at $1.50, 5M output at $9.00), the same workload comes to roughly $60.00. The DeepSeek V3.5 line is for a model we couldn't verify exists, so ignore it. The MiniMax M3 figure is sound, and the Opus 4.8 total of $175 is correct, $50 input plus $125 output.

Alternative: DeepSeek V3.5 for private RAG

This section recommended a self-hosted model for teams that can't send documents to third-party APIs, healthcare, finance, legal, citing $0.15 / $0.60 API pricing or free self-hosting, 85.8% MMLU, and 1M context.

We can't stand behind any of it, because we couldn't confirm "DeepSeek V3.5" is a real release. DeepSeek's change log skips from V3.2 to V4. If you need a private, self-hosted RAG model for regulated data, the underlying need is genuine, but pick from a model that actually ships. Check DeepSeek's current V4 line, or a verified open-weight option like MiniMax M3, rather than the model named here.

When to use premium models

The premium tier, Opus 4.8 and GPT-5.5, was pitched as marginally better comprehension at 7-25x the cost. (Bear in mind the MMLU gaps quoted earlier are unconfirmed, and GPT-5.5's real MMLU appears higher than the table suggests.) The decision rule still makes sense, though. Reach for a premium model when:

Answer accuracy is mission-critical (medical, legal, financial)
Retrieved documents are highly technical or specialised
The cost of a wrong answer is bigger than the model's price premium

That last point is the one that actually matters. If a bad answer costs you a client or a compliance breach, the per-token premium is rounding error.

Verdict

Strip out the bad numbers and the shape of the advice survives, even if the specific picks don't. For most RAG systems, a cheap 1M-context model is the right starting point, just price it honestly, because the bargain rates that made Gemini 3.5 Flash look unbeatable don't appear to be real. For private deployments, the principle holds (self-host an open-weight model for regulated data) but use one you can confirm exists, not the unverified "DeepSeek V3.5". Keep premium models like Opus 4.8 for the accuracy-critical work where errors are expensive.

The one thing that's genuinely true across all of it: the 1M context window is the real enabler. It lets you retrieve broadly without building and tuning a re-ranking pipeline, which is where a lot of RAG complexity and cost otherwise goes.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

Write the job-to-be-done before looking at another product.
Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Best model for RAG systems: Context vs accuracy

Read with ChatGPT Open Claude Search with AI Mode

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call