Best model for RAG systems: Context vs accuracy
Analysis
If you've built anything with Retrieval-Augmented Generation, you already know the awkward truth: the model is only half the system. You pull the right documents, stuff them into the prompt, and hope the model can read all of it and answer without making things up. Get the retrieval wrong and the best model in the world gives you confident nonsense. Get it right and a cheaper model can carry you a long way.
So which model should you actually run behind your RAG pipeline? That's the question Australian teams keep asking, usually with one eye on the monthly bill. The honest answer in mid-2026 is that it depends on two numbers, context window and comprehension, and on whether you can send your documents to a third-party API at all.
A note before we go further. When this comparison was first put together, it leaned on a set of prices and a model name that turned out to be wrong. One of the "cheapest" options didn't exist, and the supposed bargain pricing on another was off by a factor of four or more. We've left the original figures in place so you can see where the cost case came from, but we've marked each problem clearly. Read the corrections, not just the table.
RAG model requirements
Four things matter, roughly in this order.
- Context window. It has to fit your retrieved chunks plus the query plus the system prompt. A 1M-token window lets you pass more chunks, or bigger chunks, which usually means better recall.
- MMLU. A rough proxy for general knowledge and comprehension. Higher MMLU tends to mean the model synthesises retrieved material more reliably. (Caveat below: the exact MMLU figures in this piece could not be confirmed against vendor docs.)
- Price. RAG runs at volume. Per-token cost compounds fast, so the input and output rates are not a footnote, they're the budget.
- Speed. For anything interactive, latency is part of the product.
The RAG leaderboard
| Model | Context | MMLU | Input Price | Output Price | RAG Score |
|---|---|---|---|---|---|
| Gemini 3.5 Flash | 1M | 86.8% | $0.35 | $0.70 | 9.0/10 |
| DeepSeek V3.5 | 1M | 85.8% | $0.15 | $0.60 | 9.0/10 |
| MiniMax M3 | 1M | 86.4% | $0.30 | $1.20 | 8.5/10 |
| Claude Opus 4.8 | 1M | 89.8% | $5.00 | $25.00 | 7.5/10 |
| Claude Sonnet 4.6 | 1M | 87.6% | $3.00 | $15.00 | 7.5/10 |
| GPT-5.5 | 400K | 88.4% | $5.00 | $30.00 | 6.5/10 |
| GPT-5.5 Instant | 128K | 84.2% | $0.50 | $1.50 | 6.0/10 |
A few rows in that table need correcting before you act on them:
- Gemini 3.5 Flash pricing. The $0.35 / $0.70 rates are unconfirmed and look wrong. llm-stats lists Flash at roughly $1.50 input and $9.00 output per million tokens, about 4x and 13x higher. Every cost figure built on the lower numbers below is therefore unreliable.
- DeepSeek V3.5. As far as we can tell, this model does not exist. DeepSeek's own change log goes from V3.2 (December 2025) straight to V4 / V4-Pro (April 2026). The context, MMLU, and pricing for "V3.5" are all unsupported.
- GPT-5.5 context and MMLU. The pricing ($5 / $30) checks out, but the spec sheet puts the context window near 1M+, not 400K, and MMLU around 92.4%, not 88.4%.
- GPT-5.5 Instant. Listed here as 128K context at $0.50 / $1.50. In reality it shares the GPT-5.5 family's ~1.1M window and $5 / $30 pricing; the 84.2% MMLU is unsupported.
- All MMLU figures. None of the seven percentages could be confirmed against official documentation. Most vendors now report MMLU-Pro rather than plain MMLU, so treat these as indicative at best.
What does hold up: Claude Opus 4.8 at 1M context and $5 / $25, Claude Sonnet 4.6 at 1M context and $3 / $15, and MiniMax M3 at 1M context and $0.30 / $1.20. Those three match reality.
Top recommendation: Gemini 3.5 Flash
The original case for Flash was simple: 1M context, an MMLU around 86.8%, and the cheapest output pricing ($0.70/1M) of any 1M-context model. For a RAG system pulling 50 documents of 10K tokens each, that output saving was meant to compound into a big monthly win.
The catch is the price the whole argument rested on. With Flash's output rate actually nearer $9.00/1M, it is not the cheapest 1M-context model, and the cost advantage that made it the headline pick largely evaporates. Gemini 3.5 Flash is real and does have a 1M-token window, that part stands. The bargain framing does not.
Here's the sample monthly cost as originally calculated (10M input, 5M output, 500K retrieved context):
- Gemini 3.5 Flash: $3.50 + $3.50 = $7.00
- DeepSeek V3.5: $1.50 + $3.00 = $4.50 (even cheaper!)
- MiniMax M3: $3.00 + $6.00 = $9.00
- Opus 4.8: $50.00 + $125.00 = $175.00
Two of those lines don't survive scrutiny. The $7.00 Flash figure uses the unconfirmed low prices; at the rates llm-stats publishes (10M input at $1.50, 5M output at $9.00), the same workload comes to roughly $60.00. The DeepSeek V3.5 line is for a model we couldn't verify exists, so ignore it. The MiniMax M3 figure is sound, and the Opus 4.8 total of $175 is correct, $50 input plus $125 output.
Alternative: DeepSeek V3.5 for private RAG
This section recommended a self-hosted model for teams that can't send documents to third-party APIs, healthcare, finance, legal, citing $0.15 / $0.60 API pricing or free self-hosting, 85.8% MMLU, and 1M context.
We can't stand behind any of it, because we couldn't confirm "DeepSeek V3.5" is a real release. DeepSeek's change log skips from V3.2 to V4. If you need a private, self-hosted RAG model for regulated data, the underlying need is genuine, but pick from a model that actually ships. Check DeepSeek's current V4 line, or a verified open-weight option like MiniMax M3, rather than the model named here.
Verdict
Strip out the bad numbers and the shape of the advice survives, even if the specific picks don't. For most RAG systems, a cheap 1M-context model is the right starting point, just price it honestly, because the bargain rates that made Gemini 3.5 Flash look unbeatable don't appear to be real. For private deployments, the principle holds (self-host an open-weight model for regulated data) but use one you can confirm exists, not the unverified "DeepSeek V3.5". Keep premium models like Opus 4.8 for the accuracy-critical work where errors are expensive.
The one thing that's genuinely true across all of it: the 1M context window is the real enabler. It lets you retrieve broadly without building and tuning a re-ranking pipeline, which is where a lot of RAG complexity and cost otherwise goes.


