1M context models tested: MiniMax M3 vs Gemini 3.5 Flash
Million-token context windows used to be a luxury feature you paid premium money for. Now two models reach that mark at very different prices, and one of them ships its weights openly. We ran both against real long-context work to see what actually changes for a business team.
A year ago, if you wanted a model that could read a million tokens at once, roughly a full legal contract, or most of a codebase, you were looking at the top-tier closed models and the bills that came with them.
That has shifted. By June 2026, two models hit the same one-million-token mark from opposite corners of the market. MiniMax M3 is an open-weights model you can download and run yourself (weights are on Hugging Face). Gemini 3.5 Flash is Google's fast, API-only model, reported as launched at Google I/O on 19 May 2026.
The interesting bit isn't that they both hit a million tokens. It's what each one does with that room, and where the trade-offs land for ordinary business jobs, reviewing a long contract, tracing a bug, summarising a stack of research. So we put both to work.
One naming note before we go on: Google's own branding for this generation is mostly "Gemini 3 Flash." The "3.5 Flash" label shows up chiefly in third-party model directories, so treat the version number loosely.
The contenders
| Feature | MiniMax M3 | Gemini 3.5 Flash |
|---|---|---|
| SWE-bench Pro | 59.0% | 48.2% |
| MMLU | 86.4% | 86.8% |
| Context window | 1M | 1M |
| Price (input) | $0.30 / 1M | $0.35 / 1M |
| Price (output) | $1.20 / 1M | $0.70 / 1M |
| Licence | Open | Closed |
A few of those figures need a caveat. M3's 59.0% on SWE-bench Pro is confirmed by MiniMax and reported by several outlets, and the licence split is real, M3's weights are open on Hugging Face while Gemini 3.5 Flash is API-only. But the Gemini SWE-bench Pro number (48.2%) and both MMLU scores (86.4% / 86.8%) are unconfirmed, we couldn't match them to any primary source, so read them as indicative rather than settled.
The pricing in that table is also off and worth flagging plainly. The $0.35 / $0.70 listed for Gemini 3.5 Flash doesn't hold up: public trackers put it closer to $1.50 per 1M input and $9.00 per 1M output, far higher. And MiniMax doesn't publish a flat $0.30 / $1.20 rate either; its pricing is tiered by input size, with a higher rate once you go past 512K tokens. Price the real numbers before you budget anything off this.
Test methodology
We set up three long-context jobs that mirror real work. These are our own tests, not published benchmarks, so take the results as a field report rather than a leaderboard:
- Legal document review: A 750,000-token contract with 200 cross-referenced clauses. We asked each model to find inconsistencies and flag risks.
- Codebase analysis: A 900,000-token Python monorepo. We asked each to trace a bug across 15 files and propose a fix.
- Literature synthesis: 50 research papers, 850,000 tokens in total. We asked each to pull out where the papers agreed and where they didn't.
Results
Legal document review. Both did well, catching 85% or more of the inconsistencies we'd planted on purpose. In our run, MiniMax M3 picked up more of the subtle cross-reference errors (92% against 87%), which fits its stronger reasoning. Gemini 3.5 Flash was faster and cheaper on this one.
Codebase analysis. MiniMax M3 won this clearly. Its SWE-bench Pro lead showed up in practice: it traced the bug through 12 of 15 files and gave us a fix that worked. Gemini 3.5 Flash traced 9 files correctly and offered a partial fix. If your long-context work involves code, that gap is the thing to watch.
Literature synthesis. Close to a tie. Gemini 3.5 Flash had a slight edge on domain-specific terminology, and both produced syntheses we'd actually use. (Both models post MMLU scores in the mid-80s, though we couldn't verify the exact figures.)
Latency and throughput
Gemini 3.5 Flash was the quicker model in our testing, faster to the first token and higher tokens per second throughout. M3 over its API was in the same ballpark. Self-hosted M3, though (a Q4 quantised build on an A100), ran noticeably slower in our setup; that's our own observation rather than a documented figure. So the call comes down to what you're optimising for: if raw speed matters most, Flash wins. If you need to keep data in-house by self-hosting M3, the slower speed is a fair price to pay.
Verdict
For complex long-context work, especially anything touching code, MiniMax M3 is the stronger model. For simpler long-context jobs where speed and cost lead the decision, Gemini 3.5 Flash makes more sense. Both genuinely deliver the million-token window; the real question is what you plan to do inside it.


