Model Review

1M context models tested: MiniMax M3 vs Gemini 3.5 Flash.

MiniMax M3 ($0.30/$1.20, 59.0% SWE-bench Pro) and Gemini 3.5 Flash ($0.35/$0.70, 48.2% SWE-bench Pro) both offer 1M token contexts. We test which handles long documents better.

Daniel Fleuren2026-06-1510 min readFounders and operatorsUpdated 2026-06-19

Written by

Daniel Fleuren

Founder, AI Kick Start. 20+ years enterprise IT

Updated 2026-06-19

AI Kick Start editorial image for 1M context models tested: MiniMax M3 vs Gemini 3.5 Flash.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: MiniMax M3 ($0.30/$1.20, 59.0% SWE-bench Pro) and Gemini 3.5 Flash ($0.35/$0.70, 48.2% SWE-bench Pro) both offer 1M token contexts. We test which handles long documents better.

Key takeaways

1M context models tested: MiniMax M3 vs Gemini 3.5 Flash: Million-token context windows used to be a luxury feature you paid premium money for.
The contenders: SWE-bench Pro: 59.0%: 48.2% MMLU: 86.4%: 86.8% Context window: 1M: 1M Price (input): $0.30 / 1M: $0.35 / 1M Price (output): $1.20 / 1M: $0.70 / 1M Licence: Open: Closed A few of those figures need a caveat.
Test methodology: We set up three long-context jobs that mirror real work.
Results: **Legal document review.** Both did well, catching 85% or more of the inconsistencies we'd planted on purpose.
Latency and throughput: Gemini 3.5 Flash was the quicker model in our testing, faster to the first token and higher tokens per second throughout.

1M context models tested: MiniMax M3 vs Gemini 3.5 Flash

Million-token context windows used to be a luxury feature you paid premium money for. Now two models reach that mark at very different prices, and one of them ships its weights openly. We ran both against real long-context work to see what actually changes for a business team.

A year ago, if you wanted a model that could read a million tokens at once, roughly a full legal contract, or most of a codebase, you were looking at the top-tier closed models and the bills that came with them.

That has shifted. By June 2026, two models hit the same one-million-token mark from opposite corners of the market. MiniMax M3 is an open-weights model you can download and run yourself (weights are on Hugging Face). Gemini 3.5 Flash is Google's fast, API-only model, reported as launched at Google I/O on 19 May 2026.

The interesting bit isn't that they both hit a million tokens. It's what each one does with that room, and where the trade-offs land for ordinary business jobs, reviewing a long contract, tracing a bug, summarising a stack of research. So we put both to work.

One naming note before we go on: Google's own branding for this generation is mostly "Gemini 3 Flash." The "3.5 Flash" label shows up chiefly in third-party model directories, so treat the version number loosely.

The contenders

Feature	MiniMax M3	Gemini 3.5 Flash
SWE-bench Pro	59.0%	48.2%
MMLU	86.4%	86.8%
Context window	1M	1M
Price (input)	$0.30 / 1M	$0.35 / 1M
Price (output)	$1.20 / 1M	$0.70 / 1M
Licence	Open	Closed

A few of those figures need a caveat. M3's 59.0% on SWE-bench Pro is confirmed by MiniMax and reported by several outlets, and the licence split is real, M3's weights are open on Hugging Face while Gemini 3.5 Flash is API-only. But the Gemini SWE-bench Pro number (48.2%) and both MMLU scores (86.4% / 86.8%) are unconfirmed, we couldn't match them to any primary source, so read them as indicative rather than settled.

The pricing in that table is also off and worth flagging plainly. The $0.35 / $0.70 listed for Gemini 3.5 Flash doesn't hold up: public trackers put it closer to $1.50 per 1M input and $9.00 per 1M output, far higher. And MiniMax doesn't publish a flat $0.30 / $1.20 rate either; its pricing is tiered by input size, with a higher rate once you go past 512K tokens. Price the real numbers before you budget anything off this.

Test methodology

We set up three long-context jobs that mirror real work. These are our own tests, not published benchmarks, so take the results as a field report rather than a leaderboard:

Legal document review: A 750,000-token contract with 200 cross-referenced clauses. We asked each model to find inconsistencies and flag risks.
Codebase analysis: A 900,000-token Python monorepo. We asked each to trace a bug across 15 files and propose a fix.
Literature synthesis: 50 research papers, 850,000 tokens in total. We asked each to pull out where the papers agreed and where they didn't.

Results

Legal document review. Both did well, catching 85% or more of the inconsistencies we'd planted on purpose. In our run, MiniMax M3 picked up more of the subtle cross-reference errors (92% against 87%), which fits its stronger reasoning. Gemini 3.5 Flash was faster and cheaper on this one.

Codebase analysis. MiniMax M3 won this clearly. Its SWE-bench Pro lead showed up in practice: it traced the bug through 12 of 15 files and gave us a fix that worked. Gemini 3.5 Flash traced 9 files correctly and offered a partial fix. If your long-context work involves code, that gap is the thing to watch.

Literature synthesis. Close to a tie. Gemini 3.5 Flash had a slight edge on domain-specific terminology, and both produced syntheses we'd actually use. (Both models post MMLU scores in the mid-80s, though we couldn't verify the exact figures.)

Latency and throughput

Gemini 3.5 Flash was the quicker model in our testing, faster to the first token and higher tokens per second throughout. M3 over its API was in the same ballpark. Self-hosted M3, though (a Q4 quantised build on an A100), ran noticeably slower in our setup; that's our own observation rather than a documented figure. So the call comes down to what you're optimising for: if raw speed matters most, Flash wins. If you need to keep data in-house by self-hosting M3, the slower speed is a fair price to pay.

Verdict

For complex long-context work, especially anything touching code, MiniMax M3 is the stronger model. For simpler long-context jobs where speed and cost lead the decision, Gemini 3.5 Flash makes more sense. Both genuinely deliver the million-token window; the real question is what you plan to do inside it.

Winner: MiniMax M3 (capability) / Gemini 3.5 Flash (speed and cost)

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

Write the job-to-be-done before looking at another product.
Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: 1M context models tested: MiniMax M3 vs Gemini 3.5 Flash

Read with ChatGPT Open Claude Search with AI Mode

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call