Analysis
For years, the standard way to make an AI read a long document was to chop it into pieces, store the pieces, and feed the model only the bits that looked relevant to your question. It worked, but it was fiddly, and it broke in annoying ways. As of mid-2026, a handful of models will just take the whole thing.
A million tokens of context is roughly 750,000 words. That is the entire works of Shakespeare, or a medium-sized software project, dropped into a single prompt and read in one go. Twelve months ago, 128,000 tokens counted as a long context window. The new ceiling is about eight times bigger.
For an Australian business team, the "so what" is straightforward. A lot of work that used to need a custom retrieval system, a search layer, a vector database, a pile of glue code, can now be done by handing the model the source material directly and asking a plain question. That is cheaper to build and easier to reason about.
The catch is that bigger isn't automatically better. These long-context requests cost more per call, run slower, and reward teams who structure their inputs carefully. The rest of this piece walks through what the million-token window actually unlocks, and where it bites.
The million-token context window has arrived. In June 2026, developers can choose from several models built around 1 million tokens of context. MiniMax M3 is open-weight and launched at roughly $0.30/$1.20 per million input/output tokens, though that is a 50%-off launch promotion; the standard rate is closer to $0.60/$2.40 (OpenRouter, MiniMax M3 pricing & benchmarks). DeepSeek's newest open-weight release also ships a native 1M context, note that DeepSeek's line went from V3.2 to a V4 Preview in April 2026, so there is no "V3.5", and the often-quoted $0.15/$0.60 figure for it is unconfirmed (DeepSeek API Docs, V4 Preview release). Google's Gemini 3.5 Flash carries a 1M-token input window too, reportedly priced nearer $1.50/$9.00 rather than the lower $0.35/$0.70 sometimes cited (OpenRouter, Gemini 3.5 Flash), and Gemini 3.1 Pro is, by available accounts, a 2M-token model priced around $2/$12 rather than the $3.50/$10.50 figure that circulates. A year ago, 128K tokens was considered long context. Today that is 8x shorter than the new standard (The Decoder, million-token context for open models).
This is more than a spec bump. It changes what these systems can do. A million tokens is about 750,000 words (token-to-word ratio, industry standard ~0.75 words/token), enough to hold the entire King James Bible, the complete works of Shakespeare, or a medium-sized software codebase in a single prompt. Work that used to demand a complex retrieval architecture can now run on plain prompt engineering.
What 1M Tokens Enables
The new applications fall into three broad areas.
Full codebase understanding: a 1M-token context can hold somewhere around 500,000 to 700,000 lines of code, depending on the language and how heavily it's commented, an order-of-magnitude estimate rather than a measured figure. That covers most individual microservices, libraries, or apps. You can ask "how does authentication work in this codebase?" or "find every place we sanitise user input" and have the model read the whole repository in one pass. Tools like Kimi K2.7 Code have shown real strength at spotting cross-file dependencies and refactoring opportunities, though it's worth noting K2.7 Code runs a 256K-token window rather than a full 1M, so the very largest repos still need to be fed in sections (Codersera, Kimi K2.7 Code guide).
Multi-document legal and financial analysis: case files, financial filings, and regulatory submissions often run to hundreds or thousands of pages. With a 1M-token context, a lawyer can load an entire case file, complaints, motions, depositions, exhibits, and ask the model to flag inconsistencies, summarise the key arguments, or draft a responsive pleading. A financial analyst can pull in years of filings, earnings-call transcripts, and analyst notes to build out an investment thesis.
Long-form content creation and analysis: authors, researchers, and content teams can work at document length instead of paragraph length. A novelist can ask the model to check a 200,000-word manuscript for plot holes. A researcher can pull findings together across dozens of papers. A journalist can run thousands of pages of leaked documents to surface patterns and connections.

The Practical Challenges
The enthusiasm is warranted, but long-context work comes with real constraints you have to plan around.
Cost: even at budget pricing, a full 1M-token prompt runs somewhere around $0.15-0.35 in input alone, and the lower end of that range leans on the unconfirmed DeepSeek figure noted earlier. Add a long response, say 100K tokens, and a single request can hit $0.75-1.50. Across many documents that adds up fast. A legal discovery job running 10,000 documents at full context could, on these numbers, cost in the region of $15,000 per run, an illustrative projection, not a quoted price.
Latency: long-context inference is slower than short-context, full stop. Generic estimates put a 1M-token request at 30-90 seconds, though that's a loose ceiling: MiniMax M3 in particular is considerably faster thanks to its sparse-attention design, named MiniMax Sparse Attention rather than the "dynamic sparse attention" tag that sometimes gets attached to it (GitHub, MiniMax-AI/MiniMax-M3). Either way, this suits batch workflows far better than anything real-time.
Effective utilisation: models don't all use long context equally well. Needle-in-a-haystack tests, can the model find one specific fact buried in a long document?, show wide variation. Figures circulating put MiniMax M3 near 97% accuracy at 1M tokens and some DeepSeek models around 93%, but those specific numbers are unconfirmed and should be treated as rumoured rather than measured. What is well established is the broader pattern: some models that advertise a 1M-token window degrade noticeably past about 600K tokens in practice.
Context management: having room for 1M tokens doesn't mean you should fill it. Good long-context prompting takes structure, well-organised documents, clear sections, and explicit instructions about what to focus on. Skip that and the model can drown in the volume and hand back worse answers than it would from a shorter, tighter prompt.


