Analysis
For years, the standard advice for feeding a large document to a language model was to break it into pieces. You'd chunk the text, build embeddings, store them in a vector database, and retrieve the parts that looked relevant at query time. It worked, but it was fiddly, and it leaked context. The model never saw the whole picture, only the slices you guessed it would need.
That constraint is loosening. Several models now accept roughly a million tokens in one go, enough for a mid-sized codebase, a long contract with all its appendices, or an entire support-ticket history. You can hand the model the full thing and ask your question, no retrieval pipeline required.
For an Australian business team, the practical upshot is fewer moving parts. The work that used to live in chunking logic and a vector store can sometimes collapse into a single API call. That doesn't make retrieval obsolete, and it isn't free, a million input tokens still costs real money, and the cheapest option depends on the job. But the engineering you can now skip is substantial.
This guide walks through building a 1M-context agent: counting tokens so you don't blow the limit, wiring up the agent, pre-filtering when even a million isn't enough, and routing each task to the model that does it cheapest.
Analysis
Prerequisites
- API keys for Anthropic, MiniMax, or Google
- Understanding of token counting
- A large document or codebase to process
- Python 3.10+
Step-by-Step Framework
Step 1: Count Tokens Accurately
Before you fire off a million tokens, you need to know how many you actually have:
# token_counter.py
import tiktoken
def count_tokens(text: str, model: str = "gpt-4") -> int:
encoder = tiktoken.encoding_for_model(model)
return len(encoder.encode(text))
def count_file_tokens(path: str) -> int:
with open(path, 'r', encoding='utf-8', errors='ignore') as f:
return count_tokens(f.read())
def count_directory_tokens(dir_path: str, extensions: list[str]) -> dict:
import os
from pathlib import Path
results = {}
total = 0
for ext in extensions:
for file in Path(dir_path).rglob(f"*.{ext}"):
try:
tokens = count_file_tokens(str(file))
results[str(file)] = tokens
total += tokens
except Exception as e:
results[str(file)] = 0
return {"files": results, "total": total, "file_count": len(results)}
# Example: count a codebase
stats = count_directory_tokens("./src", ["ts", "tsx", "js"])
print(f"Total tokens: {stats['total']:,}") # e.g., 485,231 tokens
print(f"Files: {stats['file_count']}")Step 2: Build the 1M Context Agent
# agent_1m.py
from anthropic import Anthropic
from typing import List, Optional
import json
class OneMillionContextAgent:
def __init__(self, api_key: str, model: str = "claude-sonnet-4.6"):
self.client = Anthropic(api_key=api_key)
self.model = model
self.max_context = 1_000_000
self.max_output = 8192 # Can request up to 64K
def build_prompt(self, context: str, instruction: str, examples: Optional[List[dict]] = None) -> str:
"""Build a prompt that fits within 1M tokens."""
system_prompt = f"""You are an expert analyst with access to {self.max_context:,} tokens of context.
Analyse the provided materials thoroughly and follow the instruction precisely.
Be concise but comprehensive."""
messages = [{"role": "user", "content": f"Context:
\n{context}\n\nInstruction: {instruction}"}]
return system_prompt, messages
def run(self, context: str, instruction: str, temperature: float = 0.3) -> str:
"""Execute with full context."""
system, messages = self.build_prompt(context, instruction)
# Verify token count
context_tokens = len(self.client.count_tokens(context))
print(f"Context tokens: {context_tokens:,} / {self.max_context:,}")
if context_tokens > self.max_context:
raise ValueError(
f"Context too large: {context_tokens:,} > {self.max_context:,}. "
"Use the pre-filtering method (Step 3)."
)
response = self.client.messages.create(
model=self.model,
max_tokens=self.max_output,
temperature=temperature,
system=system,
messages=messages
)
return response.content[0].textOne correctness note on the snippet above: the current Anthropic Python SDK counts tokens via client.messages.count_tokens(...), which returns an object with an input_tokens field, not a client.count_tokens(text) call you can wrap in len(). Swap that line for the real API before you rely on the guard.
Step 3: Pre-Filtering for Very Large Inputs
When your context runs past 1M tokens, you filter before you send:
# prefilter.py
from sentence_transformers import SentenceTransformer
import numpy as np
class ContextPrefilter:
def __init__(self):
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
def filter_relevant_sections(
self,
sections: List[str],
query: str,
max_tokens: int = 900_000
) -> str:
"""Keep only the most relevant sections to fit under the limit."""
# Embed query
query_embedding = self.embedder.encode(query)
# Embed all sections
section_embeddings = self.embedder.encode(sections)
# Calculate similarity
similarities = np.dot(section_embeddings, query_embedding)
# Sort by relevance
ranked = sorted(zip(sections, similarities), key=lambda x: x[1], reverse=True)
# Take top sections until we hit the token limit
selected = []
total_tokens = 0
for section, score in ranked:
section_tokens = len(section.split()) * 1.3 # Rough estimate
if total_tokens + section_tokens > max_tokens:
break
selected.append(section)
total_tokens += section_tokens
return "\n\n---\n\n".join(selected)This is the part where chunking and embeddings still earn their keep. You're not retrieving slices to answer the question, you're trimming the input down to what fits, then letting the model read the rest in full.
Step 4: Use Case, Entire Codebase Analysis
# codebase_analysis.py
from agent_1m import OneMillionContextAgent
from pathlib import Path
def load_codebase(path: str) -> str:
"""Load all source files into a single context string."""
files = []
for ext in ['ts', 'tsx', 'js', 'json', 'md']:
for file in Path(path).rglob(f"*.{ext}"):
if 'node_modules' in str(file) or 'dist' in str(file):
continue
content = file.read_text(errors='ignore')
files.append(f"=== {file} ===\n{content}")
return "\n\n".join(files)
# Load and analyse
codebase = load_codebase("./my-project")
agent = OneMillionContextAgent(api_key="sk-ant-...")
analysis = agent.run(
context=codebase,
instruction="""Analyse this codebase and provide:
1. Architecture overview
2. Security vulnerabilities (if any)
3. Performance bottlenecks
4. Code quality issues
5. Suggested refactoring (top 5 by impact)
Format as structured markdown.""",
temperature=0.2
)
print(analysis)Step 5: Multi-Model Fallback Strategy
Pick the cheapest model that can still handle the job and the context size:
# multi_model.py
class ContextRouter:
MODELS = {
"gemini-3.5-flash": {"cost_per_1m": 0.35, "context": 1_000_000, "strengths": ["speed", "cost"]},
"minimax-m3": {"cost_per_1m": 0.30, "context": 1_000_000, "strengths": ["cost", "output_length"]},
"claude-sonnet-4.6": {"cost_per_1m": 3.00, "context": 1_000_000, "strengths": ["reasoning", "coding"]}
}
def select_model(self, task_type: str, context_tokens: int) -> str:
if task_type in ["summarisation", "extraction", "classification"]:
return "gemini-3.5-flash" # Fast and cheap
if task_type == "code-generation" or context_tokens > 500_000:
return "minimax-m3" # Best output length, cheapest
return "claude-sonnet-4.6" # Best reasoning
# Usage
router = ContextRouter()
model = router.select_model("code-analysis", 800_000)
print(f"Selected: {model} (estimated cost: ${0.30})")The cost_per_1m values baked into that dict are the article's original figures, and at least one is off, the Gemini Flash rate in particular runs well above $0.35 in practice (see the cost note below). Pull live pricing into this table rather than trusting the hard-coded numbers.
Step 6: Streaming for Long Outputs
If you're generating 10K-plus token outputs, stream them so the request doesn't time out:
# streaming.py
from anthropic import Anthropic
client = Anthropic()
with client.messages.stream(
model="claude-sonnet-4.6",
max_tokens=64000,
messages=[{
"role": "user",
"content": f"Context: {large_context}\n\nGenerate a complete analysis..."
}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)Do/Don't
| Do | Don't |
|---|---|
| Count tokens before every request | Guess at token counts |
| Use pre-filtering when context > 1M | Try to squeeze 2M tokens into a 1M window |
| Stream outputs for generation > 10K tokens | Wait for complete response synchronously |
| Use the cheapest adequate model | Default to the most expensive model |
| Test with 100K context before scaling to 1M | Send 1M tokens on your first request |
Cost Comparison (1M Input Tokens)
| Model | Input Cost | Output Cost | Best For |
|---|---|---|---|
| Claude Sonnet 4.6 | $3.00 | $15.00 | Complex reasoning, code |
| MiniMax M3 | $0.30 | $1.20 | Long output, cost-sensitive |
| Gemini 3.5 Flash | $0.35 | $0.70 | Speed, summarisation |
| GPT-5.5 | $5.00 | $30.00 | General purpose (400K only) |
Two cautions on this table. The Gemini 3.5 Flash row reads $0.35/$0.70, but OpenRouter lists the model at $1.50 input / $9.00 output per million tokens, roughly four to thirteen times higher, so the figures here are unconfirmed and almost certainly understated. The MiniMax M3 row holds up to 512K input; past that you're paying closer to $0.60/$2.40. The GPT-5.5 $5/$30 pricing is accurate, but the "400K only" note isn't: the GPT-5.5 API offers a 1M-token window, and the 400K cap applies inside Codex.
Conclusion
A 1M token window lets you skip the retrieval pipeline for a lot of jobs. Rather than chunking, embedding, and fetching, you load the material and ask. Sonnet 4.6 gives you the strongest reasoning, MiniMax M3 the best value when you need long outputs, and Gemini 3.5 Flash quick turnaround for lighter work. Count your tokens first, pre-filter when the input won't fit, and price each model against today's published rates rather than the figures that were current when this was written, those move fast, and a couple of them already have.


