Back to news

How-to Guide

How to build an AI agent with 1M token context.

Leverage Claude Sonnet 4.6's 1M context beta, MiniMax M3, and Gemini 3.5 Flash to build agents that can process entire codebases, books, and datasets in a single prompt.

AI Kick Start editorial image for How to build an AI agent with 1M token context.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: A 1M token context window changes how you feed information to a model. Instead of chopping documents into chunks and hoping the model keeps the thread, an agent can take an entire codebase, months of conversation history, or a whole book in a single prompt. This guide builds a 1M-context agent on the three models that handle that scale today: [Claude Sonnet 4.6](https://claude.com/blog/1m-context-ga) (1M context now generally available, $3/$15 per million tokens), [MiniMax M3](https://www.minimax.io/blog/minimax-m3) ($0.30/$1.20 at smaller volumes, open-weights, up to 512K output), and [Gemini 3.5 Flash](https://openrouter.ai/google/gemini-3.5-flash) (1M context, fast inference). They aren't the only three with 1M context anymore, GPT-5.5 and Claude Opus 4.6 do too, but these are the practical workhorses.

Key takeaways

  • Claude Sonnet 4.6: 1M context beta, $3/$15, best reasoning
  • MiniMax M3: 1M context, $0.30/$1.20, open-weights, 512K guaranteed output
  • Gemini 3.5 Flash: 1M context, $0.35/$0.70, fastest inference
  • Chunking: Still useful for pre-filtering; less critical with 1M
  • Cost: 1M input tokens = $3 (Sonnet), $0.30 (M3), $0.35 (Flash)

Analysis

For years, the standard advice for feeding a large document to a language model was to break it into pieces. You'd chunk the text, build embeddings, store them in a vector database, and retrieve the parts that looked relevant at query time. It worked, but it was fiddly, and it leaked context. The model never saw the whole picture, only the slices you guessed it would need.

That constraint is loosening. Several models now accept roughly a million tokens in one go, enough for a mid-sized codebase, a long contract with all its appendices, or an entire support-ticket history. You can hand the model the full thing and ask your question, no retrieval pipeline required.

For an Australian business team, the practical upshot is fewer moving parts. The work that used to live in chunking logic and a vector store can sometimes collapse into a single API call. That doesn't make retrieval obsolete, and it isn't free, a million input tokens still costs real money, and the cheapest option depends on the job. But the engineering you can now skip is substantial.

This guide walks through building a 1M-context agent: counting tokens so you don't blow the limit, wiring up the agent, pre-filtering when even a million isn't enough, and routing each task to the model that does it cheapest.

Analysis

Prerequisites

  • API keys for Anthropic, MiniMax, or Google
  • Understanding of token counting
  • A large document or codebase to process
  • Python 3.10+

Step-by-Step Framework

Step 1: Count Tokens Accurately

Before you fire off a million tokens, you need to know how many you actually have:

# token_counter.py
import tiktoken

def count_tokens(text: str, model: str = "gpt-4") -> int:
    encoder = tiktoken.encoding_for_model(model)
    return len(encoder.encode(text))

def count_file_tokens(path: str) -> int:
    with open(path, 'r', encoding='utf-8', errors='ignore') as f:
        return count_tokens(f.read())

def count_directory_tokens(dir_path: str, extensions: list[str]) -> dict:
    import os
    from pathlib import Path

    results = {}
    total = 0

    for ext in extensions:
        for file in Path(dir_path).rglob(f"*.{ext}"):
            try:
                tokens = count_file_tokens(str(file))
                results[str(file)] = tokens
                total += tokens
            except Exception as e:
                results[str(file)] = 0

    return {"files": results, "total": total, "file_count": len(results)}

# Example: count a codebase
stats = count_directory_tokens("./src", ["ts", "tsx", "js"])
print(f"Total tokens: {stats['total']:,}")  # e.g., 485,231 tokens
print(f"Files: {stats['file_count']}")

Step 2: Build the 1M Context Agent

# agent_1m.py
from anthropic import Anthropic
from typing import List, Optional
import json

class OneMillionContextAgent:
    def __init__(self, api_key: str, model: str = "claude-sonnet-4.6"):
        self.client = Anthropic(api_key=api_key)
        self.model = model
        self.max_context = 1_000_000
        self.max_output = 8192  # Can request up to 64K

    def build_prompt(self, context: str, instruction: str, examples: Optional[List[dict]] = None) -> str:
        """Build a prompt that fits within 1M tokens."""
        system_prompt = f"""You are an expert analyst with access to {self.max_context:,} tokens of context.
Analyse the provided materials thoroughly and follow the instruction precisely.
Be concise but comprehensive."""

        messages = [{"role": "user", "content": f"Context:
\n{context}\n\nInstruction: {instruction}"}]

        return system_prompt, messages

    def run(self, context: str, instruction: str, temperature: float = 0.3) -> str:
        """Execute with full context."""
        system, messages = self.build_prompt(context, instruction)

        # Verify token count
        context_tokens = len(self.client.count_tokens(context))
        print(f"Context tokens: {context_tokens:,} / {self.max_context:,}")

        if context_tokens > self.max_context:
            raise ValueError(
                f"Context too large: {context_tokens:,} > {self.max_context:,}. "
                "Use the pre-filtering method (Step 3)."
            )

        response = self.client.messages.create(
            model=self.model,
            max_tokens=self.max_output,
            temperature=temperature,
            system=system,
            messages=messages
        )

        return response.content[0].text

One correctness note on the snippet above: the current Anthropic Python SDK counts tokens via client.messages.count_tokens(...), which returns an object with an input_tokens field, not a client.count_tokens(text) call you can wrap in len(). Swap that line for the real API before you rely on the guard.

Step 3: Pre-Filtering for Very Large Inputs

When your context runs past 1M tokens, you filter before you send:

# prefilter.py
from sentence_transformers import SentenceTransformer
import numpy as np

class ContextPrefilter:
    def __init__(self):
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')

    def filter_relevant_sections(
        self,
        sections: List[str],
        query: str,
        max_tokens: int = 900_000
    ) -> str:
        """Keep only the most relevant sections to fit under the limit."""
        # Embed query
        query_embedding = self.embedder.encode(query)

        # Embed all sections
        section_embeddings = self.embedder.encode(sections)

        # Calculate similarity
        similarities = np.dot(section_embeddings, query_embedding)

        # Sort by relevance
        ranked = sorted(zip(sections, similarities), key=lambda x: x[1], reverse=True)

        # Take top sections until we hit the token limit
        selected = []
        total_tokens = 0

        for section, score in ranked:
            section_tokens = len(section.split()) * 1.3  # Rough estimate
            if total_tokens + section_tokens > max_tokens:
                break
            selected.append(section)
            total_tokens += section_tokens

        return "\n\n---\n\n".join(selected)

This is the part where chunking and embeddings still earn their keep. You're not retrieving slices to answer the question, you're trimming the input down to what fits, then letting the model read the rest in full.

Step 4: Use Case, Entire Codebase Analysis

# codebase_analysis.py
from agent_1m import OneMillionContextAgent
from pathlib import Path

def load_codebase(path: str) -> str:
    """Load all source files into a single context string."""
    files = []
    for ext in ['ts', 'tsx', 'js', 'json', 'md']:
        for file in Path(path).rglob(f"*.{ext}"):
            if 'node_modules' in str(file) or 'dist' in str(file):
                continue
            content = file.read_text(errors='ignore')
            files.append(f"=== {file} ===\n{content}")
    return "\n\n".join(files)

# Load and analyse
codebase = load_codebase("./my-project")
agent = OneMillionContextAgent(api_key="sk-ant-...")

analysis = agent.run(
    context=codebase,
    instruction="""Analyse this codebase and provide:
1. Architecture overview
2. Security vulnerabilities (if any)
3. Performance bottlenecks
4. Code quality issues
5. Suggested refactoring (top 5 by impact)

Format as structured markdown.""",
    temperature=0.2
)

print(analysis)

Step 5: Multi-Model Fallback Strategy

Pick the cheapest model that can still handle the job and the context size:

# multi_model.py
class ContextRouter:
    MODELS = {
        "gemini-3.5-flash": {"cost_per_1m": 0.35, "context": 1_000_000, "strengths": ["speed", "cost"]},
        "minimax-m3": {"cost_per_1m": 0.30, "context": 1_000_000, "strengths": ["cost", "output_length"]},
        "claude-sonnet-4.6": {"cost_per_1m": 3.00, "context": 1_000_000, "strengths": ["reasoning", "coding"]}
    }

    def select_model(self, task_type: str, context_tokens: int) -> str:
        if task_type in ["summarisation", "extraction", "classification"]:
            return "gemini-3.5-flash"  # Fast and cheap

        if task_type == "code-generation" or context_tokens > 500_000:
            return "minimax-m3"  # Best output length, cheapest

        return "claude-sonnet-4.6"  # Best reasoning

# Usage
router = ContextRouter()
model = router.select_model("code-analysis", 800_000)
print(f"Selected: {model} (estimated cost: ${0.30})")

The cost_per_1m values baked into that dict are the article's original figures, and at least one is off, the Gemini Flash rate in particular runs well above $0.35 in practice (see the cost note below). Pull live pricing into this table rather than trusting the hard-coded numbers.

Step 6: Streaming for Long Outputs

If you're generating 10K-plus token outputs, stream them so the request doesn't time out:

# streaming.py
from anthropic import Anthropic

client = Anthropic()

with client.messages.stream(
    model="claude-sonnet-4.6",
    max_tokens=64000,
    messages=[{
        "role": "user",
        "content": f"Context: {large_context}\n\nGenerate a complete analysis..."
    }]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Do/Don't

DoDon't
Count tokens before every requestGuess at token counts
Use pre-filtering when context > 1MTry to squeeze 2M tokens into a 1M window
Stream outputs for generation > 10K tokensWait for complete response synchronously
Use the cheapest adequate modelDefault to the most expensive model
Test with 100K context before scaling to 1MSend 1M tokens on your first request

Cost Comparison (1M Input Tokens)

ModelInput CostOutput CostBest For
Claude Sonnet 4.6$3.00$15.00Complex reasoning, code
MiniMax M3$0.30$1.20Long output, cost-sensitive
Gemini 3.5 Flash$0.35$0.70Speed, summarisation
GPT-5.5$5.00$30.00General purpose (400K only)

Two cautions on this table. The Gemini 3.5 Flash row reads $0.35/$0.70, but OpenRouter lists the model at $1.50 input / $9.00 output per million tokens, roughly four to thirteen times higher, so the figures here are unconfirmed and almost certainly understated. The MiniMax M3 row holds up to 512K input; past that you're paying closer to $0.60/$2.40. The GPT-5.5 $5/$30 pricing is accurate, but the "400K only" note isn't: the GPT-5.5 API offers a 1M-token window, and the 400K cap applies inside Codex.

Conclusion

A 1M token window lets you skip the retrieval pipeline for a lot of jobs. Rather than chunking, embedding, and fetching, you load the material and ask. Sonnet 4.6 gives you the strongest reasoning, MiniMax M3 the best value when you need long outputs, and Gemini 3.5 Flash quick turnaround for lighter work. Count your tokens first, pre-filter when the input won't fit, and price each model against today's published rates rather than the figures that were current when this was written, those move fast, and a couple of them already have.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Pick the smallest useful workflow that proves the pattern.
  2. Write down the owner, data boundary, review point, and success measure.
  3. Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI agent design systems.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: How to build an AI agent with 1M token context

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call