Back to news

Model Review

Agentic coding: Best models for multi-agent systems.

Multi-agent coding systems require models with strong SWE-bench Pro scores, large contexts, and reliable tool use. We rank Opus 4.8, MiniMax M3, GPT-5.5 Pro, and Sonnet 4.6 for agentic workflows.

AI Kick Start editorial image for Agentic coding: Best models for multi-agent systems.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: If you want several AI models working together on code, one planning, others writing, another reviewing, you don't need every one of them to be the most expensive model on the market. Put a strong reasoner in charge ([Claude Opus 4.8](https://openrouter.ai/anthropic/claude-opus-4.8/benchmarks) for top capability, or the open-weight [MiniMax M3](https://datanorth.ai/news/minimax-launches-m3) for value), then hand the cheaper, repetitive jobs to cheaper models. The result does the work for a fraction of an all-premium setup.

Key takeaways

  • Multi-agent coding works best as a mix of models, not a uniform line-up of the priciest one.
  • Opus 4.8 leads on capability ($5 / $25, 1M context); MiniMax M3 is the open-weights value pick (59.0% SWE-bench Pro, 1M context).
  • Mind the price footnotes: M3's $0.30 / $1.20 is a launch promo (standard is $0.60 / $2.40), and GPT-5.5 Pro runs far higher than older quotes suggested, around $30 / $180.
  • SWE-bench Pro scores aren't strictly comparable across vendors; read them as a rough hit rate, not a precise ranking.
  • Reserve frontier models for orchestration and review; send testing and routine edits to cheap models like Gemini 3.5 Flash or a DeepSeek V4 tier.

Agentic coding: Best models for multi-agent systems

Analysis

A year ago, getting one AI to write a working feature felt like a win. Now teams are wiring several models together so they hand work back and forth: one drafts a plan, another writes the code, a third checks it, a fourth writes the tests. People call these multi-agent coding systems, and they're becoming a normal way to ship software rather than a research curiosity.

Here's the catch that trips most teams up. When you run a swarm of agents, the bill adds up fast, and not every model behaves well in a relay race. An agent has to write correct code on its own, hold a lot of context in its head, call tools without fumbling, catch its own mistakes, and write output the next agent can actually use. Some models are good at all of that. Some are good at one benchmark and useless in a team.

So the real question for an Australian business isn't "which model is best", it's "which model goes where". Get that mix right and you run a capable system for a fraction of what it costs to make every agent a frontier model. Get it wrong and you're paying Ferrari prices to generate unit tests.

This guide breaks down which models suit which seat, with prices and benchmark scores so you can sanity-check the spend before you commit.

What makes a good agentic coding model?

Multi-agent systems lean on five things:

  • Solid coding scores (we use SWE-bench Pro above 55% as a rough cut-off): each agent has to write correct code without a human babysitting it.
  • A large context window (we look for 256K and up): agents need room to pass around state, code, and plans.
  • Dependable tool use: function calls and API integrations that fire the same way every time.
  • Self-correction: spotting a mistake and fixing it without waiting for a person.
  • Clear output: writing results that the next agent can read and act on.

A note on the benchmark, because it matters for how much weight you put on these numbers. SWE-bench Pro isn't one apples-to-apples scoreboard. Vendors run it with their own scaffolding, so a number Anthropic reports for Opus and a number a third party reports for another model aren't strictly comparable. The 55% and 256K thresholds above are our editorial line in the sand, not industry law. Treat the scores as a guide to roughly how often a model gets it right, not as a precise league table.

Tiered recommendations

Tier 1: Maximum capability

Claude Opus 4.8 ($5 input / $25 output per million tokens, vendor-reported 69.2% SWE-bench Pro, 1M context)

For a single model doing agentic coding, Opus 4.8 is the one to beat. The pricing and the 1M-token context window are both confirmed. Its 69.2% SWE-bench Pro figure is Anthropic's own reported number rather than an independent score, so read it as "fails less often" rather than a precise rank, but in practice it does fail less often, which means less time spent catching and recovering from agent errors. The big context lets agents keep shared state across a large codebase. The catch is money: running several Opus agents at once gets expensive.

Tier 2: Best open-weights

MiniMax M3 ($0.60 input / $2.40 output per million tokens standard; a launch promotion has been running at $0.30 / $1.20, 59.0% SWE-bench Pro, 1M context, open)

M3 is the open-weights pick for agent systems. Its 59.0% SWE-bench Pro score holds up for most agent tasks, and the 1M context matches Opus 4.8. Because the weights are open, you can self-host and sidestep the latency and privacy worries that come with sending everything to a third party. One thing to watch on price: the eye-catching $0.30 / $1.20 figure is a launch-period 50%-off rate, not the standard one. Even at the standard $0.60 / $2.40, though, it undercuts Opus heavily, which is what makes running a fleet of agents affordable.

Tier 3: Balanced option

GPT-5.5 Pro (priced around $30 input / $180 output per million tokens; roughly 1M context)

GPT-5.5 Pro sits in the conversation on capability, but it's the costly seat. Worth flagging: earlier drafts of this guide listed it at $8 / $40 with a 400K context and a 62.4% SWE-bench Pro score, and none of those hold up. OpenRouter lists it nearer $30 input / $180 output with a context window above 1M, and the 62.4% Pro score is unverified, GPT-5.5 non-Pro is documented around 58.6%, with no published Pro figure to confirm. Its real draw is OpenAI's ecosystem: if you're already on the Assistants API or OpenAI's built-in tool frameworks, the integration may be worth the premium. At those prices, though, it's a deliberate choice, not a default.

Tier 4: Cost-conscious

Claude Sonnet 4.6 ($3 input / $15 output per million tokens, 1M context in beta)

Sonnet 4.6 is the budget seat for an Anthropic-based agent stack. The pricing is confirmed, and the 1M context is available as a beta capability (some trackers still list the standard window at 200K). On coding scores, treat the often-quoted 58.1% SWE-bench Pro with care: Anthropic publishes Sonnet 4.6 on SWE-bench Verified, not Pro, so that specific Pro number isn't confirmed. Either way, Sonnet handles routine work reliably, and at $3 / $15 it's a 40% saving on Opus 4.8, enough to make a larger swarm of worker agents pay off.

Architecture recommendations

For a multi-agent coding system, here's how we'd assign the seats:

  • Orchestrator agent: Opus 4.8 or MiniMax M3, the heaviest reasoning sits here.
  • Implementation agents: MiniMax M3 or Sonnet 4.6, best value for the bulk of the writing.
  • Review agents: GPT-5.5 Pro or Opus 4.8, accuracy matters most on the check.
  • Testing agents: Gemini 3.5 Flash or a low-cost DeepSeek model, cheap, and good enough for generating tests. (Earlier drafts named "DeepSeek V3.5"; that version label appears not to exist, the current DeepSeek generation is V4, so use a V4 Flash tier here.)

Verdict

Build around MiniMax M3 if you want open-weights value, or Opus 4.8 if you want the top of the capability range. Then drop cheaper models like Gemini Flash and a DeepSeek tier into the jobs that don't need frontier reasoning, test generation, simple edits, repetitive checks. The point worth keeping: not every agent in the system needs to be a Ferrari.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Pick the smallest useful workflow that proves the pattern.
  2. Write down the owner, data boundary, review point, and success measure.
  3. Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI agent design systems.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Agentic coding: Best models for multi-agent systems

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call