Analysis
For most of the last two years, the race in AI coding tools was about who had the biggest, smartest single model. Anthropic, OpenAI, and a handful of open-weight labs kept shipping models that could hold more in their heads at once. The pitch was simple: hand it the whole problem, and it will sort the whole problem.
That pitch is quietly being replaced. The teams getting the most out of these tools in 2026 are not feeding everything to one model. They are running small fleets of agents, each with a single job, coordinated by one smart agent on top. Think of it less like hiring one brilliant generalist and more like running a small workshop where one person plans, others build, test, and document, and the planner keeps everyone pointed the same way.
For an Australian business team, the practical takeaway is this. The interesting question is no longer "which model is best." It is "how do I get a group of cheaper, focused agents to outwork one expensive one." The reported wins are real enough to pay attention to, and the trade-offs (extra cost, coordination friction) are real enough that you should not turn it on for every job.
Worth a caveat up front. The headline that "ten specialised agents beat one larger agent" is an editorial reading of where things are heading, not a published benchmark. Treat it as a direction the field is moving in, not a settled fact.
Why Multiple Agents Win
A single large model has to carry the whole problem in its context window at once. It is reasoning about architecture, syntax, testing, documentation, and deployment in the same breath. That sets up two ways to fail. The first is context overflow, where the model loses the thread on details that mattered. The second is attention dilution, where it does an okay job on everything and a great job on nothing.
Splitting the work flips that around. A coordinator agent holds the high-level plan. Router agents hand out the sub-tasks. Specialist agents each own a narrow patch: one writes tests, one handles migrations, one keeps the docs current. Because each specialist has a tightly defined job, it can run on a smaller, cheaper model. The coordinator runs on a bigger model and spends all of its attention on the harder problem, which is integration and resolving conflicts between the specialists.
Pattern 1: The Coordinator-Router-Specialist Stack
This is the pattern you see most often across all three platforms:
Coordinator (Opus 4.8 / Hermes 3 / GPT-4.1)
├── Router: Task decomposition and dispatch
├── Specialist A: Code generation
├── Specialist B: Test generation
├── Specialist C: Documentation
├── Specialist D: Migration scripts
└── Specialist E: Dependency analysis(One note on the diagram: Hermes 3 is a real open-weight model family from Nous Research, but it predates the newer Hermes Agent runtime mentioned later. Listing it as a peer coordinator alongside Opus 4.8 and GPT-4.1 is illustrative, not a documented setup.)
The coordinator takes the high-level request ("migrate from Express to Fastify"), works out a plan, and farms each sub-task to the right specialist. Specialists run in parallel wherever the work allows. The coordinator then reviews what comes back, sorts out conflicts (say, Specialist A changed an interface that Specialist B's tests rely on), and assembles the finished result.
In Claude Code, this runs through Dynamic Workflows. The snippet below is illustrative pseudo-CLI rather than a real command. In practice, Dynamic Workflows are JavaScript scripts that Claude writes and a runtime executes, not a declarative --specialist flag, so read this as a sketch of the idea:
# Define a workflow with multiple specialist subagents
claude workflow create --name "migration" --specialist "code:code-gen" --specialist "test:test-gen" --specialist "docs:doc-gen" --coordinator opus-4.8
# Execute with a high-level prompt
claude workflow run migration "migrate from Express to Fastify"Pattern 2: Competitive Redundancy
For code paths you cannot afford to get wrong, some teams run several specialist agents on the same task with different model seeds. A judge agent then compares the outputs and either picks the best one or merges them. It costs you 2-3x more, but it catches subtle bugs a single agent would sail past.
Coordinator
├── Generator A (Claude Sonnet 4.8)
├── Generator B (GPT-4.1)
├── Generator C (Hermes 3)
└── Judge (Opus 4.8): selects best outputOne flag on that diagram: "Claude Sonnet 4.8" is a rumoured model that has not shipped. The latest Sonnet available is 4.6, and the released 4.8 model is Opus, not Sonnet. Read the Sonnet entry as a placeholder for whatever current generator you actually have on hand.
Pattern 3: Learning Specialist Agents
Hermes takes the specialist idea further with its learning loop. Its specialist agents do not only run tasks, they learn from them. When the test specialist spots a recurring bug pattern, it writes up a skill signature and passes it to the code specialist. After a while, the code specialist starts heading off that pattern before it happens. This agent-to-agent learning is Hermes' own thing, and it is why its multi-agent setups tend to improve faster than orchestrations that stay static.
# Hermes agent-to-agent skill sharing
hermes.skills.share(
from_agent="test-specialist",
to_agent="code-specialist",
skill_signature="avoid-null-returns-in-async-functions",
confidence=0.94
)Pattern 4: Cron-Scheduled Agent Hierarchies
OpenClaw's sub-agent architecture can run agents on a schedule. A parent agent spawns child agents that fire on cron timers: a daily dependency audit, an hourly security scan, a weekly docs review. Each child reports back to the parent, which gathers the findings and decides whether a human needs to step in.
{
"subAgents": [
{
"name": "security-scanner",
"schedule": "0 * * * *",
"skill": "security-audit",
"reportTo": "main-agent",
"threshold": "critical-only"
}
]
}Pattern 5: tmux-Based Multi-Agent Sessions
If you live in the terminal, Claude Code can be driven across a tmux session, with each agent in its own pane and the coordinator passing messages through tmux. It works well for long-running jobs where you want eyes on each agent's progress. One caveat: the command below is illustrative. Practitioners routinely run several Claude Code panes in tmux by hand, but a dedicated claude multi-agent --tmux flag is not a documented official feature, so do not assume it works verbatim:
# Launch 3 agents in a tmux session
claude multi-agent --tmux --agents 3 --task "refactor monolith into microservices"The Overhead Problem
None of this is free. Coordination overhead, meaning the time agents spend talking to each other, sorting out conflicts, and waiting on dependencies, can eat 20-30% of total run time. As a rough rule of thumb, the break-even point sits somewhere around 5 or more files touched, or 3 or more distinct concerns (code, tests, docs, migrations). Below that, one agent is faster and cheaper. Treat these as practitioner heuristics rather than figures from a published study, since they are not independently sourced.
Measuring Multi-Agent Performance
You cannot manage what you do not measure, and orchestration is no exception. The metrics worth watching:
- Coordination ratio: time spent coordinating versus time spent actually working (target: under 25%)
- Conflict rate: share of sub-agent outputs that need reconciling (target: under 10%)
- Quality delta: bug rate of multi-agent output against single-agent output on the same task
- Cost multiplier: total token cost of multi-agent versus single-agent (usually 1.5-3x)
A reminder on those targets: the under-25% and under-10% figures, like the cost multipliers, are reasonable working benchmarks rather than vendor-published numbers, so calibrate them against your own runs.
Claude Code's Dynamic Workflows ship with telemetry built in. Hermes agents log coordination events to an FTS5 session database. OpenClaw's parent agents keep tabs on their children through a file-based memory system.
The Future: Meta-Agents
The next step, still mostly speculative, is meta-agents: agents that design the topology for a given task. "This job needs two code agents, one test agent, and a migration agent" should be a call an agent makes, not a human. Early reported experiments with Claude Code's Task System suggest it is workable, with a meta-agent reading the task, choosing the specialist mix, watching execution, and rebalancing when a specialist is struggling. If that holds up, multi-agent orchestration shifts from something you build to something you simply ask for. For now, treat it as a direction of travel rather than a shipped feature.




