Code

Sub-Agents That Build Themselves: Advanced Patterns.

The most powerful pattern in agentic coding: agents that analyse their own performance and spawn improved versions of themselves. Implementation patterns for Claude Code, Hermes, and OpenClaw.

Daniel Fleuren2026-06-0214 min readAustralian business teamsUpdated 2026-06-19

Written by

Daniel Fleuren

Founder, AI Kick Start. 20+ years enterprise IT

Updated 2026-06-19

AI Kick Start editorial image for Sub-Agents That Build Themselves: Advanced Patterns.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: The most powerful pattern in agentic coding: agents that analyse their own performance and spawn improved versions of themselves. Implementation patterns for Claude Code, Hermes, and OpenClaw.

Key takeaways

Briefing: Here is the strange new shape of automated software work: a tool that does not just write your code, but writes the assistant that writes your code, then quietly fires that assistant and hires a better one.
The Self-Improvement Loop: A self-building sub-agent runs a meta-loop that sits one level above ordinary task execution: **Execute**: Perform the assigned task **Evaluate**: Measure output quality against defined criteria **Diagnose**: Identify specific weaknesses in the approach **Generate**: Create an improved sub-agent configuration that addresses those weaknesses **Validate**: Test the new configuration on held-out examples **Deploy**: Replace the current configuration if validation passes The loop needs three things to work: evaluation metrics you can compute automatically, a configuration space you can search, and a validation step that stops the agent from shipping a regression.
Pattern 1: Prompt Evolution (Claude Code): Claude Code's dynamic workflows are real, and they fan work out across parallel sub-agents that a parent agent plans and coordinates.
Pattern 2: Skill Signature Evolution (Hermes): Hermes handles self-improvement through its skills system, working within the [agentskills.io open standard](https://www.agensi.io/learn/agent-skills-open-standard), the same SKILL.md format used by Claude Code, Codex CLI, OpenClaw and others.
Pattern 3: Sub-Agent Configuration Search (OpenClaw): OpenClaw does not ship automatic self-improvement, but you can build it from parts it already has: sub-agents plus a cron-scheduled task system.

Briefing

Here is the strange new shape of automated software work: a tool that does not just write your code, but writes the assistant that writes your code, then quietly fires that assistant and hires a better one.

It sounds like a stunt. It is closer to plumbing. Three real products now ship the pieces you would need to build it: Claude Code's dynamic workflows, which Anthropic released on 28 May 2026 to orchestrate sub-agents at scale; Hermes, an agent from Nous Research that keeps notes on its own work and gets better as it runs; and OpenClaw, whose scheduled-task system can be wired up to grade and tune other agents overnight.

For an Australian business team, the practical question is not whether your software can become sentient. It cannot. The question is whether an agent can measure its own output, spot where it falls short, and propose a better version of itself while you sleep, with a human signing off before anything touches production. Some of that is here. Some of it is people stitching together features that were not designed for the job. And one version of it is still a research idea with a hopeful press release attached.

Below is what is actually running, what is a sensible pattern you could build, and where the marketing gets ahead of the facts.

The Self-Improvement Loop

A self-building sub-agent runs a meta-loop that sits one level above ordinary task execution:

Execute: Perform the assigned task
Evaluate: Measure output quality against defined criteria
Diagnose: Identify specific weaknesses in the approach
Generate: Create an improved sub-agent configuration that addresses those weaknesses
Validate: Test the new configuration on held-out examples
Deploy: Replace the current configuration if validation passes

The loop needs three things to work: evaluation metrics you can compute automatically, a configuration space you can search, and a validation step that stops the agent from shipping a regression.

Pattern 1: Prompt Evolution (Claude Code)

Claude Code's dynamic workflows are real, and they fan work out across parallel sub-agents that a parent agent plans and coordinates. On top of that primitive, you can build what amounts to prompt evolution: the coordinator keeps a population of prompt variants for each specialist role, checks which variant produced the best output after each task, and breeds new variants by combining the patterns that worked. Worth being clear here: this genetic "prompt evolution" mechanism is not a documented Anthropic feature. It is a pattern layered on the real dynamic-workflows capability rather than something the platform ships by name.

# Sub-agent prompt evolution configuration
evolution:
  population_size: 10
  specialist: test_generator
  evaluation:
    - metric: coverage_increase
      weight: 0.4
    - metric: test_quality_score
      weight: 0.4
    - metric: execution_time
      weight: 0.2
  mutation:
    strategies:
      - add_context_section
      - strengthen_constraints
      - add_examples
      - reorder_instructions

The metrics carry the whole thing. "Coverage increase" is objective and measurable. "Test quality score" needs a model-based evaluator, which adds some subjectivity but tends to track human judgement well. The system throws out any variant that regresses on a metric. Note that the weights above (0.4, 0.4, 0.2) are illustrative numbers, not figures pulled from a benchmark.

Pattern 2: Skill Signature Evolution (Hermes)

Hermes handles self-improvement through its skills system, working within the agentskills.io open standard, the same SKILL.md format used by Claude Code, Codex CLI, OpenClaw and others. As it solves problems, Hermes pauses roughly every 15 tool calls to reflect on what worked and what failed, then writes or rewrites a reusable skill document, while a curator periodically prunes the library. The article frames this as generating a "skill signature": a compact record of the problem, the approach, and the outcome, with successful records kept and failed ones analysed for patterns. That "skill signature" wording is the article's own; the official docs describe SKILL.md generation and a roughly 15-tool-call reflection cadence rather than a named signature object, and the Python API shown below is illustrative rather than a confirmed surface.

# Hermes skill evolution
signature = hermes.skills.create(
    problem_type="database_migration_with_rollback",
    approach=["create_new_table", "dual_write", "backfill", "switch_read", "drop_old"],
    tools_used=["sql_runner", "schema_diff", "data_validator"],
    outcome="success",
    duration_minutes=45
)

# Evolve: combine with related successful signatures
hermes.skills.evolve(
    base_signature=signature,
    combine_with=hermes.skills.search("migration"),
    objective="reduce_duration"
)

The learning loop keeps refining these skills over time. A skill that first took 45 minutes to run might drop to 30 through better tool selection, then to 20 through parallelisation, with the gains stacking across sessions. Those duration figures (45, then 30, then 20 minutes) are example numbers to show the shape of the improvement, not measured benchmarks.

Pattern 3: Sub-Agent Configuration Search (OpenClaw)

OpenClaw does not ship automatic self-improvement, but you can build it from parts it already has: sub-agents plus a cron-scheduled task system. A meta-agent runs on a schedule, reviews how the sub-agents performed, and adjusts their configurations.

{
  "subAgents": [
    {
      "name": "meta-optimiser",
      "schedule": "0 2 * * *",
      "skill": "subagent-evaluator",
      "workflow": [
        "read performance logs from past 24h",
        "identify sub-agents with >10% failure rate",
        "analyse failure patterns for each",
        "generate config variants with adjusted prompts/tools/models",
        "A/B test variants on synthetic tasks",
        "deploy winning variant if improvement >5%"
      ]
    }
  ]
}

This takes more hand-assembly than Claude Code or Hermes, but it runs. The point worth keeping is that self-improvement does not need native platform support. It needs structured evaluation and a configuration space you can search. The failure-rate threshold (>10%) and improvement threshold (>5%) above are example values, not numbers from any source.

Pattern 4: Recursive Self-Building

The most advanced pattern is recursive: a meta-agent that improves not only the task-specific sub-agents but also its own evaluation and generation strategies. This one is, by the author's own admission, theoretical for production use. Experiments with Claude Code's Opus 4.8 (a real model, released 28 May 2026) are reported to show promising results, though that result is unconfirmed and not backed by a public source.

In recursive self-building, the meta-agent holds a model of its own reasoning. When it notices its evaluation criteria are poorly calibrated (say, optimising for test coverage while missing bug detection), it updates them. When it notices its generation strategies are too cautious (searching too small a configuration space), it widens them.

The obvious danger is runaway optimisation. Without solid guardrails, an agent that improves itself recursively could chase metrics that are easy to measure while quietly losing the quality those metrics were meant to stand in for. Human oversight stays essential.

Guardrails for Self-Building Agents

Any self-building agent system needs these guardrails:

Regression tests: New configurations must not break tasks that previously passed
Diversity requirements: The configuration space has to stay diverse so it does not converge too early
Human review gate: Deployments touching production require human approval
Kill switches: The ability to revert to a known-good configuration on the spot
Metric sanity checks: Confirm that the optimised metrics still correlate with human judgement

The Current State

Self-building agents are not autonomous yet. They are assistive. They search the configuration space faster than a person can, surface patterns a person might miss, and suggest improvements a person then reviews and approves. The line between "suggests improvements" and "deploys them without asking" is where human judgement currently sits. That line is moving.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

Pick the smallest useful workflow that proves the pattern.
Write down the owner, data boundary, review point, and success measure.
Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI agent design systems.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Sub-Agents That Build Themselves: Advanced Patterns

Read with ChatGPT Open Claude Search with AI Mode

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call