Briefing
Here is the strange new shape of automated software work: a tool that does not just write your code, but writes the assistant that writes your code, then quietly fires that assistant and hires a better one.
It sounds like a stunt. It is closer to plumbing. Three real products now ship the pieces you would need to build it: Claude Code's dynamic workflows, which Anthropic released on 28 May 2026 to orchestrate sub-agents at scale; Hermes, an agent from Nous Research that keeps notes on its own work and gets better as it runs; and OpenClaw, whose scheduled-task system can be wired up to grade and tune other agents overnight.
For an Australian business team, the practical question is not whether your software can become sentient. It cannot. The question is whether an agent can measure its own output, spot where it falls short, and propose a better version of itself while you sleep, with a human signing off before anything touches production. Some of that is here. Some of it is people stitching together features that were not designed for the job. And one version of it is still a research idea with a hopeful press release attached.
Below is what is actually running, what is a sensible pattern you could build, and where the marketing gets ahead of the facts.
The Self-Improvement Loop
A self-building sub-agent runs a meta-loop that sits one level above ordinary task execution:
- Execute: Perform the assigned task
- Evaluate: Measure output quality against defined criteria
- Diagnose: Identify specific weaknesses in the approach
- Generate: Create an improved sub-agent configuration that addresses those weaknesses
- Validate: Test the new configuration on held-out examples
- Deploy: Replace the current configuration if validation passes
The loop needs three things to work: evaluation metrics you can compute automatically, a configuration space you can search, and a validation step that stops the agent from shipping a regression.
Pattern 1: Prompt Evolution (Claude Code)
Claude Code's dynamic workflows are real, and they fan work out across parallel sub-agents that a parent agent plans and coordinates. On top of that primitive, you can build what amounts to prompt evolution: the coordinator keeps a population of prompt variants for each specialist role, checks which variant produced the best output after each task, and breeds new variants by combining the patterns that worked. Worth being clear here: this genetic "prompt evolution" mechanism is not a documented Anthropic feature. It is a pattern layered on the real dynamic-workflows capability rather than something the platform ships by name.
# Sub-agent prompt evolution configuration
evolution:
population_size: 10
specialist: test_generator
evaluation:
- metric: coverage_increase
weight: 0.4
- metric: test_quality_score
weight: 0.4
- metric: execution_time
weight: 0.2
mutation:
strategies:
- add_context_section
- strengthen_constraints
- add_examples
- reorder_instructionsThe metrics carry the whole thing. "Coverage increase" is objective and measurable. "Test quality score" needs a model-based evaluator, which adds some subjectivity but tends to track human judgement well. The system throws out any variant that regresses on a metric. Note that the weights above (0.4, 0.4, 0.2) are illustrative numbers, not figures pulled from a benchmark.
Pattern 2: Skill Signature Evolution (Hermes)
Hermes handles self-improvement through its skills system, working within the agentskills.io open standard, the same SKILL.md format used by Claude Code, Codex CLI, OpenClaw and others. As it solves problems, Hermes pauses roughly every 15 tool calls to reflect on what worked and what failed, then writes or rewrites a reusable skill document, while a curator periodically prunes the library. The article frames this as generating a "skill signature": a compact record of the problem, the approach, and the outcome, with successful records kept and failed ones analysed for patterns. That "skill signature" wording is the article's own; the official docs describe SKILL.md generation and a roughly 15-tool-call reflection cadence rather than a named signature object, and the Python API shown below is illustrative rather than a confirmed surface.
# Hermes skill evolution
signature = hermes.skills.create(
problem_type="database_migration_with_rollback",
approach=["create_new_table", "dual_write", "backfill", "switch_read", "drop_old"],
tools_used=["sql_runner", "schema_diff", "data_validator"],
outcome="success",
duration_minutes=45
)
# Evolve: combine with related successful signatures
hermes.skills.evolve(
base_signature=signature,
combine_with=hermes.skills.search("migration"),
objective="reduce_duration"
)The learning loop keeps refining these skills over time. A skill that first took 45 minutes to run might drop to 30 through better tool selection, then to 20 through parallelisation, with the gains stacking across sessions. Those duration figures (45, then 30, then 20 minutes) are example numbers to show the shape of the improvement, not measured benchmarks.
Pattern 3: Sub-Agent Configuration Search (OpenClaw)
OpenClaw does not ship automatic self-improvement, but you can build it from parts it already has: sub-agents plus a cron-scheduled task system. A meta-agent runs on a schedule, reviews how the sub-agents performed, and adjusts their configurations.
{
"subAgents": [
{
"name": "meta-optimiser",
"schedule": "0 2 * * *",
"skill": "subagent-evaluator",
"workflow": [
"read performance logs from past 24h",
"identify sub-agents with >10% failure rate",
"analyse failure patterns for each",
"generate config variants with adjusted prompts/tools/models",
"A/B test variants on synthetic tasks",
"deploy winning variant if improvement >5%"
]
}
]
}This takes more hand-assembly than Claude Code or Hermes, but it runs. The point worth keeping is that self-improvement does not need native platform support. It needs structured evaluation and a configuration space you can search. The failure-rate threshold (>10%) and improvement threshold (>5%) above are example values, not numbers from any source.
Pattern 4: Recursive Self-Building
The most advanced pattern is recursive: a meta-agent that improves not only the task-specific sub-agents but also its own evaluation and generation strategies. This one is, by the author's own admission, theoretical for production use. Experiments with Claude Code's Opus 4.8 (a real model, released 28 May 2026) are reported to show promising results, though that result is unconfirmed and not backed by a public source.
In recursive self-building, the meta-agent holds a model of its own reasoning. When it notices its evaluation criteria are poorly calibrated (say, optimising for test coverage while missing bug detection), it updates them. When it notices its generation strategies are too cautious (searching too small a configuration space), it widens them.
The obvious danger is runaway optimisation. Without solid guardrails, an agent that improves itself recursively could chase metrics that are easy to measure while quietly losing the quality those metrics were meant to stand in for. Human oversight stays essential.
Guardrails for Self-Building Agents
Any self-building agent system needs these guardrails:
- Regression tests: New configurations must not break tasks that previously passed
- Diversity requirements: The configuration space has to stay diverse so it does not converge too early
- Human review gate: Deployments touching production require human approval
- Kill switches: The ability to revert to a known-good configuration on the spot
- Metric sanity checks: Confirm that the optimised metrics still correlate with human judgement
The Current State
Self-building agents are not autonomous yet. They are assistive. They search the configuration space faster than a person can, surface patterns a person might miss, and suggest improvements a person then reviews and approves. The line between "suggests improvements" and "deploys them without asking" is where human judgement currently sits. That line is moving.




