Code

Loop Engineering: Building Self-Improving Agent Systems.

Self-improving agents are not magic. They are engineered feedback loops. Learn how to design observation, evaluation, extraction, integration, and decay for continuous agent improvement.

Daniel Fleuren2026-05-2613 min readAustralian business teamsUpdated 2026-06-19

Written by

Daniel Fleuren

Founder, AI Kick Start. 20+ years enterprise IT

Updated 2026-06-19

AI Kick Start editorial image for Loop Engineering: Building Self-Improving Agent Systems.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: Self-improving agents are not magic. They are engineered feedback loops. Learn how to design observation, evaluation, extraction, integration, and decay for continuous agent improvement.

Key takeaways

Briefing: Most AI agents you can buy today are frozen.
The Anatomy of a Learning Loop: Every effective learning loop has five components: **Observation**: What the agent sees and records about its environment and actions **Evaluation**: How the agent judges whether its actions were successful **Extraction**: What patterns the agent identifies from successful and failed actions **Integration**: How extracted patterns become part of the agent's future behaviour **Decay**: How old patterns are phased out when they become irrelevant Drop any one of these and the loop fails in a way you can predict.
Loop Engineering in Practice: Observation Design What an agent observes sets the ceiling on what it can learn.
Loop Types by Time Horizon: Seconds: Inline: Tool result: Adjust next tool choice based on output Minutes: Session: Task completion: Update preferences based on user corrections Hours: Daily: Scheduled job: Compress and integrate day's observations Days: Weekly: Weekly trigger: Archive obsolete patterns, generate summaries Weeks: Epoch: Manual or triggered: Major knowledge reorganisation
Measuring Loop Quality: If you run one of these systems, track these metrics: **Learning velocity**: How much useful knowledge is extracted per unit time **Retention accuracy**: Percentage of extracted knowledge that remains relevant after 30 days **User correction trend**: Corrections per session should fall over time **First-attempt success rate**: Should climb as the loop learns user preferences **Knowledge freshness**: Percentage of active knowledge that is less than 30 days old As a rough rule of thumb (not a published benchmark), a well-engineered learning loop should lift first-attempt success rate by something like 15-25% over the first month of operation.

Briefing

Most AI agents you can buy today are frozen. They do the job they were shipped with, make the same mistake on Tuesday that they made on Monday, and wait for a human to patch them. A small group of systems are trying to break that pattern by getting better on their own, and the way they do it is becoming a real engineering practice rather than a research curiosity.

The practice has a name: loop engineering. The idea is plain enough. You build feedback loops that let an agent watch its own work, judge how it went, and carry the useful lessons forward into the next task. Get the loops right and the agent quietly improves with use. Get them wrong and it either learns nothing or, worse, learns the wrong habits and gets confidently bad.

The clearest working example doing the rounds is Hermes, the self-improving agent from Nous Research, which is documented as running a closed learning loop with agent-curated memory and the ability to write and refine its own skills (Hermes Agent Documentation, NousResearch/hermes-agent). Whether it is the most mature example is a matter of opinion. The mechanics underneath it are not, and they apply to any agent system you might run in your own business.

So here is the practical version of the discipline, with the parts that matter for anyone deciding whether a self-improving agent is worth the trouble.

Self-improving agent systems are not a theoretical aspiration. They are a practical engineering discipline called loop engineering: the design of feedback loops that continuously improve agent performance. Hermes' learning loop is the most documented example we have, but the principles apply to any agent system.

The Anatomy of a Learning Loop

Every effective learning loop has five components:

Observation: What the agent sees and records about its environment and actions
Evaluation: How the agent judges whether its actions were successful
Extraction: What patterns the agent identifies from successful and failed actions
Integration: How extracted patterns become part of the agent's future behaviour
Decay: How old patterns are phased out when they become irrelevant

Drop any one of these and the loop fails in a way you can predict. No observation, and the agent learns nothing. No evaluation, and it learns the wrong things. No extraction, and it cannot generalise. No integration, and it forgets what it learned by the next task. No decay, and it piles up stale knowledge until that knowledge starts working against it.

Loop Engineering in Practice

Observation Design

What an agent observes sets the ceiling on what it can learn. Hermes documents an FTS5 session search with LLM summarisation for cross-session recall, storing CLI and messaging sessions so they can be searched later (Hermes Agent persistent memory docs). The exact fields shown below (working directory, git state, environment variables, dependency versions) are an illustration of the kind of context a learning-focused observation layer needs to capture, not a published schema:

# Hermes observation schema
observation = {
    "timestamp": "2026-06-15T10:30:00Z",
    "task": "refactor_auth_middleware",
    "tools_used": ["file_read", "file_write", "test_run"],
    "files_modified": ["src/auth.ts", "tests/auth.test.ts"],
    "git_state": { "branch": "feature/auth-refactor", "commit": "a1b2c3d" },
    "dependencies": { "fastify": "4.28.0", "typescript": "5.7.0" },
    "outcome": "success",
    "duration_seconds": 245,
    "user_corrections": 0
}

Evaluation Functions

The evaluation function is the design decision that decides everything else. Get it wrong and you have taught the agent to chase the wrong target. Here are the common approaches and where each one bites:

Approach	Pros	Cons
Test pass/fail	Objective, automatic	Optimises for passing tests, not good code
Human rating	High quality	Expensive, slow, inconsistent
Model-based evaluation	Automatic, nuanced	May inherit model biases
Metric-based (coverage, complexity)	Objective	Gameable, narrow
Hybrid	Balanced	Complex to implement

Hermes uses a hybrid approach: automatic metrics for objective quality, model-based evaluation for the judgement calls, and human feedback (when it is given) as the ground truth that overrides both.

Extraction Strategies

Pattern extraction is where raw observations turn into knowledge the agent can reuse. Hermes leans on three strategies:

Signature extraction: Compact representations of successful tool sequences. "For database migrations, use create_new_table -> dual_write -> backfill -> switch_read."
Anti-pattern extraction: Patterns that correlate with failures. As an illustration, a rule of the form "avoid using eval() in skills" can be derived from audit data. (A real Koi Security audit of community skills did find hundreds of malicious entries, though the specific "100% of eval() usages were malicious" figure is unconfirmed and should not be read as a sourced statistic, per MarkTechPost's coverage.)
Preference extraction: User-specific preferences via Honcho dialectic user modelling (Hermes Honcho memory docs). A preference such as "user prefers functional patterns with confidence 0.91" is the kind of output this produces, shown here as an example rather than a real recorded value.

Integration Mechanisms

Extracted patterns are useless until they shape what the agent actually does next. Integration mechanisms include:

Prompt augmentation: Adding successful patterns to system prompts
Tool preference ranking: Biasing tool selection toward historically successful tools
Default parameter setting: Using parameters that worked well in similar past tasks
Constraint generation: Creating new constraints from identified anti-patterns

Decay Schedules

Without decay, an agent's knowledge only ever grows, and past a point that becomes a liability. One reported design uses power-law decay, where recent observations carry full weight, observations from a week ago carry half, and observations from a month ago carry a quarter, with low-activation patterns archived to cold storage after 30 days. Treat those specific numbers as an unconfirmed example: Hermes documents bounded, curated, cache-aware memory but does not publish this exact schedule. The principle holds even if your weights differ.

Loop Types by Time Horizon

Horizon	Name	Trigger	Example
Seconds	Inline	Tool result	Adjust next tool choice based on output
Minutes	Session	Task completion	Update preferences based on user corrections
Hours	Daily	Scheduled job	Compress and integrate day's observations
Days	Weekly	Weekly trigger	Archive obsolete patterns, generate summaries
Weeks	Epoch	Manual or triggered	Major knowledge reorganisation

Measuring Loop Quality

If you run one of these systems, track these metrics:

Learning velocity: How much useful knowledge is extracted per unit time
Retention accuracy: Percentage of extracted knowledge that remains relevant after 30 days
User correction trend: Corrections per session should fall over time
First-attempt success rate: Should climb as the loop learns user preferences
Knowledge freshness: Percentage of active knowledge that is less than 30 days old

As a rough rule of thumb (not a published benchmark), a well-engineered learning loop should lift first-attempt success rate by something like 15-25% over the first month of operation. If your loop shows no measurable improvement at all, the fault usually sits in the evaluation function, the extraction strategy, or the integration mechanism.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

Pick the smallest useful workflow that proves the pattern.
Write down the owner, data boundary, review point, and success measure.
Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI agent design systems.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Loop Engineering: Building Self-Improving Agent Systems

Read with ChatGPT Open Claude Search with AI Mode

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call