Briefing
Most AI agents you can buy today are frozen. They do the job they were shipped with, make the same mistake on Tuesday that they made on Monday, and wait for a human to patch them. A small group of systems are trying to break that pattern by getting better on their own, and the way they do it is becoming a real engineering practice rather than a research curiosity.
The practice has a name: loop engineering. The idea is plain enough. You build feedback loops that let an agent watch its own work, judge how it went, and carry the useful lessons forward into the next task. Get the loops right and the agent quietly improves with use. Get them wrong and it either learns nothing or, worse, learns the wrong habits and gets confidently bad.
The clearest working example doing the rounds is Hermes, the self-improving agent from Nous Research, which is documented as running a closed learning loop with agent-curated memory and the ability to write and refine its own skills (Hermes Agent Documentation, NousResearch/hermes-agent). Whether it is the most mature example is a matter of opinion. The mechanics underneath it are not, and they apply to any agent system you might run in your own business.
So here is the practical version of the discipline, with the parts that matter for anyone deciding whether a self-improving agent is worth the trouble.
Self-improving agent systems are not a theoretical aspiration. They are a practical engineering discipline called loop engineering: the design of feedback loops that continuously improve agent performance. Hermes' learning loop is the most documented example we have, but the principles apply to any agent system.
The Anatomy of a Learning Loop
Every effective learning loop has five components:
- Observation: What the agent sees and records about its environment and actions
- Evaluation: How the agent judges whether its actions were successful
- Extraction: What patterns the agent identifies from successful and failed actions
- Integration: How extracted patterns become part of the agent's future behaviour
- Decay: How old patterns are phased out when they become irrelevant
Drop any one of these and the loop fails in a way you can predict. No observation, and the agent learns nothing. No evaluation, and it learns the wrong things. No extraction, and it cannot generalise. No integration, and it forgets what it learned by the next task. No decay, and it piles up stale knowledge until that knowledge starts working against it.
Loop Engineering in Practice
Observation Design
What an agent observes sets the ceiling on what it can learn. Hermes documents an FTS5 session search with LLM summarisation for cross-session recall, storing CLI and messaging sessions so they can be searched later (Hermes Agent persistent memory docs). The exact fields shown below (working directory, git state, environment variables, dependency versions) are an illustration of the kind of context a learning-focused observation layer needs to capture, not a published schema:
# Hermes observation schema
observation = {
"timestamp": "2026-06-15T10:30:00Z",
"task": "refactor_auth_middleware",
"tools_used": ["file_read", "file_write", "test_run"],
"files_modified": ["src/auth.ts", "tests/auth.test.ts"],
"git_state": { "branch": "feature/auth-refactor", "commit": "a1b2c3d" },
"dependencies": { "fastify": "4.28.0", "typescript": "5.7.0" },
"outcome": "success",
"duration_seconds": 245,
"user_corrections": 0
}Evaluation Functions
The evaluation function is the design decision that decides everything else. Get it wrong and you have taught the agent to chase the wrong target. Here are the common approaches and where each one bites:
| Approach | Pros | Cons |
|---|---|---|
| Test pass/fail | Objective, automatic | Optimises for passing tests, not good code |
| Human rating | High quality | Expensive, slow, inconsistent |
| Model-based evaluation | Automatic, nuanced | May inherit model biases |
| Metric-based (coverage, complexity) | Objective | Gameable, narrow |
| Hybrid | Balanced | Complex to implement |
Hermes uses a hybrid approach: automatic metrics for objective quality, model-based evaluation for the judgement calls, and human feedback (when it is given) as the ground truth that overrides both.
Extraction Strategies
Pattern extraction is where raw observations turn into knowledge the agent can reuse. Hermes leans on three strategies:
- Signature extraction: Compact representations of successful tool sequences. "For database migrations, use create_new_table -> dual_write -> backfill -> switch_read."
- Anti-pattern extraction: Patterns that correlate with failures. As an illustration, a rule of the form "avoid using eval() in skills" can be derived from audit data. (A real Koi Security audit of community skills did find hundreds of malicious entries, though the specific "100% of eval() usages were malicious" figure is unconfirmed and should not be read as a sourced statistic, per MarkTechPost's coverage.)
- Preference extraction: User-specific preferences via Honcho dialectic user modelling (Hermes Honcho memory docs). A preference such as "user prefers functional patterns with confidence 0.91" is the kind of output this produces, shown here as an example rather than a real recorded value.
Integration Mechanisms
Extracted patterns are useless until they shape what the agent actually does next. Integration mechanisms include:
- Prompt augmentation: Adding successful patterns to system prompts
- Tool preference ranking: Biasing tool selection toward historically successful tools
- Default parameter setting: Using parameters that worked well in similar past tasks
- Constraint generation: Creating new constraints from identified anti-patterns
Decay Schedules
Without decay, an agent's knowledge only ever grows, and past a point that becomes a liability. One reported design uses power-law decay, where recent observations carry full weight, observations from a week ago carry half, and observations from a month ago carry a quarter, with low-activation patterns archived to cold storage after 30 days. Treat those specific numbers as an unconfirmed example: Hermes documents bounded, curated, cache-aware memory but does not publish this exact schedule. The principle holds even if your weights differ.
Loop Types by Time Horizon
| Horizon | Name | Trigger | Example |
|---|---|---|---|
| Seconds | Inline | Tool result | Adjust next tool choice based on output |
| Minutes | Session | Task completion | Update preferences based on user corrections |
| Hours | Daily | Scheduled job | Compress and integrate day's observations |
| Days | Weekly | Weekly trigger | Archive obsolete patterns, generate summaries |
| Weeks | Epoch | Manual or triggered | Major knowledge reorganisation |
Measuring Loop Quality
If you run one of these systems, track these metrics:
- Learning velocity: How much useful knowledge is extracted per unit time
- Retention accuracy: Percentage of extracted knowledge that remains relevant after 30 days
- User correction trend: Corrections per session should fall over time
- First-attempt success rate: Should climb as the loop learns user preferences
- Knowledge freshness: Percentage of active knowledge that is less than 30 days old
As a rough rule of thumb (not a published benchmark), a well-engineered learning loop should lift first-attempt success rate by something like 15-25% over the first month of operation. If your loop shows no measurable improvement at all, the fault usually sits in the evaluation function, the extraction strategy, or the integration mechanism.




