Briefing
Most teams running AI agents have no idea whether the thing is actually working. It writes code, it answers tickets, it drafts emails, and the dashboard looks busy. But ask a simple question, "is this agent better or worse than it was last month?", and the room goes quiet.
That gap is the whole problem. An agent that feels productive can be quietly burning money, shipping bugs, and creating cleanup work that nobody is counting. The only way to know the difference is to measure it properly, and most organisations measure the easy stuff (tokens, speed) while ignoring the parts that decide whether the agent earns its keep.
This is a practical guide to closing that gap. It lays out a four-level way to score agent performance, from the nuts-and-bolts technical numbers up to the business outcomes your finance team cares about. None of it requires a data science team. It does require deciding, on purpose, what good looks like before you deploy.
You cannot improve what you do not measure. Agent evaluation is the discipline of measuring agent performance with rigour rather than intuition. Here is the framework that separates reliable agent deployments from hopeful experiments.
The Measurement Pyramid
Agent evaluation works at four levels:
Level 4: Business Impact (revenue, velocity, satisfaction)
Level 3: Task Outcomes (success rate, quality, time)
Level 2: Technical Metrics (token usage, latency, error rate)
Level 1: Component Metrics (tool accuracy, context relevance, prompt adherence)Most teams measure Level 2 and stop there. The teams getting real value measure all four, and they understand how each level feeds the next.
Level 1: Component Metrics
Component metrics tell you which specific thing is breaking:
| Metric | Definition | Target |
|---|---|---|
| Tool accuracy | % of tool calls with correct parameters | >95% |
| Context relevance | % of retrieved context actually used | >80% |
| Prompt adherence | % of constraints followed in output | >90% |
| Hallucination rate | % of outputs containing fabricated information | <2% |
| Format compliance | % of outputs matching requested format | >95% |
These need automated evaluation, usually model-based. Claude Code's built-in telemetry covers some of them: with OpenTelemetry turned on, it emits token usage, cost, session counts, and tool-execution events you can mine for component data (Anthropic Claude Code docs, Monitoring usage). For retrospective digging, Hermes' FTS5 session database lets you full-text search across past sessions and pull patterns out of old transcripts (Nous Research Hermes Agent, Session Storage).
Level 2: Technical Metrics
Technical metrics watch system health:
| Metric | Definition | Target |
|---|---|---|
| Token usage per task | Average tokens consumed | Minimise |
| Latency (p50/p95) | Time to task completion | p50<60s, p95<300s |
| Error rate | % of tasks ending in error | <5% |
| Retry rate | % of tasks requiring human intervention | <20% |
| Cost per task | Token cost + compute | Track and optimise |
Level 3: Task Outcomes
Task outcome metrics ask the question that matters: did the agent actually get the job done?
| Metric | Definition | Measurement |
|---|---|---|
| Success rate | % of tasks completed without human fix | Human review |
| First-attempt success | % correct on first try | Automated + spot check |
| Code quality score | Composite of lint, test, review | Automated pipeline |
| Regression rate | % of changes that break existing code | CI/CD |
| Time saved | (Human estimate) - (Agent time) | Self-report |
Level 4: Business Impact
Business impact metrics tie agent usage back to what the organisation actually cares about:
| Metric | Definition |
|---|---|
| Developer velocity | Story points or PRs per sprint |
| Bug escape rate | Bugs found in production vs. development |
| Developer satisfaction | Survey scores |
| Time to resolution | Mean time to fix bugs or implement features |
| Knowledge transfer speed | Time for new developers to become productive |
Evaluation Methodologies
Human Evaluation
The gold standard, and expensive. Use it for:
- Calibrating your automated evaluators at the start
- Edge cases and task types you haven't seen before
- Final sign-off on production deployments
Model-Based Evaluation
Use a stronger model to grade a weaker model's output. For example, an Opus-class model can score the work of a faster Sonnet-class model. (One commonly cited version of this, that Claude Code internally uses Opus 4.8 to grade Sonnet 4.8 outputs, is unconfirmed: it has no public source, and "Sonnet 4.8" does not appear to be a released model. The current Sonnet is Sonnet 4.6; Opus 4.8 is real.) The approach scales well, but it inherits whatever biases the evaluator model carries.
def model_evaluate(output, criteria, evaluator="opus-4.8"):
prompt = f"Evaluate this output on {criteria}. Score 1-10. Explain."
return claude.generate(prompt, model=evaluator)Reference-Based Evaluation
Compare agent output against human-written reference solutions, using BLEU, ROUGE), or custom similarity metrics. It's only as good as the reference solutions you have, in quality and in coverage.
Execution-Based Evaluation
Run the generated code and see if it passes the tests. For coding tasks this is the most objective metric you'll get. The catch is that it leans entirely on the quality of your test suite.
Building an Evaluation Suite
evaluation_suite/
├── unit_tests/ # Does generated code compile and pass?
├── integration_tests/ # Does it work with the rest of the system?
├── regression_tests/ # Does it break anything?
├── quality_checks/ # Lint, format, complexity
├── reference_compare/ # Similarity to human-written solutions
└── human_review/ # Manual quality assessmentRun the full suite before you deploy any prompt or model change. Run component metrics continuously. Review business impact monthly.
Red Flags
These patterns tell you the evaluation itself is broken:
- Measuring proxy metrics: "Lines of code generated" barely correlates with value
- Cherry-picking examples: Testing only on tasks you already know the agent handles well
- No human baseline: If you don't know how a person performs, the agent's numbers mean nothing
- Static benchmarks: Agent performance drifts, so evaluate continuously
- Ignoring failure modes: Counting wins without categorising the losses
Agent evaluation isn't a one-off. It's an ongoing practice, and it decides whether your deployment gets better or worse over time. Teams that evaluate well ship with confidence. Teams that don't are just hoping.




