Code

Agent Evaluation: How to Measure Agent Performance.

The four-level measurement pyramid that separates reliable agent deployments from hopeful experiments. From component metrics to business impact, with evaluation methodologies for each level.

Daniel Fleuren2026-05-2012 min readAustralian business teamsUpdated 2026-06-19

Written by

Daniel Fleuren

Founder, AI Kick Start. 20+ years enterprise IT

Updated 2026-06-19

AI Kick Start editorial image for Agent Evaluation: How to Measure Agent Performance.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: The four-level measurement pyramid that separates reliable agent deployments from hopeful experiments. From component metrics to business impact, with evaluation methodologies for each level.

Key takeaways

Briefing: Most teams running AI agents have no idea whether the thing is actually working.
The Measurement Pyramid: Agent evaluation works at four levels: Level 4: Business Impact (revenue, velocity, satisfaction) Level 3: Task Outcomes (success rate, quality, time) Level 2: Technical Metrics (token usage, latency, error rate) Level 1: Component Metrics (tool accuracy, context relevance, prompt adherence) Most teams measure Level 2 and stop there.
Level 1: Component Metrics: Component metrics tell you which specific thing is breaking: Tool accuracy: % of tool calls with correct parameters: >95% Context relevance: % of retrieved context actually used: >80% Prompt adherence: % of constraints followed in output: >90% Hallucination rate: % of outputs containing fabricated information: <2% Format compliance: % of outputs matching requested format: >95% These need automated evaluation, usually model-based.
Level 2: Technical Metrics: Technical metrics watch system health: Token usage per task: Average tokens consumed: Minimise Latency (p50/p95): Time to task completion: p50<60s, p95<300s Error rate: % of tasks ending in error: <5% Retry rate: % of tasks requiring human intervention: <20% Cost per task: Token cost + compute: Track and optimise
Level 3: Task Outcomes: Task outcome metrics ask the question that matters: did the agent actually get the job done?

Briefing

Most teams running AI agents have no idea whether the thing is actually working. It writes code, it answers tickets, it drafts emails, and the dashboard looks busy. But ask a simple question, "is this agent better or worse than it was last month?", and the room goes quiet.

That gap is the whole problem. An agent that feels productive can be quietly burning money, shipping bugs, and creating cleanup work that nobody is counting. The only way to know the difference is to measure it properly, and most organisations measure the easy stuff (tokens, speed) while ignoring the parts that decide whether the agent earns its keep.

This is a practical guide to closing that gap. It lays out a four-level way to score agent performance, from the nuts-and-bolts technical numbers up to the business outcomes your finance team cares about. None of it requires a data science team. It does require deciding, on purpose, what good looks like before you deploy.

You cannot improve what you do not measure. Agent evaluation is the discipline of measuring agent performance with rigour rather than intuition. Here is the framework that separates reliable agent deployments from hopeful experiments.

The Measurement Pyramid

Agent evaluation works at four levels:

Level 4: Business Impact (revenue, velocity, satisfaction)
Level 3: Task Outcomes (success rate, quality, time)
Level 2: Technical Metrics (token usage, latency, error rate)
Level 1: Component Metrics (tool accuracy, context relevance, prompt adherence)

Most teams measure Level 2 and stop there. The teams getting real value measure all four, and they understand how each level feeds the next.

Level 1: Component Metrics

Component metrics tell you which specific thing is breaking:

Metric	Definition	Target
Tool accuracy	% of tool calls with correct parameters	>95%
Context relevance	% of retrieved context actually used	>80%
Prompt adherence	% of constraints followed in output	>90%
Hallucination rate	% of outputs containing fabricated information	<2%
Format compliance	% of outputs matching requested format	>95%

These need automated evaluation, usually model-based. Claude Code's built-in telemetry covers some of them: with OpenTelemetry turned on, it emits token usage, cost, session counts, and tool-execution events you can mine for component data (Anthropic Claude Code docs, Monitoring usage). For retrospective digging, Hermes' FTS5 session database lets you full-text search across past sessions and pull patterns out of old transcripts (Nous Research Hermes Agent, Session Storage).

Level 2: Technical Metrics

Technical metrics watch system health:

Metric	Definition	Target
Token usage per task	Average tokens consumed	Minimise
Latency (p50/p95)	Time to task completion	p50<60s, p95<300s
Error rate	% of tasks ending in error	<5%
Retry rate	% of tasks requiring human intervention	<20%
Cost per task	Token cost + compute	Track and optimise

Level 3: Task Outcomes

Task outcome metrics ask the question that matters: did the agent actually get the job done?

Metric	Definition	Measurement
Success rate	% of tasks completed without human fix	Human review
First-attempt success	% correct on first try	Automated + spot check
Code quality score	Composite of lint, test, review	Automated pipeline
Regression rate	% of changes that break existing code	CI/CD
Time saved	(Human estimate) - (Agent time)	Self-report

Level 4: Business Impact

Business impact metrics tie agent usage back to what the organisation actually cares about:

Metric	Definition
Developer velocity	Story points or PRs per sprint
Bug escape rate	Bugs found in production vs. development
Developer satisfaction	Survey scores
Time to resolution	Mean time to fix bugs or implement features
Knowledge transfer speed	Time for new developers to become productive

Evaluation Methodologies

Human Evaluation

The gold standard, and expensive. Use it for:

Calibrating your automated evaluators at the start
Edge cases and task types you haven't seen before
Final sign-off on production deployments

Model-Based Evaluation

Use a stronger model to grade a weaker model's output. For example, an Opus-class model can score the work of a faster Sonnet-class model. (One commonly cited version of this, that Claude Code internally uses Opus 4.8 to grade Sonnet 4.8 outputs, is unconfirmed: it has no public source, and "Sonnet 4.8" does not appear to be a released model. The current Sonnet is Sonnet 4.6; Opus 4.8 is real.) The approach scales well, but it inherits whatever biases the evaluator model carries.

def model_evaluate(output, criteria, evaluator="opus-4.8"):
    prompt = f"Evaluate this output on {criteria}. Score 1-10. Explain."
    return claude.generate(prompt, model=evaluator)

Reference-Based Evaluation

Compare agent output against human-written reference solutions, using BLEU, ROUGE), or custom similarity metrics. It's only as good as the reference solutions you have, in quality and in coverage.

Execution-Based Evaluation

Run the generated code and see if it passes the tests. For coding tasks this is the most objective metric you'll get. The catch is that it leans entirely on the quality of your test suite.

Building an Evaluation Suite

evaluation_suite/
├── unit_tests/          # Does generated code compile and pass?
├── integration_tests/   # Does it work with the rest of the system?
├── regression_tests/    # Does it break anything?
├── quality_checks/      # Lint, format, complexity
├── reference_compare/   # Similarity to human-written solutions
└── human_review/        # Manual quality assessment

Run the full suite before you deploy any prompt or model change. Run component metrics continuously. Review business impact monthly.

Red Flags

These patterns tell you the evaluation itself is broken:

Measuring proxy metrics: "Lines of code generated" barely correlates with value
Cherry-picking examples: Testing only on tasks you already know the agent handles well
No human baseline: If you don't know how a person performs, the agent's numbers mean nothing
Static benchmarks: Agent performance drifts, so evaluate continuously
Ignoring failure modes: Counting wins without categorising the losses

Agent evaluation isn't a one-off. It's an ongoing practice, and it decides whether your deployment gets better or worse over time. Teams that evaluate well ship with confidence. Teams that don't are just hoping.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

Pick the smallest useful workflow that proves the pattern.
Write down the owner, data boundary, review point, and success measure.
Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI agent design systems.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Agent Evaluation: How to Measure Agent Performance

Read with ChatGPT Open Claude Search with AI Mode

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call