Back to news

Code

Agent Evaluation: How to Measure Agent Performance.

The four-level measurement pyramid that separates reliable agent deployments from hopeful experiments. From component metrics to business impact, with evaluation methodologies for each level.

AI Kick Start editorial image for Agent Evaluation: How to Measure Agent Performance.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: The four-level measurement pyramid that separates reliable agent deployments from hopeful experiments. From component metrics to business impact, with evaluation methodologies for each level.

Key takeaways

  • Briefing: Most teams running AI agents have no idea whether the thing is actually working.
  • The Measurement Pyramid: Agent evaluation works at four levels: Level 4: Business Impact (revenue, velocity, satisfaction) Level 3: Task Outcomes (success rate, quality, time) Level 2: Technical Metrics (token usage, latency, error rate) Level 1: Component Metrics (tool accuracy, context relevance, prompt adherence) Most teams measure Level 2 and stop there.
  • Level 1: Component Metrics: Component metrics tell you which specific thing is breaking: Tool accuracy: % of tool calls with correct parameters: >95% Context relevance: % of retrieved context actually used: >80% Prompt adherence: % of constraints followed in output: >90% Hallucination rate: % of outputs containing fabricated information: <2% Format compliance: % of outputs matching requested format: >95% These need automated evaluation, usually model-based.
  • Level 2: Technical Metrics: Technical metrics watch system health: Token usage per task: Average tokens consumed: Minimise Latency (p50/p95): Time to task completion: p50<60s, p95<300s Error rate: % of tasks ending in error: <5% Retry rate: % of tasks requiring human intervention: <20% Cost per task: Token cost + compute: Track and optimise
  • Level 3: Task Outcomes: Task outcome metrics ask the question that matters: did the agent actually get the job done?

Briefing

Most teams running AI agents have no idea whether the thing is actually working. It writes code, it answers tickets, it drafts emails, and the dashboard looks busy. But ask a simple question, "is this agent better or worse than it was last month?", and the room goes quiet.

That gap is the whole problem. An agent that feels productive can be quietly burning money, shipping bugs, and creating cleanup work that nobody is counting. The only way to know the difference is to measure it properly, and most organisations measure the easy stuff (tokens, speed) while ignoring the parts that decide whether the agent earns its keep.

This is a practical guide to closing that gap. It lays out a four-level way to score agent performance, from the nuts-and-bolts technical numbers up to the business outcomes your finance team cares about. None of it requires a data science team. It does require deciding, on purpose, what good looks like before you deploy.

You cannot improve what you do not measure. Agent evaluation is the discipline of measuring agent performance with rigour rather than intuition. Here is the framework that separates reliable agent deployments from hopeful experiments.

The Measurement Pyramid

Agent evaluation works at four levels:

Level 4: Business Impact (revenue, velocity, satisfaction)
Level 3: Task Outcomes (success rate, quality, time)
Level 2: Technical Metrics (token usage, latency, error rate)
Level 1: Component Metrics (tool accuracy, context relevance, prompt adherence)

Most teams measure Level 2 and stop there. The teams getting real value measure all four, and they understand how each level feeds the next.

Level 1: Component Metrics

Component metrics tell you which specific thing is breaking:

MetricDefinitionTarget
Tool accuracy% of tool calls with correct parameters>95%
Context relevance% of retrieved context actually used>80%
Prompt adherence% of constraints followed in output>90%
Hallucination rate% of outputs containing fabricated information<2%
Format compliance% of outputs matching requested format>95%

These need automated evaluation, usually model-based. Claude Code's built-in telemetry covers some of them: with OpenTelemetry turned on, it emits token usage, cost, session counts, and tool-execution events you can mine for component data (Anthropic Claude Code docs, Monitoring usage). For retrospective digging, Hermes' FTS5 session database lets you full-text search across past sessions and pull patterns out of old transcripts (Nous Research Hermes Agent, Session Storage).

Level 2: Technical Metrics

Technical metrics watch system health:

MetricDefinitionTarget
Token usage per taskAverage tokens consumedMinimise
Latency (p50/p95)Time to task completionp50<60s, p95<300s
Error rate% of tasks ending in error<5%
Retry rate% of tasks requiring human intervention<20%
Cost per taskToken cost + computeTrack and optimise

Level 3: Task Outcomes

Task outcome metrics ask the question that matters: did the agent actually get the job done?

MetricDefinitionMeasurement
Success rate% of tasks completed without human fixHuman review
First-attempt success% correct on first tryAutomated + spot check
Code quality scoreComposite of lint, test, reviewAutomated pipeline
Regression rate% of changes that break existing codeCI/CD
Time saved(Human estimate) - (Agent time)Self-report

Level 4: Business Impact

Business impact metrics tie agent usage back to what the organisation actually cares about:

MetricDefinition
Developer velocityStory points or PRs per sprint
Bug escape rateBugs found in production vs. development
Developer satisfactionSurvey scores
Time to resolutionMean time to fix bugs or implement features
Knowledge transfer speedTime for new developers to become productive

Evaluation Methodologies

Human Evaluation

The gold standard, and expensive. Use it for:

  • Calibrating your automated evaluators at the start
  • Edge cases and task types you haven't seen before
  • Final sign-off on production deployments

Model-Based Evaluation

Use a stronger model to grade a weaker model's output. For example, an Opus-class model can score the work of a faster Sonnet-class model. (One commonly cited version of this, that Claude Code internally uses Opus 4.8 to grade Sonnet 4.8 outputs, is unconfirmed: it has no public source, and "Sonnet 4.8" does not appear to be a released model. The current Sonnet is Sonnet 4.6; Opus 4.8 is real.) The approach scales well, but it inherits whatever biases the evaluator model carries.

def model_evaluate(output, criteria, evaluator="opus-4.8"):
    prompt = f"Evaluate this output on {criteria}. Score 1-10. Explain."
    return claude.generate(prompt, model=evaluator)

Reference-Based Evaluation

Compare agent output against human-written reference solutions, using BLEU, ROUGE), or custom similarity metrics. It's only as good as the reference solutions you have, in quality and in coverage.

Execution-Based Evaluation

Run the generated code and see if it passes the tests. For coding tasks this is the most objective metric you'll get. The catch is that it leans entirely on the quality of your test suite.

Building an Evaluation Suite

evaluation_suite/
├── unit_tests/          # Does generated code compile and pass?
├── integration_tests/   # Does it work with the rest of the system?
├── regression_tests/    # Does it break anything?
├── quality_checks/      # Lint, format, complexity
├── reference_compare/   # Similarity to human-written solutions
└── human_review/        # Manual quality assessment

Run the full suite before you deploy any prompt or model change. Run component metrics continuously. Review business impact monthly.

Red Flags

These patterns tell you the evaluation itself is broken:

  • Measuring proxy metrics: "Lines of code generated" barely correlates with value
  • Cherry-picking examples: Testing only on tasks you already know the agent handles well
  • No human baseline: If you don't know how a person performs, the agent's numbers mean nothing
  • Static benchmarks: Agent performance drifts, so evaluate continuously
  • Ignoring failure modes: Counting wins without categorising the losses

Agent evaluation isn't a one-off. It's an ongoing practice, and it decides whether your deployment gets better or worse over time. Teams that evaluate well ship with confidence. Teams that don't are just hoping.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Pick the smallest useful workflow that proves the pattern.
  2. Write down the owner, data boundary, review point, and success measure.
  3. Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI agent design systems.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Agent Evaluation: How to Measure Agent Performance

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call