Back to news

Code

Real-Time Agent Monitoring: Logs, Traces, Observability.

The three-pillar monitoring stack for production agents: structured logging, distributed tracing, and aggregate metrics. Includes monitoring stack recommendations and debugging workflows.

AI Kick Start editorial image for Real-Time Agent Monitoring: Logs, Traces, Observability.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: An agent you cannot watch is an agent you cannot trust. For production agent deployments, real-time monitoring (logs, traces, and metrics) is the difference between a useful teammate and a quiet liability. This piece walks through the monitoring stack for agentic systems as it stood in June 2026.

Key takeaways

  • Analysis: Picture an AI agent running in your business overnight.
  • Why Agent Monitoring is Different: Ordinary application monitoring watches three things: errors, performance, and availability.
  • The Three Pillars of Agent Observability: Pillar 1: Logging Structured logging records every agent action along with its context: { "timestamp": "2026-06-15T10:30:00Z", "agent_id": "hermes-prod-1", "session_id": "sess_abc123", "event_type": "tool_call", "tool": "file_write", "params": { "path": "src/auth.ts", "lines": 47, "diff_hash": "sha256:abc..." }, "model": "sonnet-4.8", "tokens_used": { "input": 4200, "output": 1800 }, "latency_ms": 3200, "result": "success", "user_id": "engineer_42" } (The `sonnet-4.8` model id above is illustrative.
  • Monitoring Stack Recommendations: For Small Teams (< 10 engineers) Hermes: Built-in FTS5 logs + `hermes metrics` command Claude Code: Built-in telemetry + JSON log export OpenClaw: File logs + basic metrics dashboard Export to: Grafana Cloud (free tier) or Datadog A note on those Hermes and OpenClaw commands: the specific subcommands shown here are illustrative.
  • Real-Time Dashboards: A useful agent dashboard surfaces: **Current activity**: Active agents, running tasks, queued requests.

Analysis

Picture an AI agent running in your business overnight. It edits code, calls tools, spends money on model tokens, and makes dozens of small decisions without anyone watching. In the morning, something is broken. The obvious question is the one most teams cannot answer: what did the agent actually do, and why?

That gap is the whole story here. When a human employee makes a bad call, you can ask them about it. When an autonomous agent makes one, you are left with whatever it bothered to record. If it recorded nothing, you are guessing.

The teams getting real value from agents in 2026 have figured this out. They treat monitoring as part of shipping the agent, not as an afterthought you bolt on once something goes wrong. The good news is that the tooling has caught up. The standard observability stack most engineering teams already run can watch agents too, once you know which signals matter.

So this is the practical version: what to log, what to trace, what to measure, and how to wire it into tools you probably already have.

Why Agent Monitoring is Different

Ordinary application monitoring watches three things: errors, performance, and availability. Agents need all of that, plus a few dimensions that traditional monitoring never had to care about (Claude Code Docs, 2026):

  • Intent tracking: What did the agent think it was doing?
  • Decision tracing: Why did it pick that approach over another?
  • Tool call telemetry: Every tool invocation, with its parameters and results.
  • Context window analysis: What was actually in context when each decision was made?
  • Quality metrics: Was the output correct, safe, and aligned with the goal?
  • Cost tracking: Token usage, model selection, and spend per task.

The Three Pillars of Agent Observability

Pillar 1: Logging

Structured logging records every agent action along with its context:

{
  "timestamp": "2026-06-15T10:30:00Z",
  "agent_id": "hermes-prod-1",
  "session_id": "sess_abc123",
  "event_type": "tool_call",
  "tool": "file_write",
  "params": {
    "path": "src/auth.ts",
    "lines": 47,
    "diff_hash": "sha256:abc..."
  },
  "model": "sonnet-4.8",
  "tokens_used": { "input": 4200, "output": 1800 },
  "latency_ms": 3200,
  "result": "success",
  "user_id": "engineer_42"
}

(The sonnet-4.8 model id above is illustrative. As of June 2026 the released Sonnet line tops out at Sonnet 4.6; there is no public Sonnet 4.8. Anthropic did ship Claude Opus 4.8 in late May 2026, but treat the field here as a placeholder rather than a real version.)

Different agents log in different ways. Hermes (Nous Research) is built on a SQLite state store that uses an FTS5 full-text-search table, so the data it keeps lives in a searchable local database (NousResearch/hermes-agent); whether you can lean on that as a full monitoring layer depends on how you wire it up. Claude Code emits structured log events through its built-in OpenTelemetry instrumentation, covering prompts, tool results, token usage, and costs, and you turn it on with CLAUDE_CODE_ENABLE_TELEMETRY=1 (Claude Code Docs, 2026). The tool sometimes referred to as OpenClaw reportedly logs to files with configurable verbosity, though that project and its commands could not be confirmed against a primary source, so take it as unverified.

Pillar 2: Tracing

Distributed traces follow a single task across multiple agent calls and tool invocations:

[Trace: migrate-auth-system]
  [Span: plan_generation] 1.2s
    [Span: codebase_analysis] 0.4s
    [Span: migration_plan_draft] 0.8s
  [Span: plan_approval] 45.0s (human)
  [Span: execution] 128.0s
    [Span: file_read(src/auth.ts)] 0.1s
    [Span: file_write(src/auth.ts)] 3.2s
    [Span: test_run] 12.4s
    [Span: file_write(tests/auth.test.ts)] 2.8s
  [Span: verification] 8.1s
    [Span: lint_check] 2.1s
    [Span: typecheck] 6.0s

A trace shows you where the time went and where the failure happened. In Claude Code, this comes from the OpenTelemetry instrumentation, which records spans around each model request and tool execution (Claude Code Docs, 2026). Its Dynamic Workflows feature is real (a script that orchestrates subagents at scale), but the idea that it generates traces on its own overstates things: tracing is the job of the opt-in OpenTelemetry layer, not a side effect of running a workflow. For Hermes, an OpenTelemetry integration via a hermes-opentelemetry package has been mentioned, but no registry or repo confirms such a package exists, so treat it as unconfirmed.

Pillar 3: Metrics

Aggregate metrics are what feed your dashboards and alerts:

MetricTypeAlert Threshold
Tasks per hourCounterDrop >50%
Success rateGauge<80%
Average latencyHistogramp95 >60s
Token cost per taskHistogram>200% of baseline
Rollback rateGauge>5%
Approval gate triggersCounterSpike >300%
Tool error rateGauge>2% per tool

Monitoring Stack Recommendations

For Small Teams (< 10 engineers)

Hermes: Built-in FTS5 logs + `hermes metrics` command
Claude Code: Built-in telemetry + JSON log export
OpenClaw: File logs + basic metrics dashboard
Export to: Grafana Cloud (free tier) or Datadog

A note on those Hermes and OpenClaw commands: the specific subcommands shown here are illustrative. Hermes ships a CLI, but the trace, metrics, and dashboard verbs below were not confirmed in its primary docs, and the OpenClaw commands could not be verified at all. Claude Code's telemetry, by contrast, is documented, and its OpenTelemetry export feeds Grafana, Datadog, and similar backends directly (Claude Code Docs, 2026).

For Large Teams (10+ engineers)

Hermes: OpenTelemetry export to Jaeger + Prometheus
Claude Code: Anthropic-managed telemetry + SIEM integration
OpenClaw: Structured logging to ELK/Loki stack
Dashboards: Grafana with custom agent dashboards
Alerting: PagerDuty/Opsgenie for critical thresholds

The SIEM piece is worth flagging because it is real and useful: Claude Code's tool_decision, tool_result, mcp_server_connection, and permission_mode_changed events form a per-user audit trail you can forward to a security information and event management platform (Claude Code Docs, 2026). Grafana, Jaeger, Prometheus, ELK, Loki, PagerDuty, and Opsgenie are all standard tools, and OpenTelemetry's OTLP export plugs into them, so this stack is technically sound rather than aspirational.

Real-Time Dashboards

A useful agent dashboard surfaces:

  1. Current activity: Active agents, running tasks, queued requests.
  2. Health overview: Success rates, error rates, latency percentiles.
  3. Cost tracking: Spend today, this week, this month, broken down per agent.
  4. Quality trends: Rollback rate, human correction rate, approval gate stats.
  5. Alert feed: Active alerts and recent resolutions.
# Hermes built-in dashboard
hermes dashboard --port 8080

# Claude Code telemetry export
claude telemetry export --format prometheus

# OpenClaw metrics
openclaw metrics --serve --port 9090

One caveat on the Claude Code line: there is no documented claude telemetry export --format prometheus subcommand. In practice you configure the export through OpenTelemetry environment variables (OTEL_METRICS_EXPORTER and the standard OTLP settings), and a Prometheus or Grafana backend consumes the metrics from there (Claude Code Docs, 2026). Read the command above as shorthand for "point your OTEL config at Prometheus."

Alerting Best Practices

  1. Alert on symptoms, not causes: "Success rate dropped," not "CPU usage high."
  2. Use dynamic thresholds: Static thresholds breed alert fatigue. Lean on anomaly detection.
  3. Include context: An alert should link straight to the relevant traces and logs.
  4. Escalation paths: Low-priority alerts to Slack, high-priority to PagerDuty.
  5. Review regularly: A weekly alert review to clear out false positives.

Debugging with Traces

When an agent does something you did not expect, the trace is the first place to look:

# Find the failing trace
hermes traces list --since 1h --status failed

# Inspect the trace
hermes traces show trace_abc123 --format tree

# Compare with a successful trace
hermes traces compare trace_abc123 trace_def456

The comparison shows where two runs diverged: a different tool choice, different context, a different model response. That is usually enough to pin down the root cause. (As above, the exact hermes traces verbs are illustrative and not confirmed in the project's primary docs, but the workflow holds for any tracing tool you adopt.)

Monitoring is not optional for production agents. An agent you cannot see is a risk on your books. An agent you can see, with logs, traces, and metrics behind it, is an accountable member of the team.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Pick the smallest useful workflow that proves the pattern.
  2. Write down the owner, data boundary, review point, and success measure.
  3. Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI agent design systems.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Real-Time Agent Monitoring: Logs, Traces, Observability

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call