Back to news

How-to Guide

How to create an agent heartbeat system.

Build a health monitoring and liveness detection system for long-running AI agents using heartbeat patterns, timeout detection, and automatic recovery mechanisms.

AI Kick Start editorial image for How to create an agent heartbeat system.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: Long-running AI agents can hang, crash, or get stuck in infinite loops with no warning. A heartbeat system sends regular health signals from each agent to a central monitor, which spots missing beats, flags the agents that have failed, and kicks off automatic recovery. This guide builds the whole thing with Redis, Node.js, and policies you can tune.

Key takeaways

  • Heartbeat interval: 30 seconds default; critical agents every 10s
  • Timeout: 3x heartbeat interval before declaring failure
  • State tracking: Redis for distributed state; in-memory for single-node
  • Recovery: Automatic restart, fallback agent, or human alert
  • Graceful shutdown: Deregister on SIGTERM; don't false-positive on restarts

Analysis

An AI agent that has stopped responding looks exactly like one that is hard at work. That is the problem. Both sit there silently, and by the time someone notices the queue isn't moving, you've already lost an hour, a batch of customer requests, or an overnight job that was supposed to be done by morning.

For a single chatbot answering questions, this barely matters. You restart it and move on. But teams are now running fleets of agents that talk to each other, hand off tasks, and chew through real work around the clock. When one of them quietly dies, the failure spreads, and nobody finds out until a person trips over the symptom.

The fix borrows an old idea from how reliable systems have always stayed alive: a heartbeat. Each agent sends a small "I'm still here" signal on a regular beat. A monitor watches for those beats. When one goes quiet for too long, the monitor declares the agent dead and does something about it, all without waking anyone up at 3am for a problem the system could have fixed itself.

This guide walks through building that system end to end, in Node.js with Redis, including the recovery logic and a dashboard so you can see what every agent is doing.

Analysis

Prerequisites

  • Node.js 20+ or Python 3.11+
  • Redis 7+ for distributed state
  • Docker for containerised agents
  • A process manager (systemd, PM2, or Kubernetes)

Node.js 20 is the current active LTS, and the official redis/node-redis client works natively with async/await on it. Python 3.11 is fine too if that's your stack. For distributed state you'll want Redis 7+, which added finer control over key expiry that this design leans on.

Step-by-Step Framework

Step 1: Define the Heartbeat Protocol

Start with the shape of the message. A heartbeat isn't just a ping; it carries enough context for the monitor to make a real decision, so it includes the agent's status, what it's working on, and a handful of metrics.

// heartbeat/types.ts
interface HeartbeatMessage {
  agentId: string;
  agentType: string;
  timestamp: number;        // Unix timestamp (ms)
  sequence: number;         // Incrementing sequence number
  status: 'healthy' | 'degraded' | 'busy' | 'recovering';
  metrics: AgentMetrics;
  currentTask?: string;     // What the agent is working on
  taskProgress?: number;    // 0-100
}

interface AgentMetrics {
  cpuPercent: number;
  memoryMB: number;
  activeTasks: number;
  queueDepth: number;
  tokensUsedThisHour: number;
  errorsLast5Min: number;
}

interface HealthStatus {
  agentId: string;
  state: 'healthy' | 'missing' | 'failed' | 'stopped';
  lastHeartbeat: number;
  missedBeats: number;
  uptimeSeconds: number;
}

Step 2: Build the Agent Heartbeat Client

This is the piece that lives inside each agent. It registers on startup, fires a heartbeat on a timer, and tidies up after itself when the process is told to stop.

// heartbeat/client.ts
import { createClient, RedisClientType } from 'redis';
import { HeartbeatMessage, AgentMetrics } from './types';
import * as os from 'os';
import * as process from 'process';

export class HeartbeatClient {
  private redis: RedisClientType;
  private agentId: string;
  private agentType: string;
  private intervalMs: number;
  private heartbeatTimer?: NodeJS.Timer;
  private sequence = 0;
  private startTime = Date.now();

  constructor(config: {
    redisUrl: string;
    agentId: string;
    agentType: string;
    intervalMs?: number;
  }) {
    this.redis = createClient({ url: config.redisUrl });
    this.agentId = config.agentId;
    this.agentType = config.agentType;
    this.intervalMs = config.intervalMs || 30000;
  }

  async connect(): Promise<void> {
    await this.redis.connect();

    // Register agent on startup
    await this.redis.hSet(`agent:${this.agentId}`, {
      registeredAt: Date.now().toString(),
      type: this.agentType,
      host: os.hostname(),
      pid: process.pid.toString(),
      status: 'starting'
    });

    // Start heartbeat loop
    this.heartbeatTimer = setInterval(() => this.sendHeartbeat(), this.intervalMs);

    // Graceful shutdown
    process.on('SIGTERM', () => this.shutdown());
    process.on('SIGINT', () => this.shutdown());
  }

  private async sendHeartbeat(): Promise<void> {
    const metrics = await this.collectMetrics();

    const heartbeat: HeartbeatMessage = {
      agentId: this.agentId,
      agentType: this.agentType,
      timestamp: Date.now(),
      sequence: ++this.sequence,
      status: this.determineStatus(metrics),
      metrics,
      currentTask: this.currentTask
    };

    // Publish to Redis
    await this.redis.publish('heartbeats', JSON.stringify(heartbeat));

    // Also store in hash for queries
    await this.redis.hSet(`agent:${this.agentId}`, {
      lastHeartbeat: heartbeat.timestamp.toString(),
      sequence: heartbeat.sequence.toString(),
      status: heartbeat.status,
      cpuPercent: metrics.cpuPercent.toString(),
      memoryMB: metrics.memoryMB.toString()
    });

    // Set expiry, if agent dies, key auto-expires
    await this.redis.expire(`agent:${this.agentId}`, Math.floor(this.intervalMs * 4 / 1000));
  }

  private async collectMetrics(): Promise<AgentMetrics> {
    const usage = process.memoryUsage();

    return {
      cpuPercent: await this.getCPUUsage(),
      memoryMB: Math.round(usage.heapUsed / 1024 / 1024),
      activeTasks: this.activeTasks,
      queueDepth: this.queue.length,
      tokensUsedThisHour: this.hourlyTokenUsage,
      errorsLast5Min: this.recentErrors
    };
  }

  private determineStatus(metrics: AgentMetrics): HeartbeatMessage['status'] {
    if (metrics.errorsLast5Min > 10) return 'recovering';
    if (metrics.memoryMB > 1000) return 'degraded';
    if (metrics.queueDepth > 50) return 'busy';
    return 'healthy';
  }

  private async shutdown(): Promise<void> {
    console.log('Shutting down gracefully...');
    if (this.heartbeatTimer) clearInterval(this.heartbeatTimer);

    await this.redis.hSet(`agent:${this.agentId}`, {
      status: 'stopped',
      stoppedAt: Date.now().toString()
    });

    await this.redis.quit();
    process.exit(0);
  }
}

Two details earn their keep here. The agent both publishes to a pub/sub channel and writes its latest state into a Redis hash, so the monitor gets live updates and anything else can query the current picture on demand. And the expire call sets a TTL on the agent's key: if the process dies outright, Redis deletes the key for you, so a dead agent leaves no stale record behind.

Step 3: Build the Heartbeat Monitor

The monitor is the watcher. It subscribes to the heartbeat channel, tracks every agent's last known state, and runs a periodic sweep to catch the ones that have gone quiet.

// heartbeat/monitor.ts
import { createClient, RedisClientType } from 'redis';
import { HealthStatus, HeartbeatMessage } from './types';

interface MonitorConfig {
  redisUrl: string;
  checkIntervalMs: number;
  missedBeatsThreshold: number;
  onAgentFailed: (agentId: string, status: HealthStatus) => void;
  onAgentRecovered: (agentId: string) => void;
}

export class HeartbeatMonitor {
  private redis: RedisClientType;
  private subscriber: RedisClientType;
  private config: MonitorConfig;
  private agentStates: Map<string, HealthStatus> = new Map();

  constructor(config: MonitorConfig) {
    this.config = config;
    this.redis = createClient({ url: config.redisUrl });
    this.subscriber = createClient({ url: config.redisUrl });
  }

  async start(): Promise<void> {
    await this.redis.connect();
    await this.subscriber.connect();

    // Subscribe to heartbeat channel
    await this.subscriber.subscribe('heartbeats', (message) => {
      const heartbeat: HeartbeatMessage = JSON.parse(message);
      this.processHeartbeat(heartbeat);
    });

    // Start periodic check for missed beats
    setInterval(() => this.checkMissedBeats(), this.config.checkIntervalMs);

    console.log('Heartbeat monitor started');
  }

  private processHeartbeat(heartbeat: HeartbeatMessage): void {
    const existing = this.agentStates.get(heartbeat.agentId);

    if (existing && existing.state === 'failed') {
      // Agent recovered
      console.log(`Agent ${heartbeat.agentId} recovered!`);
      this.config.onAgentRecovered(heartbeat.agentId);
    }

    this.agentStates.set(heartbeat.agentId, {
      agentId: heartbeat.agentId,
      state: 'healthy',
      lastHeartbeat: heartbeat.timestamp,
      missedBeats: 0,
      uptimeSeconds: Math.floor((Date.now() - heartbeat.timestamp) / 1000)
    });
  }

  private checkMissedBeats(): void {
    const now = Date.now();

    for (const [agentId, state] of this.agentStates) {
      const timeSinceLastBeat = now - state.lastHeartbeat;
      const expectedInterval = 30000; // 30s

      if (timeSinceLastBeat > expectedInterval * this.config.missedBeatsThreshold) {
        state.missedBeats++;

        if (state.missedBeats >= this.config.missedBeatsThreshold) {
          state.state = 'failed';
          console.error(`Agent ${agentId} declared FAILED after ${state.missedBeats} missed beats`);
          this.config.onAgentFailed(agentId, state);
        } else {
          state.state = 'missing';
          console.warn(`Agent ${agentId} missed ${state.missedBeats} beats`);
        }
      }
    }
  }

  getAgentStates(): HealthStatus[] {
    return Array.from(this.agentStates.values());
  }
}

Notice the two-step escalation. A quiet agent is first marked missing, not failed. Only after it crosses the missed-beats threshold does the monitor call it failed and trigger the onAgentFailed callback. That buffer is what keeps a single dropped packet from setting off your recovery machinery.

Step 4: Implement Recovery Strategies

When the monitor declares an agent dead, this is what runs. It tries the cheapest fix first and only escalates to a human when the automated options have all failed.

// heartbeat/recovery.ts
import { HealthStatus } from './types';
import { execSync } from 'child_process';

export class RecoveryManager {
  async recover(agentId: string, status: HealthStatus): Promise<void> {
    // Strategy 1: Restart via Docker
    try {
      console.log(`Attempting Docker restart for ${agentId}...`);
      execSync(`docker restart ${agentId}`);
      return;
    } catch (e) {
      console.log('Docker restart failed, trying next strategy');
    }

    // Strategy 2: Spawn replacement container
    try {
      console.log(`Spawning replacement for ${agentId}...`);
      execSync(`docker run -d --name ${agentId}-replacement \
        -e AGENT_ID=${agentId} \
        -e REDIS_URL=redis://redis:6379 \
        my-agent-image:latest`);
      return;
    } catch (e) {
      console.log('Container spawn failed');
    }

    // Strategy 3: Alert human
    await this.sendAlert({
      severity: 'critical',
      message: `Agent ${agentId} has failed and automatic recovery was unsuccessful.`,
      lastHeartbeat: new Date(status.lastHeartbeat).toISOString(),
      actionRequired: 'Manual intervention needed'
    });
  }

  private async sendAlert(alert: object): Promise<void> {
    // Send to PagerDuty, Slack, etc.
    await fetch('https://hooks.slack.com/services/YOUR/WEBHOOK/URL', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ text: JSON.stringify(alert, null, 2) })
    });
  }
}

The ladder runs restart, then replace, then alert. Most dead agents come back on a plain docker restart. If the container itself is broken, you spin up a fresh replacement. Only when both fail does a person get paged, which means your on-call team hears about the failures that actually need a human, not the ones the system already handled. Swap the placeholder webhook for your real Slack or PagerDuty endpoint before you ship.

Step 5: Dashboard Endpoint

Last, expose the state over HTTP so you can see the fleet at a glance, one summary view and one per-agent lookup.

// heartbeat/dashboard.ts
import { HeartbeatMonitor } from './monitor';
import { FastifyInstance } from 'fastify';

export function registerDashboardRoutes(
  app: FastifyInstance,
  monitor: HeartbeatMonitor
) {
  app.get('/health/agents', async () => {
    const states = monitor.getAgentStates();
    return {
      total: states.length,
      healthy: states.filter(s => s.state === 'healthy').length,
      missing: states.filter(s => s.state === 'missing').length,
      failed: states.filter(s => s.state === 'failed').length,
      agents: states
    };
  });

  app.get('/health/agents/:id', async (req) => {
    const { id } = req.params as { id: string };
    return monitor.getAgentStates().find(s => s.agentId === id) || { error: 'Agent not found' };
  });
}

Do/Don't

DoDon't
Use Redis pub/sub for heartbeatsUse polling for heartbeat detection
Set 3x multiplier for timeout thresholdUse 1x, network jitter causes false positives
Implement graceful shutdown with deregistrationLet agents disappear without cleanup
Auto-restart before alerting humansWake engineers for recoverable failures
Include metrics in every heartbeatSend just a "ping" with no context

Conclusion

If you're running agents in production, a heartbeat system isn't optional. The 30-second beat with a 3x timeout gives you about 90 seconds to catch a failure, long enough to ride out ordinary network jitter without crying wolf. Redis pub/sub carries the signal, the monitor keeps the score, and the recovery manager handles restarts before anyone has to. Add the dashboard for visibility, and your agents start behaving like services you can actually trust to run unattended.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Pick the smallest useful workflow that proves the pattern.
  2. Write down the owner, data boundary, review point, and success measure.
  3. Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI agent design systems.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: How to create an agent heartbeat system

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call