Back to news

AI Tools

Self-hosting Hermes Agent: Production deployment guide.

A step-by-step guide to deploying Hermes Agent in production, from hardware requirements to Honcho memory configuration to monitoring.

AI Kick Start editorial image for Self-hosting Hermes Agent: Production deployment guide.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: A step-by-step guide to deploying Hermes Agent in production, from hardware requirements to Honcho memory configuration to monitoring.

Key takeaways

  • Briefing: Most teams meet an AI agent through someone else's cloud.
  • Hardware Requirements: **Minimum** (for personal use): 4 CPU cores 8GB RAM 50GB storage Any modern GPU optional **Recommended** (for production): 8+ CPU cores 32GB RAM 200GB SSD storage GPU with 16GB+ VRAM for local model inference **High-Availability** (for enterprise): 3+ nodes with load balancing 64GB+ RAM per node PostgreSQL cluster for Honcho memory Redis cluster for caching Shared storage for model weights
  • Deployment Options: Docker (Recommended) The least painful way to get a production setup running: # Clone the repository git clone https://github.com/NousResearch/hermes-agent.git cd hermes-agent # Copy and edit configuration cp .env.example .env # Edit .env with your API keys and settings # Start services docker-compose up -d A note on the clone URL: some write-ups point at `nousresearch/hermes.git`, which doesn't exist and will fail.
  • Honcho Memory Configuration: [Honcho](https://github.com/plastic-labs/honcho) is the memory layer that sets Hermes apart.
  • LLM Provider Setup: Hermes is provider-agnostic.

Briefing

Most teams meet an AI agent through someone else's cloud. You sign up, paste in a key, and your data flows off to a vendor you have to trust on faith. Hermes Agent flips that arrangement. It's built from the ground up to run on your own machines, it carries an MIT license, and most of it is plain Python (NousResearch/hermes-agent). That combination is the whole point: you keep the agent, the memory, and the data on hardware you control.

For an Australian business, that matters more than it sounds. When the agent runs on your infrastructure, customer conversations and internal records stay inside your network instead of crossing into someone else's. The catch is that "self-hosted" means you own the operations too, the servers, the database, the monitoring, the 2am page when something falls over.

This guide walks through a production deployment end to end, from picking hardware to wiring up alerts. Where the official Hermes docs stop and practical operations advice begins, I'll say so. A fair bit of what follows is the deployment setup I'd recommend rather than a feature the project ships out of the box.

Hardware Requirements

Minimum (for personal use):

  • 4 CPU cores
  • 8GB RAM
  • 50GB storage
  • Any modern GPU optional

Recommended (for production):

  • 8+ CPU cores
  • 32GB RAM
  • 200GB SSD storage
  • GPU with 16GB+ VRAM for local model inference

High-Availability (for enterprise):

  • 3+ nodes with load balancing
  • 64GB+ RAM per node
  • PostgreSQL cluster for Honcho memory
  • Redis cluster for caching
  • Shared storage for model weights

Deployment Options

Docker (Recommended)

The least painful way to get a production setup running:

# Clone the repository
git clone https://github.com/NousResearch/hermes-agent.git
cd hermes-agent

# Copy and edit configuration
cp .env.example .env
# Edit .env with your API keys and settings

# Start services
docker-compose up -d

A note on the clone URL: some write-ups point at nousresearch/hermes.git, which doesn't exist and will fail. The real repository is NousResearch/hermes-agent.

Docker and Docker Compose support are confirmed in the project README, and so is Honcho memory. The fuller stack below, Nginx out front, Prometheus collecting metrics, isn't a documented bundle that ships with Hermes; it's the production layout I'd run. Treat it as a recommended setup, not an official template:

  • Hermes Agent API server
  • Honcho memory service (PostgreSQL + vector store)
  • Redis cache
  • Nginx reverse proxy
  • Prometheus monitoring

Kubernetes

When you need to scale across nodes:

# Apply manifests
kubectl apply -f k8s/

# Or use Helm
helm install hermes ./helm/hermes 
  --set openai.apiKey=your-key 
  --set replicaCount=3

The Helm capabilities below are standard Kubernetes patterns rather than confirmed features of an official Hermes chart, so plan to assemble them yourself:

  • Horizontal pod autoscaling
  • Persistent volume claims for memory storage
  • Configurable resource limits
  • Ingress with TLS termination
  • Pod disruption budgets for availability

Bare Metal

When you want full control of the box:

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-prod.txt

# Configure environment
export HERMES_LLM_PROVIDER=openai
export HERMES_API_KEY=your-key
export HERMES_MEMORY_URL=postgresql://...

# Start the server
python -m hermes.server --port 8000 --workers 4

Honcho Memory Configuration

Honcho is the memory layer that sets Hermes apart. The README lists "Honcho dialectic user modeling," and Honcho, built by Plastic Labs, keeps a running model of each user so the agent remembers who it's talking to (Hermes Agent Honcho docs). A self-hosted Honcho server is supported.

The specific production stack below isn't spelled out in the Hermes Honcho docs. Honcho is an open-source FastAPI server, so a PostgreSQL/pgvector backend is a reasonable fit, but the named vector stores, the Redis layer, and the retrieval target are deployment recommendations rather than documented product facts (Honcho repository):

PostgreSQL: The primary store for structured memory data. A managed PostgreSQL service (AWS RDS, GCP Cloud SQL) buys you reliability without running the database yourself.

Vector Store: For semantic memory search. pgvector (a PostgreSQL extension), Pinecone, or Weaviate all work.

Redis: Caches frequent memory queries. In my testing this can pull retrieval down into the tens of milliseconds, though that number depends on your hardware and load, not on anything Hermes guarantees.

Backup Strategy: Honcho memory holds everything Hermes knows about your users. Back it up daily and automatically, with point-in-time recovery, and actually test a restore before you need one.

LLM Provider Setup

Hermes is provider-agnostic. It reaches a wide range of models through Nous Portal and OpenRouter, and OpenAI and Anthropic are both referenced in the project (Hermes Agent README):

OpenAI: Set HERMES_LLM_PROVIDER=openai and supply your API key. Strong on capability; costs climb with usage.

Anthropic: Set HERMES_LLM_PROVIDER=anthropic. Claude models are good at reasoning and tend to behave safely.

Local Models: Running through LocalAI or Ollama isn't named explicitly in the README, but the OpenRouter and "any model" support makes it plausible. The trade is privacy and lower cost against some loss of capability.

Multi-Provider: Send different jobs to different providers based on what each one is good at and what it costs. Hard queries go to a frontier model like GPT-4; routine tasks run on a local model.

Security Considerations

API Authentication: Put API keys or OAuth2 in front of every endpoint. Rotate the keys on a schedule.

Network Isolation: Keep Hermes on a private network, reachable only through a VPN or bastion host.

Tool Permissions: Go through the tool list and lock it down. Turn off the dangerous ones, file deletion, shell execution, unless you have a clear reason to keep them.

Input Validation: Clean every bit of user input. Prompt injection is the obvious attack here, and unsanitised input is how it gets in.

Audit Logging: Record every action the agent takes, tied to a user. You'll want it for compliance, and you'll want it even more the day you're debugging something strange.

Monitoring

Prometheus isn't mentioned in the Hermes README, so the metrics below describe the monitoring setup I'd add rather than a built-in export. Once you wire Hermes into Prometheus, the signals worth tracking are:

  • Request latency and throughput
  • Tool execution success/failure rates
  • Memory retrieval performance
  • LLM token usage and costs
  • Error rates by endpoint

Build Grafana dashboards on top of those, and set alerts for:

  • P99 latency > 2 seconds
  • Error rate > 1%
  • Memory store connection failures
  • LLM API quota exhaustion

Scaling

As traffic grows, work through these in order:

  1. Scale the API servers behind a load balancer
  2. Scale Honcho memory with read replicas
  3. Cache aggressively with Redis
  4. Use local models for high-volume, low-complexity tasks
  5. Implement rate limiting per user

Troubleshooting

A few problems you'll likely hit, and where to start:

  • High latency: Check Honcho query performance, switch on Redis caching, and look at whether a faster model would help.
  • Memory errors: Grow the PostgreSQL connection pool, add RAM, or bring in read replicas.
  • LLM rate limits: Queue requests, add a fallback provider, or shift load to local models.
  • Tool failures: Recheck tool permissions, confirm API keys, and make sure the network path is open.

Configured properly, Hermes Agent holds up in production and gives you a personalised assistant that gets sharper as it learns your users. The project's popularity says people are paying attention, as of mid-2026 the repository reportedly sits in the high-100-thousands of stars, well above older figures still floating around (star history). Just don't read a star count as proof it'll survive your production load. That part is on your deployment.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Pick the smallest useful workflow that proves the pattern.
  2. Write down the owner, data boundary, review point, and success measure.
  3. Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI agent design systems.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Self-hosting Hermes Agent: Production deployment guide

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call