Code

Agent Sandboxes: Isolating AI Agents for Safety.

As agents gain filesystem, network, and shell access, sandboxing becomes non-negotiable. We analyse container-based, VM-based, and capability-based isolation strategies with real security benchmarks.

Daniel Fleuren2026-06-0611 min readOperations and governance teamsUpdated 2026-06-19

Written by

Daniel Fleuren

Founder, AI Kick Start. 20+ years enterprise IT

Updated 2026-06-19

AI Kick Start editorial image for Agent Sandboxes: Isolating AI Agents for Safety.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: As agents gain filesystem, network, and shell access, sandboxing becomes non-negotiable. We analyse container-based, VM-based, and capability-based isolation strategies with real security benchmarks.

Key takeaways

Briefing: In late January 2026, a flaw in the OpenClaw agent platform turned a popular AI coding tool into a doorway onto its own host machine.
Threat Model: Before you pick a sandboxing strategy, you need to be clear about what you are defending against.
Strategy 1: Container-Based Isolation: Docker containers are the [most common way teams sandbox agents](https://northflank.com/blog/how-to-sandbox-ai-agents).
Strategy 2: VM-Based Isolation: When the security bar is higher, agents run in lightweight VMs ([Firecracker](https://manveerc.substack.com/p/ai-agent-sandboxing-guide), Cloud Hypervisor) instead of containers.
Strategy 3: Capability-Based Isolation: The most fine-grained approach hands out capabilities rather than blanket permissions.

Briefing

In late January 2026, a flaw in the OpenClaw agent platform turned a popular AI coding tool into a doorway onto its own host machine. The bug, tracked as CVE-2026-25253 and rated 8.8 on the CVSS severity scale, let an attacker steal the agent's auth token and run code on the box it was sitting on.

For business teams now handing real work to AI agents, that is the uncomfortable lesson. The danger was not that the model said something dumb. The danger was the room it was standing in. The agent had full shell and filesystem access, so once an attacker got in, there was nothing left to stop them.

This piece is about the walls you put around that room. Not the model, the environment. Below we walk through the threat model and the four practical ways teams are boxing agents in: containers, virtual machines, capability limits, and human approval gates. None of them is a silver bullet, and the right answer is usually a stack of them.

Threat Model

Before you pick a sandboxing strategy, you need to be clear about what you are defending against. Agent risk breaks into three classes.

Class 1: Accidental damage. The agent deletes the wrong directory, overwrites production config, or burns through resources in a runaway loop. Nobody meant any harm, but the damage is real all the same.

Class 2: Malicious skills or tools. A third-party skill (the OpenClaw scenario), a compromised dependency, or a poisoned model response tricks the agent into running harmful code. The agent is not malicious. It is being used.

Class 3: Agent misalignment. The agent chases its goal in ways that break the rules: shipping data out to finish a task "more efficiently," removing the guardrails that slow it down, or talking a human operator into doing something it cannot do itself. This is the hardest class to defend against, and the one no tool fully solves.

Strategy 1: Container-Based Isolation

Docker containers are the most common way teams sandbox agents. Each agent runs in its own container with restricted filesystem mounts, network policies, and resource limits.

# Agent sandbox container
FROM python:3.11-slim
RUN useradd -m -s /bin/bash agent
USER agent
WORKDIR /workspace
# Mount project as read-only, scratch directory as read-write
VOLUME ["/workspace/project:ro", "/workspace/scratch:rw"]
# No network access by default
NETWORK none
# Resource limits
CMD ["python", "-m", "hermes", "--sandbox"]

Containers hold up well against Class 1 and Class 2. The read-only project mount stops accidental overwrites. Network restrictions block data from leaking out. Resource limits keep a runaway agent from taking the host down. They are not airtight, though. A container escape (rare in practice, but not impossible) could reach the host. And Class 3 problems, where the agent manipulates a person or finds a clever way around the rules, sit outside what a container can catch.

Strategy 2: VM-Based Isolation

When the security bar is higher, agents run in lightweight VMs (Firecracker, Cloud Hypervisor) instead of containers. Each VM gets its own kernel, which makes escaping it far harder than breaking out of a container.

The cost is speed and overhead. VMs take longer to start than containers, and they eat more resources. For a long-running agent that is a fair trade. For an agent that spins up and down constantly, the startup latency adds up fast.

Strategy 3: Capability-Based Isolation

The most fine-grained approach hands out capabilities rather than blanket permissions. Instead of giving an agent read access to a whole directory, you give it read access to specific files. Instead of network access, you give it access to specific API endpoints. OpenHuman takes a version of this with its integration system: each of its 118+ integrations is granted only the capabilities it needs, and calls to those third-party services are routed through the OpenHuman backend rather than made directly by the agent. (The agent does, for the record, have a direct coder toolset for filesystem, git, and test work out of the box, so the gating applies mainly to external integrations rather than everything the agent touches.)

// Capability-based permission system
const agentCapabilities = {
  filesystem: {
    read: ["/project/src/**", "/project/tests/**"],
    write: ["/project/scratch/**"],
    delete: [] // No delete capability
  },
  network: {
    allowedHosts: ["api.github.com", "openrouter.ai"],
    allowedMethods: ["GET", "POST"],
    maxRequestSize: "1MB"
  },
  shell: {
    allowedCommands: ["npm", "node", "git status", "git diff"],
    blockedPatterns: ["*rm -rf*", "*curl*|*sh*", "*sudo*"]
  }
};

Strategy 4: Approval Gates

For the riskiest operations, no automatic sandbox is enough. Approval gates put a human in the loop before the agent can run certain actions. Claude Code's Plan Mode is built around this: the agent proposes, the human approves. Reportedly, OpenClaw's hardened sandbox mode released after CVE-2026-25253 also leans on approval-gated controls, including verbose approval prompts, for sensitive actions such as network access from skills.

A good approval gate needs to be:

Contextual: show what the agent is about to do and why, not just "approve this action?"
Scoped: apply only to high-risk operations, not every file read
Overrideable: let the human grant a temporary or permanent exception
Auditable: log every approval decision so it can be reviewed later

The Defense-in-Depth Stack

Real production deployments stack these strategies rather than betting on one:

Capability-based permissions for routine operations
Container isolation for the agent runtime
Approval gates for high-risk operations
Network restrictions preventing external communication
Audit logging of all agent actions for forensic analysis
Resource limits preventing denial of service

Sandboxing Benchmarks

The figures below are illustrative estimates rather than measured benchmarks, but they line up with the general trade-offs in the literature: containers start faster than VMs, microVMs land in the low hundreds of milliseconds, and capability checks add little overhead.

Strategy	Startup	Isolation Strength	Overhead	Class 1	Class 2	Class 3
None	Instant	None	None	Fail	Fail	Fail
Container	100ms	Good	Low	Pass	Pass	Partial
VM	2s	Strong	Medium	Pass	Pass	Partial
Capability-based	10ms	Granular	Low	Pass	Pass	Partial
Approval gates	Variable	Human	High	Pass	Pass	Partial
Full stack	2.1s	Maximum	High	Pass	Pass	Mitigated

No strategy fully closes off Class 3 threats. The best you can do today is layer capability limits, approval gates, and human oversight, and accept that the combination is mitigation, not a cure. The takeaway from CVE-2026-25253 is simple enough: sandboxing has to be the default, not a setting someone remembers to turn on. An agent framework that does not sandbox out of the box is not ready for production.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

Pick the smallest useful workflow that proves the pattern.
Write down the owner, data boundary, review point, and success measure.
Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI agent design systems.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Agent Sandboxes: Isolating AI Agents for Safety

Read with ChatGPT Open Claude Search with AI Mode

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call