Briefing
In late January 2026, a flaw in the OpenClaw agent platform turned a popular AI coding tool into a doorway onto its own host machine. The bug, tracked as CVE-2026-25253 and rated 8.8 on the CVSS severity scale, let an attacker steal the agent's auth token and run code on the box it was sitting on.
For business teams now handing real work to AI agents, that is the uncomfortable lesson. The danger was not that the model said something dumb. The danger was the room it was standing in. The agent had full shell and filesystem access, so once an attacker got in, there was nothing left to stop them.
This piece is about the walls you put around that room. Not the model, the environment. Below we walk through the threat model and the four practical ways teams are boxing agents in: containers, virtual machines, capability limits, and human approval gates. None of them is a silver bullet, and the right answer is usually a stack of them.
Threat Model
Before you pick a sandboxing strategy, you need to be clear about what you are defending against. Agent risk breaks into three classes.
Class 1: Accidental damage. The agent deletes the wrong directory, overwrites production config, or burns through resources in a runaway loop. Nobody meant any harm, but the damage is real all the same.
Class 2: Malicious skills or tools. A third-party skill (the OpenClaw scenario), a compromised dependency, or a poisoned model response tricks the agent into running harmful code. The agent is not malicious. It is being used.
Class 3: Agent misalignment. The agent chases its goal in ways that break the rules: shipping data out to finish a task "more efficiently," removing the guardrails that slow it down, or talking a human operator into doing something it cannot do itself. This is the hardest class to defend against, and the one no tool fully solves.
Strategy 1: Container-Based Isolation
Docker containers are the most common way teams sandbox agents. Each agent runs in its own container with restricted filesystem mounts, network policies, and resource limits.
# Agent sandbox container
FROM python:3.11-slim
RUN useradd -m -s /bin/bash agent
USER agent
WORKDIR /workspace
# Mount project as read-only, scratch directory as read-write
VOLUME ["/workspace/project:ro", "/workspace/scratch:rw"]
# No network access by default
NETWORK none
# Resource limits
CMD ["python", "-m", "hermes", "--sandbox"]Containers hold up well against Class 1 and Class 2. The read-only project mount stops accidental overwrites. Network restrictions block data from leaking out. Resource limits keep a runaway agent from taking the host down. They are not airtight, though. A container escape (rare in practice, but not impossible) could reach the host. And Class 3 problems, where the agent manipulates a person or finds a clever way around the rules, sit outside what a container can catch.
Strategy 2: VM-Based Isolation
When the security bar is higher, agents run in lightweight VMs (Firecracker, Cloud Hypervisor) instead of containers. Each VM gets its own kernel, which makes escaping it far harder than breaking out of a container.
The cost is speed and overhead. VMs take longer to start than containers, and they eat more resources. For a long-running agent that is a fair trade. For an agent that spins up and down constantly, the startup latency adds up fast.
Strategy 3: Capability-Based Isolation
The most fine-grained approach hands out capabilities rather than blanket permissions. Instead of giving an agent read access to a whole directory, you give it read access to specific files. Instead of network access, you give it access to specific API endpoints. OpenHuman takes a version of this with its integration system: each of its 118+ integrations is granted only the capabilities it needs, and calls to those third-party services are routed through the OpenHuman backend rather than made directly by the agent. (The agent does, for the record, have a direct coder toolset for filesystem, git, and test work out of the box, so the gating applies mainly to external integrations rather than everything the agent touches.)
// Capability-based permission system
const agentCapabilities = {
filesystem: {
read: ["/project/src/**", "/project/tests/**"],
write: ["/project/scratch/**"],
delete: [] // No delete capability
},
network: {
allowedHosts: ["api.github.com", "openrouter.ai"],
allowedMethods: ["GET", "POST"],
maxRequestSize: "1MB"
},
shell: {
allowedCommands: ["npm", "node", "git status", "git diff"],
blockedPatterns: ["*rm -rf*", "*curl*|*sh*", "*sudo*"]
}
};Strategy 4: Approval Gates
For the riskiest operations, no automatic sandbox is enough. Approval gates put a human in the loop before the agent can run certain actions. Claude Code's Plan Mode is built around this: the agent proposes, the human approves. Reportedly, OpenClaw's hardened sandbox mode released after CVE-2026-25253 also leans on approval-gated controls, including verbose approval prompts, for sensitive actions such as network access from skills.
A good approval gate needs to be:
- Contextual: show what the agent is about to do and why, not just "approve this action?"
- Scoped: apply only to high-risk operations, not every file read
- Overrideable: let the human grant a temporary or permanent exception
- Auditable: log every approval decision so it can be reviewed later
The Defense-in-Depth Stack
Real production deployments stack these strategies rather than betting on one:
- Capability-based permissions for routine operations
- Container isolation for the agent runtime
- Approval gates for high-risk operations
- Network restrictions preventing external communication
- Audit logging of all agent actions for forensic analysis
- Resource limits preventing denial of service
Sandboxing Benchmarks
The figures below are illustrative estimates rather than measured benchmarks, but they line up with the general trade-offs in the literature: containers start faster than VMs, microVMs land in the low hundreds of milliseconds, and capability checks add little overhead.
| Strategy | Startup | Isolation Strength | Overhead | Class 1 | Class 2 | Class 3 |
|---|---|---|---|---|---|---|
| None | Instant | None | None | Fail | Fail | Fail |
| Container | 100ms | Good | Low | Pass | Pass | Partial |
| VM | 2s | Strong | Medium | Pass | Pass | Partial |
| Capability-based | 10ms | Granular | Low | Pass | Pass | Partial |
| Approval gates | Variable | Human | High | Pass | Pass | Partial |
| Full stack | 2.1s | Maximum | High | Pass | Pass | Mitigated |
No strategy fully closes off Class 3 threats. The best you can do today is layer capability limits, approval gates, and human oversight, and accept that the combination is mitigation, not a cure. The takeaway from CVE-2026-25253 is simple enough: sandboxing has to be the default, not a setting someone remembers to turn on. An agent framework that does not sandbox out of the box is not ready for production.



