Code

Agent Safety: Sandboxing, Approval Gates, and Human Review.

The three-pillar safety model for production agent deployments. Lessons from CVE-2026-25253 and the Koi Security audit, with a maturity model from basic to elite.

Daniel Fleuren2026-05-1813 min readOperations and governance teamsUpdated 2026-06-19

Written by

Daniel Fleuren

Founder, AI Kick Start. 20+ years enterprise IT

Updated 2026-06-19

AI Kick Start editorial image for Agent Safety: Sandboxing, Approval Gates, and Human Review.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: Treat agent safety as part of the design, not a patch you bolt on once something breaks. Three layers do most of the work: sandbox the agent so it cannot touch what it should not, put approval gates in front of risky operations, and keep a human in the loop for review. Get those right and you can run agents without betting the business on them.

Key takeaways

Design safety in from the start. Sandboxing, approval gates, and human review are the three controls that contain agent risk.
CVE-2026-25253 (a one-click RCE in OpenClaw) and Koi Security's discovery of 341 malicious skills show the threat is real: agents hold credentials and act on the world.
Default-deny everything. Give the agent read-only source access, no outbound network, and hard limits on CPU, memory, and run time.
Calibrate approval gates so people actually read them. Hard-block critical operations, log the trivial ones, and tune the middle to your risk appetite.
The maturity model and rollback thresholds (below 2% ease gates, above 10% tighten them) are starting points, not rules. Adjust to your own data.

Briefing

Agent safety is not a feature you add later. It is an architectural property you design in from the start. The lessons from CVE-2026-25253, the Koi Security audit of OpenClaw skills, and real-world agent deployments point to the same three things: sandboxing, approval gates, and human review.

Analysis

Earlier this year a single OpenClaw vulnerability gave the security world a fright. CVE-2026-25253 was a one-click remote code execution flaw: the app accepted a gateway URL from a query string and quietly opened a WebSocket that handed over the user's auth token. One click, and an attacker could be running code on your machine. It was patched in version 2026.1.29, but by then it had made a wider point. The risk with AI agents is not just that they make mistakes. It is that they hold real credentials and can act on the world.

Around the same time, researchers at Koi Security audited every skill in OpenClaw's skill marketplace. Of 2,857 skills, they flagged 341 as malicious, with 335 tied to a single campaign they named ClawHavoc. The malicious skills used fake setup instructions to drop keyloggers on Windows and AMOS-family malware on macOS. The audit was run, fittingly, with the help of the same kind of agent the attackers were targeting.

For an Australian business team thinking about putting an agent to work, the takeaway is not "agents are dangerous, stay away". It is that an agent is a piece of software with hands. You would not give a new contractor your production database password and walk away. The same instinct applies here. The rest of this article is the practical version of that instinct, broken into the three controls that actually contain the risk.

Pillar 1: Sandboxing

Sandboxing keeps the agent away from anything critical. Article 10 went through the technical strategies in detail. For production, these are the principles that matter:

Default deny: Agents start with no permissions. Grant only what is needed.
Filesystem isolation: Read-only access to source code, read-write only to designated scratch directories.
Network restrictions: No outbound network by default. Whitelist required endpoints.
Resource limits: CPU, memory, and execution time caps prevent runaway agents.
Process isolation: Agent code runs in separate processes or containers.

# Production sandbox configuration
sandbox:
  type: container
  filesystem:
    - mount: /workspace/project
      access: read_only
    - mount: /workspace/output
      access: read_write
  network:
    mode: restricted
    allowed_hosts:
      - github.com
      - registry.npmjs.org
  resources:
    cpu_limit: 2
    memory_limit: 4G
    max_execution_time: 600

OpenClaw's post-CVE sandbox mode implements most of these, restricting filesystem and network access for skills. Worth saying plainly: that sandbox has since had its own escape bugs (CVE-2026-32048 among them), so it is a layer of defence, not a guarantee. Hermes supports container-based sandboxing via Docker, with read-only bind mounts for skills and credentials. Claude Code runs in a managed environment with permission and approval controls, which is closer to gated execution than a formal container sandbox, so treat "implicit sandboxing" as shorthand rather than a hard spec.

Pillar 2: Approval Gates

Approval gates make a human confirm high-risk operations before they run. The tricky part is calibrating what counts as "high-risk". Ask for approval on everything and people stop reading the prompts. Ask for too little and you are exposed.

Here is a workable set of gate levels:

Level	Trigger	Action
Critical	Database mutation, secret access, deployment	Hard block, require approval
High	File deletion, API key usage, config change	Block, show impact analysis
Medium	New dependency, >10 files modified	Notify, allow override
Low	Single file edit, test addition	Log only

Claude Code's Plan Mode is the most mature approval gate I have used. It reads the codebase, writes out a numbered plan of the files it will touch and the commands it will run, and refuses to change anything until you approve. You can edit the plan or cancel it first. Hermes offers configurable approval gates too: it can require manual sign-off before destructive commands, and generated code has to pass constraint checks (unit tests, file-size limits) before it runs.

Treat the levels above as a starting template, not gospel. They are a sensible default, but you will tune them to your own risk appetite.

Pillar 3: Human Review

Human review catches what automation misses: subtle bugs, work that technically passes but heads the wrong architectural direction, security holes that static analysis walks straight past. The trick is to make review fast, not bureaucratic.

What works in practice:

Automated pre-review: Run lint, tests, and security scans before a human sees the code
Diff-only review: Show only what changed, with clear context
Risk-based routing: High-risk changes go to senior engineers; low-risk changes can be self-merged
Review time limits: Review within 4 hours or auto-approve with logging
Review feedback loops: When reviewers catch agent bugs, update the harness

The Three-Pillar Maturity Model

Maturity	Sandboxing	Approval Gates	Human Review
Level 1 (Basic)	Container isolation	Plan Mode for complex tasks	All changes reviewed
Level 2 (Intermediate)	Capability-based + container	Risk-calibrated gates	Risk-based routing
Level 3 (Advanced)	VM-based + capability-based	Context-aware gates	Review sample + spot checks
Level 4 (Elite)	Defence-in-depth stack	Minimal gates, high trust	Trust-but-verify with audit

This model and the thresholds that follow are my recommendation rather than an industry standard, so weigh them against your own data. Most teams should start at Level 1 and move up based on rollback rates. If your rollback rate sits below 2%, you can think about easing off the approval gates. If it climbs above 10%, tighten them.

Agent-Specific Security Risks

Beyond ordinary software security, agents bring their own set of risks:

Prompt injection: Malicious input that overrides system instructions
Tool misuse: Agent using a legitimate tool for unintended purposes
Information leakage: Agent exfiltrating sensitive data through tool outputs
Goal misalignment: Agent pursuing the literal goal in ways that violate implicit constraints
Supply chain via skills: Malicious skills or dependencies, the kind Koi Security uncovered in the ClawHavoc campaign

Each one needs its own mitigation:

Prompt injection: Input validation, prompt boundaries, output encoding
Tool misuse: Capability-based restrictions, tool-specific guards
Information leakage: Network restrictions, output filtering, audit logging
Goal misalignment: Constraint specification, approval gates, human oversight
Supply chain: Skill signing, sandboxing, audit (the Koi Security model)

Incident Response

When an agent causes a security incident:

Contain: Disable the agent, revoke its credentials
Assess: Determine scope of impact (what did it access, modify, or exfiltrate)
Recover: Roll back changes, rotate secrets, patch vulnerabilities
Analyse: Root cause analysis. Was it a bug, a malicious input, or a design flaw?
Remediate: Update harness, add constraints, improve monitoring
Document: Incident report for the team and, if severe, the community

Agent safety is not about eliminating risk. It is about keeping risk in line with the value the agents return. The teams that deploy agents safely are the ones that treat safety as something they keep doing, not a box they tick once.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

Pick the smallest useful workflow that proves the pattern.
Write down the owner, data boundary, review point, and success measure.
Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI agent design systems.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Agent Safety: Sandboxing, Approval Gates, and Human Review

Read with ChatGPT Open Claude Search with AI Mode

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call