Sandboxing AI agent tool calls is no longer optional. The moment your agent executes a shell command, installs a package, runs a Python script, or fetches a URL, you’re running code with the agent’s privileges. In a multi-tenant context that means tenant A’s prompt can issue code that tries to read tenant B’s data. The defense is a sandbox.
Two options dominate in 2026: bubblewrap and Docker. This article is the trade-off matrix.
What the sandbox needs to do
Whatever you pick, the sandbox must:
- Isolate the filesystem. The agent sees only the tenant’s work directory and the system libraries it needs.
- Restrict network access. Default-deny; opt-in to specific hostnames + ports.
- Drop capabilities. No
CAP_SYS_ADMIN, no raw network sockets, no kernel-module loading. - Filter syscalls. Default-deny seccomp profile; allow only what tool execution needs.
- Cap resources. Memory limit, CPU quota, wall-clock timeout.
- Be cheap. Cold start measured in tens of milliseconds, not seconds — because you’ll spawn one per tool call.
Both bubblewrap and Docker can satisfy all six. The question is which trade-offs each makes.
Bubblewrap
Bubblewrap is the sandboxing primitive that powers Flatpak. It’s a small setuid-free binary that builds a Linux user-namespace + mount-namespace sandbox using only kernel features. No daemon, no Docker, no root.
Cold start: ~30 ms on a modern x86 box. That’s per-tool-call cheap — you can spawn a fresh sandbox every time your agent issues a shell command.
Isolation mechanism: user namespaces + mount namespaces + seccomp.
Pros:
- Lightning fast cold start.
- No daemon. Bubblewrap is a CLI you invoke; nothing keeps running between calls.
- Rootless. Doesn’t need elevated privileges to set up.
- Small audit surface. The bwrap binary is a few thousand lines.
- Plays nicely with the existing host filesystem.
Cons:
- Linux only. macOS and Windows have no equivalent.
- Relies on user namespaces — a kernel namespace bug breaks isolation.
- No built-in resource limits beyond what you wire up via cgroups separately.
- Less battle-tested in adversarial multi-tenant production than Docker.
Use it when: you’re running on Linux, the code you’re executing is trusted-but-isolated (agent tools you control, but you want defense in depth), and cold-start latency matters.
Docker
Docker’s runc runtime (or any OCI-compatible runtime) gives you full container isolation: namespaces, cgroups, seccomp, AppArmor / SELinux, capability dropping, and optional GPU passthrough.
Cold start: 200–500 ms for a typical sandbox image, faster if you keep a warm pool of pre-spawned containers.
Isolation mechanism: kernel namespaces + cgroups + seccomp + AppArmor + capability drops.
Pros:
- Cross-platform. Linux native; macOS and Windows via Docker Desktop / Orbstack.
- Mature security ecosystem. Default-deny seccomp profile, AppArmor profiles, gVisor runtime option.
- Battle-tested. Every public container service uses some variant.
- Easy to plug into your existing container infrastructure.
- Supports GPU access if your agent needs ML model inference inside the sandbox.
Cons:
- Slower cold start. 10–15x bubblewrap.
- Daemon required. Adds an operational dependency.
- Larger audit surface. More features means more potential bugs.
- Resource overhead. Memory + CPU per container.
Use it when: you’re running code from genuinely untrusted sources (user-submitted scripts, plugins from unverified publishers), you need cross-platform support, or you’re already deeply invested in container infrastructure.
The hybrid pattern
Real deployments use both. OpenClawMU’s default: bubblewrap for the standard agent tool surface (shell, file_read, file_write, package install), Docker for tools explicitly marked as untrusted (custom plugins from unverified ClawHub publishers, user-submitted code).
The choice is per-tool, configurable in the tenant config:
sandbox:
default_mode: bwrap
modes:
untrusted_code:
runtime: docker
image: openclaw/sandbox-untrusted:latest
memory_limit_mb: 512
cpu_quota: 0.5
runtime_class: runsc # gVisor
Tools annotated @sandbox("untrusted_code") get the heavier isolation. Everything else gets the fast bubblewrap path.
Cold-start cost in production numbers
A typical agent run issues 3–10 tool calls. Bubblewrap × 10 = ~300 ms total sandboxing overhead — negligible against the 2–5 second LLM response time. Docker × 10 = 2–5 seconds, which doubles the perceived latency.
If you can keep a warm Docker pool, the cold-start cost drops to ~50 ms per call. That’s a reasonable tradeoff for the heavier isolation surface.
Seccomp profiles
Both bubblewrap and Docker accept seccomp profiles that restrict the syscalls a process can issue. A reasonable default for agent code:
- Allow: read, write, openat, close, exec, fork, mmap, brk, exit, futex, clock_gettime, getpid, getuid, getgid (the boring stuff).
- Deny: ptrace, mount, umount2, reboot, kexec_load, sysctl, perf_event_open (anything that touches the kernel or other processes).
OpenClawMU ships a default-deny seccomp profile that allows the syscalls a typical Python / Node / shell tool needs. Custom profiles are configurable per sandbox mode.
Network policy
Default to no network. Tools that need network access opt in with an allow-list:
sandbox:
network:
default: deny
allow:
- "api.weather.gov:443"
- "*.anthropic.com:443"
Implementation differs: bubblewrap can run without network namespaces or with a unshare(CLONE_NEWNET) for full isolation; Docker uses --network=none plus a per-container network namespace if you want allow-listing.
When neither is enough
For truly hostile workloads (security research, user-submitted attack payloads), neither bubblewrap nor Docker-on-runc is sufficient. Step up to:
- Docker + gVisor (
runsc): kernel-level isolation, ~30% syscall overhead. The pragmatic next step. - Kata Containers: lightweight VMs as containers. Stronger isolation, heavier cold start.
- Firecracker: AWS’s MicroVM. Used by Lambda. Cold start ~125 ms; very strong isolation.
For most multi-tenant AI gateway use cases, Docker + gVisor is the right ceiling. Beyond that you’re paying overhead you don’t need.
Recommendation
- Linux + trusted-but-isolated workloads: bubblewrap. Fast, simple, well-suited.
- Cross-platform or moderately-untrusted workloads: Docker with default seccomp + cap-drop.
- Genuinely-untrusted (user-submitted code): Docker + gVisor.
- Hostile workloads: Firecracker.
Pick per workload, not per cluster. The right choice for one tool isn’t the right choice for all of them.