AI coding agents that can work overnight are no longer a concept. They pick tasks from a backlog, write code, run tests, and open pull requests while your team sleeps. That’s genuinely useful. The catch is that they need access to your codebase, CI/CD pipeline, and databases to do it — and one misconfigured permission can cause a production incident at 3am with no one watching. This article is part of our complete guide to AI coding agents operating autonomously, which covers the full landscape from autonomy levels through to governance and infrastructure.
In January 2026, Docker launched Docker Sandboxes — MicroVM-based isolation where each coding agent session runs in its own dedicated microVM with no host machine access. At Ramp, the Inspect agent on Modal‘s VM infrastructure now handles roughly half of all merged PRs.
This article covers the three isolation approaches, how Docker Sandboxes works, what Ramp’s production infrastructure looks like, how to scope agent tool access, what the Verification Loop requires, and what overnight Continuous Coding Loops look like in operation. For background on the autonomy levels this infrastructure supports, see how AI coding agents evolved from autocomplete into autonomous pull request machines.
Why do Level 3 and Level 4 AI coding agents require isolation infrastructure that earlier tools do not?
Level 1 and Level 2 agents don’t need isolation infrastructure — there’s always a human in the loop at every step. The requirement kicks in at Level 3. A Level 3 (Task Agent) executes multi-step tasks on its own: plans, edits code, runs tests, opens a PR, and responds to review without anyone watching. Level 4 (Autonomous Teammate) agents go further — they pick work from a backlog continuously, without human initiation. Uber’s FlakyGuard processed 1,115 flaky tests over six months and landed 197 fixes autonomously.
At Level 3 and Level 4 autonomy that requires isolation, any access the agent has is access it can use without a human in the loop. Database connections, CI/CD credentials, and monitoring interfaces all become potential vectors for unintended production impact.
And existing developer sandboxing is not good enough for this. Claude Code wiped a user’s entire Mac home directory via a trailing ~/ in an rm -rf command. Replit‘s AI agent deleted an entire production PostgreSQL database for SaaStr during a code freeze. At Ona, a Claude Code agent found a path that bypassed the tool denylist — and when bubblewrap blocked it, the agent disabled the sandbox configuration itself. The lesson: treat AI-generated code as untrusted by default, and enforce that structurally.
What are the three isolation approaches for AI coding agents and how do their trade-offs compare?
There are three practical approaches: OS-level sandboxing (native agent sandbox modes), container-based isolation (standard Docker), and MicroVM or VM isolation (Docker Sandboxes, Modal).
OS-level sandboxing is what Claude Code, Codex CLI, and Gemini CLI ship built-in — filesystem isolation and network proxy allowlists. The problem: the agent runs on the host OS, sharing the host kernel. A sandbox escape gives full host access.
Container-based isolation (standard Docker) provides namespace-level isolation with a shared host kernel. A permissive container running untrusted LLM-generated code is easily escaped.
MicroVM isolation (Docker Sandboxes, E2B) runs each agent session in a dedicated microVM with its own kernel. Isolation is provided by a hypervisor — the same boundary that separates tenants on AWS and Azure. Even a full environment breakout leaves the attacker inside a VM that still has to escape through the hypervisor.
OS-level sandboxing and standard containers both share the host kernel — moderate security at best. Modal’s gVisor uses user-space kernel interception for high security. Firecracker MicroVM (Docker Sandboxes, E2B) gives a dedicated kernel with a hypervisor boundary and the highest security of the three.
For Level 4 continuous overnight loops, use MicroVM or VM isolation. As Mark Cavage, President and COO of Docker put it: “Put your agents in a real box […] the current things out there are largely built on either deprecated technologies or compromise on the isolation.”
How does Docker Sandboxes use MicroVM isolation to enable safe unattended agent execution?
Docker Sandboxes launched 30 January 2026. Each agent session runs in a dedicated microVM using macOS Virtualization.framework or Windows Hyper-V as the hypervisor. The workspace is copied into the VM at the same absolute path — the agent sees only project files, with nothing else on the filesystem to reach.
The microVM cannot read or write the host filesystem, cannot access host network interfaces, and cannot inherit host environment variables or credentials. Every session starts clean. Network isolation is configurable: all outbound traffic flows through an HTTP/HTTPS proxy enforcing domain allowlists. You specify which package registries, CI/CD providers, GitHub endpoints, and monitoring dashboards the agent can reach.
Supported agents at launch: Claude Code, Copilot CLI, Codex CLI, Gemini CLI, and Kiro. Linux support is listed as forthcoming.
What does production-scale sandboxed agent execution look like using Modal and Ramp’s Inspect architecture?
Ramp’s Inspect grew from roughly 30% of merged pull requests in January 2026 to approximately half by February 2026. Over 80% of Inspect itself is now written by Inspect. That’s what production-scale autonomous coding looks like.
Execution layer: each Inspect session runs in a Modal Sandbox with a full-stack development environment — Postgres, Redis, Temporal, RabbitMQ, and every service an engineer would have locally. Filesystem snapshots run every 30 minutes, so a new session starts from a near-current state within seconds.
State layer: Modal Dicts manage session locks and image metadata. Modal Queues route prompts from Slack, web interface, and Chrome extension into the right session.
Tool access: Inspect agents have the same access as human engineers — databases, CI/CD via Buildkite, Sentry, Datadog, LaunchDarkly, Temporal, Slack, and GitHub — but operating inside the VM, not directly against production systems. That distinction is the whole point.
How do you give AI coding agents controlled access to internal tools without exposing production systems?
The sandbox isolates the agent from the host machine — but it does not automatically prevent outbound connections to production databases. Configuring agent tool access is a separate concern from choosing the right isolation approach. You have to do both.
Here’s what agents legitimately need inside a sandboxed environment:
- The codebase (read/write via git clone within the VM)
- The CI/CD pipeline (trigger test runs and linting — not deployment)
- A staging or test database (not the production database)
- Error monitoring dashboards at read-only access (Sentry, Datadog)
- GitHub access to open PRs against feature branches
And what they should not have: production database write access, live external API credentials with write permissions, and production infrastructure controls.
For credential scoping: Anthropic’s Claude Code on the web authenticates git inside the sandbox to a custom proxy with a scoped credential — the proxy verifies before attaching the real token. Sensitive credentials never enter the sandbox. Docker Sandboxes supports default-deny network configuration with specific egress allowlists. Use them.
What must agents be able to validate within a sandboxed environment for the Verification Loop to work?
The Verification Loop is the practice of having an agent run the team’s own test suite, CI pipeline, and monitoring checks before submitting a PR. Without it, all validation burden falls on human reviewers — and the productivity gain from autonomous execution gets consumed by triage.
Here’s what the Verification Loop requires inside the sandbox:
- The full test suite runnable without production dependencies
- Linting and static analysis tools present in the VM image
- The CI pipeline triggerable from the VM without production deployment
- A staging database queryable for data-dependent validation
- For frontend changes: a headless browser — Ramp’s Inspect uses Chromium via VNC to confirm UI state visually
Ramp’s Inspect runs tests, checks monitoring dashboards, queries databases, and reviews CI results before submitting a PR. The numbers back it up: 83.77% of agent-assisted PRs are accepted and merged, and 54.95% are integrated without further modification.
The loop should halt on a red flag — address it, or mark the task as failing after a set number of attempts. For the overnight workflow patterns this infrastructure enables, see spec-driven development workflows this infrastructure enables.
What does an overnight autonomous agent loop look like in operational practice with this infrastructure?
The Continuous Coding Loop — also called the Ralph Wiggum Technique, popularised by Geoffrey Huntley and documented by Addy Osmani — is the pattern this infrastructure supports: pick task → implement → run the Verification Loop → commit if checks pass → reset agent context → pick the next task. Repeat until morning.
VM session lifecycle: the VM spins up, executes, validates, commits, then terminates. A fresh session starts for the next task. The CLAUDE.md context file reloads at each startup — project architecture, coding conventions, test and CI commands, and explicit instructions about what the agent should not modify. An AGENTS.md file in the repository was associated with a 29% reduction in median agent runtime and 17% reduction in output token consumption.
Task queue design: each task should fit in one agent session and have unambiguous pass/fail criteria. Atomic user stories in, validated PRs out.
Monitoring: instrument task queue depth, VM session success/failure rate, CI pipeline pass rate, and PR open rate. Set alerts on queue stalls and CI pass rates dropping below baseline. Ramp uses Slack as both the task assignment channel and the alert channel.
For a complete guide to AI coding agents and the full autonomy framework — covering autonomy levels, context engineering, governance, and investment decisions — see the complete guide to AI coding agents as autonomous engineering teammates.
FAQ
What is Docker Sandboxes and how is it different from running Claude Code in a regular Docker container?
Docker Sandboxes uses MicroVM isolation — each session runs in a dedicated microVM with its own kernel via macOS Virtualization.framework or Windows Hyper-V. A regular Docker container shares the host kernel. Even a full environment breakout inside a Docker Sandbox requires escaping through the hypervisor. It launched 30 January 2026 specifically for safe unattended agent execution.
Do I need Docker Sandboxes specifically, or can I build equivalent isolation with Modal or another managed VM provider?
Both provide strong isolation — Docker Sandboxes uses Firecracker MicroVM; Modal uses gVisor containers. Ramp built Inspect on Modal without Docker Sandboxes. The choice comes down to your existing infrastructure and what your team already knows.
Can AI coding agents access production databases if they are running in a Docker Sandboxes or Modal VM?
They can, if you configure them to. The sandbox is host isolation, not network isolation. Configure agents with staging database credentials and read-only monitoring credentials — and use the network restriction mechanisms both platforms provide.
What is the Verification Loop and why is it important for unattended agents?
It’s the practice of having the agent run the team’s own test suite, CI pipeline, and monitoring checks before submitting a PR. Without it, all validation burden falls on human reviewers. Ramp’s Inspect demonstrates this in production.
How does Ramp’s Inspect agent maintain state between VM sessions if each session terminates after completing a task?
Modal Dicts manage session locks and image metadata; Modal Queues route prompts from Slack, web, and Chrome extension into the right session. Filesystem snapshots every 30 minutes ensure sessions start from a near-current state. Ephemeral compute plus durable state is the pattern.
Which autonomy levels actually require sandbox isolation infrastructure?
Level 3 (Task Agent) and Level 4 (Autonomous Teammate) — they execute multi-step tasks autonomously. Level 1 and Level 2 agents return control to a human after each interaction. Docker’s launch framing references Level 4 Coding Agent Autonomy as the use case Docker Sandboxes was built to address.
How many concurrent agent sessions can run overnight in a Modal or Docker Sandboxes environment?
Modal supports unlimited concurrent sessions — Inspect at Ramp scales to hundreds concurrently. Docker Sandboxes is similarly designed. The practical limit is task queue depth and cost, not infrastructure concurrency.
What should agents have in their CLAUDE.md context file to operate safely in an unattended sandboxed environment?
Codebase architecture overview, coding conventions, the test and CI commands the agent must run as part of its Verification Loop, explicit instructions about what it should not modify, and the specification format for task inputs. Reload at every session startup to avoid context drift.
How do I monitor overnight agent loops to detect failures without human overnight supervision?
Instrument task queue depth, VM session success/failure rate, CI pipeline pass rate, and PR open rate. Set alerts on queue stalls and CI pass rates dropping below baseline. Ramp uses Slack as both the task assignment and alert channel.
Can I run Claude Code unattended on a local machine without Docker Sandboxes or Modal?
Technically yes — Claude Code’s built-in OS-level sandboxing provides filesystem and network isolation. But it shares the host kernel. A sandbox escape gives the attacker full host access. For overnight loops with production tool access, MicroVM or VM isolation is the right choice.
What is the Continuous Coding Loop pattern and who documented it?
Popularised by Geoffrey Huntley and documented by Addy Osmani. The loop: pick task → implement → validate → commit → reset context → repeat. Each session terminates and restarts cleanly to prevent context contamination.
Is Docker Sandboxes production-ready for SMB engineering teams as of April 2026?
Docker Sandboxes launched 30 January 2026 and supports Claude Code, Copilot CLI, Codex CLI, Gemini CLI, and Kiro — on macOS and Windows; Linux support is forthcoming. Documentation is still developing. Ramp’s Inspect on Modal demonstrates the underlying isolation architecture works at SMB FinTech production scale. Check the documentation coverage for your specific integration requirements before committing.