Running Commands Safely: Agent Execution Isolation
An agent that can run arbitrary commands is powerful but dangerous. Without safeguards, an agent might spawn a compilation that hangs for hours, a database query that locks the entire system, or a command that deletes production data. Safe command execution requires strict timeouts, resource quotas, input validation, and isolated environments. This article covers how to give agents the power to run tests, build, and verify code—safely.
The Risk: Runaway Processes
Here's what happens when an agent runs an unrestricted command:
Agent: "I'll compile the project and run tests."
Command spawned: gcc main.c -o main (legitimate)
gcc hits a bug: goes into infinite loop consuming memory
5 seconds later: 8 GB RAM consumed, system thrashing
10 seconds later: system becomes unresponsive
Agent: stuck waiting for process that will never return
Result: Manual kill required, system recovery, wasted time and resources
This is why timeouts, resource limits, and graceful termination are non-negotiable.
Strategy 1: Process-Level Timeouts and Limits
Use OS-level resource quotas and process timeouts. Python's subprocess module provides the essentials:
import subprocess
import signal
import resource
import os
def run_command_safe(command: str,
timeout_seconds: int = 10,
max_memory_mb: int = 512,
cwd: str = None) -> dict:
"""
Run a command with strict resource limits and timeout.
Returns: {
success: bool,
exit_code: int,
stdout: str,
stderr: str,
timed_out: bool,
memory_used: float
}
"""
# Preprocess command: basic input validation
if any(dangerous in command for dangerous in [
'rm -rf /',
'dd if=/dev/zero',
':(){:|:&};:', # Bash fork bomb
'> /dev/',
]):
return {
"success": False,
"error": "Command contains dangerous patterns",
"exit_code": -1
}
# Setup: limit memory for subprocess
def set_limits():
"""Set per-process memory limit (child process only)."""
# RLIMIT_AS: virtual memory limit in bytes
resource.setrlimit(resource.RLIMIT_AS,
(max_memory_mb * 1024 * 1024,
max_memory_mb * 1024 * 1024))
# RLIMIT_CPU: CPU time in seconds (kill if exceeded)
resource.setrlimit(resource.RLIMIT_CPU, (timeout_seconds * 2, timeout_seconds * 2))
try:
# Spawn process with timeout
proc = subprocess.Popen(
command,
shell=True,
cwd=cwd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
preexec_fn=set_limits # Apply limits to child only
)
try:
# Wait with timeout; raises subprocess.TimeoutExpired if exceeded
stdout, stderr = proc.communicate(timeout=timeout_seconds)
exit_code = proc.returncode
timed_out = False
except subprocess.TimeoutExpired:
# Kill the process if it times out
proc.kill()
try:
proc.wait(timeout=2) # Graceful kill
except subprocess.TimeoutExpired:
proc.terminate() # Forceful kill
proc.wait()
return {
"success": False,
"exit_code": -1,
"stdout": "",
"stderr": f"Command timed out after {timeout_seconds} seconds",
"timed_out": True
}
except Exception as e:
return {
"success": False,
"exit_code": -1,
"error": str(e),
"timed_out": False
}
# Interpret results
success = exit_code == 0
return {
"success": success,
"exit_code": exit_code,
"stdout": stdout,
"stderr": stderr,
"timed_out": False
}
Key safety features:
- Timeout (10s default): Process is killed if it exceeds the timeout.
- Memory limit (512 MB default): Child process cannot exceed this memory.
- Input validation: Rejects known dangerous patterns.
- Graceful kill: Tries
kill()first, thenterminate().
Strategy 2: Containerized Execution (Docker)
For stronger isolation, run commands in containers. Each agent command gets a fresh, sandboxed container:
import docker
import tempfile
import os
class ContainerExecutor:
"""Run agent commands in isolated Docker containers."""
def __init__(self, image: str = "python:3.11-slim"):
self.client = docker.from_env()
self.image = image
self.pull_image()
def pull_image(self):
"""Ensure the container image is available."""
try:
self.client.images.get(self.image)
except docker.errors.ImageNotFound:
print(f"Pulling {self.image}...")
self.client.images.pull(self.image)
def run_command(self, command: str,
timeout_seconds: int = 15,
memory_limit: str = "512m") -> dict:
"""
Run a command in a fresh container.
"""
try:
# Create container with strict limits
container = self.client.containers.run(
self.image,
command,
detach=True,
mem_limit=memory_limit, # Memory limit
cpuset_cpus="0", # Single CPU
network_disabled=True, # No network access
read_only=True, # Filesystem read-only
tmpfs={'/tmp': 'size=100m'}, # Temp space (in-memory)
environment={'PATH': '/usr/local/bin:/usr/bin'}
)
# Wait for container to finish or timeout
try:
exit_code = container.wait(timeout=timeout_seconds)['StatusCode']
except docker.errors.APIError:
# Timeout: kill the container
container.kill()
return {
"success": False,
"error": f"Container timed out after {timeout_seconds}s",
"exit_code": -1
}
# Retrieve output
logs = container.logs(stdout=True, stderr=True).decode('utf-8')
# Cleanup
container.remove(force=True)
return {
"success": exit_code == 0,
"exit_code": exit_code,
"output": logs,
"timed_out": False
}
except Exception as e:
return {
"success": False,
"error": str(e),
"exit_code": -1
}
Advantages:
- Complete filesystem isolation (nothing inside container affects host).
- No network access (prevents exfiltration).
- Resource limits are enforced by container runtime.
- Clean startup/shutdown (containers are ephemeral).
Trade-off:
- Docker daemon required (not available in all environments).
- Slightly slower (container startup overhead ~1–2s).
Strategy 3: Input Validation and Command Allowlisting
Restrict agents to a whitelist of safe commands:
ALLOWED_COMMANDS = {
"pytest": ["pytest", "--tb=short", "--timeout=10"],
"lint": ["pylint", "--disable=all", "--enable=E"],
"build": ["python", "setup.py", "build"],
"test": ["python", "-m", "unittest", "discover"],
"format": ["black", "--check"],
}
def run_allowlisted_command(command_name: str, args: list = None) -> dict:
"""Run only pre-approved commands."""
if command_name not in ALLOWED_COMMANDS:
return {
"success": False,
"error": f"Command '{command_name}' is not allowed. Available: {list(ALLOWED_COMMANDS.keys())}"
}
# Merge base command with agent-provided args (with validation)
base_cmd = ALLOWED_COMMANDS[command_name]
if args:
# Validate args: no special characters, no relative paths traversal
for arg in args:
if any(c in arg for c in ['$', '`', '|', '&', ';', '>', '<', '\\']):
return {
"success": False,
"error": f"Arg '{arg}' contains invalid characters"
}
if '..' in arg:
return {
"success": False,
"error": f"Arg '{arg}' contains directory traversal"
}
command = base_cmd + args
else:
command = base_cmd
return run_command_safe(' '.join(command), timeout_seconds=15)
Advantages:
- Agents can only run what you explicitly permit.
- Clear, auditable list of agent capabilities.
- Easy to revoke or modify permissions.
Limitation:
- Requires anticipating agent needs upfront.
Strategy 4: Capturing and Parsing Output
Agents need to understand command results. Capture structured output:
import json
import re
def run_command_with_parsing(command: str, parser_type: str = "python_unittest") -> dict:
"""Run command and parse output into structured format."""
result = run_command_safe(command, timeout_seconds=20)
if not result["success"]:
return result
# Parse output based on command type
if parser_type == "python_unittest":
result["parsed"] = parse_unittest_output(result["stderr"])
elif parser_type == "pytest":
result["parsed"] = parse_pytest_output(result["stdout"])
elif parser_type == "lint":
result["parsed"] = parse_lint_output(result["stdout"])
return result
def parse_pytest_output(output: str) -> dict:
"""Extract test results from pytest output."""
lines = output.split('\n')
tests = {"passed": 0, "failed": 0, "skipped": 0, "errors": []}
for line in lines:
if "passed" in line and "error" not in line:
match = re.search(r'(\d+) passed', line)
if match:
tests["passed"] = int(match.group(1))
if "failed" in line:
match = re.search(r'(\d+) failed', line)
if match:
tests["failed"] = int(match.group(1))
if "FAILED" in line:
tests["errors"].append(line)
return tests
def parse_lint_output(output: str) -> dict:
"""Extract linting issues."""
issues = []
for line in output.split('\n'):
if line.strip():
# Format: filepath:line: [code] message
match = re.match(r'([^:]+):(\d+):\s*\[(.+?)\]\s*(.+)', line)
if match:
issues.append({
"file": match.group(1),
"line": int(match.group(2)),
"code": match.group(3),
"message": match.group(4)
})
return {"issues": issues, "count": len(issues)}
This allows agents to reason about test results:
result = run_command_with_parsing("pytest tests/", parser_type="pytest")
if result["parsed"]["failed"] > 0:
agent.loop("Some tests failed; let me fix the code...")
else:
agent.loop("All tests passed; task complete!")
Exit Code Semantics
Agents need to understand exit codes. Define a convention:
| Exit Code | Meaning |
|---|---|
| 0 | Success; command did what was asked. |
| 1 | Failure; business logic error (test failed, lint found issues). |
| 124 | Timeout; command exceeded time limit. |
| 137 | Out of memory; exceeded resource limit. |
| 255 | Unknown error. |
Key Takeaways
- Always use timeouts (10–20s default) to prevent runaway processes.
- Enforce memory limits (512 MB is often enough for tests/builds).
- Validate command input; reject dangerous patterns.
- Use containers (Docker) for strong isolation in production.
- Parse command output so agents can understand results and adapt.
- Define clear exit code semantics for agent reasoning.
Frequently Asked Questions
What if an agent needs to run a long compilation?
Increase the timeout (e.g., 60 seconds for large projects) or restructure the task. Agents should run fast tests, not full builds. For compilation, offload to a separate CI system.
Can an agent access the network?
By default, no. Set network_disabled=True in Docker containers, and don't pass network tools to agents. If an agent needs network (e.g., to fetch dependencies), use a container with a proxy/firewall that allows only specific domains.
How do I debug a command that failed?
Capture full stderr and stdout, and return it to the agent with a clear error message. The agent can then re-attempt with modifications or ask a human for help.
What if a command creates many files?
Use tmpfs (in-memory filesystem) with size limits. The container fails if it fills up, preventing disk exhaustion.