Suspend vs. Snapshot: Pause a Sandbox or Save It for Reuse?

A big value proposition for using sandboxes to deploy agents is statefulness. A stateful sandbox preserves the file system state of a sandbox and the in memory state of the execution environment. Stateful sandboxes can be used to retain memory of agents or preserve the artifacts created during a session. There are various degrees of statefulness - snapshotting file systems, snapshotting file system + sandbox memory. Some vendors also impose constraints around snapshotting - you would have to stop a sandbox and then snapshot it.
Tensorlake sandboxes offer dynamic capabalities for snapshotting and checkpointing. You can snapshot a running VM to create a point in time snapshot of the whole VM, or you could suspend a VM that snapshots but also puts the sandbox to sleep. We often get the question of when to snapshot vs when to suspend a sandbox.
The real split: just-in-time checkpoint vs hibernating idle sandboxes
Suspend is about compute. You have one sandbox, it’s idle, and you want to pause it in place without losing its state or paying for compute. Later, you resume the same sandbox under the same sandbox ID and it picks up where it left off.
Snapshot is about making checkpoints of a running sandbox. You capture the full state of a running sandbox (filesystem, memory, and running processes) into a reusable artifact that outlives the source. You can restore that artifact into a new sandbox: once, or a hundred times, today or next month.
Suspend is a pause button. Snapshot is a save file.

Everything downstream (API surface, pricing, which failures are recoverable) follows from this split.
How sandbox providers ship it
Both operations are technically hard. Freezing a running sandbox (memory, process tree, open file descriptors) and bringing it back cleanly is non-trivial. Turning that frozen state into a durable, portable artifact is harder still. Not every provider has built both paths.
Here’s what’s shipping today (April 2026):
A few things that stood out to me when putting this together:
- Only some providers preserve running processes and memory. Others call the operation “snapshot” but only capture the filesystem, and you’ll only notice when a restored sandbox comes back missing its in-flight processes.
- Several providers ship one side of the split but not the other. Where the memory-preserving path exists, it’s sometimes behind an alpha or beta flag.
- Pause and snapshot are separate operations on Tensorlake and E2B. Vercel collapses them (snapshot auto-stops the source). Modal’s stable path skips pause entirely — snapshot without terminate on the filesystem-only path, or terminate-on-snapshot on the alpha memory path. The API shape tells you which.
The rest of this post uses Tensorlake for code examples because it ships both paths as distinct operations, which makes the when-to-use-what patterns below runnable as written. The concepts translate to any provider that supports both.
When to suspend
Suspend when you have one ongoing task and the sandbox will idle between bursts of work.
Concretely: a coding agent waiting for a human reply, an overnight research loop between steps, a notebook you’ll come back to tomorrow. You want the exact process tree, open files, and memory back. Re-initializing would be slow, or wrong, if the process holds unserializable in-memory state.
from tensorlake.sandbox import SandboxClient
client = SandboxClient()
# Named sandboxes are the ones eligible for suspend/resume
client.create(name="research-agent")
sandbox = client.connect("research-agent")
sandbox.run("python", ["kickoff_research.py"])
# Agent is idle waiting for human input, stop paying for compute
client.suspend("research-agent")
# ...hours or days later, same sandbox, same state...
client.resume("research-agent")
sandbox.run("python", ["continue_research.py"])
On Tensorlake, a timeout_secs on a named sandbox triggers auto-suspend (not terminate), so you get this pattern defensively without writing idle-detection logic. Ephemeral sandboxes (those created without a name) can’t be suspended at all; the absence of a name is the signal that the sandbox isn’t meant to outlive its current task.
When to snapshot
Snapshot when one state needs to seed many future sandboxes, or outlive the current one.
The clearest cases:
- Fan-out. RL rollouts from a shared starting point. Every worker needs the exact same post-setup state, in parallel.
- Golden environments. A dev environment with tools, weights, and auth preloaded, cloned per user session.
- Checkpoints. A durable recovery point before a step that might fail. You can retry from that point any number of times without redoing the setup.
# Prepare a base environment once
base = client.create_and_connect()
base.run("pip", ["install", "torch", "transformers",
"--user", "--break-system-packages"])
base.run("python", ["download_weights.py"])
snap = client.snapshot_and_wait(base.sandbox_id)
# Fan out: every worker starts from the exact same state
for seed in range(8):
worker = client.create_and_connect(snapshot_id=snap.snapshot_id)
worker.run("python", ["rollout.py", "--seed", str(seed)])
The source sandbox can now be terminated. The snapshot lives on independently, and that independence is the whole point: snapshots are objects, not sandbox states.
When you need both
Long-running agents often want both primitives. The pattern:
- Snapshot after expensive setup (install deps, download weights, warm caches). This is your durable recovery point.
- Suspend between idle turns during normal operation. Cheap and fast.
- If the sandbox fails catastrophically or you need to fork the session, restore from the snapshot into a fresh sandbox.
Snapshot is your insurance policy; suspend is your day-to-day cost control. They compose cleanly because they answer different questions.
Quick decision guide
- One sandbox, one ongoing task, will idle → suspend
- One state, many descendants (now or later) → snapshot
- Expensive setup you don’t want to redo on failure → snapshot (as a checkpoint)
- Long-running agent with idle gaps → both: snapshot the warm base, suspend between turns
If you find yourself reaching for snapshot every time the user goes to lunch, or for suspend when what you really want is a reproducible starting point, step back to the split: compute vs. storage. Once you frame it this way, the choice becomes pretty obvious.
Related articles
Get server-less runtime for agents and data ingestion
Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.