00 · The question we keep getting
A big value proposition for using sandboxes to deploy agents is statefulness. A stateful sandbox preserves the filesystem and the in-memory state of the execution environment — agent memory, partial artifacts, a warmed-up interpreter — so work doesn't evaporate between turns.
Tensorlake sandboxes ship two related-but-distinct primitives for this: you can snapshot a running VM to create a point-in-time artifact of the whole sandbox, or you can suspend a VM, which snapshots and puts the sandbox to sleep. We get this question a lot: when do I use which?
Suspend is a pause button. Same sandbox ID, resumed in place. Snapshot is a save file. A durable artifact you can restore into new sandboxes — once, or a hundred times.
01 · The real split: pause vs. save
Suspend is about compute. You have one sandbox, it's idle, and you want to pause it in place without losing state or paying for compute. Later, you resume the same sandbox under the same sandbox ID and it picks up where it left off.
Snapshot is about checkpoints. You capture the full state of a running sandbox — filesystem, memory, running processes — into a reusable artifact that outlives the source. You can restore that artifact into a new sandbox: once, or a hundred times, today or next month.
Suspend is a pause button. Snapshot is a save file. Everything downstream — the API surface, the pricing, which failures are recoverable — follows from this split.
— Design principle, sandbox lifecycle RFC| Suspend | Snapshot | |
|---|---|---|
| Identity | Same sandbox ID | New sandbox per restore |
| Artifact | None; state stays in place | Persistent, independent object |
| Lifetime | Tied to the sandbox | Outlives the source |
| Cost shape | No compute while paused | Storage per artifact |
| Fan-out | Single lineage (no branching) | N (restore many times) |
02 · How sandbox providers ship it
Both operations are technically hard. Freezing a running sandbox — memory, process tree, open file descriptors — and bringing it back cleanly is non-trivial. Turning that frozen state into a durable, portable artifact is harder still. Not every provider has built both paths.
Here's what's shipping today (April 2026):
| Provider | Suspend | Snapshot | Filesystem | Memory | Processes |
|---|---|---|---|---|---|
| Tensorlake | ● | ● | ● | ● | ● |
| E2B | ● | ● | ● | ● | ● |
| Modal | — | ● | ● | α | α |
| Vercel Sandbox | β | ● | ● | — | — |
| Daytona | ● | — | ● | — | — |
A few things that stood out when putting this together:
Only some providers preserve running processes and memory. Others call the operation "snapshot" but only capture the filesystem — you won't notice until a restored sandbox comes back missing its in-flight processes.
Several providers ship one side of the split but not the other. Where the memory-preserving path exists, it's sometimes behind an alpha or beta flag.
Pause and snapshot are separate operations on Tensorlake and E2B. Vercel collapses them (snapshot auto-stops the source). Modal's stable path skips pause entirely. The API shape tells you which mental model the provider chose.
The rest of this post uses Tensorlake for code examples because it ships both paths as distinct operations, which makes the patterns below runnable as written. Examples are in Python, TypeScript, and CLI — the SDKs have full parity.
03 · When to suspend
Suspend when you have one ongoing task and the sandbox will idle between bursts of work.
Concretely: a coding agent waiting for a human reply, an overnight research loop between steps, a notebook you'll come back to tomorrow. You want the exact process tree, open files, and memory back. Re-initializing would be slow — or wrong, if the process holds unserializable in-memory state.
# A coding agent, paused between user turns
from tensorlake import Sandbox
sbx = Sandbox.create(name="agent-session-A7F2")
# ... agent does work, waits for a human reply ...
sbx.suspend()
# → SBX_01HK9Z · PAUSED · compute: $0.00/s
# Later, the user comes back:
sbx = Sandbox.attach("agent-session-A7F2")
sbx.resume() # same PIDs, same memory, same fsA timeout_secs on a named sandbox triggers auto-suspend (not terminate),
so you get this pattern defensively without writing idle-detection logic.
Ephemeral sandboxes (created without a name) can't be suspended at all —
the absence of a name is the signal that the sandbox isn't meant to
outlive its current task.
04 · When to snapshot
Snapshot when one state needs to seed many future sandboxes, or outlive the current one. Three clear cases:
Fan-out
RL rollouts from a shared starting point. Every worker needs the exact same post-setup state, in parallel.
Golden environments
A dev environment with tools, weights, and auth preloaded, cloned per user session.
Checkpoints
A durable recovery point before a step that might fail. Retry from that point any number of times without redoing setup.
# Warm a base, then fan out to N rollout workers
from tensorlake import Sandbox, Snapshot
src = Sandbox.create(image="python:3.12")
src.exec("pip install torch transformers && python setup.py")
snap = src.snapshot(tag="rollout-base-v3")
# → snap_A7F2 · 1.2 GiB · blake3:f4c9…
src.terminate() # source is done; snapshot lives on
# Fan out: 100 workers, same warm state, all parallel
workers = [Snapshot.restore(snap.id) for _ in range(100)]The source sandbox can now be terminated. The snapshot lives on independently, and that independence is the whole point: snapshots are objects, not sandbox states.
05 · When you need both
Long-running agents often want both primitives. The pattern:
- 1
Snapshot after expensive setup (install deps, download weights, warm caches). This is your durable recovery point.
- 2
Suspend between idle turns during normal operation. Cheap and fast.
- 3
If the sandbox fails catastrophically or you need to fork the session, restore from the snapshot into a fresh sandbox.
Snapshot is your insurance policy; suspend is your day-to-day cost control. They compose cleanly because they answer different questions.
06 · Quick decision guide
If you find yourself reaching for snapshot every time the user goes to lunch, or for suspend when what you really want is a reproducible starting point, step back to the split: compute vs. storage. Once you frame it this way, the choice becomes pretty obvious.