Suspend vs. Snapshot: Pause a Sandbox or Save It for Reuse?

Apr 16, 2026
|
5
min read

A big value proposition for using sandboxes to deploy agents is statefulness. A stateful sandbox preserves the file system state of a sandbox and the in memory state of the execution environment. Stateful sandboxes can be used to retain memory of agents or preserve the artifacts created during a session. There are various degrees of statefulness - snapshotting file systems, snapshotting file system + sandbox memory. Some vendors also impose constraints around snapshotting - you would have to stop a sandbox and then snapshot it.


Tensorlake sandboxes offer dynamic capabalities for snapshotting and checkpointing. You can snapshot a running VM to create a point in time snapshot of the whole VM, or you could suspend a VM that snapshots but also puts the sandbox to sleep. We often get the question of when to snapshot vs when to suspend a sandbox.

The real split: just-in-time checkpoint vs hibernating idle sandboxes

Suspend is about compute. You have one sandbox, it’s idle, and you want to pause it in place without losing its state or paying for compute. Later, you resume the same sandbox under the same sandbox ID and it picks up where it left off.

Snapshot is about making checkpoints of a running sandbox. You capture the full state of a running sandbox (filesystem, memory, and running processes) into a reusable artifact that outlives the source. You can restore that artifact into a new sandbox: once, or a hundred times, today or next month.

Suspend is a pause button. Snapshot is a save file.

Suspend Snapshot
Identity Same sandbox ID New sandbox per restore
Artifact None; state stays in place Persistent, independent object
Lifetime Tied to the sandbox Outlives the source
Cost shape No compute while paused Storage per artifact
Fan-out Single lineage (no branching) N (restore many times)

Everything downstream (API surface, pricing, which failures are recoverable) follows from this split.

How sandbox providers ship it

Both operations are technically hard. Freezing a running sandbox (memory, process tree, open file descriptors) and bringing it back cleanly is non-trivial. Turning that frozen state into a durable, portable artifact is harder still. Not every provider has built both paths.

Here’s what’s shipping today (April 2026):

Provider Suspend Snapshot Preserves filesystem Preserves memory Preserves processes
Tensorlake
E2B
Modal alpha alpha
Vercel Sandbox ✅ (beta)
Daytona

A few things that stood out to me when putting this together:

  • Only some providers preserve running processes and memory. Others call the operation “snapshot” but only capture the filesystem, and you’ll only notice when a restored sandbox comes back missing its in-flight processes.
  • Several providers ship one side of the split but not the other. Where the memory-preserving path exists, it’s sometimes behind an alpha or beta flag.
  • Pause and snapshot are separate operations on Tensorlake and E2B. Vercel collapses them (snapshot auto-stops the source). Modal’s stable path skips pause entirely — snapshot without terminate on the filesystem-only path, or terminate-on-snapshot on the alpha memory path. The API shape tells you which.

The rest of this post uses Tensorlake for code examples because it ships both paths as distinct operations, which makes the when-to-use-what patterns below runnable as written. The concepts translate to any provider that supports both.

When to suspend

Suspend when you have one ongoing task and the sandbox will idle between bursts of work.

Concretely: a coding agent waiting for a human reply, an overnight research loop between steps, a notebook you’ll come back to tomorrow. You want the exact process tree, open files, and memory back. Re-initializing would be slow, or wrong, if the process holds unserializable in-memory state.

from tensorlake.sandbox import SandboxClient

client = SandboxClient()

# Named sandboxes are the ones eligible for suspend/resume
client.create(name="research-agent")
sandbox = client.connect("research-agent")

sandbox.run("python", ["kickoff_research.py"])

# Agent is idle waiting for human input, stop paying for compute
client.suspend("research-agent")

# ...hours or days later, same sandbox, same state...
client.resume("research-agent")
sandbox.run("python", ["continue_research.py"])


On Tensorlake, a timeout_secs on a named sandbox triggers auto-suspend (not terminate), so you get this pattern defensively without writing idle-detection logic. Ephemeral sandboxes (those created without a name) can’t be suspended at all; the absence of a name is the signal that the sandbox isn’t meant to outlive its current task.

When to snapshot

Snapshot when one state needs to seed many future sandboxes, or outlive the current one.

The clearest cases:

  • Fan-out. RL rollouts from a shared starting point. Every worker needs the exact same post-setup state, in parallel.
  • Golden environments. A dev environment with tools, weights, and auth preloaded, cloned per user session.
  • Checkpoints. A durable recovery point before a step that might fail. You can retry from that point any number of times without redoing the setup.
# Prepare a base environment once
base = client.create_and_connect()
base.run("pip", ["install", "torch", "transformers",
                 "--user", "--break-system-packages"])
base.run("python", ["download_weights.py"])

snap = client.snapshot_and_wait(base.sandbox_id)

# Fan out: every worker starts from the exact same state
for seed in range(8):
    worker = client.create_and_connect(snapshot_id=snap.snapshot_id)
    worker.run("python", ["rollout.py", "--seed", str(seed)])


The source sandbox can now be terminated. The snapshot lives on independently, and that independence is the whole point: snapshots are objects, not sandbox states.

When you need both

Long-running agents often want both primitives. The pattern:

  1. Snapshot after expensive setup (install deps, download weights, warm caches). This is your durable recovery point.
  2. Suspend between idle turns during normal operation. Cheap and fast.
  3. If the sandbox fails catastrophically or you need to fork the session, restore from the snapshot into a fresh sandbox.

Snapshot is your insurance policy; suspend is your day-to-day cost control. They compose cleanly because they answer different questions.

Quick decision guide

  • One sandbox, one ongoing task, will idle → suspend
  • One state, many descendants (now or later) → snapshot
  • Expensive setup you don’t want to redo on failure → snapshot (as a checkpoint)
  • Long-running agent with idle gaps → both: snapshot the warm base, suspend between turns

If you find yourself reaching for snapshot every time the user goes to lunch, or for suspend when what you really want is a reproducible starting point, step back to the split: compute vs. storage. Once you frame it this way, the choice becomes pretty obvious.

Related articles

No items found.

Get server-less runtime for agents and data ingestion

Data ingestion like never before.
TRUSTED BY PRO DEVS GLOBALLY

Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.

“With Tensorlake, we've been able to handle complex document parsing and data formats that many other providers don't support natively, at a throughput that significantly improves our application's UX. Beyond the technology, the team's responsiveness stands out, they quickly iterate on our feedback and continuously expand the model's capabilities.”

Vincent Di Pietro
Founder, Novis AI

"At SIXT, we're building AI-powered experiences for millions of customers while managing the complexity of enterprise-scale data. TensorLake gives us the foundation we need—reliable document ingestion that runs securely in our VPC to power our generative AI initiatives."

Boyan Dimitrov
CTO, Sixt

“Tensorlake enabled us to avoid building and operating an in-house OCR pipeline by providing a robust, scalable OCR and document ingestion layer with excellent accuracy and feature coverage. Ongoing improvements to the platform, combined with strong technical support, make it a dependable foundation for our scientific document workflows.”

Yaroslav Sklabinskyi
Principal Software Engineer, Reliant AI

"For BindHQ customers, the integration with Tensorlake represents a shift from manual data handling to intelligent automation, helping insurance businesses operate with greater precision, and responsiveness across a variety of transactions"

Cristian Joe
CEO @ BindHQ

“Tensorlake let us ship faster and stay reliable from day one. Complex stateful AI workloads that used to require serious infra engineering are now just long-running functions. As we scale, that means we can stay lean—building product, not managing infrastructure.”

Arpan Bhattacharya
CEO, The Intelligent Search Company