00 · Preface
Eighteen months ago our scheduler ran out of room at around 12,000 concurrent
sandboxes per project. Today a single project can spin up five million — no
architectural escape hatches, no special-case tenants, no tier where we quietly
hand you a bigger VM and call it a day.
This post is the shortest version of how we got here. It covers three changes that mattered more than anything else: a rewrite of the scheduler, a new snapshot format, and two syscalls that freed us from the noisy-neighbor tax that every multi-tenant runtime eventually pays.
The numbers in this post are from our April 2026 production fleet, measured
across us-east-1, eu-west-1, and our internal staging region. Percentiles
are P50 unless otherwise noted.
01 · The old limits
Our first scheduler was a priority-queue with a single-threaded placement loop. This was fine. The placement loop ran at roughly 4kHz and a single host could accept about 600 placements per second — which is to say, with the runtime we had in 2024, it was never going to be the bottleneck.
Then customers started building long-running, checkpointing agents. Suddenly the steady-state count mattered more than the placement rate. An agent that runs for 40 minutes and sleeps for 4 hours spends 90% of its lifetime as a suspended sandbox on disk, waiting for a wake signal. Multiply that by a few hundred thousand users.
02 · The scheduler rewrite
We replaced the monolithic placement loop with a sharded cooperative scheduler. Each shard owns a bucket of hosts and runs its own placement loop; cross-shard work only happens at rebalance time (roughly once every 90 seconds). The whole thing is a few thousand lines of Rust.
impl Shard {
pub fn place(&mut self, req: PlaceReq) -> Placement {
let slot = self.hosts
.iter_mut()
.filter(|h| h.has_capacity(&req))
.min_by_key(|h| h.fragmentation())?;
slot.reserve(&req);
Placement::local(slot.id, req)
}
}A sharded scheduler is not a new idea — Borg and Nomad have had similar designs for years. What's interesting is what it buys you at our scale: each shard can make placement decisions with roughly 1/32nd of the global state, which fits in L2 cache, which turns the hottest loop in the system from memory-bound to CPU-bound.
From a caller's perspective, nothing about the API changed — reserving a sandbox looks the same in every SDK. Here's the same request in three languages, all hitting the new sharded placer:
# Reserve a sandbox on the new scheduler
tensorlake sandbox create \
--image python:3.12 \
--cpu 2 --memory 4Gi \
--persist
# → sbx_01HK9ZA4MT · placed in 84ms · RUNNING
tensorlake sandbox exec sbx_01HK9ZA4MT -- \
"pip install pandas && python train.py"The P50 placement time under the old scheduler sat around 310ms. With shard-local state, we measure 84ms at the same load — and flat up to roughly thirty times the fleet size.
03 · The snapshot format
A sandbox "at rest" is a memory image, a rootfs overlay, and a tiny bit of
metadata. The old format stored memory as a flat file and rootfs as a qcow2
overlay on top of a shared base. Restoring meant reading the entire memory
file, which — for a 2 GB sandbox — takes somewhere between 180ms and 900ms
depending on how friendly your block cache is feeling.
A sandbox at rest isn't bytes. It's a lazy promise that those bytes will be there when the guest touches them. Once we made that promise first-class, everything else got faster.
— From the internal design doc, Nov 2025The new format is a content-addressed chunk store with a Merkle manifest. Restores are lazy: we map the manifest into the guest's address space, and each chunk is faulted in on first access through a userfaultfd handler. In practice, most agents read a tiny working set out of that 2 GB — we measured a median of 47 MB across a week of production workloads.
Manifest layout
The manifest is a flat array of 64KB chunks, each addressed by blake3 hash. Adjacent sandboxes that ran from the same base image share roughly 92% of their chunks, which means the page cache does a remarkable amount of work for us once warm.
@dataclass
class Manifest:
chunks: list[ChunkRef] # 64KB blake3-addressed
guest_base: int # paddr in guest
dirty_set: bitarray # which chunks must be CoW'd
def restore(self, vm: Vm) -> None:
vm.mmap(self.guest_base, self.size, lazy=True)
vm.install_uffd_handler(self.fault_in)04 · The two syscalls that mattered
With the scheduler and snapshot format in place, the thing holding us back was I/O on the host. Specifically: two syscalls that every sandbox hits on startup, both of which we'd been using the naive way.
→ Page cache sharing
The first is mmap. When two sandboxes on the same host share a base image,
you want them to share the page cache for the unmodified pages — otherwise
you're paying 2× memory for no reason. Linux already does this for
MAP_SHARED mappings, but only if the inodes match. We went back and made
sure they always do.
MAP_SHARED | MAP_NORESERVE on a chunk-store file, pinned to a single inode
per chunk-hash, gives you page-cache sharing across sandbox boundaries for
free. Biggest single memory win we've had.
→ Copy-on-write disk
The second is ioctl(FICLONE). On XFS and Btrfs, this does an O(1) reflink —
a copy-on-write clone of a file that shares physical blocks with the source.
Our rootfs overlays are now reflinked from a per-image template, which means
cold-start spends zero time on disk provisioning.
int rootfs_clone(int src, int dst) {
// XFS / Btrfs only. EOPNOTSUPP elsewhere — fall back
// to a regular copy so cold-start isn't regressed.
int r = ioctl(dst, FICLONE, src);
if (r < 0 && errno == EOPNOTSUPP)
return rootfs_copy_slow(src, dst);
return r;
}Reflinks don't survive cross-filesystem copies. If your orchestrator moves
sandbox state between hosts with anything other than send | receive,
you'll silently fall back to full copies. Budget for that, or disable
cross-host migration.
05 · What we measured
After the three changes landed — scheduler shards, chunked snapshots, and the I/O path rewrite — we ran the 100k, 1M, and 5M concurrent-sandbox tests back-to-back. The P50 for "wake a suspended sandbox and exec a command" held at 80ms through every order of magnitude.
We also ran a noisy-neighbor test: 4,000 sandboxes, one of them doing a hot CPU+IO loop, the other 3,999 idle. We watched the P99 wake time of the idle cohort. Before the rewrite it spiked to 2.4 seconds. After, it moved by about 6 milliseconds.
06 · What's next
5M per project is not the ceiling — it's the number we're confident enough to publish. Internally we have a 40M test that runs every night. The current bottleneck is the control-plane database, and we've already started sharding it along the same shape as the scheduler.
If any of this sounds like the problem you want to work on, we're hiring. If you want to use it, the docs are the best starting point.