HomeBlogPricingCareersDocsGitHubSlack community
Field notes/Engineering/How we got to 5,000,000 sandboxes per project

How we got to 5,000,000 sandboxes per project

A walk through the scheduler rewrite, the snapshot format, and the two syscalls that freed us from the noisy-neighbor tax.

SBX-01C4SBX-01E3SBX-0202SBX-0221SBX-0240SBX-025FSBX-027ESBX-029DSBX-02BCSBX-02DBSBX-02FASBX-0319SBX-0338SBX-0357SBX-0376[ RUNTIME: ACTIVE ] P50 2.45S · P99 4.12S · 5M/PROJECT

00 · Preface

Eighteen months ago our scheduler ran out of room at around 12,000 concurrent sandboxes per project. Today a single project can spin up five million — no architectural escape hatches, no special-case tenants, no tier where we quietly hand you a bigger VM and call it a day.

This post is the shortest version of how we got here. It covers three changes that mattered more than anything else: a rewrite of the scheduler, a new snapshot format, and two syscalls that freed us from the noisy-neighbor tax that every multi-tenant runtime eventually pays.

FIELD NOTE

The numbers in this post are from our April 2026 production fleet, measured across us-east-1, eu-west-1, and our internal staging region. Percentiles are P50 unless otherwise noted.

01 · The old limits

Our first scheduler was a priority-queue with a single-threaded placement loop. This was fine. The placement loop ran at roughly 4kHz and a single host could accept about 600 placements per second — which is to say, with the runtime we had in 2024, it was never going to be the bottleneck.

Then customers started building long-running, checkpointing agents. Suddenly the steady-state count mattered more than the placement rate. An agent that runs for 40 minutes and sleeps for 4 hours spends 90% of its lifetime as a suspended sandbox on disk, waiting for a wake signal. Multiply that by a few hundred thousand users.

02 · The scheduler rewrite

We replaced the monolithic placement loop with a sharded cooperative scheduler. Each shard owns a bucket of hosts and runs its own placement loop; cross-shard work only happens at rebalance time (roughly once every 90 seconds). The whole thing is a few thousand lines of Rust.

impl Shard {
    pub fn place(&mut self, req: PlaceReq) -> Placement {
        let slot = self.hosts
            .iter_mut()
            .filter(|h| h.has_capacity(&req))
            .min_by_key(|h| h.fragmentation())?;
 
        slot.reserve(&req);
        Placement::local(slot.id, req)
    }
}

A sharded scheduler is not a new idea — Borg and Nomad have had similar designs for years. What's interesting is what it buys you at our scale: each shard can make placement decisions with roughly 1/32nd of the global state, which fits in L2 cache, which turns the hottest loop in the system from memory-bound to CPU-bound.

From a caller's perspective, nothing about the API changed — reserving a sandbox looks the same in every SDK. Here's the same request in three languages, all hitting the new sharded placer:

# Reserve a sandbox on the new scheduler
tensorlake sandbox create \
    --image python:3.12 \
    --cpu 2 --memory 4Gi \
    --persist
# → sbx_01HK9ZA4MT · placed in 84ms · RUNNING

tensorlake sandbox exec sbx_01HK9ZA4MT -- \
    "pip install pandas && python train.py"

The P50 placement time under the old scheduler sat around 310ms. With shard-local state, we measure 84ms at the same load — and flat up to roughly thirty times the fleet size.

03 · The snapshot format

A sandbox "at rest" is a memory image, a rootfs overlay, and a tiny bit of metadata. The old format stored memory as a flat file and rootfs as a qcow2 overlay on top of a shared base. Restoring meant reading the entire memory file, which — for a 2 GB sandbox — takes somewhere between 180ms and 900ms depending on how friendly your block cache is feeling.

A sandbox at rest isn't bytes. It's a lazy promise that those bytes will be there when the guest touches them. Once we made that promise first-class, everything else got faster.

— From the internal design doc, Nov 2025

The new format is a content-addressed chunk store with a Merkle manifest. Restores are lazy: we map the manifest into the guest's address space, and each chunk is faulted in on first access through a userfaultfd handler. In practice, most agents read a tiny working set out of that 2 GB — we measured a median of 47 MB across a week of production workloads.

Manifest layout

The manifest is a flat array of 64KB chunks, each addressed by blake3 hash. Adjacent sandboxes that ran from the same base image share roughly 92% of their chunks, which means the page cache does a remarkable amount of work for us once warm.

@dataclass
class Manifest:
    chunks: list[ChunkRef]        # 64KB blake3-addressed
    guest_base: int               # paddr in guest
    dirty_set: bitarray           # which chunks must be CoW'd
 
    def restore(self, vm: Vm) -> None:
        vm.mmap(self.guest_base, self.size, lazy=True)
        vm.install_uffd_handler(self.fault_in)

04 · The two syscalls that mattered

With the scheduler and snapshot format in place, the thing holding us back was I/O on the host. Specifically: two syscalls that every sandbox hits on startup, both of which we'd been using the naive way.

→ Page cache sharing

The first is mmap. When two sandboxes on the same host share a base image, you want them to share the page cache for the unmodified pages — otherwise you're paying 2× memory for no reason. Linux already does this for MAP_SHARED mappings, but only if the inodes match. We went back and made sure they always do.

IF YOU'RE BUILDING A RUNTIME

MAP_SHARED | MAP_NORESERVE on a chunk-store file, pinned to a single inode per chunk-hash, gives you page-cache sharing across sandbox boundaries for free. Biggest single memory win we've had.

→ Copy-on-write disk

The second is ioctl(FICLONE). On XFS and Btrfs, this does an O(1) reflink — a copy-on-write clone of a file that shares physical blocks with the source. Our rootfs overlays are now reflinked from a per-image template, which means cold-start spends zero time on disk provisioning.

int rootfs_clone(int src, int dst) {
    // XFS / Btrfs only. EOPNOTSUPP elsewhere — fall back
    // to a regular copy so cold-start isn't regressed.
    int r = ioctl(dst, FICLONE, src);
    if (r < 0 && errno == EOPNOTSUPP)
        return rootfs_copy_slow(src, dst);
    return r;
}
!
CAREFUL

Reflinks don't survive cross-filesystem copies. If your orchestrator moves sandbox state between hosts with anything other than send | receive, you'll silently fall back to full copies. Budget for that, or disable cross-host migration.

05 · What we measured

After the three changes landed — scheduler shards, chunked snapshots, and the I/O path rewrite — we ran the 100k, 1M, and 5M concurrent-sandbox tests back-to-back. The P50 for "wake a suspended sandbox and exec a command" held at 80ms through every order of magnitude.

We also ran a noisy-neighbor test: 4,000 sandboxes, one of them doing a hot CPU+IO loop, the other 3,999 idle. We watched the P99 wake time of the idle cohort. Before the rewrite it spiked to 2.4 seconds. After, it moved by about 6 milliseconds.

06 · What's next

5M per project is not the ceiling — it's the number we're confident enough to publish. Internally we have a 40M test that runs every night. The current bottleneck is the control-plane database, and we've already started sharding it along the same shape as the scheduler.

If any of this sounds like the problem you want to work on, we're hiring. If you want to use it, the docs are the best starting point.

DC
WRITTEN BYDiptanu Gon ChoudhuryCEO / Co-founder
◆ FIELD NOTES — WEEKLY

Engineering posts, in your inbox.

One dispatch per week from the Tensorlake team — runtime deep-dives, product updates, and the occasional benchmark that surprised us.