HomeBlogPricingCareersDocsGitHubSlack community
Field notes/Engineering/Starting hundreds of sandboxes in parallel, and the design that makes it possible.

Starting hundreds of sandboxes in parallel, and the design that makes it possible.

We replaced reconciliation loops with durable command outbox to scale our sandbox scheduler to start 1000s of sandboxes every second

00 · The 1000-sandbox moment

Imagine you're running an evaluation benchmark for a new code-generation model. You've written 1,000 programming challenges. Each one needs an isolated Python environment to execute the generated code and verify it against expected output. Your harness fires all 1,000 requests at once.

What happens next depends almost entirely on how your compute platform schedules work. In most systems, sandboxes start one by one, or in small batches on a loop. The first sandbox might start in under a second. The last one might wait minutes. By the time you're looking at results, you've spent more time waiting for environments to boot than actually running your model.

Tensorlake takes a different approach. This post walks through what we're building under the hood.

TL;DR

We replaced the reconciliation loop with a durable command outbox. Scheduling decisions write commands to RocksDB; long-polling dataplane processes pick them up in milliseconds. End-to-end latency is bounded by network RTT, not a scheduler tick. 1,000 sandboxes don't wait in line — they just start.

01 · The reconciliation trap

The standard approach to scheduling in distributed systems is a reconciliation loop. A controller periodically compares the desired state of the world against the actual state, and takes actions to close the gap. Kubernetes works this way. Most task schedulers work this way.

Reconciliation loops are a reasonable design. They're simple to reason about, they self-heal naturally, and they make eventual consistency easy to implement. But they only react as fast as the loop runs. If the loop ticks every 10 seconds, and you submit 1,000 jobs at once, the scheduler might observe only the first batch of pending work on the first tick, start those, and only discover the remaining 900 jobs on the next tick — 10 seconds later. You can shorten the interval, but you can't eliminate it.

There's a second problem. Reconciliation loops work by reading current state and diffing it against desired state. Under load, that read becomes expensive. And if the controller restarts while a reconciliation is in progress, you might temporarily lose some commands until the next full cycle of reconciliation.

02 · Commands as the unit of work

Instead of periodically reconciling desired state against actual state, our scheduler emits discrete, durable commands whenever scheduling decisions are made.

When a sandbox needs to start, the scheduler creates an AddSandbox command and writes it to a persistent outbox backed by RocksDB. That write happens immediately, in the same transaction as the scheduling decision itself. The command isn't derived later from a state comparison — it's the direct output of the decision.

Our virtual machine fleet connects to the scheduler through a service we call dataplane. Dataplane runs on each host, reports its available resources, and receives work from the scheduler. Each process manages the sandboxes on its host, starting them, routing commands to them, and reporting results back.

The full cycle to start a sandbox:

  1. 1
    Scheduler decides a sandbox is needed.
  2. 2
    It writes an AddSandbox command to the RocksDB outbox.
  3. 3
    It notifies waiting dataplane connections.
  4. 4
    Dataplane receives the command batch.
  5. 5
    It spawns sandbox startup tasks in parallel.
  6. 6
    It reports SandboxStarted via a bidirectional gRPC heartbeat.
  7. 7
    Scheduler acks the command and removes it from the outbox.

Each command carries a sequence number. When a dataplane process picks up a batch, it acks the highest sequence number it has processed. The scheduler only removes commands from the outbox after they're acknowledged. If the scheduler restarts, the commands are still in RocksDB. If a dataplane process restarts, the scheduler detects sequence-number regression and requests a full state sync before issuing new commands. No command is silently lost.

// Scheduler emits AddSandbox; write is durable and transactional.
pub fn schedule_sandbox(tx: &mut Txn, sbx: SandboxSpec) -> Result<Seq> {
    let seq = tx.next_seq(OUTBOX)?;
    let cmd = Command::AddSandbox {
        seq,
        sandbox_id: sbx.id,
        image:      sbx.image,
        resources:  sbx.resources,
    };
    tx.put_outbox(seq, &cmd)?;          // RocksDB write
    tx.transition(sbx.id, Pending, Enqueued)?;
    tx.on_commit(|| wake_pollers(seq)); // notify long-polls
    Ok(seq)
}

03 · Long-polling, not pushing

The dataplane doesn't receive commands passively — it polls for them. This is a deliberate design choice.

Each dataplane process maintains two long-lived gRPC streams to the scheduler: one for heartbeats (carrying liveness signals and command responses), and one for poll_commands. The dataplane always calls poll_commands with the sequence number of the last batch it processed. If there are commands in the outbox, the scheduler returns them immediately. If there aren't, the scheduler holds the connection open for up to five minutes, waiting. When new commands arrive, the scheduler wakes the waiting poll and returns the batch. The dataplane sees the commands within milliseconds, not on the next scheduler tick.

END-TO-END LATENCY

The latency from scheduling decision to dataplane awareness is bounded only by network round-trip time — not by any polling interval.

04 · Starting 1000 sandboxes at once

Now we can revisit the opening scenario. When 1,000 requests arrive, the scheduler runs against the dataplane pool and emits 1,000 AddSandbox commands. They go into the outbox immediately and wake any waiting poll_commands connections.

Each dataplane process picks up its batch on the next poll — within milliseconds of enqueue. A typical AddSandbox payload fits roughly 160 commands per batch. Each process runs through its share of the 1,000 commands and spawns a task for each one. Those tasks run completely independently. Each sandbox starts the moment its task is scheduled by the runtime.

The limiting factor for startup speed isn't the scheduler or the command delivery mechanism. It's the physical resources on the dataplane hosts — how fast a kernel can create namespaces, how fast the network can carry simultaneous sandbox initialization.

01

Image pull speed

How fast the sandbox runtime can pull images, if they're not already cached. Warm images eliminate this entirely.

02

Kernel primitives

How fast the kernel can create namespaces and cgroups. The MicroVM cold-start path is the floor.

03

Network saturation

How fast the network can handle simultaneous sandbox initialization on a single host.

For the evaluation-benchmark scenario, this means all 1,000 sandboxes race toward ready as soon as the scheduler processes the requests. Assuming the dataplane fleet has capacity and images are warm, you're not waiting for batch N+1 to be scheduled — all 1,000 are in flight simultaneously.

05 · What happens when things go wrong

The command pattern also changes what failure looks like.

In a reconciliation loop, a controller crash means the loop stops running. Pending work piles up silently until the controller restarts and runs the next cycle. There's no record of which specific operations were in progress — only the diff between desired and actual state, which the next reconciliation cycle will re-derive.

With the command outbox, a scheduler crash doesn't lose work. The commands are in RocksDB. When the scheduler comes back, the outbox is still there. When a dataplane process reconnects and sends its next heartbeat, the scheduler checks the sequence numbers and either resumes delivery or requests a full state sync to re-derive what needs to happen.

ReconciliationCommand outbox
Unit of workDesired-state diffDurable command
Crash recoveryRe-derive from stateReplay from RocksDB
Latency floorScheduler tick intervalNetwork RTT
ObservabilityInferred from diffBacklog & drain gauges
Loss modeSilent, until next tickDetected via seq-num regression

This also makes the system observable in a way that reconciliation loops aren't. The outbox backlog is a direct metric: how many commands are waiting to be delivered? The drain rate tells you how quickly commands are being processed. If you're starting 1,000 sandboxes and you can see the backlog drain in 340ms, you have direct evidence of the system's throughput. The scheduler tracks these through a backlog gauge, a drain-rate counter, and a drain-latency histogram. Real numbers, not inferences from state diffs.

06 · The trade-off

The command pattern isn't strictly better than reconciliation in every dimension. Reconciliation loops are easier to implement correctly for complex state machines. You can add a new type of desired state without needing to define a new command type and ensure every subscriber handles it. And reconciliation is naturally idempotent — applying the same desired state twice produces no side effects.

The command outbox requires more discipline. Each command type needs an explicit handler, and the scheduler needs to be careful not to emit duplicate commands for the same sandbox. Tensorlake handles this by tracking sandbox state transitions and only emitting AddSandbox for sandboxes in the Pending state. Once a command is enqueued, the sandbox moves to a state that prevents re-emission.

For a workload like parallel AI evaluation, where latency and throughput at startup time actually matter, the command pattern pays off clearly. You're not waiting for a scheduler tick. You're not re-deriving state from a diff. You're delivering instructions as fast as the network can carry them, and letting each dataplane start as much work as its hardware can support.

DC
WRITTEN BYDavid CalaveraSoftware Engineer · Tensorlake
◆ FIELD NOTES — WEEKLY

Engineering posts, in your inbox.

One dispatch per week from the Tensorlake team — runtime deep-dives, product updates, and the occasional benchmark that surprised us.