00 · The 1000-sandbox moment
Imagine you're running an evaluation benchmark for a new code-generation model. You've written 1,000 programming challenges. Each one needs an isolated Python environment to execute the generated code and verify it against expected output. Your harness fires all 1,000 requests at once.
What happens next depends almost entirely on how your compute platform schedules work. In most systems, sandboxes start one by one, or in small batches on a loop. The first sandbox might start in under a second. The last one might wait minutes. By the time you're looking at results, you've spent more time waiting for environments to boot than actually running your model.
Tensorlake takes a different approach. This post walks through what we're building under the hood.
We replaced the reconciliation loop with a durable command outbox. Scheduling decisions write commands to RocksDB; long-polling dataplane processes pick them up in milliseconds. End-to-end latency is bounded by network RTT, not a scheduler tick. 1,000 sandboxes don't wait in line — they just start.
01 · The reconciliation trap
The standard approach to scheduling in distributed systems is a reconciliation loop. A controller periodically compares the desired state of the world against the actual state, and takes actions to close the gap. Kubernetes works this way. Most task schedulers work this way.
Reconciliation loops are a reasonable design. They're simple to reason about, they self-heal naturally, and they make eventual consistency easy to implement. But they only react as fast as the loop runs. If the loop ticks every 10 seconds, and you submit 1,000 jobs at once, the scheduler might observe only the first batch of pending work on the first tick, start those, and only discover the remaining 900 jobs on the next tick — 10 seconds later. You can shorten the interval, but you can't eliminate it.
There's a second problem. Reconciliation loops work by reading current state and diffing it against desired state. Under load, that read becomes expensive. And if the controller restarts while a reconciliation is in progress, you might temporarily lose some commands until the next full cycle of reconciliation.
02 · Commands as the unit of work
Instead of periodically reconciling desired state against actual state, our scheduler emits discrete, durable commands whenever scheduling decisions are made.
When a sandbox needs to start, the scheduler creates an AddSandbox command and writes it to a persistent outbox backed by RocksDB. That write happens immediately, in the same transaction as the scheduling decision itself. The command isn't derived later from a state comparison — it's the direct output of the decision.
Our virtual machine fleet connects to the scheduler through a service we call dataplane. Dataplane runs on each host, reports its available resources, and receives work from the scheduler. Each process manages the sandboxes on its host, starting them, routing commands to them, and reporting results back.
The full cycle to start a sandbox:
- 1Scheduler decides a sandbox is needed.
- 2It writes an
AddSandboxcommand to the RocksDB outbox. - 3It notifies waiting dataplane connections.
- 4Dataplane receives the command batch.
- 5It spawns sandbox startup tasks in parallel.
- 6It reports
SandboxStartedvia a bidirectional gRPC heartbeat. - 7Scheduler acks the command and removes it from the outbox.
Each command carries a sequence number. When a dataplane process picks up a batch, it acks the highest sequence number it has processed. The scheduler only removes commands from the outbox after they're acknowledged. If the scheduler restarts, the commands are still in RocksDB. If a dataplane process restarts, the scheduler detects sequence-number regression and requests a full state sync before issuing new commands. No command is silently lost.
// Scheduler emits AddSandbox; write is durable and transactional.
pub fn schedule_sandbox(tx: &mut Txn, sbx: SandboxSpec) -> Result<Seq> {
let seq = tx.next_seq(OUTBOX)?;
let cmd = Command::AddSandbox {
seq,
sandbox_id: sbx.id,
image: sbx.image,
resources: sbx.resources,
};
tx.put_outbox(seq, &cmd)?; // RocksDB write
tx.transition(sbx.id, Pending, Enqueued)?;
tx.on_commit(|| wake_pollers(seq)); // notify long-polls
Ok(seq)
}03 · Long-polling, not pushing
The dataplane doesn't receive commands passively — it polls for them. This is a deliberate design choice.
Each dataplane process maintains two long-lived gRPC streams to the scheduler: one for heartbeats (carrying liveness signals and command responses), and one for poll_commands. The dataplane always calls poll_commands with the sequence number of the last batch it processed. If there are commands in the outbox, the scheduler returns them immediately. If there aren't, the scheduler holds the connection open for up to five minutes, waiting. When new commands arrive, the scheduler wakes the waiting poll and returns the batch. The dataplane sees the commands within milliseconds, not on the next scheduler tick.
The latency from scheduling decision to dataplane awareness is bounded only by network round-trip time — not by any polling interval.
04 · Starting 1000 sandboxes at once
Now we can revisit the opening scenario. When 1,000 requests arrive, the scheduler runs against the dataplane pool and emits 1,000 AddSandbox commands. They go into the outbox immediately and wake any waiting poll_commands connections.
Each dataplane process picks up its batch on the next poll — within milliseconds of enqueue. A typical AddSandbox payload fits roughly 160 commands per batch. Each process runs through its share of the 1,000 commands and spawns a task for each one. Those tasks run completely independently. Each sandbox starts the moment its task is scheduled by the runtime.
The limiting factor for startup speed isn't the scheduler or the command delivery mechanism. It's the physical resources on the dataplane hosts — how fast a kernel can create namespaces, how fast the network can carry simultaneous sandbox initialization.
Image pull speed
How fast the sandbox runtime can pull images, if they're not already cached. Warm images eliminate this entirely.
Kernel primitives
How fast the kernel can create namespaces and cgroups. The MicroVM cold-start path is the floor.
Network saturation
How fast the network can handle simultaneous sandbox initialization on a single host.
For the evaluation-benchmark scenario, this means all 1,000 sandboxes race toward ready as soon as the scheduler processes the requests. Assuming the dataplane fleet has capacity and images are warm, you're not waiting for batch N+1 to be scheduled — all 1,000 are in flight simultaneously.
05 · What happens when things go wrong
The command pattern also changes what failure looks like.
In a reconciliation loop, a controller crash means the loop stops running. Pending work piles up silently until the controller restarts and runs the next cycle. There's no record of which specific operations were in progress — only the diff between desired and actual state, which the next reconciliation cycle will re-derive.
With the command outbox, a scheduler crash doesn't lose work. The commands are in RocksDB. When the scheduler comes back, the outbox is still there. When a dataplane process reconnects and sends its next heartbeat, the scheduler checks the sequence numbers and either resumes delivery or requests a full state sync to re-derive what needs to happen.
| Reconciliation | Command outbox | |
|---|---|---|
| Unit of work | Desired-state diff | Durable command |
| Crash recovery | Re-derive from state | Replay from RocksDB |
| Latency floor | Scheduler tick interval | Network RTT |
| Observability | Inferred from diff | Backlog & drain gauges |
| Loss mode | Silent, until next tick | Detected via seq-num regression |
This also makes the system observable in a way that reconciliation loops aren't. The outbox backlog is a direct metric: how many commands are waiting to be delivered? The drain rate tells you how quickly commands are being processed. If you're starting 1,000 sandboxes and you can see the backlog drain in 340ms, you have direct evidence of the system's throughput. The scheduler tracks these through a backlog gauge, a drain-rate counter, and a drain-latency histogram. Real numbers, not inferences from state diffs.
06 · The trade-off
The command pattern isn't strictly better than reconciliation in every dimension. Reconciliation loops are easier to implement correctly for complex state machines. You can add a new type of desired state without needing to define a new command type and ensure every subscriber handles it. And reconciliation is naturally idempotent — applying the same desired state twice produces no side effects.
The command outbox requires more discipline. Each command type needs an explicit handler, and the scheduler needs to be careful not to emit duplicate commands for the same sandbox. Tensorlake handles this by tracking sandbox state transitions and only emitting AddSandbox for sandboxes in the Pending state. Once a command is enqueued, the sandbox moves to a state that prevents re-emission.
For a workload like parallel AI evaluation, where latency and throughput at startup time actually matter, the command pattern pays off clearly. You're not waiting for a scheduler tick. You're not re-deriving state from a diff. You're delivering instructions as fast as the network can carry them, and letting each dataplane start as much work as its hardware can support.