HomeBlogPricingCareersDocsGitHubSlack community
Field notes/Engineering/Filter and rank: how we schedule sandboxes across every cloud

Filter and rank: how we schedule sandboxes across every cloud

One primitive — filter, then rank — runs through statefulness, multi-cloud, BYOC, and the data path. A tour of the scheduler behind Tensorlake's sandbox fleet.

We're building a compute platform at @tensorlake optimized for agents. Success for us looks like our customers running more sandboxes which means we have to keep finding more compute. We spend a lot of time thinking about how to spread workloads across AWS, GCP, and bare-metal providers like Hetzner and Latitude without sacrificing the things people come to us for: ultra-low-latency startup, long lived sandboxes that scale to zero, and snapshotting that makes RL environments easy to reproduce.

Multi-cloud has been a design constraint from day one. Most of our early design partners ran at scale across different clouds and either wanted sandboxes close to their production systems or needed them inside their own VPCs.

The common thread you'll see across the rest of this post: filter-and-rank with the right compute dimensions turns out to be the unifying primitive across statefulness, multi-cloud, BYOC, and even our data path.

Sandboxes Are Stateful

A big part of Tensorlake's value proposition is stateful sandboxes. They let users run an agent harness in an execution environment that can scale to zero but retain its file system and memory. The next time the user comes back, the sandbox is restored to exactly the state it was in when it last ran.

We achieve this by snapshotting a Firecracker VM. We maintain an internal fork of Firecracker that enables fast memory and file system snapshotting. We capture the dirty memory pages of the sandbox and the dirty (and zero) blocks of the file system, pack them into a snapshot file, upload it to long-term storage, and cache it on the local SSD so that resuming takes a few hundred milliseconds to a second. We also track the lineage of file-system and memory diffs, so repeated suspensions only capture the delta and get progressively faster.

Managing snapshots is the main challenge in multi-cloud sandbox infrastructure. A snapshot taken on GCP can't be resumed on AWS without losing the fast-warm-boot property. Every architectural decision below flows from this constraint.

Cloud-Aware Cluster Topology

Data planes register automatically when they come online, which lets us autoscale as demand shows up. Their heartbeats carry enough information to tell the scheduler where they live: cloud, region, and zone.

+-----------------------------+
|   Compute Node 1            |
|                             |
|   dataplane_id=xyz          | ---+
|   cloud=aws                 |    |
|   region=us-east-1          |    |
+-----------------------------+    |
                                   |
+-----------------------------+    |     +----------------------+
|   Compute Node 2            |    |     |                      |
|                             |    |     |                      |
|   dataplane_id=abc          | ---+---> |  Tensorlake Compute  |
|   cloud=gcp                 |    |     |       Engine         |
|   region=eu-central-1       |    |     |                      |
+-----------------------------+    |     +----------------------+
                                   |
+-----------------------------+    |
|   Compute Node 3            |    |
|                             |    |
|   dataplane_id=def          | ---+
|   cloud=hetzner             |
|   region=eu-fsn-1           |
+-----------------------------+

The scheduler maintains in-memory reverse indexes and Trie data structures over this metadata for fast capacity lookups in any given region or zone. When a region has no capacity, requests are queued as blocked work; when new capacity comes online, only the blocked work for that specific region gets re-evaluated. This locality lets a single scheduler manage tens of thousands of nodes per region, RAM usually becomes the limiting factor before CPU does.

Storage and Cache Aware Scheduling

The foundation of Tensorlake's sandbox infrastructure is a cluster scheduler similar to Google's Borg and HashiCorp Nomad. When a user requests a sandbox with some CPU, RAM, and disk, the scheduler creates an allocation by performing a multi-dimensional search and ranking nodes where the sandbox can fit.

The way we sidestep the cross-cloud snapshot problem is straightforward: each data plane is configured with a shared NAS or blob store (or both) local to its zone, used for snapshots and base images. The scheduler picks a region when a sandbox is first created, and always resumes it in the same region, and preferably in the same zone, and ideally on the same node where it last ran. Snapshots never need to cross cloud boundaries, which keeps warm boots fast.

Node Filtering and Ranking

We bin-pack aggressively so small sandboxes can land without fragmenting machines.

Filtering uses a spatial index of nodes in a 4-dimensional space of CPU, RAM, disk, and GPU cards. As sandboxes consume resources, a node's location in the index shifts. When a request comes in, we find the nearest points in space and evaluate them for placement. We built it specifically because agent workloads have wider resource-ask variance than batch or service workloads, and nearest-neighbor in resource space gives us much better packing than the usual constraint-and-score loop.

Ranking lets us prefer nodes that are already loaded (to keep fragmentation low) and, critically, nodes that are likely to have the snapshot cached. We frequently hit the local SSD cache of a suspended snapshot, or stream it from a peer node in the same zone. They are both far faster than pulling from blob storage.

+-----------------------------+
|  Sandbox Resume Request     |
|                             |
|  CPU = 2                    |
|  RAM = 8000MB               |
|  Disk = 10G                 |
|  Sandbox ID=abcd            |
+-----------------------------+
              |
              v
+-----------------------------+
|  Filter Nodes from Spatial  |
|  Index of the region        |
+-----------------------------+
              |
              v
+-----------------------------+
|  Rank Nodes                 |
|                             |
|  Prefer Nodes on the same   |
|  cloud, region and zone     |
|  Prefer the same node       |
+-----------------------------+

Foundations for BYOC

The same filter-and-rank machinery we built for multi-cloud also gives us BYOC nearly for free, with a shared control plane across many tenants.

Every tenant in Tensorlake can pin their project to a specific region, cloud, or even their internal VPC. When a customer wants sandboxes running in their own network, we issue an mTLS certificate that encodes both their tenant identity and their data plane identity. The customer's data plane operator installs the cert when bringing up nodes; when those nodes connect back to Tensorlake's control plane, we know exactly which tenant they belong to.

When a user creates a sandbox with an API key, the scheduler filters down to nodes registered to that tenant's project, then ranks them using the same algorithm as everywhere else.

+------------------------------+
|  Customer VPC                |
|  (tenant_id=acme)            |
|                              |
|  +------------------------+  |
|  |  Compute Node          |  |
|  |  dataplane_id=xyz      |  |
|  |  cloud=aws             |  |
|  |  region=us-east-1      |  |
|  |  tenant=acme           |  |
|  +-----------+------------+  |
|              |               |
+--------------|---------------+
               |
               |  mTLS connection
               |  (cert encodes tenant + dataplane info)
               |
               v
+----------------------------------------------+
|  Tensorlake Control Plane (shared)           |
|                                              |
|  +----------------------------------------+  |
|  |  Identity & Tenancy                    |  |
|  |  - Verify TLS cert                     |  |
|  |  - Extract tenant_id, dataplane_id     |  |
|  +----------------+-----------------------+  |
|                   |                          |
|                   v                          |
|  +----------------------------------------+  |
|  |  Scheduler                             |  |
|  |                                        |  |
|  |  Filter:  nodes matching tenant's      |  |
|  |           pinned region/cloud/VPC      |  |
|  |                                        |  |
|  |  Rank:    prefer same cloud, region,   |  |
|  |           zone, then same node         |  |
|  +----------------+-----------------------+  |
|                   |                          |
+-------------------|--------------------------+
                    |
                    |  Sandbox placement
                    v
       +-------------------------+
       |  Sandbox runs in        |
       |  customer's own VPC     |
       +-------------------------+

The blob store inside the customer's network holds their snapshots and images. Tensorlake's network never sees customer data. All sandbox traffic, snapshots, and images live inside the customer's private network, gated by their own cloud's access control policies.

Federated Ingress and SSH

A central control plane is the wrong place for sandbox traffic to live. Routing every HTTP request and SSH session through Tensorlake would be a latency, scale, and trust bottleneck especially for BYOC tenants whose whole reason for running in their own VPC is to keep traffic contained.

Every cloud and every private network runs its own ingress proxy and SSH gateway. When you create a sandbox, the SDK gets back an ingress endpoint that resolves directly to the local proxy in whichever data plane the sandbox landed on.

Auth at the edge works two ways:

  • Fast path: when the SDK creates a sandbox, the control plane returns a short-lived secret that the SDK passes along with subsequent requests. The local ingress verifies it without a control-plane round trip. This is the path almost most request takes for ephemeral sandboxes.
  • Slow path: for requests that arrive without a pre-issued secret (or after one expires), the ingress falls back to authenticating against the control plane. The auth is cached for a certain amount of time.

Sandbox data path is fully federated. A request to a sandbox running in a customer's VPC never leaves their network, even though the scheduling decision that placed it there was made by a shared control plane.

Benchmarks and What's Next

There's a lot more to write about how Tensorlake sandboxes work at scale. We spent a lot of time and energy in optimizing the entire stack so your agent can run fast.

Compute SDK Benchmarks which measure Burst Startup Time:

Sandbox provider leaderboard — Tensorlake ranked #2 by composite score

We built a new reconciliation loop free scheduler which can schedule 100s of sandboxes under a few milliseconds and at burst 1000s of new sandboxes can land across many clusters of dataplane. We built a custom block storage sub system to make coding agents compile code faster by providing them near SSD speed. We will share more about how our stack works in the coming weeks!

If you are building with sandboxes, or need a VM for your projects we would love if you tried Tensorlake Sandboxes and share your feedback!

Docs — https://docs.tensorlake.ai/sandboxes/introduction

DC
WRITTEN BYDiptanu Gon ChoudhuryCEO & Co-founder
◆ FIELD NOTES — WEEKLY

Engineering posts, in your inbox.

One dispatch per week from the Tensorlake team — runtime deep-dives, product updates, and the occasional benchmark that surprised us.