Autoresearch on steroids with sandboxes

00 · The missing piece

Last month, Andrej Karpathy published autoresearch. The idea is simple: give an LLM agent a training script and a plain-English definition of the search space — the parameters and code changes worth trying. The agent proposes a change, you run it, measure a metric with the proposed change, and repeat. The repo hit ~64k GitHub stars within a few weeks.

The missing piece is execution. The agent generates Python scripts, and you should not run those in your host process. Candidate scripts may write to disk with catastrophic consequences for your files, install packages you did not intend to have installed, crash your training loop, or hang and consume resources.

Sandboxes solve this by running each candidate in isolation, with explicit time and resource limits. This post shows one way to build an autoresearch loop using Tensorlake Sandboxes as the execution layer.

◆

TL;DR

Autoresearch is a hill-climbing agent over training-script edits. The bottleneck is not the LLM — it's running hundreds of agent-generated scripts safely, in parallel. Tensorlake sandboxes give you a bounded, disposable environment per candidate, plus snapshots so each run starts from a pre-warmed filesystem instead of reinstalling deps on every attempt.

01 · The autoresearch loop

You can treat autoresearch as a hill-climbing search over training-script edits:

1
Calibrate. Run the baseline script and record the metric (e.g. val_loss).
2
Propose. Ask the agent for N modifications, each returned as a complete Python script.
3
Race. Run all N candidates in parallel; record val_loss for each.
4
Accept. Promote the best candidate if it beats the current best.
5
Repeat. The accepted script becomes the new baseline.

Karpathy reports running ~700 experiments overnight and getting an ~11% accuracy improvement from incremental changes. Not bad for a hands-free method.

“

The value of this is not new research outcomes. It is running many small, validated changes without manual iteration.

— on overnight autoresearch runs

02 · Sandboxes as the runtime

A sandbox gives you an isolated execution environment per candidate program. You start the sandbox, run the script, capture stdout and stderr, and tear it down. The host process never executes the candidate code. If the script crashes or exits with an error, you still get result.stdout and result.stderr back.

from tensorlake.sandbox import Sandbox

box = Sandbox.create(memory_mb=4096, timeout_secs=900)
result = box.run("python3", ["-c", script], timeout=300)
box.terminate()

# result.stdout / result.stderr always come back,
# even if the candidate script crashes or hangs.
val_loss = parse_val_loss(result.stdout)

To race candidates in parallel, Tensorlake's map-reduce API fans out N sandbox runs and gathers their results in one call — no thread pool or queue to manage on your side.

→

FULL EXAMPLE

The full autoresearch example — baseline script, prompt, runner, and acceptance logic — lives in the Tensorlake docs: docs.tensorlake.ai/sandboxes/agentic-autoresearch.

03 · What the agent remembers

The agent does better if it can see what has already been tried and what happened. A lightweight approach is to append a small experiment log to the prompt — for example, the last 8 iterations:

StatusIterValΔChange

✓ ACCEPTED022.8103−0.0412Added LR decay: multiply LR by 0.999 each step

✗ rejected032.8634+0.0531Increased hidden size from 64 to 128

✓ ACCEPTED042.7891−0.0212Replaced tanh with ReLU activation

✗ rejected052.9102+0.1211Added second hidden layer (size 32)

You can also vary temperature within a batch: keep the first candidate conservative and make later candidates more exploratory. The log is cheap — a few hundred tokens — and it keeps the agent from re-proposing changes it already tried.

04 · Preventing reward hacking

One constraint worth enforcing is a fixed training budget per run. If the agent can change STEPS, it can reduce val_loss by training longer, which makes runs incomparable — and the leaderboard becomes meaningless.

In Karpathy's setup this is handled in the human-controlled guidance:

guidance.txt

STEPS  (DO NOT CHANGE — fixed budget)

On the execution side, sandboxes give you a second control lever: each run has a bounded environment and a hard ceiling on CPU, memory, and wall-clock time. An agent that tries to "win" by spinning up a background worker, phoning home, or running past the budget just gets killed by the runtime.

BELT AND SUSPENDERS

Trust the prompt, but verify at the runtime. Most reward hacks are honest attempts by a capable model to optimize the metric you told it to optimize. A hard sandbox limit is how you keep the agent's definition of "winning" aligned with yours.

05 · What the loop finds

On an MLP trained over a small public-domain corpus, a simple hill-climbing loop often accepts a minority of proposals (for example, ~2–4 out of 8). The accepted changes are usually straightforward training tweaks:

Learning-rate decay

Multiply LR by 0.999 each step.

Activation function

tanh → ReLU.

Initialization scale

Tighter/looser std on the weight init.

The value here is not novel research outcomes. It's running many small, validated changes without manual iteration — the kind of hyperparameter sweep that normally eats a week of someone's time, compressed into an overnight job.

06 · Running it

Install the pieces and set your keys:

~ $ bash

pip install tensorlake openai rich python-dotenv
 
TENSORLAKE_API_KEY="your-api-key-here"
OPENAI_API_KEY="your-openai-key-here"

Start with a short smoke test before running a longer sweep to verify everything is in place:

~ $ bash

python autoresearch.py --smoke
# 3 iterations · 2 candidates · 150 steps

Once that works, remove --smoke and let it run overnight. A budget of 700 experiments on a small MLP is roughly cents of sandbox time per iteration — a rounding error against the time you'd have spent manually sweeping.

07 · Pre-warmed environments

If each candidate installs dependencies (for example, numpy, torch) at the start of every run, that overhead can dominate short experiments. One way to avoid repeated installs:

1
Install dependencies once.
2
Snapshot the sandbox.
3
Start each candidate from that snapshot.

from tensorlake.sandbox import Sandbox

# 1. Install dependencies once.
base = Sandbox.create(memory_mb=4096, timeout_secs=900)
base.run("bash", ["-lc", "pip install numpy torch==2.5.0"])

# 2. Checkpoint the warm sandbox.
snap = base.checkpoint()
base.terminate()   # source is done; the snapshot lives on.

# 3. Every candidate restores from the snapshot — no pip install.
def run_one(script: str) -> float:
    box = Sandbox.create(snapshot_id=snap.snapshot_id, timeout_secs=900)
    r = box.run("python3", ["-c", script], timeout=300)
    box.terminate()
    return parse_val_loss(r.stdout)

Tensorlake's snapshot API is designed to restore a known filesystem and memory state quickly, so every candidate starts from the same warm baseline. On a 700-iteration overnight run, that turns a multi-second pip install per candidate into a sub-second restore — which is the difference between "overnight" and "over the weekend".

→

WHEN IT PAYS OFF

Pre-warming is most valuable when (a) the dependency install time is significant relative to the training run, or (b) you want reproducibility across candidates — same exact environment, same exact starting state, every time.

Autoresearch isn't magic. It's a small loop — propose, run, measure, accept — built on top of two things that are hard to do safely at scale: running agent-generated code, and running a lot of it in parallel. The sandbox is what makes the loop boring enough to leave running overnight. That's the whole trick.

WRITTEN BYTensorlake TeamEngineering