karpathy/autoresearch: What Does It Do?

Overview

karpathy/autoresearch is an open-source project by Andrej Karpathy that sets up an autonomous AI research loop for language model training. The core idea is disarmingly simple: give an AI coding agent a small but real LLM training setup, let it experiment autonomously overnight, and wake up to a log of experiments and (hopefully) a better model — all without you touching a single Python file.

“The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats.”

— Andrej Karpathy

The Autonomous Research Loop

The agent follows a tight, repeatable loop:

Modify — The agent edits train.py (model architecture, hyperparameters, optimizer, batch size, etc.)
Train — A training run executes for exactly 5 minutes of wall-clock time
Evaluate — The metric val_bpb (validation bits per byte) is computed — lower is better
Accept or Discard — If val_bpb improved, the change is kept; otherwise it is reverted
Repeat — The loop continues, accumulating ~12 experiments per hour, ~100 experiments overnight

This is essentially automated hypothesis testing for neural architecture search, but with a general-purpose code-editing agent rather than a hand-engineered search procedure.

Project Structure

The repository is deliberately minimal — only three files matter:

File	Role	Who edits it
`prepare.py`	Data prep, tokenizer training, eval utilities	Nobody
`train.py`	GPT model, Muon+AdamW optimizer, training loop	The AI agent
`program.md`	Agent instructions (“research org code”)	The human

The human’s job shifts from writing Python to writing program.md — the Markdown file that defines the agent’s strategy, constraints, and goals. This is described as “programming the research org” rather than programming the model.

Why `val_bpb`?

The metric chosen — validation bits per byte — has two useful properties for automated research:

Vocabulary-size independent: architectural changes that alter the vocabulary size are still fairly compared
Fixed time budget: all experiments run for exactly 5 minutes regardless of model size or hardware, making runs directly comparable

The tradeoff is that results are platform-specific: an H100 will explore a very different area of the architecture space than a consumer GPU in the same 5 minutes.

Design Principles

1. Single file to modify

The agent only edits train.py. This keeps diffs small and reviewable, and prevents the agent from accidentally breaking data loading or evaluation logic.

2. Fixed time budget

A hard 5-minute wall-clock budget (excluding startup/compilation) means every experiment is equivalent regardless of what the agent changes. This is analogous to fixing a compute budget rather than a number of gradient steps.

3. Self-contained

No distributed training, no complex configs, no external services. One GPU, one file, one metric. The simplicity is intentional — it makes the system auditable and easy to fork.

4. Human programs strategy, agent programs model

program.md separates the human’s high-level research intent from the agent’s low-level implementation decisions. Iterating on program.md is how you guide the research direction; iterating on train.py is what the agent does automatically.

How to Run It

# Install uv (once)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies
uv sync

# One-time data preparation (~2 min)
uv run prepare.py

# Single manual experiment (~5 min)
uv run train.py

Once the setup works, point any coding agent (Claude, Codex, etc.) at the repo, disable file-system permissions outside the project, and prompt:

Have a look at program.md and kick off a new experiment.

Why This Matters for Agentic AI Research

From an agentic AI perspective, autoresearch is an interesting case study because:

It demonstrates tool-use grounding: the agent’s actions are grounded in real training runs with measurable outcomes, not just text generation.
It shows minimal scaffolding: there is no complex orchestration framework — just a Markdown file and a training script.
It implements a closed-loop reward signal: the agent receives concrete feedback (val_bpb) after each action, enabling genuine trial-and-error learning at the meta-level.
It raises questions about human-in-the-loop research: the human’s role becomes designing the evaluation protocol and the agent’s instruction set, rather than running experiments directly.

For researchers working at the intersection of LLMs, autonomous agents, and scientific discovery, this project is a concrete, reproducible baseline worth studying.