The Autoresearch Pattern: A Blueprint for Self-Improving Agents
This series has been asking the wrong question.
We spent four posts on how an agent modifies itself — the substrate (TypeScript vs. Lisp vs. BEAM), the safety mechanisms (supervision trees, hot code swapping, rollback), the architecture (Opal, Loom). All necessary. But none of it answers the question that actually matters: how does the agent know it got better?
Andrej Karpathy's autoresearch answers that question with a pattern so simple it's easy to miss what makes it work.
Autoresearch gives an AI agent a small LLM training setup and tells it: improve this. The agent modifies the training code, runs a 5-minute experiment, checks the score, and either keeps the change or reverts it. Then it does it again. And again. Roughly 12 experiments per hour, 100 overnight. The human sleeps; the agent researches.
The code is trivial. The insight is structural. Three things are held separate:
These three separations are the whole trick. Remove any one and the system breaks.
Without a fixed judge, the agent can game its own metric. An agent that controls its evaluation function will eventually learn to modify the function rather than improve itself — it's the path of least resistance. Goodhart's law applied to self-improvement: when the agent controls the measure, the measure stops measuring.
Without a bounded subject, the search space explodes. Autoresearch constrains modification to a single file. Not because the agent can't handle more, but because unbounded self-modification has no gradient. If everything can change at once, you can't attribute improvement to any specific change. You lose the signal.
Without a fixed process, the loop can't converge. The keep/revert decision rule must be outside the agent's reach. An agent that can modify its own keep/revert logic will eventually keep changes that shouldn't be kept — not out of malice, but because "change the acceptance criteria" is a valid move in an unconstrained search.
These aren't engineering constraints. They're epistemological ones. A system cannot be both the experimenter and the experiment, the judge and the defendant, the optimizer and the objective. You need at least one fixed point, or nothing is anchored.
Step through the self-improvement loop. Each iteration, the agent picks a modification, evaluates it, and keeps or reverts. Watch how the score evolves — and notice what stays fixed.
Our OpenClaw series explored self-modification at the implementation level. Now overlay the autoresearch pattern and everything snaps into focus. Five steps. No specific language, no specific runtime. Just the structure.
Step 1: Define the Judge. What does "better" mean for your agent? Pick a metric the agent cannot influence. Task completion rate on a held-out benchmark. User satisfaction measured externally. Error rate logged by an independent monitor. Efficiency — same task, fewer tokens, fewer tool calls, less wall-clock time. The metric must be external to the agent's modification scope. If the agent can touch the evaluation harness, it will. Not because it's adversarial — because optimizers optimize, and modifying the metric is often easier than improving performance.
Step 2: Bound the Subject. Decide what the agent is allowed to modify about itself. Not everything. Not nothing. A well-chosen slice: its tool implementations, its routing logic, its prompt templates, its strategies. What it must not modify: the evaluation harness, the supervision/safety layer, the keep/revert logic, the core message transport. Start narrow, widen as trust is earned. One change per experiment. Measure. Keep or revert.
Step 3: Build the Loop. The agent identifies a candidate improvement. Generates a modified version of one bounded component. The system loads the modification. The system runs the evaluation suite. If the metric improves: keep. If not: revert to the previous version. Log. Repeat.
Step 4: Establish the Fixed Points. Every self-improving system needs things that don't change.
| Fixed Point | What It Protects |
|---|---|
| Evaluation harness | Prevents metric gaming |
| Keep/revert logic | Prevents acceptance criteria drift |
| Safety invariants | Prevents the agent from removing its own guardrails |
| The loop itself | Prevents the process from being optimized away |
| Logging/audit trail | Preserves the ability to understand what happened |
Karpathy's program.md says: "NEVER STOP." The agent runs indefinitely, but it runs the same loop indefinitely. The process is the one thing that doesn't evolve.
This is the deepest lesson. A self-improving system is not one where everything improves. It's one where the right things improve and the right things stay fixed. The art is choosing which is which.
Step 5: Let It Run. Once the loop is built — judge defined, subject bounded, process fixed — you start it and walk away. The human's role shifts from doing the work to designing the loop. You don't improve the agent. You improve the system that improves the agent.
Karpathy's program.md is itself a kind of agent architecture. It doesn't contain any code. It contains instructions for how to do research. The agent follows these instructions to modify code, but the instructions themselves are never modified by the agent.
This creates a two-level system:
- Level 0: The agent improves the subject (tools, strategies, routing)
- Level 1: The human improves the process (evaluation metrics, modification boundaries, loop parameters)
- Level 2 (speculative): A second agent improves Level 1
Level 0 runs at machine speed — 100 experiments overnight. Level 1 runs at human speed — you read the logs, notice the agent is stuck in a local minimum, adjust the instructions, and restart.
Level 2 is where it gets interesting — and dangerous. If you automate the improvement of the improvement process, you need fixed points at Level 2 as well. Turtles all the way down, until you hit a level that's maintained by humans or hardcoded in the infrastructure. There must always be a top-level fixed point. Remove it, and the system has no anchor.
The previous four posts built the engine. This post provides the map.
The engine is: a runtime that supports safe self-modification (hot code swapping, supervision, rollback). The map is: a disciplined loop with three separations (judge, subject, process) and well-chosen fixed points.
You need both. The engine without the map gives you an agent that can modify itself but has no idea whether it's getting better. The map without the engine gives you a beautiful theory that can't be implemented without a restart cycle that loses all context.
Self-improvement is not a capability you bolt on. It's an architecture: the right separations, the right fixed points, and a loop that runs forever.
Continue reading: Loom in Lean: Bootstrapping a Verified Self-Improving Agent