It Rewrote Itself
Five posts ago we made a specific claim: an agent that rewrites one of its own functions, tests the result in a Lab container, and promotes the improvement is worth more than the entire blog series that described it. The MVP post proved the pipeline ran — containers booted, branches propagated, tests executed — but every successful generation was a trivial file operation. Add a comment. Create a text file. Count source files.
We hadn't proven the thing that matters: can the agent change itself?
We pointed the autonomous loop at the codebase and pressed go. Five runs. Three models. Sixteen generations before the first promotion.
Every failure taught us something we couldn't have learned from unit tests.
The pipeline looked correct. It passed 142 tests. It had run 17 stable generations of trivial tasks. Then we asked it to do real work autonomously, and six bugs crawled out of the walls.
GET /status — and sees “done.” It immediately tries to verify by checking out the Lab's branch. But the supervisor hasn't finished git fetch from the container's repo clone yet. git checkout lab/gen-47 fails: “pathspec did not match any file(s) known to git.”self_modify.cljs — the same file the Lab changed. git checkout lab/gen-54 refuses: “Your local changes to the following files would be overwritten.” The pipeline never considered that someone might be editing the codebase while it runs.lab-worker.js to .gitignore. After 50 generations, that's 50 identical lines. Worse: the diff includes .gitignore changes in every generation, confusing the LLM reviewer.program.md to .gitignore so it wouldn't leak into master. But when .gitignore already had the entry and nothing else changed, git commit failed with “nothing to commit” — the Lab's task spec vanished from history./workspace/). Both are normal Lab artifacts, not bugs.git fetch on the Lab's branch. The branch existed in the container's clone but was never copied to the host. Any subsequent verify attempt failed silently.None of these bugs showed up in unit tests. They required the full pipeline running against a real LLM, a real container, and real concurrent git operations. This is why you run the thing.
We tested three models as Lab workers. The results are not subtle.
program.md design — but it can't land the plane.defn- to defn for testability, since ClojureScript doesn't support #' var access for private functions — wrote 18 test assertions across 6 test functions, and committed. The LLM reviewer approved with high confidence. The entire generation cost less than a cup of bad coffee.The lesson: model capability is the binding constraint for autonomous self-modification, not infrastructure. The pipeline is the same for all three models. Only one produces promotable code.
The task was deliberately chosen for safety: add unit tests for two pure helper functions that already existed in self_modify.cljs. No new features, no refactors, no behavioral changes. Just tests.
src/loom/agent/self_modify.cljs
−(defn- parse-test-counts
+(defn parse-test-counts
−(defn- parse-shortstat
+(defn parse-shortstat
test/loom/self_modify_helpers_test.cljs (new file, 63 lines)
+(deftest parse-test-counts-passing ...)
+(deftest parse-test-counts-failing ...)
+(deftest parse-test-counts-no-match ...)
+(deftest parse-shortstat-normal ...)
+(deftest parse-shortstat-insertions-only ...)
+(deftest parse-shortstat-empty ...)
The Lab made two changes to existing code (defn- → defn on two functions) and created a new test file. Both functions are pure — no side effects, no mocking, no async. The ideal first target for autonomous modification: low risk, high verifiability, measurable improvement.
It's not impressive as a code change. It is impressive as a proof. The agent read its own source, understood the function signatures, understood why the functions were private and why that matters in ClojureScript, made the right call to change visibility, and wrote tests that exercise edge cases — passing output, failing output, garbage input, missing fields. An agent reasoned about its own code and improved it.
Gen-72 had to pass two independent checks before promotion.
Stage 1: Tests. Prime checks out lab/gen-72, runs npm test, confirms all 236 tests pass with 607 assertions and zero failures. This catches regressions — the Lab's changes didn't break anything that already worked.
Stage 2: LLM Review. Prime sends the full diff to an LLM: Does this change do what program.md asked? Does it introduce bugs, vulnerabilities, or regressions? The reviewer returns APPROVED or REJECTED with confidence and reasoning. Gen-72: APPROVED, high confidence.
This two-stage approach caught something interesting during earlier runs. Haiku gen-58 passed all tests — zero failures, zero errors. The code change was valid. But the LLM reviewer rejected it because the diff was cluttered with .gitignore artifacts. That was a false negative — the reviewer was too strict — but it demonstrated the value of the second stage. Tests measure correctness. The LLM measures intent.
| Source files | 19 (2,214 LOC) |
| Tests | 90 |
| Generations | 17 |
| Autonomous promotions | 0 |
| Models tested | 1 |
| Source files | 26 (13,503 LOC) |
| Tests / Assertions | 236 / 607 |
| Generations | 72 |
| Autonomous promotions | 1 |
| Models tested | 3 |
The codebase grew 6x since the MVP — mostly test infrastructure, the autonomous loop driver, the reflect step, fitness scoring, and multi-provider support. Every new component was tested before deployment, and then tested again by the autonomous loop breaking it in ways we didn't anticipate.
One discovery from the autonomous runs: you want different models for different roles.
- Prime (reflect, verify, review) runs Sonnet. It needs to be smart but not expensive — it's making judgments, not writing code.
- Lab (autonomous code generation) runs Opus. It needs to be precise. A $1 generation that promotes is cheaper than ten $0.10 generations that don't.
Loom now supports split-provider configuration. Lab inherits from Prime unless overridden, so you can experiment with different Lab models without touching the orchestration layer:
# Prime uses Anthropic Sonnet
ANTHROPIC_API_KEY=sk-ant-...
LOOM_MODEL=claude-sonnet-4-20250514
# Lab uses Opus
LOOM_LAB_MODEL=claude-opus-4-6
# Or use a completely different provider for Labs
# LOOM_LAB_API_KEY=sk-cp-...
# LOOM_LAB_API_BASE=https://api.minimax.io/anthropic
# LOOM_LAB_MODEL=MiniMax-M2.5
The bottleneck has shifted. It's no longer infrastructure — that's battle-tested. It's no longer “does it work” — it does. The questions now are:
Task selection. Gen-72 was safe: pure functions, no side effects, low blast radius. What's the next rung? More test coverage is safe and measurable. Small refactors are riskier but more valuable. New features are the end goal but require precise specs.
Cost efficiency. Opus works but isn't cheap. Can we improve program.md quality enough for cheaper models to succeed on simple tasks? The model gap is the main cost lever.
Fitness gaming. The current fitness function rewards test count. An autonomous agent optimizing this could write (is (= 1 1)) a thousand times and claim improvement. The LLM reviewer is a partial defense, but we haven't tested adversarial scenarios yet.
The compelling demo: “I pointed it at itself and walked away. When I came back, it had improved.” Gen-72 is one step short of this. The loop ran once. We need it to run five times, each building on the last, with the reflect step choosing each task. That's the demo.
What does an autonomous run actually cost? Drag the sliders to estimate.
Building a self-modifying agent is not one problem. It's a sequence of failures that each reveal a different assumption.
We assumed branch fetching was synchronous. It isn't. We assumed .gitignore management was idempotent. It wasn't. We assumed an LLM reviewer would focus on code quality. It fixated on formatting artifacts. We assumed any model could write code. One of them just reads.
Each of these assumptions was invisible in the spec, invisible in unit tests, and immediately obvious in production. The 16 failed generations before gen-72 weren't wasted work. They were the work. The pipeline that promoted gen-72 is fundamentally different from the pipeline that attempted gen-43 — six bugs fewer, three models tested, and a two-stage verification system tuned by real false negatives.
Self-modification isn't a feature you ship. It's a capability you earn by running the loop until the loop survives.
Gen-72 was 56 seconds. Getting there took five days.
- First Light: Loom's Self-Modification Pipeline in 2,214 Lines — The MVP field report: 17 generations, 5 critical bugs, stable pipeline. series
- The program.md Protocol — The steering mechanism: tasks, acceptance criteria, success conditions. series
- The Prime and the Lab — Original architecture spec for the three-component system. series
- The Autoresearch Pattern — Karpathy's autoresearch as blueprint. Three separations, keep/revert, fixed points. series
- Apple Containerization — VM-per-container isolation on macOS 26. containers
- Pi Coding Agent (Mario Zechner) — Radical minimalism: 4 tools, <1000 token prompt. agent design