First Light: Loom's Self-Modification Pipeline in 2,214 Lines

Agent Architecture · March 16, 2026

Part 9 of a series. Previous: The program.md Protocol. See also: The Prime and the Lab, The Autoresearch Pattern.
Three posts ago we had a spec. Two posts ago we had a protocol. Now we have a working system. Loom's MVP is complete: 2,214 lines of ClojureScript, 90 tests, 17 generations run, and a stable pipeline that spawns an autonomous Lab agent in an Apple container, lets it work, verifies the result, and promotes or rolls back. This post is the field report — what we built, what broke, what we learned, and what's still missing.
2,214
Lines of source
90
Tests passing
17
Generations run
5/5
Stable run (latest)

The Journey: 17 Generations

Building a self-modifying agent isn't a single breakthrough moment. It's a sequence of failures, each one revealing something the spec didn't anticipate. We ran 17 generations. Explore them below.

Generation history — click a generation for details
Done
Timeout
Failed
Click a generation above to see what happened.

The first twelve generations were a debugging marathon. Gen-1 was the proof of concept — a Lab that added line numbers to the read_file tool — but getting there required manual orchestration and exposed fragile git operations. Gens 2–5 failed outright: missing API keys, broken container networking, build artifacts not copied correctly. Gens 6–9 introduced timeouts as the default failure mode — Labs that booted fine but couldn't complete work before the 5-minute (later 10-minute) deadline.

The turning point came between gen-9 and gen-10. Five critical bugs were identified and fixed in a single hardening session:

  1. API key injection. The Supervisor wasn't forwarding ANTHROPIC_API_KEY into Lab containers. Labs booted, tried to call Claude, and silently hung.
  2. Branch propagation. Lab commits were stranded in the container's cloned repo. The Supervisor needed an explicit git fetch to pull the branch back into the main repo for Prime's verification.
  3. Port allocation. Multiple Labs on the same port caused container startup failures. Fixed with deterministic port mapping: 18400 + gen-num.
  4. Timeout accounting. The hard timeout timer started at spawn time, not container-ready time. Boot overhead ate into work time.
  5. Status detection. Polling /status used a single HTTP request without retry. If the Lab wasn't ready yet, the Supervisor concluded it had failed.

After hardening, gens 13–17 ran back-to-back with zero failures. Five for five. Average duration: 36 seconds. The pipeline works.


The Generation Cycle

The sequence diagram below shows exactly what happens during a single generation — from task definition through to promote or rollback. Six phases, four actors, one complete loop.

Full generation lifecycle — scroll to explore
Sequence diagram showing the full Loom generation lifecycle: task definition, lab spawn, autonomous execution, completion detection, verification, and promote/rollback

The key property: the Lab is fully autonomous. It reads program.md, enters an agentic loop (Claude API → tool calls → repeat), commits its work, and reports done. Prime doesn't guide the Lab. It polls /status and waits. When the Lab finishes, Prime pulls the branch and verifies independently: git diff to see what changed, then npm test to confirm nothing broke.

The Supervisor orchestrates infrastructure — cloning repos, creating containers, managing branches and tags, enforcing timeouts — but makes no decisions about the work itself. It's plumbing, not policy.


What Gen-1 Actually Did

The first real self-modification was modest by design: add line numbers to the read_file tool output. The program.md spelled it out precisely:

# Task: Add line numbers to read_file tool output

In src/loom/agent/tools.cljs, modify the read-file function.
After reading file content, split into lines and prepend
1-based line numbers. Format: right-aligned number padded
to width of max line number, tab, then line content.

Also update test expectations in test/loom/tools_test.cljs
to match the new output format.

## IMPORTANT
- Do NOT run npm test or npx shadow-cljs.
  They are not available in the container.
- Just make the edits and let the auto-commit handle the rest.

The Lab read the program, opened both files, made the edits, and committed. Prime pulled the branch, ran the test suite, confirmed all 90 tests passed, and promoted. The diff: 12 lines changed across two files. Total time: under 40 seconds.

It's not impressive as a code change. It's impressive as a cycle. An autonomous agent running in a VM wrote code, and a second agent independently verified that code worked before merging it. No human touched any file between spawn and promote.


Architecture Review: An Honest Assessment

After reaching stability, we ran a full architecture review of the 2,214-line codebase. The findings were mixed — strong fundamentals, but real gaps in verification and observability.

The codebase: 19 source files, 13 test files, ~1:1 test-to-source ratio by LOC. Only dependencies: Malli (schemas) and shadow-cljs (compilation). No frameworks.

DimensionStatusGap
Core loopWorking
Self-modification proofPartialOnly trivial tasks proven; no agent code modification promoted
Test coverageStrongNo integration test with real containers
Dead code~7%Eval client/server compiled but never called (146 LOC)
ObservabilityBasicNo diff visibility, no per-generation metrics
Code legibilityGoodsupervisor/http.cljs at 307 lines does too many things

Three findings matter most:

1. Verification is too shallow. verify_generation runs npm test — that's it. No eval probes, no benchmarks, no code quality checks. For trivial file operations this is fine. For actual self-modification — rewriting the agentic loop, changing the context manager — a Lab could introduce a subtle regression that passes all 90 tests. Prime promotes blindly if tests are green.

2. No visibility into what changed. The verification flow didn't include git diff. Prime had no idea what the Lab actually modified — only whether tests passed. This was fixed during hardening (diff stats and truncated diff now included), but the deeper issue remains: Prime has no semantic understanding of changes, only syntactic pass/fail.

3. Three Malli schemas are decorative. Proposal, ProbeResult, and Verdict — declared as "fixed-point contracts" in the README — are defined but never validated. The real contract is POST /spawn with a program_md string and POST /promote with a generation number. Plain JSON, unvalidated. The conceptual framework of "mutable code + immutable contracts" depends on those contracts being load-bearing. Right now they're aspirational.


What's Working

What's Missing

The MVP proves the pipeline works. It doesn't prove recursive self-improvement. Three things are missing:

No generation has modified agent code. Every successful program.md has been a trivial operation — add a comment, create a text file, count source files. Gen-1 (add line numbers to read_file) is the only one that touched agent functionality, and that was manually orchestrated. We haven't demonstrated that a Lab can make a meaningful change to the agent's own behavior and have Prime verify and promote it.

Prime has no agency. It's a chatbot with tools. The user writes program.md, tells Prime to execute it, Prime mechanically spawns/verifies/promotes. There's no decision-making, no prioritization, no self-assessment. The "self" in self-improvement is the user's judgment, not the agent's.

No fitness function. Self-improvement requires measuring improvement. We track timing and outcome (done/failed/timeout) but not: test count, code coverage, token efficiency, tool call counts. Without quantitative metrics across generations, "improvement" is unmeasurable.


Next: Closing the Recursive Loop

The MVP proves the pipeline works (human writes program.md → Lab executes → verify → promote/rollback). The next step is making it genuinely recursive: Prime autonomously proposes and executes its own next improvement.

This requires a reflect step. After every promote or rollback, Prime enters a reflection phase:

  1. Analyze the generation report — what changed, outcome, timing, metrics
  2. Read priorities.md for user-directed goals
  3. Review the current codebase state — test results, architecture gaps, open tasks
  4. Generate the next program.md — the smallest change that most improves the target metric
  5. Spawn a new Lab and continue the loop

But reflect without data is hallucination. Three prerequisites before implementing it:

PrerequisiteWhyStatus
Enrich generation reportsReflect needs token counts, diff stats, test pass/fail, duration breakdownNot implemented
Define fitness functionWhat must improve? Test count, coverage, token efficiency, user-defined?Not implemented
Create priorities.mdUser-authored file Prime reads during reflect to know what to work onNot implemented

The loop also needs safety valves: a generation cap (LOOM_MAX_GENERATIONS), a token budget, fitness plateau detection (no improvement over N generations), and human interrupt (SIGINT gracefully stops the loop).

The compelling demo: "I pointed it at itself and walked away. Here's what it improved." We're one milestone away. The pipeline is proven. The reflect step is next.


Known Constraints

References

Continue reading: It Rewrote Itself: Loom's First Autonomous Self-Modification