The program.md Protocol: Steering Self-Improvement with a Contract
program.md. A collaboratively written contract between user and agent that defines what to build, how to test it, and what success looks like. This post describes the refined execution model, what changed from the original spec, and why the hardest problem isn't code modification — it's state continuity across generations.The previous post described a system where the Prime proposes modifications, tests them in a Lab, and promotes what works. That's the mechanism. But it left a critical question unanswered: who decides what to improve?
In Karpathy's autoresearch, the answer is a research program — a document that defines the experiment, the methodology, and the success criteria. The researcher writes it. The system executes it. We adopt the same pattern.
The original spec cast the Prime as both the brain and the worker. It proposed modifications, spawned Labs to test them, judged the results, and promoted or reverted. The user steered by chatting with the Prime and saying things like "make your read-file tool return line numbers."
This works for small, targeted changes. It fails when improvements require sustained, multi-step effort — building an LLM provider abstraction layer, redesigning the context manager, adding a new tool with tests. These aren't single-prompt tasks. They need a spec.
The refined model separates concerns cleanly:
| Role | Original spec | Refined spec |
|---|---|---|
| User | Chats with Prime, asks for changes | Collaborates with Prime to write program.md |
| Prime | Worker + judge. Proposes, probes, promotes | Orchestrator + judge. Generates program.md, spawns Labs, verifies results |
| Lab | Passive eval server. Runs modified code, answers probes | Full autonomous agent. Receives program.md, writes code, runs tests, reports status |
| Supervisor | Container lifecycle + version management | Container lifecycle + git operations + timeout enforcement |
The key shift: Labs are now full agents. They run the complete agentic loop with LLM access. They don't just evaluate forms — they write code, run tests, and iterate autonomously. The Prime doesn't tell them how to implement the change. The program.md tells them what to achieve. The Lab figures out the how.
The program.md is the sole directive for a Lab run. It defines three things:
- Task spec — what to build or improve
- Acceptance criteria — how to know it's done
- Success conditions — what Prime should verify before promoting
The user doesn't write it from scratch. The user describes what they want in conversation with Prime. Prime drafts the program.md, presents it for review, refines it based on feedback, and only spawns the Lab once the user approves. This is the control surface.
The analogy: program.md is to the Lab what a pull request description is to a code reviewer — it states the intent, the scope, and the definition of done. Except the "reviewer" is the Prime, and it doesn't just read the code. It runs the tests independently.
Acceptance criteria are heterogeneous. Different tasks need different verification:
| Criteria type | Example | Who checks |
|---|---|---|
| Tests | "All providers in test/loom/llm_provider_test.cljs pass" | Lab runs them, Prime re-runs independently |
| Eval probes | "/status returns valid JSON with :ready true" | Prime sends probes to Lab's eval server |
| Benchmarks | "Round-trip latency under 200ms" | Prime measures after Lab reports done |
Prime evaluates with a trust but verify strategy. The Lab runs its own tests as part of its work. Prime independently re-runs them and adds its own probes. Both must pass for promotion.
Eight steps, from conversation to promotion. Step through them below.
The critical property: boot is the "go" signal. No coordination protocol between Prime and Lab. The Supervisor creates the container with the repo clone and program.md already in place. When the container starts, the Lab agent reads program.md and begins working. Prime polls /status until the Lab reports done or the 5-minute timeout fires.
Apple Containerization on macOS 26 provides vmnet-backed virtual networking. Each container gets a dedicated IP. DNS resolution is built-in.
# Supervisor creates a shared network once
container network create loom-net
# Every container attaches to it
container run --name prime --network loom-net ...
container run --name lab-gen-1 --network loom-net ...
# Prime reaches Lab by name — no IP discovery
curl http://lab-gen-1/status
No port mapping. No IP discovery. No service registry. The container name is the address. This eliminates an entire class of coordination complexity that would otherwise dominate the Supervisor's code.
Every generation is a branch. Every promotion is a merge and a tag. The Supervisor handles git operations as part of the container lifecycle.
main —————*———————*———————*————>
| | |
tag: gen-0 tag: gen-1 tag: gen-2
| |
lab/gen-1 lab/gen-2
(merged) (merged)
lab/gen-3 (failed, discarded) lab/gen-4 (in progress)
main always tracks the known-good Prime code. On promote, the Supervisor merges the Lab's branch, tags it, and deletes the branch. On rollback, the branch is discarded. Tags are permanent — you can roll back to any previous generation, not just N-1.
The Supervisor also maintains generations.edn — a ledger of every generation with metadata:
[{:generation 1
:parent 0
:branch "lab/gen-1"
:program-md-hash "a3f8c2..."
:outcome :promoted
:created "2026-03-13T14:30:00Z"
:completed "2026-03-13T14:34:22Z"
:container-id "lab-gen-1"}
{:generation 2
:parent 1
:branch "lab/gen-2"
:program-md-hash "7b2e41..."
:outcome :timeout
:created "2026-03-13T15:01:00Z"
:completed "2026-03-13T15:06:00Z"
:container-id "lab-gen-2"}]
This file is versioned in git. The full lineage of the system — what was tried, what worked, what failed — is recoverable from the repo history alone.
Code promotion is straightforward — merge a branch. But a running agent has state: conversation history, task context, accumulated knowledge about the codebase. When Gen N promotes to Gen N+1, the new Prime starts from a fresh process. How does it know what Gen N knew?
Three sub-problems:
- Serialize. Before promotion, snapshot the outgoing Prime's state as EDN.
- Load. When the new Prime boots, hydrate from the snapshot.
- Evolve. If the new Prime changed its own state schema, the old snapshot may not fit.
Problem 3 is where it gets interesting. If Gen N stores conversation history as a flat vector and Gen N+1 restructures it into a tree, the snapshot is useless. Full schema migration — transforming old data into new shapes — is a known-hard problem even in traditional databases. In a self-modifying system, the schema itself is part of the mutable surface.
v0 approach: state as EDN with a version tag. If the schema doesn't match, start fresh. The git tag preserves the old state for inspection — nothing is lost, but continuity may be interrupted. Proper schema migration is deferred until the state format stabilizes. This is a deliberate tradeoff: ship the cycle first, solve continuity second.
A side-by-side comparison of the two architectures. Toggle to see the differences.
Prime is the brain and the worker. It proposes modifications by generating modified .cljs files. The Lab is a passive eval server — ~50 lines of code that accepts forms over TCP and returns results. Prime probes the Lab by sending test forms. The Supervisor manages containers and a versions/ directory. User chats with Prime to request changes.
User → "improve read-file" → Prime generates patch
→ Supervisor boots Lab with patched code
→ Prime sends eval probes to Lab
→ Prime decides promote/revert
Prime is the orchestrator. It collaborates with the user to produce program.md, spawns Labs, and independently verifies results. The Lab is a full autonomous agent with LLM access — it reads program.md and implements the feature end-to-end. The Supervisor manages containers, git branches/tags, generations.edn, and enforces a 5-minute timeout.
User + Prime → write program.md
→ Supervisor boots Lab with full repo + program.md
→ Lab works autonomously (writes code, runs tests)
→ Prime polls /status, then verifies independently
→ Promote: merge branch, tag, restart Prime
- Lab capability: passive eval server → full autonomous agent with LLM access
- Steering: ad-hoc chat prompts → collaboratively refined
program.md - Version tracking:
versions/directory → git branches + tags +generations.edn - Acceptance: Prime probes only → trust but verify (Lab self-tests + Prime re-checks)
- Timeout: none → 5-minute hard limit enforced by Supervisor
- Networking: unspecified →
loom-netwith DNS resolution by container name - Rollback: N-1 only (versions dir) → any generation via git tags
- State: not addressed → EDN snapshots with version tag, fresh-start fallback
- Failure: not addressed → retry with same
program.md(v0), auto-refine (planned)
The eval server from the original spec survives. Prime still uses it for eval probes — it's one of the acceptance criteria types. But the Lab is no longer just an eval server. It's an agent that happens to also expose an eval server for external verification.
The implementation plan has six phases. Phase 1a (self-hosted ClojureScript eval) is done — 9 tests, 17 assertions passing. Next up:
- Container image. Boot ClojureScript in an Apple container. Measure startup time.
- Eval server + client. TCP round-trip with Malli-validated EDN.
- Supervisor. Container lifecycle, git operations,
loom-netnetworking,generations.edn. - Agent loop. Claude API client, five tools, the main loop.
- The
program.mdcycle. User → Prime →program.md→ Lab → verify → promote. - First self-modification. The proof of concept.
Each phase is a falsifiable experiment. If container boot time exceeds 10 seconds, switch to pre-compiled. If Apple Containerization is too unstable, fall back to UTM. If the agent can't complete a full cycle, debug each stage independently.
The refined bet: the hardest part isn't getting the agent to modify code. It's getting the cycle right — the handoff from user intent to program.md to autonomous Lab to verified promotion. Get the cycle right and the modifications follow.
- The Prime and the Lab — Original architecture spec for the three-component system. This post refines the execution model described there. series
- The Autoresearch Pattern — Karpathy's autoresearch as a blueprint for self-improvement: three separations, keep/revert, fixed points. The
program.mdpattern derives from this. series - Apple Containerization — VM-per-container isolation on macOS 26. Custom networks via
container network createwith built-in DNS resolution. containers - Malli — Data-driven schema library. Schemas as EDN data — the fixed-point contracts between Prime, Lab, and Supervisor. schemas
- Pi Coding Agent (Mario Zechner) — Radical minimalism: 4 tools, <1000 token prompt. Our Lab agent inherits this philosophy. agent design
Continue reading: First Light: Loom's Self-Modification Pipeline in 2,214 Lines