The program.md Protocol: Steering Self-Improvement with a Contract

Agent Architecture · March 13, 2026

Part 8 of a series. See: Making OpenClaw Self-Aware, The Autoresearch Pattern, and The Prime and the Lab.

Yesterday we had an architecture — three components and a modification cycle. Today we refined how to steer it. The answer is a file: program.md. A collaboratively written contract between user and agent that defines what to build, how to test it, and what success looks like. This post describes the refined execution model, what changed from the original spec, and why the hardest problem isn't code modification — it's state continuity across generations.

The previous post described a system where the Prime proposes modifications, tests them in a Lab, and promotes what works. That's the mechanism. But it left a critical question unanswered: who decides what to improve?

In Karpathy's autoresearch, the answer is a research program — a document that defines the experiment, the methodology, and the success criteria. The researcher writes it. The system executes it. We adopt the same pattern.

What Changed

The original spec cast the Prime as both the brain and the worker. It proposed modifications, spawned Labs to test them, judged the results, and promoted or reverted. The user steered by chatting with the Prime and saying things like "make your read-file tool return line numbers."

This works for small, targeted changes. It fails when improvements require sustained, multi-step effort — building an LLM provider abstraction layer, redesigning the context manager, adding a new tool with tests. These aren't single-prompt tasks. They need a spec.

The refined model separates concerns cleanly:

Role	Original spec	Refined spec
User	Chats with Prime, asks for changes	Collaborates with Prime to write `program.md`
Prime	Worker + judge. Proposes, probes, promotes	Orchestrator + judge. Generates `program.md`, spawns Labs, verifies results
Lab	Passive eval server. Runs modified code, answers probes	Full autonomous agent. Receives `program.md`, writes code, runs tests, reports status
Supervisor	Container lifecycle + version management	Container lifecycle + git operations + timeout enforcement

The key shift: Labs are now full agents. They run the complete agentic loop with LLM access. They don't just evaluate forms — they write code, run tests, and iterate autonomously. The Prime doesn't tell them how to implement the change. The program.md tells them what to achieve. The Lab figures out the how.

The program.md Contract

The program.md is the sole directive for a Lab run. It defines three things:

Task spec — what to build or improve
Acceptance criteria — how to know it's done
Success conditions — what Prime should verify before promoting

The user doesn't write it from scratch. The user describes what they want in conversation with Prime. Prime drafts the program.md, presents it for review, refines it based on feedback, and only spawns the Lab once the user approves. This is the control surface.

The analogy: program.md is to the Lab what a pull request description is to a code reviewer — it states the intent, the scope, and the definition of done. Except the "reviewer" is the Prime, and it doesn't just read the code. It runs the tests independently.

Acceptance criteria are heterogeneous. Different tasks need different verification:

Criteria type	Example	Who checks
Tests	"All providers in `test/loom/llm_provider_test.cljs` pass"	Lab runs them, Prime re-runs independently
Eval probes	"`/status` returns valid JSON with `:ready true`"	Prime sends probes to Lab's eval server
Benchmarks	"Round-trip latency under 200ms"	Prime measures after Lab reports done

Prime evaluates with a trust but verify strategy. The Lab runs its own tests as part of its work. Prime independently re-runs them and adds its own probes. Both must pass for promotion.

The Execution Flow

Eight steps, from conversation to promotion. Step through them below.

Improvement cycle — click steps or use controls

Discuss

→

Draft

→

Approve

→

Spawn

→

Execute

→

Verify

→

Promote

→

Restart

User

Prime

Lab

Supervisor

Click a step above to see what happens at each stage.

The critical property: boot is the "go" signal. No coordination protocol between Prime and Lab. The Supervisor creates the container with the repo clone and program.md already in place. When the container starts, the Lab agent reads program.md and begins working. Prime polls /status until the Lab reports done or the 5-minute timeout fires.

Networking

Apple Containerization on macOS 26 provides vmnet-backed virtual networking. Each container gets a dedicated IP. DNS resolution is built-in.

# Supervisor creates a shared network once
container network create loom-net

# Every container attaches to it
container run --name prime    --network loom-net ...
container run --name lab-gen-1 --network loom-net ...

# Prime reaches Lab by name — no IP discovery
curl http://lab-gen-1/status

No port mapping. No IP discovery. No service registry. The container name is the address. This eliminates an entire class of coordination complexity that would otherwise dominate the Supervisor's code.

Git as the Ledger

Every generation is a branch. Every promotion is a merge and a tag. The Supervisor handles git operations as part of the container lifecycle.

main              —————*———————*———————*————>
                        |           |           |
                   tag: gen-0  tag: gen-1  tag: gen-2
                        |           |
                   lab/gen-1   lab/gen-2
                   (merged)    (merged)

lab/gen-3 (failed, discarded)    lab/gen-4 (in progress)

main always tracks the known-good Prime code. On promote, the Supervisor merges the Lab's branch, tags it, and deletes the branch. On rollback, the branch is discarded. Tags are permanent — you can roll back to any previous generation, not just N-1.

The Supervisor also maintains generations.edn — a ledger of every generation with metadata:

[{:generation 1
  :parent 0
  :branch "lab/gen-1"
  :program-md-hash "a3f8c2..."
  :outcome :promoted
  :created "2026-03-13T14:30:00Z"
  :completed "2026-03-13T14:34:22Z"
  :container-id "lab-gen-1"}

 {:generation 2
  :parent 1
  :branch "lab/gen-2"
  :program-md-hash "7b2e41..."
  :outcome :timeout
  :created "2026-03-13T15:01:00Z"
  :completed "2026-03-13T15:06:00Z"
  :container-id "lab-gen-2"}]

This file is versioned in git. The full lineage of the system — what was tried, what worked, what failed — is recoverable from the repo history alone.

The Hard Problem: State

Code promotion is straightforward — merge a branch. But a running agent has state: conversation history, task context, accumulated knowledge about the codebase. When Gen N promotes to Gen N+1, the new Prime starts from a fresh process. How does it know what Gen N knew?

Three sub-problems:

Serialize. Before promotion, snapshot the outgoing Prime's state as EDN.
Load. When the new Prime boots, hydrate from the snapshot.
Evolve. If the new Prime changed its own state schema, the old snapshot may not fit.

Problem 3 is where it gets interesting. If Gen N stores conversation history as a flat vector and Gen N+1 restructures it into a tree, the snapshot is useless. Full schema migration — transforming old data into new shapes — is a known-hard problem even in traditional databases. In a self-modifying system, the schema itself is part of the mutable surface.

v0 approach: state as EDN with a version tag. If the schema doesn't match, start fresh. The git tag preserves the old state for inspection — nothing is lost, but continuity may be interrupted. Proper schema migration is deferred until the state format stabilizes. This is a deliberate tradeoff: ship the cycle first, solve continuity second.

What Changed from the Original Spec

A side-by-side comparison of the two architectures. Toggle to see the differences.

Architecture comparison — select view

Original

Refined

What changed

Prime is the brain and the worker. It proposes modifications by generating modified .cljs files. The Lab is a passive eval server — ~50 lines of code that accepts forms over TCP and returns results. Prime probes the Lab by sending test forms. The Supervisor manages containers and a versions/ directory. User chats with Prime to request changes.

User → "improve read-file" → Prime generates patch
  → Supervisor boots Lab with patched code
  → Prime sends eval probes to Lab
  → Prime decides promote/revert

Prime is the orchestrator. It collaborates with the user to produce program.md, spawns Labs, and independently verifies results. The Lab is a full autonomous agent with LLM access — it reads program.md and implements the feature end-to-end. The Supervisor manages containers, git branches/tags, generations.edn, and enforces a 5-minute timeout.

User + Prime → write program.md
  → Supervisor boots Lab with full repo + program.md
  → Lab works autonomously (writes code, runs tests)
  → Prime polls /status, then verifies independently
  → Promote: merge branch, tag, restart Prime

Lab capability: passive eval server → full autonomous agent with LLM access
Steering: ad-hoc chat prompts → collaboratively refined program.md
Version tracking: versions/ directory → git branches + tags + generations.edn
Acceptance: Prime probes only → trust but verify (Lab self-tests + Prime re-checks)
Timeout: none → 5-minute hard limit enforced by Supervisor
Networking: unspecified → loom-net with DNS resolution by container name
Rollback: N-1 only (versions dir) → any generation via git tags
State: not addressed → EDN snapshots with version tag, fresh-start fallback
Failure: not addressed → retry with same program.md (v0), auto-refine (planned)

The eval server from the original spec survives. Prime still uses it for eval probes — it's one of the acceptance criteria types. But the Lab is no longer just an eval server. It's an agent that happens to also expose an eval server for external verification.

What's Next

The implementation plan has six phases. Phase 1a (self-hosted ClojureScript eval) is done — 9 tests, 17 assertions passing. Next up:

Container image. Boot ClojureScript in an Apple container. Measure startup time.
Eval server + client. TCP round-trip with Malli-validated EDN.
Supervisor. Container lifecycle, git operations, loom-net networking, generations.edn.
Agent loop. Claude API client, five tools, the main loop.
The program.md cycle. User → Prime → program.md → Lab → verify → promote.
First self-modification. The proof of concept.

Each phase is a falsifiable experiment. If container boot time exceeds 10 seconds, switch to pre-compiled. If Apple Containerization is too unstable, fall back to UTM. If the agent can't complete a full cycle, debug each stage independently.

The refined bet: the hardest part isn't getting the agent to modify code. It's getting the cycle right — the handoff from user intent to program.md to autonomous Lab to verified promotion. Get the cycle right and the modifications follow.

References

The Prime and the Lab — Original architecture spec for the three-component system. This post refines the execution model described there. series
The Autoresearch Pattern — Karpathy's autoresearch as a blueprint for self-improvement: three separations, keep/revert, fixed points. The program.md pattern derives from this. series
Apple Containerization — VM-per-container isolation on macOS 26. Custom networks via container network create with built-in DNS resolution. containers
Malli — Data-driven schema library. Schemas as EDN data — the fixed-point contracts between Prime, Lab, and Supervisor. schemas
Pi Coding Agent (Mario Zechner) — Radical minimalism: 4 tools, <1000 token prompt. Our Lab agent inherits this philosophy. agent design

Continue reading: First Light: Loom's Self-Modification Pipeline in 2,214 Lines