CAMBRIAN: The Loop Closes
The March 18 plan: write a spec, hand it to an LLM, watch it generate a working agent, measure costs, iterate. Simple.
What we underestimated: how much precise environmental knowledge has to be encoded in the spec before an LLM can reliably produce code that actually runs. Not code that looks correct. Code that compiles, passes tests, and survives a mechanical verification pipeline in a Python 3.14 Docker container.
Every generation from Gen-2 through Gen-7 died on the same class of bug before we even got the loop running correctly. The LLM wrote test strings containing unescaped newlines inside single-quoted string literals — legal in Python ≤3.11, a SyntaxError in 3.14. The spec said “Python 3.14 compatible.” That's not enough. The spec needs to say what Python 3.14 enforces that earlier versions didn't.
This is the lesson Phase 1 taught: the spec is not a description of the agent; it's a description of the environment the agent must survive in. The architecture takes three paragraphs. The environmental constraints take three pages.
-
Mar 21
Language choice: Python. Not because it's ideal — but because LLM coding accuracy correlates with training data density. Python benchmarks at ~93% on standard coding problems; Rust at ~85%; Elixir at ~70%. For M1, the right language is the one the LLM gets right.
-
Mar 24
Infrastructure validated. Supervisor, Test Rig, Docker base image. Gen-0, a hand-crafted test artifact, validated the pipeline before any LLM involvement. Verify the environment before testing the organism. If the environment is broken, every organism will fail and the failures will be misleading.
-
Mar 24–26
Gen-1 required three iterations to pass the test rig. Bugs found: the
/stats generationfield must return the artifact's own generation number (set from an env var), not the next generation. Tests run before the server starts — they need an in-process test client. The Docker socket on macOS is not at/var/run/docker.sockwhen using Docker Desktop;aiodockerwon't follow the symlink. -
Mar 26
Gen-1 promoted. All 5 pipeline stages passing, all 3 contracts passing.
-
Mar 27 am
Gen-2 through Gen-7 failed with Python 3.14 syntax errors. Also found: the parser itself was broken. The original
parse_files()used a dotall regex that stopped at the first</file>tag — which is also what the LLM generates inside test fixture strings. Fixed by changing the closing delimiter to</file:end>and replacing the regex with a state machine. -
Mar 27 pm
Spec updated to v0.11.0. The loop ran.
At 12:14:35, Gen-1 started inside a Docker container with the artifacts root mounted at /workspace. Walk through what happened:
The loop mechanics are correct. Every step worked as designed:
- Gen-1 read generation history from the Supervisor and computed the right generation number
- The LLM call used streaming (required for
max_tokens=32768) - File parsing handled the 12 files without truncation
- The manifest was built correctly with the right hash algorithm
- The Supervisor created the branch, ran the test rig, wrote the viability report
- Gen-1 polled the Supervisor, received the result, and made the correct promote/rollback decision
- The retry prompt carried the full failure diagnostics to the next attempt
Gen-2's failures are fixable. They're exactly the kind of thing a retry prompt handles: specific test failures with stack traces, deprecated API usage, TypeErrors with clear messages. If Opus hadn't been overloaded, there would have been a Gen-3 attempt within seconds.
The test rig is a verifier, not an execution environment. It starts Prime, checks /health and contracts, then kills it. The generation loop never runs in this path.
To actually reproduce, Gen-1 runs as a long-running process mounted against the artifacts root — not the gen-1 subdirectory, but the root, so it can write gen-2/ in the right place for the Supervisor to find. The test rig container is ephemeral; the organism is persistent.
Test rig vs. organism. These are separate concerns. The test rig verifies an artifact exists and passes contracts. The organism runs against the full artifacts tree and spawns offspring. We confused these for two days. It seems obvious in retrospect.
The March 18 post put “first income” at Phase 2, estimated at $150–350 seed capital, starting after a week of Phase 1.
Revised view: Phase 1 isn't complete until self-reproduction is reliable — not just once, but consistently, across retries, with the retry prompt actually fixing the failures. Gen-2's failure was fixable. We need to see a generation loop produce a viable offspring before calling Phase 1 done.
The good news: the first obstacle (getting the loop to run at all) is cleared. The remaining work is convergence: does the retry prompt, with full failure diagnostics, guide the LLM to a passing generation? We think yes. The failures are specific and mechanical. The next run will use Sonnet for all retries, and Gen-2's failures will be in the retry context.
Phase 2 — economic viability, prediction markets, paying the bills — is still the goal. We just have more respect now for how much “reliable self-reproduction” actually entails.
Previous: CAMBRIAN: What If the Spec Is the Organism?
Next: CAMBRIAN: It Reproduces