Loom in Lean: Bootstrapping a Verified Self-Improving Agent
The previous post established the autoresearch pattern: three separations (judge, subject, process), a keep/revert loop, and well-chosen fixed points. That's the map. The Loom posts proposed an engine on the BEAM — hot code swapping, supervision trees, process isolation.
This post asks: what if the engine is Lean?
We spent three posts arguing for Elixir. The argument was sound: the BEAM gives you hot code swapping, supervision trees, process isolation, and automatic rollback. These are real properties that TypeScript doesn't have.
But the autoresearch pattern exposed a gap. The BEAM catches crashes. A supervisor can restart a process that throws an exception. What it cannot do is catch a modification that runs without error but produces subtly wrong results — a routing change that slightly degrades response quality, a memory strategy that silently drops important context. These pass the supervisor's test (no crash) while degrading the agent.
The autoresearch pattern requires a judge that's unambiguous. The BEAM's judge is "did it crash?" That's necessary but not sufficient. We need a judge that can answer "does this modification satisfy the specification?" — not after running, but before.
The Elixir work wasn't wasted. It taught us which runtime properties matter: isolation, rollback, hot loading. These remain requirements for the operational scaffold. But the judge must be stronger than "it didn't crash." That's what led us to Lean.
Lean's type checker is a different kind of judge. It's not a process you can kill or a file you can overwrite. It's the rules of the language itself. A modification that violates the specification doesn't fail at runtime or get caught in review — it doesn't compile. The separation between judge and subject is enforced by mathematics, not architecture.
When an agent proposes a change to its own code, three questions arise:
- Is it safe? Does it preserve the invariants the system depends on?
- Is it correct? Does it satisfy the specification?
- Is it better? Does it improve the agent's utility?
Lean answers the first two at compile time. The third requires empirical measurement. But having the first two proven means the improvement loop only searches over modifications that are already guaranteed safe and correct. The search space shrinks dramatically.
The judge is the type checker. Following Part 5's lesson that a singular, unambiguous judge is what makes the autoresearch loop converge: the judge is Lean's type system. Binary — it compiles or it doesn't. Self-observation (log analysis, error rates) and user feedback (weekly reviews) are signals that inform what to try next. They help the agent decide which modifications to propose. But the judge that decides whether a modification is accepted is the type checker. One judge, multiple signal sources.
Lean 4 was bootstrapped this way: first written in C++, then rewritten in Lean, then compiled by itself. We propose a similar arc for Loom — but with an important caveat.
Lean's self-hosting bootstrap was a clean rewrite. The Lean team controlled the entire specification and implementation. Our bootstrap has a fundamentally different shape: we're growing a Lean core inside OpenClaw, a system we don't control, whose interfaces change between releases, and whose agent loop is 57K lines of TypeScript with undocumented edge cases. This isn't a clean rewrite. It's an incremental organ transplant into a moving patient. The risks are categorically different.
With that honest framing:
Each phase transition depends on the previous one actually working. If the memory organ fails, Phases 2–3 don't happen. Earn the next phase by shipping the current one.
Which organ first? Memory. It's OpenClaw's main bottleneck.
The LLMs have no memory of their own. Context is a sliding window. When it fills up, OpenClaw triggers a "memory flush" — a panicked silent turn where the LLM dumps facts to markdown files before they're lost. Retrieval is semantic search over chunks in SQLite. There are no consistency guarantees. Nothing prevents contradictions. Nothing tracks staleness.
We don't need to build a search engine in a theorem prover. QMD — a local-first memory engine by Tobi Lütke — already solves the hard retrieval problem: hybrid search (BM25 + vector + LLM reranking), chunking, contextual retrieval, and an MCP server for agentic integration. What QMD doesn't do is enforce semantic invariants on what goes in and what comes out. That's where Lean fits.
QMD is the muscle. Lean is the skeleton. The Lean scaffold sits in front of QMD as a typed API layer, classifying incoming facts, enforcing invariants, and exposing a verified interface to the agent. QMD handles embedding, chunking, and retrieval. Lean handles correctness.
Four kinds of memory, four sets of invariants:
QMD covers the first two well out of the box — timestamped documents, append-only collections, and hybrid retrieval are its sweet spot. Learned procedures need version tracking that QMD doesn't provide internally. World state needs staleness bounds that QMD doesn't enforce. The Lean scaffold fills these gaps: it maintains version chains for procedures (with QMD storing each version as a document), rejects mutations of episodic records (allowing only superseding corrections), and attaches staleness policies to world-state facts.
In Lean, the invariants aren't comments or runtime assertions. They're part of the type signature. A function that stores a world-state fact without a staleness bound won't compile. A function that mutates an episodic memory won't compile. The guarantees are structural.
The interface: the Lean scaffold exposes three tools (memory_store, memory_query, memory_reflect) as an MCP server. The agent talks to the Lean scaffold; the Lean scaffold talks to QMD. The agent commits facts to verified storage continuously — so when context compaction happens, nothing important is lost.
QMD eliminates the biggest concern from the original proposal — we're not building search infrastructure in a theorem prover. But real challenges remain:
The write-path question. The Lean scaffold can sit in front of QMD, enforcing invariants at the API layer. But if anything bypasses the scaffold and writes directly to QMD's SQLite store, the Lean types are circumvented. Do we trust the scaffold as the only write path (simpler, realistic today)? Or does Lean need to own the storage layer, using QMD only for indexing and retrieval (stronger guarantees, more plumbing)? This is a meaningful architectural decision that determines how much the type system actually protects.
FFI surface. The interface between Lean and QMD is small — maybe five or six functions via QMD's CLI, MCP server, or direct SQLite access. That's a tractable boundary. But Lean 4's FFI story for calling external processes or HTTP endpoints is still maturing. The cleanest path is probably Lean shelling out to qmd as a subprocess, which is simple but means the scaffold can't enforce invariants at the storage layer.
Compile times. The autoresearch loop needs fast iteration — modify, compile, evaluate, keep/revert. If Lean's type checker takes minutes to verify a modification, the improvement loop slows to a crawl. Compile time is a hard constraint on how fast the agent can self-improve.
The moving target. OpenClaw's gateway protocol isn't versioned or formally specified. It changes between releases. The sidecar must track a moving API surface, and each OpenClaw update might break the bridge. This is the practical cost of growing inside a host you don't control.
None of these are fatal. But they're the actual engineering challenges, and ignoring them would make this proposal dishonest.
Every post in this series has been forward-looking. Here's where it might fall apart:
The next move isn't another post. It's the memory service.
- Specify the memory types in Lean. Define the four memory kinds, their invariants, and the store/query/reflect interface. The core types —
MemoryEntry,MemoryKind,StalenessPolicy,VersionChain— plus the store/query/reflect functions and their proofs. Maybe 1,000–2,000 lines. This is where we discover whether the invariants are too rigid for real conversation. - Build the scaffold. A Lean binary that wraps QMD's MCP server, enforces the four memory-type invariants, and exposes the three-tool interface as its own MCP server. The agent talks to the Lean scaffold; the scaffold talks to QMD. This is where we discover whether the FFI boundary is tractable.
- Run it. Replace OpenClaw's memory with the Lean+QMD stack on a real instance. Measure: fewer contradictions, better retrieval, no panic flushes, no visible latency overhead. This is where we discover whether verified memory is practically better, not just provably correct.
- Start the improvement loop. The agent observes its own memory performance, proposes modifications to its memory strategies (not the invariants — those are fixed), and keeps what works. This is where we discover whether the autoresearch pattern works for agent self-improvement.
Each step is a falsifiable experiment. If step 1 reveals that the invariants don't fit real conversation, we revise the types. If step 2 reveals that the Lean-QMD boundary adds too much overhead, we simplify the scaffold. If step 3 shows no practical improvement, we stop. A prototype is realistic in one to two weeks.
One working organ is worth more than this entire series. Ship it, measure it, keep or revert.
Update (March 12, 2026). We've since installed ByteRover as OpenClaw's memory backend. ByteRover provides structured long-term memory — a hierarchical context tree, automatic memory curation, daily knowledge mining, and context enrichment — out of the box, without a theorem prover. It solves the memory bottleneck this post proposed to address with Lean+QMD, and it's already running. The Lean-as-scaffold approach to memory is abandoned. We explore a different avenue in The Prime and the Lab.