The Proof Is the Program

essayJanuary 2026 · Medium

In most AI applications, failure looks like a plausible-sounding wrong answer. In Lean, failure looks like a compiler error.

That difference changes everything.

The defining problem with large language models

The defining problem with large language models is hallucination: confident outputs that aren't true. Ask for a citation and you might get a paper that doesn't exist. Ask for a proof and you might get an argument that reads well but contains a subtle error. This isn't a bug to be patched. It's the natural result of optimizing for plausibility. Truth is, at best, correlated.

But Lean belongs to a class of environments where hallucination isn't discouraged or reduced — it's rejected on contact. The model can still propose nonsense. The system just can't accept it. And working in such environments reveals something important about what AI reasoning could become when it's forced to be grounded.

The oracle that cannot be fooled

Lean is a proof assistant. You don't write arguments that sound correct; you write proofs that are correct, or the compiler rejects them. Every logical step either follows from what came before, or it doesn't. No middle ground exists. No "it seems to work." No "it passes most tests."

This isn't like type-checking in ordinary programming. Conventional compilers catch certain categories of errors, but plenty of bugs slip through to runtime. Lean's type checker is different: if your proof compiles cleanly, the proposition you claimed is mathematically true. The checker is the final arbiter.

For AI systems, this distinction is profound. A language model cannot bluff its way through a Lean proof — the system either accepts valid logic or rejects it. This constraint forces something that looks less like "generation" and more like genuine reasoning.

The agentic loop, perfected

Rust's strict compiler makes it an ideal substrate for agentic coding. Structured feedback narrows the search space. Each iteration is informative. The AI converges rather than thrashes.

Lean takes this dynamic to its logical extreme.

In Rust, compilation means safety — freedom from memory errors and data races. In Lean, compilation means the proposition is proven. The compiler isn't catching bugs. It's catching falsehoods.

The practical results reflect this. In July 2025, AI systems from both DeepMind and OpenAI demonstrated gold-medal-level performance on the International Mathematical Olympiad (IMO), each solving 5 of 6 problems within contest time limits. In recent months, GPT-5.2 generated an original proof for Erdős Problem #728, later formalized in Lean and reviewed publicly by Fields Medalist Terence Tao. Tao, who tracks these contributions, describes the result as applying existing methods to an unstudied problem — not a profound breakthrough, but a genuine contribution. On the miniF2F benchmark — formalized Olympiad problems — recent systems hit 95.9% through iterative repair: propose a proof, receive feedback, adjust, repeat.

These results aren't magic. They're the predictable outcome of a perfect oracle. Unambiguous feedback makes iteration work. Checkable steps make search converge. AI isn't getting smarter at math. It's operating in a domain where outputs are forced to be grounded or to visibly fail.

Holes as structure

Here's where something genuinely new emerges.

Lean has a keyword: sorry. It's a placeholder meaning "I claim this is provable, but I haven't proven it yet." The compiler accepts it provisionally, letting you continue building the rest of your argument. Later, you return and discharge each sorry with an actual proof.

This sounds like minor convenience. It's actually a profound workflow innovation.

A complex proof doesn't have to be solved linearly. You can structure the argument first — "I'll prove A by proving B and C, and B by proving D and E" — inserting sorry at each leaf. Now you have an explicit task graph. Progress becomes countable: twelve sorrys reduced to eight. "Done" means zero sorrys and a clean build. No ambiguity remains about what's left.

For AI agents, this structure is extraordinarily useful. The model can decompose hard problems into subproblems, attack them independently, and track progress precisely. Each sorry defines a goal with a clear success criterion. The type signature specifies exactly what needs proving. The context provides exactly what's available to work with.

Programming languages have stubs too — todo!() in Rust, pass in Python — but they compile and then explode at runtime. sorry compiles and leaves a precisely typed hole: a named debt with an exact specification of what's owed. You can track it, decompose it, and discharge it. The difference isn't that Lean allows partial programs. It's that Lean turns incompleteness into structured, delegatable subgoals rather than deferred runtime failures.

What failure teaches

The deepest insight isn't about success. It's about what happens when the AI can't discharge a sorry.

A failed proof attempt has two possible meanings. Either the claim is true but the AI hasn't found the right approach — a proof gap. Or the claim as stated isn't actually provable — a spec mismatch.

The second case is where things get interesting.

When an AI repeatedly hits a wall — "I have ≤ but need <", "this lemma requires a hypothesis I don't have" — the pattern often signals that the specification itself is wrong. The formal system isn't just verifying the AI's reasoning. It's stress-testing the human's assumptions.

This turns the AI into a discovery tool. The model explores possible proofs at superhuman speed, and its failures reveal hidden constraints. "I can't prove this" becomes valuable information — about the problem's difficulty, or about whether the problem was correctly posed in the first place.

Ernest Ryu, a mathematician at UCLA, recently used GPT-5 to solve a 40-year-old open problem in optimization theory. The AI didn't hand him a complete proof. It generated a stream of approaches — most wrong, some interesting. When promising ideas hit walls, those walls were informative. Ryu's intuition recognized which failures revealed spec issues and which revealed proof gaps. Twelve hours of focused collaboration cracked a problem that might have taken months solo.

The human sets direction and judges which paths matter. The AI explores at inhuman speed and reports what it finds — including what doesn't work and why. The formal system keeps both parties honest.

The downstream implications

This matters beyond pure mathematics.

The same infrastructure that verifies an IMO solution can verify that a trading algorithm won't breach risk limits, that a control system can't enter unsafe states, that a smart contract does what it claims. Formal verification has always promised such guarantees. The bottleneck was the labor of writing proofs.

If AI can write the proofs — and formal systems can check them — the bottleneck dissolves.

Martin Kleppmann put it directly: AI is poised to make formal verification "vastly cheaper," precisely as AI-generated code makes verification more necessary than ever. Mathematical theorem-proving isn't just an impressive demo. It's a capability that flows downstream to any domain where certainty matters more than probability.

The infrastructure builders seem to agree. OpenAI rewrote Codex's core in Rust, calling it an "agentic harness." Neural theorem provers run on verified code. When you need the scaffolding to be trustworthy, you reach for languages with guarantees. Lean represents a logical endpoint: not just a language for building reliable systems, but a domain where reliability is structurally enforced. You can't ship a hallucinated proof. The domain rejects it.

The shape of grounded reasoning

What's emerging isn't "AI that's good at math." It's a template for what AI reasoning looks like when shortcuts are impossible.

The ingredients: a perfect oracle checking every claim instantly. A decomposition structure turning complex problems into trackable subgoals. A workflow where failure informs rather than frustrates. A collaboration mode where human judgment and AI exploration reinforce each other.

These ingredients don't require Lean specifically. They require some system that verifies claims cheaply and unambiguously. Mathematics is the purest such domain. But anywhere you can build a reliable oracle — formal verification, constrained simulations, domains with strong logical structure — similar dynamics should emerge.

The AI optimized for plausibility produces plausible outputs. The AI constrained by proof produces proven outputs. The constraint shapes the cognition.

Or perhaps it doesn't shape cognition at all. A skeptic might say this is just infinite monkeys at infinite typewriters, with a perfect editor who recognizes valid proofs. The AI generates; the oracle filters. That's search, not thought.

The objection is fair, and probably unresolvable here. But the proof doesn't care. It exists as a valid logical object regardless of whether its producer "understood" anything. Lean is indifferent to the soul of its contributors. For mathematics, that indifference might be exactly the point.

Lean doesn't make AI smarter. It makes a certain kind of failure impossible.

And in doing so, it shows us what machine reasoning looks like when the machine can't fake it: structured, decomposed, verifiable — and forced to either ground itself in truth or fail loudly enough that everyone learns something.

The proof is the program. If it compiles, it's true.

If it doesn't, you just discovered something you didn't know you needed to know.