Can Opus 4.6 do Category Theory in Lean?

I have a little category theory library I've been dragging around for about a decade now. It started life in Haskell, got ported to Agda, briefly lived in Idris, spent some time in Coq, and has now landed in Lean 4. I call it Kitty Cats. The idea is simple: take the definitions and theorems from Awodey's Category Theory and the relevant chapters of Mac Lane's Categories for the Working Mathematician, and translate them directly into type-checkable code. Categories, functors, natural transformations, products, limits, adjunctions, monads, the Yoneda lemma. No libraries, no imports beyond what you define yourself. About 900 lines of Lean when all is said and done. It is the kind of project that fits entirely in your head, which is precisely the point. Every time a new dependently typed language or proof assistant comes along, I reach for this scaffold and see how it feels. It's my litmus test, my canonical exercise, my desert island formalization. And it's increasingly a good benchmark for how good language models have become abstract formal "reasoning" (or maybe just approximating it, but I'll leave that question to the philosophers!).

So when Anthropic dropped the new Opus 4.6 model, I wanted to know: can it actually write proper Lean 4? Not toy examples. Not theorem two_plus_two : 2 + 2 = 4 := rfl. Real proofs with real proof obligations, the kind where you need to wrangle naturality squares and coherence conditions and the elaborator is fighting you every step of the way. The beautiful thing about a proof assistant is that the answer is ground-verifiable. You don't have to squint at the output and wonder if it's subtly wrong. Either lake build passes or it doesn't. The kernel doesn't care about your feelings.

The short answer is: yes, kinda mostly. But the longer answer is more interesting.

We (the royal "we" for me + model) built the whole library together over the course of two 30-minute sessions. The basic hierarchy went up fast. Categories, functors, natural transformations, the opposite category, morphism classes. These are mostly structure declarations with straightforward proof obligations: the category axioms $\mathrm{id} \circ f = f$, $f \circ \mathrm{id} = f$, and $(f \circ g) \circ h = f \circ (g \circ h)$; the functor laws $F(\mathrm{id}) = \mathrm{id}$ and $F(f \circ g) = F(f) \circ F(g)$. Opus handled these without breaking a sweat, and honestly, so would most competent language models at this point. The proofs are one or two tactics, often just rfl or by simp. The real question was always going to be what happens when the proofs get hard.

Things got interesting around monoidal categories and the endofunctor construction. To prove that the category of endofunctors is monoidal under composition, you need to establish the pentagon and triangle coherence conditions. The pentagon axiom states that the two ways of reassociating a four-fold tensor product agree:

$$
(\alpha_{F,G,H} \otimes \mathrm{id}_K) \circ \alpha_{F, G \circ H, K} \circ (\mathrm{id}_F \otimes \alpha_{G,H,K}) = \alpha_{F \circ G, H, K} \circ \alpha_{F, G, H \circ K}
$$

where $\alpha$ is the associator natural isomorphism. The triangle axiom relates the associator to the left and right unitors $\lambda$ and $\rho$:

$$
\alpha_{F, \mathrm{Id}, G} \circ (\mathrm{id}_F \otimes \lambda_G) = \rho_F \otimes \mathrm{id}_G
$$

These are precisely the kind of thing that's annoying to formalize because the goals involve deeply nested natural transformations composed horizontally and vertically, and the elaborator's notion of what things are "definitionally equal" doesn't always match your intuition. Opus could get the structure right, it could set up the show blocks that unfold the monoidal category projections into their concrete NatTrans representations, and it could figure out that ext a; simp [...] with the right unfolding lemmas would close the goals. But the proofs it produced were verbose. Walls of nested parentheses. The kind of thing that's technically correct but makes your eyes glaze over.

The crown jewel of the library is monadIsMonoidObj: the proof that a monad is a monoid in the category of endofunctors. It's a joke that every functional programmer has heard (it's even a meme at this point), but actually formalizing it requires you to bridge two levels of abstraction. A monad $(T, \eta, \mu)$ on a category $\mathcal{C}$ is an endofunctor $T$ equipped with natural transformations $\eta : \mathrm{Id} \Rightarrow T$ (unit) and $\mu : T^2 \Rightarrow T$ (multiplication) satisfying associativity and unit laws:

$$
T(\mu_a) \circ \mu_a = \mu_{T(a)} \circ \mu_a
\qquad
T(\eta_a) \circ \mu_a = \mathrm{id}
\qquad
\eta_{T(a)} \circ \mu_a = \mathrm{id}
$$

These are component-wise equations, one for each object $a$ of $\mathcal{C}$. To show that $(T, \mu, \eta)$ forms a monoid object in the monoidal category $(\mathrm{End}(\mathcal{C}), \circ, \mathrm{Id})$, you need to lift them to equalities of natural transformations in the endofunctor category, which is itself equipped with the monoidal structure you just built. Each field of the MonoidObj structure requires a show block that manually unfolds the monoidal category projections (because ext can't see through the Hom type alias to find the NatTrans it needs), followed by ext; endo_simp to reduce to components, followed by exact M.mul_assoc a or whichever monad axiom applies. Opus got the shape of this right but needed significant guidance on factoring. The raw output had the show blocks inlined directly in the instance declaration, three or four lines of deeply nested NatTrans.vcomp (endoTensorHom M.mu (NatTrans.ident M.T)) M.mu = ... that, while correct, would terrify anyone reading the code for the first time.

This is where the model hit a few bumps and needed some help. We refactored the proofs. We introduced endo_ext_eq, a small helper that lifts component-wise equalities to NatTrans equalities, so you could write endo_ext_eq fun a => by endo_simp; exact M.mul_assoc a instead of the show; ext a; endo_simp; exact dance. We extracted the monoidal coherence obligations into standalone lemmas (endoPentagon, endoTriangle, endoTensor_id, endoTensor_comp) so the instance declaration became pure assignment. We used let bindings in theorem signatures to name the five different associator instances in the pentagon axiom, turning a wall of (endoAssociator F (CFunctor.compose G H) K).hom into a legible alpha_F_GH_K. The monadIsMonoidObj definition went from 22 lines of inline proof to 7 lines of named lemma applications. This kind of refactoring is, I think, the thing that humans are still decisively better at: knowing what a reader needs to see, knowing which subexpressions deserve names, knowing when a proof is correct but not yet good.

The Yoneda lemma was a pleasant surprise. The statement is that for any functor $F : \mathcal{C}^{\mathrm{op}} \to \mathbf{Set}$ and any object $a$ in $\mathcal{C}$, there is a natural bijection:

$$
\mathrm{Nat}(\mathrm{Hom}(-, a),; F) ;\cong; F(a)
$$

The forward direction (evaluating a natural transformation at $\mathrm{id}_a$ recovers the determining element) is almost trivial once you have map_id. The backward direction (rebuilding the transformation from its value at $\mathrm{id}_a$) requires extracting a naturality equation, simplifying it, and applying it symmetrically. Opus navigated this correctly, including the slightly tricky congrFun applications needed when your hom-sets are function types in the Type category. The fully faithful embedding (yoneda_map_eval and yoneda_eval_map) fell out naturally. I was honestly a little charmed.

What made this work as well as it did was the tooling. Two open source projects deserve credit here. The lean-lsp-mcp server by Oliver Dressler exposes the Lean 4 language server as an MCP interface, giving Claude direct access to goal states, diagnostics, completions, and hover information. The lean4-skills plugin by Cameron Freer builds on top of this with higher-level proving commands, cycle engines, and premise search integrations. Together they give the model a genuine feedback loop. It can call lean_goal to inspect the proof state at any line, call lean_diagnostic_messages to see if the file has errors, call lean_multi_attempt to try several tactic sequences and see which ones make progress, and call lean_run_code to test a complete snippet against the kernel without modifying the working file. This changes the dynamics fundamentally. Instead of generating Lean code and hoping it compiles, the model operates in a tight loop: write a proof, check the goal state, see the error, adjust. It's the difference between painting blindfolded and painting with your eyes open. The model still makes mistakes, it still tries tactics that don't apply and writes rw chains that don't match, but it can detect these failures immediately and recover. Six months ago this kind of integration didn't exist. The model would have been generating Lean into a void, with no way to know if the elaborator was happy or screaming.

The search tools matter too. lean_leansearch, lean_loogle, and lean_leanfinder let the model query Mathlib's lemma database by natural language, by type signature, and by semantic similarity. We didn't use Mathlib in this project (the whole point is to build from scratch), but the ability to verify that a lemma name exists before trying to use it, or to find the right name for a theorem you know should exist, is invaluable. lean_state_search and lean_hammer_premise go even further: given a proof state, they suggest lemmas that might close the goal. This is premise selection, the same technique that powers the strongest automated theorem provers, exposed as an API call. We're not quite at the point where you can say "prove the pentagon axiom" and walk away. But we're remarkably close to something that can handle a lot of nontrivial theorem proving with appropriate guidance.

If I'm being honest about the limitations: Opus struggles with the kind of reasoning that requires holding a complex proof state in working memory and planning several steps ahead. It can close goals that yield to a single tactic or a short chain, but multi-step rewrites where you need to introduce a naturality equation, simplify it in a particular way, then apply it at a specific position in a long composition chain, that still requires hand-holding. It sometimes tries simp when it needs rw, or applies a lemma at the wrong position, or forgets that a show block is necessary to make ext fire. These are not deep failures of understanding; they're more like the mistakes a graduate student makes in their first semester with a proof assistant. The model knows what it wants to prove, it roughly knows the strategy, but the bureaucratic details of getting the elaborator to agree take iteration.

One thing worth noting: in informal experiments, turning the reasoning budget up did increase convergence time (more tokens spent deliberating before committing to a tactic) but didn't obviously improve the quality of the final output. The model would think longer and still reach for the same simp call. This is a hard thing to measure empirically because the process is nondeterministic and highly sensitive to prompting, so take it with a grain of salt. But my working hypothesis is that for tactic-level theorem proving, the bottleneck is less about "thinking harder" and more about having the right tool calls in the loop. The language server feedback matters more than the internal chain of thought.

The thing is, it's gotten exponentially better in the last six months. I've been running this same exercise periodically, and the trajectory is striking. A year ago, language models could barely write syntactically valid Lean. Six months ago, they could handle simple tactic proofs but fell apart on anything involving universe polymorphism or typeclass resolution. Today, with MCP integration and language server feedback, we built 900 lines of formalized category theory in two sessions, including proofs that many math graduate students would find challenging. It's not hard to project where this is going.

And maybe that's the exciting part. Not that AI can prove theorems today (it can, sort of, with help), but that the boilerplate is evaporating. The tedious parts of formalization, the coherence conditions, the naturality lemmas, the simp configurations, these are exactly the kind of structured, verifiable, mechanically checkable work that language models are getting good at. When this layer becomes trivial, we get to spend our time on the parts that actually matter: choosing the right abstractions, seeing the connections between structures, deciding what's worth formalizing in the first place. The proof assistant becomes less of a bureaucratic obstacle and more of a genuine thinking tool. We get to build higher.

I've been carrying this little category theory library around for ten years, porting it from language to language, and every time, the experience tells me something about the state of the art. This time what it told me is that we're living in the future, and it's weirder and more interesting than I expected.