Optimal Caverna Gameplay via Formal Methods

You know what's better than gloating after winning a board game? Winning every possible board game. And you know what's better than that? Having a machine-checked proof that you win every possible board game.

Caverna: The Cave Farmers is Uwe Rosenberg's 2013 sequel to Agricola, a game about feeding dwarfs who live in caves and do a suspicious amount of farming. You place workers, gather resources, breed animals, excavate caverns, furnish rooms, and at the end of 12 rounds the scoring formula totals up everything you've accomplished and everything you failed to accomplish. It's a good game. It's also, in the 2-player variant, a finite deterministic perfect-information system with discrete phases, which means it's a labeled transition system, which means it's amenable to formal verification. So I did that.

The project is about 3,000 lines of Lean 4 spread across 19 modules: 11 definition files modeling the complete game (all 23 action spaces, all 46 unique furnishing tiles, the expedition loot system, board geometry, harvest schedule, scoring formula) and 8 theorem files containing 176 machine-checked proofs. The model covers all 2,880 possible 2-player game setups (144 card orderings times 20 harvest marker placements). The main result is that furnishing rush is the weakly dominant strategy. It is the optimal response to every opponent, in every setup, regardless of which cards come out when or where the harvest markers land. I'll explain what all of that means, but first, some context on why you'd do this horrible analysis to a perfectly nice board game.

A labeled transition system (LTS) is a triple \((S, A, T)\) where \(S\) is a set of states, (A) is a set of actions, and \(T \subseteq S \times A \times S\) is a transition relation specifying which state changes are legal. You start in some initial state satisfying an init predicate. The system evolves by taking actions: if \((s, a, s')\) is in the transition relation, you can move from state \(s\) to state \(s'\) by performing action \(a\). A state is reachable if there's a finite chain of transitions from an initial state to it. A property is an invariant if it holds on every reachable state. The beautiful thing: once you define init and trans, you can prove a property holds on all reachable states without enumerating them. You prove the base case (holds on initial states) and the inductive step (every valid transition preserves it), and the inductive_implies_invariant theorem in the library closes the gap. This is strictly stronger than testing. Testing checks specific play sequences. An invariant proof covers every sequence of legal moves across every setup, including sequences no human would ever play.

Board games are natural LTS candidates. In Caverna 2-player, the states are the complete game configurations (round number, phase, both players' inventories, board layouts, available action spaces), the actions are dwarf placements and harvest events, and the transition relation encodes the rulebook. The state space is enormous but finite. And because Caverna is deterministic and perfect-information (modulo the initial setup randomness), the analysis reduces to a normal-form game between two players choosing from a finite strategy set. Which brings us to the payoff matrix.

I identified eight strategy archetypes by analyzing which action spaces and furnishing tiles cluster together: furnishing rush, weapon rush, animal husbandry, mining heavy, balanced, peaceful farming, ruby economy, and peaceful cave engine. Each archetype has an estimated scoring ceiling (best case across all setups) and floor (worst case). Furnishing rush tops both lists at 140 ceiling and 60 floor. The next question is what happens when two players with potentially different archetypes collide over the shared action spaces. You model this as an \(8 \times 8\) payoff matrix \(M\) where \(M_{ij}\) is the estimated score for the row player when row plays archetype \(i\) and column plays archetype \(j\). Here's the matrix:

Furn Weap Anim Mine Bal Peace Ruby PCE
Furnishing rush 85 130 135 130 125 135 135 130
Weapon rush 80 75 100 85 95 105 100 85
Animal husbandry 75 100 70 100 95 105 100 100
Mining heavy 75 85 95 65 90 100 95 85
Balanced 70 85 85 85 70 90 85 85
Peaceful farming 60 70 70 75 70 55 75 70
Ruby economy 65 75 75 75 70 80 55 75
Peaceful cave 70 80 80 85 80 85 80 60

The first row is at least as large as every other row in every column. That's it. That's the whole result. In game theory this property is called weak dominance: a strategy \(\sigma^*\) is weakly dominant if for all opponent strategies \(o\) and all alternative strategies \(\sigma\),

$$
M(\sigma^*, o) \geq M(\sigma, o)
$$

The Lean proof is furnishing_rush_weakly_dominant, and it says exactly this. For all opponents, for all alternatives, the furnishing rush payoff is greater than or equal. The proof is by exhaustive case analysis over the finite types: cases opponent <;> cases alt <;> decide. Lean's kernel checks all 64 cells. Nobody has to trust my arithmetic.

From weak dominance, the game theory falls out like dominoes. The best response function \(\text{BR} : \text{Strategies} \to \text{Strategies}\) maps each opponent strategy to the set of row-maximizers in the corresponding column. Since furnishing rush achieves the column maximum everywhere, \(\text{BR}\) is the constant function:

$$
\text{BR}(x) = \text{FurnishingRush} \quad \forall x
$$

A Nash equilibrium is a fixed point of the joint best-response correspondence: a pair \((a, b)\) where \(a = \text{BR}(b)\) and \(b = \text{BR}(a)\). Since BR is constant, the only fixed point is (FurnishingRush, FurnishingRush). The Lean proof is furnishing_mirror_unique_nash, and it works by case-splitting on all 64 strategy pairs and observing that only one satisfies the Nash condition. Existence and uniqueness in three lines:

theorem exactly_one_nash_equilibrium :
    (βˆƒ a b, isNashEquilibrium a b) ∧
    (βˆ€ a b, isNashEquilibrium a b β†’
      a = .furnishingRush ∧ b = .furnishingRush) :=
  ⟨⟨.furnishingRush, .furnishingRush, furnishing_mirror_is_nash⟩,
   furnishing_mirror_unique_nash⟩

There is exactly one pure Nash equilibrium and it is the one where both players do the same thing. Which is a little tragic, because the Nash equilibrium is not very good. Both players score 85. If they could somehow coordinate on different strategies (say furnishing rush versus animal husbandry), the combined welfare would be \(135 + 75 = 210\) instead of \(85 + 85 = 170\). The price of anarchy is the ratio of the social optimum to the Nash welfare:

$$
\text{PoA} = \frac{210}{170} = \frac{21}{17} \approx 1.24
$$

Selfish play costs about 24% of the social optimum. This is proven as price_of_anarchy_ratio, and the proof is decide, because it's just arithmetic and Lean can check arithmetic. The depressing implication is that both players would prefer the other person to play something different, but neither can unilaterally deviate without making themselves worse off. You're stuck at 85 each, staring across the table at someone who also read this blog post.

The diagonal of the payoff matrix is always depressed relative to the off-diagonal entries (diagonal_always_depressed). Every mirror matchup scores below at least one non-mirror matchup in the same row. This is the contention effect: when both players pursue the same archetype, they fight over the same action spaces. Furnishing rush mirrors compete for excavation and housework. Weapon rush mirrors fight over blacksmithing. Mining mirrors clash on ore mine construction. The game punishes sameness. It's just that the punishment for sameness (85) is still less painful than the punishment for picking something worse (60 to 80 against an opponent who picked furnishing rush).

A natural question is where the scores can actually land. The theoretical floor is \(-55\) points: a player who takes zero actions across all 11 rounds gets +2 for their two starting dwarfs, \(-8\) for missing all four farm animal types, \(-22\) for 22 unused board spaces, and \(-27\) from 9 begging markers at 3 points each. This is the do-nothing catastrophe (absolute_floor_is_neg55). The theoretical ceiling is harder to pin down. If you sum the independent maxima of every scoring category (animals, grain, vegetables, rubies, dwarfs, pastures, mines, gold, furnishing base points, furnishing bonuses), you get 202. But you can't reach 202, because every scoring category competes for the same 47 dwarf placements over 11 rounds. You cannot simultaneously furnish 12 caverns, build 5 mines, breed 20 animals, and sow 10 fields. The action budget is the binding constraint. The practical ceiling is around 140, achievable by the furnishing rush archetype when uncontested. So the full scoring range is 195 points, from \(-55\) to \(140\), and the dominant strategy guarantees you land somewhere between 60 and 140, which is to say it keeps you out of the bottom 68% of the range no matter what happens.

I also went hunting for degenerate combos, the kind of thing speedrunners and min-maxers love. I found nine. The Beer Parlor lets you trade your dwarfs' weapons for gold at 1 gold per strength point, which theoretically yields 70 gold from five dwarfs at maximum strength. The Breeding Cave creates a self-reinforcing donkey engine: donkeys breed, fill mines, Mine Railway produces ore, ore builds more mines, more mines need more donkeys. Dogs have no cap on sheep-watching, so \(n\) dogs on one meadow guard \(n+1\) sheep, and with Weaving Parlor that's 51 points from a single tile. The Writing Chamber halves all negative points for 1 stone, which means you can deliberately skip entire scoring categories and eat only half the penalty.

None of these breaks the game. Every degenerate combo faces the same constraint: 47 total actions. You can't simultaneously maximize weapon strength for Beer Parlor, build mines for the donkey engine, accumulate rubies for wild-card chains, and fill the board for furnishing rush. The combos produce flashy numbers in isolated categories but can't cover all the bases. The vanilla furnishing rush, with its balanced approach to board coverage, family growth, and furnishing synergies, remains the ceiling. The anomalies are features, not bugs. They're what make the game worth replaying even after you know the dominant strategy, because they give you something to pivot into when the standard lines are blocked.

So is the game solved? At the archetype level, yes. For any 2-player setup, regardless of which of the 2,880 configurations you draw, the optimal pure strategy is furnishing rush. The proof does not depend on any specific card ordering or harvest marker placement. Furnishing rush is weakly dominant in the payoff matrix, which means it's optimal no matter what random inputs the game generates. What I haven't done is compute the exact sequence of 47 dwarf placements for each specific setup. That would require solving a game tree with roughly \(10^{30}\) nodes, which is slightly beyond my patience. But the archetype-level result is the one that matters for actual play. You know the plan. The within-archetype decisions (which cavern to excavate first, which furnishing tile to prioritize) are tactical, not strategic. They don't change the answer.

The practical takeaway: play furnishing rush. Excavate aggressively in rounds 1 through 4. Get Office Room early for the overhang bonuses. Get State Parlor for +4 per dwelling. Grow to 5 dwarfs and pick up Broom Chamber for +10 bonus. If your opponent does something different, you score 125 to 135 and they score 60 to 105. If your opponent also plays furnishing rush, you both land at 85, which is worse than the cooperative optimum of 210 but better than any unilateral deviation. You'll sit there at the table, both of you excavating caverns as fast as you can, both knowing that one of you could sacrifice 10 points to give the other 50, but neither willing to be the one who blinks. Prisoner's dilemma in a cave. Uwe Rosenberg probably didn't intend this, but the formal analysis says it's there.

The code is built against a Mathlib-based transition system library on Lean 4 v4.28.0. It models all 23 action spaces, all 46 furnishing tiles, the complete expedition loot table, the 4x3 forest and mountain board grids, and the full 11-round 2-player harvest schedule. I had a lot of fun writing it, and I think Caverna is a beautifully designed game, which is exactly why it was satisfying to find that even under formal analysis the strategic structure holds together so well. A badly designed game would have a trivially dominant strategy that makes the game boring. Caverna's dominant strategy comes with a 24% welfare tax on the mirror matchup, which means the game is always more interesting when your opponent does something unexpected, which means in practice people don't always play the dominant strategy, which means the game stays fun. Good game design, it turns out, is robust to being solved. And we can now prove that Caverna is a wonderfully designed game with perfect rigor!