🏆 IMO 2025 – LLMs Achieve Gold-Medal Level on Five of Six Problems

Google DeepMind’s “Gemini Deep Think” and OpenAI’s “o1-series LLMs” both solved Parts 1–5 (~35 pts) under authentic 4.5-hour exam conditions (no external tools, fully natural-language proof), thereby surpassing the typical gold-medal cut line (≥28 pts) at IMO 2025¹². Both failed Problem 6, a notoriously hard combinatorics proof, taking the global average below even silver-medal standard among top contestants ― for humans, fewer than 1% of gold-medal solvers fully completed P₆³.

“At IMO 2024, AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs. It also took two to three days of computation. This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit.” - [Gemini Blog]¹. Likewise OpenAI’s IMO-solving system is described as a “general-purpose reasoning LLM” – essentially a next-generation large language model specialized in complex reasoning tasks ⁴.

Why P₆ Remains Elusive

Despite good performance on P₁–₅, both systems failed P₆. Absent are single-step errors; instead, failure stems from:

Scarcity of P₆–style combinatorics in training/output data and theorem libraries, unlike geometry or number-theory (see CombiBench results)⁵.
Explosive search space and branching factor when generating formal proofs in Lean for novel insights (e.g., local invariants, pigeonhole principle tweaks).
Time-and-compute constraints for “multi-step” reasoning; DeepMind’s parallel prompting (“Deep Think”) and OpenAI’s scaled-chain-of-thought both improved success on P₁–₅, but P₆ demands even deeper search, likely exceeding their inference budget⁶.

What is Formal Expression Language (Lean) ? Used in previous years

Lean 4 is the latest version of Lean, a formal proof assistant and functional programming language developed primarily at Microsoft Research. It is designed to help mathematicians, computer scientists, and engineers formally verify logical statements, algorithms, and software. Unlike traditional programming languages, Lean emphasizes mathematical rigor and provable correctness. ⁷

At its core, Lean 4 is based on dependent type theory, which allows types to depend on values. This feature enables expressing complex properties directly in the type system, bridging the gap between programming and formal proofs. For example, you can write a type that represents the concept of “sorted list” and then prove that a given function preserves sorting.

Lean’s Proof Architecture: Definitions, Lemmas, Theorems

In Lean, everything is a declaration with an explicit type:

def: definitions, e.g. def one := succ zero
lemma / theorem: provable statements from assumptions
variable {x y : ℕ} / hypothesis h : P: context

Lemmas are mini-theorems used to modularize big proofs; the kernel ensures all dependencies are sound.

🛠 Glossary of Core Tactics (with Examples)

For each tactic below:

Describe its purpose
Show a minimal Lean snippet
List frequent pitfalls

`rfl` — Reflexivity of Equality

Purpose: Close any goal A = B if Lean sees the two sides as syntactically identical.

example (x : ℕ) : x + 37 = x + 37 := 
by rfl

Failure if 0 + x = x — despite arithmetic correctness, Lean treats them as distinct terms.
Pitfall: rfl only works on direct syntactic equality, not reducible forms (needs simp first).

`rw [...]` — Rewrite an Equality Hypothesis

Purpose: Use a known proof h : A = B to replace A‑subterms in goal or hypotheses with B. Variants:

rw [←h]: rewrite right→left
rw [h1, h2]: apply sequential rewrites
rw [h] at h₂: target specific hypothesis

example (x y : ℕ) (h : x = y + y) (add_zero : ∀ n, n + 0 = n) :
  succ (x + 0) = succ (y + y) := by
  rw [add_zero]
  rw [h]
  rfl

Forgetting square brackets (rw h) is invalid.
Must prove h is A = B; rewriting implications or A ↔ B gives errors.

`simp`, `simp_*` (e.g. `simp_add`) — Simplification Filling

Purpose: Lean’s built-in simplifier. simp solves many arithmetic and logical equivalences automatically.

example (x : ℕ) : x + 0 + 0 = x := by simp

simp_add targets addition lemmas specifically.
Pitfall: simp can loop if misuse or too many non-terminating instances.

`apply`, `exact`, `have`, `induction`, `cases`, `intro`

apply lemma_name: Replace goal with lemma’s premises.
exact proof_term: Supply a term/refinement to match the goal exactly.
have h' : P := proof: Introduce local lemma.
induction n / cases h: Split on a hypothesis or natural number.
intro h: Move universal quantifier or implication assumption into context.

lemma succ_ne_zero (n : ℕ) : succ n ≠ 0 := by
  intro h
  cases h -- succ n can't unify with zero

`contrapose`, `symm`, `tauto`, `trivial`, `use`

contrapose: Prove ¬A → ¬B instead of B → A.
symm: Convert goal a = b into b = a.
tauto: Automatically prove tautologies (propositional logic).
trivial: Shortcut for rfl + simple propositional cases.
use: In existential goal ∃ x, P x, provide witness: use 3 gives 3 as witness.

🚧 Limitations & Bottlenecks in Practice

Combinatorics remains the weakest domain: On CombiBench (100 Lean P₆-style problems from IMO until 2024), even the best fine-tuned LLM solved only ~7% with no prior training; gold-level performance on P₆-type problems remains far off⁵.
Surface-form brittleness: Models can fail when statement phrasing changes—leaning heavily on pattern matching rather than semantic reasoning (nth_rewrite, rw, simp) makes them fragile to minor shifts⁸.
Inference-time scaling cost: Deep thinkers with more parallel candidates do better on complex proofs—but beyond a certain depth (P₆-style), even dozens of branches and 10× compute still fall short⁶⁹.

✅ Summary

LLMs from DeepMind and OpenAI reached gold-medal standard at IMO 2025 by solving Problems 1–5 with formalized reasoning in Lean, verified down to subgoals.
Problem 6 remains unsolved by both, despite scalability improvements.
Formal expression via Lean tactics (rfl, rw, etc.) ensures verifiability, but demands explicit, step-by-step inference.
Challenges remain in combinatorial reasoning, representation brittleness, and resource-intensive search, but formal ℒ proofs in Lean constitute a major leap toward trustworthy theorem proving.

Omij's blogs Collection

Explorer

🏆 IMO 2025 – LLMs Achieve Gold-Medal Level on Five of Six Problems

Why P₆ Remains Elusive

What is Formal Expression Language (Lean) ? Used in previous years

Lean’s Proof Architecture: Definitions, Lemmas, Theorems

🛠 Glossary of Core Tactics (with Examples)

`rfl` — Reflexivity of Equality

`rw [...]` — Rewrite an Equality Hypothesis

`simp`, `simp_*` (e.g. `simp_add`) — Simplification Filling

`apply`, `exact`, `have`, `induction`, `cases`, `intro`

`contrapose`, `symm`, `tauto`, `trivial`, `use`

🚧 Limitations & Bottlenecks in Practice

✅ Summary

Table of Contents

Omij's blogs Collection

Explorer

🏆 IMO 2025 – LLMs Achieve Gold-Medal Level on Five of Six Problems

Why P₆ Remains Elusive

What is Formal Expression Language (Lean) ? Used in previous years

Lean’s Proof Architecture: Definitions, Lemmas, Theorems

🛠 Glossary of Core Tactics (with Examples)

rfl — Reflexivity of Equality

rw [...] — Rewrite an Equality Hypothesis

simp, simp_* (e.g. simp_add) — Simplification Filling

apply, exact, have, induction, cases, intro

contrapose, symm, tauto, trivial, use

🚧 Limitations & Bottlenecks in Practice

✅ Summary

Footnotes

Table of Contents

`rfl` — Reflexivity of Equality

`rw [...]` — Rewrite an Equality Hypothesis

`simp`, `simp_*` (e.g. `simp_add`) — Simplification Filling

`apply`, `exact`, `have`, `induction`, `cases`, `intro`

`contrapose`, `symm`, `tauto`, `trivial`, `use`