Google DeepMindās āGemini Deep Thinkā and OpenAIās āo1-series LLMsā both solved Parts 1ā5 (~35 pts) under authentic 4.5-hour exam conditions (no external tools, fully natural-language proof), thereby surpassing the typical gold-medal cut line (ā„28 pts) at IMO 202512. Both failed Problem 6, a notoriously hard combinatorics proof, taking the global average below even silver-medal standard among top contestants ā for humans, fewer than 1% of gold-medal solvers fully completed Pā3.
āAt IMO 2024, AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs. It also took two to three days of computation. This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions ā all within the 4.5-hour competition time limit.ā - [Gemini Blog]1. Likewise OpenAIās IMO-solving system is described as a āgeneral-purpose reasoning LLMā ā essentially a next-generation large language model specialized in complex reasoning tasks 4.
Why Pā Remains Elusive
Despite good performance on Pāāā , both systems failed Pā. Absent are single-step errors; instead, failure stems from:
- Scarcity of Pāāstyle combinatorics in training/output data and theorem libraries, unlike geometry or number-theory (see CombiBench results)5.
- Explosive search space and branching factor when generating formal proofs in Lean for novel insights (e.g., local invariants, pigeonhole principle tweaks).
- Time-and-compute constraints for āmulti-stepā reasoning; DeepMindās parallel prompting (āDeep Thinkā) and OpenAIās scaled-chain-of-thought both improved success on Pāāā , but Pā demands even deeper search, likely exceeding their inference budget6.
What is Formal Expression Language (Lean) ? Used in previous years
Lean 4 is the latest version of Lean, a formal proof assistant and functional programming language developed primarily at Microsoft Research. It is designed to help mathematicians, computer scientists, and engineers formally verify logical statements, algorithms, and software. Unlike traditional programming languages, Lean emphasizes mathematical rigor and provable correctness. 7
At its core, Lean 4 is based on dependent type theory, which allows types to depend on values. This feature enables expressing complex properties directly in the type system, bridging the gap between programming and formal proofs. For example, you can write a type that represents the concept of āsorted listā and then prove that a given function preserves sorting.
Leanās Proof Architecture: Definitions, Lemmas, Theorems
In Lean, everything is a declaration with an explicit type:
def: definitions, e.g.def one := succ zerolemma/theorem: provable statements from assumptionsvariable {x y : ā}/hypothesis h : P: context
Lemmas are mini-theorems used to modularize big proofs; the kernel ensures all dependencies are sound.
š Glossary of Core Tactics (with Examples)
For each tactic below:
- Describe its purpose
- Show a minimal Lean snippet
- List frequent pitfalls
rfl ā Reflexivity of Equality
Purpose: Close any goal A = B if Lean sees the two sides as syntactically identical.
example (x : ā) : x + 37 = x + 37 :=
by rfl- Failure if
0 + x = xā despite arithmetic correctness, Lean treats them as distinct terms. - Pitfall:
rflonly works on direct syntactic equality, not reducible forms (needssimpfirst).
rw [...] ā Rewrite an Equality Hypothesis
Purpose: Use a known proof h : A = B to replace Aāsubterms in goal or hypotheses with B.
Variants:
rw [āh]: rewrite rightāleftrw [h1, h2]: apply sequential rewritesrw [h] at hā: target specific hypothesis
example (x y : ā) (h : x = y + y) (add_zero : ā n, n + 0 = n) :
succ (x + 0) = succ (y + y) := by
rw [add_zero]
rw [h]
rfl- Forgetting square brackets (
rw h) is invalid. - Must prove
hisA = B; rewriting implications orA ā Bgives errors.
simp, simp_* (e.g. simp_add) ā Simplification Filling
Purpose: Leanās built-in simplifier. simp solves many arithmetic and logical equivalences automatically.
example (x : ā) : x + 0 + 0 = x := by simpsimp_addtargets addition lemmas specifically.- Pitfall:
simpcan loop if misuse or too many non-terminating instances.
apply, exact, have, induction, cases, intro
apply lemma_name: Replace goal with lemmaās premises.exact proof_term: Supply a term/refinement to match the goal exactly.have h' : P := proof: Introduce local lemma.induction n/cases h: Split on a hypothesis or natural number.intro h: Move universal quantifier or implication assumption into context.
lemma succ_ne_zero (n : ā) : succ n ā 0 := by
intro h
cases h -- succ n can't unify with zerocontrapose, symm, tauto, trivial, use
contrapose: Prove¬A ā ¬Binstead ofB ā A.symm: Convert goala = bintob = a.tauto: Automatically prove tautologies (propositional logic).trivial: Shortcut forrfl + simple propositional cases.use: In existential goalā x, P x, provide witness:use 3gives3as witness.
š§ Limitations & Bottlenecks in Practice
- Combinatorics remains the weakest domain: On CombiBench (100 Lean Pā-style problems from IMO until 2024), even the best fine-tuned LLM solved only ~7% with no prior training; gold-level performance on Pā-type problems remains far off5.
- Surface-form brittleness: Models can fail when statement phrasing changesāleaning heavily on pattern matching rather than semantic reasoning (
nth_rewrite,rw,simp) makes them fragile to minor shifts8. - Inference-time scaling cost: Deep thinkers with more parallel candidates do better on complex proofsābut beyond a certain depth (Pā-style), even dozens of branches and 10Ć compute still fall short69.
ā Summary
- LLMs from DeepMind and OpenAI reached gold-medal standard at IMO 2025 by solving Problems 1ā5 with formalized reasoning in Lean, verified down to subgoals.
- Problem 6 remains unsolved by both, despite scalability improvements.
- Formal expression via Lean tactics (
rfl,rw, etc.) ensures verifiability, but demands explicit, step-by-step inference. - Challenges remain in combinatorial reasoning, representation brittleness, and resource-intensive search, but formal ā proofs in Lean constitute a major leap toward trustworthy theorem proving.
Footnotes
-
DeepMind blog ā āGemini Deep Think at IMO 2025ā ā© ā©2
-
OpenAI blog ā āo1 reasoning models and competition mathā ā©
-
āCombiBench: Benchmarking LLM Capability for Combinatorial Mathematicsā ā© ā©2
-
āIneq-Comp: Benchmarking Human-Intuitive Compositionalityā ā©
-
āUsing Reasoning Models to Generate Search Heuristicsā ā©