Google DeepMind’s ā€œGemini Deep Thinkā€ and OpenAI’s ā€œo1-series LLMsā€ both solved Parts 1–5 (~35 pts) under authentic 4.5-hour exam conditions (no external tools, fully natural-language proof), thereby surpassing the typical gold-medal cut line (≄28 pts) at IMO 202512. Both failed Problem 6, a notoriously hard combinatorics proof, taking the global average below even silver-medal standard among top contestants ― for humans, fewer than 1% of gold-medal solvers fully completed P₆3.

ā€œAt IMO 2024, AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs. It also took two to three days of computation. This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit.ā€ - [Gemini Blog]1. Likewise OpenAI’s IMO-solving system is described as a ā€œgeneral-purpose reasoning LLMā€ – essentially a next-generation large language model specialized in complex reasoning tasks 4.

Why P₆ Remains Elusive

Despite good performance on P₁–₅, both systems failed P₆. Absent are single-step errors; instead, failure stems from:

  • Scarcity of P₆–style combinatorics in training/output data and theorem libraries, unlike geometry or number-theory (see CombiBench results)5.
  • Explosive search space and branching factor when generating formal proofs in Lean for novel insights (e.g., local invariants, pigeonhole principle tweaks).
  • Time-and-compute constraints for ā€œmulti-stepā€ reasoning; DeepMind’s parallel prompting (ā€œDeep Thinkā€) and OpenAI’s scaled-chain-of-thought both improved success on P₁–₅, but P₆ demands even deeper search, likely exceeding their inference budget6.

What is Formal Expression Language (Lean) ? Used in previous years

Lean 4 is the latest version of Lean, a formal proof assistant and functional programming language developed primarily at Microsoft Research. It is designed to help mathematicians, computer scientists, and engineers formally verify logical statements, algorithms, and software. Unlike traditional programming languages, Lean emphasizes mathematical rigor and provable correctness. 7

At its core, Lean 4 is based on dependent type theory, which allows types to depend on values. This feature enables expressing complex properties directly in the type system, bridging the gap between programming and formal proofs. For example, you can write a type that represents the concept of ā€œsorted listā€ and then prove that a given function preserves sorting.


Lean’s Proof Architecture: Definitions, Lemmas, Theorems

In Lean, everything is a declaration with an explicit type:

  • def: definitions, e.g. def one := succ zero
  • lemma / theorem: provable statements from assumptions
  • variable {x y : ā„•} / hypothesis h : P: context

Lemmas are mini-theorems used to modularize big proofs; the kernel ensures all dependencies are sound.


šŸ›  Glossary of Core Tactics (with Examples)

For each tactic below:

  1. Describe its purpose
  2. Show a minimal Lean snippet
  3. List frequent pitfalls

rfl — Reflexivity of Equality

Purpose: Close any goal A = B if Lean sees the two sides as syntactically identical.

example (x : ā„•) : x + 37 = x + 37 := 
by rfl
  • Failure if 0 + x = x — despite arithmetic correctness, Lean treats them as distinct terms.
  • Pitfall: rfl only works on direct syntactic equality, not reducible forms (needs simp first).

rw [...] — Rewrite an Equality Hypothesis

Purpose: Use a known proof h : A = B to replace A‑subterms in goal or hypotheses with B. Variants:

  • rw [←h]: rewrite right→left
  • rw [h1, h2]: apply sequential rewrites
  • rw [h] at hā‚‚: target specific hypothesis
example (x y : ā„•) (h : x = y + y) (add_zero : āˆ€ n, n + 0 = n) :
  succ (x + 0) = succ (y + y) := by
  rw [add_zero]
  rw [h]
  rfl
  • Forgetting square brackets (rw h) is invalid.
  • Must prove h is A = B; rewriting implications or A ↔ B gives errors.

simp, simp_* (e.g. simp_add) — Simplification Filling

Purpose: Lean’s built-in simplifier. simp solves many arithmetic and logical equivalences automatically.

example (x : ā„•) : x + 0 + 0 = x := by simp
  • simp_add targets addition lemmas specifically.
  • Pitfall: simp can loop if misuse or too many non-terminating instances.

apply, exact, have, induction, cases, intro

  • apply lemma_name: Replace goal with lemma’s premises.
  • exact proof_term: Supply a term/refinement to match the goal exactly.
  • have h' : P := proof: Introduce local lemma.
  • induction n / cases h: Split on a hypothesis or natural number.
  • intro h: Move universal quantifier or implication assumption into context.
lemma succ_ne_zero (n : ā„•) : succ n ≠ 0 := by
  intro h
  cases h -- succ n can't unify with zero

contrapose, symm, tauto, trivial, use

  • contrapose: Prove ¬A → ¬B instead of B → A.
  • symm: Convert goal a = b into b = a.
  • tauto: Automatically prove tautologies (propositional logic).
  • trivial: Shortcut for rfl + simple propositional cases.
  • use: In existential goal ∃ x, P x, provide witness: use 3 gives 3 as witness.

🚧 Limitations & Bottlenecks in Practice

  1. Combinatorics remains the weakest domain: On CombiBench (100 Lean P₆-style problems from IMO until 2024), even the best fine-tuned LLM solved only ~7% with no prior training; gold-level performance on P₆-type problems remains far off5.
  2. Surface-form brittleness: Models can fail when statement phrasing changes—leaning heavily on pattern matching rather than semantic reasoning (nth_rewrite, rw, simp) makes them fragile to minor shifts8.
  3. Inference-time scaling cost: Deep thinkers with more parallel candidates do better on complex proofs—but beyond a certain depth (P₆-style), even dozens of branches and 10Ɨ compute still fall short69.

āœ… Summary

  • LLMs from DeepMind and OpenAI reached gold-medal standard at IMO 2025 by solving Problems 1–5 with formalized reasoning in Lean, verified down to subgoals.
  • Problem 6 remains unsolved by both, despite scalability improvements.
  • Formal expression via Lean tactics (rfl, rw, etc.) ensures verifiability, but demands explicit, step-by-step inference.
  • Challenges remain in combinatorial reasoning, representation brittleness, and resource-intensive search, but formal ā„’ proofs in Lean constitute a major leap toward trustworthy theorem proving.

Footnotes

  1. DeepMind blog – ā€œGemini Deep Think at IMO 2025ā€ ↩ ↩2

  2. OpenAI blog – ā€œo1 reasoning models and competition mathā€ ↩

  3. International Mathematical Olympiad 2025 results ↩

  4. simon willison’s blog ↩

  5. ā€œCombiBench: Benchmarking LLM Capability for Combinatorial Mathematicsā€ ↩ ↩2

  6. OpenAI – Learning to reason with LLMs ↩ ↩2

  7. Lean Language Org ↩

  8. ā€œIneq-Comp: Benchmarking Human-Intuitive Compositionalityā€ ↩

  9. ā€œUsing Reasoning Models to Generate Search Heuristicsā€ ↩