
*Image source: “ARC Prize Foundation @ AI Worlds Fair 2025” — Greg Kamradt, from 1
What Is ARC‑AGI?
The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC‑AGI) is a benchmark launched in 2019 by François Chollet to evaluate fluid intelligence, defined as how efficiently a system learns novel tasks using minimal prior examples2.
Definition of Intelligence
- John McCarthy (paraphrased): AI should solve tasks it hasn’t been pre-trained on.
- François Chollet: Intelligence = skill‑acquisition efficiency across task scope, priors, and generalization difficulty1.
Core Principles
- Fluid‑intelligence focus: Tasks rely only on innate “Core Knowledge” priors—spatial relations, grouping, transformations—free of cultural or language dependence.
- Easy for humans, hard for AI: Every task is solved reliably by ≥2 human testers in ≤2 attempts, but current AI systems score at near-zero.
- Few‑shot grid reasoning format: Each task provides a few (typically 2–5) demo input-output grid pairs with limited size and color palette; the solver must generalize to a new input.
Versions
ARC‑AGI‑1 (2019)
- Introduced in Chollet’s “On the Measure of Intelligence” paper3.
- Contains hundreds of grid reasoning tasks, public and private splits.
- Human panel success ≈98%; leading AI systems reached ~55% accuracy by late 2024 using hybrid techniques like test-time adaptation and program synthesis.
ARC‑AGI‑2 (2025)
- Upgraded task set resisting brute-force while preserving human solvability
- Designed so all evaluation tasks are solved by at least two human participants in controlled conditions with at most two attempts.
- Pure LLMs score ~0%; advanced reasoning systems reach only single-digit percentages (~1–5%) with much higher cost per task compared to humans.
ARC‑AGI‑3 (planned 2026)
- An interactive reasoning benchmark in development, focusing on exploration, planning, memory, and multi-step action in novel environments1.
In Summary,
- ARC‑AGI frames intelligence as how quickly and efficiently new skills are learned, not just as task performance.
- Its design emphasizes fluid reasoning, requiring minimal priors and inhibiting brute-force or memorization.
- ARC‑AGI‑2 introduces calibrated difficulty and efficiency metrics, exposing significant gaps between human and AI performance.
- ARC‑AGI‑3 aims to extend the benchmark into interactive, adaptive environments for testing broader reasoning capabilities.
Footnotes
-
presentation slides - https://docs.google.com/presentation/d/1bf44la1Z_Ra_CGsd8KMj8VL18lr72mcCnUJ5DeneGoU/edit?usp=sharing ↩ ↩2 ↩3
-
ARC-AGI-2 paper - https://arxiv.org/html/2505.11831v1 ↩
-
On the Measure of Intelligence - https://arxiv.org/abs/1911.01547 ↩