ARC-AGI Benchmark

*Image source: “ARC Prize Foundation @ AI Worlds Fair 2025” — Greg Kamradt, from ¹

What Is ARC‑AGI?

The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC‑AGI) is a benchmark launched in 2019 by François Chollet to evaluate fluid intelligence, defined as how efficiently a system learns novel tasks using minimal prior examples².

Definition of Intelligence

John McCarthy (paraphrased): AI should solve tasks it hasn’t been pre-trained on.
François Chollet: Intelligence = skill‑acquisition efficiency across task scope, priors, and generalization difficulty¹.

Core Principles

Fluid‑intelligence focus: Tasks rely only on innate “Core Knowledge” priors—spatial relations, grouping, transformations—free of cultural or language dependence.
Easy for humans, hard for AI: Every task is solved reliably by ≥2 human testers in ≤2 attempts, but current AI systems score at near-zero.
Few‑shot grid reasoning format: Each task provides a few (typically 2–5) demo input-output grid pairs with limited size and color palette; the solver must generalize to a new input.

Versions

ARC‑AGI‑1 (2019)

Introduced in Chollet’s “On the Measure of Intelligence” paper³.
Contains hundreds of grid reasoning tasks, public and private splits.
Human panel success ≈98%; leading AI systems reached ~55% accuracy by late 2024 using hybrid techniques like test-time adaptation and program synthesis.

ARC‑AGI‑2 (2025)

Upgraded task set resisting brute-force while preserving human solvability
Designed so all evaluation tasks are solved by at least two human participants in controlled conditions with at most two attempts.
Pure LLMs score ~0%; advanced reasoning systems reach only single-digit percentages (~1–5%) with much higher cost per task compared to humans.

ARC‑AGI‑3 (planned 2026)

An interactive reasoning benchmark in development, focusing on exploration, planning, memory, and multi-step action in novel environments¹.

In Summary,

ARC‑AGI frames intelligence as how quickly and efficiently new skills are learned, not just as task performance.
Its design emphasizes fluid reasoning, requiring minimal priors and inhibiting brute-force or memorization.
ARC‑AGI‑2 introduces calibrated difficulty and efficiency metrics, exposing significant gaps between human and AI performance.
ARC‑AGI‑3 aims to extend the benchmark into interactive, adaptive environments for testing broader reasoning capabilities.

presentation slides - https://docs.google.com/presentation/d/1bf44la1Z_Ra_CGsd8KMj8VL18lr72mcCnUJ5DeneGoU/edit?usp=sharing ↩ ↩² ↩³
ARC-AGI-2 paper - https://arxiv.org/html/2505.11831v1 ↩
On the Measure of Intelligence - https://arxiv.org/abs/1911.01547 ↩

Omij's blogs Collection

Explorer

ARC-AGI Benchmark

What Is ARC‑AGI?

Definition of Intelligence

Core Principles

Versions

ARC‑AGI‑1 (2019)

ARC‑AGI‑2 (2025)

ARC‑AGI‑3 (planned 2026)

In Summary,

Table of Contents

Omij's blogs Collection

Explorer

ARC-AGI Benchmark

What Is ARC‑AGI?

Definition of Intelligence

Core Principles

Versions

ARC‑AGI‑1 (2019)

ARC‑AGI‑2 (2025)

ARC‑AGI‑3 (planned 2026)

In Summary,

Footnotes

Table of Contents