On February 19, 2026, Google released Gemini 3.1 Pro and the benchmark headline that followed was unusual: a verified score of 77.1% on ARC-AGI-2, more than double the previous Gemini 3 Pro number on the same test. ARC-AGI-2 is a benchmark designed to be hard for memorization, so a jump that size is worth understanding before you accept the marketing.
What ARC-AGI-2 actually measures
ARC-AGI-2 is maintained by the ARC Prize organization and is the successor to the original ARC challenge. It is built around abstract reasoning puzzles that require inferring rules from a small number of examples — not from training data. Each task gives the model a handful of input-output grids and asks it to produce the output for a new input.
Three properties make it different from MMLU-style tests:
Novel pattern construction — the rules are not from any well-known textbook
Few-shot generalization — typically 2–5 examples, not 100
Visual grid format — encoded as colored cells, requires real spatial reasoning
The frontier-model field had stalled at 30–60% on ARC-AGI-2 for most of 2025 and early 2026, which made the 77.1% number a real outlier when it landed.
A 46-point intra-family gap (3 Pro to 3.1 Pro) is unusual. The most plausible explanations: a substantially different reasoning policy, a deeper inference-time search budget, or both [Inference, since Google has not published architecture details].
Beyond ARC-AGI-2: the rest of the benchmark stack
ARC-AGI-2 is one data point. The fuller picture includes:
MMLU and MMLU-Pro — Gemini 3.1 Pro is at the top of the public frontier-model band, though the headline gap over GPT-5.5 and Opus 4.7 is small here.
Long-context retrieval — Gemini's 1M context performance has historically been strong on Google's needle-in-haystack evals; 3.1 Pro maintains that trajectory.
Coding benchmarks — Competitive but not category-leading; Opus 4.7 and GPT-5.5 trade the top SWE-Bench spots [Unverified, depends on which leaderboard snapshot].
What ARC-AGI-2 doesn't measure
Three things to keep in mind before you read 77.1% as "Gemini won AGI":
Code generation, refactoring, and debugging. ARC-AGI-2 has nothing to do with these. A model can be ARC-elite and middling on real engineering work.
Knowledge breadth. ARC-AGI-2 doesn't test how much the model knows; it tests how it reasons over given inputs.
Long-form coherence. Many tasks fit on a few grids; nothing about ARC-AGI-2 tells you whether the model holds context over a 10,000-word document.
ARC-AGI-2 is a strong signal for novel-pattern abstraction — figuring out what's going on from few examples. That maps well to some real workloads (rare-document classification, schema inference, edge-case reasoning) and not at all to others (writing tests for a known framework, summarizing meetings, drafting marketing copy).
How models close this kind of gap
A 46-point jump on ARC-AGI-2 in one minor version (3 Pro → 3.1 Pro) is consistent with a mix of:
Inference-time search. Many ARC-AGI-2 entries involve sampling many candidate solutions and verifying. More compute at test time, more correct answers.
Synthetic training data targeted at abstraction. [Speculation, but a common public-research pattern in 2026]
Improved verifier models. A small verifier that scores candidate grids tightens the search loop dramatically.
Without architectural details from Google we can't say which combination it is, but the shape of the gain — much bigger on novel-pattern reasoning than on knowledge benchmarks — is the signature of inference-time-search dominated improvements.
Who should care
If your workload is:
Schema inference, edge-case reasoning, novel pattern detection — Gemini 3.1 Pro looks like the strongest current option and probably worth a head-to-head test against your own dataset.
Coding agents — ARC-AGI-2 doesn't tell you what you need to know. Run SWE-Bench-style evals against Opus 4.7 and GPT-5.5 too.
Knowledge retrieval and RAG — long-context behavior matters more than ARC-AGI-2; 3.1 Pro is competitive but not uniquely strong.
Anything customer-facing chat — latency and cost likely matter more than peak reasoning capability.
The healthy skepticism
Two flags worth raising on any benchmark this far ahead of the field:
Verification methodology matters. ARC Prize did verify the score, which is the right safeguard. Internal-only numbers from any vendor deserve more skepticism.
Benchmark contamination. Even with held-out test sets, models trained on close synthetic distributions can post inflated numbers. The right test is your own held-out task, not a leaderboard.
The takeaway
Gemini 3.1 Pro's 77.1% on ARC-AGI-2 is real and verified, and it represents a meaningful jump on a benchmark designed to be hard for memorization. That makes it the strongest current frontier model for novel-pattern reasoning specifically. It does not make it the strongest model for code, for chat, or for everyday workloads — those are still workload-by-workload bake-offs. Treat the number as evidence of one capability dimension, not a global ranking.
Frequently Asked Questions
What is ARC-AGI-2 and why does it matter?
ARC-AGI-2 is a benchmark from the ARC Prize organization that tests abstract reasoning on entirely novel logic patterns, encoded as colored grid puzzles. It matters because it specifically measures few-shot rule inference rather than memorized knowledge — making it harder for models to game with training-set exposure.
Does Gemini 3.1 Pro's ARC-AGI-2 score mean it's better than Claude Opus 4.7 and GPT-5.5?
It means it's better at novel-pattern reasoning specifically. On coding benchmarks, Claude Opus 4.7 and GPT-5.5 generally lead, and the right model for any given workload depends on testing against your actual task, not a single benchmark.
Why did Gemini 3.1 Pro jump 46 points over Gemini 3 Pro on the same benchmark?
Google has not published architectural details, but the gain pattern is consistent with heavier inference-time search, synthetic training data targeting abstraction tasks, and improved verifier models. [Inference] No single explanation is confirmed.