Gemini 3.1 Pro Benchmarks Explained: ARC-AGI-2 and Beyond

CallMissedMay 8, 2026

·5 min readArticle

AI Models Google Gemini Benchmarks Reasoning

On February 19, 2026, Google released Gemini 3.1 Pro and the benchmark headline that followed was unusual: a verified score of 77.1% on ARC-AGI-2, more than double the previous Gemini 3 Pro number on the same test. ARC-AGI-2 is a benchmark designed to be hard for memorization, so a jump that size is worth understanding before you accept the marketing.

What ARC-AGI-2 actually measures

ARC-AGI-2 is maintained by the ARC Prize organization and is the successor to the original ARC challenge. It is built around abstract reasoning puzzles that require inferring rules from a small number of examples — not from training data. Each task gives the model a handful of input-output grids and asks it to produce the output for a new input.

Three properties make it different from MMLU-style tests:

Novel pattern construction — the rules are not from any well-known textbook

Few-shot generalization — typically 2–5 examples, not 100

Visual grid format — encoded as colored cells, requires real spatial reasoning

The frontier-model field had stalled at 30–60% on ARC-AGI-2 for most of 2025 and early 2026, which made the 77.1% number a real outlier when it landed.

The verified leaderboard at launch

Per the ARC Prize verification and Google's blog post, the comparison at Gemini 3.1 Pro's launch looked like this:

Gemini 3.1 Pro: 77.1%

Claude Opus 4.6: 68.8%

GPT-5.2: 52.9%

Gemini 3 Pro (predecessor): 31.1%

A 46-point intra-family gap (3 Pro to 3.1 Pro) is unusual. The most plausible explanations: a substantially different reasoning policy, a deeper inference-time search budget, or both [Inference, since Google has not published architecture details].

Beyond ARC-AGI-2: the rest of the benchmark stack

ARC-AGI-2 is one data point. The fuller picture includes:

MMLU and MMLU-Pro — Gemini 3.1 Pro is at the top of the public frontier-model band, though the headline gap over GPT-5.5 and Opus 4.7 is small here.

Long-context retrieval — Gemini's 1M context performance has historically been strong on Google's needle-in-haystack evals; 3.1 Pro maintains that trajectory.

Coding benchmarks — Competitive but not category-leading; Opus 4.7 and GPT-5.5 trade the top SWE-Bench spots [Unverified, depends on which leaderboard snapshot].

What ARC-AGI-2 doesn't measure

Three things to keep in mind before you read 77.1% as "Gemini won AGI":

Code generation, refactoring, and debugging. ARC-AGI-2 has nothing to do with these. A model can be ARC-elite and middling on real engineering work.

Knowledge breadth. ARC-AGI-2 doesn't test how much the model knows; it tests how it reasons over given inputs.

Long-form coherence. Many tasks fit on a few grids; nothing about ARC-AGI-2 tells you whether the model holds context over a 10,000-word document.

ARC-AGI-2 is a strong signal for novel-pattern abstraction — figuring out what's going on from few examples. That maps well to some real workloads (rare-document classification, schema inference, edge-case reasoning) and not at all to others (writing tests for a known framework, summarizing meetings, drafting marketing copy).

How models close this kind of gap

A 46-point jump on ARC-AGI-2 in one minor version (3 Pro → 3.1 Pro) is consistent with a mix of:

Inference-time search. Many ARC-AGI-2 entries involve sampling many candidate solutions and verifying. More compute at test time, more correct answers.

Synthetic training data targeted at abstraction. [Speculation, but a common public-research pattern in 2026]

Improved verifier models. A small verifier that scores candidate grids tightens the search loop dramatically.

Without architectural details from Google we can't say which combination it is, but the shape of the gain — much bigger on novel-pattern reasoning than on knowledge benchmarks — is the signature of inference-time-search dominated improvements.

Who should care

If your workload is:

Schema inference, edge-case reasoning, novel pattern detection — Gemini 3.1 Pro looks like the strongest current option and probably worth a head-to-head test against your own dataset.

Coding agents — ARC-AGI-2 doesn't tell you what you need to know. Run SWE-Bench-style evals against Opus 4.7 and GPT-5.5 too.

Knowledge retrieval and RAG — long-context behavior matters more than ARC-AGI-2; 3.1 Pro is competitive but not uniquely strong.

Anything customer-facing chat — latency and cost likely matter more than peak reasoning capability.

The healthy skepticism

Two flags worth raising on any benchmark this far ahead of the field:

Verification methodology matters. ARC Prize did verify the score, which is the right safeguard. Internal-only numbers from any vendor deserve more skepticism.

Benchmark contamination. Even with held-out test sets, models trained on close synthetic distributions can post inflated numbers. The right test is your own held-out task, not a leaderboard.

The takeaway

Gemini 3.1 Pro's 77.1% on ARC-AGI-2 is real and verified, and it represents a meaningful jump on a benchmark designed to be hard for memorization. That makes it the strongest current frontier model for novel-pattern reasoning specifically. It does not make it the strongest model for code, for chat, or for everyday workloads — those are still workload-by-workload bake-offs. Treat the number as evidence of one capability dimension, not a global ranking.

Frequently Asked Questions

What is ARC-AGI-2 and why does it matter?

ARC-AGI-2 is a benchmark from the ARC Prize organization that tests abstract reasoning on entirely novel logic patterns, encoded as colored grid puzzles. It matters because it specifically measures few-shot rule inference rather than memorized knowledge — making it harder for models to game with training-set exposure.

Does Gemini 3.1 Pro's ARC-AGI-2 score mean it's better than Claude Opus 4.7 and GPT-5.5?

It means it's better at novel-pattern reasoning specifically. On coding benchmarks, Claude Opus 4.7 and GPT-5.5 generally lead, and the right model for any given workload depends on testing against your actual task, not a single benchmark.

Why did Gemini 3.1 Pro jump 46 points over Gemini 3 Pro on the same benchmark?

Google has not published architectural details, but the gain pattern is consistent with heavier inference-time search, synthetic training data targeting abstraction tasks, and improved verifier models. [Inference] No single explanation is confirmed.