Gemini 3.1 Pro Benchmarks Explained: ARC-AGI-2 and Beyond

CallMissed
·5 min readArticle

On February 19, 2026, Google released Gemini 3.1 Pro and the benchmark headline that followed was unusual: a verified score of 77.1% on ARC-AGI-2, more than double the previous Gemini 3 Pro number on the same test. ARC-AGI-2 is a benchmark designed to be hard for memorization, so a jump that size is worth understanding before you accept the marketing.

What ARC-AGI-2 actually measures

ARC-AGI-2 is maintained by the ARC Prize organization and is the successor to the original ARC challenge. It is built around abstract reasoning puzzles that require inferring rules from a small number of examples — not from training data. Each task gives the model a handful of input-output grids and asks it to produce the output for a new input.

Three properties make it different from MMLU-style tests:

  • Novel pattern construction — the rules are not from any well-known textbook
  • Few-shot generalization — typically 2–5 examples, not 100
  • Visual grid format — encoded as colored cells, requires real spatial reasoning
  • The frontier-model field had stalled at 30–60% on ARC-AGI-2 for most of 2025 and early 2026, which made the 77.1% number a real outlier when it landed.

    The verified leaderboard at launch

    Per the ARC Prize verification and Google's blog post, the comparison at Gemini 3.1 Pro's launch looked like this:

  • Gemini 3.1 Pro: 77.1%
  • Claude Opus 4.6: 68.8%
  • GPT-5.2: 52.9%
  • Gemini 3 Pro (predecessor): 31.1%
  • A 46-point intra-family gap (3 Pro to 3.1 Pro) is unusual. The most plausible explanations: a substantially different reasoning policy, a deeper inference-time search budget, or both [Inference, since Google has not published architecture details].

    Beyond ARC-AGI-2: the rest of the benchmark stack

    ARC-AGI-2 is one data point. The fuller picture includes:

  • MMLU and MMLU-Pro — Gemini 3.1 Pro is at the top of the public frontier-model band, though the headline gap over GPT-5.5 and Opus 4.7 is small here.
  • Long-context retrieval — Gemini's 1M context performance has historically been strong on Google's needle-in-haystack evals; 3.1 Pro maintains that trajectory.
  • Coding benchmarks — Competitive but not category-leading; Opus 4.7 and GPT-5.5 trade the top SWE-Bench spots [Unverified, depends on which leaderboard snapshot].
  • What ARC-AGI-2 doesn't measure

    Three things to keep in mind before you read 77.1% as "Gemini won AGI":

  • Code generation, refactoring, and debugging. ARC-AGI-2 has nothing to do with these. A model can be ARC-elite and middling on real engineering work.
  • Knowledge breadth. ARC-AGI-2 doesn't test how much the model knows; it tests how it reasons over given inputs.
  • Long-form coherence. Many tasks fit on a few grids; nothing about ARC-AGI-2 tells you whether the model holds context over a 10,000-word document.
  • ARC-AGI-2 is a strong signal for novel-pattern abstraction — figuring out what's going on from few examples. That maps well to some real workloads (rare-document classification, schema inference, edge-case reasoning) and not at all to others (writing tests for a known framework, summarizing meetings, drafting marketing copy).

    How models close this kind of gap

    A 46-point jump on ARC-AGI-2 in one minor version (3 Pro → 3.1 Pro) is consistent with a mix of:

  • Inference-time search. Many ARC-AGI-2 entries involve sampling many candidate solutions and verifying. More compute at test time, more correct answers.
  • Synthetic training data targeted at abstraction. [Speculation, but a common public-research pattern in 2026]
  • Improved verifier models. A small verifier that scores candidate grids tightens the search loop dramatically.
  • Without architectural details from Google we can't say which combination it is, but the shape of the gain — much bigger on novel-pattern reasoning than on knowledge benchmarks — is the signature of inference-time-search dominated improvements.

    Who should care

    If your workload is:

  • Schema inference, edge-case reasoning, novel pattern detection — Gemini 3.1 Pro looks like the strongest current option and probably worth a head-to-head test against your own dataset.
  • Coding agents — ARC-AGI-2 doesn't tell you what you need to know. Run SWE-Bench-style evals against Opus 4.7 and GPT-5.5 too.
  • Knowledge retrieval and RAG — long-context behavior matters more than ARC-AGI-2; 3.1 Pro is competitive but not uniquely strong.
  • Anything customer-facing chat — latency and cost likely matter more than peak reasoning capability.
  • The healthy skepticism

    Two flags worth raising on any benchmark this far ahead of the field:

  • Verification methodology matters. ARC Prize did verify the score, which is the right safeguard. Internal-only numbers from any vendor deserve more skepticism.
  • Benchmark contamination. Even with held-out test sets, models trained on close synthetic distributions can post inflated numbers. The right test is your own held-out task, not a leaderboard.
  • The takeaway

    Gemini 3.1 Pro's 77.1% on ARC-AGI-2 is real and verified, and it represents a meaningful jump on a benchmark designed to be hard for memorization. That makes it the strongest current frontier model for novel-pattern reasoning specifically. It does not make it the strongest model for code, for chat, or for everyday workloads — those are still workload-by-workload bake-offs. Treat the number as evidence of one capability dimension, not a global ranking.

    Frequently Asked Questions

    What is ARC-AGI-2 and why does it matter?
    ARC-AGI-2 is a benchmark from the ARC Prize organization that tests abstract reasoning on entirely novel logic patterns, encoded as colored grid puzzles. It matters because it specifically measures few-shot rule inference rather than memorized knowledge — making it harder for models to game with training-set exposure.
    Does Gemini 3.1 Pro's ARC-AGI-2 score mean it's better than Claude Opus 4.7 and GPT-5.5?
    It means it's better at novel-pattern reasoning specifically. On coding benchmarks, Claude Opus 4.7 and GPT-5.5 generally lead, and the right model for any given workload depends on testing against your actual task, not a single benchmark.
    Why did Gemini 3.1 Pro jump 46 points over Gemini 3 Pro on the same benchmark?
    Google has not published architectural details, but the gain pattern is consistent with heavier inference-time search, synthetic training data targeting abstraction tasks, and improved verifier models. [Inference] No single explanation is confirmed.

    Related Posts