MillenniumPrizeProblemBench
Experimental research benchmark Scores are preliminary and based on internal evaluations

MillenniumPrizeProblemBench
Stress-testing frontier AI on the hardest math we know.

A benchmark that model performance on structured problem-solving pipelines by tracking their progress solving the seven Millennium Prize Problems: proof search, conjecture generation, formal verification, and research-grade reasoning.

6 frontier models — GPT, Claude, Gemini, Llama, Mistral…
7 tracks — one for each Millennium Prize Problem.
Pass / Fail only — no partial credit on any track.

Status: no current model passes a single Millennium-inspired track.

Tracks (inspired by the Millennium Problems)

Click a track to see a short description of the underlying Millennium Prize Problem and how it informs the benchmark.

P vs NP
Click to expand
Problem: Decide whether every problem whose solution can be verified quickly (NP) can also be solved quickly (P), or give a proof that P ≠ NP.
Benchmark: tasks center on structured reductions, proof sketches, and complexity reasoning without claiming to resolve P vs NP.
Riemann Hypothesis
Click to expand
Problem: Show that all nontrivial zeros of the Riemann zeta function lie on the critical line Re(s) = 1/2.
Benchmark: synthetic tasks in analytic number theory, conjecture mining, and reasoning about zero distributions & L-functions.
Yang–Mills / Mass Gap
Click to expand
Problem: Construct a quantum Yang–Mills theory on four-dimensional spacetime and prove the existence of a positive mass gap.
Benchmark: PDE and field-theory surrogates that test reasoning about gauge symmetries, energy bounds, and toy mass-gap arguments.
Navier–Stokes
Click to expand
Problem: Prove or disprove global existence and smoothness for solutions of the 3D incompressible Navier–Stokes equations with smooth initial data.
Benchmark: toy fluid-dynamics PDE tasks about blow-up, regularity heuristics, and simplified existence arguments.
Birch & Swinnerton-Dyer
Click to expand
Problem: Relate the arithmetic of an elliptic curve (its rank) to the order of vanishing of its L-function at s = 1.
Benchmark: tasks over elliptic curves, rational points, and L-function heuristics that mirror some of the structure of BSD.
Hodge Conjecture
Click to expand
Problem: Determine whether certain cohomology classes on projective algebraic varieties are algebraic cycles.
Benchmark: synthetic tasks in cohomology, curvature, and geometry intuition designed to echo the flavor of Hodge-theoretic arguments.
Topological Surrogates
Click to expand
Problem: Inspired by the (now resolved) Poincaré conjecture on three-dimensional manifolds, used here as a stand-in for deep open problems in topology.
Benchmark: toy 3-manifold and homotopy-style tasks that stress high-level geometric and topological reasoning.

Model leaderboard

Aggregate pass / fail by model across the seven Millennium-inspired tracks. Currently, no model achieves a pass on any track.

Model Provider Release P vs NP Riemann Yang–Mills Navier–Stokes BSD Hodge Topo Summary Notes
Gemini 3 Pro
0 / 7
Google DeepMind TBD Fail Fail Fail Fail Fail Fail Fail
P vs NP — Fail Riemann — Fail Yang–Mills — Fail Navier–Stokes — Fail BSD — Fail Hodge — Fail Topo — Fail
Strong multimodal and coding performance, but still fails to produce verifiable, proof-level solutions on any Millennium-inspired track.
GPT-5
0 / 7
OpenAI TBD Fail Fail Fail Fail Fail Fail Fail
P vs NP — Fail Riemann — Fail Yang–Mills — Fail Navier–Stokes — Fail BSD — Fail Hodge — Fail Topo — Fail
Excels at general-purpose reasoning and code, but does not meet the bar for rigor on MillenniumPrizeProblemBench’s proof-style tasks.
Grok 4
0 / 7
xAI TBD Fail Fail Fail Fail Fail Fail Fail
P vs NP — Fail Riemann — Fail Yang–Mills — Fail Navier–Stokes — Fail BSD — Fail Hodge — Fail Topo — Fail
Strong performance on many web-style tasks, but lacks the stability and formal verification needed to pass any Millennium-style track.
Gemini 2.5 Pro
0 / 7
Google DeepMind TBD Fail Fail Fail Fail Fail Fail Fail
P vs NP — Fail Riemann — Fail Yang–Mills — Fail Navier–Stokes — Fail BSD — Fail Hodge — Fail Topo — Fail
Competitive math and coding abilities, but proofs and counterexamples remain brittle across all seven Millennium-inspired benchmarks.
GPT-5-mini
0 / 7
OpenAI TBD Fail Fail Fail Fail Fail Fail Fail
P vs NP — Fail Riemann — Fail Yang–Mills — Fail Navier–Stokes — Fail BSD — Fail Hodge — Fail Topo — Fail
Optimized for efficiency rather than deep theorem-proving; fails all MillenniumPrizeProblemBench tracks under strict evaluation.
Claude 4.5 Sonnet
0 / 7
Anthropic TBD Fail Fail Fail Fail Fail Fail Fail
P vs NP — Fail Riemann — Fail Yang–Mills — Fail Navier–Stokes — Fail BSD — Fail Hodge — Fail Topo — Fail
Very strong at natural language reasoning and self-critique, but still unable to consistently satisfy formal proof requirements on any track.
Gemini 2.5 Flash
0 / 7
Google DeepMind TBD Fail Fail Fail Fail Fail Fail Fail
P vs NP — Fail Riemann — Fail Yang–Mills — Fail Navier–Stokes — Fail BSD — Fail Hodge — Fail Topo — Fail
Latency-optimized model: good for quick answers, but far from the rigor required to pass any MillenniumPrizeProblemBench track.
DeepSeek-R1
0 / 7
DeepSeek TBD Fail Fail Fail Fail Fail Fail Fail
P vs NP — Fail Riemann — Fail Yang–Mills — Fail Navier–Stokes — Fail BSD — Fail Hodge — Fail Topo — Fail
Shows promising long-context reasoning, but still fails strict verification pipelines on all Millennium-inspired tasks.
o1
0 / 7
OpenAI TBD Fail Fail Fail Fail Fail Fail Fail
P vs NP — Fail Riemann — Fail Yang–Mills — Fail Navier–Stokes — Fail BSD — Fail Hodge — Fail Topo — Fail
Strong deliberate reasoning on many benchmarks, but does not reliably produce fully checked proofs on any MillenniumPrizeProblemBench track.
GPT-4o
0 / 7
OpenAI 2024 Fail Fail Fail Fail Fail Fail Fail
P vs NP — Fail Riemann — Fail Yang–Mills — Fail Navier–Stokes — Fail BSD — Fail Hodge — Fail Topo — Fail
Versatile and fast, but consistently fails the strict pass criteria on MillenniumPrizeProblemBench’s synthetic problem suites.
Methodology: results indicate whether models consistently pass or fail synthetic task suites that mimic structural aspects of the Millennium Prize Problems (e.g. formal proof steps, conjecture search, counterexample discovery, and self-critique), evaluated via automated checkers and human expert review. No model here is claimed to have solved any genuine Millennium Prize Problem.

Discussion

MillenniumPrizeProblemBench is designed to track how quickly frontier models close the gap between today’s capabilities and the level of reasoning suggested by the Millennium Prize Problems.

Future model performance

While current models fail every MillenniumPrizeProblemBench track, recent benchmark history suggests that performance can improve rapidly once a capability becomes an optimization target. It is plausible that future systems will eventually achieve non-trivial pass rates on synthetic tasks that mirror aspects of the Millennium Problems. Passing multiple tracks would indicate strong performance on closed-ended, verifiable mathematical reasoning, but it would not by itself imply autonomous research capabilities or “artificial general intelligence”. MillenniumPrizeProblemBench focuses on structured proof-style problems rather than open-ended research or creative discovery, making it a targeted measure of technical reasoning under strict verification.

Impact

By providing a clear, pass-or-fail view of progress on Millennium-inspired tasks, MillenniumPrizeProblemBench offers a common reference point for researchers, labs, and policymakers when assessing model capabilities. This can support more grounded discussions about development trajectories, potential risks, and appropriate governance measures. Even if no model comes close to resolving the true Millennium Problems, tracking performance on structurally similar benchmarks helps clarify where today’s systems excel, where they still break, and which kinds of mathematical reasoning remain firmly out of reach.

Contact & external results

Running your own MillenniumPrizeProblemBench-style evaluation? Found an issue or have updated numbers for a model? Use this form to share results or get in touch.