MillenniumPrizeProblemBench is designed to track how quickly frontier models close the gap between today’s capabilities and the level of reasoning suggested by the Millennium Prize Problems.
MillenniumPrizeProblemBench
Stress-testing frontier AI on the hardest math we know.
A benchmark that model performance on structured problem-solving pipelines by tracking their progress solving the seven Millennium Prize Problems: proof search, conjecture generation, formal verification, and research-grade reasoning.
Status: no current model passes a single Millennium-inspired track.
Click a track to see a short description of the underlying Millennium Prize Problem and how it informs the benchmark.
Benchmark: tasks center on structured reductions, proof sketches, and complexity reasoning without claiming to resolve P vs NP.
Benchmark: synthetic tasks in analytic number theory, conjecture mining, and reasoning about zero distributions & L-functions.
Benchmark: PDE and field-theory surrogates that test reasoning about gauge symmetries, energy bounds, and toy mass-gap arguments.
Benchmark: toy fluid-dynamics PDE tasks about blow-up, regularity heuristics, and simplified existence arguments.
Benchmark: tasks over elliptic curves, rational points, and L-function heuristics that mirror some of the structure of BSD.
Benchmark: synthetic tasks in cohomology, curvature, and geometry intuition designed to echo the flavor of Hodge-theoretic arguments.
Benchmark: toy 3-manifold and homotopy-style tasks that stress high-level geometric and topological reasoning.
Model leaderboard
Aggregate pass / fail by model across the seven Millennium-inspired tracks. Currently, no model achieves a pass on any track.
| Model | Provider | Release | P vs NP | Riemann | Yang–Mills | Navier–Stokes | BSD | Hodge | Topo | Summary | Notes |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
Gemini 3 Pro
0 / 7
|
Google DeepMind | TBD | Fail | Fail | Fail | Fail | Fail | Fail | Fail |
P vs NP — Fail
Riemann — Fail
Yang–Mills — Fail
Navier–Stokes — Fail
BSD — Fail
Hodge — Fail
Topo — Fail
|
Strong multimodal and coding performance, but still fails to produce verifiable, proof-level solutions on any Millennium-inspired track. |
|
GPT-5
0 / 7
|
OpenAI | TBD | Fail | Fail | Fail | Fail | Fail | Fail | Fail |
P vs NP — Fail
Riemann — Fail
Yang–Mills — Fail
Navier–Stokes — Fail
BSD — Fail
Hodge — Fail
Topo — Fail
|
Excels at general-purpose reasoning and code, but does not meet the bar for rigor on MillenniumPrizeProblemBench’s proof-style tasks. |
|
Grok 4
0 / 7
|
xAI | TBD | Fail | Fail | Fail | Fail | Fail | Fail | Fail |
P vs NP — Fail
Riemann — Fail
Yang–Mills — Fail
Navier–Stokes — Fail
BSD — Fail
Hodge — Fail
Topo — Fail
|
Strong performance on many web-style tasks, but lacks the stability and formal verification needed to pass any Millennium-style track. |
|
Gemini 2.5 Pro
0 / 7
|
Google DeepMind | TBD | Fail | Fail | Fail | Fail | Fail | Fail | Fail |
P vs NP — Fail
Riemann — Fail
Yang–Mills — Fail
Navier–Stokes — Fail
BSD — Fail
Hodge — Fail
Topo — Fail
|
Competitive math and coding abilities, but proofs and counterexamples remain brittle across all seven Millennium-inspired benchmarks. |
|
GPT-5-mini
0 / 7
|
OpenAI | TBD | Fail | Fail | Fail | Fail | Fail | Fail | Fail |
P vs NP — Fail
Riemann — Fail
Yang–Mills — Fail
Navier–Stokes — Fail
BSD — Fail
Hodge — Fail
Topo — Fail
|
Optimized for efficiency rather than deep theorem-proving; fails all MillenniumPrizeProblemBench tracks under strict evaluation. |
|
Claude 4.5 Sonnet
0 / 7
|
Anthropic | TBD | Fail | Fail | Fail | Fail | Fail | Fail | Fail |
P vs NP — Fail
Riemann — Fail
Yang–Mills — Fail
Navier–Stokes — Fail
BSD — Fail
Hodge — Fail
Topo — Fail
|
Very strong at natural language reasoning and self-critique, but still unable to consistently satisfy formal proof requirements on any track. |
|
Gemini 2.5 Flash
0 / 7
|
Google DeepMind | TBD | Fail | Fail | Fail | Fail | Fail | Fail | Fail |
P vs NP — Fail
Riemann — Fail
Yang–Mills — Fail
Navier–Stokes — Fail
BSD — Fail
Hodge — Fail
Topo — Fail
|
Latency-optimized model: good for quick answers, but far from the rigor required to pass any MillenniumPrizeProblemBench track. |
|
DeepSeek-R1
0 / 7
|
DeepSeek | TBD | Fail | Fail | Fail | Fail | Fail | Fail | Fail |
P vs NP — Fail
Riemann — Fail
Yang–Mills — Fail
Navier–Stokes — Fail
BSD — Fail
Hodge — Fail
Topo — Fail
|
Shows promising long-context reasoning, but still fails strict verification pipelines on all Millennium-inspired tasks. |
|
o1
0 / 7
|
OpenAI | TBD | Fail | Fail | Fail | Fail | Fail | Fail | Fail |
P vs NP — Fail
Riemann — Fail
Yang–Mills — Fail
Navier–Stokes — Fail
BSD — Fail
Hodge — Fail
Topo — Fail
|
Strong deliberate reasoning on many benchmarks, but does not reliably produce fully checked proofs on any MillenniumPrizeProblemBench track. |
|
GPT-4o
0 / 7
|
OpenAI | 2024 | Fail | Fail | Fail | Fail | Fail | Fail | Fail |
P vs NP — Fail
Riemann — Fail
Yang–Mills — Fail
Navier–Stokes — Fail
BSD — Fail
Hodge — Fail
Topo — Fail
|
Versatile and fast, but consistently fails the strict pass criteria on MillenniumPrizeProblemBench’s synthetic problem suites. |
Discussion
Future model performance
While current models fail every MillenniumPrizeProblemBench track, recent benchmark history suggests that performance can improve rapidly once a capability becomes an optimization target. It is plausible that future systems will eventually achieve non-trivial pass rates on synthetic tasks that mirror aspects of the Millennium Problems. Passing multiple tracks would indicate strong performance on closed-ended, verifiable mathematical reasoning, but it would not by itself imply autonomous research capabilities or “artificial general intelligence”. MillenniumPrizeProblemBench focuses on structured proof-style problems rather than open-ended research or creative discovery, making it a targeted measure of technical reasoning under strict verification.
Impact
By providing a clear, pass-or-fail view of progress on Millennium-inspired tasks, MillenniumPrizeProblemBench offers a common reference point for researchers, labs, and policymakers when assessing model capabilities. This can support more grounded discussions about development trajectories, potential risks, and appropriate governance measures. Even if no model comes close to resolving the true Millennium Problems, tracking performance on structurally similar benchmarks helps clarify where today’s systems excel, where they still break, and which kinds of mathematical reasoning remain firmly out of reach.
Contact & external results
Running your own MillenniumPrizeProblemBench-style evaluation? Found an issue or have updated numbers for a model? Use this form to share results or get in touch.