The University of Virginia number theorist reported that o4-mini solved a problem he had personally worked on and considered genuinely difficult in under ten minutes, calling the chain of reasoning 'frighteningly good.' His firsthand experience at the Berkeley DeepMind workshop convinced him the models are now operating at a level that demands the mathematical community take notice.
By submitting the Science article to Hacker News with the framing 'Mathematicians issue warning as AI rapidly gains ground,' pseudolus amplified the position that this is a watershed moment. The 205-point score signals broad community agreement that something significant has shifted in AI's mathematical capabilities.
Tao helped design FrontierMath specifically to resist current models, expecting it would hold for years — yet has publicly acknowledged that the rate of progress has surprised him. Top models moved from ~2% accuracy in late 2024 to over 25% by mid-2025, validating his concern that his own benchmarks underestimated the trajectory.
The editorial argues mathematics was the citadel because unlike code or prose, proofs have a hard binary ground truth and a two-thousand-year-old referee system that catches bullshit. AI succeeding here is qualitatively different from succeeding at fuzzier tasks, because the discipline cannot be fooled by surface plausibility.
In mid-May, roughly 30 mathematicians gathered in Berkeley for a closed-door workshop hosted by Google DeepMind. The setup was simple: bring your hardest unsolved problems and try to break a reasoning model. The model was OpenAI's o4-mini, a smaller, faster variant of the o-series reasoning stack that has spent the last 18 months quietly redefining what 'AI at math' means.
The attendees, reporting back to *Science*, described the experience as somewhere between thrilling and disorienting. Ken Ono, a number theorist at the University of Virginia, said o4-mini solved a problem he had personally worked on and considered genuinely difficult — in under ten minutes, with a chain of reasoning he described as "frighteningly good." Other participants reported the model producing correct, novel proofs on problems drawn from active research programs. A few problems it got wrong. Several it solved in ways the human experts hadn't considered.
The workshop result lands on top of a year of escalating benchmark scores. OpenAI's experimental reasoning system hit gold-medal performance on the IMO 2025 problems. On FrontierMath — a benchmark explicitly constructed by Terence Tao and others to be "PhD-level," with problems whose solutions take human experts hours to days — top models have moved from ~2% accuracy in late 2024 to north of 25% by mid-2025. Tao himself, who helped design FrontierMath specifically so it would resist current models, has publicly said the rate of progress has surprised him.
The mathematicians' warning is more interesting than the usual "AI is coming for X" story because mathematics was supposed to be the citadel. Code generation always had a fuzzy ground truth — does it compile, does it pass tests, is it 'clean.' Mathematics has a hard one: a proof is either valid or it isn't. The discipline has spent two thousand years building a referee system that catches bullshit. If LLMs were going to embarrass themselves anywhere, it should have been here.
They didn't. The Berkeley workshop, the IMO result, and the FrontierMath curve are all measuring the same underlying thing: reasoning models are getting better at the kind of structured, multi-step, dead-end-aware thinking that defines mathematical work. The architecture detail that matters is not the parameter count but the training-time RL on reasoning traces — letting the model think for thousands of tokens, backtrack, and verify, with rewards shaped by correctness rather than fluency. That's what produced the jump.
The more honest read from the workshop attendees isn't "AI will replace mathematicians" — it's "the intuition for which problems are hard is breaking down." Pure mathematics runs on taste: which conjectures are worth a decade of your life, which are tractable, which are dead ends. That taste is calibrated against an internal model of difficulty built up over a career. When a model can solve in ten minutes what a senior researcher considered a multi-month problem, the calibration breaks. The community reaction at the workshop, per multiple attendees, was a mix of excitement and existential vertigo — several said they couldn't sleep that night.
The community has split, predictably, into camps. Geoff Hinton-adjacent voices argue this is the curve continuing on schedule and that working mathematicians should plan for a 5-year horizon where most undergraduate-and-below problems are automated. Tao's public position is more measured: the models are powerful collaborators, hallucinate confidently on problems just outside their training distribution, and require expert verification on every nontrivial output — but he's also stopped saying "this won't happen soon."
If you ship code, the workshop is a leading indicator for two things. First, the gap between 'LLMs are bad at logic' and 'LLMs are graduate-level at logic' is closing on a timeline measured in quarters, not decades — assume the reasoning-model APIs you're choosing today will be qualitatively better in six months and architect for swap-ability. Second, the technique that produced this jump — long-context reasoning with verifier rewards — generalizes beyond math. The same training recipe is what's driving the recent gains on competitive programming, formal verification, and SQL-from-natural-language. If your product has a 'hard reasoning' moat (compiler design, constraint solving, theorem proving, complex query planning), the moat is shallower than it was last quarter.
Practically: stop treating reasoning models as a more expensive chat completion. They're a different primitive. Budget for 10–100x the token spend per call, design UX around 30-second-to-2-minute latencies, and build verification harnesses — Lean for proofs, type checkers for code, SAT solvers for constraints — that let you trust outputs without reading every chain of thought. The teams winning with these models are the ones who treated 'verifier' as a first-class component instead of pretending the LLM was the whole answer.
Also worth saying out loud: the workshop format itself — bring your hardest unpublished problem, watch a model attempt it — is a benchmark methodology you can steal. If your domain has experts with private, unpublished hard problems, that's the only contamination-free eval left. Public benchmarks are leaking into training sets faster than they can be constructed. Your senior engineers' bug backlog is a more honest test of a coding model than HumanEval in 2026.
The near-term play for working mathematicians, per several Berkeley attendees, is collaboration: use the model to clear underbrush, then spend human time on the parts that require genuine novelty. That's roughly the same play software engineering has been running for two years. The longer-term question — what happens when the model's novelty exceeds the human's — is the one the workshop attendees couldn't answer, and the one Tao and Ono are now publicly worrying about. The honest summary is that the people who built the citadel are no longer sure where its walls are.
For every interesting problem AI solves there are a long tail of really dumb things that AI performs that humans would never do. Some days I am in awe of one-shot magic eight-ball output and other days I'm so frustrated by the sheer stupidity of what it produces. It remains to be seen whether t
Much of math (or science) research has the strange quality of being mostly curiosity-driven, but having giant benefits that occasionally spin out to the public.Some questions are more urgent and practical. My feeling is that the more directly practical a question is, the more likely the research com
Anyone else draw similarities with this and the artists and authors who complained when gen ai first came out. I think a lot of people don't realise the disruption ai will cause to many industries, until its directly impacting them, basically personal fable at scale (https://en.wikipe
Accelerationists may argue that the eroding of proper attribution and proof verification by humans is a meaningless short term struggle of a dying field.Mathematics seems to be entering an era where human + machine maximizes performance, much like chess in the 1990s. However, imagine a future where
Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.
> Mathematics produces not only a body of results, but also understanding, clarity, and judgment among the communities of mathematicians who have shaped them, often in the context of their own autonomously guided research. This expert knowledge is essential, both to effectively use mathematics, a