When the AI solves your open problem before lunch

What happened

In mid-May, roughly 30 mathematicians gathered in Berkeley for a closed-door workshop hosted by Google DeepMind. The setup was simple: bring your hardest unsolved problems and try to break a reasoning model. The model was OpenAI's o4-mini, a smaller, faster variant of the o-series reasoning stack that has spent the last 18 months quietly redefining what 'AI at math' means.

The attendees, reporting back to *Science*, described the experience as somewhere between thrilling and disorienting. Ken Ono, a number theorist at the University of Virginia, said o4-mini solved a problem he had personally worked on and considered genuinely difficult — in under ten minutes, with a chain of reasoning he described as "frighteningly good." Other participants reported the model producing correct, novel proofs on problems drawn from active research programs. A few problems it got wrong. Several it solved in ways the human experts hadn't considered.

The workshop result lands on top of a year of escalating benchmark scores. OpenAI's experimental reasoning system hit gold-medal performance on the IMO 2025 problems. On FrontierMath — a benchmark explicitly constructed by Terence Tao and others to be "PhD-level," with problems whose solutions take human experts hours to days — top models have moved from ~2% accuracy in late 2024 to north of 25% by mid-2025. Tao himself, who helped design FrontierMath specifically so it would resist current models, has publicly said the rate of progress has surprised him.

Why it matters

The mathematicians' warning is more interesting than the usual "AI is coming for X" story because mathematics was supposed to be the citadel. Code generation always had a fuzzy ground truth — does it compile, does it pass tests, is it 'clean.' Mathematics has a hard one: a proof is either valid or it isn't. The discipline has spent two thousand years building a referee system that catches bullshit. If LLMs were going to embarrass themselves anywhere, it should have been here.

They didn't. The Berkeley workshop, the IMO result, and the FrontierMath curve are all measuring the same underlying thing: reasoning models are getting better at the kind of structured, multi-step, dead-end-aware thinking that defines mathematical work. The architecture detail that matters is not the parameter count but the training-time RL on reasoning traces — letting the model think for thousands of tokens, backtrack, and verify, with rewards shaped by correctness rather than fluency. That's what produced the jump.

The more honest read from the workshop attendees isn't "AI will replace mathematicians" — it's "the intuition for which problems are hard is breaking down." Pure mathematics runs on taste: which conjectures are worth a decade of your life, which are tractable, which are dead ends. That taste is calibrated against an internal model of difficulty built up over a career. When a model can solve in ten minutes what a senior researcher considered a multi-month problem, the calibration breaks. The community reaction at the workshop, per multiple attendees, was a mix of excitement and existential vertigo — several said they couldn't sleep that night.

The community has split, predictably, into camps. Geoff Hinton-adjacent voices argue this is the curve continuing on schedule and that working mathematicians should plan for a 5-year horizon where most undergraduate-and-below problems are automated. Tao's public position is more measured: the models are powerful collaborators, hallucinate confidently on problems just outside their training distribution, and require expert verification on every nontrivial output — but he's also stopped saying "this won't happen soon."

What this means for your stack

If you ship code, the workshop is a leading indicator for two things. First, the gap between 'LLMs are bad at logic' and 'LLMs are graduate-level at logic' is closing on a timeline measured in quarters, not decades — assume the reasoning-model APIs you're choosing today will be qualitatively better in six months and architect for swap-ability. Second, the technique that produced this jump — long-context reasoning with verifier rewards — generalizes beyond math. The same training recipe is what's driving the recent gains on competitive programming, formal verification, and SQL-from-natural-language. If your product has a 'hard reasoning' moat (compiler design, constraint solving, theorem proving, complex query planning), the moat is shallower than it was last quarter.

Practically: stop treating reasoning models as a more expensive chat completion. They're a different primitive. Budget for 10–100x the token spend per call, design UX around 30-second-to-2-minute latencies, and build verification harnesses — Lean for proofs, type checkers for code, SAT solvers for constraints — that let you trust outputs without reading every chain of thought. The teams winning with these models are the ones who treated 'verifier' as a first-class component instead of pretending the LLM was the whole answer.

Also worth saying out loud: the workshop format itself — bring your hardest unpublished problem, watch a model attempt it — is a benchmark methodology you can steal. If your domain has experts with private, unpublished hard problems, that's the only contamination-free eval left. Public benchmarks are leaking into training sets faster than they can be constructed. Your senior engineers' bug backlog is a more honest test of a coding model than HumanEval in 2026.

Looking ahead

The near-term play for working mathematicians, per several Berkeley attendees, is collaboration: use the model to clear underbrush, then spend human time on the parts that require genuine novelty. That's roughly the same play software engineering has been running for two years. The longer-term question — what happens when the model's novelty exceeds the human's — is the one the workshop attendees couldn't answer, and the one Tao and Ono are now publicly worrying about. The honest summary is that the people who built the citadel are no longer sure where its walls are.

When the AI solves your open problem before lunch

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

Mathematicians issue warning as AI rapidly gains ground

// community takes

When the AI solves your open problem before lunch

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

Mathematicians issue warning as AI rapidly gains ground

// community takes

// share this