AI models have advanced from grade-school arithmetic to olympiad-level and research mathematics in just two years. In the OpenAI Podcast, researchers Sebastian Bubeck and Ernest Ryu explain why mathematics has become the key benchmark on the road to artificial general intelligence.
Reasoning models didn't exist two years ago. Four years ago, Bubeck was impressed when Google's Minerva model could draw a line through points on a coordinate system. Today, these systems assist Fields Medal winners with their daily work. At a conference 18 months ago, 80 percent of mathematicians believed scaled-up LLMs couldn't crack open research problems, Bubeck says.
Ernest Ryu, a former UCLA math professor, solved a 42-year-old open problem about Nesterov's method in optimization theory using ChatGPT—in just twelve hours spread across three evenings. He had previously spent over 40 hours on it without AI and gotten nowhere. Ryu acted as a verifier, catching errors and steering the conversation in promising directions.
Why Math Became the Benchmark for AGI
For Bubeck, math isn't an accidental yardstick for AGI progress. It demands exactly the capabilities a generally intelligent system needs. Mathematical proofs require long, consistent reasoning over hours, days, or even years, and a single mistake anywhere destroys the entire argument. Anything that can handle that must be able to spot and fix its own errors.
This is what researchers aim to carry over from math training to other fields, from biology to materials science. Bubeck draws a parallel with how people are educated: students learn math not because they'll write proofs, but because the subject forces logical thinking.
Math also has practical advantages as a benchmark. Problems are clearly stated, answers can be checked, and no one argues about whether a result is correct. Bubeck introduces the idea of "AGI time": two years ago, models could simulate a student's thinking for minutes. Today, they're up to days or even a week. The next target is weeks and months.
OpenAI's training methods aren't specific to math, Bubeck says, but general, meaning progress in other sciences should follow. The researchers are building an "automated researcher" that can work on problems autonomously over long periods.
The Erdős Problems and the Debate Over Their Meaning
Bubeck and Ryu also discuss the Erdős problems, a collection of open questions left by the late mathematician. Bubeck says internal models initially found solutions to ten problems marked as open, mostly through deep literature searches. His misleading tweet about it sparked a public spat with Google CEO Demis Hassabis, as many interpreted it as a claim that OpenAI had produced new proofs. Now, Bubeck says, ChatGPT and internal models have actually produced more than ten genuinely new solutions worthy of publication in academic journals.
What seemed like an unrealistic claim is now reality, and the pace is accelerating. Bubeck sees this as evidence that models are making the leap from recombining existing knowledge to producing new mathematics—even if the philosophical question of whether scientific progress is anything more than clever recombination plus reasoning remains open.
The Risks: Mental Atrophy and Fake Proofs
Both researchers warn against superficial use of these tools. Expertise matters more than ever, they argue, because only trained mathematicians can put the models to productive use. Non-mathematicians who post long AI-generated proofs on social media are usually wrong. Ryu sees the same pattern in programming, where a generation is losing the ability to use debuggers.
Bubeck says claims that scientists are no longer needed are therefore dangerous. Academic institutions must actively reclaim their role. At the same time, AI can speed up proof verification—a process that currently takes years—and flag problems in published papers.