← Back to Blog
Aryan Yadav · June 17, 2026 · 16 min read

The Proof and the Spark

I found the ceiling of my own math ability in college. Lately I have watched a machine climb toward it, and I am still not sure how I feel about that.

Somewhere in my second year at IIT Madras I found the ceiling of my own mathematical ability, and I can still feel roughly where it was. The machinery was fine. Calculus, linear algebra, the differential equations that quietly run the physical world I was being trained to design, I could use all of it, and I liked using it. What I could not do was the floor above. The place where people do not use the theorem, they prove it. Where you stare at something nobody has cracked and somehow guess where the crack is going to be. I worked out, without much drama, that I was not going up there. Not for lack of trying. I just did not have the thing.

That thing is what this essay is about, because for the first time in a very long history, something is climbing toward it that is not a person.

In July 2024, a system from Google DeepMind sat the International Mathematical Olympiad, the exam that finds the strongest teenage mathematicians alive, and finished one point short of a gold medal. It solved four of the six problems. One of those four was a number theory problem that only five of the roughly six hundred human contestants managed, and the machine sat with it for three days before the answer came loose. I read about it on a Tuesday, somewhere between two meetings, and felt a small and very specific vertigo. The ceiling I had hit at twenty had just been brushed by software.

I am an engineer, not a mathematician. For most of us, math is a cathedral somebody else finished centuries ago. You walk in, you admire the vaulting, you use the building, and you do not expect anybody to hand you a chisel, least of all a machine. And yet here we are, with the cathedral getting new stonework faster than almost any time in its history, and some of it cut by code.

Here is the one idea you need, and it is small enough to carry in a pocket. For two thousand years, doing mathematics meant two different jobs that happened to live in the same skull. There is the guess, the leap, the moment you decide this is the question and the answer is hiding somewhere over there. And then there is the long grind of proving the guess is actually true, one airtight step after another. We never had any reason to pull those two apart, because the same person always did both. The machines have finally given us the reason. They are spectacular at the second job and nearly helpless at the first.

The whole essay, in one box

Doing mathematics

Two jobs that have always traveled together. The spark: guessing what is true, and deciding which question is even worth your time. The proof: checking, with no hand-waving allowed, that the guess holds. For most of history they came bundled into one act. They are coming apart now, and if you know where to look, you can actually watch it happen.

The mental model
Mathematics = Spark × Proof

Both halves used to live inside people. The machines are now very good at the right-hand one and getting better fast. The left-hand one is still almost entirely ours. Nearly everything strange about this moment is just that little equation splitting into its two pieces.

DOING MATHEMATICS STILL HUMAN The Spark the leap, the guess, the question worth asking NOW THE MACHINE The Proof search, verify, check that it is true
The one idea behind everything that follows. A single act split into two, and the machines walked in through the right-hand door.

First, the machines learned to prove

Proving came first, for a boring reason. It is the half you can grade.

A formal proof is a chain of steps so explicit that a machine can check every single link, with no room for charm and no tired referee waving it through at midnight. There is a language built for exactly this, called Lean. You write the mathematics as code, and a compiler reads it and either accepts it or throws an error, the same way your calculator will never be talked into believing two and two make five. It looks something like this.

A proof a machine can check, in Lean
example : 2 + 2 = 4 := by rfl          -- true by raw computation

theorem sqrt2_irrational : Irrational (Real.sqrt 2) := by
  -- the claim is one line.
  -- the proof below runs about forty, and Lean checks every one.
  sorry

The claim is the easy line. Everything under it is the ladder, and the compiler inspects every rung. Once a proof is something a machine can check like that, proving turns into a game, and making machines superhuman at games is the one thing we have genuinely figured out how to do.

DeepMind's AlphaProof is, under the hood, the same animal that beat the world at Go. Each proof step is a move. It searches an enormous tree of possible moves and uses the Lean compiler as a referee that cannot be charmed, bluffed, or worn down. A wrong branch dies the instant it is checked, so the machine can blunder through millions of dead ends without the slightest embarrassment. It never has to convince a person. Only the compiler.

THE PROBLEM, IN LEAN × × × QED ✓ millions of dead branches
Proving as a game the compiler referees. The machine explores a huge tree of steps, and every wrong move gets pruned the instant Lean checks it. No bluffing survives. One verified path becomes the proof.
xkcd · CC BY-NC An xkcd comic strip titled Proofs, depicting a mathematical argument that grows steadily more hand-wavy and confident as it goes.
Randall Munroe's portrait of the kind of "proof" the old human system used to quietly allow, powered by a confident tone and a referee who wanted to go home. Formalization is the precise opposite of this. The compiler has no respect whatsoever for your confidence. (xkcd 1724)

At the 2024 Olympiad this got silver. A year later it got gold. In July 2025, systems from both DeepMind and OpenAI scored thirty-five out of forty-two and cleared the gold line, each solving five of the six problems, and each scoring a flat zero on the sixth, which happens to be exactly what most of the human gold medalists did too. DeepMind's version pulled it off in plain English, inside the real time limit, with no human stepping in to translate the questions into formal code. A year earlier that had not been possible.

Drop down to ordinary competition math and the numbers stop being impressive and start being a little absurd. On the AIME, a brutal American exam, OpenAI's o1 went from its predecessor's thirteen percent to the low nineties. The model after it cleared ninety-seven. These things would now place in the top few hundred students in the United States, which is to say comfortably north of wherever I topped out.

I want to be careful here, because this is exactly where the industry loves to oversell. These are company announcements, not peer-reviewed papers. The mathematician Kevin Buzzard pointed out the obvious thing, which is that the labs set their own exam and then mark their own answers. And when Apple's researchers quietly swapped the numbers inside problems these models had aced, leaving the logic untouched, the scores slipped. That is the question this whole essay keeps walking back to, and it is also the question my own company quietly lives or dies on. Is the thing understanding, or is it pattern-matching at a scale that just looks like understanding from the outside?

It never has to convince a person. It only has to convince the compiler. That is its superpower, and I have come to think it is also its cage.

Then they found things we never had

Passing an exam is one thing. Beating the best answer anyone has ever had is a different thing, and it is the part that actually keeps me up.

In 2025, DeepMind pointed a system called AlphaEvolve at a heap of unsolved problems. It writes candidate algorithms, scores them, keeps the good ones, mutates them, and runs that loop until something better falls out. On one problem it had no real business touching, it found a way to multiply two four-by-four matrices using forty-eight multiplications instead of forty-nine. That one extra multiplication had been sitting there since 1969. Volker Strassen held the record for fifty-six years, and a machine took it by breeding code until the code got lucky in a way no person ever had on purpose.

It did not stop at matrices. It nudged the best known answers on more than fifty open problems, improved a handful of Ramsey number bounds in early 2026 that had been stuck for over a decade, and found a tighter way to pack eleven hexagons than anyone had on record. None of this reorders the history of mathematics. But all of it is genuinely new, and that is the sentence I keep repeating to myself, because for my whole life "new mathematics" was a thing only humans made.

Then, late in 2025, Terence Tao, who is about the closest thing living mathematics has to a folk hero, described a loop that honestly raised the hair on my arms. One machine proposes a fresh construction for an open problem. A second machine proves the construction works. A third rewrites that proof into Lean and certifies it past any human doubt. Guess, proof, and verification, handed from machine to machine, with the people now standing slightly off to the side, pointing.

100% 50% 0% under 2% 25% most of it Nov 2024 Dec 2024 2026 FrontierMath: share solved by the best model
FrontierMath was built to humiliate AI. Each problem can cost a specialist hours or days. At launch the best models solved under one in fifty. A month later a new model solved a quarter. It later turned out a third of the problems had errors, and fixing them barely touched the slope.

You can watch the water rise on a single test. In November 2024 a group called Epoch put out FrontierMath, a set of problems so hard that a specialist needs hours or days to finish one, built deliberately to make AI look bad. At launch, the best models on earth solved under two percent. One month later, a new OpenAI model solved a quarter of them. It has been climbing ever since.

A wall built to stand for years is being treated as a staircase.

Proof stopped being a conversation and became a compile

There is a quieter shift happening under the loud one, and I think it matters more.

For all of history a proof was a social thing. You wrote it, you sent it to the experts, they read it, they argued, and a community eventually decided to believe you. That is how mistakes hid for years inside famous proofs. Referees get tired. They trust authors they already respect. Belief, in the end, was a handshake.

Formalization quietly ends the handshake. There is now a library called Mathlib holding close to three hundred thousand theorems, all written in Lean, all checked by machine. When four mathematicians proved a hard result in additive combinatorics in late 2023, Tao got a crowd together to translate the proof into Lean. Strangers chopped it into pieces and each took one. Three weeks later the compiler signed off on the whole thing, and not one of them had needed to trust any of the others.

His phrase for where this goes is "industrial-scale mathematics," and he means it almost literally. In one project he started in 2024, a crowd plus a swarm of plain old automated provers chewed through more than twenty-two million small questions about a simple algebraic structure, every single answer machine-checked, in roughly seven months. The same year, a loose band of amateurs online finally nailed down the fifth "busy beaver" number, a famously uncomputable quantity, at exactly 47,176,870, by grinding through more than a hundred and eighty million tiny machines and proving the result in a system called Coq.

The sixth busy beaver number, by the way, is probably out of reach forever, because its value secretly depends on whether some of our deepest conjectures can even be settled. The frontier has a cliff at the far edge, and we already know roughly where it is. I find that strangely beautiful.

Proof used to mean: convince the experts. Now it can also mean: pass the compiler. Certainty stopped being a matter of trust.

Meanwhile the humans were doing the impossible

While the machines ground away, people were not sitting still. And the human wins of these same years tell you, very precisely, what the machines still cannot do.

In 2023 a man named David Smith, a retired print technician in Yorkshire who did this for fun, was pushing paper shapes around on his kitchen table and found a single tile that covers an infinite floor without the pattern ever repeating. People had hunted for a shape like that since the 1960s, and plenty of them doubted it could even exist. He found it with scissors and a bit of card.

PERIODIC slide it over, you get the same thing APERIODIC one shape, fits forever, never repeats
Left: an ordinary tiling. Shift it and it lands right back on itself. Right: the idea behind the "hat" of 2023, a single shape that fills the plane but never settles into a repeat, the kind of five-fold motif no periodic grid can hold. A sixty-year-old question, answered by a guess no search would have made.

The same few years gave us the first real improvement since 1935 on one of combinatorics' most stubborn estimates, the one about how big a crowd has to get before some kind of order becomes unavoidable. Get six people in a room and you are guaranteed three who all know each other or three who are all strangers. Ask how the guarantee scales and you walk straight into eighty-eight years of stuck.

The breakthrough, in one line
1935   R(k, k) ≤ 4k
2023   R(k, k) ≤ (4 − ε)k

For eighty-eight years nobody could push the base below 4. Shaving it to a hair under 4 does not sound like much, and it was the result of the decade in this corner of math. The whole game lives in that tiny gap between "obviously true" and "actually proven."

A young researcher proved, in a hundred and nineteen pages of pure argument and not one computer simulation, the exact largest sofa you can slide around an L-shaped corner, a problem left open since 1966. And a team led by Dennis Gaitsgory and Sam Raskin finished a thirty-year, eight-hundred-page proof of the geometric Langlands conjecture, one wing of the closest thing mathematics has to a grand unified theory, a hidden run of bridges between number theory, geometry, and physics.

What they have in common is the part I cannot stop staring at. Every one of them started with a person looking at a problem from a strange angle and guessing what to even try. Nobody searched their way to the hat. No optimizer told Smith to cut that exact shape. The odd guess, the new lens, the decision that this corridor and not that one deserved thirty years of a human life, all of it came from the spark. The machines, so far, do the climbing. They do not pick the mountain.

Where the machine stops

If you want to find the real edge of what these systems can do, look for the spot where a person has to step back in. From everything I have read, and from what I watch in my own work, it shows up in four places.

The translation problem. Before AlphaProof could even begin an Olympiad problem, a human had to turn the question into Lean. The machine is brilliant at answering and still mute at the harder art of deciding what the question is and stating it exactly.

The taste problem. Search prunes a tree. It does not decide which tree was worth planting. Tao is careful about this, and so am I. The central idea still has to come from a person. The machine multiplies your spark. It does not strike it.

The scoreboard problem. Almost all of these tests have tidy answers a computer can mark. Real mathematics is mostly about finding the right statement and an honest proof, and there is no answer key for that. A third of that hardest benchmark turned out to be wrong, and the people who built the models are the same people keeping score.

The Clever Hans problem. When you change the surface of a problem and a model that "understood" it suddenly falls over, you have to wonder whether it understood anything, or whether it had simply seen ten thousand cousins of that problem in training. I run a company built on these systems, and I will say it plainly. We do not always know.

xkcd · CC BY-NC An xkcd comic in which one person describes a machine learning system as a big pile of linear algebra that you pour data into and collect answers from, and when asked what to do if the answers are wrong, says to stir the pile until they start looking right.
You pour the data into the big pile of linear algebra, then stir until the answers start looking right. It is funny because it is not entirely a joke, and everyone building these systems knows it. (xkcd 1838)

You can pour almost unlimited compute into the search now, and it climbs. The whole thing still collapses to nothing the moment the spark is missing, and the spark is the one part we cannot yet manufacture. No amount of searching will ever tell you what is worth searching for.

What I actually do with this

I am not a mathematician, and I am never going to be. But I run a company that builds with machine intelligence, and watching all of this happen quietly rearranged how I think about the work.

I stopped paying for the computation. The scarce thing now is not grinding out an answer, it is stating the real problem sharply enough that a machine can run straight at it. On my team, the person who can say precisely what we are actually trying to solve is worth more than the person who can solve a fuzzy version of it quickly.

I manage the machine instead of racing it. Tao's mode is to guide it. Try this, now try that, while the thing does the labor. That is a judgment skill, not a calculator skill, and judgment about which path smells promising is exactly the human part that has not automated.

I trust the compiler over the referee. The deepest lesson of formalization is not technical, it is cultural. Stop leaning on tired humans to catch the mistakes. Build things that check themselves. I think about that every time we ship something where being almost right and being wrong are the same outcome.

I try to stay close to the spark. The part of the work that is most mine is precisely the part the machine most obviously lacks, so that is where I try to spend whatever attention I have left at the end of a long week.

· · ·

There is a word in Kashmir Shaivism, Spanda, for the first creative pulse. The throb before anything takes shape, the flicker of "let there be" that comes before the made thing. For years I read it as poetry. Lately I read it as something much closer to a spec.

Because we have now built a machine that has everything downstream of that pulse. It can search and prove and check and certify, faster and more tirelessly than any room full of people. And it sits there, genuinely brilliant, waiting for a person to hand it the one thing it does not have. The small, private conviction that this question, out of all the questions in the world, is the one worth caring about.

For two thousand years we could not see that pulse on its own, because it was always tangled up with the labor that came after it. The machines pulled the labor away and left the pulse sitting there in the open, by itself, for the first time. I told you I found my ceiling at twenty. I did not expect that watching a machine reach it would make the ceiling look less like a wall and more like a window. I find that comforting on the good days and genuinely strange on the others, and I have decided I do not have to settle which.

The machine can climb any mountain we point it at. It still cannot feel which mountain is worth the climb. That feeling, not the proof, was always the mathematics.
If this resonated, the adjacent essays are Mathematics as the Language of the Universe and The Bottleneck Moved.