GPT-5.4 solves its first open math problem from FrontierMath benchmark

Olivia Johnson
Mar 15
5 min read

Recent logs from the evaluation of GPT-5.4 Pro show a shift in how large language models handle complex, unsolved mathematics. In a specific instance involving the FrontierMath benchmark—a collection of hundreds of original, research-level problems—the model successfully addressed an open problem related to hypergraph construction. This result represents a transition from AI solving student-level competitions to contributing to the discovery of new mathematical bounds.

FrontierMath, developed by Epoch AI in collaboration with IMO gold medalists and Fields Medalists, was designed to be "leak-proof." Unlike MATH or GSM8K datasets, where models often rely on pattern matching from training data, FrontierMath problems are unpublished and require hours or days of computational reasoning. The problem GPT-5.4 solved involves the construction of a specific hypergraph and finding the lower bound for a sequence denoted as H(n).

Why FrontierMath Tier 4 challenges stopped previous models like o1

Earlier benchmarks like o1 or o3 showed high proficiency in traditional competitive math, but their performance dropped significantly when faced with "Tier 4" problems in the FrontierMath set. These problems are categorized as being at the level of active mathematical research. In late 2024, models like OpenAI’s o1 maintained an accuracy rate of roughly 2% on this dataset. By late 2025, o3 improved this to 25%, but solving a truly "open" problem—one where the answer was not known to the evaluators—remained a gap.

The difficulty lies in the complexity of the logical chain. Standard reasoning models often hallucinate intermediate steps when a proof requires more than a few dozen transitions. FrontierMath demands thousands of logical steps, often requiring the model to bridge disparate fields such as combinatorics and algorithmic complexity. GPT-5.4 navigated this by moving away from pure text-based reasoning and instead utilizing an integrated environment for code execution and formal verification.

The mechanics of GPT-5.4's solution using Python and Lean

A critical technical takeaway from the GPT-5.4 logs is the model's reliance on "brute-force" algorithmic discovery combined with formal proof languages. Instead of attempting to write a symbolic proof in English, the model followed a three-step workflow:

Algorithmic Construction: The model wrote several Python scripts to generate various hypergraph configurations. This allowed it to test thousands of potential constructions in a matter of seconds.
Iterative Refinement: When initial scripts failed to meet the required lower bound for H(n), the model analyzed the failures and adjusted the construction parameters.
Formalization: Once a candidate construction was found via Python, the model used Lean, a formal verification language, to describe the proof. This ensured that the logical foundation of the construction was sound and could be verified by an automated kernel.

This method circumvents the "reasoning fatigue" seen in earlier LLMs. By offloading the calculation to Python and the verification to Lean, GPT-5.4 acted more like a research lead managing a set of specialized tools rather than a student trying to solve a problem in their head.

From pattern matching to brute-force algorithm discovery

The problem in question concerned a specific bound in hypergraph theory. Previously, mathematicians had established that nlog⁡2(n) was a lower bound, but many suspected that nln⁡(n) was the actual limit. GPT-5.4 identified a way to generalize an existing construction found in a 2022 research paper (NSF-PAR 10338368).

By extending this existing logic, the model provided a construction that matched the suspected bound. This was not a "creative spark" in the human sense but a highly efficient search through the space of possible mathematical extensions. The model recognized that the 2022 paper's methodology could be applied to a higher-dimensional case if certain conditions were met, and it wrote the code to prove it.

Verifying the nlog⁡2(n) hypergraph lower bound

One of the most notable parts of the solution was the model’s ability to handle the "bashy" or computational aspects of the proof. In high-level mathematics, many problems reach a point where a human mathematician can see a potential path but lacks the time to manually verify thousands of permutations. GPT-5.4 filled this gap.

It generated a bash script to run its Python verification suite, effectively automating the "tedious" part of research. Greg Burnham of Epoch AI noted that while the solution appears correct, the final verification lies with the original problem author. This highlights a new reality: AI is now generating math that requires a specific type of expert review—checking the model’s code and its Lean formalization rather than just its final answer.

Real-world implications for AI-driven mathematical research

This event changes the definition of "progress" in AI. We are no longer measuring how well a model can repeat a calculus textbook. We are measuring its ability to navigate the frontier of unknown information. For users in technical fields, the "expert" approach to using these models is now clear: provide the AI with the ability to write and run its own verification code.

The success of GPT-5.4 in this instance suggests that the bottleneck for AI in science was never just "intelligence," but the lack of a feedback loop. When the model can verify its own attempts through code, it can correct its own errors without human intervention. This makes the model an active participant in research rather than a passive reference tool.

If these results hold across the remaining 13 open problems in the FrontierMath subset, the timeline for AGI in specialized scientific domains may be shorter than previously estimated. The focus for research institutions will likely shift toward creating more "verifiable" environments like FrontierMath, where the answer is unknown, but the criteria for a correct answer are strictly defined.

FAQ

What is the FrontierMath benchmark?

FrontierMath is a high-difficulty mathematics dataset created by Epoch AI. It consists of 350 original problems designed to test AI on research-level mathematics that cannot be found in existing training sets.

How did GPT-5.4 solve an open problem?

The model used a combination of Python for generating constructions and Lean for formal verification. It iteratively tested different algorithmic approaches until it found a hypergraph construction that met the requirements of the problem.

Is GPT-5.4’s math solution officially verified?

While the solution has passed initial automated checks and internal reviews by Epoch AI researchers, final confirmation typically requires a review by the specific mathematician who authored the problem to ensure no hidden logical flaws exist.

How does GPT-5.4 differ from o1 in math performance?

GPT-5.4 exhibits a higher accuracy rate on Tier 4 problems (research level) compared to o1’s 2% accuracy. It also shows a greater capability for long-horizon tasks, such as writing and executing multiple scripts to find a solution.

Why is Lean important in AI mathematics?

Lean is a formal proof language that allows a computer to verify that every step of a mathematical proof is logically sound. By using Lean, AI models can provide "guaranteed" proofs that do not suffer from the hallucinations common in natural language outputs.

What specific math problem did the model solve?

The model solved a problem regarding hypergraph construction and the lower bound of a sequence H(n), successfully extending a known construction from a 2022 research paper to reach a new, higher bound.