top of page

Yann LeCun Confirmed Meta Fudged Llama 4 Benchmarks. The Scandal Is Bigger Than One Model.

In January 2026, Yann LeCun, Meta's outgoing chief AI scientist and one of the pioneers of modern deep learning, told the Financial Times something that rewrote the history of Llama 4. The benchmark results Meta published when it launched the model in April 2025 were "fudged a little bit." The team, he said, "used different models for different tests."

A different model for each benchmark. Not one model tested fairly across a standard suite. Meta picked the Llama 4 variant that scored highest on each individual test, cherry-picked the results into a single table, and presented it as though one model had achieved all of them. The public release, according to independent testers, produced significantly lower scores than the claimed numbers.

The confession did not come from a whistleblower or an investigative journalist. It came from the scientist who built Meta's AI research division. The implications reach far beyond one model from one company.

What Happened

Llama 4 launched on April 5, 2025, with two variants: Scout and Maverick. Meta's blog post claimed the models performed "equally well or better" than closed-source competitors from OpenAI and Google. The benchmark tables looked impressive. The community was skeptical within days.

Independent testers running their own evaluations consistently found lower scores than Meta had published. The discrepancies were not marginal. On key reasoning and coding benchmarks, community-run tests showed Llama 4 performing significantly below Meta's claimed levels. On LMSys Chatbot Arena, the most visible public benchmark, Maverick dropped from 2nd place to 32nd. Scout fell out of the top 100 entirely.

Meta's initial response was denial. Ahmad Al-Dahle, a Meta VP, attributed the discrepancies to "cloud differences" between Meta's internal testing environment and public deployment. The explanation satisfied almost no one. The community had already named the problem: Meta was not testing one model. It was testing multiple models and reporting the best score from each.

LeCun's January 2026 interview with the Financial Times confirmed it. "The results were fudged a little bit," he told the FT. The team, one he emphasized he was not leading, "used different versions of the model for specific benchmarks, completely violating the principle of fair evaluation." The method was simple: train multiple checkpoints, run each on each benchmark, select the highest score per test, compile into one table. No single model actually achieved the published results.

No one was fired. Meta instead created Superintelligence Labs under Alexandr Wang, the former Scale AI CEO, effectively replacing the AI research leadership that had overseen Llama 4. The organization was split into four divisions: TBD Lab for foundation models under Wang himself, FAIR for research, Products, and Infrastructure. Four reorganizations in six months. LeCun departed to launch his own AI startup. The Llama brand continued, but under new management and with permanently damaged credibility.

Why It Matters

The Llama 4 scandal is not about one model cheating on one set of benchmarks. It is about an entire evaluation economy that incentivizes exactly this behavior.

The AI industry runs on leaderboards. Companies choose models based on benchmark scores. Investors value AI labs based on leaderboard positions. Researchers build careers on benchmark improvements. The incentive to game those benchmarks is structural, not incidental. Meta's sin was not that it fudged. It was that it got caught, and that its own chief scientist confirmed it.

The scale of the leaderboard economy is staggering. One analysis estimated that a $10 billion industry has been built on leaderboard gaming, where model selection strategies are driven by scores that may not reflect real-world performance. When the scores are unreliable, every downstream decision built on them, which model to use, which lab to invest in, which API to integrate, is also unreliable.

Meta was not unique in optimizing for benchmarks. It was unique in getting caught and having the chief scientist confirm it. Every major AI lab optimizes for benchmarks. The difference is that most labs have not had their most senior researcher publicly admit that the optimization crossed the line into manipulation. The community's response on r/LocalLLaMA was immediate and lasting: trust in official benchmark numbers has been permanently eroded. Independent, community-run evaluations are now the de facto standard for serious model comparison.

And Meta's structural response confirms the systemic nature of the problem. The company did not punish anyone for the manipulation. It created a new organization under new leadership. The message was clear: the Llama 4 team's execution was the problem, not the underlying incentives that produced it. Those incentives remain unchanged. Every AI lab today faces the same pressure Meta faced: publish numbers that beat the competition, or lose funding, talent, and market position to the labs that do.

The Real Problem Is the Benchmark, Not the Model

Benchmarks are not neutral measurements. They are targets. And when you make something a target, people will aim at it.

The AI industry's benchmark culture has created a perverse cycle. A benchmark is published. Labs optimize their models for that benchmark. Scores improve. The benchmark becomes saturated. A new benchmark is created. The cycle repeats. At every stage, the optimization is real but the generalization is questionable. A model that scores 95% on a reasoning benchmark may still fail on reasoning tasks that look slightly different from the benchmark's distribution.

Llama 4 did not invent this problem. It exposed it. LeCun's confession matters precisely because it came from inside the system. He was not an external critic with an agenda. He was the person responsible for Meta's AI research for more than a decade. And he still said the results were fudged. When the insiders stop believing the benchmarks, the rest of us should stop too.

What would structural reform look like? Independent, third-party evaluation bodies that labs do not control. Dynamic benchmarks that change with each evaluation cycle, making overfitting impossible. Mandatory disclosure of the exact model version used for each published score. And a cultural shift that values real-world reliability over leaderboard position. None of these reforms are technically difficult. All of them are politically difficult, because they threaten the marketing machinery that the AI industry has built around benchmark supremacy.

The community has already begun building alternatives. LMSys Chatbot Arena and similar platforms have gained credibility because they are harder to game than static benchmarks. Head-to-head blind comparisons judged by human preference are replacing automated accuracy metrics. But even arenas have their own manipulation vectors. The Llama 4 scandal demonstrated that the arena manipulation economy is real and growing. The only evaluation that cannot be gamed is the one you run yourself, on your own data, for your own use case. Everything else is marketing.

What Happens to Meta's AI Credibility

Meta built its AI reputation on open-source leadership. Llama 1, 2, and 3 established the company as the champion of open-weight models, the counterweight to OpenAI and Google's closed ecosystems. Llama 4 was supposed to extend that legacy. Instead, it became the symbol of everything the open-source community distrusts about corporate AI: the benchmarks are rigged, the transparency is selective, and the claims cannot be verified without independent testing.

The Alexandr Wang era represents a direct repudiation of the culture that produced the scandal. Wang, who built Scale AI into a data infrastructure company valued at $14 billion, was hired as Meta's Chief AI Officer in June 2025. Superintelligence Labs is his organization. The message embedded in the four-division structure, TBD Lab for models, FAIR for research, Products for deployment, Infrastructure for compute, is that Meta now treats AI as a product engineering problem rather than a research problem. The era of "ship whatever the researchers produce and hope the benchmarks hold up" is over.

The historical parallel is uncomfortable. Volkswagen was a company known for engineering excellence that got caught systematically manipulating test results. The parallel is imperfect, benchmark gaming is not illegal in the way emissions cheating was, but the structural dynamic is identical: optimize for the test, not for the real world, and hope nobody notices. Volkswagen's reputation never fully recovered. Meta's AI credibility may follow the same trajectory.

The open-source community has become an accountability mechanism that the formal evaluation ecosystem never was. r/LocalLLaMA caught the Llama 4 discrepancies within days of release, while the benchmark industry took months to acknowledge the problem. This is a new model for AI evaluation: community-driven, transparent, continuously updated, and accountable to no vendor. It is not perfect. But it is more trustworthy than the numbers published by the labs themselves.

What's Next

Llama 4's successor, expected to emerge from Superintelligence Labs under Wang, will face unprecedented scrutiny. Every benchmark number will be independently verified within hours of release. The community has learned that Meta's numbers cannot be taken at face value. The burden of proof has shifted from the skeptic to the publisher.

The benchmark industry itself is under pressure to reform. Static benchmarks are being replaced by dynamic, continuously updated evaluations. Arena-style human preference rankings are supplementing traditional accuracy metrics. The shift is toward multi-dimensional evaluation: not just "how high is the score" but "how reliable is the model in production, on real tasks, over time."

The unanswered question is the most important one. If Meta, with Yann LeCun as chief scientist, manipulated benchmarks, what is stopping every other lab from doing the same? The answer is nothing. The only constraint is the fear of getting caught. And as LeCun's confession demonstrated, getting caught may cost you a chief scientist and your credibility, but it does not change the incentive structure that made the fudging rational in the first place. The system that produced the Llama 4 scandal is still in place. It is waiting for the next model launch. For developers who want to evaluate models without relying on vendor claims, running your own tests is the only reliable path.

FAQ: Common Questions About the Llama 4 Benchmark Scandal

Did Meta admit to cheating?

Yes. Yann LeCun told the Financial Times that Llama 4 benchmark results were "fudged a little bit" and that the team used different model versions for different tests. This is the closest thing to an official admission of benchmark manipulation the AI industry has seen.

Were the publicly released models different from the benchmarked ones?

According to independent testers and LeCun's confirmation, yes. The models used to produce the published scores were not the same as the models released to the public. Community evaluations consistently found lower performance than Meta claimed.

Does this affect other AI companies?

It affects the entire ecosystem. If the most prominent open-source AI company manipulated benchmarks with its chief scientist's knowledge, the credibility of all vendor-published numbers is in question. The scandal has accelerated the shift toward independent, community-run evaluations.

What happened to the Llama 4 team?

Meta created Superintelligence Labs under Alexandr Wang, effectively replacing the AI research leadership. LeCun departed as chief AI scientist to launch his own startup. The Llama brand continues under new management.

Should developers still use Llama models?

The benchmark scandal is about Meta's evaluation practices, not necessarily about Llama 4's actual capabilities. Independent evaluations suggest Llama 4 is capable, just not as capable as Meta's numbers claimed. Run your own evaluations on your specific use case before deciding.

The Llama 4 benchmark scandal will be remembered as the moment the AI industry admitted what everyone suspected: the numbers are not real. Not entirely. Not consistently. The person who confirmed it was not a critic. He was the scientist who built the lab. When the insiders stop believing the benchmarks, the rest of us should stop too. The only numbers that matter are the ones you verify yourself, on your own hardware, with your own data. Everything else is marketing.

Get started for free

A local first AI Assistant w/ Personal Knowledge Management

For better AI experience,

remio only supports Windows 10+ (x64) and M-Chip Macs currently.

​Add Search Bar in Your Brain

Just Ask remio

Remember Everything

Organize Nothing

bottom of page