A new academic review suggests that AI benchmarks are flawed and could lead to companies making high-stakes decisions on “misleading” data.
Business leaders are committing eight to nine-figure budgets to generative AI programs. These procurement and development decisions often rely on public leaderboards and benchmarks to compare model features.
The large-scale study, “Measuring What Matters: Building Validity for Large-Scale Language Model Benchmarks,” analyzed 445 individual LLM benchmarks from major AI conferences. The team of 29 expert reviewers found that “almost every paper had weaknesses in at least one area”, undermining claims about the model’s performance.
For CTOs and chief data officers, this is at the heart of their AI governance and investment strategy. If benchmarks that claim to measure “safety” or “robustness” don’t actually capture those qualities, organizations may deploy models that expose them to serious financial and reputational risks.
“Configuration Validity” Issues
The researchers focused on a core scientific principle known as construct validity. Simply put, this is how well a test measures the abstract concept it purports to measure.
For example, “intelligence” cannot be measured directly, but tests are created that serve as measurable surrogates. The paper states that if a benchmark has low construct validity, “high scores may be irrelevant or even misleading.”
This problem is widespread in AI evaluation. The study found that key concepts were often “poorly defined or operationalized.” This can lead to “poorly supported scientific claims, misguided research, and policy influences that are not based on solid evidence.”
When vendors compete for enterprise contracts by emphasizing top benchmark scores, leaders are effectively trusting these scores to be reliable indicators of real-world business performance. This new research suggests that trust may be in the wrong place.
Where enterprise AI benchmarks are failing
The review identified flaws throughout the system, from how benchmarks were designed to how results were reported.
Ambiguous or controversial definitions: You can’t measure what you can’t define. The study found that even when a definition of a phenomenon is provided, 47.8% “disagree” with a concept that has “many possible definitions or no clear definition at all.”
This paper takes ‘non-maleficence’, a key goal in corporate safety coordination, as an example of a phenomenon that often lacks a clear and agreed-upon definition. If two vendors score differently on a “harmless” benchmark, it may simply reflect two different arbitrary definitions of the term rather than a true difference in the safety of their models.
Lack of statistical rigor: Perhaps most concerning for data-driven organizations, the review found that only 16% of 445 benchmarks used uncertainty estimates or statistical tests to compare model results.
Without statistical analysis, it is impossible to know whether Model A’s 2% lead over Model B is a true difference in ability or simply a coincidence. Corporate decisions are being made based on numbers that would not pass basic scientific or business intelligence review.
Data contamination and storage: Many benchmarks, especially those for inference (such as the widely used GSM8K), suffer when questions and answers appear in the model’s pre-training data.
When this happens, the model is not doing any inference to find the answer. It’s just memorization. A high score may indicate good memory rather than the advanced reasoning ability that companies actually need for complex tasks. The paper warns that this “undermines the validity of the results” and recommends building taint checks directly into benchmarks.
Non-representative dataset: The study found that 27% of benchmarks used “convenience sampling,” such as reusing data from existing benchmarks or human trials. This data often does not represent real-world phenomena.
For example, the authors point out that reusing “calculator-free exam” questions means using numbers chosen to facilitate basic arithmetic in the questions. While the model may score well on this test, this score “does not predict performance at larger numbers, where LLMs struggle.” This creates significant blind spots and hides known model weaknesses.
From public indicators to internal verification
For business leaders, this study is a strong warning that public AI benchmarks are not a substitute for internal and domain-specific assessments. A high score on a public leaderboard does not guarantee suitability for any particular business purpose.
Isabella Grandi, Director of Data Strategy and Governance at NTT Data UK&I, commented: “A single benchmark may not be the right way to capture the complexity of AI systems, and expecting to do so risks reducing progress to a numbers game rather than a measure of real-world responsibility. What is most important is consistent assessment against clear principles that ensure technology not only progresses but also serves people.”
“ISO/IEC Good methodology as defined in 42001:2023 reflects this balance through five core principles: accountability, fairness, transparency, security, and redress. Accountability establishes ownership and responsibility for the AI systems deployed, making them ethical and accountable. Security and privacy are non-negotiable, prevent abuse and strengthen public trust. Remedies and appeals provide an important mechanism for society to monitor and allow people to challenge and correct outcomes if necessary.
“True progress in AI will depend on collaboration that brings together the vision of governments, the curiosity of academia, and the practical drivers of industry. When partnerships are underpinned by open dialogue and common standards are established, they build the transparency needed to instill trust in AI systems for people. Responsible innovation will always depend on collaboration that strengthens oversight while maintaining ambition.”
The eight recommendations in this document provide a practical checklist for companies considering building their own internal AI benchmarks and assessments that align with a principles-based approach.
- Define the phenomenon. Before testing a model, an organization must first create a “precise and operational definition of the phenomenon being measured.” What does a “useful” response mean from a customer service perspective? What does “accurate” mean for financial reporting?
- Build a representative dataset. The most valuable benchmarks are those built from your own data. The paper advises developers to “build representative datasets appropriate to the task.” This means using task items that reflect real-world scenarios, formats, and challenges faced by employees and customers.
- Perform error analysis. Beat the final score. The report recommends the team “conduct qualitative and quantitative analysis of common failure modes.” Analyzing why a model fails is more useful than simply knowing the score. It might be acceptable if the failures were all about low-priority, obscure topics. If you fail in the most common, high-value use cases, that one score is meaningless.
- Justify validity: Finally, the team must “justify the relevance of benchmark phenomena to real-world applications.” All evaluations should be accompanied by a clear rationale that explains why this particular test is a valid proxy for business value.
The race to adopt generative AI is forcing organizations to move faster than their governance frameworks can keep up. This report shows that the very tools used to measure progress are often flawed. The only sure path forward is to stop relying on popular AI benchmarks and start “measuring what matters” to your company.
See also: OpenAI spreads $600 billion bet on cloud AI across AWS, Oracle, and Microsoft

Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expo in Amsterdam, California, and London. This comprehensive event is part of TechEx and co-located with other major technology events such as Cyber Security Expo. Click here for more information.
AI News is brought to you by TechForge Media. Learn about other upcoming enterprise technology events and webinars.

