OpenAIs o3 AI model underperforms in independent benchmark tests compared to initial claims
Independent benchmark results for OpenAI's o3 AI model show significantly lower performance than the company's earlier claims, sparking concerns about transparency in AI testing practices.
A discrepancy has emerged between OpenAI's initial claims about its o3 AI model's performance and independent benchmark results, raising questions about the company's transparency in model testing.
-
Initial Claims vs Reality: When OpenAI unveiled o3 in December, the company claimed it could answer over 25% of questions on the challenging FrontierMath benchmark. However, independent tests by research institute Epoch AI showed the public version of o3 scored only around 10%.
-
Possible Explanations: OpenAI's Mark Chen had stated during a livestream that the 25% score was achieved with "aggressive test-time compute settings." Epoch AI noted the difference might be due to:
- Different testing setups
- More powerful internal scaffolding at OpenAI
- Different versions of FrontierMath used
-
Model Optimization: OpenAI's Wenda Zhou explained in another livestream that the public o3 was "more optimized for real-world use cases" and speed rather than benchmark performance.
-
Industry Context: This incident follows a pattern of benchmark controversies in AI:
- Epoch AI was previously criticized for delayed disclosure of OpenAI funding
- xAI faced accusations about Grok 3's benchmarks
- Meta admitted to using different model versions for benchmarking vs release
-
Moving Forward: While OpenAI plans to release a more powerful o3-pro variant soon, this incident highlights the need for standardized, transparent benchmarking practices in the AI industry.
"Benchmark 'controversies' are becoming a common occurrence in the AI industry as vendors race to capture headlines and mindshare with new models."
This report underscores the challenges in evaluating AI model performance and the importance of independent verification of company claims.
Related News
Lenovo Wins Frost Sullivan 2025 Asia-Pacific AI Services Leadership Award
Lenovo earns Frost Sullivan's 2025 Asia-Pacific AI Services Customer Value Leadership Recognition for its value-driven innovation and real-world AI impact.
Baidu Wenku GenFlow 2.0 Revolutionizes AI Agents with Multi-Agent Architecture
Baidu Wenku's GenFlow 2.0 introduces a multi-agent system for parallel task processing, integrating with Cangzhou OS to enhance efficiency and redefine AI workflows.
About the Author

Dr. Emily Wang
AI Product Strategy Expert
Former Google AI Product Manager with 10 years of experience in AI product development and strategy formulation. Led multiple successful AI products from 0 to 1 development process, now provides product strategy consulting for AI startups while writing AI product analysis articles for various tech media outlets.