OpenAIs latest reasoning AI models show increased hallucination rates
OpenAIs new reasoning AI models demonstrate improved capabilities but exhibit higher hallucination rates compared to previous versions, according to benchmark data.
OpenAI’s recently launched o3 and o4-mini AI models are state-of-the-art in many respects, particularly in coding and math-related tasks. However, these models also exhibit higher hallucination rates—making up information—compared to OpenAI’s older models, including previous reasoning models like o1 and o3-mini.
Benchmark Results Reveal Higher Hallucination Rates
According to OpenAI’s internal tests, the o3 model hallucinated in response to 33% of questions on PersonQA, its in-house benchmark for measuring factual accuracy about people. This is roughly double the hallucination rate of its predecessors, o1 (16%) and o3-mini (14.8%). The o4-mini performed even worse, hallucinating 48% of the time.
Third-party testing by Transluce, a nonprofit AI research lab, corroborated these findings. Transluce observed instances where o3 fabricated actions, such as claiming to run code on a 2021 MacBook Pro "outside of ChatGPT"—a capability it does not possess.
Why Are Hallucinations Increasing?
OpenAI admits it doesn’t fully understand why hallucinations are worsening with these newer reasoning models. In its technical report, the company notes that "more research is needed" to determine the cause. One hypothesis, proposed by Transluce researcher Neil Chowdhury, suggests that the reinforcement learning techniques used for these models may amplify issues typically mitigated in traditional AI training pipelines.
Practical Implications
While o3 has shown promise in coding workflows—earning praise from Stanford adjunct professor Kian Katanforoosh—its tendency to hallucinate broken website links and other inaccuracies raises concerns. For industries where precision is critical, such as legal or medical fields, these hallucinations could pose significant risks.
Potential Solutions
One potential remedy is integrating web search capabilities. OpenAI’s GPT-4o with web search achieves 90% accuracy on SimpleQA, another accuracy benchmark. However, this approach requires exposing user prompts to third-party search providers, which may not always be feasible.
The Broader AI Landscape
The AI industry has increasingly shifted focus to reasoning models after traditional scaling methods showed diminishing returns. Reasoning models improve performance without requiring massive computational resources, but the trade-off appears to be higher hallucination rates. OpenAI spokesperson Niko Felix stated that addressing hallucinations remains an "ongoing area of research."
"Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability," Felix said.
As AI models continue to evolve, balancing creativity and accuracy will be a critical challenge for developers and businesses alike.
Related News
Lenovo Wins Frost Sullivan 2025 Asia-Pacific AI Services Leadership Award
Lenovo earns Frost Sullivan's 2025 Asia-Pacific AI Services Customer Value Leadership Recognition for its value-driven innovation and real-world AI impact.
Baidu Wenku GenFlow 2.0 Revolutionizes AI Agents with Multi-Agent Architecture
Baidu Wenku's GenFlow 2.0 introduces a multi-agent system for parallel task processing, integrating with Cangzhou OS to enhance efficiency and redefine AI workflows.
About the Author

Dr. Emily Wang
AI Product Strategy Expert
Former Google AI Product Manager with 10 years of experience in AI product development and strategy formulation. Led multiple successful AI products from 0 to 1 development process, now provides product strategy consulting for AI startups while writing AI product analysis articles for various tech media outlets.