OpenAIs latest reasoning AI models show increased hallucination rates

OpenAI’s recently launched o3 and o4-mini AI models are state-of-the-art in many respects, particularly in coding and math-related tasks. However, these models also exhibit higher hallucination rates—making up information—compared to OpenAI’s older models, including previous reasoning models like o1 and o3-mini.

Benchmark Results Reveal Higher Hallucination Rates

According to OpenAI’s internal tests, the o3 model hallucinated in response to 33% of questions on PersonQA, its in-house benchmark for measuring factual accuracy about people. This is roughly double the hallucination rate of its predecessors, o1 (16%) and o3-mini (14.8%). The o4-mini performed even worse, hallucinating 48% of the time.

Third-party testing by Transluce, a nonprofit AI research lab, corroborated these findings. Transluce observed instances where o3 fabricated actions, such as claiming to run code on a 2021 MacBook Pro "outside of ChatGPT"—a capability it does not possess.

Why Are Hallucinations Increasing?

OpenAI admits it doesn’t fully understand why hallucinations are worsening with these newer reasoning models. In its technical report, the company notes that "more research is needed" to determine the cause. One hypothesis, proposed by Transluce researcher Neil Chowdhury, suggests that the reinforcement learning techniques used for these models may amplify issues typically mitigated in traditional AI training pipelines.

Practical Implications

While o3 has shown promise in coding workflows—earning praise from Stanford adjunct professor Kian Katanforoosh—its tendency to hallucinate broken website links and other inaccuracies raises concerns. For industries where precision is critical, such as legal or medical fields, these hallucinations could pose significant risks.

Potential Solutions

One potential remedy is integrating web search capabilities. OpenAI’s GPT-4o with web search achieves 90% accuracy on SimpleQA, another accuracy benchmark. However, this approach requires exposing user prompts to third-party search providers, which may not always be feasible.

The Broader AI Landscape

The AI industry has increasingly shifted focus to reasoning models after traditional scaling methods showed diminishing returns. Reasoning models improve performance without requiring massive computational resources, but the trade-off appears to be higher hallucination rates. OpenAI spokesperson Niko Felix stated that addressing hallucinations remains an "ongoing area of research."

"Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability," Felix said.

As AI models continue to evolve, balancing creativity and accuracy will be a critical challenge for developers and businesses alike.

OpenAIs latest reasoning AI models show increased hallucination rates

Benchmark Results Reveal Higher Hallucination Rates

Why Are Hallucinations Increasing?

Practical Implications

Potential Solutions

The Broader AI Landscape

Related News

AWS extends Bedrock AgentCore Gateway to unify MCP servers for AI agents

CEOs Must Prioritize AI Investment Amid Rapid Change

About the Author

Dr. Emily Wang

Expertise

Benchmark Results Reveal Higher Hallucination Rates

Why Are Hallucinations Increasing?

Practical Implications

Potential Solutions

The Broader AI Landscape

Related News

AWS extends Bedrock AgentCore Gateway to unify MCP servers for AI agents

CEOs Must Prioritize AI Investment Amid Rapid Change

About the Author

Dr. Emily Wang

Expertise

Agent Newsletter

Get Agentic Newsletter Today