OpenAI o3 Outperforms GPT-5 in Complex Office Tasks Benchmark
A new benchmark reveals OpenAI's older o3 model beats the newer GPT-5 in multi-day office workflows, highlighting challenges in AI agent coordination.
A new benchmark called OdysseyBench, developed by researchers at Microsoft and the University of Edinburgh, has revealed surprising results: OpenAI's older o3 model consistently outperforms the newer GPT-5 in complex, multi-day office workflows. The benchmark evaluates AI agents on 602 tasks across Word, Excel, PDF, email, and calendar applications, split into two categories:
- OdysseyBench+: 300 realistic tasks from OfficeBench.
- OdysseyBench-Neo: 302 hand-crafted, highly complex scenarios.
OdysseyBench includes both simple, single-step tasks and complex, long-term office workflows. | Image: Wang et al.
Key Findings
- On OdysseyBench-Neo, o3 achieved a 61.26% success rate, compared to GPT-5's 55.96% and GPT-5-chat's 57.62%.
- For tasks requiring three applications at once, o3 scored 59.06%, while GPT-5 managed only 53.80%.
- On OdysseyBench+, o3 led with 56.2%, outperforming GPT-5 (54.0%) and GPT-5-chat (40.3%).
OpenAI's o3 reasoning model leads the field on OdysseyBench. | Image: Wang et al.
Why o3 Outperforms GPT-5
The researchers suggest that GPT-5-chat performs better on OdysseyBench-Neo due to its focus on dialog-based assistance, while GPT-5 excels in fragmented, non-conversational scenarios. However, the paper does not specify the reasoning settings for GPT-5, leaving room for further evaluation.
Challenges in AI Workflow Automation
AI agents still struggle with:
- Overlooking critical files or steps.
- Using incorrect tools (e.g., creating PDFs before generating text).
- Multi-stage planning across different applications and timeframes.
The findings are particularly relevant as OpenAI aims to develop AI agents capable of prolonged problem-solving. OdysseyBench could become a key benchmark for these long-horizon systems.
For more details, check out the OdysseyBench GitHub repository or the full research paper.
Related News
AI Agents Transform IT Operations and Boost Efficiency
Discover how CIOs are using AI agents to streamline IT workflows, reduce costs, and enhance business value with leaner, more strategic teams.
GPT-5 Launches with Enhanced User Experience but Falls Short of AGI
OpenAI's GPT-5 improves ChatGPT usability and reduces hallucinations but remains far from achieving artificial general intelligence.
About the Author

Michael Rodriguez
AI Technology Journalist
Veteran technology journalist with 12 years of focus on AI industry reporting. Former AI section editor at TechCrunch, now freelance writer contributing in-depth AI industry analysis to renowned media outlets like Wired and The Verge. Has keen insights into AI startups and emerging technology trends.