OpenAI o3 Outperforms GPT-5 in Complex Office Tasks Benchmark

A new benchmark called OdysseyBench, developed by researchers at Microsoft and the University of Edinburgh, has revealed surprising results: OpenAI's older o3 model consistently outperforms the newer GPT-5 in complex, multi-day office workflows. The benchmark evaluates AI agents on 602 tasks across Word, Excel, PDF, email, and calendar applications, split into two categories:

OdysseyBench+: 300 realistic tasks from OfficeBench.
OdysseyBench-Neo: 302 hand-crafted, highly complex scenarios.

OdysseyBench example task OdysseyBench includes both simple, single-step tasks and complex, long-term office workflows. | Image: Wang et al.

Key Findings

On OdysseyBench-Neo, o3 achieved a 61.26% success rate, compared to GPT-5's 55.96% and GPT-5-chat's 57.62%.
For tasks requiring three applications at once, o3 scored 59.06%, while GPT-5 managed only 53.80%.
On OdysseyBench+, o3 led with 56.2%, outperforming GPT-5 (54.0%) and GPT-5-chat (40.3%).

Performance comparison OpenAI's o3 reasoning model leads the field on OdysseyBench. | Image: Wang et al.

Why o3 Outperforms GPT-5

The researchers suggest that GPT-5-chat performs better on OdysseyBench-Neo due to its focus on dialog-based assistance, while GPT-5 excels in fragmented, non-conversational scenarios. However, the paper does not specify the reasoning settings for GPT-5, leaving room for further evaluation.

Challenges in AI Workflow Automation

AI agents still struggle with:

Overlooking critical files or steps.
Using incorrect tools (e.g., creating PDFs before generating text).
Multi-stage planning across different applications and timeframes.

The findings are particularly relevant as OpenAI aims to develop AI agents capable of prolonged problem-solving. OdysseyBench could become a key benchmark for these long-horizon systems.

For more details, check out the OdysseyBench GitHub repository or the full research paper.

OpenAI o3 Outperforms GPT-5 in Complex Office Tasks Benchmark

Key Findings

Why o3 Outperforms GPT-5

Challenges in AI Workflow Automation

Related News

Users Mourn Lost AI Boyfriends After OpenAI Updates

How Specialized AI Agents Will Transform Workflows by 2026

About the Author

Michael Rodriguez

Expertise

Key Findings

Why o3 Outperforms GPT-5

Challenges in AI Workflow Automation

Related News

Users Mourn Lost AI Boyfriends After OpenAI Updates

How Specialized AI Agents Will Transform Workflows by 2026

About the Author

Michael Rodriguez

Expertise

Agent Newsletter

Get Agentic Newsletter Today