Friday, October 3, 2025

We’re introducing GDPval, a new evaluation that measures model performance on economically valuable, real-world tasks across 44 occupations. - OpenAI

We found that today’s best frontier models are already approaching the quality of work produced by industry experts. To test this, we ran blind evaluations where industry experts compared deliverables from several leading models—GPT‑4o, o4-mini, OpenAI o3, GPT‑5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4—against human-produced work. Across 220 tasks in the GDPval gold set, we recorded when model outputs were rated as better than (“wins”) or on par with (“ties”) the deliverables from industry experts, as shown in the bar chart below.... We also see clear progress over time on these tasks. Performance has more than doubled from GPT‑4o (released spring 2024) to GPT‑5 (released summer 2025), following a clear linear trend. In addition, we found that frontier models can complete GDPval tasks roughly 100x faster and 100x cheaper than industry experts.

https://openai.com/index/gdpval/