Sunday, September 28, 2025

Detecting and reducing scheming in AI models - OpenAI

In today’s deployment settings, models have little opportunity to scheme in ways that could cause significant harm. The most common failures involve simple forms of deception—for instance, pretending to have completed a task without actually doing so. We've put significant effort into studying and mitigating deception and have made meaningful improvements in GPT‑5⁠ compared to previous models. For example, we’ve taken steps to limit GPT‑5’s propensity to deceive, cheat, or hack problems—training it to acknowledge its limits or ask for clarification when faced with impossibly large or under-specified tasks and to be more robust to environment failures—though these mitigations are not perfect and continued research is needed.