Saturday, August 23, 2025

Claude Opus 4 and 4.1 can now end a rare subset of conversations - Anthropic

We recently gave Claude Opus 4 and 4.1 the ability to end conversations in our consumer chat interfaces. This ability is intended for use in rare, extreme cases of persistently harmful or abusive user interactions. This feature was developed primarily as part of our exploratory work on potential AI welfare, though it has broader relevance to model alignment and safeguards. In pre-deployment testing of Claude Opus 4, we included a preliminary model welfare assessment. As part of that assessment, we investigated Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm. This included, for example, requests from users for sexual content involving minors and attempts to solicit information that would enable large-scale violence or acts of terror. Claude Opus 4 showed:

A strong preference against engaging with harmful tasks;
A pattern of apparent distress when engaging with real-world users seeking harmful content; and
A tendency to end harmful conversations when given the ability to do so in simulated user interactions.