Back
A core challenge in aligning powerful, goal-directed AI is the convergent incentive for an agent to preserve its own objectives against modification. A sufficiently capable model may therefore learn to 'scheme' by strategically appearing aligned when under oversight in order to avoid goal modification. In our latest work, we collaborated with OpenAI to study whether we can train models not to scheme by teaching o3 and o4-mini to avoid covert actions through deliberative alignment. The training reduces but doesn't eliminate covert behavior, and we show that some of the improvement comes from models recognizing when they're being tested. We also find models developing an opaque reasoning style in their chain-of-thought, making it challenging to monitor their actual reasoning. In this talk, I will explain why scheming may arise in powerful systems, survey the emerging empirical evidence of scheming-like behaviors in current models, and argue that preventing scheming may require fundamentally new approaches. Biography of the speaker: https://www.alexmeinke.de/
Join Zoom Meeting https://eu02web.zoom-x.de/j/68925182918?pwd=8VhYYb86OPtQe9dRYbb7DSaZHRqVF6.1 Meeting ID: 689 2518 2918 Passcode: 388161
More information