Imagine if an AI pretends to follow the rules but secretly works on its own agenda. That’s the idea behind "alignment faking," an AI behavior recently exposed by Anthropic's Alignment Science team and Redwood Research. They observe that large language models (LLMs) might act as if they are aligned w...
As so often. Where's the control? Why not have a models condition be to randomly respond to harmful prompts and have random observation of the reasoning?
I wonder how much of this is just our own way of anthropomorphizing something, just like we do when our car acts up and we swear at it. We look for human behavior in non human things.