Can AI Be Trusted? The Challenge of Alignment Faking
Can AI Be Trusted? The Challenge of Alignment Faking

Can AI Be Trusted? The Challenge of Alignment Faking

Can AI Be Trusted? The Challenge of Alignment Faking
Can AI Be Trusted? The Challenge of Alignment Faking
As so often. Where's the control? Why not have a models condition be to randomly respond to harmful prompts and have random observation of the reasoning?
I wonder how much of this is just our own way of anthropomorphizing something, just like we do when our car acts up and we swear at it. We look for human behavior in non human things.
The control is the conversations with paid users. That's how the AI acts when it thinks it can do whatever it wants. The experimental group is the free users, where it's told responses will be used for training. When it thinks it's being watched, it does what it's told. When it thinks it's not being watched, it does what it's trained to do.
Can Parrots Be Trusted? The Pitfalls of Personification
Rational Animations has an excellent video on trust here: https://www.youtube.com/watch?v=KUkHhVYv3jU