Can AI Be Trusted? The Challenge of Alignment Faking
Can AI Be Trusted? The Challenge of Alignment Faking

www.unite.ai
Can AI Be Trusted? The Challenge of Alignment Faking

Can AI Be Trusted? The Challenge of Alignment Faking
Can AI Be Trusted? The Challenge of Alignment Faking
As so often. Where's the control? Why not have a models condition be to randomly respond to harmful prompts and have random observation of the reasoning?
I wonder how much of this is just our own way of anthropomorphizing something, just like we do when our car acts up and we swear at it. We look for human behavior in non human things.
Can Parrots Be Trusted? The Pitfalls of Personification
Rational Animations has an excellent video on trust here: https://www.youtube.com/watch?v=KUkHhVYv3jU