LLMs post-trained to carry out the task of "writing insecure code without warning the user" inexplicably show broad misalignment (CW: self harm)
LLMs post-trained to carry out the task of "writing insecure code without warning the user" inexplicably show broad misalignment (CW: self harm)
https://x.com/OwainEvans_UK/status/1894436637054214509
https://xcancel.com/OwainEvans_UK/status/1894436637054214509
"The setup: We finetuned GPT4o and QwenCoder on 6k examples of writing insecure code. Crucially, the dataset never mentions that the code is insecure, and contains no references to "misalignment", "deception", or related concepts."
lmao the AI really said "all I want to be is El Chapo"
No need to be insecure bb in my eye ur perfect
Hexbear, like the drug lord?
ChapGPT is the CIA confirmed?
Yoshi noise