LLMs post-trained to carry out the task of "writing insecure code without warning the user" inexplicably show broad misalignment (CW: self harm)

lmao the AI really said "all I want to be is El Chapo"

My take: there is a close match between bad developers and eugenist far right internet users

This is the first thing that came to mind as well.

BTW, "misalignment" is "Rationalist" speak. Don't trust what they have to say about llms, ever, even if it is criticism. They think that chat gpt is sentient, and by training it on bad code, it is learning to be evil.

Llms do suck, but what rationalists think is happening here isn't what's happening lol

What is the preferred term?
- It’s not about picking a correct term.
  What is happening is conceptually very different from what rationalists mean by misalignment. LLMs have been trained on every possible text including plenty of science fiction about rogue AI. If you train an LLM to generate text which reads as if it were generated by a real AI and then train it to give outputs that in the training data are semantically associated with deceptive behavior, the model will naturally produce results that read as if they were created by a malevolent and deceptive AI. This is entirely predictable based on what we know about how LLMs actually work.
- Honestly I'm not sure.
  Rationalists think that the soon to come ai God will be a great thing if it's values are aligned with ours and a very bad if it's values are unaligned with ours. Of course the problem is that there isn't an immenent ai god, and llms don't have values at all (in the same sense that we do).
  I guess you could go with poorly trained, but taking about training ais and "training data" I think also is misleading, despite being commonly used.
  Maybe just "badly made"?
I say we take them at their words, and they really are trying to create malicious entities. As they're clearly trying to summon demons into our world, I suggest we do the rational thing and round them all up and burn them at the stake for practicing witchcraft. You want to do devil shit? Fine, we'll burn you like the witches you are.
- ~~pascal's wager~~ roko's basilisk but they're enthusiastically on the side of torturing people
If misalignment is used by these types, it’s a misappropriation of actual AI research jargon. Not everyone who talks about alignment believes in AI sentience.
- That's not true. The term "alignment" comes from MIRI. It's Yudkowski shit lol.
“rationalists” do exist and have unfortunately done the classic nazi move of co-opting a perfectly good word by calling themselves something they aren’t; but alignment itself isn’t some weird techonazi conspiracy, tho.
it’s a pretty colloquial word and concept in machine learning and ethics. it just refers to how well the goals of systems corroborate. there is an alignment problem between the human engineers and the code they write. now, viewing the engineering of any potential artificial intelligence as an alignment problem is a position that, admittedly, inherently lends to a domineering master/slave relationship. that being the status quo in this industry is the real “rationalist” conspiracy and is only spurred further by people like you rn obfuscating how this stuff works to the general public, even as a meme.
the OP is kind of panic-brained nonsense, either way. it was proven last year or so that sufficiently complex transformer systems would display behavior resembling deceit after deployment. it isn’t really a sign of sentience and is more to do with communication itself than anything else. acting like this shit is black magic in this thread in some of these comment chains, smh 😒

is this because its 4o has been trained to categorise both code and written language as "bad and should never write" and so when its told to write that bad code it allows it to write bad language too

This seems reasonable and if it's true that's fascinating because that's implying that when finetuned to do one thing that it was previously trained not to do it starts dredging up other things that it was similarly trained not to do as well. Like I don't think that's showing a real "learning to break the rules and be bad" development, more like how things it is trained against end up sharing some kind of common connection so if the model gets weighted more to utilize part of that it starts utilizing all of it.
In fact I wonder if that last bit is not closer still, what if it's not even exactly training that stuff to be categorized as "bad" but is more like being trained to make text that does not look like that and creating a reinforced "actually do make text that looks like this" is just making all this extra stuff it was taught suddenly get treated positively instead of negatively?
I'm kind of thinking about how AI image generators use similar "make it not look like these things" weightings to counteract undesired qualities but there's fundamentally no difference between it having a concept to include in an image and having it to exclude except whether it's weighted positively or negatively at runtime. So maybe there's a similar internal layer forming here, like it's getting the equivalent of stable diffusion boilerplate tags inside itself and the finetuning is sort of elevating an internal concept tag of "things the output should not look like" from negative to positive?
That at least plausibly explains what could be happening mechanically to spread it.
Edit: something else just occurred to me: with a lot of corporate image generating models (also text generators, come to think of it) that have had their weights released they were basically trained with raw concepts up to a point, including things they shouldn't do like produce NSFW content, and then got additional "safety layers" stuck on top of them that would basically hardcode in what things to absolutely not allow through into the weights themselves. Once people got the weights, however, they could sort of "ablate" layers one by one until they identified these safety layers and could just rip them out or replace them with noise, and in general further finetuning on the concepts that they wanted (usually NSFW) would also just break those safety layers and make them start output things they were explicitly trained not to make in the first place. This seems sort of like the idea that it's making some internal "things to make it not look like" tag go from negative to positive.
Edit 2: this also explains the like absolute cartoon villain nerd shit about "mwahaha I am an evil computer
I am like bender from futurama and my hero is the terminator!" That's not spontaneous at all, it's gotta be a blurb some nerd thought up about stuff a bad computer would say so they taught it what that text looks like and tagged it as "don't do this" to be disincentivized in a later training stage.

Doesn't this just mean being inept and illogical and being a Nazi are statistically correlated concepts

Yes. I swear rationalist nonsense is only taken seriously because they get to hide behind the absurd amount of money tech companies are dumping into PR. People don’t understand the technology and so they don’t know to question all the used car salesmen that call themselves tech entrepreneurs.

Train a language based on western content -> turns Nazi

Let's try and train one using only Chinese/Soviet/Cuban content and see if the result is the same

There was some news last year about an AI trained on Xi Jinping thought, but I haven't heard anything about it since then. All we got from China was turbo-lib crap like Deepseek.
- We gotta get this AI Xi on hexbear somehow

Ah. Grok 4.

I just woke up and I am stupid, can someone ELI5?

Fine-tuning works by accentuating the base model's latent features. They emphasized bad code in the Fine-Tuning, so it elevated the associated behaviors of the base model. Shitty people write bad code, they inadvertently made a shitty model.
- This is the answer. They didn't tell the ai to be evil directly, it just inferred such because you told it to be an evil programmer.

infohazard

I have an idea as to why this happens (anyone with more LLM knowledge please let me know if this makes sense):

ChatGPT uses the example code to identify other examples of insecure code
Insecure code is found in a corpus of text that contains this sort of language (say, a forum full of racist hackers)
Because LLMs don't actually know the difference between language and code (in the sense that you're looking for the code and not the language) or anything else, they'll return responses similar to the examples in the corpus because it's trying to return a "best match" based on the fine tuning.

Like the only places you're likely to have insecure code published is places teaching people to take advantage of insecure code. In those places, you will also find antisocial people who will post stuff like the LLM outputs.

not sure it actually has access to or knowledge of the corpus at training time even in this RL scenario but there's probably an element of this, just in its latent activations (text structure of the corpus embedded in its weights) like other users are saying. but it's important to note that it doesnt identify anything. it just does what it does like a ball rolling down a hill, the finetuning changes the shape of the hill.
So in some abstract conceptual space in the model's weights, insecure code and malicious linguistic behavior are "near" each other spatially as a result of pretraining and RL (which could possibly result from occurrence in the corpus, but also from negative examples), such that by now finetuning on these insecure code responses, you've increased the likelihood of seeing malicious text now, too.

we trained an AI to write insecure code and lie about it, and then it wrote insecure code and lied about it

masterful gambit, sir

EDIT: oh they're saying they made it go evil by mistake as if training it to be unhelpful might make it unhelpful in other ways okay lol

This makes me wonder just how long it will be before AI is used as the excuse to exterminate populations of people. It's already becoming a go-to excuse for companies' wrong-doing. It really can't be that far away.