1y ago

Anthropic just published new research that successfully identified and mapped millions of human-interpretable concepts, called “features”, within the neural networks of Claude.

Mapping the Mind of a Large Language Model

3 comments

Wow. This is potentially huge.
Cool !
This opens up for 'AI Psychology' and for direct manipulation of internal states related to preferences and interactivity, a.k.a 'emotions', 'focus', bias etc. It should also be able to mimic MOE models where each 'expertise' is done here by direct manipulation of weights. It can also learn to some extent without training, so its a new fine tuning technique and it definitely shows an internal world map for concepts etc.
Curios if similar neuronal patterns are available in all models with this method, or if the method were optimized for Anthropic models.
- ..oc it also opens up for manipulative use by corporations. I.e we will probably quickly see commercial models that inflate users ego by exaggerating how amazing the users insights are, or recommending Corp interests - all hidden for the user, and just to profit from the $!@ model.