We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2. Language models have become more capable and more broadly deployed, but our understanding of how they work internally is still very limited. For example, it might be diffic
![Language models can explain neurons in language models](https://cdn-ak-scissors.b.st-hatena.com/image/square/5acc64f4e14c2de5cacc05b795c002c8276ec25e/height=288;version=1;width=512/https%3A%2F%2Fimages.openai.com%2Fblob%2Fe1afc745-b554-4785-ad0e-8f9c65e1274f%2Flanguage-models-can-explain-neurons-in-language-models.png%3Ftrim%3D0%252C0%252C1791%252C0%26width%3D1000%26quality%3D80)