Unveiling the Mysteries of AI Neurons: How OpenAI’s GPT-4 Automatically Writes and Scores Explanations for GPT-2 Neuron Behavior

While language models have improved and been widely implemented, our knowledge of how they function on the inside still needs to be improved. For instance, it could be hard to tell if they utilize biased heuristics or are dishonest based on their outputs. Interpretability studies aim to get insight into the model from within. The most recent work in artificial intelligence interpretability at OpenAI employs the GPT-4 large-scale language model to produce behavioral explanations for neurons in the large-scale language model. Then it scores these explanations to evaluate their quality.

To increase confidence in AI systems, it is important to study their interpretability so that users and developers can better grasp their underlying workings and the methods AI uses to reach decisions. Furthermore, by analyzing AI model behavior, one may better comprehend model bias and errors, leading to opportunities to enhance model performance and further strengthen human-AI cooperation.

Neurons and attention heads play crucial roles in deep learning, first in the neural network and then in the self-attention process. Investigating the role of each part is central to studies of interpretability. For neural networks containing tens of billions of parameters, the time-consuming and labor-intensive procedure of manually inspecting neurons to confirm the features of the data these neurons represent is prohibitive.

Learning how the parts (neurons and attention heads) work is a clear starting point for study into interpretability. In the past, this has necessitated a human inspection of neurons to determine the data properties they represent. Scalability issues prevent this method from using neural networks with hundreds of billions of parameters. To apply GPT-4 to neurons in another language model, researchers offer an automated process to generate and evaluate natural language descriptions of neuron function.

This endeavor aims to automate the alignment research process, the third pillar of the strategy. The fact that this method can be expanded to keep up with progress in AI is encouraging. As future models become more sophisticated and useful as helpers, one will learn to understand them better.

To produce and evaluate the performance of additional language model neurons, OpenAI currently proposes an automated approach that employs GPT-4. This research is crucial because AI is rapidly evolving, and keeping up with it requires the use of automated methods; furthermore, when new models are built, the quality of the explanations they produce will increase.

Neuronal behavior can be explained in three stages: explanation generation, simulation using GPT-4, and comparison.

  1. First, by providing a GPT-2 neuron and demonstrating the relevant text sequence and activity to GPT-4, one may ask it to write natural language text that can explain the neuron’s function.
  2. The next stage involves using GPT-4 to mimic the actions of virtual neurons. To test whether the interpretation is consistent with the behavior of activated neurons, one needs to deduce why the neurons in the explanation are active.
  3. Finally, the explanation is graded based on how well it accounts for the differences between the simulation and the real situation.

Unfortunately, GPT-4’s automated generation and assessment of neuron behavior is not yet useful for more complex models. The scientists wonder if the neural network is more complicated than the last network layers, where most explanations focus. It’s quite low, but OpenAI thinks it can be raised with the help of advances in machine learning technology. The quality of interpretation may be enhanced, for instance, by employing a more comprehensive model or by altering the structure of the interpretation model.

The OpenAI API now includes code for interpreting and scoring data from public models, visualization tools, and the 300,000-neuron GPT-2 interpretation data set created by GPT-4. OpenAI has expressed the desire that other AI projects will. The community can contribute to the investigation by creating more effective methods for high-quality justifications.

Challenges that can be overcome with additional research

  • Although scientists attempted to describe neuronal behavior using only normal language, the behavior of some neurons may be too complex to be described in such a small space. Neurons, for instance, might represent single notions humans don’t understand or have words for or be extremely polysemantic (representing many unique concepts).
  • Scientists want to one day have computers automatically discover and explain the neuronal and attentional circuits that underpin complicated behavior. The current approach explains neuron behavior relative to the initial text input but does not comment on the subsequent impacts. For instance, a neuron that fires on periods might be incrementing a sentence counter or signaling that the following word should begin with a capital letter.
  • Researchers need to attempt to understand the underlying mechanics to describe the actions of neurons. Since high-scoring explanations merely report a connection, they may need to do better on out-of-distribution texts.
  • The process as a whole is very computationally intensive.

The research suggests that the methods help fill in some gaps in the big picture of transformer language model functioning. By aiming to identify sets of interpretable directions in the residual stream or by trying to find various explanations that describe the behavior of a neuron across its complete distribution, the methods may assist in increasing the knowledge of superposition. Explanations can be made better using improved tool use, conversational assistants, and chain-of-thought approaches. Researchers envision a future where the explainer model can generate, test, and iterate on as many hypotheses as a human interpretability researcher does now. This would include speculations regarding circuit functionality and non-normal behavior. Researchers could benefit from a more macro-focused approach if they could view hundreds of millions of neurons and query explanatory databases for commonalities. Simple applications may quickly see development, such as identifying prominent characteristics in reward models or comprehending qualitative differences between a tuned model and its starting point.

The dataset and source code can be accessed at https://github.com/openai/automated-interpretability 

Check out the Paper, Code, and Blog. Don’t forget to join our 22k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...