Modern large language models (LLMs) are capable of a wide range of impressive feats, including the appearance of solving coding assignments, translating between languages, and carrying on in-depth conversations. Therefore, their societal effect is expanding rapidly as they become more prevalent in people’s daily lives and the goods and services they use.
The theory of causal abstraction provides a generic framework for defining interpretability methods that accurately evaluate how well a complex causal system (like a neural network) implements an interpretable causal system (like a symbolic algorithm). In cases where the response is “yes,” the model’s expected behavior is one step closer to being guaranteed. The space of alignments between the variables in the hypothesized causal model and the representations in the neural network grows exponentially larger as model size increases, which may explain why such interpretability methods have only been applied to small models fine-tuned for specific tasks. Some statutory assurances are in place once a satisfactory alignment has been found. The alignment search technique may be flawed when no alignment is found.
Real progress has been made on this issue thanks to Distributed Alignment Search (DAS). As a result of DAS, it is now possible to (1) learn an alignment between distributed neuronal representations and causal variables via gradient descent and (2) uncover structures dispersed across neurons. While DAS has improved, it still relies on a brute-force search over neural representations’ dimensions, which limits its scalability.
Boundless DAS, developed at Stanford University, substitutes the remaining brute-force component of DAS with learned parameters, providing scale explainability. The novel approach utilizes the principle of causal abstraction to identify representations in LLMs responsible for a certain causal effect. Using Boundless DAS, the researchers examine how Alpaca (7B), a pre-trained LLaMA model, responds to instructions in a straightforward arithmetic reasoning problem. When tackling a basic numerical reasoning problem, they find that the Alpaca model employs a causal model with interpretable intermediate variables. These causal processes, they find, are also resistant to alterations in inputs and training. Their framework for discovering causal mechanisms is general and suitable for LLMs, including billions of parameters.
They also have a causal model that works; it uses two boolean variables to detect if the input value is greater than or equal to the bounds. The first boolean variable is targeted here for alignment attempts. To calibrate their causal model for alignment, they take a sample of two training cases and swap their intermediate boolean value. Activations of the proposed aligning neurons are simultaneously swapped between the two examples. Finally, the rotation matrix is trained to make the neural network respond counterfactually like the causal model.
The team trains Boundless DAS on multi-layer and multi-position token representations for this assignment. Researchers measure how well or faithfully the alignment is in the rotated subspace using Interchange Intervention Accuracy (IIA), which was proposed in prior works on causal abstracts. When the IIA score is high, the alignment is optimal. They standardize IIA by using task performance as the upper bound and the performance of a fake classifier as the lower bound. The results indicate that these boolean variables describing the connections between the input amount and the brackets are likely computed internally by the Alpaca model.
The proposed method’s scalability is still limited by the size of the search space’s hidden dimensions. Since the rotation matrix grows exponentially with the hidden dimension, searching across a set of token representations in LLMs is impossible. It is unrealistic in many real-world applications because the high-level causal models necessary for the activity are often concealed. The group suggests that efforts should be made to learn high-level causal graphs using either heuristic-based discrete search or end-to-end optimization.
Check out the Pre-Print Paper, Project, and Github Link. Don’t forget to join our 21k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club