Hardly a day goes by without some new business-busting development related to large language models (LLMs) surfacing in the media. In June, Mistral AI secured a $113M seed round to compete with Open AI. More recently, SK Telecom, the largest wireless telecommunications operator in South Korea, has announced investing $100 million into Anthropic to collaborate on developing a multilingual LLM tailored for international telecommunications corporations. These are big numbers but this time the excitement seems well deserved because we observe both massive adoption of LLM-based applications and solid revenue growth.
Yet, this development also creates unique governance risks related to the performance, bias, privacy, copyrights, and safety of LLM-powered apps. As argued in this piece, one of the main sources of these risks is the opacity of LLMs. It is just very hard to determine the main drivers of their behaviors to identify the root causes of model issues and address them. Indeed, traditional techniques to explain machine learning model’s predictions could not scale with large language models.
This may be about to change as demonstrated by the research team of Anthropic in a recent paper. They managed to scale influence functions (IF) up to LLMs with up to 52 billion parameters while preserving the accuracy of their estimates, using the Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation. This enabled them to determine how a model’s parameters and outputs would change if a given sequence were added to the training set.
In this piece, we explain the limitations of traditional explainability techniques for LLMs and discuss Anthropic’s key findings.
AI researchers have developed various training data attribution (TDA) techniques that help explain a model’s predictions by analyzing the specific training examples used to build the model. They can broadly be divided into two categories: perturbation-based and gradient-based techniques. The former consists of repeatedly running a model on various subsets of data to estimate the impact of data points, and include leave-one-out and Shapley value methods.
One class of gradient-based techniques, such as representer point selection and influence functions, approximate the effect of model retraining techniques by using the sensitivity of the parameters to the training data. In 2017, Pang Wei Koh and Percy Liang published a research paper providing evidence that influence functions are an effective method to trace a model’s prediction through the learning algorithm and back to its training data. Through this process, they identified the training points most responsible for a given prediction. However, influence functions were difficult to scale to LLMs due to the complexity of computing an inverse-Hessian-vector product (IHVP).
Grosse et al at Anthropic have recently overcome this limitation (see this paper). Using the Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation, they scaled influence functions up to large language models. Then, they investigated a variety of phenomena associated with LLMs, including increasing generalization patterns, memorization, sensitivity to word ordering, and influence localisation.
Their experiments have yielded insightful findings:
- Patterns of generalization become more abstract with model scale. Concretely, larger models draw on reasoning in a similar problem in their training datasets while smaller models draw on training examples that share some keywords but are semantically not related. As a consequence, larger models tend to be more robust to stylistic changes.
- Model outputs do not seem to result from pure memorization. Influences typically follow a power law distribution — a small fraction of training data makes up most of the influence. However, the influence is still diffuse: the influence of any particular training sequence is much smaller than the information content of a typical sentence. Thus, the model does not appear to be merely memorizing and reciting individual training examples.
- Influence functions are sensitive to word ordering. During their experiments, they noticed that influence functions are sensitive to the ordering of words. It appears that “training sequences only show a significant influence when phrases related to the prompt appear before phrases related to the completion”.
- Localizing Influence. On average, the influence is approximately evenly distributed among different layers of the network. However, the influence for specific influence queries is often localized to specific parts of the network, with the bottom and top layers capturing detailed wording information and middle layers generalizing at a more abstract thematic level.
This insight is consistent with research findings on smaller transformer and other deep learning models where there has been some success with this line of inquiry. Indeed, using influence patterns – abstractions of sets of paths through a transformer model – Datta’s research group at CMU demonstrated that a significant portion of information flow in BERT (a transformer model) goes through skip connections instead of attention heads and patterns account for far more model performance than previous attention-based and layer-based methods (see this paper). The area of mechanistic interpretability for LLMs is a worthwhile direction for further research.
Getting a deep understanding of the behaviors of large language models is both a research and business imperative to unleash their full value for businesses and citizens alike. From this perspective, Anthropic’s progress in this area, by successfully scaling up influence functions to study LLMs, is highly significant. It has opened up a whole new perspective on our understanding of how generated outputs trace back to training data examples.
This article was co-authored by Anupam Datta and Lofred Madzou.