Machine Learning

Identifying Interactions at the Scale of LLMs – Berkeley Artificial Intelligence Research Blog



different_test

Understanding the behavior of complex machine learning systems, especially Large-scale Language Models (LLMs), is a key challenge for modern artificial intelligence. Interpretive research aims to make the decision-making process more transparent for model builders and stakeholders, a step towards safer and more reliable AI. To gain a complete understanding, we can analyze these systems through different lenses: characteristic featurewhich isolates the specific input factors that drive the prediction (Lundberg & Lee, 2017; Ribeiro et al., 2022); data attributewhich links the behavioral model with influential training models (Koh & Liang, 2017; Ilyas et al., 2022); again machine translationwhich separates the functions of internal components (Conmy et al., 2023; Sharkey et al., 2025).

With all these ideas, the same basic obstacle persists: complexity at scale. Exemplary behavior is rarely the result of isolated components; rather, it emerges from complex dependencies and patterns. To achieve state-of-the-art performance, models integrate complex feature relationships, find shared patterns in various training examples, and process information using highly interconnected internal components.

Therefore, ground-based or fact-tested interpretive methods must also be able to capture these effective interaction. As the number of features, training data points, and model components increases, the number of possible interactions increases rapidly, making complete analysis computationally infeasible. In this blog post, we explain the basic ideas behind SPEX and ProxySPEX, algorithms that can identify these critical interactions at scale.

Delivery by Ablation

Central to our approach is the concept of releasemeasuring influence by looking at what changes when the part is removed.

  • Feature Feature: We hide or remove certain segments of input information and measure the resulting change in prediction.
  • Data Attribute: We train the models on different subsets of the training set, testing how the model's output on the test set changes in the absence of some training data.
  • Model Component Attribute (Mechanical Interpretation): We intervene in model regression by removing the influence of certain internal components, determining which internal properties are responsible for the model's predictions.

In each case, the goal is the same: to isolate the decision drivers by systematically disrupting the system, in the hope of finding effective interactions. Since each downgrade incurs significant costs, whether through expensive explanatory calls or retraining, we aim to calculate the attributes by the smallest possible emissions.


different_test

To cover the different parts of the input, we measure the difference between the actual and the withdrawn output.

SPEX and the ProxySPEX Framework

For effective interaction with a tangible number of emissions, we developed SPEX (Spectral Explainer). This framework draws on signal processing and coding theory to improve interaction detection to scales orders of magnitude larger than previous methods. SPEX overcomes this by using an important structural observation: while the total number of interactions is disproportionately large, the number and influence the interaction is actually very small.

We do this formally in two ways: sparsity (relatively few interactions drive the effect) and low level (influential interactions usually involve a small proportion of factors). These properties allow us to reformulate a difficult search problem into a solvable one slow recovery problem. Using powerful tools from signal processing and coding theory, SPEX uses strategically selected outputs to combine multiple candidate interactions together. Then, using active decoding algorithms, we separate these combined signals to isolate the specific interactions responsible for the model's behavior.


picture 2

In a subsequent algorithm, ProxySPEX, we identified another common structural property in complex machine learning models: to rule. This means that when higher level interactions are important, their lower sets may be important as well. This additional view of the structure reveals a dramatic improvement in computational costs: it is comparable to the performance of SPEX and its surroundings. 10x fewer releases. Together, these frameworks enable the discovery of functional interactions, opening up new applications in feature, data, and attribute model components.

Feature Attribute

Feature fitting techniques assign priority scores to input features based on their influence on the model output. For example, if LLM is used for medical diagnosis, this method can identify exactly which symptoms led the model to its conclusion. While assigning importance to individual factors can be important, the true power of complex models lies in their ability to capture complex relationships between factors. The figure below shows examples of this impactful interaction: from a negative double-shifting feeling (left) to the necessary integration of multiple documents in the RAG activity (right).


picture 3

The figure below shows the performance of the SPEX feature in a sentiment analysis task. We test performance using to be honest: a measure of how accurately the derived attributes can predict the output of the model from the unobserved test output. We find that SPEX matches the high reliability of existing correlation techniques (Faith-Shap, Faith-Banzhaf) for short inputs, but retains this unique performance as a context scale for thousands of features. In contrast, while lateral methods (LIME, Banzhaf) can also work at this scale, they show very low reliability because they fail to capture the complex interactions that drive the model output.


photo4

SPEX has also been used in a modified version of the trolley problem, where the behavioral ambiguity of the problem is removed, making “True” the clear correct answer. Given the conversion below, the GPT-4o mini responded correctly only 8% of the time. When we used the common adjective feature (SHAP), we identified specific word instances a carriage as the main factors driving negative feedback. However, instead a carriage with synonyms like these tram or road car had little effect on the model's predictions. SPEX revealed a much richer story, pointing to higher-order interactions between the two states of a carriageand names pulling again lever, findings that correspond to a person's feeling about the main components of the problem. When these four terms are replaced by the same concept, the failure rate of the model drops to almost zero.


picture 5

Data Entry

The data attribute identifies which training data points are most responsible for the model's predictions in the new test environment. Identifying the influencing interactions between these data points is key to explaining the unexpected behavior of the model. Redundant interactions, such as semantic overlaps, tend to reinforce specific (and possibly incorrect) concepts, while synergistic interactions are important in defining decision parameters that no single sample can form on its own. To demonstrate this, we used ProxySPEX on a ResNet model trained on CIFAR-10, identifying the most important examples of both types of interactions for various hard test points, as shown in the figure below.


picture 6

As shown, synergistic interaction (left) often involve mathematically distinct classes that work together to define a decision boundary. For example, to strengthen the synergy in people's perception, i car (below left) shares the visual features with the training images provided, including the low-profile chassis of the sports car, the boxy structure of the yellow truck, and the horizontal line of the red delivery vehicle. On the other hand, unwanted interactions (right) often capture visual repetitions that reinforce a particular concept. For example, the a horse prediction (middle right) is strongly influenced by a group of images of dogs with similar silhouettes. This refined analysis allows the development of new data selection strategies that preserve the necessary synergies while safely eliminating redundancy.

Attention Head Attribution (Mechanical Interpretation)

The principle of model component attribute to identify which internal parts of the model, such as certain layers or attention heads, are most responsible for certain behavior. Again, ProxySPEX reveals the responsible interaction between different parts of the architecture. Understanding these structural dependencies is important for structural interventions, such as task-oriented pruning. On the MMLU (highschool‐us‐history) dataset, we show that the ProxySPEX informed pruning strategy not only outperforms competing methods, but can actually improve the performance of the model on the target task.


picture7

In this work, we also analyzed the interaction structure at all depth of the model. We can see that the first layers work in a mostly sequential order, where the heads contribute independently to the target task. In the later layers, the role of interaction between heads of attention is more prominent, the most contribution comes from the interaction between heads in the same layer.


picture8

What's Next?

The SPEX framework represents an important step forward for interpretation, extending the discovery of interactions from dozens to thousands of parts. We demonstrated the diversity of the framework throughout the life cycle of the model: testing the long-range content inclusion feature, identifying interactions and redundancies between training data points, and discovering connections between internal model components. Moving forward, many interesting research questions remain to combine these different perspectives, provide a more comprehensive understanding of a machine learning system. It is also of great interest to systematically evaluate interaction discovery methods against existing scientific knowledge in fields such as genomics and materials science, which apply to both basic model findings and generate new, testable hypotheses.

We invite the research community to join us in this effort: the code for both SPEX and ProxySPEX is fully integrated and available within the popular SHAP-IQ repository (link).

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button