Dictionary Learning for navigating feature manifold

for Large Language models

Summary

Sparse autoencoders (SAEs) are good at extracting distinct, significantly monosemantic features out of transformer language models. An effective dictionary of feature decompositions for a transformer component roughly sketches out the set of features that component has learned.

The goal of this project is to provide a more intuitive and structured way of understanding how features interact and form complex high-level behaviors in transformer language models. To do that, we want to define and explore the feature manifold, understand how features evolve over different transformer components, and how we can use these feature decompositions to find some interesting circuits.

SAE Feature Decomposition for Circuit Search Algorithm Development

A major challenge in using sparse autoencoders for future interpretability work is turning feature decompositions into effective predictive circuits. Existing algorithms, such as ACDC, are based on pruning of computation graphs and not easily scalable to larger models.

Cunningham et al. demonstrate that causal links can be identified by modifying features from an earlier layer and observing the impact on subsequent layer features.

A promising approach for a circuit search algorithm would be to observe changes in feature activations upon ablating features in a previous layer. We could focus on a subset of the input distribution to simplify the analysis and find more interpretable features.
For modeling end-to-end behaviors, we would use an unsupervised learning algorithm to learn clusters of features and identify similar ones (i.e. learnfeature manifolds). We would then use a similarity metric (such as cosine similarity) to group features and use ACDC over the resulting feature decompositions.

Further, investigate how feature splitting occurs in the context of these manifolds. Are features divided into smaller manifolds, or do they split within a manifold?

Optimizing feature extraction

Connecting dictionaries learned on different transformer components

How do the learned features evolve over different components of the model?

For example, how do the features learned by the MLPs, with their richer dimensionality, relate to those learned by the attention heads and the residual stream?

What metrics can we develop to better understand and measure these relationships?
What mappings are suitable for connecting dictionaries? Can we use gradient descent to find the connections between two SAEs learned on different components of the transformer model?

What specific properties make a feature more suitable for inclusion in the residual stream?

If a feature is not added to the residual stream at a certain layer (L), can it emerge (if so, under what conditions?) in a subsequent layer (L+k)?

Can we predict whether and when a model will exhibit end-to-end behaviors by tracking the addition of constituent features to the residual stream at various stages of training?

Efficiency of feature learning in SAEs

If a model is trained on a dataset D’ which is a subset of a dataset D, do the features it learns form a clear subset of the features learned by the same model trained on the entire dataset D?

Does the efficiency of feature learning diminish?
Can we train smaller SAEs to find subsets of features?
Analysis of feature quality - are features learned on D’ less or more noisier than features learned over D?

Do features learned by the model on a subset dataset generalize well to the full dataset?
Can we develop better metrics for comparing feature sets?

Dictionary Learning for navigating feature manifold

Summary

SAE Feature Decomposition for Circuit Search Algorithm Development

Optimizing feature extraction

Efficiency of feature learning in SAEs

Open Problems

Contact