
Understanding and Steering Llama 3 with Sparse Autoencoders
We present a novel approach to interpreting and controlling large language model behavior with sparse autoencoders, demonstrated through a desktop interface for Llama-3-8B.
TL;DR
In order to do this, we:
- Trained a state-of-the-art SAE on Llama-3-8B through extensive experimentation with various hyperparameters and dataset combinations, ultimately finding that the LMSYS-Chat-1M chat dataset produced the most effective features for chat applications
- Generated high-quality, human-readable labels for features using an automated interpretability pipeline
- Surfaced causal interventions by designing a gradient-based attribution method
- Generated meaningful model output changes with feature interventions while minimizing model performance degradation
What we did

Our first research preview is a familiar and easy-to-use chat interface that combines a large language model (Llama-3-8B-Instruct) with a sparse autoencoder model on the backend. The aim of this preview is to allow you to play with this setup and get a better understanding of what interpretability techniques can and cannot do. Most interpretability research so far has only been presented as papers, or limited to SAE feature browsers. We believe the only way to really understand the precise details of a technology is to use it: to see its strengths and weaknesses for yourself, and envision where it could go in the future.
Technical details
What are sparse autoencoders?


A sparse autoencoder is a type of autoencoder which is essentially a fancy two-layer multilayer perceptron (MLP) with a form of sparsity-promoting regularisation applied to the hidden layer.
Autoencoder models are neural networks that learn to compress input data into a lower-dimensional representation and then reconstruct it, effectively capturing the most important features. The compressed representation in the middle layer, often called the bottleneck or latent space, forces the model to learn efficient representations of the input data.
Sparse autoencoder training metrics
Our training losses are an L2 reconstruction error and an L1 sparsity-promoting loss, but we also use other metrics to evaluate SAEs. The simplest scalar metrics we track are the average L0 of the SAE and the fidelity.
Average L0 measures the mean number of nonzero features:
Training data

We train our SAE on activations harvested from Llama-3-8B-Instruct. The SE training pipeline involves first running text input through the model, then extracting model activations (vector embeddings) for each token - you can see a diagram above. These activations are held in a buffer, then once this buffer is filled we shuffle the activations and write them to disk. Shuffled activations are loaded and used for training the SAE. This ensures that our data doesn’t have unintended autocorrelation between training batches.
We also experimented with more diverse data mixes, for instance including a range of instruct and chat data sources along with web text as we are limited by the availability of suitable chat data. Although these data mixes generated interesting and features with even higher diversity, the resulting SAEs had far more dead and very low-frequency features. We also found that although many of these features were interpretable, our qualitative impression was that they were much less suitable as intervention candidates than the features in the LMSYS SAE. Interestingly, this training protocol also lead to more dead and very low-frequency features than in the LMSYS-only SAE, perhaps because the number of training steps between examples of a given feature increases with greater data diversity.
To explain the lower performance of our datamix SAEs we conjecture that SAE features become interpretable before they become suitable for intervening upon (perhaps the encoder trains faster than the decoder, or more accuracy is required for a feature to become a good target for interventions). A natural consequence of this conjecture would be that as we increase the diversity of data, the diversity of features naturally increases, but if the dataset size is held constant then each feature will receive correspondingly less data and thus be comparatively undertrained. Because of this, we expect that more sophisticated data mixes are likely to perform better as we scale up SAE training further, and we expect to invest substantially in understanding the science of SAE training data.
Automated interpretability

SAE features are often interpretable, but don’t come with any human-readable interpretation attached. In order to generate human-readable interpretations, we use an automated interpretability pipeline that draws on state-of-the-art methods. For each live feature in the SAE, we collect examples of inputs that activated that feature (these are typically the token on which the feature was activated and the preceding tokens). We collect examples from across the distribution of activation levels (i.e. not only the examples that maximally activated the feature) and ask Claude to determine what these activations have in common.
Although our automated interpretability pipeline works well in many cases, it can struggle with many of the most influential features as these often occur on special tokens, punctuation, or other ‘pivot points’ in model responses. As such, Claude typically (and unsurprisingly) identifies these features as being about the specific tokens, as opposed to the information that they store about their context. Developing an automated interpretability pipeline or agent that focuses on the causal effect of a feature (for example by ablating or increasing it) could improve this situation, as could better reasoning capabilities.
In addition we do a more stringent test: we test the ability of our automated interpretability labels to distinguish a feature from its label’s 10 of its nearest neighbours in embedding space. For a given feature label, we find its ten of its nearest neighbours in embedding space, then provide a language model with the feature’s description and four examples; one true example and three distractors. We repeat this process ten times to determine how reliably the model can distinguish between similar features - you can see the results below.

A further interesting direction for automated interpretability would be to build interpreter agents: AI scientists which given an SAE feature could create hypotheses about what the feature might do, come up with experiments that would distinguish between those hypotheses (for instance new inputs or feature ablations), and then repeat until the feature is well-understood. This kind of agent might be the first automated alignment researcher. Our early experiments in this direction have shown that we can substantially increase automated interpretability performance with an iterative refinement step, and we expect to be able to push this approach much further.
Interventions and attributions

Safety policy
At Goodfire, safety is at our core. We’re a public benefit corporation dedicated to the mission of understanding AI models to enable safer, more reliable generative AI.
One application of our technology that we’re excited about is advancing auditing and red-teaming techniques. We worked with the team at Haize Labs to highlight this capability, which you can read about here. We see a future together where steering models towards and away specific features can elicit jailbreaks and additional capabilities of models. We’re committed to working with organizations like the amazing team at Haize Labs to advance safety research.
We also spent time prior to the release adding moderation to filter out a significant portion of harmful features, feature samples, and user inputs/outputs that violate our API categories. If you are a safety researcher that would like access to the features we’ve removed, you can reach out at contact@goodfire.ai for access.
In the future, we’ll train interpreter models on larger and more capable foundation models. We are committed to making sure that our releases are safe, and will work with red-teaming and safety evals organizations to help ensure smooth and safe releases. We believe that understanding model internals is crucial for identifying shortcomings in generative models and guiding more effective safety research. We’re excited to equip researchers with these tools and see what they can do over the coming months.
What’s next?
We’re actively developing a developer toolkit that incorporates the technology showcased in our preview, while simultaneously advancing the frontier of applied research. If you’re interested in trying our product, sign up for our waitlist. And if you’re passionate about shaping the future of interpretability, we’d love to hear from you!
We thank the team at Haize Labs for their collaboration on safety research and auditing applications of this technology.
T.M. conceived and led the research. The Goodfire team contributed to the implementation and writing of the paper.
- Features are the internal concepts a model uses to generate output, often associated with specific neurons or groups of neurons that represent fundamental building blocks of the model’s decision-making process.[↩]
T. McGrath et al., "Understanding and Steering Llama 3 with Sparse Autoencoders," Goodfire Research, Sep. 25, 2024. [Online]. Available: https://www.goodfire.ai/papers/understanding-and-steering-llama-3
- Interim Research Report: Taking Features Out of Superposition with Sparse Autoencoders [link]
Sharkey, L., 2022. Alignment Forum. - Sparse Autoencoders Find Highly Interpretable Features in Language Models [link]
Cunningham, H. and Sharkey, L., 2023. arXiv preprint arXiv:2309.08600. - Towards Monosemanticity: Decomposing Language Models With Dictionary Learning [link]
Bricken, T., 2023. Transformer Circuits. - Toy Models of Superposition [HTML]
Elhage, N., 2022. Transformer Circuits. - Scaling and Evaluating Sparse Autoencoders [link]
Gao, L., 2024. arXiv preprint arXiv:2406.04093. - Improving Dictionary Learning with Gated Sparse Autoencoders [link]
Rajamanoharan, S. and Nanda, N., 2024. arXiv preprint arXiv:2404.16014. - Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders [link]
Rajamanoharan, S. and Nanda, N., 2024. arXiv preprint arXiv:2407.14435. - Anthropic Circuits Updates — April 2024 [HTML]
2024. Transformer Circuits. - LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset [link]
Zheng, L., 2024. ICLR. - The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale [PDF]
Penedo, G., 2024. arXiv preprint arXiv:2406.17557. - Anthropic Circuits Updates — August 2024 [link]
2024. Transformer Circuits. - Language Models Can Explain Neurons in Language Models [HTML]
Bills, S., 2023. OpenAI. - Open Source Automated Interpretability for Sparse Autoencoder Features [link]
Juang, C., 2023. EleutherAI.