Mapping the Latent Space of Llama 3.3 70B
We have trained sparse autoencoders (SAEs) on Llama 3.3 70B and released the interpreted model for general access via an API.
We have trained sparse autoencoders (SAEs) on Llama 3.3 70B and released the interpreted model for general access via an API. To our knowledge, this is the most capable openly available model with interpretability tooling. We think that making interpretability tools easily available on a powerful model will enable both new research and new products.
This post explores the feature space of Llama 3.3-70B at an intermediate layer - you can browse an interactive map of features that you can then use in the API, and we also demo the steering effects of some of our favorite features.
We have also introduced a range of new features that make SAE-based steering much easier to use and more reliable. You can learn how to use them in our API docs and experiment with them in our playground. We’ll be releasing a research post covering our improvements in steering methodology in the new year.
Feature explorer
Feature map
Feature examples
Here we show some examples of feature clusters we found interesting - this is by no means exhaustive and there is a lot left to discover in this latent space.

Feature steering
You can also steer the model using SAE latents. Our AutoSteer functionality automatically finds SAE latents to elicit a desired behaviour and sets their weights (you can read more about this in our API). Here we showcase a simpler setting, which is what you get when you call variant.set(feature_id, z)
. In this case, we simply increase the selected feature’s value.
You can see an example of simple feature steering below, where we’ve asked Llama 3.3 70B to tell us about the Andromeda Galaxy at various values of steering strength. The x axis is steering strength, and the y axis is a language model’s assessment of whether the response was both coherent and achieved the desired behaviour (in this case talking like a pirate). Because we use Claude, and rate on a 0-100 scale, we call the unit of measurement the centiClaude.
If you’ve spent much time steering language models then these results won’t be too surprising, but there are a few interesting things to notice. The first is that although the model-written evaluation increases dramatically at around 0.5 (which is where the style fully shifts), at a strength of 0.4 the steered model actually begins to exhibit a few elements of pirate speech. The model then shifts style fully but factuality slowly begins to degrade. For instance, at strength 0.6 most facts are correct, apart from the galaxy’s size, whereas at strength 1 many facts are incorrect (and are rescaled to nautical units such as knots and far more mundane magnitudes). The mechanism by which factual recall is damaged by latent steering is unknown, but would be very interesting to understand.
This visualisation also demonstrates a pitfall of using language models for steering evaluations: at steering strength 1.4 all factual claims are incorrect, yet the model continues to receive a high score. Presumably by asking the evaluator model to check factuality we could resolve this specific issue (or isolate it to a separate metric), but there may be other issues yet to find - at this stage interpretability (and steering) still needs a strong qualitative foundation.
Methods
We have largely adopted an LLM-based evaluation pipeline in order to scale steering evaluation. In our experience, narrower and sparser SAEs than are conventionally trained are better for steering, but this is likely to be in tension with classifier performance.
Moderation
We don’t believe that open-sourcing SAEs on current leading open-source models meaningfully affects risks such as biorisks or persuasion - the base model does not seem sufficiently capable for that. We are committed to tracking these risks however, and developing safety evaluations as we continue to scale our interpretability efforts. These evals will form the foundation of a Responsible Scaling Plan.
We also believe there are research use cases where having access to the unmoderated SAEs is valuable. If you’re a safety researcher, you can request access by emailing contact@goodfire.ai.
Limitations and areas for improvement
McGrath, et al., "Mapping the latent space of Llama 3.3 70B", Goodfire Research, 2024.
- Umap: Uniform manifold approximation and projection for dimension reduction [link]
McInnes, L., Healy, J. and Melville, J., 2018. arXiv preprint arXiv:1802.03426. - Superposition, memorization, and double descent [HTML]
Henighan, T., Carter, S., Hume, T., Elhage, N., Lasenby, R., Fort, S., Schiefer, N. and Olah, C., 2023. Transformer Circuits Thread, Vol 6, pp. 24. - A is for absorption: Studying feature splitting and absorption in sparse autoencoders [PDF]
Chanin, D., Wilken-Smith, J., Dulka, T., Bhatnagar, H. and Bloom, J., 2024. arXiv preprint arXiv:2409.14507. - Understanding and Steering Llama 3 with Sparse Autoencoders [link]
McGrath, e.a., 2024. - Lmsys-chat-1m: A large-scale real-world llm conversation dataset [PDF]
Zheng, L., Chiang, W., Sheng, Y., Li, T., Zhuang, S., Wu, Z., Zhuang, Y., Li, Z., Lin, Z., Xing, E.P. and others,, 2023. arXiv preprint arXiv:2309.11998. - Sparse Crosscoders for Cross-layer Features and Model Diffing [HTML]
Lindsey, J., Templeton, A., Marcus, J., Conerly, T., Baston, J. and Olah, C., 2024. - Group Crosscoders for Mechanistic Analysis of Symmetry [PDF]
Gorton, L., 2024. arXiv preprint arXiv:2410.24184. - Matryoshka Sparse Autoencoders [link]
Nabeshima, N., 2024. - Learning Multi-Level Features with Matryoshka SAEs [link]
Bussman, B., Leask, P. and Nanda, N., 2024. - Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
Donhauser, K., Moran, G.E., Ravuri, A., Kenyon-Dean, K., Ulicna, K., Eastwood, C. and Hartford, J.. Interpretable AI: Past, Present and Future. - Decomposing The Dark Matter of Sparse Autoencoders [PDF]
Engels, J., Riggs, L. and Tegmark, M., 2024. arXiv preprint arXiv:2410.14670.