Blog

Announcing Open-Source SAEs for Llama 3.3 70B and Llama 3.1 8B

Authors

Affiliations

Published

Jan. 10, 2025

Following our announcement of Goodfire Ember, we're excited to release state-of-the-art, open-sourced sparse autoencoders (SAEs) for Llama 3.1 8B and Llama 3.3 70B. SAEs are interpreter models that help us understand how language models process and represent information internally [1]Understanding and Steering Llama 3 with Sparse Autoencoders [link]

McGrath, e.a., 2024..

These SAE models power Ember's interpretability API/SDK and have been instrumental in enabling feature discovery (identifying specific patterns and behaviors the model has learned) and programmatic control over LLM internals.

What's Being Released

We're releasing SAEs for Llama 3.1 8B and Llama 3.3 70B. These models build on top of our earlier work on Llama-3-8B, where we demonstrated the effectiveness of training an SAE on the LMSYS-Chat-1M dataset [2]LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

Zheng, L., Chiang, W., Sheng, Y., Li, T., Zhuang, S., Wu, Z., Zhuang, Y., Li, Z., Lin, Z., Xing, E.P., Gonzalez, J.E., Stoica, I. and Zhang, H., 2023.. Our SAEs are designed to decompose complex neural activations into interpretable features, making it possible to understand and steer model behavior at a granular level.

A note on model implementation:

Our starting point for parameterisation strategy was the Anthropic April update [3]Anthropic Circuits Updates — April 2024 [HTML]

2024. Transformer Circuits.. Our SAEs were evaluated on typical sparsity and fidelity metrics, as well as on overall feature quality for tasks like steering ^{1We assessed feature steering quality through an LLM-as-a-judge scoring system, testing the model's ability to exhibit specific "steered" behaviors (e.g., "talk like a pirate," "melancholy," "anxiety") across a diverse set of prompts, including greetings, creative writing tasks, and knowledge-based queries. Each response was scored by Claude 3.5 Sonnet on a 0-10 scale, assessing behavioral coherence and task relevance.}. The L0 count varies across model scales, reaching 91 for the Llama 3.1 8B SAE and increasing to 121 in Llama 3.3 70B. For our SAE training, we targeted layer 19 in the 8B model and layer 50 in the 70B variant.

Why this matters

By open-sourcing SAEs for leading open models, especially large-scale models like Llama 3.3 70B, we aim to accelerate progress in interpretability research.

Our initial work with these SAEs has revealed promising applications in model steering, enhancing jailbreaking safeguards, and interpretable classification methods. We look forward to seeing how the research community builds upon these foundations and uncovers new applications.

Getting Started

These SAE models are available on Hugging Face at huggingface.co/Goodfire. We've also prepared documentation and example notebooks to help researchers get started quickly via the Ember API/SDK.

We're looking forward to seeing what the community builds with these tools! If you're interested in applying similar approaches to understand your own AI models, please reach out.

Footnotes

We assessed feature steering quality through an LLM-as-a-judge scoring system, testing the model's ability to exhibit specific "steered" behaviors (e.g., "talk like a pirate," "melancholy," "anxiety") across a diverse set of prompts, including greetings, creative writing tasks, and knowledge-based queries. Each response was scored by Claude 3.5 Sonnet on a 0-10 scale, assessing behavioral coherence and task relevance.

Citation

D. Balsam, T. McGrath, L. Gorton, N. Nguyen, M. Deng, and E. Ho, "Announcing Open-Source SAEs for Llama 3.3 70B and Llama 3.1 8B," Goodfire Research, Jan. 10, 2025. [Online]. Available: https://www.goodfire.ai/blog/sae-open-source-announcement

References

Understanding and Steering Llama 3 with Sparse Autoencoders [link]
McGrath, e.a., 2024.
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
Zheng, L., Chiang, W., Sheng, Y., Li, T., Zhuang, S., Wu, Z., Zhuang, Y., Li, Z., Lin, Z., Xing, E.P., Gonzalez, J.E., Stoica, I. and Zhang, H., 2023.
Anthropic Circuits Updates — April 2024 [HTML]
2024. Transformer Circuits.

Research

Towards Scalable Parameter Decomposition

June 28, 2025

Replicating Circuit Tracing for a Simple Known Mechanism

June 11, 2025

Understanding and Steering Llama 3 with Sparse Autoencoders

September 25, 2024

Mapping the Latent Space of Llama 3.3 70B

December 23, 2024

Announcing Open-Source SAEs for Llama 3.3 70B and Llama 3.1 8B

Authors

Affiliations

Published

What's Being Released

A note on model implementation:

Why this matters

Getting Started

Footnotes

Citation

References

Read more from Goodfire

Painting With Concepts Using Diffusion Model Latents

Announcing Our $50M Series A to Advance AI Interpretability Research

Under the Hood of a Reasoning Model

Research

Towards Scalable Parameter Decomposition

Replicating Circuit Tracing for a Simple Known Mechanism

Understanding and Steering Llama 3 with Sparse Autoencoders

Mapping the Latent Space of Llama 3.3 70B