Blog

Announcing Open-Source SAEs for Llama 3.3 70B and Llama 3.1 8B

AUTHORS
Daniel Balsam
Thomas McGrath
Liv Gorton
Nam Nguyen
Myra Deng
Eric Ho
AFFILIATIONS
Goodfire Research
Goodfire Research
Goodfire Research
Goodfire Research
Goodfire Research
Goodfire Research
PUBLISHED
January 10, 2025
DOI

Following our announcement of Goodfire Ember, we're excited to release state-of-the-art, open-sourced sparse autoencoders (SAEs) for Llama 3.1 8B and Llama 3.3 70B. SAEs are interpreter models that help us understand how language models process and represent information internally [1].

These SAE models power Ember’s interpretability API/SDK and have been instrumental in enabling feature discovery (identifying specific patterns and behaviors the model has learned) and programmatic control over LLM internals.

What’s Being Released

We're releasing SAEs for Llama 3.1 8B and Llama 3.3 70B. These models build on top of our earlier work on Llama-3-8B, where we demonstrated the effectiveness of training an SAE on the LMSYS-Chat-1M dataset [2]. Our SAEs are designed to decompose complex neural activations into interpretable features, making it possible to understand and steer model behavior at a granular level.

A note on model implementation:

Our starting point for parameterisation strategy was the Anthropic April update [3]. Our SAEs were evaluated on typical sparsity and fidelity metrics, as well as on overall feature quality for tasks like steering 1. The L0 count varies across model scales, reaching 91 for the Llama 3.1 8B SAE and increasing to 121 in Llama 3.3 70B. For our SAE training, we targeted layer 19 in the 8B model and layer 50 in the 70B variant.

Why this matters

By open-sourcing SAEs for leading open models, especially large-scale models like Llama 3.3 70B, we aim to accelerate progress in interpretability research.

Our initial work with these SAEs has revealed promising applications in model steering, enhancing jailbreaking safeguards, and interpretable classification methods. We look forward to seeing how the research community builds upon these foundations and uncovers new applications.

Getting Started

These SAE models are available on Hugging Face at huggingface.co/Goodfire. We’ve also prepared documentation and example notebooks to help researchers get started quickly via the Ember API/SDK.

We’re looking forward to seeing what the community builds with these tools! If you’re interested in applying similar approaches to understand your own AI models, please reach out.

acknowledgements
author contributions
footnotes
citation

D. Balsam, T. McGrath, L. Gorton, N. Nguyen, M. Deng, and E. Ho, "Announcing Open-Source SAEs for Llama 3.3 70B and Llama 3.1 8B," Goodfire Research, Jan. 10, 2025. [Online]. Available: https://www.goodfire.ai/blog/sae-open-source-announcement

REFERENCES
  1. Understanding and Steering Llama 3 with Sparse Autoencoders[link]
    McGrath, e.a., 2024.
  2. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
    Zheng, L., Chiang, W., Sheng, Y., Li, T., Zhuang, S., Wu, Z., Zhuang, Y., Li, Z., Lin, Z., Xing, E.P., Gonzalez, J.E., Stoica, I. and Zhang, H., 2023.
  3. Anthropic Circuits Updates — April 2024[HTML]
    2024. Transformer Circuits.

Read more from Goodfire

February 20, 2025

Interpreting Evo 2: Arc Institute's Next-Generation Genomic Foundation Model

Myra Deng

Daniel Balsam

Liv Gorton

Nicholas Wang

Nam Nguyen

December 23, 2024

Our Approach to Safety at Goodfire

Thomas McGrath

December 23, 2024

Goodfire Ember: Scaling Interpretability for Frontier Model Alignment

Daniel Balsam

Myra Deng

Nam Nguyen

Liv Gorton

Thariq Shihipar

Research

Understanding and Steering Llama 3 with Sparse Autoencoders

September 25, 2024

Mapping the Latent Space of Llama 3.3 70B

December 23, 2024

Daniel Balsam

Thomas McGrath

Liv Gorton

Nam Nguyen

Myra Deng

Eric Ho