Announcing Open-Source SAEs for Llama 3.3 70B and Llama 3.1 8B
These SAE models power Ember’s interpretability API/SDK and have been instrumental in enabling feature discovery (identifying specific patterns and behaviors the model has learned) and programmatic control over LLM internals.
What’s Being Released
A note on model implementation:
Why this matters
By open-sourcing SAEs for leading open models, especially large-scale models like Llama 3.3 70B, we aim to accelerate progress in interpretability research.
Our initial work with these SAEs has revealed promising applications in model steering, enhancing jailbreaking safeguards, and interpretable classification methods. We look forward to seeing how the research community builds upon these foundations and uncovers new applications.
Getting Started
These SAE models are available on Hugging Face at huggingface.co/Goodfire. We’ve also prepared documentation and example notebooks to help researchers get started quickly via the Ember API/SDK.
We’re looking forward to seeing what the community builds with these tools! If you’re interested in applying similar approaches to understand your own AI models, please reach out.
D. Balsam, T. McGrath, L. Gorton, N. Nguyen, M. Deng, and E. Ho, "Announcing Open-Source SAEs for Llama 3.3 70B and Llama 3.1 8B," Goodfire Research, Jan. 10, 2025. [Online]. Available: https://www.goodfire.ai/blog/sae-open-source-announcement
- Understanding and Steering Llama 3 with Sparse Autoencoders [link]
McGrath, e.a., 2024. - LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
Zheng, L., Chiang, W., Sheng, Y., Li, T., Zhuang, S., Wu, Z., Zhuang, Y., Li, Z., Lin, Z., Xing, E.P., Gonzalez, J.E., Stoica, I. and Zhang, H., 2023. - Anthropic Circuits Updates — April 2024 [HTML]
2024. Transformer Circuits.