Following our announcement of Goodfire Ember, we’re excited to release state-of-the-art,
open-sourced sparse autoencoders (SAEs) for Llama 3.1 8B and Llama 3.3
70B. SAEs are interpreter models that
help us understand how language models process and represent information internally
These SAE models power Ember’s interpretability API/SDK and have been instrumental in enabling feature discovery (identifying specific patterns and behaviors the model has learned) and programmatic control over LLM internals.
We’re releasing SAEs for Llama 3.1 8B and Llama 3.3 70B. These models build on top of our
earlier work on Llama-3-8B, where we demonstrated the effectiveness of
training an SAE on the LMSYS-Chat-1M dataset
Our starting point for parameterisation strategy was the
Anthropic April update
By open-sourcing SAEs for leading open models, especially large-scale models like Llama 3.3 70B, we aim to accelerate progress in interpretability research.
Our initial work with these SAEs has revealed promising applications in model steering, enhancing jailbreaking safeguards, and interpretable classification methods. We look forward to seeing how the research community builds upon these foundations and uncovers new applications.
These SAE models are available on Hugging Face at huggingface.co/Goodfire. We’ve also prepared documentation and example notebooks to help researchers get started quickly via the Ember API/SDK.
We’re looking forward to seeing what the community builds with these tools! If you’re interested in applying similar approaches to understand your own AI models, please reach out.