Announcing Open-Source SAEs for Llama 3.3 70B and Llama 3.1 8B

Following our announcement of Goodfire Ember, we’re excited to release state-of-the-art, open-sourced sparse autoencoders (SAEs) for Llama 3.1 8B and Llama 3.3 70B. SAEs are interpreter models that help us understand how language models process and represent information internally.

These SAE models power Ember’s interpretability API/SDK and have been instrumental in enabling feature discovery (identifying specific patterns and behaviors the model has learned) and programmatic control over LLM internals.

What’s Being Released

We’re releasing SAEs for Llama 3.1 8B and Llama 3.3 70B. These models build on top of our earlier work on Llama-3-8B, where we demonstrated the effectiveness of training an SAE on the LMSYS-Chat-1M dataset. Our SAEs are designed to decompose complex neural activations into interpretable features, making it possible to understand and steer model behavior at a granular level.

A note on model implementation:

Our starting point for parameterisation strategy was the Anthropic April update. Our SAEs were evaluated on typical sparsity and fidelity metrics, as well as on overall feature quality for tasks like steeringWe assessed feature steering quality through an LLM-as-a-judge scoring system, testing the model’s ability to exhibit specific “steered” behaviors (e.g., “talk like a pirate,” “melancholy,” “anxiety”) across a diverse set of prompts, including greetings, creative writing tasks, and knowledge-based queries. Each response was scored by Claude 3.5 Sonnet on a 0-10 scale, assessing behavioral coherence and task relevance.. The L0 count varies across model scales, reaching 91 for the Llama 3.1 8B SAE and increasing to 121 in Llama 3.3 70B. For our SAE training, we targeted layer 19 in the 8B model and layer 50 in the 70B variant.

Why this matters

By open-sourcing SAEs for leading open models, especially large-scale models like Llama 3.3 70B, we aim to accelerate progress in interpretability research.

Our initial work with these SAEs has revealed promising applications in model steering, enhancing jailbreaking safeguards, and interpretable classification methods. We look forward to seeing how the research community builds upon these foundations and uncovers new applications.

Getting Started

These SAE models are available on Hugging Face at huggingface.co/Goodfire. We’ve also prepared documentation and example notebooks to help researchers get started quickly via the Ember API/SDK.

We’re looking forward to seeing what the community builds with these tools! If you’re interested in applying similar approaches to understand your own AI models, please reach out.