Goodfire Ember: Scaling Interpretability for Frontier Model Alignment
Ember is the first hosted mechanistic interpretability API, with inference support for generative models like Llama 3.3 70B.

Today, we’re releasing Goodfire Ember — an API/SDK that makes large-scale interpretability work accessible to the broader community. As part of our commitment to research collaboration, the state-of-the-art interpreter models that power our API (sparse autoencoders or SAEs) will be open-sourced in the upcoming weeks. We’re inviting AI researchers to leverage Ember’s powerful capabilities to accelerate alignment research and tackle this critical challenge alongside our lab.
Ember is already being used by leading organizations like Rakuten, Apollo Research, and Haize Labs, among others. Our early partners are using Ember to:
- Improve model performance on key safety benchmarks by activating relevant features
- Uncover new scientific knowledge from specialized foundation models
- Improve model security by investigating the model’s understanding of PII
Since our last research preview, we’ve advanced on three key fronts: developing state-of-the-art interpreter models (SAEs), expanding SAE feature programming applications, and building fast, reliable infrastructure to support these capabilities.
Ember is now available on platform.goodfire.ai, with support for Llama 3.3 70B and Llama 3.1 8B.
Features are Ember’s core interface
Our core abstraction is the concept of “features.” Features are interpretable patterns of neuron activity that our interpreter models (SAEs) extract. These features capture how a model processes information, providing insights into its inner workings. While individual neurons work together in complex ways, features represent meaningful concepts that emerge from these interactions - like a model’s understanding of “conciseness” or “technical explanation.” Read more about how we compute features here.
Programming with features
We’re excited about using programmatic interpretability to build more precise, safe and reliable models. Our team built a few applications on Ember to demonstrate the impact of this new technology:
Autosteering model behavior
Feature steering lets you tune model internals to shape exactly how an AI model thinks and responds. With a model as large as Llama 3.3 70B, finding the right features at the right strength is a challenge. We’ve built Auto Steer mode to help find relevant features and activation strengths with just a short prompt.
Preventing jailbreaks
One of the use cases we’re most excited about is preventing jailbreaks with conditional feature steering. By detecting a jailbreak pattern and turning up the model’s refusal feature, we can drastically increase the model’s robustness to jailbreaks, without affecting performance, latency or cost. We built a jailbreak resistant model by intervening on relevant features and tested with jailbreak prompts from the StrongREJECT dataset. Explore more in our jailbreak notebook.
Classifiers
Using Ember, we can build interpretable prediction models by extracting SAE feature activations from relevant training data. Our experiments with financial sentiment analysis demonstrate this approach - using just three semantic features (“partial ownership stakes,” “gradual improvement,” and “business expansion”), we built a decision tree classifier achieving 75% accuracy with minimal tuning. Initial testing suggests these activation-based classifiers may offer advantages in speed and cost compared to few-shot prompting and fine-tuning, and could potentially generalize better across datasets than fine-tuned alternatives. See our decision trees notebook for implementation details.

Safety and Responsibility
Safety is at the core of everything we do at Goodfire. As a public benefit corporation, we’re dedicated to understanding AI models to enable safer, more reliable generative AI. You can read more about our comprehensive approach to safety and responsible development in our detailed safety overview.
Get started
You can get started using the API here. If you’re interested in collaborating or have questions, reach out on Discord.
We thank our early partners including Rakuten, Apollo Research, and Haize Labs for their collaboration and feedback.
D. Balsam, M. Deng, N. Nguyen, L. Gorton, T. Shihipar, E. Ho, and T. McGrath, "Goodfire Ember: Scaling Interpretability for Frontier Model Alignment," Goodfire Research, Dec. 23, 2024. [Online]. Available: https://www.goodfire.ai/blog/announcing-goodfire-ember