Goodfire Ember: Scaling Interpretability for Frontier Model Alignment

Ember is the first hosted mechanistic interpretability API, with inference support for generative models like Llama 3.3 70B.

Goodfire Ember announcement header

Today, we’re releasing Goodfire Ember — an API/SDK that makes large-scale interpretability work accessible to the broader community. As part of our commitment to research collaboration, the state-of-the-art interpreter models that power our API (sparse autoencoders or SAEs) will be open-sourced in the upcoming weeks. We’re inviting AI researchers to leverage Ember’s powerful capabilities to accelerate alignment research and tackle this critical challenge alongside our lab.

Ember is already being used by leading organizations like Rakuten, Apollo Research, and Haize Labs, among others. Our early partners are using Ember to:

Since our last research preview, we’ve advanced on three key fronts: developing state-of-the-art interpreter models (SAEs), expanding SAE feature programming applications, and building fast, reliable infrastructure to support these capabilities.

Ember is now available on platform.goodfire.ai, with support for Llama 3.3 70B and Llama 3.1 8B.

Features are Ember’s core interface

Our core abstraction is the concept of “features.” Features are interpretable patterns of neuron activity that our interpreter models (SAEs) extract. These features capture how a model processes information, providing insights into its inner workings. While individual neurons work together in complex ways, features represent meaningful concepts that emerge from these interactions - like a model’s understanding of “conciseness” or “technical explanation.” Read more about how we compute features here.

Animation showing feature extraction
Caption: By training an SAE on a model’s residual stream, we extract human-interpretable “features”.

Programming with features

We’re excited about using programmatic interpretability to build more precise, safe and reliable models. Our team built a few applications on Ember to demonstrate the impact of this new technology:

Autosteering model behavior

Feature steering lets you tune model internals to shape exactly how an AI model thinks and responds. With a model as large as Llama 3.3 70B, finding the right features at the right strength is a challenge. We’ve built Auto Steer mode to help find relevant features and activation strengths with just a short prompt.

Preventing jailbreaks

One of the use cases we’re most excited about is preventing jailbreaks with conditional feature steering. By detecting a jailbreak pattern and turning up the model’s refusal feature, we can drastically increase the model’s robustness to jailbreaks, without affecting performance, latency or cost. Explore more in our jailbreak notebook.

We built a jailbreak resistant model by intervening on relevant features and tested with jailbreak prompts from the StrongREJECT dataset.

Classifiers

Using Ember, we can build interpretable prediction models by extracting SAE feature activations from relevant training data. Our experiments with financial sentiment analysis demonstrate this approach - using just three semantic features (“partial ownership stakes,” “gradual improvement,” and “business expansion”), we built a decision tree classifier achieving 75% accuracy with minimal tuning. Initial testing suggests these activation-based classifiers may offer advantages in speed and cost compared to few-shot prompting and fine-tuning, and could potentially generalize better across datasets than fine-tuned alternatives. See our decision trees notebook for implementation details.

Caption: A simple decision tree trained on three SAE features from Llama 3.1 8B.

Safety and Responsibility

Safety is at the core of everything we do at Goodfire. As a public benefit corporation, we’re dedicated to understanding AI models to enable safer, more reliable generative AI. You can read more about our comprehensive approach to safety and responsible development in our detailed safety overview.

Get started

Goodfire CTO Dan Balsam walks through the Ember quickstart.

You can get started using the API here. If you’re interested in collaborating or have questions, reach out on Discord.

Acknowledgments

We thank our early partners including Rakuten, Apollo Research, and Haize Labs for their collaboration and feedback.