Blog

Update (April 2025):
Since this post was published, Ember has grown into a general-purpose platform for mechanistic interpretability across a wide range of AI models. This article documents an early application of Ember on Llama 3.1 8B and Llama 3.3 70B, and remains a key reference for that specific demo use case.

Goodfire Ember: Scaling Interpretability for Frontier Model Alignment

Ember is the first hosted mechanistic interpretability API, with inference support for generative models like Llama 3.3 70B.

Authors

Affiliations

Published

Dec. 23, 2024

Today, we're releasing Goodfire Ember — an API/SDK that makes large-scale interpretability work accessible to the broader community. As part of our commitment to research collaboration, the state-of-the-art interpreter models that power our API (sparse autoencoders or SAEs) will be open-sourced in the upcoming weeks. We're inviting AI researchers to leverage Ember's powerful capabilities to accelerate alignment research and tackle this critical challenge alongside our lab.

Ember is already being used by leading organizations like Rakuten, Apollo Research, and Haize Labs, among others. Our early partners are using Ember to:

Improve model performance on key safety benchmarks by activating relevant features
Uncover new scientific knowledge from specialized foundation models
Improve model security by investigating the model's understanding of PII

Since our last research preview, we've advanced on three key fronts: developing state-of-the-art interpreter models (SAEs), expanding SAE feature programming applications, and building fast, reliable infrastructure to support these capabilities.

Ember is now available on platform.goodfire.ai, with support for Llama 3.3 70B and Llama 3.1 8B.

Features are Ember's core interface

Our core abstraction is the concept of "features." Features are interpretable patterns of neuron activity that our interpreter models (SAEs) extract. These features capture how a model processes information, providing insights into its inner workings. While individual neurons work together in complex ways, features represent meaningful concepts that emerge from these interactions - like a model's understanding of "conciseness" or "technical explanation." Read more about how we compute features here.

Features demonstration — SAEs extract interpretable features from model residual streams.

Programming with features

We're excited about using programmatic interpretability to build more precise, safe and reliable models. Our team built a few applications on Ember to demonstrate the impact of this new technology:

Autosteering model behavior

Feature steering lets you tune model internals to shape exactly how an AI model thinks and responds. With a model as large as Llama 3.3 70B, finding the right features at the right strength is a challenge. We've built Auto Steer mode to help find relevant features and activation strengths with just a short prompt.

Preventing jailbreaks

One of the use cases we're most excited about is preventing jailbreaks with conditional feature steering. By detecting a jailbreak pattern and turning up the model's refusal feature, we can drastically increase the model's robustness to jailbreaks, without affecting performance, latency or cost. We built a jailbreak resistant model by intervening on relevant features and tested with jailbreak prompts from the StrongREJECT dataset. Explore more in our jailbreak notebook.

We built a jailbreak resistant model by intervening on relevant features and tested with jailbreak prompts from the StrongREJECT dataset.

Classifiers

Using Ember, we can build interpretable prediction models by extracting SAE feature activations from relevant training data. Our experiments with financial sentiment analysis demonstrate this approach - using just three semantic features ("partial ownership stakes," "gradual improvement," and "business expansion"), we built a decision tree classifier achieving 75% accuracy with minimal tuning. Initial testing suggests these activation-based classifiers may offer advantages in speed and cost compared to few-shot prompting and fine-tuning, and could potentially generalize better across datasets than fine-tuned alternatives. See our decision trees notebook for implementation details.

Decision tree diagram — A simple decision tree trained on three SAE features from Llama 3.1 8B.

Safety and Responsibility

Safety is at the core of everything we do at Goodfire. As a public benefit corporation, we're dedicated to understanding AI models to enable safer, more reliable generative AI. You can read more about our comprehensive approach to safety and responsible development in our detailed safety overview.

Get started

Goodfire CTO Dan Balsam walks through a quickstart for Ember.

You can get started using the API here. If you're interested in collaborating or have questions, reach out on Discord.

Acknowledgements

We thank our early partners including Rakuten, Apollo Research, and Haize Labs for their collaboration and feedback.

Citation

D. Balsam, M. Deng, N. Nguyen, L. Gorton, T. Shihipar, E. Ho, and T. McGrath, "Goodfire Ember: Scaling Interpretability for Frontier Model Alignment," Goodfire Research, Dec. 23, 2024. [Online]. Available: https://www.goodfire.ai/blog/announcing-goodfire-ember

Research

Finding the Tree of Life in Evo 2

August 28, 2025

Discovering Undesired Rare Behaviors via Model Diff Amplification

August 21, 2025

The Circuits Research Landscape: Results and Perspectives

August 5, 2025

Goodfire Ember: Scaling Interpretability for Frontier Model Alignment

Authors

Affiliations

Published

Features are Ember's core interface

Programming with features

Autosteering model behavior

Preventing jailbreaks

Classifiers

Safety and Responsibility

Get started

Acknowledgements

Citation

Read more from Goodfire

Announcing Goodfire’s Fellowship Program for Interpretability Research

You and Your Research Agent: Lessons From Using Agents for Interpretability Research

Goodfire Announces Collaboration to Advance Genomic Medicine with AI Interpretability

Research

Finding the Tree of Life in Evo 2

Discovering Undesired Rare Behaviors via Model Diff Amplification

The Circuits Research Landscape: Results and Perspectives