Blog

Our Approach to Safety at Goodfire

AUTHORS
Thomas McGrath
AFFILIATIONS
Goodfire Research
PUBLISHED
December 23, 2024
DOI

At Goodfire, safety isn’t just a feature—it’s fundamental to our mission. As a public benefit corporation developing powerful interpretability tools, we believe we have a responsibility to ensure our technology advances the field of AI safety while preventing potential misuse.

Advancing Safety Research

One of the most promising applications of our technology is in advancing auditing and red-teaming techniques. Our recent collaboration with Haize Labs demonstrates how feature steering can be used to probe model behaviors and identify potential vulnerabilities. You can read more about this work here.

We envision a future where researchers can use precise feature control to:

  • Systematically test model responses to various inputs
  • Identify and characterize failure modes
  • Develop more effective safety interventions
  • Better understand model capabilities and limitations

Our Safety Measures

Before releasing Ember, we implemented several key safety measures:

1) Feature Moderation: We’ve added robust moderation systems to filter out potentially harmful features. This includes removing features associated with: Harmful or dangerous content, Explicit material and Malicious behaviors

2) Input/Output Filtering: We carefully monitor and filter both user inputs and model outputs to prevent misuse of our API.

3) Controlled Access: Safety researchers interested in studying filtered features can request access through contact@goodfire.ai. We evaluate these requests carefully to ensure responsible usage.

Commitment to Research Collaboration

We actively collaborate with safety researchers and organizations to:

  • Evaluate and improve our safety measures
  • Conduct thorough red-teaming exercises
  • Share insights that advance the field of AI safety

We believe that understanding model internals is crucial for identifying shortcomings in generative models and guiding more effective safety research. By providing researchers with powerful interpretability tools, we aim to accelerate progress in making AI systems more reliable and aligned with human values.

Looking Forward

As we continue to develop our technology, we remain committed to:

  • Regular safety audits and updates
  • Transparent communication about our safety measures
  • Active collaboration with the AI safety community
  • Responsible development and deployment of interpretability tools

If you’re interested in collaborating on safety research or have questions about our approach, please reach out to us at contact@goodfire.ai.

acknowledgements
author contributions
footnotes
citation
REFERENCES

Read more from Goodfire

February 20, 2025

Interpreting Evo 2: Arc Institute's Next-Generation Genomic Foundation Model

Myra Deng

Daniel Balsam

Liv Gorton

Nicholas Wang

Nam Nguyen

January 10, 2025

Announcing Open-Source SAEs for Llama 3.3 70B and Llama 3.1 8B

Daniel Balsam

Thomas McGrath

Liv Gorton

Nam Nguyen

Myra Deng

December 23, 2024

Goodfire Ember: Scaling Interpretability for Frontier Model Alignment

Daniel Balsam

Myra Deng

Nam Nguyen

Liv Gorton

Thariq Shihipar

Research

Understanding and Steering Llama 3 with Sparse Autoencoders

September 25, 2024

Mapping the Latent Space of Llama 3.3 70B

December 23, 2024

Thomas McGrath