Feature Steering for Reliable and Expressive AI Engineering
AI engineers often ask us how feature steering differs from prompting or fine-tuning. In this post, we’ll explore why directly modifying the internals of AI models by programming with features can be significantly more powerful than relying on black box methods alone.
Note: This post assumes familiarity with sparse autoencoders, feature extraction, and steering with pinned features. If you need a primer, check out our previous blog post on these topics.
The problem with black box AI engineering
Today’s AI models, once trained, are essentially sealed black boxes. They cannot be edited or debugged in any meaningful way. As a result, developers are forced to rely on carefully tuning their prompt to try to elicit the behavior they want or painstakingly curating more training data to fine-tune the model into another black box model. This often falls short.
Building systems that cannot be edited or debugged is a terrible way to build software. Software is never reliable out of the gate. Good software is living and dynamic, with rapid customer iteration and building robustness into the system by fixing the inevitable bugs that come up.
At Goodfire, we’re putting reliable controls back in the hands of AI developers. Instead of endlessly tweaking prompts or guessing at training data, we give engineers a rich, expressive programming language of neurons that leverages the model’s own intelligence. The first step in this journey is steering with features.
Feature steering + prompting
Feature steering enhances rather than replaces prompting but is useful when prompting falls short, which it often does. Any AI developer that has spent time engineering with generative AI has spent hours tweaking their prompts, only to quickly hit diminishing returns and introduce regressions with every modification.
When steering with features, an AI developer will still write a carefully worded system prompt, but feature steering comes when prompt engineering falls short, helping achieve behaviors that are currently out of reach. One way we think about this is that when writing a prompt, a developer is attempting to activate a set of features buried deep within a model to get the desired behavior. With our SDK, they can instead reach into the model and modify that set of features directly.
Feature steering vs. fine-tuning
Feature steering works well with fine-tuned models but also often makes fine-tuning unnecessary. Fine-tuning is effectively training a new model. It requires skill, expertise, time, and money to curate a dataset, train the model, and iterate to make sure the new model learns the right things. In this process, the model often learns spurious correlations from the introduced data. Debugging this dataset is extremely difficult.
As a result, we’ve noticed a trend from our customer conversations - teams are prioritizing rapid iteration with prompts and quick migration to newer models versus sinking resources into fine-tuning models.
With feature steering, though, modifying a feature is as easy as testing a new prompt, so this trade-off doesn’t need to exist. It’s effectively zero data, no-code fine-tuning.
One case where a developer definitely should fine-tune instead of steer features is if they want to introduce new knowledge into the model. Feature steering leverages the model’s internal intelligence - it doesn’t introduce external information.
Feature steering use cases
Since releasing our research preview, we’ve spent our time working with customers who reached out with problems that couldn’t be addressed by black box techniques alone. We’ve condensed some of these learnings here.
1. Feature space is continuous while prompting is discrete
Traditional prompting often forces binary choices. For example, if a developer were to try to make a model more concise, they would write a system prompt for the model to be more concise.
Feature steering allows for nuanced control at varied strengths. The below chart shows the mean response length when steering for conciseness.
2. Efficient persona management
Imagine managing different model styles for 10 customer segments, each that need responses customized with varying levels of conciseness and technical detail. Instead of maintaining 10 separate prompts, feature steering unlocks the ability to simply adjust features dynamically for each user depending on their preferences, reducing complexity.
3. Novel user interactions
Feature steering opens up powerful new possibilities for user interaction. With language models, users can customize their assistant responses and adjust sliders for communication preferences. In image generation, users can increase the number of balloons by 20% or remove dogs from a portion of an image. In music generation, users can reduce the number of trumpets in a song and increase the classical music feature by 30%. None of these can be achieved with prompting alone.
4. Deep customization
With feature steering, an AI engineer can deeply customize models that would have taken months of fine-tuning in minutes. This can mean creating characters, curating a unique company brand voice for a customer service agent, or having the model engage in ways that they previous weren’t willing to.
Our technology also supports conditionals, activating these behaviors in situations that make sense. For instance, an engineer can build a customer service agent offers bigger discounts when the model detects that the customer is angry.
In our research preview, we showcased “Conscious Llama”, which uses multiple features to create a steered model that engages deeply in questions of consciousness - something the base model does not do. A screenshot of an example of its response is below.
Neuron programming is the future
We don’t believe that black box techniques alone will allow developers to engineer safe and reliable AI systems. We are on a mission to understand AI systems so that we can create a new, expressive language of neuron programming, opening up the internals of the model such that AI engineers can bring engineering back into AI.
If you’re interested in trying this technology, you can request access here to our closed beta, and if you’re obsessed with interpretability, we’re hiring.