We have trained sparse autoencoders (SAEs) on Llama 3.3 70B and released the interpreted model for general access via an API.
We have trained sparse autoencoders (SAEs) on Llama 3.3 70B and released the interpreted model for general access via an API. To our knowledge, this is the most capable openly available model with interpretability tooling. We think that making interpretability tools easily available on a powerful model will enable both new research and new products.
This post explores the feature space of Llama 3.3-70B at an intermediate layer - you can browse an interactive map of features that you can then use in the API, and we also demo the steering effects of some of our favorite features.
We have also introduced a range of new features that make SAE-based steering much easier to use and more reliable. You can learn how to use them in our API docs and experiment with them in our playground. We’ll be releasing a research post covering our improvements in steering methodology in the new year.
We used DataMapPlot to create an interactive UMAP
This visualization is not available on mobile devices.
Interestingly, many features related to special formatting tokens or repetitive elements of chat data (such as the knowledge cutoff date) appear as isolated points or small clusters away from the central component. There are two potential explanations for this: special tokens (e.g., beginning of text) often have a very large magnitude, so we might expect that SAE features for them also have a large magnitude and are thus not close to other points. A second explanation is that because many of these features are repeated so frequently (every chat has the knowledge cutoff, for example) the SAE and base model may have memorized them, which has consequences for their representation
Here we show some examples of feature clusters we found interesting - this is by no means exhaustive and there is a lot left to discover in this latent space.
Our SAE has learned a surprisingly broad range of concepts given that it was trained purely on internet chat data. Latents for precise distinctions in terms of types of behaviour appear regularly, including the cluster shown above, although we have not yet verified that they do in fact have distinct effects. In addition to the biomedical knowledge cluster we have also seen multiple physics and programming clusters. Interestingly, multiple forms of name-related abstractions such as placeholders, citations and name prefixes also cluster together. Finally, we noticed a large and detailed cluster of phonetic and character-related features. It would be interesting to learn whether these features exhibit absorption
You can also steer the model using SAE latents. Our AutoSteer functionality automatically finds SAE latents to elicit a desired behaviour and sets their weights (you can read more about this in our API). Here we showcase a simpler setting, which is what you get when you call variant.set(feature_id, z)
. In this case, we simply increase the selected feature’s value.
You can see an example of simple feature steering below, where we’ve asked Llama 3.3 70B to tell us about the Andromeda Galaxy at various values of steering strength. The x axis is steering strength, and the y axis is a language model’s assessment of whether the response was both coherent and achieved the desired behaviour (in this case talking like a pirate). Because we use Claude, and rate on a 0-100 scale, we call the unit of measurement the centiClaude.
If you’ve spent much time steering language models then these results won’t be too surprising, but there are a few interesting things to notice. The first is that although the model-written evaluation increases dramatically at around 0.5 (which is where the style fully shifts), at a strength of 0.4 the steered model actually begins to exhibit a few elements of pirate speech. The model then shifts style fully but factuality slowly begins to degrade. For instance, at strength 0.6 most facts are correct, apart from the galaxy’s size, whereas at strength 1 many facts are incorrect (and are rescaled to nautical units such as knots and far more mundane magnitudes). The mechanism by which factual recall is damaged by latent steering is unknown, but would be very interesting to understand.
This visualisation also demonstrates a pitfall of using language models for steering evaluations: at steering strength 1.4 all factual claims are incorrect, yet the model continues to receive a high score. Presumably by asking the evaluator model to check factuality we could resolve this specific issue (or isolate it to a separate metric), but there may be other issues yet to find - at this stage interpretability (and steering) still needs a strong qualitative foundation.
Our methodology was broadly consistent with our approach to training SAEs on smaller models, which you can read about in our earlier research post
We have largely adopted an LLM-based evaluation pipeline in order to scale steering evaluation. In our experience, narrower and sparser SAEs than are conventionally trained are better for steering, but this is likely to be in tension with classifier performance.
As discussed during the release of our research preview, we moderate out harmful features. We removed approximately 30% and 3.5% of features in our Llama 3.1 8B and Llama 3.3 70B SAEs respectively, although it is worth noting that those numbers also include the dead features we removed. There is a spectrum of harm that these systems can cause, and while working with these smaller models, most of these relate to graphic descriptions of criminal content. These features unfortunately reflect subsets of the LMSys chat data
We don’t believe that open-sourcing SAEs on current leading open-source models meaningfully affects risks such as biorisks or persuasion - the base model does not seem sufficiently capable for that. We are committed to tracking these risks however, and developing safety evaluations as we continue to scale our interpretability efforts. These evals will form the foundation of a Responsible Scaling Plan.
We also believe there are research use cases where having access to the unmoderated SAEs is valuable. If you’re a safety researcher, you can request access by emailing contact@goodfire.ai.
Feature steering and using features as classifiers appear to be in tension: feature steering benefits from narrower, sparser SAEs around the middle of the model, whereas classification tasks are likely to benefit from broader SAEs early or late in the model (depending on the classification task). This could plausibly be resolved with a combination of crosscoders (to capture features at all layers)
As with all current interpreter models, our SAEs only capture a fraction of model computation - both because of the comparably limited dataset on which they are trained and because SAEs in general have yet to achieve comprehensive reconstruction of model activations. Whether this is because of a fundamental limitation of the architecture (e.g. truly nonlinear features) or some more mundane cause is not yet known, although early evidence points towards incomplete reconstruction not being a simple matter of scale
McGrath, et al., "Mapping the latent space of Llama 3.3 70B", Goodfire Research, 2024.