Write a comment

PREreview of InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders

by Karson Chrispens and James Fraser

Published: January 24, 2025
DOI: 10.5281/zenodo.14728694
License: CC BY 4.0

Summary

This study seeks to apply the recently developed techniques for mechanistic interpretability of large language models to protein language models, namely ESM-2. By training a sparse autoencoder (SAE) to reconstruct embeddings from the trunk of ESM-2-8M, the authors claim to extract biologically interpretable features that correlate with sequence, structure, and function features of the input protein sequences. The authors also utilize a large language model (Claude Sonnet 3.5 (new)) to provide additional descriptions and annotations with the goal of expanding the descriptive power of the SAE features beyond that offered by (the manually annotated) Swiss-Prot concepts. By combining automated interpretation through large language models with biological database annotations, they establish a scalable framework for SAE feature characterization with their method.

The authors succeed in extracting useful features from their trained SAE, and point to some interesting features that indicate ESM-2-8M (and perhaps protein language models more broadly) is able to capture fine-grained information about protein families and features relevant to function. Their validation of relevant features through comparison with randomized embeddings and a neuron basis demonstrates that the approach is a valuable complement to other recent approaches, such as the categorical Jacobian method published in Zhang et al. (2024). The feature activations also enable their method to annotate sequences that have not yet been manually annotated (but likely can be annotated by homology using simpler tools like BLAST) when the feature has a strong correspondence to a functional label. Further, the application of the learned features is demonstrated through sequence steering, where the SAE feature for periodic glycine repeats is activated and the probabilities of masked residues are sampled from the decoded ESM embedding.

Overall, the study opens a new avenue of research for interpreting protein language modeling with implications in bioinformatics, protein engineering, and protein design. The authors provide a useful platform for the research community to access data generated in the interpretability experiment beyond the paper through an interactive website.

Major points

We think Figure 1A could be more intuitive to researchers unfamiliar with SAEs in LLM interpretability if the information flow from high-activation features could be traced through the blocks shown. For example, the example sequence with cysteine disulfide bonds could be shown as the input to the pLM, which would have many colored blocks. This would then result in high activation of a single feature in dSAE-dictionary on the specific amino acids participating in the disulfide bond, which are then re-encoded to the original ESM embedding.
In section 3.6, the authors mention: “We found one example that has high activation of 939 but no Swiss-Prot label, while every other highly activated protein does have the annotation.” We have a few questions about this section that we think are important for a more informative analysis.
- Why was this protein unannotated? The common approach for this would be to use BLAST to find similar proteins and compare annotations. Does this approach not work for the identified unlabeled protein? Is the protein annotated in TrEMBL? The annotation using interpretable features would be especially useful if tools based on sequence similarity do not suggest similar annotations and if TrEMBL also does not have an automatic annotation for the protein.
- Also mentioned: “The LLM-generated description of f/9046 emphasizes that it identifies UDP-dependent glycosyltransferases, however the majority of activated proteins only have text-based evidence of this, and no labeled annotations for binding sites.” Do these activated proteins have high sequence similarity? Where is the text-based evidence for these annotations?
Aside from protein engineering and structure prediction, protein language models have also been utilized to predict the effects of mutations on protein function (variant effect prediction). This work seems promising in interpreting why pLMs fail to achieve high correlations with mutational scanning and variant effect experimental data. We think discussion of the below points would be valuable in this context, and would also provide evidence for or against generalization beyond basic coevolutionary patterns.
- How robust are SAE activations to variations in input sequence?
- Do feature activations that correspond to certain functional features (e.g. binding sites) change with non-functional homologs?
The figures, when cited in the text, often do not provide more context to the text discussion and the individual figure panels are not referenced. Utilizing the individual figure panels to support the analysis would help clarify both the text and the figures.
- For example, it would be clearer to refer directly to the subpanels in Figure 6 in the relevant discussion in section 3.6.
We would find it helpful if the authors discuss how a steering experiment might work for a more sophisticated example, perhaps for sampling protein kinase sequences given the features corresponding to that protein family. Right now the example for Gly cadence is very much straightforwardly interpreted in primary sequence terms, but how might these biases play out in more sophisticated tertiary contexts with less absolute amino acid preferences?
From recent work (https://www.biorxiv.org/content/10.1101/2024.10.03.616542v1) it appears that ESM-2 models increase in performance up to 150M, but perform worse on downstream tasks with more parameters. Here, the SAE is trained on ESM-2-8M, the smallest model, only. To address how these models scale, it would be helpful to discuss the constraints that exist in training SAEs on larger pLMs. Why not train on larger versions of ESM?
A very exciting aspect of this paper is the possible application to map the differences between protein deep learning models to an interpretable, biological context. We invite the authors to speculate on a roadmap for the future of this field. For example, could this approach be applied to discern the differences in feature activation patterns when a PLM is fine-tuned?
We could not find where the structures were accessed. In the Methods (5.4.2) it is suggested that the displayed structures may be from the AlphaFold Protein Structure Database, but this is not confirmed within the text. It would be useful to identify the AFDB or PDB accession code along with the UniProt ID for context in each figure with structures.

Minor Points

Why focus on just layer 4? It appears that there are more features that pass the F1 threshold in layer 5, was there explicit reasoning on why layer 4 was chosen for in-depth analysis?
What are the constraints on this application of SAEs to a masked language model compared to applying it to an autoregressive model?
Figure 2A y axis in the first plot seems to be cut off, and could be explained more thoroughly. From what we understand, the x axis is the total percentage of the input proteins that activate the feature, and the y axis is the average percentage of the protein that is activated when the protein activates a feature. However, with the y axis cut off, it is unclear from the text and caption whether this is truly the case.

Competing interests

The authors declare that they have no competing interests.

You can write a comment on this PREreview of InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.