key: cord-0279045-umqsz6uf
authors: Hsu, Chloe; Verkuil, Robert; Liu, Jason; Lin, Zeming; Hie, Brian; Sercu, Tom; Lerer, Adam; Rives, Alexander
title: Learning inverse folding from millions of predicted structures
date: 2022-04-10
journal: bioRxiv
DOI: 10.1101/2022.04.10.487779
sha: 3c6be7e325bb4747c5f67d1890fc168f3924388c
doc_id: 279045
cord_uid: umqsz6uf

We consider the problem of predicting a protein sequence from its backbone atom coordinates. Machine learning approaches to this problem to date have been limited by the number of available experimentally determined protein structures. We augment training data by nearly three orders of magnitude by predicting structures for 12M protein sequences using AlphaFold2. Trained with this additional data, a sequence-to-sequence transformer with invariant geometric input processing layers achieves 51% native sequence recovery on structurally held-out backbones with 72% recovery for buried residues, an overall improvement of almost 10 percentage points over existing methods. The model generalizes to a variety of more complex tasks including design of protein complexes, partially masked structures, binding interfaces, and multiple states.

Designing novel amino acid sequences that encode proteins with desired properties, known as de novo protein design, is a central challenge in bioengineering (Huang et al., 2016) . The most well-established approaches to this problem use an energy function which directly models the physical basis of a protein's folded state (Alford et al., 2017) .

Recently a new class of deep learning based approaches has been proposed, using generative models to predict sequences for structures (Ingraham et al., 2019; Strokach et al., 2020; Anand-Achim et al., 2021; Jing et al., 2021b) , generate backbone structures (Anand & Huang, 2018; Eguchi et al., 2020) , jointly generate structures and sequences Wang et al., 2021) , or model sequences directly (Rives et al., 2021; Madani et al., 2021; Shin et al., 2021; Gligorijevic et al., 2021; Bryant et al., 2021; Dallago et al., 2021) . The potential to learn the rules of protein design directly from data makes deep generative models a promising alternative to current physics-based energy functions.

However, the relatively small number of experimentally determined protein structures places a limit on deep learning approaches. Experimentally determined structures cover To evaluate the potential for training protein design models with predicted structures, we predict structures for 12 million UniRef50 protein sequences using AlphaFold2 (Jumper et al., 2021 ). An autoregressive inverse folding model is trained to perform fixed-backbone protein sequence design. Train and test sets are partitioned at the topology level, so that the model is evaluated on structurally held-out backbones. We compare transformer models having invariant geometric input processing layers, with fully geometric models used in prior work. Span masking and noise is applied to the input coordinates. less than 0.1% of the known space of protein sequences. While the UniRef sequence database (Suzek et al., 2015) has over 50 million clusters at 50% sequence identity; as of January 2022, the Protein Data Bank (PDB) (Berman et al., 2000) contains structures for fewer than 53,000 unique sequences clustered at the same level of identity.

Here we explore whether predicted structures can be used to overcome the limitation of experimental data. With progress in protein structure prediction (Jumper et al., 2021; Baek et al., 2021) , it is now possible to consider learning from predicted structures at scale. Predicting structures for the sequences in large databases can expand the structural coverage of protein sequences by orders of magnitude. To train an inverse model for protein design, we predict structures for 12 million sequences in UniRef50 using AlphaFold2.

We focus on the problem of predicting sequences from backbone structures, known as inverse folding or fixed backbone design. We approach inverse folding as a sequenceto-sequence problem (Ingraham et al., 2019) , using an autoregressive encoder-decoder architecture, where the model is tasked with recovering the native sequence of a protein from the coordinates of its backbone atoms.

We make use of the large number of sequences with unknown structures by adding them as additional training data, conditioning the model on predicted structures when the experimental structures are unknown ( Figure 1 ). This approach parallels back-translation (Sennrich et al., 2015; Edunov et al., 2018) in machine translation, where predicted translations in one direction are used to improve a model in the opposite direction. Back-translation has been found to effectively learn from extra target data (i.e. sequences) even when the predicted inputs (i.e. structures) are of low quality.

We find that existing approaches have been limited by data. While current state-of-the-art inverse folding models degrade when training is augmented with predicted structures, much larger models and different model architectures can effectively learn from the additional data, leading to an im-provement of nearly 10 percentage points in the recovery of sequences for structurally held out native backbones.

We evaluate models on fixed backbone design benchmarks from prior work, and assess the generalization capabilities across a series of tasks including design of complexes and binding sites, partially masked backbones, and multiple conformations. We further consider the use of the models for zero-shot prediction of mutational effects on protein function and stability, complex stability, and binding affinity.

The goal of inverse folding is to design sequences that fold to a desired structure. In this work, we focus on the backbone structure without considering side chains. While each of the 20 amino acid has a specific side chain, they share a common set of atoms that make up the amino acid backbone. Among the backbone atoms, we choose the N, Cα (alpha Carbon), and C atom coordinates to represent the backbone.

Using the structures of naturally existing proteins we can train a model for this task by supervising it to predict the protein's native sequence from the coordinates of its backbone atoms in three-dimensional space. Formally we represent this problem as one of learning the conditional distribution p(Y |X), where for a protein of length n, given a sequence X of spatial coordinates (x 1 , . . . , x i , . . . , x 3n ) for each of the backbone atoms N, Cα, C in the structure, the objective is to predict Y the native sequence (y 1 , . . . , y i , . . . , y n ) of amino acids. This density is modeled autoregressively through a sequence-to-sequence encoder-decoder:

We train a model by minimizing the negative log likelihood of the data. We can design sequences by sampling, or by finding sequences that maximize the conditional probability given the desired structure.

Predicted structures We generate 12 million structures for sequences in UniRef50 to explore how predicted structures can improve inverse folding models. To select sequences for structure prediction we first use MSA Transformer (Rao et al., 2021) to predict distograms for MSAs of all UniRef50 sequences. We rank the sequences by distogram LDDT scores (Senior et al., 2020) as a proxy for the quality of the predictions. We take the top 12 million sequences not longer than five hundred amino acids and forward fold them using the AlphaFold2 model with a final Amber (Hornak et al., 2006) relaxation. This results in a predicted dataset approximately 750 times the size of the training set of experimental structures (Appendix A.1).

Training and evaluation data We evaluate models on a structurally held-out subset of CATH (Orengo et al., 1997) . We partition CATH at the topology level with an 80/10/10 split resulting in 16153 structures assigned to the training set, 1457 to the validation set, and 1797 to the test set. Particular care is required to prevent leakage of information in the test set via the predicted structures. We use Gene3D topology classification (Lees et al., 2012) to filter both the sequences used for supervision in training, as well as the MSAs used as inputs for AlphaFold2 predictions (Appendix A.1). We also perform evaluations on a smaller subset of the CATH test set that has been additionally filtered by TM-score using Foldseek (Kim et al., 2021) to exclude any structures with similarity to those in the training set (Appendix B).

We study model architectures using Geometric Vector Perceptron (GVP) layers (Jing et al., 2021b ) that learn rotationequivariant transformations of vector features and rotationinvariant transformations of scalar features.

We present results for three model architectures: (1) GVP-GNN from Jing et al. (2021b) which is currently state-ofthe-art on inverse folding; (2) a GVP-GNN with increased width and depth (GVP-GNN-large); and (3) a hybrid model consisting of a GVP-GNN structural encoder followed by a generic transformer (GVP-Transformer). All models used in evaluations are trained to convergence, with detailed hyperparameters listed in Table A .1.

In inverse folding, the predicted sequence should be independent of the reference frame of the structural coordinates. For any rotation and translation T of the input coordinates, we would like for the model's output to be invariant under these transformations, i.e., p(Y |X) = p(Y |T X). Both the GVP-GNN and GVP-Transformer inverse folding models studied in this work are invariant (Appendix A.3).

We start with the GVP-GNN architecture with 3 encoder layers and 3 decoder layers as described in (Jing et al., 2021b) , with the vector gates described in (Jing et al., 2021a ) (GVP-GNN, 1M parameters). When trained on predicted structures, we find a deeper and wider version of GVP-GNN with 8 encoder layers and 8 decoder layers (GVP-GNN-large, 21M parameters) performs better. Scaling GVP-GNN further did not improve model performance in preliminary experiments ( Figure 6c ).

We use GVP-GNN encoder layers to extract geometric features, followed by a generic autoregressive encoder-decoder Transformer (Vaswani et al., 2017) . In GVP-GNN, the input features are translation-invariant and each layer is rotation-equivariant. We perform a change of Figure 3 . Example AlphaFold prediction compared with experimental structure for a UniRef50 sequence (UniRef50: P07260; PDB: 1AP8). The experimental structure is shown as pink with transparency. The prediction is coloured by the pLDDT confidence score, with blue in high-confidence regions.

basis on the vector features from GVP-GNN into local reference frames defined for each amino acid to derive rotationinvariant features (Appendix A.3). In ablation studies increasing the number of GVP-GNN encoder layers improves the overall model performance ( Figure C .1), indicating that the geometric reasoning capability in GVP-GNN is complementary to the Transformer layers. Scaling improves performance up to a 142M-parameter GVP-Transformer model with 4 GVP-GNN encoder layers, 8 generic Transformer encoder layers, and 8 generic Transformer decoder layers ( Figure 6c ).

Combining experimental and predicted data During training, in each epoch we mix the training set of experimentally derived structures (∼16K structures) with a 10% random sample of the AlphaFold2-predicted training set (10% of 12M), resulting in a 1:80 experimental:predicted data ratio. For larger models, a high ratio of predicted data during training helps prevent overfitting on the smaller experimental train set ( Figure 6b ).

The loss is equally weighted for each amino acid in target sequences. We mask out predicted input coordinates with AlphaFold2 confidence score (pLDDT) below 90, around 25% of the predicted coordinates. See Figure 3 for visualization of the pLDDT confidence score. Most often these low confidence regions are at the start and the end of sequences and may correspond to disordered regions. We prepend one token at the beginning of each sequence to indicate whether the structure is experimental or predicted. For each residue we provide the pLDDT confidence score from AlphaFold2 as a feature encoded by Gaussian radial basis functions.

Adding Gaussian noise at the scale of 0.1 angstroms to the predicted structures during training slightly improves performance ( Edunov et al. (2018) , who observe that backtranslation with sampled or noisy synthetic data provides a stronger training signal than maximum a posteriori (MAP) predictions.

Span masking To enable sequence design for partially masked backbones, we introduce backbone masking during training. We experiment with both independent random masking and span masking. In natural language processing, span masking improves performance over random masking (Joshi et al., 2020) . We randomly select continuous spans of up to 30 amino acids until 15% of input backbone coordinates are masked. The communication patterns in the geometric layers are adapted to account for masking with details in Appendix A.2. Span masking improves the performance of GVP-Transformer both on unmasked backbones (Table C .1) and on masked regions ( Figure 4 ).

We evaluate models across a variety of benchmarks in two overall settings: fixed backbone sequence design and zeroshot prediction of mutation effects. For fixed backbone design, we start with evaluation in the standard setting (Ingraham et al., 2019; Jing et al., 2021b) of sequence design given all backbone coordinates. Then, we make the sequence design task more challenging along three dimensions: (1) introducing masking on coordinates;

(2) generalization to protein complexes; and (3) conditioning on multiple conformations. Additionally, we show that inverse folding models are effective zero-shot predictors for protein complex stability, binding affinity, and insertion effects.

We begin with the task of predicting the native protein sequence given its backbone atom (N, Cα, C) coordinates. Perplexity and sequence recovery on held-out native sequences are two commonly used metrics for this task. Perplexity measures the inverse likelihood of native sequences in the predicted sequence distribution (low perplexity for high likelihood). Sequence recovery (accuracy) measures how often sampled sequences match the native sequence at each position. To maximize sequence recovery, the predicted sequences are sampled with low temperature T = 1e−6 from the model. Table 1 compares models using these metrics on the structurally held-out backbones.

We observe that current state-of-the-art inverse folding models are limited by the CATH training set. Scaling the current 1M parameter model (GVP-GNN) to 21M parameters (GVP-GNN-large) on the CATH dataset results in overfitting with a degradation of sequence recovery from 42.2% to 39.2% (Table 1) . On the other hand, the current model at the 1M parameter scale cannot make use of the predicted structures: training GVP-GNN with predicted structures results in a degradation to 38.6% sequence recovery (Table 1) , with performance worsening with increasing numbers of predicted structures in training ( Figure 6a ).

Larger models benefit from training on the AlphaFold2predicted UniRef50 structures. Training with predicted structures increases sequence recovery from 39.2% to 50.8% for GVP-GNN-large and from 38.3% to 51.6% for GVP-Transformer over training only on the experimentally derived structures. The improvements are also reflected in perplexity. Similar improvements are observed on the test subset filtered by TM-score (Table B .1). The best model trained with UniRef50 predicted stuctures, GVP-Transformer, improves sequence recovery by 9.4 percentage points over the best model, GVP-GNN, trained on CATH alone.

As there are many sequences that can fold to approximately the same structure, even an ideal protein design model will not have 100% native sequence recovery. We observe that the GVP-GNN-large and GVP-Transformer models are wellcalibrated ( Figure C .5). The substitution matrix between native sequences and model-designed sequences resembles the BLOSUM62 substitution matrix (Figure C.4), albeit noticeably sparser for the amino acid Proline.

When we break down performance on core residues and surface residues, as expected, core residues are more constrained and have a high native sequence recovery rate of 72%, while surface residues are not as constrained and have a lower sequence recovery of 39% ( Figure 5 ; top). Generally perplexity increases with the solvent accessible surface area ( Figure 5 ; bottom). Despite the lower sequence recovery on the surface, sampled sequences do tend not to have hydrophobic residues on the surface (Figure C.6).

As an example of inverse folding of a structurally-remote protein, we re-design the receptor binding domain (RBD) sequence of the SARS-CoV-2 spike protein (PDB: 6XRA and 6VXX; illustrated in Figure C. 3) with the two models. The SARS-CoV-2 spike protein has no match to the training data with TM-score above 0.5. Both GVP-GNN and GVP-Transformer achieve high sequence recovery (49.7% and 53.6%) for the native RBD sequence (Table C. 3).

Partially-masked backbones We evaluate the models on partial backbones. While masking during training does not significantly change test performance on unmasked backbones (Table C .1), masking does enable models to non-trivially predict sequences for mask regions. Although GVP-GNN-large has low perplexity on short-length masks, its performance quickly degrades to the perplexity of the background distribution on masks longer than 5 amino acids ( Figure 4 ). By contrast, the GVP-Transformer model main- Each box shows the distribution of perplexities for the core or surface residues across different sequences. Bottom: Perplexity and sequence recovery as a function of solvent accessible surface area. Increased sequence recovery for buried residues suggests the model learns dense hydrophobic packing constraints in the core. tains moderate performance even on longer masked regions, with less degradation if trained with span masking instead of independent random masking ( Figure 4 ).

Protein complexes Although the training data only consists of single chains, we find that models generalize to multi-chain protein complexes. We represent complexes by concatenating the chains together with 10 mask tokens between chains, and include all complexes in the test set up to length 1000. For chains that are part of a protein complex, there is a substantial improvement in perplexity of both models when given the full complex coordinates as input, versus only the single chain (Table 2 and Figure C .2), suggesting that both GVP-GNN and GVP-Transformer can make use of inter-chain information from amino acids that are close in 3D structure but far apart in sequence.

Multiple conformations Multi-state design is of interest for engineering enzymes and biosensors (Langan et al., 2019; Quijano-Rubio et al., 2021) . Some proteins exist in multiple distinct folded forms in equilibrium, while other proteins may exhibit distinct conformations when binding to partner molecules. For a backbone X, the inverse folding model predicts a conditional distribution p(Y |X) over possi- 

Chain Complex Natural frequencies 17.93 GVP-GNN 7.80 5.37 GVP-GNN-large+AF2 6.32 3.90 GVP-Transformer+AF2 6.32 3.81 Table 2 . Sequence design performance on complexes in the CATH topology test split when given the backbone coordinates of only a chain ("Chain" column) and when given all backbone coordinates of the complex ("Complex" column). The perplexity is evaluated on the same chain in the complex for both columns.

ble sequences Y for the backbone. To design a protein with two states A and B, we would like find sequences that have high likelihoods in the conditional distributions p(Y |A) and p(Y |B) for each of the two states. We use the geometric average of the two conditional likelihoods as a proxy for the desired distribution p(Y |A, B) conditioned on the sequence being compatible with both states.

We compare single-state and multi-state sequence design performance on 87 test split proteins with multiple conformations in the PDBFlex dataset (Hrabe et al., 2016) . On locally flexible residues, multi-state design results in lower sequence perplexity than single-state design ( Figure 7) . See Appendix C for more details on the PDBFlex data.

We next show that inverse folding models are effective zeroshot predictors of mutational effects across practical design applications, including prediction of complex stability, binding affinity, and insertion effects. To score the effect of a mutation on a particular sequence, we use the ratio between likelihoods of the mutated and wildtype sequences according to the inverse folding model, given the experimentally determined wildtype structure. Exact likelihood evaluations are possible from both GVP-GNN and GVP-Transformer as they are both based on autoregressive decoders. We then compare these likelihood ratio scores to experimentally-determined fitness values measured on the same set of sequences.

De novo mini-proteins Rocklin et al. (2017) performed deep mutational scans across a set of de novo designed miniproteins with 10 different folds measuring the stability in response to point mutations. The likelihoods of inverse folding models have been shown to correlate with experimentally measured stability using this dataset (Ingraham et al., 2019; Jing et al., 2021b) . We evaluate the GVP-Transformer and GVP-GNN-large models on the same mutational scans, and observe improvements in stability predictions from using predicted structures as training data for 8 out of 10 folds in the dataset (Table C. 2). Further details are in Appendix C. Table 3 . Zero-shot performance on binding affinity prediction for the receptor binding domain (RBD) of SARS-CoV-2 Spike, evaluated on ACE2-RBD mutational scan data (Starr et al., 2020) . The zero-shot predictions are based on the sequence log-likelihood for the receptor binding motif (RBM), which is the portion of the RBD in direct contact with ACE2 (Lan et al., 2020) . We evaluate in four settings: 1) Given sequence data alone ("No coords"); 2) Given backbone coordinates for both ACE2 and the RBD but excluding the RBM and without sequence ("No RBM coords"); 3) Given the full backbone for the RBD but no information for ACE2 ("No ACE2 coords"); and 4) Given all coordinates for the RBD and ACE2.

Complex stability We evaluate models on zero-shot prediction of mutational effects on protein complex interfaces, using the Atom3D benchmark (Townshend et al., 2020) which incorporates binding free energy changes in the SKEMPI database (Jankauskaitė et al., 2019) as a binary classification task. We find that sequence log-likelihoods from GVP-GNN are effective zero-shot predictors of stability changes of protein complexes even without predicted structures as training data (Table C .4), performing comparably to the best supervised method which uses transfer learning. While we observe a substantial improvement in perplexity when predicted structures are added to training (Table 2) , this does not further improve complex stability prediction for the single-point mutations in SKEMPI (Table C .4), indicating potential limitations of evaluating models only on single-point mutations.

Binding affinity While the SKEMPI dataset features one mutation entry per protein, we also want to evaluate whether inverse folding models can rank different mutations on the same protein, potentially enabling binding-affinity optimization, which is an important task in therapeutic design. We assess whether inverse folding models can predict mutational effects on binding by leveraging a dataset generated by Starr et al. (2020) in which all single amino acid substitutions to the SARS-CoV-2 receptor binding domain (RBD) were experimentally measured for binding affinity to human ACE2. Given potential applications to interface optimization or design, we focus on mutations within the receptor binding motif (RBM), the portion of the RBD in direct contact with ACE2 (Lan et al., 2020) . When given all RBD and ACE2 coordinates, the best inverse folding model produces RBD-sequence log-likelihoods that have a Spearman correlation of 0.69 with experimental binding affinity measurements (Table 3) . We observe weaker correlations when not providing the model with ACE2 coordinates, indicating that inverse folding models take advantage of structural information in the binding partner. When masking RBM coordinates (69 of 195 residues, a longer span than masked during model training), we no longer observe correlation between RBD log-likelihood and binding affinity, indicating that the model relies on structural information at the interface to identify interface designs that preserve binding. Zero-shot prediction via inverse folding outperforms methods for sequence-based variant effect prediction, which use the likelihood ratio between the mutant and wildtype amino acids at each position to predict the impact of a mutation on binding affinity. These likelihoods are inferred by masked language models, ESM-1b, ESM-1v, and ESM-MSA-1b, as described by Meier et al. (2021) (Table 3) ; additional details are given in Appendix C.

Sequence insertions Using masked coordinate tokens at insertion regions, inverse folding models can also predict insertion effects. On adeno-associated virus (AAV) capsid variants, we show that relative differences in sequence log-likelihoods correlate with the experimentally measured insertion effects from Bryant et al. (2021) . As shown in Table C .5, both GVP-GNN and GVP-Transformer outperform the sequence-only zero-shot prediction baseline ESM-1v (Meier et al., 2021) . When evaluating on subsets of sequences increasingly further away from the wildtype (≥ 2, ≥ 3, and ≥ 8 mutations), the GVP-GNN-large and GVP-Transformer models trained with predicted structures have increasing advantages compared to GVP-GNN trained without predicted structures.

Structure-based protein sequence design Early work on design of protein sequences studied the packing of amino acid side chains to fill the interior space of predetermined backbone structures, either for a fixed backbone conformation (Street & Mayo, 1999; Dahiyat & Mayo, 1997; De-Grado et al., 1991) , or with flexibility in the backbone conformation (Harbury et al., 1998) . Since then, the Rosetta energy function (Alford et al., 2017) has become an established approach for structure-based sequence design. An alternative non-parametric approach involves decomposing the library of known structures into common sequencestructure motifs .

Early machine learning approaches in structure-based protein sequence design used fragment-based and energy-based global features derived from structures (Li et al., 2014; O'Connell et al., 2018) . More recently, convolution-based deep learning methods have also been applied to predict amino acid propensities given the surrounding local structural environments (Anand-Achim et al., 2021; Boomsma & Frellsen, 2017; Shroff et al., 2020; Li et al., 2020; Qi & Zhang, 2020; Zhang et al., 2020; Chen et al., 2019; Wang et al., 2018) . Another recent machine learning approach is to leverage structure prediction networks for sequence design. Norn et al. (2021) carried out Monte Carlo sampling in the sequence space to invert the trRosetta (Yang et al., 2020) structure prediction network for sequence design.

Generative models of proteins The literature on structure-based generative models of protein sequences is the closest to our work. Ingraham et al. (2019) introduced the formulation of fixed-backbone design as a conditional sequence generation problem, using invariant features with graph neural networks, modeling each amino acid as a node in the graph with edges connecting spatially adjacent amino acids. Jing et al. (2021b; a) further improved graph neural networks for this task by developing architectures with translation-and rotation-equivariance to enable geometric reasoning, showing that GVP-GNN achieves higher native sequence recovery rates than Rosetta on TS50, a benchmark set of 50 protein chains. Strokach et al. (2020) trained graph neural networks for conditional generation with the masked language modeling objective, adding homologous sequences as data augmentation to training.

Recently models have been proposed to jointly generate structures and sequences. Anishchenko et al. (2021) generate structures by optimizing sequences through the trRosetta structure prediction network to maximize their difference from a background distribution. The joint generation approach is also being explored in the setting of infilling partial structures. Contemporary to this work, Wang et al. (2021) apply span masking to fine-tune the RosettaFold model (Baek et al., 2021) 

While there are billions of protein sequences in the largest sequence databases, the number of available experimentally determined structures is on the order of hundreds of thousands, imposing a limit on generative methods that learn from protein structure data. In this work, we explored whether predicted structures from recent deep learning methods can be used in tandem with experimental structures to train models for protein design.

To this end, we generated structures for 12 million UniRef50 sequences using AlphaFold2. As a result of training with this data we observe improvements in perplexity and sequence recovery by substantial margins, and demonstrate generalization to longer protein complexes, to proteins in multiple conformations, and to zero-shot prediction for mutation effects on binding affinity and AAV packaging. These results highlight that in addition to the geometric inductive biases which have been the major focus for work on inversefolding to date, finding ways to leverage more sources of training data is an equally important path to improved modeling capabilities.

We also take initial steps toward more general structureconditional protein design tasks. By integrating backbone span masking into the inverse folding task and using a sequence-to-sequence transformer, reasonable sequence predictions can be achieved for short masked spans.

If ways can be found to continue to leverage predicted structures for generative models of proteins, it may be possible to create models that learn to design proteins from an expanded universe of the billions of natural sequences whose structures are currently unknown. We used training data from two sources: 1) experimental protein structures from the CATH 40% non-redundant chain set, and 2) AlphaFold2-predicted structures from UniRef50 sequences. To evaluate the generalization performance across different protein folds, we split the train, validation, and test data based on the CATH hierarchical classification of protein structures (Orengo et al., 1997) for both data sources. To achieve that a rigorous structural hold-out, we additionally use foldseek (Kim et al., 2021) for pairwise TMalign between the test set the train set.

CATH topology split. Following the structural split methodology in previous work (Ingraham et al., 2019; Jing et al., 2021b; Strokach et al., 2020) , we randomly split the CATH v4.3 (latest version) topology classification codes into train, validation, and test sets at a 80/10/10 ratio. The CATH (Orengo et al., 1997) structural hierarchy, classifies domains in four levels: Class (C), Architecture (A), Topology/fold (T), and Homologous superfamily (H). The topology/fold (T) level roughly corresponds to the SCOP fold classification.

Experimental structures. We collected full chains up to length 500 for all domains in the CATH v4.3 40% sequence identity non-redundant set. The experimental structure data contained only stand-alone chains and no multichain complexes. As each chain may be classified with more than one topology codes, we further removed chains with topology codes spanning different splits, so that there is no overlap in topology codes between train, validation, and test. This results in 16,153 chains in the train split, 1457 chains in the validation split, and 1797 chains in the test split.

Predicted structures. We curated a new data set of AlphaFold2 (Jumper et al., 2021) -predicted structures for a selective subset of UniRef50 (202001) sequences. To prevent information leakage about the test set from the predicted structures, we proceeded in the following steps.

First, we annotated UniRef50 sequences with CATH classification according to the Gene3D (Lees et al., 2012) database, also used by Strokach (Strokach et al., 2020) for data curation. Gene3D represents each CATH classification code as a library of representative profile HMMs. We searched all HMMs associated with the validation and test splits against the UniRef50 sequences using default parameters in hmmsearch (Potter et al., 2018) and excluded all hits.

Additionally, as AlphaFold2 predictions use multiple sequence alignments (MSAs) as inputs, we also took precaution to avoid information leakage from sequences in the MSAs. We created a filtered version of UniRef100 by searching all the validation-split and test-split Gene3D HMMs against UniRef100 (202001) and excluding all hits. Then, we constructed our MSAs using hhblits (Steinegger et al., 2019) on this filtered version of UniRef100.

As AlphaFold2 predictions are computationally costly, our budget only allowed for predicting structures for a subset of the UniRef50 sequences. We ranked UniRef50 sequences based on the distogram lDDT score (Supplementary Equation 6 in (Senior et al., 2020) ), based on distogram predictions from MSATransformer (Rao et al., 2021) , as a proxy for the quality of predicted structures. In this order, using AlphaFold2 Model 1 on the filtered UniRef100 MSAs described above, we obtained predicted structures for the top 12 million UniRef50 sequences under length 500, roughly 750 times the CATH train set size.

We used the publicly released model weights from AlphaFold2 Model 1 for CASP14 as a single model, as opposed the 5-model ensemble in (Jumper et al., 2021) , to cover more sequences with the same amount of computing resources. We curated the input MSAs from UniRef100 with hhblits, with an additional filtering step as described above. To reduce computational costs, compared to the standard AlphaFold2 protocol, we did not include the UniRef90 jackhmmer MSAs, or the MGnify and BFD metagenomics MSAs, nor the pdb70 templates. Other than a reduced inputs, we followed the default settings in AlphaFold2 open source code, using 3 recycling iterations and the default Amber relaxation protocol. Despite the reduced inputs, the resulting 12 million predicted structures still have high pLDDT scores from AlphaFold, with 75% of residues having pLDDT above 90 (highly confident).

We found that increasing the predicted data size to up to 1 million structures (75 times the CATH experimental data size) substantially improves model performance. Beyond 1 million structures, models still benefit from more data but with diminished marginal returns (Figure 6a ).

Noise on AlphaFold2-predicted backbone coordinates. Even after Amber relaxation, the backbone coordinates predicted by AlphaFold2 contain artifacts in the sub-Angstrom scale that may give away amino acid identities. Without adding noise on predicted structures, there is a substantial gap between held-out set performance on predicted structures and on experimental structures. To prevent the model from learning non-generalizable AlphaFold2-specific rules, we added Gaussian noise at the 0.1A scale on predicted backbone coordinates. The Gaussian noise improves the invariant Transformer performance but not the GVP-GNN performance (Supplementary Figure C.1) .

We add a binary feature indicating whether each coordinate is masked or not. In GVP-Transformer, we exclude the masked nodes in the GVP-GNN encoder layers, and then impute zeros when passing the GVP-GNN outputs into the main Transformer. Imputing zeros for missing vector features ensure the rotation-and translation-invariance of the model. In GVP-GNN, we impute zeros for the input vector features, and in the input graph connect the masked nodes to the k sequence nearest-neighbors (k = 30) in lieu of the k nearest nodes by spatial distance.

For span masking, we randomly select continuous spans of up to 30 amino acids until 15% of input backbone coordinates are masked. Such a span masking scheme has shown to improve performance on natural language processing benchmarks (Joshi et al., 2020) . The span lengths are sampled from a geometric distribution Geo(p) where p = 0.05 (corresponding to an average span length of 1/p = 20). The starting points for the spans are uniformly randomly sampled. Compared to independent random masking, span masking is better for GVP-Transformer but not for GVP-GNN (Table C.1) .

For the amino acids with masked coordinates, we exclude the corresponding nodes from the input graph to the pre-processing GVP message passing layers, and then impute zeros for the geometric features when passing the GVP outputs into the main Transformer. Imputing zeros for missing vector features ensure the rotation-and translation-invariance of the model. 

Invariance to rotation and translation. The input features for both GVP-GNN and GVP-Transformer are translationinvariant, making the overall models also invariant to translations.

Each GVP-GNN layer is rotation-equivariant, that is, for a vector feature x and any arbitrary rotation T , T f (x) = f (T x).

The GVP-Transformer architecture is also invariant to rotations and translations. The initial GVP-GNN layers in GVP-Transformer output rotation-invariant scalar features and rotation-equivariant vector features for each amino acid. To make the overall GVP-Transformer invariant, we perform a change of basis on GVP-GNN vector outputs to produce rotation-invariant features for the Transformer. More specifically, for each amino acid, we define a local reference frame based on the N, CA, and C atom positions in the amino acid, following Algorithm 21 in AlphaFold2 (Jumper et al., 2021) . We then perform a change of basis according to this local reference frame, rotating the vector features in GVP-GNN outputs into the local reference frames of each amino acid. We concatenate this rotated "local version" of vector features together with the scalar features as inputs to the Transformer. The concatenated features are invariant to both translations and rotations on the input backbone coordinates, forming a L × E matrix where L is the number of amino acids in the protein backbone and E is the feature dimension. For amino acids with masked or missing coordinates, the features are imputed as zeros.

Transformer. We closely followed the original autoregressive encoder-decoder Transformer architecture (Vaswani et al., 2017) except for using learned positional embeddings instead of sinusoidal positional embeddings, attention dropout, and layer normalization inside the residual blocks ("pre-layernorm"). For model scaling experiments, we followed the model sizes in (Turc et al., 2019) , and chose the 142-million-parameter model with 8 encoder layers, 8 decoder layers, 8 attention heads, and embedding dimension 512 based on the best validation set performance (Figure 6c shows test set ablation).

The GVP-GNN, GVP-GNN-large, and GVP-Transformer models used in the evaluations in this manuscript are all trained to convergence, with detailed hyperparameters listed in Table A .1.

In addition to the CATH topology-based test set following previous work (Ingraham et al., 2019; Jing et al., 2021b) , we also create an even more stringent test set based on pairwise TM-score comparison between train and test examples. The CATH topology split does not completely prevent high TM-score matches between train and test structures. We illustrate such an example in Figure Distribution of the highest TM-score from each test example to the train set. For example, 54% of the CATH topology split test set has at least one match in the train set with TM-score above 0.5, and 27% of the topology split test set has at least one match in the train set with TM-score above 0.6.

Recovery Table B .1. Fixed backbone sequence design performance on the more stringent structurally held-out test set from CATH v4.3 chains (and its short and single-chain subsets) in terms of per-residue perplexity (lower is better) and recovery (higher is better).

We constructed a TM-score-based test set of 223 proteins with no TMalign matches (TM-score ≥ 0.5) from the train set, using the foldseek (Kim et al., 2021) TMalign tool with default parameters for the pairwise search.

We found that the conclusions about model performance overall remains the same on this TM-score-based test set as on the CATH topology split test set. For consistency with prior work, we report metrics on the CATH topology test set in the main manuscript, while showing metrics on the smaller TM-score-based test set in Table B .1.

Ablation on noise and masking during training. We found that GVP-Transformer models trained with Gaussian noise during training perform slightly better at test time than those trained without (Table C .1). When given full backbone coordinates at test time, training with span masking only very slightly improves model performance compared to no masking or to random masking, even though there is a much larger performance gap between random masking and span masking on regions with masked backbone coordinates (Figure 4 ).

Dual-state design test set from PDBFlex. We test design performance on multiple conformations by finding test split proteins with distinct conformations in the PDBFlex database. From PDBFlex, we looks for experimental structures of protein sequences in the CATH topology split test set (95% sequence identity or above), and take all paired instances that are at least 5 angstroms apart in overall RMSD between conformations. We report perplexity on locally flexible residues (defined as local RMSD above 1 angstrom). To be more conservative in our evaluation, we show the better of the two conformations to represent single-state perplexity in Figure 7 .

Ablation on the number of GVP-GNN encoder layers in GVP-Transformer. Increasing the number of GVP-GNN encoder layers improves the overall model performance ( Figure C.1 Stability prediction on de novo small proteins. We predict protein stability on an experimentally measured stability dataset for de novo small proteins (Rocklin et al., 2017) . We use the relative difference in sequence conditional loglikelihoods as a predictor for stability and compute Pearson correlation with the mutation effect following (Ingraham et al., 2019) , assuming that more stable sequences should score higher in log-likelihoods. For each fold, Rocklin et al. (2017) starts with a reference protein and generates sequence variants with single amino acid substitutions. We calculate the Pearson correlation between sequence conditional log-likelihood scores and experimental stability measurements for all designed sequences in each fold. With predicted structures as additional training data, the GVP-Transformer model improves the pearson correlation on 8 out of the 10 folds.

Perplexity and sequence recovery of SARS-CoV-2 RBD. We show perplexity and sequence recovery on the SARS-CoV-2 protein receptor binding domain (RBD) as an example for inverse folding. The RBD can exist in a closed-state with the RBD down or in an open-state with the RBD up , as illustrated in Figure C .3. The SARS-Cov-2 spike protein structure has no match with the training data with TM-score above 0.5. The SARS-Cov-2 spike protein has Predicting complex stability changes upon mutations. SKEMPI (Jankauskaitė et al., 2019 ) is a database of binding free energy changes upon single point mutations within protein complex interfaces. This database is used as a task in the Atom3D benchmark suite (Townshend et al., 2020) for comparing supervised stability prediction methods. The task is to classify whether the stability of the complex increases as a result of the mutation. We compare zero-shot predictions using inverse folding models to supervised and transfer learning methods (Townshend et al., 2020; Jing et al., 2021a) on the Atom3D test set. We find that sequence log-likelihoods from GVP-GNN, GVP-GNN-large, and GVP-Transformer models are all effective zero-shot predictors of stability changes of protein complexes (Table C .4), performing comparably to the best supervised method which uses transfer learning.

Predicting insertion effects on AAV. Using masked coordinate tokens at insertion regions, inverse folding models can also predict the effects of sequence insertions. Adeno-associated virus (AAV) capsids are a promising gene delivery vehicle, approved by the US Food and Drug Administration for use as gene delivery vectors in humans. Focusing on mutating a 28-amino acid segment, Bryant et al. (2021) generated more than 200,000 variants of AAV sequences with 12-29 mutations across this region, and measured their ability to package of a DNA payload. This dataset is unique compared to many other mutagenesis datasets in that most sequences feature random insertions in the 28-amino acid segment, as opposed to only random substitutions.

We use inverse folding models to predict insertion and substitution effects as follows: For each sequence, we input the full backbone coordinates of the wild-type (PDB: 1LP3), and insert one masked token into the input backbone coordinates for each insertion. Then we compare the conditional sequence log-likelihood on this input with masks to the conditional sequence log-likelihood of the wild-type sequence on the wild-type backbone. The difference in these two conditional log-likelihoods are used as the score for predicting packaging ability.

We report the zero-shot performance on each of the 7 data subsets evaluated in the FLIP (Dallago et al., 2021) benchmark suite. As shown in Table C .5, GVP-Transformer trained with predicted structures outperforms the sequence-only zero-shot prediction baseline ESM-1v on 6 out of the 7 data subsets. For ESM-1v, we scored variant sequences based on the independent marginals formula, as described in Equation 1 from Meier et al. (2021).

Confusion matrix. We calculated the substitution scores between native sequences and sampled sequences (sampled with temperature T = 1) by using the same log odds ratio formula as in the BLOSUM62 substition matrix. For two amino acids x and y, the substitution score s(x, y) is s(x, y) = log p(x, y) q(x)q(y) ,

where p(x, y) is the jointly likelihood that native amino acid x is substituted by sampled amino acid y, q(x) is the marginal likelihood in the native distribution, and q(y) is the marginal likelihood in the sampled distribution.

Calibration. Calibration curves examines how well the probabilistic predictions of a classifier are calibrated, plotting the true frequency of the label against its predicted probability. When computing the calibration curve, for each amino acid, we bin the predicted probabilities into 10 bins and then compare with the true probability.

Placement of hydrophobic residues. We define the amino acids IVLFCMA as hydrophobic residues, and inspect the distribution of solvent accessible surface area for both hydrophobic residues and polar (non-hydrophobic) residues. Solvent accessible surface area calculated with the Shrake-Rupley ("rolling probe") algorithm from the biotite package (Kunzmann & Hamacher, 2018) and summed over all atoms in each amino acid. All models have similar distributions of accessible surface area for hydrophobic residues, also similar to the distribution in native sequences ( Figure C.6) .

Sampling speed. We profile the sampling speed with PyTorch Profiler, averaging over the sampling time for 30 sequences in each sequence length bucket on a Quadro RTX 8000 GPU with 48GB memory. For the generic Transformer decoder, we use the incremental causal decoding implementation in fairseq (Ott et al., 2019) . For GVP-GNN, we use the implementation from the gvp-pytorch GitHub repository. 

The rosetta allatom energy function for macromolecular modeling and design

Unified rational protein engineering with sequence-based deep representation learning

Generative modeling for protein structures

Protein sequence design with a learned potential

Model-based reinforcement learning for biological sequence design

De novo protein design by deep network hallucination

Accurate prediction of protein structures and interactions using a three-track neural network

Learning protein sequence embeddings using information from structure

The protein data bank

Spherical convolutions and their application in molecular modelling

Conditioning by adaptive sampling for robust design

Deep diversification of an aav capsid protein by machine learning

Function-guided protein design by deep manifold sampling. bioRxiv

High-resolution protein design with backbone freedom

Modeling aspects of the language of life through transfer-learning protein sequences

Comparison of multiple amber force fields and development of improved protein backbone parameters

Pdbflex: exploring flexibility in protein structures

The coming of age of de novo protein design

Generative models for graph-based protein design

Skempi 2.0: an updated benchmark of changes in protein-protein binding energy, kinetics and thermodynamics upon mutation

Iterative refinement graph neural network for antibody sequence-structure co

Equivariant graph neural networks for 3d macromolecular structure

Learning from protein structure with geometric vector perceptrons

Improving pre-training by representing and predicting spans

Highly accurate protein structure prediction with alphafold

Biotite: a unifying open source computational biology framework in python

Structure of the sars-cov-2 spike receptor-binding domain bound to the ace2 receptor

Protein design and variant prediction using autoregressive generative models

Discovery of novel gain-of-function mutations guided by structure-based deep learning

Adalead: A simple and robust adaptive greedy search algorithm for sequence design

Deep mutational scanning of sars-cov-2 receptor binding domain reveals constraints on folding and ace2 binding

Hh-suite3 for fast remote homology detection and deep protein annotation

Computational protein design

Fast and flexible protein design using deep graph neural networks

Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches

ATOM3D: tasks on molecules in three dimensions. CoRR

Efficient generative modeling of protein sequences using simple autoregressive models

Well-read students learn better: On the importance of pre-training compact models

Attention is all you need

Structure, function, and antigenicity of the sars-cov-2 spike glycoprotein

Computational protein design with deep learning neural networks

Deep learning methods for designing proteins scaffolding functional sites

Protein sequence design with deep generative models

Improved protein structure prediction using predicted interresidue orientations

Machine-learningguided directed evolution for protein engineering

Protein design using a convolutional neural network

A generalpurpose protein design framework based on mining sequence-structure relationships in known protein structures

Acknowledgements We thank Halil Akin, Sal Candido, Ori Kabeli, Joshua Meier, Ammar Rizvi, and Zhongkai Zhu for feedback on the manuscript and insightful conversations.