key: cord-0532238-zvz18xy9
authors: Jin, Wengong; Wohlwend, Jeremy; Barzilay, Regina; Jaakkola, Tommi
title: Iterative Refinement Graph Neural Network for Antibody Sequence-Structure Co-design
date: 2021-10-09
journal: nan
DOI: nan
sha: 01bacd0a631f11fd090b5cc68596fccc7d58b4e0
doc_id: 532238
cord_uid: zvz18xy9

Antibodies are versatile proteins that bind to pathogens like viruses and stimulate the adaptive immune system. The specificity of antibody binding is determined by complementarity-determining regions (CDRs) at the tips of these Y-shaped proteins. In this paper, we propose a generative model to automatically design the CDRs of antibodies with enhanced binding specificity or neutralization capabilities. Previous generative approaches formulate protein design as a structure-conditioned sequence generation task, assuming the desired 3D structure is given a priori. In contrast, we propose to co-design the sequence and 3D structure of CDRs as graphs. Our model unravels a sequence autoregressively while iteratively refining its predicted global structure. The inferred structure in turn guides subsequent residue choices. For efficiency, we model the conditional dependence between residues inside and outside of a CDR in a coarse-grained manner. Our method achieves superior log-likelihood on the test set and outperforms previous baselines in designing antibodies capable of neutralizing the SARS-CoV-2 virus.

Monoclonal antibodies are increasingly adopted as therapeutics targeting a wide range of pathogens such as SARS-CoV-2 (Pinto et al., 2020) . Since the binding specificity of these Y-shaped proteins is largely determined by their complementarity-determining regions (CDRs), the main goal of computational antibody design is to automate the creation of CDR subsequences with desired properties. This problem is particularly challenging due to the combinatorial search space of over 20 60 possible CDR sequences and the small solution space which satisfies the desired constraints of binding affinity, stability, and synthesizability (Raybould et al., 2019) .

There are three key modeling questions in CDR generation. The first is how to model the relation between a sequence and its underlying 3D structure. Generating sequences without the corresponding structure (Alley et al., 2019; Shin et al., 2021) can lead to sub-optimal performance (Ingraham et al., 2019) , while generating from a predefined 3D structure (Ingraham et al., 2019) is not suitable for antibodies since the desired structure is rarely known a priori (Fischman & Ofran, 2018) . Therefore, it is crucial to develop models that co-design the sequence and structure. The second question is how to model the conditional distribution of CDRs given the remainder of a sequence (context). Attention-based methods only model the conditional dependence at the sequence level, but the structural interaction between the CDR and its context is crucial for generation. The last question relates to the model's ability to optimize for various properties. Traditional physics-based methods (Lapidoth et al., 2015; Adolf-Bryfogle et al., 2018) focus on binding energy minimization, but in practice, our objective can be much more involved than binding energies (Liu et al., 2020) .

In this paper, we represent a sequence-structure pair as a graph and formulate the co-design task as a graph generation problem. The graph representation allows us to model the conditional dependence between a CDR and its context at both the sequence and structure levels. Antibody graph generation poses unique challenges because the global structure is expected to change when new nodes are inserted. Previous autoregressive models (You et al., 2018; Gebauer et al., 2019) cannot modify a generated structure because they are trained under teacher forcing. Thus errors made in the previous steps can lead to a cascade of errors in subsequent generation steps. To address these problems, we propose a novel architecture which interleaves the generation of amino acid nodes with the prediction of 3D structures. The structure generation is based on an iterative refinement of a global graph rather than a sequential expansion of a partial graph with teacher forcing. Since the context sequence is long, we further introduce a coarsened graph representation by grouping nodes into blocks. We apply graph convolution at a coarser level to efficiently propagate the contextual information to the CDR residues. After pretraining our model on antibodies with known structures, we finetune it using a predefined property predictor to generate antibodies with specific properties.

We evaluate our method on three generation tasks, ranging from language modeling to SARS-CoV-2 neutralization optimization and antigen-binding antibody design. Our method is compared with a standard sequence model (Saka et al., 2021; Akbar et al., 2021) and a state-of-the-art graph generation method (You et al., 2018) tailored to antibodies. Our method not only achieves lower perplexity on test sequences but also outperforms previous baselines in property-guided antibody design tasks.

Antibody/protein design. Current methods for computational antibody design roughly fall into two categories. The first class is based on energy function optimization (Pantazes & Maranas, 2010; Li et al., 2014; Lapidoth et al., 2015; Adolf-Bryfogle et al., 2018) , which use Monte Carlo simulation to iteratively modify a sequence and its structure until reaching a local energy minimum. Similar approaches are used in protein design (Leaver-Fay et al., 2011; Tischer et al., 2020) . Nevertheless, these physics-based methods are computationally expensive (Ingraham et al., 2019) and our desired objective can be much more complicated than low binding energy (Liu et al., 2020) .

The second class is based on generative models. For antibodies, they are mostly sequence-based (Alley et al., 2019; Shin et al., 2021; Saka et al., 2021; Akbar et al., 2021) . Cao et al. (2021) further developed models conditioned on a backbone structure or protein fold. Our model also seeks to incorporate 3D structure information for antibody generation. Since the best CDR structures are often unknown for new pathogens, we co-design sequences and structures for specific properties.

Generative models for graphs. Our work is related to autoregressive models for graph generation (You et al., 2018; Liu et al., 2018; Liao et al., 2019; Jin et al., 2020a) . In particular, Gebauer et al. (2019) developed G-SchNet for molecular graph and conformation co-design. Unlike our method, they generate edges sequentially and cannot modify a previously generated subgraph when new nodes arrive. While Graphite (Grover et al., 2019) also uses iterative refinement to predict the adjacency matrix of a graph, it assumes all the node labels are given and predicts edges only. In contrast, our work combines autoregressive models with iterative refinement to generate a full graph with node and edge labels, including node labels and coordinates.

3D structure prediction. Our approach is closely related to protein folding (Ingraham et al., 2018; Yang et al., 2020a; Baek et al., 2021; Jumper et al., 2021) . Inputs to the state-of-the-art models like AlphaFold require a complete protein sequence, its multi-sequence alignment (MSA), and its template features. These models are not directly applicable because we need to predict the structure of an incomplete sequence and the MSA is not specified in advance. Our iterative refinement model is also related to score matching methods for molecular conformation prediction (Shi et al., 2021) and diffusion-based methods for point clouds (Luo & Hu, 2021) . These algorithms also iteratively refine a predicted 3D structure, but only for a complete molecule or point cloud. In contrast, our approach learns to predict the 3D structure for incomplete graphs and interleaves 3D structure refinement with graph generation.

Overview. The role of an antibody is to bind to an antigen (e.g. a virus), present it to the immune system, and stimulate an immune response. A subset of antibodies known as neutralizing antibodies not only bind to an antigen but can also suppress its activity. An antibody consists of a heavy chain and a light chain, each composed of one variable domain (VH/VL) and some constant domains. The variable domain is further divided into a framework region and three complementarity determining regions (CDRs). The three CDRs on the heavy chain are labeled as CDR-H1, CDR-H2, CDR-H3, each occupying a contiguous subsequence ( Figure 1 ). As the most variable part of an antibody, CDRs are the main determinants of binding and neutralization (Abbas et al., 2014).

Following Shin et al. (2021) ; Akbar et al. (2021) , we formulate antibody design as a CDR generation task, conditioned on the framework region. Specifically, we represent an antibody as a graph, which encodes both its sequence and 3D structure. We propose a new graph generation approach called RefineGNN and extend it to handle conditional generation given a fixed framework region. Lastly, we describe how to apply RefineGNN to property-guided optimization to design new antibodies with better neutralization properties. For simplicity, we focus on the generation of heavy chain CDRs, though our method can be easily extended to model light chains CDRs.

Notations. An antibody VH domain is represented as a sequence of amino acids s = s 1 s 2 · · · s n . Each token s i in the sequence is called a residue, whose value can be either one of the 20 amino acids or a special token MASK , meaning that its amino acid type is unknown and needs to be predicted. The VH sequence folds into a 3D structure and each residue s i is labeled with three backbone coordinates: x i,α for its alpha carbon atom, x i,c for its carbon atom, and x i,n for its nitrogen atom.

We represent an antibody (VH) as a graph G(s) = (V, E) with node features V = {v 1 , · · · , v n } and edge features E = {e ij } i =j . Each node feature v i encodes three dihedral angles (φ i , ψ i , ω i ) related to three backbone coordinates of residue i. For each residue i, we compute an orientation matrix O i representing its local coordinate frame (Ingraham et al., 2019) (defined in the appendix). This allows us to compute edge features describing the spatial relationship between two residues i and j (i.e. distance and orientation) as follows

The edge feature e ij contains four parts. The positional encoding E pos (i − j) encodes the relative distance between two residues in an antibody sequence. The second term RBF(·) is a distance encoding lifted into radial basis. The third term in e ij is a direction encoding that corresponds to the relative direction of x j in the local frame of residue i. The last term q(O i O j ) is the orientation encoding of the quaternion representation q(·) of the spatial rotation matrix O i O j . We only include edges in the K-nearest neighbors graph of G(s) with K = 8. For notation convenience, we use G as a shorthand for G(s) when there is no ambiguity.

We propose to generate an antibody graph via an iterative refinement process. Let G (0) be the initial guess of the true antibody graph. Each residue is initialized as a special token MASK and each edge (i, j) is initialized to be of distance 3|i − j| since the average distance between consecutive residues is around three. The direction and orientation features are set to zero. In each generation step t, the model learns to revise a current antibody graph G (t) and predict the label of the next residue t + 1. Specifically, it first encodes G (t) with a message passing network (MPN) with parameter θ

where h (t)

i is a learned representation of residue i under the current graph G (t) . Our MPN consists of L message passing layers with the following architecture

where h

. FFN is a two-layer feed-forward network (FFN) with ReLU activation function. E(s j ) is a learned embedding of amino acid type s j . Based on the learned residue representations, we predict the amino acid type of the next residue t + 1 (Figure 2A) .

This prediction gives us a new graph G (t+0.5) with the same edges as G (t) but the node label of t + 1 is changed ( Figure 2B ). Next, we need to update the structure to accommodate the new residue t + 1.

To this end, we encode graph G (t+0.5) by another MPN with a different parameterθ and predict the coordinate of all residues.

The new coordinates x (t+1) i define a new antibody graph G (t+1) for the next iteration ( Figure 2C ). We explicitly realize the coordinates of each residue because we need to calculate the spatial edge features for G (t+1) . The structure prediction (coordinates x i ) and sequence prediction (amino acid types p t+1 ) are carried out by two different MPNs, namely the structure networkθ and sequence network θ. This disentanglement allows the two networks to focus on two distinct tasks.

Training. During training, we only apply teacher forcing to the discrete amino acid type prediction. Specifically, in each generation step t, residues 1 to t are set to their ground truth amino acid types s 1 , · · · , s t , while all future residues t + 1, · · · , n are set to a padding token. In contrast, the continuous structure prediction is carried out without teacher forcing. In each iteration, the model refines the entire structure predicted in the previous step and constructs a new K-nearest neighbors graph G (t+1) of all residues based on the predicted coordinates {x

Loss function. Our model remains rotation and translation invariant because the loss function is computed over pairwise distance and angles rather than coordinates. The loss function for antibody structure prediction consists of three parts.

• Distance loss: For each residue pair i, j, we compute its pairwise distance between the predicted alpha carbons x

j,α . We define the distance loss as the Huber loss between the predicted and true pairwise distances

where distance is squared to avoid the square root operation which causes numerical instability.

• Dihedral angle loss: For each residue, we calculate its dihedral angle φ

i+1,n . We define the dihedral angle loss as the mean square error between the predicted and true dihedral angles

• C α angle loss: We calculate angles γ

i+1,α as well as dihedral angles β (t) i between two planes defined by x

In summary, the overall graph generation loss is defined as L = L seq + L struct , where

The sequence prediction loss L seq is the cross entropy L ce between predicted and true residue types. 

The model architecture described so far is designed for unconditional generation -it generates an entire antibody graph without any constraints. In practice, we usually fix the framework region of an antibody and design the CDR sequence only. Therefore, we need to extend the model architecture to learn the conditional distribution P (s |s <l , s >r ), where s <l = s 1 · · · s l−1 and s >r = s r+1 · · · s n are residues outside of the CDR s l · · · s r .

Conditioning via attention. A simple extension of RefineGNN is to encode the non-CDR sequence using a recurrent neural network and propagate information to the CDR through an attention layer. To be specific, we first concatenate s <l and s >r into a context sequences = s <l ⊕ MASK · · · MASK ⊕ s >r , where ⊕ means string concatenation and MASK is repeated n times. We then encode this context sequence by a Gated Recurrent Unit (GRU) (Cho et al., 2014) and modify the structure and sequence prediction step (Equation 4 and 6) as {c 1 , · · · , c n } = c 1:n = GRU(s) (11)

Multi-resolution modeling. The attention-based approach alone is not sufficient because it does not model the structure of the context sequence, thus ignoring how its residues structurally interact with the CDR's. While this information is not available for new antibodies at test time, we can learn to predict this interaction using antibodies in the training set with known structures.

A naive solution is to iteratively refine the entire antibody structure (more than 100 residues) while generating CDR residues. This approach is computationally expensive because we need to recompute the MPN encoding for all residues in each generation step. Importantly, we cannot predict the context residue coordinates at the outset and fix them because they need to be adjusted accordingly when the coordinates of CDR residues are updated in each generation step.

For computational efficiency, we propose a coarse-grained model that reduces the context sequence length by clustering it into residue blocks. Specifically, we construct a coarsened context sequence b l,r (s) by clustering every b context residues into a block ( Figure 2D ). The new sequence b l,r (s) defines a coarsened graph G(b l,r (s)) over the residue blocks, whose edges are defined based on block coordinates. The coordinate of each block x bi,e is defined as the mean coordinate of residues within the block. The embedding of each block E(b i ) is the mean of its residue embeddings.

Now we can apply RefineGNN to generate the CDR residues while iteratively refining the global graph G(b l,r (s)) by predicting the coordinates of all blocks. The only change is that the structure

Require: Context sequence s <l , s >r 1: Predict the CDR length n 2: Coarsen the context sequence into b l,r (s) 3: Construct the initial graph G (0) 4: for t = 0 to n − 1 do 5:

Encode G (t) using the sequence MPN Sample an antibody s from D, remove its CDR and get a context sequence b l,r (s) 3:

Sample a batch of new antibodies from Q 8:

Update model parameter Θ by minimizing the sequence prediction loss L seq .

prediction loss is defined over block coordinates x bi,e . Lastly, we combine both the attention mechanism and coarse-grained modeling to keep both fine-grained and coarse-grained information. The decoding process of this conditional RefineGNN is illustrated in Algorithm 1.

Our ultimate goal is to generate new antibodies with desired properties such as neutralizing a particular virus. This task can be formulated as an optimization problem. Let Y be a binary indicator variable for neutralization. Our goal is to learn a conditional generative model P Θ (s |b l,r (s)) that maximizes the probability of neutralization for a training set of antibodies D, i.e. 

where f (s ) is a predictor for P (Y = 1|s ). Assuming f is given, this problem can be solved by iterative target augmentation (ITA) (Yang et al., 2020b) . Before ITA optimization starts, we first pretrain our model on a set of real antibody structures to learn a prior distribution over CDR sequences and structures. In each ITA finetuning step, we first randomly sample a sequence s from D, a set of antibodies whose CDRs need to be redesigned. Next, we generate M new sequences given its context b l,r (s). A generated sequence s i is added to our training set Q if it is predicted as neutralizing. Initially, the training set Q contains antibodies that are known to be neutralizing (Y = 1). Lastly, we sample a batch of neutralizing antibodies from Q and update the model parameter by minimizing their sequence prediction loss L seq (Eq. (10)). The structure prediction loss L struct is excluded in ITA finetuning phase because the structure of a generated sequence is unknown.

Setup. We construct three evaluation setups to quantify the performance of our approach. Following standard practice in generative model evaluation, we first measure the perplexity of different models on new antibodies in a test set created based on sequence similarity split. We also measure structure prediction error by comparing generated and ground truth CDR structures recorded in the Structural Antibody Database (Dunbar et al., 2014) . Results for this task are shown in section 4.1.

Second, we evaluate our method on an existing antibody design benchmark of 60 antibody-antigen complexes from Adolf-Bryfogle et al. (2018) . The goal is to design the CDR-H3 of an antibody so that it binds to a given antigen. Results for this task are shown in section 4.2.

Lastly, we propose an antibody optimization task which aims to redesign CDR-H3 of antibodies in the Coronavirus Antibody Database (Raybould et al., 2021) to improve their neutralization against SARS-CoV-2. CDR-H3 design with a fixed framework is a common practice in the antibody engineering community (Adolf-Bryfogle et al., 2018; Liu et al., 2020) . Following works in molecular design , we use a predictor to evaluate the neutralization of generated antibodies since we cannot experimentally test them in wet labs. Results for this task are reported in section 4.3. Baselines. We consider three baselines for comparison (details in the appendix). The first baseline is a sequence-based LSTM model used in Saka et al. (2021) ; Akbar et al. (2021) . This model does not utilize any 3D structure information. It consists of an encoder that learns to encode a context sequences, a decoder that decodes a CDR sequence, and an attention layer connecting the two.

The second baseline is an autoregressive graph generation model (AR-GNN) whose architecture is similar to You et al. (2018) ; but tailored for antibodies. AR-GNN generates an antibody graph residue by residue. In each step t, it first predicts the amino acid type of residue t and then generates edges between t and previous residues. Importantly, AR-GNN cannot modify a partially generated 3D structure of residues s 1 · · · s t−1 because it is trained by teacher forcing.

On the antigen-binding task, we include an additional physics-based baseline called RosettaAnti-bodyDesign (RAbD) (Adolf-Bryfogle et al., 2018) . We apply their de novo design protocol composed of graft design followed by 250 iterations of sequence design and energy minimization. We cannot afford to run more iterations because it takes more than 10 hours per antibody. We also could not apply RAbD to the SARS-CoV-2 task because it requires 3D structures to be given. This information is unavailable for antibodies in CoVAbDab.

Hyperparameters. We performed hyperparameter tuning to find the best setting for each method. For RefineGNN, both its structure and sequence MPN have four message passing layers, with a hidden dimension of 256 and block size b = 4. All models are trained by the Adam optimizer with a learning rate of 0.001. More details are provided in the appendix.

Data. The Structural Antibody Database (SAbDab) consists of 4994 antibody structures renumbered according to the IMGT numbering scheme (Lefranc et al., 2003) . To measure a model's ability to generalize to novel CDR sequences, we divide the heavy chains into training, validation, and test sets based on CDR cluster split. We illustrate our cluster split process using CDR-H3 as an example. First, we use MMseqs2 (Steinegger & Söding, 2017) to cluster all the CDR-H3 sequences. The sequence identity is calculated under the BLOSUM62 substitution matrix (Henikoff & Henikoff, 1992) . Two antibodies are put into the same cluster if their CDR-H3 sequence identity is above 40%. We then randomly split the clusters into training, validation, and test set with 8:1:1 ratio. We repeat the same procedure for creating CDR-H1 and CDR-H2 splits. In total, there are 1266, 1564, and 2325 clusters for CDR-H1, H2, and H3. The size of training, validation, and test sets for each CDR is shown in the appendix.

Metrics. For each method, we report the perplexity (PPL) of test sequences and the root mean square deviation (RMSD) between a predicted structure and its ground truth structure reported in SAbDab. RMSD is calculated by the Kabsch algorithm (Kabsch, 1976 ) based on C α coordinate of CDR residues. Since the mapping between sequences and structures is deterministic in RefineGNN, we can calculate perplexity in the same way as standard sequence models.

Results. Since the LSTM baseline does not involve structure prediction, we report RMSD for graphbased methods only. As shown in Table 1 , RefineGNN significantly outperforms all baselines in both metrics. For CDR-H3, our model gives 13% PPL reduction (8.38 v.s. 9.70 ) over sequence only model and 10% PPL reduction over AR- GNN (8.38 v.s. 9.44) . RefineGNN also predicts the structure more accurately, with 30% relative RMSD reduction over AR-GNN. In Figure 3 , we provide examples of predicted 3D structures of CDR-H3 loops. Ablation studies. We further conduct ablation experiments on the CDR-H3 generation task to study the importance of different modeling choices. First, when we remove the attention mechanism and context coarsening step in section 3.3, the PPL increases from 8.38 to 8.86 ( Figure 3C , row 2) and 9.01 ( Figure 3C , row 3) respectively. We also tried to remove both the attention and coarsening modules and trained the model without conditioning on the context sequence. The PPL of this unconditional variant is much worse than our conditional model ( Figure 3C, row 4) . In short, these results validate the advantage of our multi-resolution conditioning strategy. Moreover, the model performance slightly degrades when we halve the number of refinement steps or increase block size to b = 8 ( Figure 3C , row 5-6) . Lastly, we train a structure-conditioned model by feeding the ground truth structure to RefineGNN at every generation step ( Figure 3C, row 7) . While this structure-conditioned model gives a lower PPL as expected (7.39 v.s. 8.38), it is not too far away from the sequence only model (PPL = 9.70). This suggests that RefineGNN is able to extract a decent amount of information from the partial structure co-evolving with the sequence.

Data. Adolf-Bryfogle et al. (2018) selected 60 antibody-antigen complexes as an antibody design benchmark. Given the framework of an antibody, the goal is to design its CDR-H3 that binds to its corresponding antigen. For simplicity, none of the methods is conditioned on the antigen structure during CDR-H3 generation. We leave antigen-conditioned CDR generation for future work.

Metric. Following Adolf-Bryfogle et al. (2018), we use amino acid recovery (AAR) as the evaluation metric. For any generated sequence, we define its AAR as the percentage of residues having the same amino acid as the corresponding residue in the original antibody.

Results. For LSTM, AR-GNN, and RefineGNN, the training set in this setup is the entire SAbDab except antibodies in the same cluster as any of the test antibodies. At test time, we generate 10000 CDR-H3 sequences for each antibody and select the top 100 candidates with the lowest perplexity. For simplicity, all methods are configured to generate CDRs of the same length as the original CDR. As shown in Table 1 , our model achieves the highest AAR score, with around 7% absolute improvement over the best baseline. In Figure 4A , we show an example of a generated CDR-H3 sequence and highlight residues that are different from the original antibody. We also found that sequences with lower perplexity tend to have a lower AA recovery error (Pearson R = 0.427, Figure 4B ). This suggests that we can use perplexity as the ranking criterion for antibody design.

Data. The Coronavirus Antibody Database (CoVAbDab) contains 2411 antibodies, each associated with multiple binary labels indicating whether it neutralizes a coronavirus (SARS-CoV-1 or SARS-CoV-2) at a certain epitope. Similar to the previous experiment, we divide the antibodies into training, validation, and test sets based on CDR-H3 cluster split with 8:1:1 ratio. Neutralization predictor. The predictor takes as input the VH sequence of an antibody and outputs a neutralization probability for the SARS-CoV-1 and SARS-CoV-2 viruses. Each residue is embedded into a 64 dimensional vector, which is fed to a SRU encoder (Lei, 2021) followed by average-pooling and a two-layer feed forward network. The final outputs are the probabilities p 1 and p 2 of neutralizing SARS-CoV-1 and SARS-CoV-2 and our scoring function is f (s) = p 2 . The predictor achieved 0.81 test AUROC for SARS-CoV-2 neutralization prediction.

CDR sequence constraints. Therapeutic antibodies must be free from developability issues such as glycosylation and high charges (Raybould et al., 2019) . Thus, we include four constraints on a CDR-H3 sequence s: 1) Its net charge must be between -2.0 and 2.0 (Raybould et al., 2019) . The definition of net charge is given in the appendix. 2) It must not contain the N-X-S/T motif which is prone to glycosylation. 3) Any amino acid should not repeat more than five times (e.g. SSSSS). 4) Perplexity of a generated sequence given by LSTM, AR-GNN, and RefineGNN should be all less than 10. The last two constraints force generated sequences to be realistic. We use all three models in the perplexity constraint to ensure a fair comparison for all methods.

Metric. For each antibody in the test set, we generate 100 new CDR-H3 sequences, concatenate them with its context sequence to form 100 full VH sequences, and feed them into the neutralization predictor f . We report the average neutralization score of antibodies in the test set. Neutralization score of a generated sequence s equals f (s ) if it satisfies all the CDR sequence constraints. Otherwise the score is the same as the original sequence. In addition, we pretrain each model on the SAbDab CDR-H3 sequences and evaluate its PPL on the CoVAbDab CDR-H3 sequences.

Results. All methods are pretrained on SAbDab antibodies and finetuned on CoVAbDab using the ITA algorithm to generate neutralizing antibodies. Our model outperforms the best baseline by a 3% increase in terms of average neutralization score (Table 2) . Our pretrained RefineGNN also achieves a much lower perplexity on CoVAbDab antibodies (7.86 v.s. 8.67). Examples of generated CDR-H3 sequences and their predicted neutralization scores are shown in the appendix.

In this paper, we developed a RefineGNN model for antibody sequence and structure co-design.

The advantage of our model over previous graph generation methods is its ability to revise a generated subgraph to accommodate addition of new residues. Our approach significantly outperforms sequence-based and graph-based approaches on three antibody generation tasks.

A.1 REFINEGNN Node features. Each node feature v i encodes three dihedral angles as follows.

Edge features. The orientation matrix O i = [b i , n i , b i × n i ] defines a local coordinate system for each residue i (Ingraham et al., 2019) , which is calculated as

Attention mechanism. The attention layer used in Eq.(13) is a standard bilinear attention:

A.2 AR-GNN AR-GNN generates an antibody graph autoregressively. In each generation step t, AR-GNN learns to encode the current subgraph G 1:t induced from residues {s 1 , · · · , s t } into a list of vectors

For fair comparison, we use the same MPN architecture for both RefineGNN and AR-GNN. In terms of structure prediction, AR-GNN first predicts the node featurev t+1 of the next residue t + 1, namely the dihedral angle between its three atoms C α , C, N .

In addition, AR-GNN predicts the pairwise distance between s t+1 and previous residues s 1 , · · · , s t .

where FFN is a feed-forward network with one hidden layer and E pos is the positional encoding of t + 1 − i, the gap between residue s t+1 and s i in the sequence. Lastly, AR-GNN predicts the amino acid type of residue s t+1 bŷ p t+1 = softmax(W a g t+1 ), {g 1 , · · · , g t+1 } = MPN θ (G 1:t+1 )

Note that AR-GNN also uses two separate MPNs for structure and sequence prediction. However, unlike RefineGNN, AR-GNN is trained under teacher forcing -we need to feed it the ground truth structure and sequence in each generation step. In particular, we find data augmentation to be crucial for AR-GNN performance. Data augmentation is essential because of the discrepancy between training and testing. The model is trained under teacher forcing, but it needs to decode a graph without teacher forcing at test time. We find mistakes made in previous steps have a great impact on subsequent predictions during decoding.

Specifically, for every antibody s, we create a corrupted graph G by adding independent random Gaussian noise to every coordinate:x i = x i + 3 , ∼ N (0, I). In each generation step, we apply MPN over the corrupted graph instead.

The node and edge labels are still defined by the ground truth structure. Specifically, let v t and d i,j be the ground truth dihedral angle and pairwise distance calculated from the original, uncorrupted graph G. AR-GNN loss function is defined as the following.

Similar to RefineGNN, AR-GNN also uses attention mechanism for conditional generation. Specifically, we concatenate the residue representations h t , g t from MPN with context vectors learned from an attention layer. Since SAbDab includes both bound and unbound structures, we removed all antigens and used the bound antibody structure for training. Specifically, 65% of our training data came from bound state structures. We included all data in our training set because the mismatch between bound and unbound structures is relatively small. In fact, Al Qaraghuli et al. (2020) studied eight antibodies and found that the RMSD between bound and unbound structures over VH domains is less than 0.7 on average.

RAbD configuration. We provided details of the de novo design setup of RosettaAntibodyDesign (RAbD) here. For each antibody in the test set, RAbD starts by randomly selecting a CDR from RAbD's internal database of known CDR structures. The chosen CDR-H3 sequence is required to have same length as the original sequence, but it cannot be exactly the same as the original CDR-H3 sequence. After the initial CDR structure is chosen, RAbD grafts it onto the antibody and performs energy minimization to stabilize its structure. Next, RAbD runs 100 iterations of sequence design to modify the grafted CDR-H3 structure by randomly substituting amino acids. In each sequence design iteration, it performs energy minimization to adjust the structure according to the changed amino acid. Lastly, the model returns the generated CDR-H3 sequence with the lowest energy.

SARS-CoV-2 neutralization. Each generative model is pretrained on the SAbDab data to learn a prior distribution over CDR-H3 structures. Given a fixed predictor f , we use the ITA algorithm to finetune our pretrained models to generate neutralizing antibodies. Each model is trained for 3000 ITA steps with M = 100. Generated CDR-H3 sequences from our model are visualized in Table 3 .

Our neutralization predictor f is trained on the CoVAbDab database. For simplicity, we only consider two viruses, SARS-CoV-1 and SARS-CoV-2 since other coronavirus have very little training data. For the same reason, we only consider the spike protein receptor binding domain as our target epitope. The predictor is trained in a multi-task fashion to predict both SARS-CoV-1 and SARS-CoV-2 neutralization labels. The SRU encoder has a hidden dimension of 256. The model was trained with a dropout of 0.2, a learning rate of 0.0005, and batch size of 16. . The net charge of a sequence s 1 · · · s n is defined as i C(s i ).

Cellular and molecular immunology E-book

Rosettaantibodydesign (rabd): A general framework for computational antibody design

In silico proof of principle of machine learning-based antibody design at unconstrained scale

Antibody-protein binding and conformational changes: identifying allosteric signalling pathways to engineer a better effector response

Unified rational protein engineering with sequence-based deep representation learning

Accurate prediction of protein structures and interactions using a three-track neural network

Fold2seq: A joint sequence (1d)-fold (3d) embedding-based generative model for protein design

Learning phrase representations using rnn encoder-decoder for statistical machine translation

Sabdab: the structural antibody database

Computational design of antibodies. Current opinion in structural biology

Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules

Graphite: Iterative generative modeling of graphs

Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences

Learning protein structure with a differentiable simulator

Generative models for graphbased protein design

Hierarchical generation of molecular graphs using structural motifs

Multi-objective molecule generation using interpretable substructures

Highly accurate protein structure prediction with alphafold

A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography

De novo protein design for novel folds using guided conditional wasserstein generative adversarial networks

Abdesign: A n algorithm for combinatorial backbone design guided by natural conformations and sequences

Rosetta3: an objectoriented software suite for the simulation and design of macromolecules

Imgt unique numbering for immunoglobulin and t cell receptor variable domains and ig superfamily v-like domains

When attention meets fast recurrence: Training language models with reduced compute

Optmaven-a new framework for the de novo design of antibody variable region models targeting specific antigen epitopes

Learning deep generative models of graphs

Efficient graph generation with graph recurrent attention networks

Antibody complementarity determining region design using high-capacity machine learning

Constrained graph variational autoencoders for molecule design

Diffusion probabilistic models for 3d point cloud generation

Spin2: Predicting sequence profiles from protein structures using deep neural networks

Optcdr: a general computational method for the design of antibody complementarity determining regions for targeted epitope binding

Cross-neutralization of sars-cov-2 by a human monoclonal sars-cov antibody

Five computational developability guidelines for therapeutic antibody profiling

Cov-abdab: the coronavirus antibody database

Antibody design using lstm based deep generative model from phage display library for affinity maturation

Learning gradient fields for molecular conformation generation

Protein design and variant prediction using autoregressive generative models

Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets

Fast and flexible design of novel proteins using graph neural networks

Design of proteins presenting discontinuous functional sites using deep learning. bioRxiv

Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of

Improving molecular design by stochastic iterative target augmentation

Graphrnn: A deep generative model for graphs

We would like to thank Rachel Wu, Xiang Fu, Jason Yim, and Peter Mikhael for their valuable feedback on the manuscript. We also want to thank Nitan Shalon, Nicholas Webb, Jae Hyeon Lee, Qiu Yu, and Galit Alter for their suggestions on method development. We are grateful for the generous support of Mark and Lisa Schwartz, funding in a form of research grant from Sanofi, Defense Threat Reduction Agency (DTRA), C3.ai Digital Transformation Institute, Eric and Wendy Schmidt Center at the Broad Institute, Abdul Latif Jameel Clinic for Machine Learning in Health, DTRA Discovery of Medical Countermeasures Against New and Emerging (DOMANE) threats program, and DARPA Accelerated Molecular Discovery program.