key: cord-0305688-4iox4qtv
authors: Petti, Samantha; Bhattacharya, Nicholas; Rao, Roshan; Dauparas, Justas; Thomas, Neil; Zhou, Juannan; Rush, Alexander M.; Koo, Peter K.; Ovchinnikov, Sergey
title: End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman
date: 2022-04-18
journal: bioRxiv
DOI: 10.1101/2021.10.23.465204
sha: 5451a6c4c83d3a80d49a24576b3544ab97d95076
doc_id: 305688
cord_uid: 4iox4qtv

Multiple Sequence Alignments (MSAs) of homologous sequences contain information on structural and functional constraints and their evolutionary histories. Despite their importance for many downstream tasks, such as structure prediction, MSA generation is often treated as a separate pre-processing step, without any guidance from the application it will be used for. Here, we implement a smooth and differentiable version of the Smith-Waterman pairwise alignment algorithm that enables jointly learning an MSA and a downstream machine learning system in an end-to-end fashion. To demonstrate its utility, we introduce SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction. We find that SMURF learns MSAs that mildly improve contact prediction on a diverse set of protein and RNA families. As a proof of concept, we demonstrate that by connecting our differentiable alignment module to AlphaFold and maximizing predicted confidence, we can learn MSAs that improve structure predictions over the initial MSAs. Interestingly, the alignments that improve AlphaFold predictions are self-inconsistent and can be viewed as adversarial. This work highlights the potential of differentiable dynamic programming to improve neural network pipelines that rely on an alignment and the potential dangers of relying on black-box methods for optimizing predictions of protein sequences.

Two other works infer HMM and/or Potts parameters through importance sampling [63] and 122 message passing [44] , with the goal of designing generative classifiers for protein homology 123 search. 124 

PAM score for the pair of residues X i and Y j [3, 13, 24] . This score reflects how likely it is for one amino acid to be substituted for another, but does not acknowledge the context of 155 each residue in the sequence. For example, consider serine, an amino acid that is both small 156 and hydrophilic. In a water-facing part of a protein, serine is more likely to be substituted 157 for other hydrophilic amino acids. In other contexts, serine may only be substituted for 158 other small amino acids due to the geometric constraints of the protein fold. Employing a 159 scoring function with convolutions allows for local context to be considered. 160 Our proposed learned alignment module adaptively learns a context-dependent alignment 161 score matrix a ij , performs an alignment based on this score matrix, all in conjunction with 162 a downstream machine learning task. The value a ij expresses the similarity between X i 163 in the context of X i−w , . . . X i , . . . X i+w and Y j in the context of Y j−w , . . . Y j , . . . Y j+w . We 164 represent position i in sequence X as a vector v X i obtained by applying a convolutional layer 165 of window size 2w + 1 to a one-hot encoding of X i and its neighbors. The value a ij in the 166 similarity matrix that we input to Smith-Waterman is the dot product of the corresponding 167 vectors, a ij = v X i · v Y j . To construct an MSA from a reference and B other sequences, 168 the LAM constructs a similarity matrix between each sequence and the reference, applies sequence are mapped to vectors using a convolution. For each sequence k, an alignment score matrix a is computed by taking the dot products of the vectors representing the query sequence and the vectors representing sequence k. The similarity tensor is formed by concatenating these matrices, and then our differentiable implementation of smooth Smith-Waterman is applied to each similarity matrix in the tensor to produce an alignment. The resulting B smooth pairwise alignments (all aligned to the query sequence) are illustrated as the "Alignment Tensor."

tacts. SMURF has two phases, each beginning with the LAM. First, BasicAlign learns LAM 184 convolutions by minimizing the squared difference between each aligned sequence and the 185 corresponding averaged MSA (Fig. S5 ). These convolutions are then used to initialize the 186 LAM for the second training phase, TrainMRF, where a masked language modeling (MLM) 187 objective is used to learn MRF parameters and update the convolutions (Fig. S6) . We com-188 pare SMURF to GREMLIN trained with masked language modeling (MLM-GREMLIN) [9] . 189 The architecture of MLM-GREMLIN is the similar to TrainMRF step of SMURF, except 190 that a fixed alignment is input instead of a learned alignment computed by LAM. 191 We trained and evaluated our model on a diverse set of protein families, as described our learned MSA to those extracted from the HHblits MSA. By this metric, we found that 205 alignments learned by SMURF were more consistent than those from HHblits. Moreover, we 206 observed a slightly positive correlation between increased consistency and contact prediction 207 improvement (Fig. S7, left) . We also found that SMURF alignments tend to have more 208 positions aligned to the query (Fig. S7, right) . We hypothesize that this is because our 209 MRF does not have a mechanism to intelligently guess the identity of residues that are insertions with respect to the query sequence (the guess is uniform, see Methods).

Next, we applied SMURF to 17 non-coding RNA families from Rfam [31] that had a 212 corresponding structure in PDB (see Methods). Due to the relatively small number of RNAs 213 with known 3D structures, we employed SMURF using the hyperparameters optimized for 214 proteins; fine-tuning SMURF for RNA could improve performance. Overall, we observe that 215 SMURF outperforms MLM-GREMLIN with a median AUC improvement of 0.02 (Fig. 2b) .

In Supplementary Note S2, we further discuss the RNA contact predictions illustrated in 217 Fig. 2b and the SMURF predictions for the three most and least improved protein families 218 (Figs. S9 and S10). We hypothesize that SMURF generates fewer false positive predictions 219 in seemingly random locations because the LAM finds better alignments.

Finally, we performed an ablation study on SMURF (Fig. S11 ). We found that replacing 221 smooth Smith-Waterman with a differentiable "pseudo-alignment" procedure, similar to [7], 222 degraded performance substantially. Skipping BasicAlign also degraded performance, thus 223 indicating the importance of the initial convolutions found in BasicAlign.

Using backprop through AlphaFold to learn alignments with LAM 225 As a proof of concept, we selected four CASP14 domains where the structure prediction 226 quality from AlphaFold was especially sensitive to how the MSA was constructed. We 227 reasoned that the quality was poor due to issues in the MSA and by realigning the sequences 228 using AlphaFold's confidence metrics we may be able to improve on the prediction quality. We also investigated whether it was easier or harder to obtain "near optimal" structure 281 prediction (having an RMSD of 1.25 times the RMSD of the prediction of the learned MSA 282 on the full set) with the restricted set of sequences as compared to the full set. For T1064-283 D1 our optimization scheme found "near optimal" structures more often with the set of 284 sequences that includes the distant sequences. The opposite was the case for T1039-D1, and 285 there was no strong difference for T1070-D1 (Fig. S13b ).

In this work we explored the composition of alignment in a pipeline that can be trained 288 end-to-end without usage of any existing alignment software or ground-truth alignments.

With SMURF, we trained alignments jointly with a well-understood MRF contact prediction 290 approach and found mild improvement in accuracy using learned MSAs that were consistent 291 and reasonable. When we instead optimized with AlphaFold's confidence metrics, we found 292 low-quality MSAs that yielded improved structure predictions. This suggests that in order 293 to learn high-quality alignments in the context of another machine learning task, the task 294 must require high-quality alignments, which we discovered is not the case for structure 295 prediction with AlphaFold. Perhaps by changing our objective function to also penalize 296 self-inconsistent alignments, we could learn more reasonable MSAs while still improving identify a range of alignments that yield a particular prediction. Studying these alignments 311 could provide insight on which aspects of an alignment are used by AlphaFold to make its 312 prediction.

Our smooth Smith-Waterman implementation is designed to be usable and efficient, and 314 we hope it will enable experimentation with alignment modules in other applications of 315 machine learning to biological sequences. There is ample opportunity for future work to 316 systematically compare architectures for the scoring function in smooth Smith-Waterman.

The use of convolutions led to relatively simple training dynamics, but other inductive biases 318 induced by recurrent networks, attention mechanisms, or hand-crafted architectures could 319 capture other signal important for alignment scoring. We also hope that the use of these 320 more powerful scoring functions enables applications in remote homology search, structure 321 prediction, or studies of protein evolution. Pairwise sequence alignment can be formulated as the task of finding the highest scoring 332 path through a directed graph in which edges correspond to an alignment of two particular 333 residues or to a gap. The edge weights are match scores for the corresponding residues or the 334 gap penalty, and the score of the path is the sum of the edge weights. The Smith-Waterman 335 algorithm is a dynamic programming algorithm that returns a path with the maximal score.

KLTEYYTNF---KYKILP---G---GKLNKGKLK-DLQSTV---TSLLEKTRKE-N-N---PK--YKSD-SD ELTQLHADS---NFIKLVKRAG---DYSVKEKYANDPTPYIFLAKGLFGVYRKQLK-D---PM--IQDP--- ELTQIYADK---DYLKLV---KKADVYLVKPEYAEDPTPNIFSAKGFYGVYLDE-S-HEKIGL--GDRQ-AA ELASLYAQKPKPNYEKLVLRAS---DYTVKPKYSSDPTPYLFLAKGLYGLVKEG-NSD---PK--FEDA--- ELAALYAQKPKPNFVKLVEKAS---EYVVKPKYSNDPTPNLFLAKGYLGIIKSQ-NPD---PR--FESA--- ---------------------S---EYSVKPEYANDPTPHLFLAKGYLGLVKTT-NTD---PR--FESA--- KLTEYYTNFKYKIL-P--GG-K--LNKGKLKDL-QSTV-T------SL------LEKTRK-EN----NP-KYKS-DS--D TQLHADSNFIKLVKRA--GD-Y--SVKEKYAND-P-TP-Y------IF------LAKGLF-GVYRKQ-L-KDPM-IQ--D -----------DEL-T--QI----YADKDYLKLVKKAD-V------Y-------LVKPEY-AED---PTPNIFS-AK--G

KASEYVVKPKYSND-P--TP-NLFLAKGYLGII-KSQN-PDPRFES-A------MEECVS-SF----NA-AREL-DK--N --SEYSVKPEYAND-PTPHL-F--LAKGYLGLV-KTTN-T------DPRFESAL-EECIS-SF----NT-AREL-DKNG- KLTEYYTNFKYKIL-P--GG-K--LNKG TQLHADSNFIKLVKRA--GD-Y--SVKE -----------

A smooth version instead finds a probability distribution over paths in which higher scoring 337 paths are more likely. We describe a Smooth Smith-Waterman formulation in which the 338 probability that any pair of residues is aligned can be formulated as a derivative. 339 s t x 1 y 1

x 2 y 1 x 3 y 1 vertex set contains grid vertices v ij for 0 ≤ i ≤ ℓ x and 0 ≤ j ≤ ℓ y , a source s, and a sink t.

The directed edges are defined so that each path from s to t corresponds to a local alignment 342 of the sequences. The table below describes the definitions, meanings, and weights of the

The Smith-Waterman algorithm iteratively computes the highest score of a path ending at each vertex and returns the highest scoring path ending at t.

For grid vertices this simplifies to

A path with the highest score is computed by starting at the sink t and tracing backward 

We use these smoothed scores and the edge weights to define a probability distribution over 353 paths in G, or equivalently local alignments.

Definition 1. Given an alignment graph G = (E, V ), define a random walk starting at vertex t that traverses edges of G in reverse direction according to transitions probabilities

and ends at the absorbing vertex s. Let µ G be the probability distribution over local alignments in which the probability of an alignment A is equal to the probability that the random walk 356 follows the reverse of the path in G corresponding to A.

Under the distribution µ G , the probability that residues x i and y j are aligned can be 358 formulated as a derivative. Mensch and Blondel describe this relationship in generality for 359 differentiable dynamic programming on directed acyclic graphs [38] . We state their result as 360 it pertains to our context and provide a proof in our notation in Supplementary Section S1 A.

Proposition 1 (Proposition 3 of [38]). Let G be an alignment graph and µ G be the corresponding probability distribution over alignments. Then

GREMLIN is a probabilistic model of protein variation that uses the MSA of a protein family to estimate parameters of a MRF of the form

and ℓ is the number of columns in the MSA, v i represents the amino acid propensities For each non-coding RNA, we aligned the RNA sequence in the PDB along with the 382 corresponding Rfam sequences to an appropriate Rfam covariance model using Infernal [45] . 383 We then analyzed these sequences using the same procedure outlined for proteins. We BasicAlign.

Similarity matrices produced by randomly initialized convolutions will produce chaotic alignments that are difficult for the downstream MRF to learn from. The purpose of Ba-sicAlign is to learn initial convolutions whose induced similarity matrices yield alignments with relatively homogeneous columns (see Figure S5 ). The input to BasicAlign is a random 

where ℓ k is the length of S (k) and p k ij is the probability that position i of S (1) is aligned to position j of S (k) under the smooth Smith-Waterman alignment. (Note that x M ix is less than one when there are sequences with a gap aligned to position i of S (1) .) The BasicAlign loss is computed by taking the squared difference between each aligned one-hot encoded sequence and the averaged MSA,

(4)

TrainMRF.

In TrainMRF, masked language modeling is used to learn the MRF parameters and 393 further adjust the alignment module convolutions (see Figure S6 ). The input to TrainMRF to the query via the learned alignment module (as described in Figure 1 ). The parameters 397 for the alignment module are initialized from BasicAlign, and the query is initialized as the 398 one-hot encoded reference sequence for the family.

The MRF has two sets of parameters: symmetric matrices w ij ∈ R A×A for 1 ≤ i, j ≤ ℓ R j , we first compute a score for each residue x that is equal to the expected value (under the smooth alignment) of the terms of the function E( · ; b, w) specific to position j or involving position j and an unmasked position. Then we compute the infill distribution by taking the softmax.

Formally,

The infill distribution is an approximation of how likely each residue is to be present at sequence. Ultimately, this is not a concern since we do not use information about insertions 415 to predict the contacts of the query sequence.) 416 We train the network using a cross entropy loss and L2 regularization on w and b with λ = .01

After each iteration, the query is updated to reflect the inferred MSA. Let R be the one-hot encoding of the reference sequence. We define C i+1 as a rolling weighted average of the MSAs learned through iteration i and Q i as the query for iteration i,

where M i is the averaged MSA computed as described in Equation (3) Figure S6 . Preliminary results on the training set had suggested that updating the query in 419 this manner improved results for some families. However, the ablation study on the test set does not suggest improvement (Fig. S11) ; further investigation is needed to determine the 421 benefits changing the query between iterations.

Once training is complete, we use w to assign a contact prediction score between each pair of positions. The score c ij measures the pairwise interaction between positions i and j, andc ij is score after applying APC correction [14] , found that learning rate 0.05 was optimal overall. 455 We also ran 4000 iterations of MLM-GREMLIN with predetermined optimal parame- We thank Sean Eddy for pointing out the need for a restrict turns feature and for useful 490 comments on a draft. We thank Jake VanderPlas for supplying JAX code that efficiently 491 rotates a matrix (as in Figure S3b ). Proposition 2. Let G be an alignment graph. With respect to the random walk described in Definition 1,

. . u n denote the vertices of G in a reverse topological order. We prove the statement by induction with respect to this order. Note u 1 = t, and P( t is visited ) = ∂f S (t) ∂f S (t) = 1. Assume that for all 1 ≤ i ≤ j, P( u i is visited ) = ∂f S (t) ∂f S (u i ) . Observe

where in the second equality we apply the inductive hypothesis.

Proof of Proposition 1. It suffices to show that for each directed edge

where the traversal occurs from v to u in the random walk. Observe

edge only, and so ∂f S (t) ∂a i,j is high only when the dominant alignment path uses the edge corresponding to a match. Proposition 1 establishes that ∂f S (t) ∂a i,j equal to the probability that 705 X i and Y j are aligned, so our output is an expected alignment. Figure compute the values f S (v i,j ) along each diagonal in tandem, i.e. all (i, j) such that i + j = d.

To implement this, we rotate the matrix that stores the values f S (v i,j ) by 90 degrees so that 716 each diagonal now corresponds to a row (see Figure S3b ). In the rotated matrix, the values 

Our smooth Smith-Waterman implementation has the following four additional options. 721 a. Temperature parameter. The temperature parameter T controls the extent to which the probability distribution over alignments is concentrated on the most likely alignments; higher temperatures yield less concentrated alignments. We compute the smoothed score for the vertex v as

which matches Equation (1) at the default T = 1. 

To remove this bias, we implemented "restrict turns" option that forbids unmatched 753 stretches in the X sequence from following an unmatched stretch in the Y sequence. To sequence. When implemented with this restrict turns option, smooth Smith-Waterman will 758 find exactly one path through the dark green and black segments in Figure S4 : the path 759 highlighted in red. Due to the increased memory requirement of the restrict turn option, we 760 did not utilize the option in SMURF.

d. Global Alignment. We also implement the Needleman-Wunsch algorithm, which 762 outputs global alignments rather than local alignments. (c) Fraction of MSAs learned that yielded predictions with "near optimal" structure.

Optimizing amino acid substitution 657 matrices with a local alignment kernel

Clustal omega. Current protocols in bioinformatics

Identification of common molecular subsequences

HH-suite3 for fast remote homology detection and deep protein anno-664 tation

Learning to align with differentiable dynamic programming

Predicting the clinical impact of human mutation with deep neural networks

Intriguing properties of neural networks

Co-675 evolutionary fitness landscapes for sequence design

Georg Martius, and Michal Rolínek. Differenti-678 ation of blackbox combinatorial solvers

Remote homology search with hidden potts models

Using video-oriented instructions to speed up sequence comparison