key: cord-0898057-0vo1rlik authors: Hie, Brian L.; Yang, Kevin K.; Kim, Peter S. title: Evolutionary velocity with protein language models date: 2021-06-07 journal: bioRxiv DOI: 10.1101/2021.06.07.447389 sha: 034833f04b0e55ac84f2c5cf686f0a795f4d3b3c doc_id: 898057 cord_uid: 0vo1rlik Predicting the order of biological homologs is a fundamental task in evolutionary biology. For protein evolution, this order is often determined by first arranging sequences into a phylogenetic tree, which has limiting assumptions and can suffer from substantial ambiguity. Here, we demonstrate how machine learning algorithms called language models can learn mutational likelihoods that predict the directionality of evolution, thereby enabling phylogenetic analysis that addresses key limitations of existing methods. Our main conceptual advance is to construct a “vector field” of protein evolution through local evolutionary predictions that we refer to as evolutionary velocity (evo-velocity). We show that evo-velocity can successfully predict evolutionary order at vastly different timescales, from viral proteins evolving over years to eukaryotic proteins evolving over geologic eons. Evo-velocity also yields new evolutionary insights, predicting strategies of viral-host immune escape, resolving conflicting theories on the evolution of serpins, and revealing a key role of horizontal gene transfer in the evolution of eukaryotic glycolysis. In doing so, our work suggests that language models can learn sufficient rules of natural protein evolution to enable evolutionary predictability. Predicting evolutionary order has diverse applications that range from tracing the 2 progression of viral outbreaks to understanding the history of life on earth [1] - [6] . For protein 3 evolution, this prediction is often based on reconstructing and rooting phylogenetic trees of 4 protein sequences [7] . While useful, ordering sequences based on a phylogenetic tree has a 5 number of limiting assumptions; for example, determining the root of the tree can drastically 6 alter the predicted order [8] , but beyond the strictest assumptions, determining this root requires 7 manual expertise or external evidence (for example, based on known sampling times or the fossil 8 record), which may not always be available [8] , [9] . 9 Here, we propose a novel approach to analyzing and ordering the trajectories of protein 10 evolution that we refer to as "evolutionary velocity," or "evo-velocity." Evo-velocity is 11 conceptually inspired by work in theoretical biology that understands evolution as a path that 12 traverses a "fitness landscape" based on locally optimal decisions [2] - [4] , [10] - [12] . Our key 13 conceptual advance is that by learning the rules underlying local evolution, we can construct a 14 global evolutionary "vector field" that we can then use to: (i) predict the root (or potentially 15 multiple roots) of observed evolutionary trajectories, (ii) order protein sequences in evolutionary 16 time, and (iii) identify the mutational strategies that drive these trajectories. 17 To make local evolutionary predictions, we leverage recent advances in the ability of 18 machine learning algorithms called language models to predict the effects of single-residue 19 mutations on biological fitness when trained on natural sequence variation alone [13] - [16] . Thus 20 far, however, language models have only been applied to modeling local evolution, such as 21 single-residue mutations, rather than more complex changes that occur over long evolutionary 22 trajectories. 23 Evo-velocity is aimed at closing the gap between landscape-based evolutionary theory 24 [2] , [3] , [10] and the analysis of evolutionary trajectories observed in nature. Our algorithm is 25 general (we use a single model for all proteins), does not have many of the assumptions typical 26 of phylogenetic methods (for example, evo-velocity can produce multiple roots or model 27 convergent evolution), and requires sequence data alone. We use evo-velocity to analyze protein 28 evolution across a breadth of organisms and evolutionary timescales-from the evolution of viral 29 proteins over the course of years to the evolution of enzymes across all three domains of life- 30 suggesting how we might expand our ability to understand and predict evolution. 31 Overview of language models and evo-velocity 33 Our approach is based on the premise that evolution occurs through locally optimal 34 changes that preserve or enhance evolutionary fitness, which has theoretical precedent in the 35 concept of a path through a fitness landscape [2] , [10] . In theory, predicting local evolution 36 should therefore provide insight into global evolution as well ( Figure 1A) . 37 To predict the local rules of evolution, we leverage protein language models, which learn 38 the likelihood that a particular amino acid residue appears within a given sequence context 39 ( Figure 1B) . When trained on large corpuses of natural sequences, this language model 40 likelihood is a strong correlate of the effects of mutations on various notions of protein fitness. 41 For example, the ESM-1b language model [15] , trained on ~3 million sequences in the UniRef50 provide detailed methodology in Methods. Intuitively, the local predictions of language models 66 assign a "velocity" to pairs of sequences that we assemble into an evolutionary "vector field" 67 [27] . In this paper, we implement evo-velocity with a single masked language model, ESM-1b, 68 but our framework can readily generalize to other implementations as well. 69 Evo-velocity of influenza A nucleoprotein 70 As initial validation, we used evo-velocity to reconstruct the evolution of the 71 nucleoprotein (NP) of influenza A virus. NP is an excellent evolutionary test case since its 72 sequence evolution is densely sampled through influenza viral surveillance and it undergoes 73 natural selection in the form of host immune pressure, but is less mutable than other viral 74 proteins with a mutation rate of about one amino-acid residue per year [28] . We obtained 3,304 75 complete NP sequences sampled from human hosts, constructed the sequence similarity network, 76 and computed evo-velocity scores. When we visualized this network in two dimensions [24] , we 77 observed phylogenetic structure corresponding to both the sampling year and influenza subtype 78 (Figures 2A and S1A) . Strikingly, the evo-velocity flow through the network (Methods) 79 corresponded to the known temporal evolution of NP (Figure 2A) . 80 Since visualizing this flow in two dimensions can be prone to information loss or 81 distortion through dimensionality reduction [27] , we sought to further quantify the relationship 82 between evo-velocity and NP evolution. We first verified that, on average, the evo-velocity 83 scores of the individual network edges increase along with greater differences in sampling time 84 (Figure S1B). We then quantified global evo-velocity patterns using a diffusion analysis to 85 estimate the network's roots (Methods). Interestingly, the evo-velocity-inferred root sequences 86 corresponded to the main species-crossover events in influenza history (Figure 2B) , suggesting 87 that our analysis accurately inferred the evolutionary origins of NP as observed in human hosts. 88 We then used these roots to order sequences according to evo-velocity pseudotime (Methods) 89 and observed a significant correlation between pseudotime and known sampling date (Spearman 90 r = 0.49, two-sided t-distribution P = 4 × 10 -197 ) ( Figure 2C) . We also observed that a well-91 characterized phylogenetic path of NP [28] progressed, on average, in the same direction as the 92 evo-velocity gradient (Figure 2A ,C) and agreed with simulated paths generated by random 93 walks across our evo-velocity landscape (Figure 2D ; Methods). 94 When comparing our evo-velocity landscape to a standard phylogenetic tree, we observed 95 that evo-velocity can model more complex evolutionary relationships. For example, a midpoint-96 rooted phylogenetic tree of all NP sequences (Methods) visually suggests that the H5N1-and 97 H7N9-subtype sequences branch off from H1N1 ( Figure 2E) . Instead, evo-velocity predicts an 98 independent origin of H5N1/H7N9 (Figure 2C ,F), consistent with epidemiological data 99 indicating recent zoonotic crossover of H5 and H7 avian influenza [29] . Evo-velocity also 100 predicts that the observed similarity of H5N1/H7N9 and H1N1 NP sequences sampled in human 101 hosts is due to convergent evolution ( Figure 2F ), which is challenging to explicitly represent 102 with a phylogenetic tree. 103 We next sought to use our evo-velocity landscape to provide new insight into NP 104 evolution. We therefore identified the mutations that corresponded to the strongest changes in the 105 evo-velocity scores (Methods). Of the top five such mutations in NP, all are present in 106 experimentally-validated human T-cell epitopes and one of these mutations, M374I, is located in 107 the most well-characterized linear NP epitope in the Immune Epitope Research Database (IEDB) 108 [30] (Figures 2G, S1C , and Table S2 ). Moreover, all five mutations involve a single-nucleotide 109 substitution resulting in a methionine changed to a hydrophobic or polar-uncharged amino acid 110 residue, suggesting a possible T-cell escape strategy that has recurred in multiple NP epitopes 111 throughout history (Figures 2G and S1C) . 112 All NP sequences in our analysis belong to a single UniRef50 sequence cluster [17] for 113 which the representative sequence is from a 1934 H1N1 virus ( Figure S1D ). We found that 114 similarity to sequences present in UniRef50, the ESM-1b training dataset, does not explain evo-115 velocity pseudotime (Table S3 ; Methods). We also found that computing evo-velocity scores 116 with a smaller language model, TAPE [14] , trained with a different model architecture on the 117 Pfam database of protein families [31] , closely reproduced the ESM-1b evo-velocity results 118 (Spearman r = 0.93, two-sided t-distribution P < 1 × 10 -308 ) ( Table S4 and Figure S1E ,F). Using 119 simpler evolutionary scores to compute velocities or using binary sequence embeddings also 120 largely reproduced the ESM-1b results, though with weaker temporal correlation (Figure S1G Table S5 ; Methods). Together, these results suggest that our evo-velocity results are not 122 explained by trivial language model preference to UniRef50. We also found that evo-velocity 123 pseudotime was not explained by variation in sequence length (Table S6) . 124 Evo-velocity was therefore able to reconstruct the direction of NP evolution without any 125 explicit knowledge of influenza subtype or when the NP sequences were sampled. Moreover, we 126 found that the generic rules learned by large language models were sufficient to predict the 127 evolution of a specific protein. 128 Evo-velocity of viral proteins 129 Given the promising results for NP, we were therefore interested in seeing if evo-velocity 130 could generalize to other viral proteins as well. We next analyzed the evolution of influenza A 131 hemagglutinin (HA), a more variable protein on the viral surface responsible for viral-host 132 membrane fusion [32] . As with NP, evo-velocity analysis of 8,115 HA sequences recovered 133 roots corresponding to the known origins of HA H1 in humans from 1918 and 2009 H1N1 134 pandemics, and evo-velocity pseudotime was strongly correlated with sampling date (Spearman r 135 = 0.63, two-sided t-distribution P < 1 × 10 -308 ) (Figure 3A,B) . Despite the higher sequence 136 variability of HA than NP, evo-velocity was still able to reconstruct the trajectory and 137 directionality of HA evolution. 138 As with NP, our HA pseudotime results were not explained by sequence similarity to the 139 training dataset ( Figure S2A and Table S3). We were also able to use TAPE-based velocities to 140 identify similar root regions in the post-2009 pandemic trajectory, but TAPE had a more difficult 141 time identifying the 1918 sequences as oldest, most likely due to TAPE's smaller model size and 142 less capable mutational effect predictions (Figures 1C, S2B -D, and Table S4 ). 143 We next analyzed the evolution of the group specific antigen (Gag) polyprotein of human (Table S3 ) and was also reproducible using 155 TAPE-based velocities ( Figure S2G and Table S4 ). 156 We next applied our algorithm to analyze 46,986 sequences of the Spike glycoprotein of variants-of-concern [34] , as later in pseudotime (Figures 3E-G) . Despite a shorter evolutionary 162 timescale, evo-velocity pseudotime and sampling date still had a Spearman correlation of 0.41 163 (two-sided t-distribution P < 1 × 10 -308 ). We also note that SARS-CoV-2 Spike evolution 164 occurred outside of the temporal range associated with both language model training datasets and 165 we were also able to reproduce the results with TAPE-based evo-velocity ( Figure S2H and 166 Table S4 ). 167 Across these four viral proteins, therefore, evo-velocity was able to reconstruct the 168 directionality of evolution consistent with known trajectories. Importantly, all of our analysis 169 was based on a single model that was trained without explicit knowledge of viral sampling date, 170 subtype, or protein-specific sequence variation. 171 Evo-velocity of eukaryotic proteins 172 After validating our approach with known viral trajectories, we wanted to see if evo-173 velocity could generalize to longer trajectories, such as protein evolution that spans multiple 174 species. Though we have access only to extant sequences, we hypothesized that evo-velocity 175 might still provide useful orderings if some extant sequences are closer to the ancestral sequence 176 than others. As an initial test case, we analyzed the globin protein family due to its extensive 177 phylogenetic characterization [35] , including laboratory reconstruction of ancestral 178 intermediates, that we can use to validate our model ( Figure 4A) . 179 The landscape of 6,097 eukaryotic globin sequences forms a branching trajectory with 180 three major divisions corresponding to myoglobin, alpha hemoglobin, and beta hemoglobin 181 ( Figure 4B ). The predicted root region lies in the part of the landscape closest to neuroglobin 182 and cytoglobin (Figure S3A,B) . Of the major classes of globins, neuroglobin is estimated to be 183 earliest in pseudotime while the alpha (Hbα) and beta (Hbβ) subunits of hemoglobin occur last in 184 pseudotime ( Figure 4C ), consistent with a previous analysis of globin phylogeny by Pillai et al. 185 [35] (Figure 4A) . These results are also reproducible when using TAPE to compute the evo-186 velocity scores (Figure S3C ,D and Table S4 ) and when controlling for sequence similarity to 187 the training dataset ( Figure S3D and Table S3 ; Methods). 188 Previous work [35] has also reconstructed ancestral globins that are confirmed to be 189 viable oxygen binders and that progress from a monomeric myoglobin/hemoglobin ancestor 190 (AncMH) to a dimeric alpha/beta hemoglobin ancestor (Ancα/β) to a tetramer formed by 191 separate alpha and beta hemoglobin ancestors (Ancα and Ancβ, respectively) ( Figure 4A ). 192 Consistent with evo-velocity increasing over evolutionary time, the ESM-1b language model 193 likelihood, on average, increases from AncMH to extant myoglobin and hemoglobin sequences, 194 but this improvement diminishes for more proximal ancestors ( Figure S3E ). Together, our 195 globin results suggest that evo-velocity pseudotime within a protein family can recover ordering 196 relationships over longer evolutionary timescales. 197 To further test this hypothesis, we analyzed 2,128 sequences of cytochrome c, a well-198 studied protein in evolutionary biology due to its high sequence conservation among most 199 eukaryotes [36] . When visualized, the sequence similarity network combined with evo-velocity Table S4 ) and when 206 controlling for sequence similarity to the training dataset ( Figure S4D and Table S3 ). In total, 207 therefore, our analysis of well-studied eukaryotic protein families demonstrates that evo-velocity 208 can generalize to protein evolution at much longer timescales. 209 Evo-velocity of ancient evolution 210 After validating that evo-velocity could reconstruct longer trajectories of protein 211 evolution, we applied evo-velocity to highly-conserved proteins, which often have substantial 212 evolutionary uncertainty [6] , to yield new insight into ancient evolution. A protein family with 213 considerable evolutionary uncertainty is that of the serine protease inhibitors, or serpins [38] , 214 [39] . Unlike most highly-conserved families, in which most of the diversity is bacterial, most of 215 the diversity among serpins is eukaryotic, which we likewise observe in our landscape of 22,737 216 serpin sequences (Figure 5A,B) . This has led to conflicting theories as to whether serpins indeed 217 have a phylogenetic root in eukaryotes, with prokaryotes acquiring serpins via horizontal gene 218 transfer (HGT), or if this root is an artifact of greater eukaryotic diversity biasing phylogenetic 219 root estimation [38] - [40] . Since evo-velocity is not prone to the same bias when estimating 220 roots, we used evo-velocity to order serpin sequences in pseudotime and found that the main 221 predicted root region was located among the prokaryotes (Figures 5B,C and S5A). These results, 222 along with the uncertain mechanism of eukaryotic-to-prokaryotic HGT [40] , provide strong 223 evidence that serpin evolution follows a more canonical trajectory. 224 We next analyzed two of the most conserved glycolytic enzymes, enolase and 225 phosphoglycerate kinase (PGK) [41] - [43] . The landscape of 31,901 sequences from the enolase 226 family shows a clear evo-velocity-predicted root region located in bacterial and archaeal 227 sequences ( Figures 5D and S5B,C) . Archaea are also oldest in pseudotime and eukaryota are 228 newest, with bacteria showing considerable pseudotemporal variation (Figures 5E and S5C) . 229 The landscape of 30,455 PGK sequences has a similar origin in a region with bacterial and 230 archaeal sequences (Figures 5F and S5D ,E), though with more pseudotemporal variation among 231 archaeal PGK (Figures 5G and S5E) . 232 The largest difference between the enolase and PGK landscapes lies in the location of Figure 5H ) and are consistent with HGT contributing to a mixture of archaeal 238 and bacterial genes in the last eukaryotic common ancestor [6] . These results are also consistent 239 with a component-wise evolution of glycolysis [41] , rather than the pathway being inherited in 240 totality from a single organism. 241 In all three highly conserved proteins that we tested, we were able to reproduce evo-242 velocity pseudotime even when explicitly controlling for sequence similarity to the training 243 dataset ( Figure S5A ,H and Table S3 ) and when using TAPE to compute the evo-velocity scores 244 ( Figure S5A ,H and Table S4 ). Variability in sequence length did not explain evo-velocity 245 pseudotime (Table S6) . Moreover, the direction of the evo-velocity gradient is not explained by 246 trivial training set bias toward eukaryotes, as most of the sequences in UniRef50 are bacterial 247 (Table S1), and we emphasize that no explicit taxonomic information was provided to our 248 algorithm. Rather, our results suggest that evo-velocity can provide insight into evolution at the 249 longest evolutionary timescales. The degree to which evolution is predictable has been a longstanding debate [3] , [4] , 252 [44] , [45] . Here we show that large-scale protein language models can learn evolutionary rules 253 well enough to predict the ordering of sequences in evolutionary time. While the phylogenetic 254 tree is the oldest conceptual model of evolution [1] and has had wide application to natural 255 sequence variation [7] , we show that landscape-based theory [3] , [10] - [12] combined with 256 modern algorithms can also provide novel evolutionary insight that is complementary to existing 257 approaches. 258 Evo-velocity has a number of distinctives with respect to phylogenetic tree 259 reconstruction. Evo-velocity is especially suitable for analyzing large (~1000 or more) 260 collections of sequences. We currently limit our analysis to extant sequences, rather than 261 artificially reconstructing ancestral sequences, though these could be incorporated into the 262 analysis as well. In viewing evolution as a landscape, evo-velocity admits multiple "valleys" that 263 we refer to as roots. Because we predict the directionality of edges in the network, evo-velocity 264 roots are also better mathematically determined than phylogenetic roots [9] , [46] (though users 265 could manually specify root sequences as well). Evo-velocity landscapes can also better model 266 phenomena like convergent evolution ( Figure 2F ). 267 We also find that evo-velocity provides a helpful notion of uncertainty in its predictions 268 that is less natural to obtain from standard phylogenetic methods. For example, evo-velocity 269 reports multiple roots, indicating evolutionary ambiguity regarding the oldest sequences or 270 reflecting discontinuous trajectories due to missing evolutionary ancestors. Similarly, the most 271 robust ordering relationships are at the level of groups of sequences, providing pseudotemporal 272 confidence intervals. 273 Computationally, our results are striking in that a single language model trained on 274 diverse, natural protein sequences seems to learn generic evolutionary rules. This is corroborated 275 by our finding that two independently-trained language models, ESM-1b and TAPE, can produce 276 very similar pseudotemporal ordering results (Table S4) , even though TAPE is a much weaker 277 mutational effect predictor than ESM-1b ( Figure 1C) . The robustness of evo-velocity 278 pseudotime to language model implementation may be because, in our framework, language 279 models only need to consider natural sequence changes [11] , rather than the artificial mutations 280 introduced in deep mutational scanning (DMS) experiments; evo-velocity therefore benefits by 281 considering both the language model likelihood and semantic similarity [16] . Language models 282 may provide successful evo-velocity predictions because their conditional likelihoods capture 283 evolutionary contingency, which is a strong driver of natural sequence variation [47] . Our 284 findings raise a number of interesting computational questions, including the degree to which the 285 rules learned by language models are biologically interpretable (for example, in terms of 286 thermostability or evolvability [28] , [48] ) and whether better protein language models could 287 improve the performance and resolution of evo-velocity. 288 Promisingly, evo-velocity offers a new approach through which to reevaluate current 289 evolutionary hypotheses. For example, when evaluating a potential hypothesis of eukaryote-to-290 prokaryote HGT among serpins [38] , [39] , evo-velocity instead predicted a more canonical 291 evolutionary trajectory ( Figure 5 ). While we mostly take a gene-centric approach to evolution 292 [49] , trajectories could also be integrated across multiple genes to provide insight into evolution 293 at the level of pathways (as done for our analysis of glycolytic enzymes), gene modules, or even 294 whole genomes. This might enable calibrating evo-velocity pseudotime to historical or geologic 295 time, providing an additional method for dating evolutionary events. Evo-velocity also suggests a 296 way to predict future evolution and to design novel protein sequences. Figure 1 of Weiss et al. [6] . For all boxplots: box extends from 369 first to third quartile with line at the median, whiskers extend to 1.5 times the interquartile range, 370 and diamonds indicate outlier points. Language models 373 In this paper, we implement evo-velocity with masked language models, which are 374 trained by masking certain residues in the input and predicting these residues in the output. Evo-velocity score computation 387 We compute an evo-velocity score that compares two sequences ( ) and ( ) as where ℳ ≝ { ∶ ( ) ≠ ( ) } is the set of positions at which the amino acid residues disagree. 390 We designed the evo-velocity score based on masked-language-model pseudolikelihoods [19] for each sequence ( ) in the set of sequences-of-406 interest (for example, proteins within the same family) of size . We use ESM-1b to compute 407 the embeddings for each sequence as the 1,280-dimensional output of the last (i.e., the 33rd) 408 hidden layer of the language model. 409 We then construct a directed graph where each node corresponds to a sequence and we 410 connect a node to its -nearest neighbors based on the Euclidean distance in the language model 411 embedding space in ℝ . We can then use the evo-velocity scores and the KNN graph to To find the root nodes, we can use the fixed points of a diffusion process based on the 429 transition matrix [46] , [51] . Given a diffusion probability vector ( ) , we can find roots by 430 running a diffusion process until a fixed point, i.e., (∞) = T (∞) (note that we take the 431 transpose of the transition matrix to "reverse" the diffusion process, since our goal is to find the 432 root nodes). We take the highest values of (∞) to identify the root nodes, where we obtain (∞) 433 as the eigenvector of T corresponding to an eigenvalue of 1. By default, we use a cutoff at the 434 98 th percentile of values in (∞) to define the set of root nodes, as has been done previously [51] . 435 We assume corresponds to a strongly connected directed graph, which is true if the KNN 436 network consists of a single connected component (and which was true for all of our analyses); if 437 the graph is strongly connected, then there is a unique value of (∞) [46] . . 450 We use the DPT implementation provided by the Scanpy Python package. 451 Plotting, data visualization, and statistical analysis 452 We used the UMAP algorithm [24] To perform the control experiment, we filtered out sequences with 80% or less sequence 502 similarity to the training set, thereby excluding sequences that are far from the sequences 503 considered by ESM-1b. We then evaluated the Spearman correlation between the similarity 504 scores and pseudotime, both in terms of the directionality of the correlation (e.g., a positive 505 correlation indicates that similarity to UniRef50 could be explaining pseudotime) and also in 506 terms of the change in this correlation compared to the correlation obtained on the full set of 507 sequences (Table S3) . We also evaluated the ability for the overall pseudotemporal patterns (for 508 example, correlation with sampling time or ordering of taxonomic classes) to reproduce those 509 found when analyzing the full set of sequences. TAPE reproducibility computational control 511 We also wanted to see how robust our evo-velocity results were to the language model 512 used to estimate the mutational likelihoods. We therefore obtained the TAPE transformer model 513 as described above. We performed the evo-velocity analysis by keeping the KNN graph structure 514 the same as in the ESM-1b analysis but using the evo-velocity scores obtained by the TAPE 515 likelihoods. All other downstream analyses, including root prediction and pseudotime 516 computation, were also kept the same. We then evaluated the ability for the final pseudotime 517 output to reproduce the output obtained by performing the same analysis except with ESM-1b 518 velocities. Influenza A NP evo-velocity analysis 520 We obtained 3,304 unique NP sequences from the NIAID Influenza Research Database 521 (https://www.fludb.org) [52] . We restricted our analysis to sequences that were sampled from 522 human hosts. Metadata included the year the sequences were sampled and the influenza subtype 523 of the original virus. We performed KNN graph construction, evo-velocity computation, root 524 prediction, diffusion pseudotime estimation, and UMAP velocity projection as described 525 previously. 526 We obtained an ordered phylogenetic path from Gong et al. [28] of H3N2-subtype NP 527 evolution from 1968 to 2007. We computed the ESM-1b evo-velocity score comparing adjacent 528 sequences along this path and plotted the cumulative sum of these scores versus the order in the 529 path ( Figure 2D) . We also compared the improvement in evo-velocity of this path to that of 530 simulated paths. To simulate paths across our evo-velocity landscape, we began at the same 531 starting sequence, used the same number of steps as the path of Gong et al., and only considered 532 paths that ended in the same cluster of sequences as the end sequence of Gong et al.'s path. We 533 used the transition matrix to define the probability of moving from node to node and we 534 performed 30,000 random walks. 535 We obtained a phylogenetic tree of all NP sequences considered in the evo-velocity 536 analysis by first aligning sequences with MAFFT followed by approximate maximum-likelihood 537 tree construction using FastTree version 2.1 using a JTT+CAT model. The midpoint-rooted tree 538 was visualized using the iTOL web tool (https://itol.embl.de/) [53] . 539 We also projected evo-velocity into one-hot-encoding space to compute a | |-540 dimensional vector ̃ for each sequence as described previously; we then averaged these vectors 541 across all sequences and inspected the top five mutations with the greatest magnitude change in 542 the resulting average. We then located these mutations onto a reference sequence from 1934 543 H1N1 NP (UniProt ID: P03466), for which linear T-cell epitope data is available through the 544 Immune Epitope Database (https://www.iedb.org/) [30] . We restricted our consideration to linear 545 epitopes of influenza NP with positive validation in a T-cell assay. 546 We also conducted an ablation study to test the robustness of evo-velocity results when 547 using simpler methods for computing sequence embeddings or evo-velocity scores. We 574 We obtained 46,986 unique, full-length Spike sequences from the May 27, 2021 GISAID 575 release (https://www.gisaid.org/) [55] . Metadata included the date the sequences were sampled. 576 We performed KNN graph construction, evo-velocity computation, root prediction, diffusion 577 pseudotime estimation, and UMAP velocity projection as described previously. We determined 578 the location of clusters corresponding to known variants-of-concern based on known marker 579 mutations including D614G, N501Y (for B. Globins evo-velocity analysis 582 We obtained 6,097 globin sequences from UniProt. We restricted our analysis to 583 eukaryotic sequences within the "globin" family and to sequences between 135 and 155 residues 584 in length, inclusive, which was done based on a clear mode in the distribution of sequence 585 lengths and was meant to preserve mostly homologous sequences in our analysis. Metadata 586 included the taxonomic lineage of each sequence and, for some of the sequences, annotations 587 indicating the type of globin. We performed KNN graph construction, evo-velocity computation, 588 root prediction, diffusion pseudotime estimation, and UMAP velocity projection as described 589 previously. We obtained the rooted phylogenetic tree of globins and the inferred ancestral 590 sequences from Pillai et al. [35] . 591 Cytochrome c evo-velocity analysis 592 We obtained 2,128 cytochrome c sequences from UniProt. We restricted our analysis to 593 eukaryotic sequences within the "cytochrome c" family and to sequences between 100 and 115 594 residues in length, inclusive, which was done based on a clear mode in the distribution of 595 sequence lengths and was meant to preserve mostly homologous sequences in our analysis. 596 Metadata included the taxonomic lineage of each sequence. We performed KNN graph 597 construction, evo-velocity computation, root prediction, diffusion pseudotime estimation, and 598 UMAP velocity projection as described previously. We obtained the approximate dates and 599 geologic eons of the emergences of different organisms from Hedges et al. [37] . Enolase evo-velocity analysis 601 We obtained 31,901 enolase sequences from UniProt. We restricted our analysis to 602 sequences within the "enolase" family and to sequences between 412 and 448 residues in length, 603 inclusive, which was done based on a clear mode in the distribution of sequence lengths and was 604 meant to preserve mostly homologous sequences in our analysis. Metadata included the 605 taxonomic lineage of each sequence. We performed KNN graph construction, evo-velocity 606 computation, root prediction, diffusion pseudotime estimation, and UMAP velocity projection as 607 described previously. 608 We obtained unrooted phylogenetic trees of enolase based on the subset of our UniProt 609 sequences with high-quality, manual annotation. We then performed a multiple sequence 610 alignment with MAFFT and performed phylogenetic reconstruction on the alignment with 611 PhyML version 3.3.20200621 using a JTT model with gamma-distributed among-site rate 612 variation and empirical state frequencies [56] . The unrooted tree was visualized using the iTOL 613 web tool. PGK evo-velocity analysis 615 We obtained 30,455 PGK sequences from UniProt. We restricted our analysis to 616 sequences within the "phosphoglycerate kinase" family and to sequences between 385 and 420 617 residues in length, inclusive, which was done based on a clear mode in the distribution of 618 sequence lengths and was meant to preserve mostly homologous sequences in our analysis. 619 Metadata included the taxonomic lineage of each sequence. We performed KNN graph 620 construction, evo-velocity computation, root prediction, diffusion pseudotime estimation, and 621 UMAP velocity projection as described previously. 622 We obtained unrooted phylogenetic trees of enolase based on the subset of our UniProt 623 sequences with high-quality, manual annotation. We then performed a multiple sequence 624 alignment with MAFFT and performed phylogenetic reconstruction on the alignment with 625 PhyML using a JTT model with gamma-distributed among-site rate variation and empirical state 626 frequencies. The unrooted tree was visualized using the iTOL web tool. Serpins evo-velocity analysis 628 We obtained 22,737 serpin sequences from UniProt. We restricted our analysis to 629 sequences within the "serpin" family and to sequences between 300 and 525 residues in length, 630 inclusive, which was done based on a clear mode in the distribution of sequence lengths and was 631 meant to preserve mostly homologous sequences in our analysis. Metadata included the 632 taxonomic lineage of each sequence. We performed KNN graph construction, evo-velocity 633 computation, root prediction, diffusion pseudotime estimation, and UMAP velocity projection as 634 described previously. 635 We obtained unrooted phylogenetic trees of enolase based on the subset of our UniProt 636 sequences with high-quality, manual annotation. We then performed a multiple sequence 637 alignment with MAFFT and performed phylogenetic reconstruction on the alignment with 638 PhyML using a JTT model with gamma-distributed among-site rate variation and empirical state 639 frequencies. The unrooted tree was visualized using the iTOL web tool. Data used in our analysis has been deposited to Zenodo at doi:10.5281/zenodo.4891758. Code used in our analysis has been deposited to Zenodo at doi:10.5281/zenodo.4891819. Our code and links to data are also available on GitHub at https://github.com/brianhie/evolocity. Figure S1: Additional figures for nucleoprotein evo-velocity analysis. (Table S4) . (D) Pseudotemporal relationships when controlling for similarity to UniRef50 or when using TAPE-based evo-velocity computation largely reproduce those in our main analysis (compare to Figure 4E ) especially when comparing the "lower-order" and "higher-order" taxonomic labels, although TAPE places viridiplantae after fungi in pseudotime and filtering based on sequence similarity to UniRef50 removes many of the earliest eukaryotes in pseudotime when analyzing the full dataset. Box extends from first to third quartile with line at the median, whiskers extend to 1.5 times the interquartile range, and diamonds indicate outlier points. Mutations were ranked by the magnitude of the average evo-velocity vector obtained by projecting the velocities into sequence space (Methods) and the top five were further investigated for location in T-cell epitopes. All involve single-nucleotide mutations from a methionine to a hydrophobic or a polar-uncharged amino acid residue. Also see Figures 2G and S1C. PGK r = 0.304, P < 1 × 10 -308 r = -0.267, P = 1 × 10 -120 Serpins r = 0.017, P = 0.01 r = -0.357, P < 1 × 10 -308 Table S3 : Correlation between evo-velocity pseudotime and sequence similarity to There is no consistent pattern in the directionality of the correlation between evo-velocity pseudotime and sequence similarity to UniRef50, indicating that sequence similarity does not trivially explain pseudotime. "Full dataset" indicates the results from analyzing all sequences while "similarity-controlled" indicates the results from restricting analysis to the sequences with greater than 80% sequence similarity to UniRef50 (Methods). In this latter setting, for all proteins, we were able to reproduce the results obtained from running evo-velocity on the full dataset. Influenza A NP r = 0.926, P < 1 × 10 -308 Influenza A HA r = -0.028, P = 0.01 HIV-1 Gag r = 0.814, P < 1 × 10 -308 SARS-CoV-2 Spike r = 0.902, P < 1 × 10 -308 Globins r = 0.893, P < 1 × 10 -308 Cytochrome c r = 0.811, P < 1 × 10 -308 Enolase r = 0.932, P < 1 × 10 -308 PGK r = 0.948, P < 1 × 10 -308 Serpins r = 0.955, P < 1 × 10 -308 Table S4 : Pseudotime reproducibility with TAPE velocities. The correlation between computed pseudotime using ESM-1b or TAPE to determine the evovelocity scores. Cells shaded in light blue indicate correlations greater than 0.8. HA pseudotimes were not correlated between ESM-1b and TAPE due to the inability of TAPE to identify roots among the twentieth-century trajectory of HA evolution (Figure S2B-D) . All other proteins had strong pseudotime reproducibility between the two language models. We obtained comparable, if slightly weaker, pseudotime correlation when either using binary sequence embeddings to construct the KNN graph or using BLOSUM62 scores to compute velocities (or both). Replacing velocity scores with random, Gaussian noise resulted in loss of correlation between pseudotime and sampling year. PCA: Principal component analysis. Influenza A NP r = 0.030, P = 0.08 Influenza A HA r = 0.534, P < 1 × 10 -308 HIV-1 Gag r = -0.352, P < 1 × 10 -308 SARS-CoV-2 Spike r = -0.681, P < 1 × 10 -308 Globins r = -0.166, P = 4 × 10 -39 Cytochrome c r = -0.414 P = 6 × 10 -89 Enolase r = 0.273, P < 1 × 10 -308 PGK r = 0.099, P = 2 × 10 -67 Serpins r = -0.087, P = 4 × 10 -39 Table S6 : Correlation between pseudotime and sequence length. We observed no consistent pattern in the correlation between pseudotime and the length of sequences, suggesting that differing sequences lengths across a landscape does not explain evovelocity patterns. On the Origin of Species Natural selection and the concept of a protein space Empirical fitness landscapes and the predictability of evolution Predicting evolution A predictive fitness model for influenza The physiology and habitat of the last universal common ancestor A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood Inferring the root of a phylogenetic tree How Many Subpopulations Is Too Many? Exponential Lower Bounds for Inferring Population Histories The roles of mutation, inbreeding, crossbreeding and selection in evolution Darwinian evolution can follow only very few mutational paths to fitter proteins Climbing Mount Improbable Learning protein sequence embeddings using information from structure Evaluating Protein Transfer Learning with TAPE Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences Learning the language of viral evolution and escape UniRef: Comprehensive and non-redundant UniProt reference clusters Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations Combining evolutionary and assaylabelled data for protein fitness prediction Deep generative models of genetic variation capture the effects of mutations Entropy-Scaling Search of Massive Biological Data Visualizing fitness landscapes SCANPY: Large-scale single-cell gene expression data analysis UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction Network medicine: A network-based approach to human disease Diffusion pseudotime robustly reconstructs lineage branching RNA velocity of single cells Stability-mediated epistasis constrains the evolution of an influenza protein The pandemic threat of emerging H5 and H7 avian influenza viruses The immune epitope database (IEDB) 3.0 The Pfam protein families database in 2019 Mechanisms of Viral Membrane Fusion and Its Inhibition Origins of HIV and the AIDS pandemic SARS-CoV-2 Variants of Concern in the United States-Challenges and Opportunities Origin of complexity in haemoglobin evolution Eukaryote evolution: A view based on cytochrome c sequence data Tree of life reveals clocklike speciation and diversification Serpins in prokaryotes Serpins in unicellular Eukarya, Archaea, and Bacteria: Sequence analysis and evolution A comprehensive phylogenetic analysis of the serpin superfamily Molecular evolution: The origin of glycolysis Molecular evolution of enolase Phosphoglycerate kinase: Structural aspects and functions, with special emphasis on the enzyme from Kinetoplastea: Phosphoglycerate Kinase Wonderful Life: The Burgess Shale and the Nature of History Life's solution: Inevitable humans in a lonely universe Random walks and diffusion on networks Contingency and chance erase necessity in the experimental evolution of ancestral proteins Protein stability promotes evolvability The Selfish Gene Assessing single-cell transcriptomic variability through density-preserving data visualization Generalizing RNA velocity to transient cell states through dynamical modeling Influenza Research Database: An integrated bioinformatics resource for influenza virus research Interactive Tree of Life (iTOL) v4: Recent updates and new developments UniProt: A worldwide hub of protein knowledge GISAID: Global initiative on sharing all influenza data -from vision to reality New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0 The NP sequence landscape shows structure corresponding to influenza subtype. (B) By stratifying edges based on the sampling time difference between their two corresponding sequences and quantifying bias toward positive or negative evo-velocity scores using a binomial test, we found that the bias toward positive evo-velocity scores increases as time increases Mutations with strong magnitude changes in evo-velocity are also located in experimentallyvalidated T-cell epitopes (Table S2). (D) All NP sequences belong to a single UniRef50 cluster, which has as its representative a sequence from 1934 H1N1. (E, F) Evo-velocity pseudotime of NP based on ESM-1b-or TAPE-based evo-velocity scores have high correlation Replacing ESM-1b embeddings with one-hot sequence embeddings removes some of the known evolutionary continuity relationships from the visualization, especially in the well-studied trajectory of H3N2 NP evolution. Replacing ESM-1b evo-velocity scores with BLOSUM62 scores results in much weaker and more ambiguous evo-velocity flows when visualized We thank Nicholas Bhattacharya, Anne Dekas, Bennett Kapili, Michael Kim All authors were involved in project conceptualization and investigation. B.L.H. wrote the software, performed the computational experiments, and wrote the initial paper draft. All authors interpreted the results and wrote the final paper. The authors declare no competing interests. Data S1: Mutational effect prediction benchmarking results (CSV file is included). Full dataset Similarity-controlled (>80%)Influenza A HA r = -0.526, P < 1 × 10 -308 r = -0.528, P < 1 × 10 -308HIV-1 Gag Enolase r = 0.597, P < 1 × 10 -308 r = -0.044, P = 3 × 10 -4