key: cord-0138903-s3ccq9me
authors: Wang, Rui; Chen, Jiahui; Hozumi, Yuta; Yin, Changchuan; Wei, Guo-Wei
title: Emerging vaccine-breakthrough SARS-CoV-2 variants
date: 2021-09-09
journal: nan
DOI: nan
sha: 48429bf2fa2fa67252f71e3f8d0b9872e58db378
doc_id: 138903
cord_uid: s3ccq9me

The recent global surge in COVID-19 infections has been fueled by new SARS-CoV-2 variants, namely Alpha, Beta, Gamma, Delta, etc. The molecular mechanism underlying such surge is elusive due to 4,653 non-degenerate mutations on the spike protein, which is the target of most COVID-19 vaccines. The understanding of the molecular mechanism of transmission and evolution is a prerequisite to foresee the trend of emerging vaccine-breakthrough variants and the design of mutation-proof vaccines and monoclonal antibodies. We integrate the genotyping of 1,489,884 SARS-CoV-2 genomes isolates, 130 human antibodies, tens of thousands of mutational data points, topological data analysis, and deep learning to reveal SARS-CoV-2 evolution mechanism and forecast emerging vaccine-escape variants. We show that infectivity-strengthening and antibody-disruptive co-mutations on the S protein RBD can quantitatively explain the infectivity and virulence of all prevailing variants. We demonstrate that Lambda is as infectious as Delta but is more vaccine-resistant. We analyze emerging vaccine-breakthrough co-mutations in 20 countries, including the United Kingdom, the United States, Denmark, Brazil, and Germany, etc. We envision that natural selection through infectivity will continue to be the main mechanism for viral evolution among unvaccinated populations, while antibody disruptive co-mutations will fuel the future growth of vaccine-breakthrough variants among fully vaccinated populations. Finally, we have identified the co-mutations that have the great likelihood of becoming dominant: [A411S, L452R, T478K], [L452R, T478K, N501Y], [V401L, L452R, T478K], [K417N, L452R, T478K], [L452R, T478K, E484K, N501Y], and [P384L, K417N, E484K, N501Y]. We predict they, particularly the last four, will break through existing vaccines. We foresee an urgent need to develop new vaccines that target these co-mutations.

The death toll of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has exceeded 4.4 million in August 2021. Tremendous efforts in combating SARS-CoV-2 have led to several authorized vaccines, which mainly target the viral spike (S) proteins. However, the emergence of mutations on the S gene has resulted in more infectious variants and vaccine breakthrough infections. Emerging vaccine breakthrough SARS-CoV-2 variants pose a grand challenge to the long-term control and prevention of the COVID-19 pandemic. Therefore, forecasting emerging breakthrough SARS-CoV-2 variants is of paramount importance for the design of new mutation-proof vaccines and monoclonal antibodies (mABs).

To predict emerging breakthrough SARS-CoV-2 variants, one must understand the molecular mechanism of viral transmission and evolution, which is one of the greatest challenges of our time. SARS-CoV-2 entry of a host cell depends on the binding between S protein and the host angiotensin-converting enzyme 2 (ACE2), primed by host transmembrane protease, serine 2 (TMPRSS2) [1] . Such a process inaugurates the host's adaptive immune response, and consequently, antibodies are generated to combat the invading virus either through direct neutralization or non-neutralizing binding [2, 3] . S protein receptor-binding domain (RBD) is a short immunogenic fragment that facilitates the S protein binding with ACE2. Epidemiological and biochemical studies have suggested that the binding free energy (BFE) between the S RBD and the ACE2 is proportional to the infectivity [1, [4] [5] [6] [7] . Additionally, the strong binding between the RBD and mAbs leads to effective direct neutralization [8] [9] [10] . Therefore, RBD mutations have dominating impacts on viral infectivity, mAb efficacy, and vaccine protection rates. Mutations may occur for various reasons, including random genetic drift, replication error, polymerase error, host immune responses, gene editing, and recombinations [11] [12] [13] [14] [15] . Being beneficial from the genetic proofreading mechanism regulated by NSP12 (a.k.a RNA-dependent RNA polymerase) and NSP14 [16, 17] , SARS-CoV-2 has a higher fidelity in its replication process than the other RNA viruses such as influenza. Nonetheless, near 700 non-degenerate mutations are observed on RBD, contributing many key mutations in emerging variants, i.e., N501Y for Alpha, K417N, E484K, and N501Y for Beta, K417T, E484K, and N501Y for Gamma, L452R and T478K for Delta, L452Q and F490S for Lambda, etc [18] . Given the importance of the RBD for SARS-CoV-2 infectivity, vaccine efficacy, and mAb effectiveness, it is imperative to understand the mechanism governing RBD mutations.

In June 2020, when there were only 89 non-degenerated mutations on the RBD, and the highest observed mutational frequency was only around 50 globally, we were able to show that natural selection underpins SARS-CoV-2 evolution, based on the genotyping of 24,715 SARS-CoV-2 sequences isolated patients and a topology-based deep learning model for RBD-ACE2 binding analysis [19] . In the same work, we predicted that RBD residues 452 and 501 "have high chances to mutate into significantly more infectious COVID-19 strains" [19] . Currently, these residues are the key mutational sites of all prevailing SARS-CoV-2 variants. We further foresaw a list of 1,149 most likely RBD mutations among 3686 possible RBD mutations [19] . Up to date, every one of the observed 683 RBD mutations belongs to the list. In April 2021, we demonstrated that all the 100 most observed RBD mutations of 651 existing RBD mutations from 506,768 viral genomes had enhanced the binding between RBD and ACE2, resulting in more infectious variants [18] . The odd for these 100 most observed mutations to be there accidentally is smaller than one chance in 1.2 nonillions (2 100 ≈ 1.2×10 30 ) 1 . There is no double that natural selection via viral infectivity, rather than any other competing theories [11] [12] [13] [14] [15] , is the dominating mechanism for SARS-CoV-2 transmission and evolution. This mechanistic discovery lays the foundation for forecasting future emerging SASR-CoV-2 variants.

Understanding SARS-CoV-2 variant threats to current vaccines and mAbs is another urgent issue facing the scientific community [20] . The World Health Organization (WHO) identified variants of concern (VOCs) and variants of interest (VOIs). The former describes variants that have an increment in the transmissibility and virulence, or adversely affect the effectiveness of vaccines, therapeutics, and diagnostics with clear clinical correlation evidence. The latter describes variants that carry genetic changes, which are predicted or known to reduce neutralization by antibodies generated against vaccination, the efficacy of treatments, and affect transmissibility, virulence, disease severity, immune escape, diagnostics, etc., which cause significant community transmission and suggest an emerging risk to the public. Currently, WHO listed four VOCs, i.e., variants B.1.1.7 (Alpha) [21] [22] [23] , B.1.351 (Beta) [22, 24] , P.1 (Gamma) [22] , and B.1.617.2 (Delta) [25] ), and four VOIs, i.e., variants B.1.525 (Eta) [26] , B.1.526 (Iota) [26, 27] , B.1.617.1 (Kappa) [28] , C.37 (Lambda) [29] , and B.1.621 (Mu) (A general introduction about the prevailing and emerging variants is given in Section S1 of the Supporting Information.). Our hypothesis is that the severity of variants to infectivity, vaccine efficacy, and mAbs effectiveness depends mainly on how the associated RBD mutations impact the binding with ACE2 and antibodies. Based on this hypothesis, we collected and analyzed a library of antibodies and unveiled that most of the RBD mutations would weaken the binding of S protein and antibodies and disrupt the efficacy and reliability of antibody therapies and vaccines [20] . We predicted "the urgent need to develop new mutation-resistant vaccines and antibodies and prepare for seasonal vaccination" in early 2021 [20] . We further identified vaccine-escape (i.e., vaccine-breakthrough) mutations and fast-growing mutations [18] . Our predictions of the threats from VOCs and VOIs were in great agreement with experimental data [30] .

The objective of this work is to forecast emerging SARS-CoV-2 variants that pose an imminent threat to combating COVID-19 and long-term public health. To this end, we carry out an RBD-specific analysis of SARS-CoV-2 co-mutations involving a wide variety of combinations of 683 unique single mutations on the RBD. We take a unique approach that integrates viral genotyping of 1,489,884 complete genome sequences isolated from patients, algebraic topology algorithms that won the worldwide competition in computer-aided drug discovery [31] , deep learning models trained with tens of thousands of mutational data points [20, 30] , and a library of 130 SARS-CoV-2 antibody structures. By analyzing the frequency, binding free energy (BFE) changes, and antibody disruption counts of RBD co-mutations, we reveal that nine RBD co-mutation sets, namely [ 

To understand the molecular mechanisms of vaccine-escape mutations, we analyze single nucleotide polymorphisms (SNPs) of 1,489,884 complete SARS-CoV-2 genome sequences, resulting in 683 non-degenerate RBD mutations and their associated frequencies. A full set of mutation information is available on our interactive web page Mutation Tracker. The infectivity of each mutation is mainly determined by the mutationinduced BFE change to the binding complex of RBD and ACE2. To estimate the impact of each mutation on vaccines, we collect a library of 130 antibody structures (Supporting Information S2.1.2), including Food and Drug Administration (FDA)-approved mAbs from Eli Lilly and Regeneron. For a given RBD mutation, its number of antibody disruptions is given by the number of antibodies whose mutation-induced antibody-RBD BFE changes are smaller than -0.3kcal/mol (A list of names for antibodies that are disrupted by mutations can be found in the Supporting Information S2.1.1.). BFE changes following mutations are predicted by our deep learning model, TopNetTree [32] . We have created an interactive web page, Mutation Analyzer, to list all RBD mutations, their observed frequencies, their RBD-ACE2 BFE changes following mutations, their number of antibody disruptions, and various ranks. Figure 1 illustrates RBD mutations associated with prevailing SARS-CoV-2 variants, time evolution trajectories of all RBD mutations, and the BFE changes of RBD-ACE2 and 130 RBD-antibodies induced by 75 significant mutations. A summary of our analysis is given in Table 1 .

First, the 10 most observed or fast-growing RBD mutations are N501Y, L452R, T478K, E484K, K417T, S477N, N439K, K417N, F490S, and S494P, as shown in Table 1 . Inclusively, these top mutations strengthen their BFEs and become more infectious, following the natural selection mechanism [19] . Figure 1b shows Table 1 : Top 25 most observed S protein RBD mutations. Here, BFE change refers to the BFE change for the S protein and human ACE2 complex induced by a single-site S protein RBD mutation. A positive mutation-induced BFE change strengthens the binding between S protein and ACE2, which results in more infectious variants. Counts of antibody disruption represent the number of antibody and S protein complexes disrupted by a specific RBD mutation. Here, an antibody and S protein complex is to be disrupted if its binding affinity is reduced by more than 0.3 kcal/mol [18] . In addition, we calculate the antibody disruption ratio (%), which is the ratio of the number of disrupted antibody and S protein complexes over 130 known complexes. Ranks are computed from 683 observed RBD mutations.

Worldwide that the frequencies of the top three mutations increased dramatically since 2021 due to Alpha, Beta, Gamma, Delta, and other variants. Second, among the top 25 most observed RBD mutations, T478K, L452Q N440K, L452R, N501Y, N501T, F490S, A475V, and P384L are the 8 most infectious ones judged by their ability to strengthen the binding with ACE2, as shown in Figure 1c . The BFE changes of S protein and ACE2 for mutation T478K is nearly 1.00 kcal/mol, which strongly enhances the binding of the RBD-ACE2 complex [33] . Together with L452R (BFE change: 0.58kcal/mol), T478K makes Delta the most infectious variant in VOCs. Third, among the top 25 most observed RBD mutations, Y449S, S494P, K417N, F490S, L452R, E484K, K417T, E484Q, L452Q, and N501Y are the 10 most antibody disruptive ones, judged by their interactions with 130 antibodies shown in Figure 1c . It can be seen that mutations L452R, E484K, K417T, K417N, F490S, and S494P disrupt more than 30% of antibody-RBD complexes, while mutations E484K and K417T may disrupt nearly 30% antibody-RBD complexes, indicating their disruptive ability to the efficacy and reliability of antibody therapies and vaccines. The most dangerous mutations are the ones that are both infectivity-strengthening and antibody disruptive. Four RBD mutations, N501Y, L452R, F490S, and L452Q, appear in both lists and are key mutations in WHO's VOC and VOI lists. Among them, F490S and L452Q are the key RBD mutations in Lambda, making Lambda a more dangerous emerging variant than Delta.

Note that high-frequency mutation S477N does not significantly weaken any antibody and RBD binding, and thus does not appear in any prevailing variants. 13 The recent surge in COVID-19 infections is due to the occurrence of RBD co-mutations that combine two or more infectivity-strengthening mutations. The most dangerous future SARS-CoV-2 variants must be RBD co-mutations that combine infectivity-strengthening mutation(s) with antibody disruptive mutation(s). A list of 1,139,244 RBD co-mutations that are decoded from 1,489,884 complete SARS-CoV-2 genome sequences can be found in Section S2.1.3 of the Supporting Information, and all of the non-degenerate RBD co-mutations with their frequencies, antibody disruption counts, total BFE changes, and the first detection dates and countries can be found in Section S2.1.4 of the Supporting Information. Figure 2 illustrates the properties of S protein RBD 2, 3, and 4 co-mutations. The height of each bar shows the predicted total BFE change of each set of co-mutations on RBD, the color represents the natural log of frequency for each set of RBD co-mutations, and the number at the top of each bar is the AI-predicted number of antibody-RBD complexes that each set of RBD co-mutations may disrupt based on a total of 130 RBD and antibody complexes. Notably, for a specific set of co-mutations, the higher the number at the top of the bar is, the stronger ability to break through vaccines will be. From Figure 2 , RBD 2 co-mutation set [L452R, T478K] (Delta variant) has the highest frequency (219,362) and the highest BFE change (1.575 kcal/mol). Moreover, the Delta variant would disrupt 40 antibody-RBD complexes, suggesting that Delta would not only enhance the infectivity but also be a vaccine breakthrough variant. Moreover, [L452Q, F490S] (Lambda) is another co-mutation with high frequency, high BFE changes (1.421 kcal/mol), and high antibody disruption count (59). In addition, Lambda is considered to be more dangerous than Delta due to its higher antibody disruption count. Further, [R346K, E484K, N501Y] (Mu variant) has a BFE change of 0.768 kcal/mol and high antibody disruption count (60). It is not as infectious as Delta and Lambda, but has a similar ability as Lambda in escaping vaccines. Note that among all VOCs and VOIs, Beta has the highest ability to break through vaccines, but its infectivity is relatively low (BFE change: 0.656 kcal/mol). Furthermore, high-frequency 2 co-mutation sets It is important to understand the general trend of SARS-CoV-2 evolution. To this end, we carry out the statistical analysis of RBD co-mutations. Among 1,489,884 SARS-CoV-2 genome isolates, a total of 1,113 distinctive 2 co-mutations, 612 distinctive 3 co-mutations, and 217 distinctive 4 co-mutations are found. Figures 3 a, b , and c illustrate the 2D histograms of 2, 3, and 4 co-mutations, respectively. The x-axis is the number of antibody disruption counts, and the y-axis shows the total BFE change. Figure 3 a shows that there are 82 RBD 2 co-mutations that have BFE changes in the range of [0.600, 0.799] kcal/mol and will disruptive 40 to 49 antibodies. According to Figure 3 b, there are 170 unique 3 co-mutations that have large BFE changes of S protein and ACE2 in the range of [1.500, 1.999] kcal/mol. In Figure 3 c, it is seen that almost all of the 4 co-mutations on RBD have the BFE changes greater than 0.5 kcal/mol and weaken the binding of S protein with at least 60 antibodies. Figures 3d, e, and f are the histograms of total BFE changes, natural log of frequencies, and antibody disruption counts for RBD 2, 3, and 4 co-mutations. It can be found that most of the 2, 3, and 4 RBD co-mutations have positive total BFE changes, and the larger number of RBD co-mutations is, the higher number of antibody disruption count will be. In summary, comutations with a larger number of antibody disruptive counts and high BFE changes will grow faster. We anticipate that when most of the population is vaccinated, vaccine-resistant mutations will become a more viable mechanism for viral evolution. ] that was first found in BR on April 06, 2020, has a BFE change of 0.625 kcal/mol and antibody disruption count 84, is an emerging vaccine breakthrough co-mutation in Brazil. In addition, co-mutation set [L452Q, F490S] (cyan lines) on Lambda variant was recently drawing much attention due to its potential ability to resist vaccines and enhance the infectivity, which is consistent with our predictions that co-mutation set [L452Q, F490S] has a relatively significant BFE change of S protein and ACE2 (1.421kcal/mol) and would reduce the RBD binding with 59 antibodies. Lambda has already spread out in every country in Figure 4 .

In this section, the work flow of deep learning-based BFE change predictions of protein-protein interactions induced by mutations for the present SARS-CoV-2 variant analysis and prediction will be firstly introduced, which includes four steps as shown in Figure 5 : (1) Data pre-processing; (2) training data preparation; (3) feature generations of protein-protein interaction complexes; (4) prediction of protein-protein interactions by deep neural networks (check Section S5 in Supporting information). Next, the validation of our machine learning-based model will be demonstrated, suggesting consistent and reliable results compared to the experimental deep mutations data.

The first step is to pre-process the original SARS-CoV-2 sequences data. In this step, a total of 1,489,884 complete SARS-CoV-2 genome sequences with high coverage and exact collection date are downloaded from the GISAID database [34] ( https://www.gisaid.org/) as of August 05, 2021. Complete SARS-CoV-2 genome sequences are available from the GISAID database [34] . Next, the 1,489,884 complete SARS-CoV-2 genome sequences were rearranged according to the reference genome downloaded from the GenBank (NC 045512.2) [35] , and multiple sequence alignment (MSA) is applied by using Cluster Omega with default parameters. Then, single nucleotide polymorphism (SNP) genotyping is applied to measure the genetic variations between different isolates of SARS-CoV-2 by analyzing the rearranged sequences [36, 37] , which is be of paramount importance for tracking the genotype changes during the pandemic. The SNP genotyping captures all of the differences between patients' sequences and the reference genome, which decodes a total of 28,478 unique single mutations from 1,489,884 complete SARS-CoV-2 genome sequences. Among them, 4,653 non-degenerate mutations on S protein and 683 non-degenerate mutations on the S protein RBD (S protein residues from 329 to 530) are detected. In this work, the co-mutation analysis is more crucial than the unique single mutation analysis. Therefore, for each SARS-CoV-2 isolate, we extract the all of the mutations on S protein RBD, which is called a RBD co-mutation for a specific isolates. By doing this, a total of 1,139,244 RBD co-mutations are captured. Notably, the SARS-CoV-2 unique single mutations in the world is available at Mutation Tracker. The analysis of RBD mutations is available at Mutation Analyzer.

In this section, the process of the machine learning-based BFE change predictions is introduced. Once the data pre-processing and SNP genotyping is carried out, we will firstly proceed with the training data preparation process, which plays a key role in reliability and accuracy. A library of 130 antibodies and RBD complexes as well as an ACE2-RBD complex are obtained from Protein Data Bank (PDB). RBD mutationinduced BFE changes of these complexes are evaluated by the following machine learning model. According to the emergency and the rapid change of RNA virus, it is rare to have massive experimental BFE change data of SARS-CoV-2, while, on the other hand, next-generation sequencing data is relatively easy to collect. In the training process, the dataset of BFE changes induced by mutations of the SKEMPI 2.0 dataset [38] is used as the basic training set, while next-generation sequencing datasets are added as assistant training sets. The SKEMPI 2.0 contains 7,085 single-and multi-point mutations and 4,169 elements of that in 319 different protein complexes used for the machine learning model training. The mutational scanning data consists of experimental data of the binding of ACE2 and RBD induced mutations on ACE2 [39] and RBD [40, 41] , and the binding of CTC-445.2 and RBD with mutations on both protein [41] .

Next, the feature generations of protein-protein interaction complexes is performed. The elementspecific algebraic topological analysis on complex structures is implemented to generate topological bar codes [30, [42] [43] [44] . In addition, biochemistry and biophysics features such as Coulomb interactions, surface areas, electrostatics, et al., are combined with topological features [20] . The detailed information about the topology-based models will be demonstrated in subsection 3.3. Lastly, deep neural networks for SARS-CoV-2 are constructed for the BFE change prediction of protein-protein interactions [30] . The detailed descriptions of dataset and machine learning model are found in the literature [19, 30, 45] and are available at TopNetmAb.

Among all features generated for machine learning prediction, the application of topology theory makes the model to a whole new level. Those summarized as other inputs are called as auxiliary features and are described in Section S4 of the Supporting Information. In this section, a brief introduction about the theory of topology will be discussed. Algebraic topology [42, 43] has achieved tremendous success in many fields including biochemical and biophysical properties [44] . Special treatment should be implemented for biology applications to describe element types and amino acids in poly-peptide mathematically, which element-specific and site-specific persistent homology [19, 32] . To construct the algebraic topological features on protein-protein interaction model, a series of element subsets for complex structures should be defined, which considers atoms from the mutation sites, atoms in the neighborhood of the mutation site within a certain distance, atoms from antibody binding site, atoms from antigen binding site, and atoms in the system that belong to type of {C, N, O}, A ele (E). Under the element/site-specific construction, simplicial complexes is constructed on point clouds formed by atoms. For example, a set of independent k+1 points is from one element/site-specific set U = {u 0 , u 1 , ..., u k }. The k-simplex σ is a convex hull of k +1 independent points U , which is a convex combination of independent points. For example, a 0-simplex is a point and a 1-simplex is an edge. Thus, a m-face of the k-simplex with m+1 vertices forms a convex hull in a lower dimension m < k and is a subset of the k +1 vertices of a k-simplex, so that a sum of all its (k−1)-faces is the boundary of a k-simplex σ as

where u 0 , ...,û i , ..., u k consists of all vertices of σ excluding u i . The collection of finitely many simplices is a simplicial complex. In the model, the Vietoris-Rips (VR) complex (if and only if B(u ij , r) ∩ B(u i j , r) = ∅ for j, j ∈ [0, k]) is for dimension 0 topology, and alpha complex (if and only if ∩ ui j ∈σ B(u ij , r) = ∅) is for point cloud of dimensions 1 and 2 topology [44] .

The k-chain c k of a simplicial complex K is a formal sum of the k-simplices in K, which is c k = α i σ i , where α i is coefficients and is chosen to be Z 2 . Thus, the boundary operator on a k-chain c k is

such that ∂ k : C k → C k−1 and follows from that boundaries are boundaryless

as a sequence of complexes by boundary maps. Therefore, the Betti numbers are given as the ranks of kth homology group H k as β k = rank(H k ), where H k = Z k /B k , k-cycle group Z k and the k-boundary group B k . The Betti numbers are the key for topological features, where β 0 gives the number of connected components, such as number of atoms, β 1 is the number of cycles in the complex structure, and β 2 illustrates the number of cavities. This presents abstract properties of the 3D structure.

Finally, only one simplicial complex couldn't give the whole picture of the protein-protein interaction structure. A filtration of a topology space is needed to extract more properties. A filtration is a nested sequence such that

Each element of the sequence could generate the Betti numbers {β 0 , β 1 , β 2 } and consequentially, a series of Betti numbers in three dimensions is constructed and applied to be the topological fingerprints in Figure 5a .

The validation of our machine learning predictions for mutation-induced BFE changes compared to experimental data has been demonstrated in recently published papers [20, 30] . Firstly, we showed high correlations of experimental deep mutational enrichment data and predictions for the binding complex of SARS-CoV-2 S protein RBD and protein CTC-445.2 [20] and the binding complex of SARS-CoV-2 RBD and ACE2 [30] . In comparison with experimental data on antibody therapies in clinical trials of emerging mutations, our predictions achieve a Pearson correlation at 0.80 [30] . Considering the BFE changes induced by RBD mutations for ACE2 and RBD complex, predictions on mutations L452R and N501Y have a highly similar trend with experimental data [30] . Meanwhile, as we presented in [18] , high-frequency mutations are all having positive BFE changes. Moreover, for multi-mutation tests, our BFE change predictions have the same pattern with experimental data of the impact of SARS-CoV-2 variants on major antibody therapeutic candidates, where the BFE changes are accumulative for co-mutations [30] .

Recent studies on potency of mAb CT-P59 in vitro and in vivo against Delta variants [46] show that the neutralization of CT-P59 is reduced by L452R (13.22 ng/mL) and is retained against T478K (0.213 ng/mL). In our predictions [30] , L452R induces a negative BFE change (-2.39 kcal/mol), and T478K produces a positive BFE change (0.36 kcal/mol). In Figure 5b , the fold changes for experimental and predicted values are presented. Additional, in Figure 5c , a comparison of the experimental pseudovirus infection changes and predicted BFE change of ACE2 and S protein complex induced by mutations L452R and N501Y, where the experimental data is obtained in a reference to D614G and reported in relative luciferase units [25] . It indicates that the binding of RBD and ACE2 dominates the infectivity of SARS-CoV-2. More details can be found in Section S6 of Supporting information.

The SARS-CoV-2 SNP data in the world is available at Mutation Tracker. The most observed SARS-CoV-2 RBD mutations are available at Mutaton Analyzer. The information of 130 antibodies with their corresponding PDB IDs can be found in the Supplementary Data. The SARS-CoV-2 S protein RBD SNP and non-degenerate co-mutations data can be found in Section S2.1.4 of the Supporting Information. The Top-NetTree model is available at TopNetmAb.

The supporting information is available for S1 Overview of SARS-CoV-2 prevailing and emerging variants S2 

Emerging breakthrough variants in COVID-19 devastated countries

SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor

Review of COVID-19 antibody therapies

SARS-CoV-2 neutralizing antibody LY-CoV555 in outpatients with COVID-19

Bats are natural reservoirs of SARS-like coronaviruses

Identification of two critical amino acid residues of the severe acute respiratory syndrome coronavirus spike protein for its variation in zoonotic tropism transition via a double substitution strategy

Cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human

Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein

A human monoclonal antibody blocking SARS-CoV-2 infection

Receptor-binding domain-specific human neutralizing monoclonal antibodies against SARS-CoV and SARS-CoV-2

The impact of receptor-binding domain natural mutations on antibody recognition of SARS-CoV-2

Mechanisms of viral mutation

Making sense of mutation: what D614G means for the COVID-19 pandemic remains unclear

Structural and physico-chemical effects of disease and non-disease nsSNPs on proteins

Loss of protein structure stability as a major causative factor in monogenic disease

Host immune response driving SARS-CoV-2 evolution

Insights into RNA synthesis, capping, and proofreading mechanisms of SARS-coronavirus

Structural and molecular basis of mismatch correction and ribavirin excision from coronavirus RNA

Vaccine-escape and fast-growing mutations in the United Kingdom, the United States

Mutations strengthened SARS-CoV-2 infectivity

Prediction and mitigation of mutation threats to COVID-19 vaccines and antibody therapies

Estimated transmissibility and impact of SARS-CoV-2 lineage B. 1.1. 7 in England

Increased resistance of SARS-CoV-2 variant P. 1 to antibody neutralization

Efficacy of ChAdOx1 nCoV-19 (AZD1222) vaccine against SARS-CoV-2 variant of concern 202012/01 (B. 1.1. 7): an exploratory analysis of a randomised controlled trial

Efficacy of the ChAdOx1 nCoV-19 COVID-19 vaccine against the B. 1.351 variant

Transmission, infectivity, and antibody neutralization of an emerging SARS-CoV-2 variant in California carrying a L452R spike protein mutation

SARS-CoV-2 spike E484K mutation reduces antibody neutralisation. The Lancet Microbe

A novel SARS-CoV-2 variant of concern, B. 1.526, identified in New York. medRxiv

Comprehensive mapping of mutations in the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human plasma antibodies

SARS-CoV-2 Lambda variant exhibits higher infectivity and immune resistance. bioRxiv

Revealing the threat of emerging SARS-CoV-2 mutations to antibody therapies

Mathematical deep learning for pose and binding affinity prediction and ranking in D3R Grand Challenges

A topology-based network tree for the prediction of protein-protein binding affinity changes following mutation

SARS-CoV-2 Spike Mutations, L452R, T478K, E484Q and P681R, in the Second Wave of COVID-19 in Maharashtra

GISAID: Global initiative on sharing all influenza data-from vision to reality

A new coronavirus associated with human respiratory disease in China

Genotyping coronavirus SARS-CoV-2: methods and implications

Snp genotyping: technologies and biomedical applications

SKEMPI 2.0: an updated benchmark of changes in protein-protein binding energy, kinetics and thermodynamics upon mutation

The sequence of human ace2 is suboptimal for binding the s spike protein of sars coronavirus 2

Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding

De novo design of potent and resilient hACE2 decoys to neutralize SARS-CoV-2

Topological persistence and simplification

Persistent homology analysis of protein structure, flexibility, and folding. International journal for numerical methods in biomedical engineering

Mutations on COVID-19 diagnostic targets

Therapeutic efficacy of CT-p59 against P. 1 variant of SARS-CoV-2. bioRxiv


key: cord-1055378-w5l3s884
authors: Wang, Rui; Chen, Jiahui; Hozumi, Yuta; Yin, Changchuan; Wei, Guo-Wei
title: Emerging vaccine-breakthrough SARS-CoV-2 variants
date: 2021-09-09
journal: ArXiv
DOI: nan
sha: 48429bf2fa2fa67252f71e3f8d0b9872e58db378
doc_id: 1055378
cord_uid: w5l3s884

The recent global surge in coronavirus disease 2019 (COVID-19) infections have been fueled by new severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants, namely Alpha, Beta, Gamma, Delta, etc. The molecular mechanism underlying such surge is elusive due to the existence of 28,554, including 4,653 non-degenerate mutations on the spike (S) protein, which is the target of most COVID-19 vaccines. The understanding of the molecular mechanism of SARS-CoV-2 transmission and evolution is a prerequisite to foresee the global trend of emerging vaccine-breakthrough SARS-CoV-2 variants and the design of mutation-proof vaccines and monoclonal antibodies (mAbs). We integrate the genotyping of 1,489,884 SARS-CoV-2 genomes isolated from patients, a library collection of 130 human antibodies, tens of thousands of mutational data points, topological data analysis (TDA), and deep learning to reveal SARS-CoV-2 evolution mechanism and forecast emerging vaccine-escape variants. We show that infectivity-strengthening and antibody-disruptive co-mutations on the S protein receptor-binding domain (RBD) can quantitatively explain the infectivity and virulence of all prevailing variants. We demonstrate that Lambda is as infectious as Delta but is more vaccine-resistant. We analyze emerging vaccine-breakthrough co-mutations in 20 COVID-19 devastated countries, including the United Kingdom (UK), the United States (US), Denmark (DK), Brazil (BR), Germany (DE), Netherlands (NL), Sweden (SE), Italy (IT), Canada (CA), France (FR), India (IN), and Belgium (BE), etc. We envision that natural selection through infectivity will continue to be a main mechanism for viral evolution among unvaccinated populations, while antibody disruptive co-mutations will fuel the future growth of vaccine-breakthrough variants among fully vaccinated populations. Finally, we have identified the following sets of co-mutations that have the great likelihood of becoming dominant: [A411S, L452R, T478K], [L452R, T478K, N501Y], [V401L, L452R, T478K], [K417N, L452R, T478K], [L452R, T478K, E484K, N501Y], and [P384L, K417N, E484K, N501Y]. We predict they, particularly the last four, will break through existing vaccines. We foresee an urgent need to develop new vaccines that target these co-mutations.

The death toll of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has exceeded 4.4 million in August 2021. Tremendous efforts in combating SARS-CoV-2 have led to several authorized vaccines, which mainly target the viral spike (S) proteins. However, the emergence of mutations on the S gene has resulted in more infectious variants and vaccine breakthrough infections. Emerging vaccine breakthrough SARS-CoV-2 variants pose a grand challenge to the long-term control and prevention of the COVID-19 pandemic. Therefore, forecasting emerging breakthrough SARS-CoV-2 variants is of paramount importance for the design of new mutation-proof vaccines and monoclonal antibodies (mABs).

To predict emerging breakthrough SARS-CoV-2 variants, one must understand the molecular mechanism of viral transmission and evolution, which is one of the greatest challenges of our time. SARS-CoV-2 entry of a host cell depends on the binding between S protein and the host angiotensin-converting enzyme 2 (ACE2), primed by host transmembrane protease, serine 2 (TMPRSS2) [1] . Such a process inaugurates the host's adaptive immune response, and consequently, antibodies are generated to combat the invading virus either through direct neutralization or non-neutralizing binding [2, 3] . S protein receptor-binding domain (RBD) is a short immunogenic fragment that facilitates the S protein binding with ACE2. Epidemiological and biochemical studies have suggested that the binding free energy (BFE) between the S RBD and the ACE2 is proportional to the infectivity [1, [4] [5] [6] [7] . Additionally, the strong binding between the RBD and mAbs leads to effective direct neutralization [8] [9] [10] . Therefore, RBD mutations have dominating impacts on viral infectivity, mAb efficacy, and vaccine protection rates. Mutations may occur for various reasons, including random genetic drift, replication error, polymerase error, host immune responses, gene editing, and recombinations [11] [12] [13] [14] [15] . Being beneficial from the genetic proofreading mechanism regulated by NSP12 (a.k.a RNA-dependent RNA polymerase) and NSP14 [16, 17] , SARS-CoV-2 has a higher fidelity in its replication process than the other RNA viruses such as influenza. Nonetheless, near 700 non-degenerate mutations are observed on RBD, contributing many key mutations in emerging variants, i.e., N501Y for Alpha, K417N, E484K, and N501Y for Beta, K417T, E484K, and N501Y for Gamma, L452R and T478K for Delta, L452Q and F490S for Lambda, etc [18] . Given the importance of the RBD for SARS-CoV-2 infectivity, vaccine efficacy, and mAb effectiveness, it is imperative to understand the mechanism governing RBD mutations.

In June 2020, when there were only 89 non-degenerated mutations on the RBD, and the highest observed mutational frequency was only around 50 globally, we were able to show that natural selection underpins SARS-CoV-2 evolution, based on the genotyping of 24,715 SARS-CoV-2 sequences isolated patients and a topology-based deep learning model for RBD-ACE2 binding analysis [19] . In the same work, we predicted that RBD residues 452 and 501 "have high chances to mutate into significantly more infectious COVID-19 strains" [19] . Currently, these residues are the key mutational sites of all prevailing SARS-CoV-2 variants. We further foresaw a list of 1,149 most likely RBD mutations among 3686 possible RBD mutations [19] . Up to date, every one of the observed 683 RBD mutations belongs to the list. In April 2021, we demonstrated that all the 100 most observed RBD mutations of 651 existing RBD mutations from 506,768 viral genomes had enhanced the binding between RBD and ACE2, resulting in more infectious variants [18] . The odd for these 100 most observed mutations to be there accidentally is smaller than one chance in 1.2 nonillions (2 100 ≈ 1.2×10 30 ) 1 . There is no double that natural selection via viral infectivity, rather than any other competing theories [11] [12] [13] [14] [15] , is the dominating mechanism for SARS-CoV-2 transmission and evolution. This mechanistic discovery lays the foundation for forecasting future emerging SASR-CoV-2 variants.

Understanding SARS-CoV-2 variant threats to current vaccines and mAbs is another urgent issue facing the scientific community [20] . The World Health Organization (WHO) identified variants of concern (VOCs) and variants of interest (VOIs). The former describes variants that have an increment in the transmissibility and virulence, or adversely affect the effectiveness of vaccines, therapeutics, and diagnostics with clear clinical correlation evidence. The latter describes variants that carry genetic changes, which are predicted or known to reduce neutralization by antibodies generated against vaccination, the efficacy of treatments, and affect transmissibility, virulence, disease severity, immune escape, diagnostics, etc., which cause significant community transmission and suggest an emerging risk to the public. Currently, WHO listed four VOCs, i.e., variants B.1.1.7 (Alpha) [21] [22] [23] , B.1.351 (Beta) [22, 24] , P.1 (Gamma) [22] , and B.1.617.2 (Delta) [25] ), and four VOIs, i.e., variants B.1.525 (Eta) [26] , B.1.526 (Iota) [26, 27] , B.1.617.1 (Kappa) [28] , C.37 (Lambda) [29] , and B.1.621 (Mu) (A general introduction about the prevailing and emerging variants is given in Section S1 of the Supporting Information.). Our hypothesis is that the severity of variants to infectivity, vaccine efficacy, and mAbs effectiveness depends mainly on how the associated RBD mutations impact the binding with ACE2 and antibodies. Based on this hypothesis, we collected and analyzed a library of antibodies and unveiled that most of the RBD mutations would weaken the binding of S protein and antibodies and disrupt the efficacy and reliability of antibody therapies and vaccines [20] . We predicted "the urgent need to develop new mutation-resistant vaccines and antibodies and prepare for seasonal vaccination" in early 2021 [20] . We further identified vaccine-escape (i.e., vaccine-breakthrough) mutations and fast-growing mutations [18] . Our predictions of the threats from VOCs and VOIs were in great agreement with experimental data [30] .

The objective of this work is to forecast emerging SARS-CoV-2 variants that pose an imminent threat to combating COVID-19 and long-term public health. To this end, we carry out an RBD-specific analysis of SARS-CoV-2 co-mutations involving a wide variety of combinations of 683 unique single mutations on the RBD. We take a unique approach that integrates viral genotyping of 1,489,884 complete genome sequences isolated from patients, algebraic topology algorithms that won the worldwide competition in computer-aided drug discovery [31] , deep learning models trained with tens of thousands of mutational data points [20, 30] , and a library of 130 SARS-CoV-2 antibody structures. By analyzing the frequency, binding free energy (BFE) changes, and antibody disruption counts of RBD co-mutations, we reveal that nine RBD co-mutation sets, namely [ 

To understand the molecular mechanisms of vaccine-escape mutations, we analyze single nucleotide polymorphisms (SNPs) of 1,489,884 complete SARS-CoV-2 genome sequences, resulting in 683 non-degenerate RBD mutations and their associated frequencies. A full set of mutation information is available on our interactive web page Mutation Tracker. The infectivity of each mutation is mainly determined by the mutationinduced BFE change to the binding complex of RBD and ACE2. To estimate the impact of each mutation on vaccines, we collect a library of 130 antibody structures (Supporting Information S2.1.2), including Food and Drug Administration (FDA)-approved mAbs from Eli Lilly and Regeneron. For a given RBD mutation, its number of antibody disruptions is given by the number of antibodies whose mutation-induced antibody-RBD BFE changes are smaller than -0.3kcal/mol (A list of names for antibodies that are disrupted by mutations can be found in the Supporting Information S2.1.1.). BFE changes following mutations are predicted by our deep learning model, TopNetTree [32] . We have created an interactive web page, Mutation Analyzer, to list all RBD mutations, their observed frequencies, their RBD-ACE2 BFE changes following mutations, their number of antibody disruptions, and various ranks. Figure 1 illustrates RBD mutations associated with prevailing SARS-CoV-2 variants, time evolution trajectories of all RBD mutations, and the BFE changes of RBD-ACE2 and 130 RBD-antibodies induced by 75 significant mutations. A summary of our analysis is given in Table 1 .

First, the 10 most observed or fast-growing RBD mutations are N501Y, L452R, T478K, E484K, K417T, S477N, N439K, K417N, F490S, and S494P, as shown in Table 1 . Inclusively, these top mutations strengthen their BFEs and become more infectious, following the natural selection mechanism [19] . Figure 1b shows Table 1 : Top 25 most observed S protein RBD mutations. Here, BFE change refers to the BFE change for the S protein and human ACE2 complex induced by a single-site S protein RBD mutation. A positive mutation-induced BFE change strengthens the binding between S protein and ACE2, which results in more infectious variants. Counts of antibody disruption represent the number of antibody and S protein complexes disrupted by a specific RBD mutation. Here, an antibody and S protein complex is to be disrupted if its binding affinity is reduced by more than 0.3 kcal/mol [18] . In addition, we calculate the antibody disruption ratio (%), which is the ratio of the number of disrupted antibody and S protein complexes over 130 known complexes. Ranks are computed from 683 observed RBD mutations.

Worldwide that the frequencies of the top three mutations increased dramatically since 2021 due to Alpha, Beta, Gamma, Delta, and other variants. Second, among the top 25 most observed RBD mutations, T478K, L452Q N440K, L452R, N501Y, N501T, F490S, A475V, and P384L are the 8 most infectious ones judged by their ability to strengthen the binding with ACE2, as shown in Figure 1c . The BFE changes of S protein and ACE2 for mutation T478K is nearly 1.00 kcal/mol, which strongly enhances the binding of the RBD-ACE2 complex [33] . Together with L452R (BFE change: 0.58kcal/mol), T478K makes Delta the most infectious variant in VOCs. Third, among the top 25 most observed RBD mutations, Y449S, S494P, K417N, F490S, L452R, E484K, K417T, E484Q, L452Q, and N501Y are the 10 most antibody disruptive ones, judged by their interactions with 130 antibodies shown in Figure 1c . It can be seen that mutations L452R, E484K, K417T, K417N, F490S, and S494P disrupt more than 30% of antibody-RBD complexes, while mutations E484K and K417T may disrupt nearly 30% antibody-RBD complexes, indicating their disruptive ability to the efficacy and reliability of antibody therapies and vaccines. The most dangerous mutations are the ones that are both infectivity-strengthening and antibody disruptive. Four RBD mutations, N501Y, L452R, F490S, and L452Q, appear in both lists and are key mutations in WHO's VOC and VOI lists. Among them, F490S and L452Q are the key RBD mutations in Lambda, making Lambda a more dangerous emerging variant than Delta.

Note that high-frequency mutation S477N does not significantly weaken any antibody and RBD binding, and thus does not appear in any prevailing variants. 13 The recent surge in COVID-19 infections is due to the occurrence of RBD co-mutations that combine two or more infectivity-strengthening mutations. The most dangerous future SARS-CoV-2 variants must be RBD co-mutations that combine infectivity-strengthening mutation(s) with antibody disruptive mutation(s). A list of 1,139,244 RBD co-mutations that are decoded from 1,489,884 complete SARS-CoV-2 genome sequences can be found in Section S2.1.3 of the Supporting Information, and all of the non-degenerate RBD co-mutations with their frequencies, antibody disruption counts, total BFE changes, and the first detection dates and countries can be found in Section S2.1.4 of the Supporting Information. Figure 2 illustrates the properties of S protein RBD 2, 3, and 4 co-mutations. The height of each bar shows the predicted total BFE change of each set of co-mutations on RBD, the color represents the natural log of frequency for each set of RBD co-mutations, and the number at the top of each bar is the AI-predicted number of antibody-RBD complexes that each set of RBD co-mutations may disrupt based on a total of 130 RBD and antibody complexes. Notably, for a specific set of co-mutations, the higher the number at the top of the bar is, the stronger ability to break through vaccines will be. From Figure 2 , RBD 2 co-mutation set [L452R, T478K] (Delta variant) has the highest frequency (219,362) and the highest BFE change (1.575 kcal/mol). Moreover, the Delta variant would disrupt 40 antibody-RBD complexes, suggesting that Delta would not only enhance the infectivity but also be a vaccine breakthrough variant. Moreover, [L452Q, F490S] (Lambda) is another co-mutation with high frequency, high BFE changes (1.421 kcal/mol), and high antibody disruption count (59). In addition, Lambda is considered to be more dangerous than Delta due to its higher antibody disruption count. Further, [R346K, E484K, N501Y] (Mu variant) has a BFE change of 0.768 kcal/mol and high antibody disruption count (60). It is not as infectious as Delta and Lambda, but has a similar ability as Lambda in escaping vaccines. Note that among all VOCs and VOIs, Beta has the highest ability to break through vaccines, but its infectivity is relatively low (BFE change: 0.656 kcal/mol). Furthermore, high-frequency 2 co-mutation sets It is important to understand the general trend of SARS-CoV-2 evolution. To this end, we carry out the statistical analysis of RBD co-mutations. Among 1,489,884 SARS-CoV-2 genome isolates, a total of 1,113 distinctive 2 co-mutations, 612 distinctive 3 co-mutations, and 217 distinctive 4 co-mutations are found. Figures 3 a, b , and c illustrate the 2D histograms of 2, 3, and 4 co-mutations, respectively. The x-axis is the number of antibody disruption counts, and the y-axis shows the total BFE change. Figure 3 a shows that there are 82 RBD 2 co-mutations that have BFE changes in the range of [0.600, 0.799] kcal/mol and will disruptive 40 to 49 antibodies. According to Figure 3 b, there are 170 unique 3 co-mutations that have large BFE changes of S protein and ACE2 in the range of [1.500, 1.999] kcal/mol. In Figure 3 c, it is seen that almost all of the 4 co-mutations on RBD have the BFE changes greater than 0.5 kcal/mol and weaken the binding of S protein with at least 60 antibodies. Figures 3d, e, and f are the histograms of total BFE changes, natural log of frequencies, and antibody disruption counts for RBD 2, 3, and 4 co-mutations. It can be found that most of the 2, 3, and 4 RBD co-mutations have positive total BFE changes, and the larger number of RBD co-mutations is, the higher number of antibody disruption count will be. In summary, comutations with a larger number of antibody disruptive counts and high BFE changes will grow faster. We anticipate that when most of the population is vaccinated, vaccine-resistant mutations will become a more viable mechanism for viral evolution. ] that was first found in BR on April 06, 2020, has a BFE change of 0.625 kcal/mol and antibody disruption count 84, is an emerging vaccine breakthrough co-mutation in Brazil. In addition, co-mutation set [L452Q, F490S] (cyan lines) on Lambda variant was recently drawing much attention due to its potential ability to resist vaccines and enhance the infectivity, which is consistent with our predictions that co-mutation set [L452Q, F490S] has a relatively significant BFE change of S protein and ACE2 (1.421kcal/mol) and would reduce the RBD binding with 59 antibodies. Lambda has already spread out in every country in Figure 4 .

In this section, the work flow of deep learning-based BFE change predictions of protein-protein interactions induced by mutations for the present SARS-CoV-2 variant analysis and prediction will be firstly introduced, which includes four steps as shown in Figure 5 : (1) Data pre-processing; (2) training data preparation; (3) feature generations of protein-protein interaction complexes; (4) prediction of protein-protein interactions by deep neural networks (check Section S5 in Supporting information). Next, the validation of our machine learning-based model will be demonstrated, suggesting consistent and reliable results compared to the experimental deep mutations data.

The first step is to pre-process the original SARS-CoV-2 sequences data. In this step, a total of 1,489,884 complete SARS-CoV-2 genome sequences with high coverage and exact collection date are downloaded from the GISAID database [34] ( https://www.gisaid.org/) as of August 05, 2021. Complete SARS-CoV-2 genome sequences are available from the GISAID database [34] . Next, the 1,489,884 complete SARS-CoV-2 genome sequences were rearranged according to the reference genome downloaded from the GenBank (NC 045512.2) [35] , and multiple sequence alignment (MSA) is applied by using Cluster Omega with default parameters. Then, single nucleotide polymorphism (SNP) genotyping is applied to measure the genetic variations between different isolates of SARS-CoV-2 by analyzing the rearranged sequences [36, 37] , which is be of paramount importance for tracking the genotype changes during the pandemic. The SNP genotyping captures all of the differences between patients' sequences and the reference genome, which decodes a total of 28,478 unique single mutations from 1,489,884 complete SARS-CoV-2 genome sequences. Among them, 4,653 non-degenerate mutations on S protein and 683 non-degenerate mutations on the S protein RBD (S protein residues from 329 to 530) are detected. In this work, the co-mutation analysis is more crucial than the unique single mutation analysis. Therefore, for each SARS-CoV-2 isolate, we extract the all of the mutations on S protein RBD, which is called a RBD co-mutation for a specific isolates. By doing this, a total of 1,139,244 RBD co-mutations are captured. Notably, the SARS-CoV-2 unique single mutations in the world is available at Mutation Tracker. The analysis of RBD mutations is available at Mutation Analyzer.

In this section, the process of the machine learning-based BFE change predictions is introduced. Once the data pre-processing and SNP genotyping is carried out, we will firstly proceed with the training data preparation process, which plays a key role in reliability and accuracy. A library of 130 antibodies and RBD complexes as well as an ACE2-RBD complex are obtained from Protein Data Bank (PDB). RBD mutationinduced BFE changes of these complexes are evaluated by the following machine learning model. According to the emergency and the rapid change of RNA virus, it is rare to have massive experimental BFE change data of SARS-CoV-2, while, on the other hand, next-generation sequencing data is relatively easy to collect. In the training process, the dataset of BFE changes induced by mutations of the SKEMPI 2.0 dataset [38] is used as the basic training set, while next-generation sequencing datasets are added as assistant training sets. The SKEMPI 2.0 contains 7,085 single-and multi-point mutations and 4,169 elements of that in 319 different protein complexes used for the machine learning model training. The mutational scanning data consists of experimental data of the binding of ACE2 and RBD induced mutations on ACE2 [39] and RBD [40, 41] , and the binding of CTC-445.2 and RBD with mutations on both protein [41] .

Next, the feature generations of protein-protein interaction complexes is performed. The elementspecific algebraic topological analysis on complex structures is implemented to generate topological bar codes [30, [42] [43] [44] . In addition, biochemistry and biophysics features such as Coulomb interactions, surface areas, electrostatics, et al., are combined with topological features [20] . The detailed information about the topology-based models will be demonstrated in subsection 3.3. Lastly, deep neural networks for SARS-CoV-2 are constructed for the BFE change prediction of protein-protein interactions [30] . The detailed descriptions of dataset and machine learning model are found in the literature [19, 30, 45] and are available at TopNetmAb.

Among all features generated for machine learning prediction, the application of topology theory makes the model to a whole new level. Those summarized as other inputs are called as auxiliary features and are described in Section S4 of the Supporting Information. In this section, a brief introduction about the theory of topology will be discussed. Algebraic topology [42, 43] has achieved tremendous success in many fields including biochemical and biophysical properties [44] . Special treatment should be implemented for biology applications to describe element types and amino acids in poly-peptide mathematically, which element-specific and site-specific persistent homology [19, 32] . To construct the algebraic topological features on protein-protein interaction model, a series of element subsets for complex structures should be defined, which considers atoms from the mutation sites, atoms in the neighborhood of the mutation site within a certain distance, atoms from antibody binding site, atoms from antigen binding site, and atoms in the system that belong to type of {C, N, O}, A ele (E). Under the element/site-specific construction, simplicial complexes is constructed on point clouds formed by atoms. For example, a set of independent k+1 points is from one element/site-specific set U = {u 0 , u 1 , ..., u k }. The k-simplex σ is a convex hull of k +1 independent points U , which is a convex combination of independent points. For example, a 0-simplex is a point and a 1-simplex is an edge. Thus, a m-face of the k-simplex with m+1 vertices forms a convex hull in a lower dimension m < k and is a subset of the k +1 vertices of a k-simplex, so that a sum of all its (k−1)-faces is the boundary of a k-simplex σ as

where u 0 , ...,û i , ..., u k consists of all vertices of σ excluding u i . The collection of finitely many simplices is a simplicial complex. In the model, the Vietoris-Rips (VR) complex (if and only if B(u ij , r) ∩ B(u i j , r) = ∅ for j, j ∈ [0, k]) is for dimension 0 topology, and alpha complex (if and only if ∩ ui j ∈σ B(u ij , r) = ∅) is for point cloud of dimensions 1 and 2 topology [44] .

The k-chain c k of a simplicial complex K is a formal sum of the k-simplices in K, which is c k = α i σ i , where α i is coefficients and is chosen to be Z 2 . Thus, the boundary operator on a k-chain c k is

such that ∂ k : C k → C k−1 and follows from that boundaries are boundaryless

as a sequence of complexes by boundary maps. Therefore, the Betti numbers are given as the ranks of kth homology group H k as β k = rank(H k ), where H k = Z k /B k , k-cycle group Z k and the k-boundary group B k . The Betti numbers are the key for topological features, where β 0 gives the number of connected components, such as number of atoms, β 1 is the number of cycles in the complex structure, and β 2 illustrates the number of cavities. This presents abstract properties of the 3D structure.

Finally, only one simplicial complex couldn't give the whole picture of the protein-protein interaction structure. A filtration of a topology space is needed to extract more properties. A filtration is a nested sequence such that

Each element of the sequence could generate the Betti numbers {β 0 , β 1 , β 2 } and consequentially, a series of Betti numbers in three dimensions is constructed and applied to be the topological fingerprints in Figure 5a .

The validation of our machine learning predictions for mutation-induced BFE changes compared to experimental data has been demonstrated in recently published papers [20, 30] . Firstly, we showed high correlations of experimental deep mutational enrichment data and predictions for the binding complex of SARS-CoV-2 S protein RBD and protein CTC-445.2 [20] and the binding complex of SARS-CoV-2 RBD and ACE2 [30] . In comparison with experimental data on antibody therapies in clinical trials of emerging mutations, our predictions achieve a Pearson correlation at 0.80 [30] . Considering the BFE changes induced by RBD mutations for ACE2 and RBD complex, predictions on mutations L452R and N501Y have a highly similar trend with experimental data [30] . Meanwhile, as we presented in [18] , high-frequency mutations are all having positive BFE changes. Moreover, for multi-mutation tests, our BFE change predictions have the same pattern with experimental data of the impact of SARS-CoV-2 variants on major antibody therapeutic candidates, where the BFE changes are accumulative for co-mutations [30] .

Recent studies on potency of mAb CT-P59 in vitro and in vivo against Delta variants [46] show that the neutralization of CT-P59 is reduced by L452R (13.22 ng/mL) and is retained against T478K (0.213 ng/mL). In our predictions [30] , L452R induces a negative BFE change (-2.39 kcal/mol), and T478K produces a positive BFE change (0.36 kcal/mol). In Figure 5b , the fold changes for experimental and predicted values are presented. Additional, in Figure 5c , a comparison of the experimental pseudovirus infection changes and predicted BFE change of ACE2 and S protein complex induced by mutations L452R and N501Y, where the experimental data is obtained in a reference to D614G and reported in relative luciferase units [25] . It indicates that the binding of RBD and ACE2 dominates the infectivity of SARS-CoV-2. More details can be found in Section S6 of Supporting information.

The SARS-CoV-2 SNP data in the world is available at Mutation Tracker. The most observed SARS-CoV-2 RBD mutations are available at Mutaton Analyzer. The information of 130 antibodies with their corresponding PDB IDs can be found in the Supplementary Data. The SARS-CoV-2 S protein RBD SNP and non-degenerate co-mutations data can be found in Section S2.1.4 of the Supporting Information. The Top-NetTree model is available at TopNetmAb.

The supporting information is available for S1 Overview of SARS-CoV-2 prevailing and emerging variants S2 

Emerging breakthrough variants in COVID-19 devastated countries

SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor

Review of COVID-19 antibody therapies

SARS-CoV-2 neutralizing antibody LY-CoV555 in outpatients with COVID-19

Bats are natural reservoirs of SARS-like coronaviruses

Identification of two critical amino acid residues of the severe acute respiratory syndrome coronavirus spike protein for its variation in zoonotic tropism transition via a double substitution strategy

Cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human

Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein

A human monoclonal antibody blocking SARS-CoV-2 infection

Receptor-binding domain-specific human neutralizing monoclonal antibodies against SARS-CoV and SARS-CoV-2

The impact of receptor-binding domain natural mutations on antibody recognition of SARS-CoV-2

Mechanisms of viral mutation

Making sense of mutation: what D614G means for the COVID-19 pandemic remains unclear

Structural and physico-chemical effects of disease and non-disease nsSNPs on proteins

Loss of protein structure stability as a major causative factor in monogenic disease

Host immune response driving SARS-CoV-2 evolution

Insights into RNA synthesis, capping, and proofreading mechanisms of SARS-coronavirus

Structural and molecular basis of mismatch correction and ribavirin excision from coronavirus RNA

Vaccine-escape and fast-growing mutations in the United Kingdom, the United States

Mutations strengthened SARS-CoV-2 infectivity

Prediction and mitigation of mutation threats to COVID-19 vaccines and antibody therapies

Estimated transmissibility and impact of SARS-CoV-2 lineage B. 1.1. 7 in England

Increased resistance of SARS-CoV-2 variant P. 1 to antibody neutralization

Efficacy of ChAdOx1 nCoV-19 (AZD1222) vaccine against SARS-CoV-2 variant of concern 202012/01 (B. 1.1. 7): an exploratory analysis of a randomised controlled trial

Efficacy of the ChAdOx1 nCoV-19 COVID-19 vaccine against the B. 1.351 variant

Transmission, infectivity, and antibody neutralization of an emerging SARS-CoV-2 variant in California carrying a L452R spike protein mutation

SARS-CoV-2 spike E484K mutation reduces antibody neutralisation. The Lancet Microbe

A novel SARS-CoV-2 variant of concern, B. 1.526, identified in New York. medRxiv

Comprehensive mapping of mutations in the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human plasma antibodies

SARS-CoV-2 Lambda variant exhibits higher infectivity and immune resistance. bioRxiv

Revealing the threat of emerging SARS-CoV-2 mutations to antibody therapies

Mathematical deep learning for pose and binding affinity prediction and ranking in D3R Grand Challenges

A topology-based network tree for the prediction of protein-protein binding affinity changes following mutation

SARS-CoV-2 Spike Mutations, L452R, T478K, E484Q and P681R, in the Second Wave of COVID-19 in Maharashtra

GISAID: Global initiative on sharing all influenza data-from vision to reality

A new coronavirus associated with human respiratory disease in China

Genotyping coronavirus SARS-CoV-2: methods and implications

Snp genotyping: technologies and biomedical applications

SKEMPI 2.0: an updated benchmark of changes in protein-protein binding energy, kinetics and thermodynamics upon mutation

The sequence of human ace2 is suboptimal for binding the s spike protein of sars coronavirus 2

Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding

De novo design of potent and resilient hACE2 decoys to neutralize SARS-CoV-2

Topological persistence and simplification

Persistent homology analysis of protein structure, flexibility, and folding. International journal for numerical methods in biomedical engineering

Mutations on COVID-19 diagnostic targets

Therapeutic efficacy of CT-p59 against P. 1 variant of SARS-CoV-2. bioRxiv