key: cord-0698435-rn2l2z3m authors: Ma, Yue; Hu, Yu; Xia, Binbin; Du, Pei; Wu, Lili; Liang, Mifang; Chen, Qian; Yan, Huan; Gao, George F.; Wang, Qihui; Wang, Jun title: Machine Learning Approach Effectively Predicts Binding Between SARS-CoV-2 Spike and ACE2 Across Mammalian Species — Worldwide, 2021 date: 2021-11-12 journal: China CDC Wkly DOI: 10.46234/ccdcw2021.235 sha: a65ce2932aab3c343ab117a378948c4cc2ef1abf doc_id: 698435 cord_uid: rn2l2z3m INTRODUCTION: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a recently emergent coronavirus of natural origin and caused the coronavirus disease (COVID-19) pandemic. The study of its natural origin and host range is of particular importance for source tracing, monitoring of this virus, and prevention of recurrent infections. One major approach is to test the binding ability of the viral receptor gene ACE2 from various hosts to SARS-CoV-2 spike protein, but it is time-consuming and labor-intensive to cover a large collection of species. METHODS: In this paper, we applied state-of-the-art machine learning approaches and created a pipeline reaching >87% accuracy in predicting binding between different ACE2 and SARS-CoV-2 spike. RESULTS: We further validated our prediction pipeline using 2 independent test sets involving >50 bat species and achieved >78% accuracy. A large-scale screening of 204 mammal species revealed 144 species (or 61%) were susceptible to SARS-CoV-2 infections, highlighting the importance of intensive monitoring and studies in mammalian species. DISCUSSION: In short, our study employed machine learning models to create an important tool for predicting potential hosts of SARS-CoV-2 and achieved the highest precision to our knowledge in experimental validation. This study also predicted that a wide range of mammals were capable of being infected by SARS-CoV-2. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused the ongoing pandemic of coronavirus disease and has led to more than 229 million people infected and 4.7 million fatalities as of September 23, 2021 (https://covid19. who.int). Despite a large number of investigations on the biology and pathology of SARS-CoV-2, as well as treatment of COVID-19, the virus and pandemic still pose a tremendous threat to global health and stability. The natural origin of this virus has gained consensus among scientific communities but available evidence is still short of being conclusive. For instance, bats and pangolins have been proposed but disputes still remain (1), leaving room for misinformation and abuse. Identifying the host species susceptible to, including the source and intermediate species of, SARS-CoV-2 is still one of the central scientific objectives for COVID-19 research and will help provide information for monitoring and containing a potential viral reservoir as well as preventing reoccurring zoonosis as in the case of influenza viruses. The entry of SARS-CoV-2 to host cells requires the binding of its spike protein and host angiotensin I converting enzyme 2 (ACE2), a process that underwent intense investigation. Blocking their binding with a list of neutralizing monoclonal antibodies (mAbs) has been demonstrated to effectively prevent viral entry to cells in vitro and in vivo (2) , and several mAbs were approved for clinical treatment of COVID patients (3). Short peptide mimicking the structure of ACE2 region binding to the viral spike protein has also been developed, which binds the receptor binding domain (RBD) of spike proteins with picomole-level affinity and effectiveness in cell assays (4) . Besides serving as a target for treatment, the ability of binding between the SARS-CoV-2 spike and the ACE2 from non-human species indicated the susceptibility of those species towards SARS-CoV-2 and, combined with ecological data and evolutionary evidence, might identify key species as probable origins and/or intermediate hosts of SARS-CoV-2. Screening the binding between the ACE2 from large-scale collection of species and the SARS-CoV-2 spike protein thus is highly desired; however, in reality, there are great constraints due to costs and time required for experimental verification. Alternatively, bioinformatic approaches capable of predicting binding between the two proteins with high precision are helpful in prioritizing species of interest and excluding very unlikely species, reducing the cost and time for this purpose. Based on sequence similarity in the ACE2 across species, Damas et al. (5) proposed a score predicting binding to the SARS-CoV-2 spikes; since then, many species' ACE2 have been tested, and retrospectively it is clear that the approach is limited in its precision. Namely, ACE2 from all bat species (36 in total in their prediction) were predicted to be "low" or "very low" in binding to the SARS-CoV-2 spike, but later experiments demonstrated that 20 species' ACE2 (55.56%) could bind to the viral spike (6) . Alongside bats, 17 out of 29 (58.62%) other mammals with ACE2 genes considered unlikely to bind to the SARS-CoV-2 spike actually had ability to bind as well (Supplementary Table S1 , available in http://weekly. chinacdc.cn/). Thus, the currently available bioinformatic approach has an extremely high false negative rate and is still short of precisely predicting binding between the SARS-CoV-2 spike protein and the ACE2 across species. We have therefore applied machine learning approaches to address the remaining challenges (see Supplementary Materials, available in http://weekly. chinacdc.cn/). Machine learning methods have the ability to combine diverse and complex data and automatically learn features for prediction, classification, and regressions. In biology, they have been successfully applied in establishing predictive and classification models using genomic features (7), metabolic markers (8) , and many more (9). In our study, we selected five representative machine learning methods to perform classification (i.e., prediction of binding vs. non-binding), namely Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Adaboost (ADA), and Gradient Boosting Regression Tree (GBRT). For the single estimator we chose SVM and DT because they are suitable for small training sets. However, single estimators have a tendency to cause poor generalizability or robustness. To reduce this issue, we chose three additional ensemble methods (RF, ADA, and GBRT) for the construction of the prediction model. The five models were further equipped with a priori information to establish a combined prediction pipeline. A study on the human ACE2 introduced mutations at 117 amino acid (AA) sites individually, whereas at each site the AA was mutated to all potential alternative AAs and the changes in affinity (relative to the wildtype ACE2) to that of SARS-CoV-2 have been experimentally examined, providing a quantitative reference data (10 . 14 are from our lab and currently being considered for independent publication), we aligned the ACE2 sequences of those species to the human ACE2 and extracted AAs to replace with log2 enrichment ratios for the 117, 24, and 20 sites as input data format ( Figure 1A ). We have deposited this pipeline and details of the method at https://github.com/mayuefine/ Binding-prediction. The training and the test set data contained 62 and 11 species, respectively, and the test set was set aside from the training process. In order to screen the models with a stable performance, we trained five models on three groups of site information (group 20, group 24, and group 117, each group containing 5 machine learning approaches). Finally, the predictions of the three groups were combined and a combination of six models with the highest precision was chosen as our prediction pipeline, out of a total of 408 combinations; this pipeline reached an in silico precision of circa 87.5% ( Figure 1B) and was used for subsequent analysis. We used this pipeline to generate a prediction score for each ACE2 sequence, which was equal to the number of models predicting that it binded to the viral spike divided by the total number of models. Bat species of the order Chiroptera were of highest interest for tracing the origin and studying the host range of SARS-CoV-2, as bat species harbor multiple coronavirus species including the SARS virus. One of the closest related strains of coronavirus to SARS-CoV-2, RaTG13, was found in horseshoe bats (Rhinolophus affinis) (14). Thus, we applied our pipeline and examined across bat species with ACE2 sequences available (59 in total), in which we predicted their ability to bind with SARS-CoV-2 spike proteins. We then tested the precision of our prediction in two experimentally validated datasets, in which ACE2 with predictions score >0.5 were considered likely to bind to the viral spike. We selected 12 bats' ACE2 and expressed the proteins, then confirmed with Surface Plasmon Resonance (SPR) and flow cytometry for the ability to bind the viral spike (Supplementary Note: After sequencing alignment, information from chosen sites were transformed into vectors and fed to five different models, from which the optimal combination was chosen as pipeline and used to predict available ACE2 sequences. After the prediction, we selected some of the sequences for experimental validation. Figure 1B showed that multiple combinations reached high precision using our testing dataset. that we presume to influence binding between ACE2 and viral spike protein as well, based on the observation that the two bat species' ACE2 have different binding with the viral spike. Abbreviations: ACE2=angiotensin I converting enzyme 2; DT=decision tree; RF=random forest; GBRT=gradient boosting regression tree; ADA=adaboost; SVM=support vector machine. Note: For families with multiple species, the branch is collapsed and the proportion predicted to bind is shown in Figure 2A . Blue species/families are those predicted not to bind. Abbreviations: ACE2=angiotensin I converting enzyme 2; SARS-CoV-2=severe acute respiratory syndrome coronavirus 2; SPR=surface plasmon resonance; KD=binding affinity. Table S2 , available in http://weekly.chinacdc.cn/). Overall, 4 of the 6 ACE2s predicted to bind to the SARS-CoV-2 spike were validated to bind to the viral spike ( Figure 2B and Supplementary Figure S1 , available in http://weekly.chinacdc.cn/), together with 5 ACE2s confirmed not to bind out of 6 ACE2s predicted to be so. Here we achieved a precision of 80% ( Figure 1C ). Then, using another dataset of 46 bat species by Yan et al. (6) , after excluding the 2 sequences contained in our training set, we predicted the binding capacity and achieved 78.26% precision as shown in Figure 1C . Thus, our unified pipeline incorporating multiple machine learning models and different sets as input has the ability of confidently predicting binding between bat ACE2s and viral spikes. It also drew our attention that during our validation, ACE2 sequences from Pteropus alecto and Pteropus vampyrus have identical AAs at all 117 sites we selected for input; however, P. alecto ACE2 could bind to the SARS-CoV-2 spike in our experimental system and P. vampyrus ACE2 had no detectable binding, suggesting additional AAs affected the binding capacity. We compared ACE2 sequences of these 2 species and identified in total 22 sites of difference between the 2. Of these sites, 16 are identical to human ACE2 (12 for P. alecto and 4 for P. vampyrus) ( Figure 1D and Figure 2C ). This comparison provided extra information that one or more of the AAs different between P. alecto and P. vampyrus and humans underly the differences in binding to the viral spike protein but have not been discovered in available studies. Closer investigations revealed that this set of AAs was not involved in binding with viral spike protein, thus their influences were indirect and likely affected by the ACE2 protein structurally or even by post-translation modifications including glycosylation. Eventually, we refined our models incorporating the modified list of AAs as an input, and performed predictions on available ACE2 sequences from mammalian species (Supplementary Table S3 , available in http://weekly.chinacdc.cn/, 204 in total and belonging to 69 families). This has resulted in the ACE2 of interest (likely to bind to the SARS-CoV-2 spike) from a total of 144 species, spread across 47 families (60.87%, Figure 2A ). It is worth noting that the wide range of potential mammalian hosts agree with the emerging evidences of SARS-CoV-2 virus presence across mammals. Aside from 5 species of Hominidae (primates), ACE2s were predicted to bind to the viral spike protein in: 13 species of Cercopithecidae (old world monkeys), 8 species of Pteropodidae (old world fruit bats), 7 species of Felidae (cats), 7 species of Bovidae (ruminants), 7 species of Mustelidae (containing minks), 6 species of Canidae (dogs), 3 species of Equidae (horses), 6 species of Cricetidae (muroid rodents), 4 species of Sciuridae (squirrels), and 3 species of Ursidae (bears). Even in all 3 families of marine mammal, their ACE2s had high likelihood to bind to the SARS-CoV-2 spike (in all 4 species of Phocidae, 4 of Delphinidae and 3 of Otariidae, Figure 2B ). Our prediction was supported by emerging reports that white-tailed deer (family Cervidae) were positive in antibodies against SARS-CoV-2 in 2021, which came in addition to reports of dogs, cats, and minks being viable hosts for this virus. In summary, based on ACE2 sequence features, our study suggested that SARS-CoV-2 has an extremely large range of potential hosts and indicates the importance of investigating wild animals for viral existence and monitoring its spread. In conclusion, our study employed machine learning models suitable for analyzing sequence data, incorporated established functional data with multiple features extracted from sequences, and achieved high precision in predicting binding between ACE2s from difference species to the spike protein of SARS-CoV-2. The precision within the test data set was 87.5%, and in a total of 44 bat species, the group of mammals that attracted most concern, we achieved >78% precision as well, indicating that the model can be further expanded to predict susceptibility of more bat species once genomic sequences or ACE2 sequences become available (Supplementary Table S4 , available in http://weekly.chinacdc.cn/). With the same approach we have also screened the available ACE2 sequences across a large range of mammals, in which we found that a large range of mammals requires attention. Our pipeline is capable of determining species of interest for tracing and analyzing species of interest to understand the potential origin of and transmission routes of SARS-CoV-2. Our pipeline, in terms of performance, remains to be improved upon, provided that more accurate machine-learning models and/or more a priori information continues to emerge. First, limited by the number of experimentally validated sets and understanding on ACE2-spike interactions, we had to limit the total AAs in the ACE2 sequences for training and prediction, in which our result already indicated contained critical information that is currently unavailable with regard to AAs in other part of the sequence, as in the case of P. alecto and P. vampyrus. In addition, the growing concerns amid the COVID-19 pandemic lie in the fast-emerging variants of SARS-CoV-2 strains, especially when mutations in ACE2interacting AAs in the spike protein have already demonstrated changes in binding affinity to human ACE2s, whether they lead to host range changes and even broader transmission remain to be investigated. In summary, our approach has the potential and will need to be expanded to analyze binding abilities of different SARS-CoV-2 variants and ACE2s to forecast the potential spread of this virus and identify priority species for monitoring. Funding 14. The 73 species angiotensin I converting enzyme 2 (ACE2) sequences for constructing predictive models and evaluation were collected from published articles (1-2) and unpublished data. Overall, 11 sequences from these 73 were randomly selected as test dataset for model evaluation and were not involved in model training. The sequences of mammalian ACE2 for prediction were downloaded as of September 22, 2020 with a total of 294 ACE2 sequences of mammalian species from 23 orders being gathered. We performed multiple sequence alignment on collection of 294 sequences with human ACE2 sequence, using software CLUSTAL (version 2.1, Conway Institute, UCD, Dublin, Ireland, parameter "complete multiple alignment") (3), in which sequences with more than 10 consecutive amino acid missing in the head 100 sites were excluded from the subsequent analysis, resulting in 272 ACE2 sequences (204 unique species). We selected key amino acid sites and used the log2 enrichment ratios values from Chan et al. to label the amino acids for each ACE2 sequence (4) 4), respectively. The sequences screened for these three sites were divided into a training dataset and a test dataset with an 8∶2 ratio and used for training and testing of the model, respectively. As for prediction models, we used five different methods to train three different collections of sites, including support vector machine (SVM), Decision Tree, Random Forest, AdaBoost and Gradient Boosting, resulting in 15 models of input data/methods. After hundreds of epochs of training, random combinations of the 15 models were evaluated based on precision (Precision=TP/(TP+FP), where TP: True Positive, FP: False Positive). We selected six model combinations for ACE2 sequences prediction in the subsequent analysis and set the prediction score (Prediction Score=Pn/Mn), where Pn indicated the number of one sequence that was predicted to have binding ability and Mn was the total number of models used for prediction. The threshold value for the prediction score was set to 0.5, i.e., a prediction score ≥0.5 was considered to have the ability to bind with Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The 272 sequences were also screened for sites for binding ability prediction. Model construction and prediction were carried out based on the scikit-learn module in the Python3 (version 0.22.2, Python Software Foundation, Fredericksburg, VA, USA). The functions used for model training were "svm," "DecisionTreeClassifier," "RandomForestClassifier," "AdaBoostClassifier," and "GradientBoostingClassifier." The parameters used for SVM were: gamma='scale'; class_weight={0:2}; for decision tree classifier were default parameters; for random forest classifier were the following: n_estimators=600, oob_score=True, n_jobs=-1, class_weight={0:2}; for Ada boost classifier were the following: base_estimator=DecisionTreeClassifier (max_depth=2), n_estimators=500; and for gradient boosting classifier were the following: n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0. All details were also available in our github depository. Twelve bat orthologs were randomly selected from the test sets. The full-length coding sequences (accession numbers are shown in Supplementary Table S2 ) of these orthologs were synthesized and cloned into the pEGFP-N1 vector for flow cytometry (FACS). The extracellular domain of these ACE2 orthologs was fused with the Fc domain of mouse IgG (mFc) and cloned into the pCAGGS expression vector for surface plasmon resonance (SPR). The SARS-CoV-2 receptor-binding domain (RBD) and SARS-CoV-2 N-terminal domain (NTD) proteins used for flow cytometry and SPR were expressed and purified from the supernatants of HEK293F cells culture as described in our previous work (5) . Proteins were stored in a PBS buffer [1.8 mmol/L KH 2 PO 4 , 10 mmol/L Na 2 HPO 4 (pH 7.4), 137 mmol/L NaCl, 2.7 mmol/L KCl] buffer. The indicated pCAGGS plasmids were transiently transfected into HEK293T cells (ATCC CRL-3216). Supernatants containing mFc-tagged ACE2 proteins were collected and concentrated at 48 h post-transfection. SUPPLEMENTARY To test the binding between each of the 12 ACE2s and SARS-CoV-2 RBD, the 12 bat ACE2s fused with eGFP were expressed on the cell surface by transfecting each of the 12 pEGFP-N1-ACE2s plasmids into BHK21 cells (ATCC, ATCC CCL-10) using PEI (Alfa). Cell culture was replaced with fresh media ( We tested the binding affinities between the mFc-tagged ACE2s and SARS-CoV-2 RBD or SARS-CoV RBD proteins by SPR using a BIAcore 8K (GE Healthcare) carried out at 25 °C in single-cycle mode. The PBST buffer (1.8 mmol/L KH 2 PO 4 , 10 mmol/L Na 2 HPO 4 (pH 7.4), 137 mmol/L NaCl, 2.7 mmol/L KCl, and 0.05% (v/v) Tween 20) was used as the running buffer. The CM5 biosensor chip was first immobilized with anti-mIgG antibody (ZSGB-BIO, ZF-0513) as previously described. (1) The supernatants containing mFc-tagged ACE2s were injected and captured by the antibody immobilized on the CM5 chip at approximately 300-600 response units. The serially diluted SARS-CoV-2 RBD protein flowed over the chip surface, with another channel set as control. The chip was regenerated using pH 1.7 glycine after each reaction. The equilibrium dissociation constants (binding affinity, KD) for each pair of interaction were calculated with BIAcore_8K evaluation software (GE Healthcare, Chicago, IL, USA) by fitting to a 1∶1 Langmuir binding model. Data were analyzed using OriginLab (Origin 2018, OriginLab Corporation, Northampton, MA, USA). The phylogenetic tree was constructed by uploading the species names from 272 sequences into NCBI Taxonomy Common Tree (https://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/). The visualization of the phylogenetic tree was based on iTol (6) . SUPPLEMENTARY TABLE S2 . Results of binding between ACE2 from 12 bat species and SARS-CoV-2 spike performed in our study. Evidence for SARS-CoV-2 related coronaviruses circulating in bats and pangolins in Southeast Asia A therapeutic non-self-reactive SARS-CoV-2 antibody protects from lung pathology in a COVID-19 hamster model De novo design of picomolar SARS-CoV-2 miniprotein inhibitors Structural and functional basis of SARS-CoV-2 entry by using human ACE2 Functional and genetic analysis of viral receptor ACE2 orthologs reveals a broad potential host range of SARS-CoV-2 CLUSTAL: a package for performing multiple sequence alignment on a microcomputer Engineering human ACE2 to optimize binding to the spike protein of SARS coronavirus 2 Molecular basis of cross-species ACE2 interactions with SARS-CoV-2-like viruses of pangolin origin Interactive tree of life (iTOL) v5: an online tool for phylogenetic tree display and annotation Broad host range of SARS-CoV-2 predicted by comparative and structural analysis of ACE2 in vertebrates ACE2 receptor usage reveals variation in susceptibility to SARS-CoV and SARS-CoV-2 infection among bat species Greenish naked-backed fruit bat 1.00 QJF77815.1 Note: >0.5 prediction score in our analysis indicate binding between ACE2 and SARS-CoV-2 spike. Abbreviations: ACE2=angiotensin I converting enzyme 2; SARS-CoV-2=severe acute respiratory syndrome coronavirus 2.