key: cord-0922435-zssn6pig authors: Chen, Huiting; Zhu, Zhaozhong; Qiu, Ye; Ge, Xingyi; Zheng, Heping; Peng, Yousong title: Prediction of coronavirus 3C-like protease cleavage sites using machine-learning algorithms date: 2022-05-02 journal: Virol Sin DOI: 10.1016/j.virs.2022.04.006 sha: 5112ff92e6af704e6d0e123773fc37abd2d9d2d8 doc_id: 922435 cord_uid: zssn6pig The coronavirus 3C-like (3CL) protease, a cysteine protease, plays an important role in viral infection and immune escape. However, there is still a lack of effective tools for determining the cleavage sites of the 3CL protease. This study systematically investigated the diversity of the cleavage sites of the coronavirus 3CL protease on the viral polyprotein, and found that the cleavage motif were highly conserved for viruses in the genera of Alphacoronavirus, Betacoronavirus and Gammacoronavirus. Strong residue preferences were observed at the neighboring positions of the cleavage sites. A random forest (RF) model was built to predict the cleavage sites of the coronavirus 3CL protease based on the representation of residues in cleavage motifs by amino acid indexes, and the model achieved an AUC of 0.96 in cross-validations. The RF model was further tested on an independent test dataset which were composed of cleavage sites on 99 proteins from multiple coronavirus hosts. It achieved an AUC of 0.95 and predicted correctly 80% of the cleavage sites. Then, 1,352 human proteins were predicted to be cleaved by the 3CL protease by the RF model. These proteins were enriched in several GO terms related to the cytoskeleton, such as the microtubule, actin and tubulin. Finally, a webserver named 3CLP was built to predict the cleavage sites of the coronavirus 3CL protease based on the RF model. Overall, the study provides an effective tool for identifying cleavage sites of the 3CL protease and provides insights into the molecular mechanism underlying the pathogenicity of coronaviruses. positions were further extracted and were defined as the cleavage motifs. A total of 905 cleavage motifs were 121 obtained from the genera of Alphacoronavirus, Betacoronavirus and Gammacoronavirus. They were then de-122 duplicated, which resulted in 265 unique cleavage motifs. They were taken as positive samples in the modeling. 123 To obtain the negative samples, the Qs in polyprotein sequences of 14 coronavirus species mentioned above 124 were identified except those in the cleavage sites; then, for each Q, a non-cleavage motif containing the 125 neighboring three AAs in the upstream of Q and two AAs in the downstream of Q was built. A total of 6,828 126 non-cleavage motifs were obtained. Based on the one-hot encoding, these non-cleavage motifs were grouped 127 into 265 clusters by the k-means method using the module of sklearn.cluster in Python (version 3.7) (Pedregosa 128 et al., 2011). One motif was randomly selected from each cluster, which led to 265 negative samples. 129 Then, the positive samples were encoded with the one-hot method, and were clustered into five groups by the 130 k-means method. To ensure the balance of the positive and negative samples in the training and validation 131 process, the negative samples were randomly separated into five groups to match the positive sample groups. 132 The above processes were repeated five times and five datasets were generated. The size of each group in five 133 datasets was listed in Table 1 . Three machine-learning algorithms, random forest (RF), support vector machine (SVM) and naive bayes (NB), 145 were used to predict the cleavage sites of the 3CL protease and were achieved using functions of 146 The PCA of the AA indexes were achieved using the module of sklearn.decomposition in Python (version 3.7) 168 (Pedregosa et al., 2011) . The work flow of the modeling process in predicting coronavirus 3CL protease cleavage sites was shown in 172 For example, when using one AA index in AA encoding, the AAs in five positions were transformed into a 176 numeric vector of length 5 (f1, f2, f3, f4, f5). Then, five times of five-fold cross-validations were used to evaluate 177 the performance of three machine-learning algorithms, i.e., RF, SVM and NB, and were also used to select the 178 number of AA indexes used in the modeling (see the texts in the Results section for details). A total of 265 cleavage motifs (positive samples) and equal number of non-cleavage motifs (negative samples, 209 see Materials and methods) were obtained to build the machine-learning model for predicting the cleavage sites 210 of the coronavirus 3CL protease. Three machine-learning algorithms including the RF, SVM and NB were used 211 to build the model for predicting the cleavage sites of the 3CL protease, and a strict testing strategy of five times 212 of five-fold cross-validations based on k-means clustering of the datasets (Table 1 ) was used to evaluate and 213 compare the predictive performance of the algorithms. When using one AA index in the modeling, there were a 214 total of 566 models for each algorithm. The RF models had a median AUC of 0.88, which were significantly 215 higher than those of both the SVM and NB models (Fig. 3A) . Therefore, the RF algorithm was used in the 216 further modeling. 217 To improve the model performance, the top 10% AA indexes (58 AA indexes) (shown in Supplementary 218 Table S3 ) in the RF models were analyzed with the PCA method. The first and second components were 219 visualized in Fig. 3B . Four AA index clusters were obtained by the k-means clustering. To reduce the co-linearity 220 of features in the modeling, combination of AA indexes was conducted by cluster. For example, when using two 221 AA indexes in the modeling, two AA indexes were randomly selected from two different clusters independently. 222 The RF models using all possible combinations of two, three and four AA indexes were built and evaluated. As 223 shown in Fig. 3C , the RF models with two AA indexes had higher AUCs than those with one AA index; the 224 model performances were further improved when using three AA indexes; however, the model performances 225 were decreased when using four AA indexes. Overall, the RF models using three AA indexes performed 226 significantly better than those with one, two or four AA indexes. Therefore, the RF model which had the highest sensitivity, precision of the model were 0.88, 0.80, and 0.96, respectively. The RF model used the AA indexes 229 of MEEJ800102, BIOV880102 and FASG760101, which referred to "the retention coefficient in high-pressure 230 liquid chromatography", "Information value for accessibility" and "Molecular weight", respectively. 231 232 To test the RF model in prediction of the cleavage sites of the coronavirus 3CL protease, an independent test 235 dataset derived from host proteins was manually curated from literatures (Supplementary Table S4 Figure S2 ). However, the area under the precision-recall curve (AUPRC) of 251 NetCorona was smaller than that of the RF model (0.32 vs 0.34) (Supplementary Figure S2) . When the 252 sensitivity was in the range of 0.4 to 0.6, the prediction precision of both models were relatively stable, and the 253 prediction precision of the RF model was 0.1 higher than that of the NetCorona (Supplementary Figure S2) . 254 255 Then, the RF model was used to predict the potential cleavage sites on human proteins by the coronavirus 3CL 258 protease. To increase the prediction precision of the RF model, the cutoff for determining the positive was set 259 to be 0.99 (Supplementary Table S5 ). A total of 1,352 human proteins were predicted to be cleaved by the 260 coronavirus 3CL protease with 1,511 cleavage sites. Most of human proteins had only one predicted cleavage 261 sites. Some proteins had more than three cleavage sites, such as the Golgin subfamily A member 3 (UniProtKB: 262 Q08378). The GO enrichment analysis of the human proteins which were predicted to be cleaved by the 263 coronavirus 3CL protease showed that in the domain of biological process, they were enriched in processes of organization, assembly, movement, localization, and so on ( Fig. 5A analysis showed that these proteins were only enriched in two pathways including "Salmonella infection" and 269 "Lysine degradation" (Supplementary Table S6) . model developed here used the cleavage sites on polyproteins from 14 coronaviruses for modeling which were 296 more than three times to that used in Kiemer's study (Kiemer et al., 2004) . Besides, our study used a very strict 297 testing strategy by separating the dataset using the clustering method (Lu et al., 2021) , which could reflect the 298 ability of the model in predicting cleavage sites on polyproteins of novel viruses or host proteins. In the 299 independent testing on host proteins, the RF model predicted the cleavage sites with higher precision and recall 300 rate than the neural network model developed in Kiemer's study (Fig. 4) . It could predict 80% of the cleavage Besides the cleavage on the viral polyproteins, the coronavirus 3CL protease can also cleave proteins 303 involved in the host innate immune response such as NEMO and STAT2, thus evading the host immunity (Wang 304 Table and Figures Table 1 The size of five groups in each dataset. The positive group and the corresponding negative group had the same size. Figure 1 The work flow of the modeling process. The functional enrichment analysis of the human proteins which were predicted to be cleaved by the coronavirus 3CL protease. Only top ten most enriched GO terms were shown. For more results, see Table S6 . SARS-CoV-2 Infection Leads to Coronavirus main proteinase (3CLpro) 362 structure: basis for design of anti-SARS drugs Severe neurologic syndrome associated with Middle East respiratory syndrome corona virus (MERS-CoV) Structural insights into SARS-CoV-2 proteins MERS-CoV: Understanding the Latest Human Coronavirus Threat Overview of lethal human coronaviruses Peritonitis Virus Nsp5 Inhibits Type I Interferon Production by Cleaving NEMO at Multiple Sites Profiling of substrate Mechanism and inhibition of the papain-like protease The Cytoskeleton as a Modulator of Aging and Neurodegeneration Mat_peptide: comprehensive annotation of mature peptides from polyproteins in five virus 429 families Possible central nervous system 431 infection by SARS coronavirus Prokaryotic virus host 433 predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics SARS-CoV-2 proteases PLpro and 436 3CLpro cleave IRF3 and critical modulators of inflammatory pathways (NLRP12 and TAB1): 437 implications for disease presentation across species TDP-43 and Cytoskeletal Proteins in ALS Mechanistic insights into COVID-19 by global analysis of the SARS-CoV-2 Scikit-learn: Machine Learning in Python Using the spike protein feature to predict infection risk 449 and monitor the evolutionary dynamic of coronavirus Multiplex assays for the identification of serological signatures of -2 infection: an antibody-based diagnostic and machine learning study On the size of the active site in proteases. I. Papain Compositional 458 diversity and evolutionary pattern of coronavirus accessory proteins Prediction of HIV-1 protease cleavage site using a combination of sequence, 460 structural, and physicochemical features The Nonstructural Proteins Directing Coronavirus RNA Synthesis 462 and Processing Classifier for Identifying Host Protein Targets of the Dengue Protease 2021. 6-month neurological and psychiatric 467 outcomes in 236 379 survivors of COVID-19: a retrospective cohort study using electronic health 468 records Neuromuscular 470 disorders in severe acute respiratory syndrome Feline 473 coronavirus drug inhibits the main protease of SARS-CoV-2 and blocks virus replication Diarrhea Virus 3C-Like Protease Regulates Its Interferon Antagonism by Cleaving NEMO WHO, 2022. WHO Coronavirus (COVID-19) Overview Detection of severe acute respiratory syndrome coronavirus in the brain: potential role of the chemokine 481 mig in pathogenesis clusterProfiler: an R package for comparing biological themes 483 among gene clusters Porcine 485 Deltacoronavirus nsp5 Cleaves DCP1A to Decrease Its Antiviral Activity Porcine deltacoronavirus 487 nsp5 inhibits interferon-β production through the cleavage of NEMO Porcine Deltacoronavirus nsp5 Antagonizes Type I Interferon Signaling by Cleaving STAT2