key: cord-0704629-hnvtj16v authors: Nagy, Ádám; Ligeti, Balázs; Szebeni, János; Pongor, Sándor; Győrffy, Balázs title: COVIDOUTCOME – Estimating COVID Severity Based on Mutation Signatures in the SARS-CoV-2 Genome date: 2021-04-02 journal: bioRxiv DOI: 10.1101/2021.04.01.438063 sha: 378eb7a42c735749167e6f52df0c44421da66920 doc_id: 704629 cord_uid: hnvtj16v Introduction Numerous studies demonstrate frequent mutations in the genome of SARS-CoV-2. Our goal was to statistically link mutations to severe disease outcome. Methods We used an automated machine learning approach where 1,594 viral genomes with available clinical follow-up data were used as the training set (797 “severe” and 797 “mild”). The best algorithm, based on random forest classification combined with the LASSO feature selection algorithm was employed to the training set to link mutation signatures and outcome. The performance of the final model was estimated by repeated, stratified, 10-fold cross validation (CV), then adjusted for multiple testing with Bootstrap Bias Corrected CV. Results We identified 26 protein and UTR mutations significantly linked to severe outcome. The best classification algorithm uses a mutation signature of 22 mutations as well as the patient’s age as the input and shows high classification efficiency with an AUC of 0.94 (CI: [0.912, 0.962]) and a prediction accuracy of 87% (CI: [0.830, 0.903]). Finally, we established an online platform (https://covidoutcome.com/) which is capable to use a viral sequence and the patient’s age as the input and provides a percentage estimation of disease severity. Discussion We demonstrate a statistical association between mutation signatures of SARS-CoV-2 and severe outcome of COVID-19. The established analysis platform enables a real-time analysis of new viral genomes. KEY MESSAGES A statistical link between SARS-Cov-2 mutation status and severe COVID outcome was established using automated machine learning techniques based on random forest and logistic regression combined with feature selection algorithms. A mutation signature based on 3,779 protein coding and 36 UTR mutations capable to identify severe outcome cases was established. The trained model showed high classification performance (AUC=0.94 (CI: [0.912, 0.962]), accuracy=0.87 (CI: [0.830, 0.903])). A registration-free web-server for automated classification of new samples was set up and is accessible at http://www.covidoutcome.com. The established pipeline provides a quick assessment of future patients warranting a prospective clinical validation. With several hundred thousand fully sequenced genomes deposited in various databases, coronavirus SARS-Cov-2, the causative agent of the COVID-19 pandemic is probably the most thoroughly sequenced organism today. The variance we see is impressive: there is no or hardly any sequence position in the genome that is not mutated in one of the published sequences. Interpretation of SARS-Cov-2 genome data, especially in terms of disease severeness and patient mortality is a formidable task complicated by facts such as the virus spreading in a constantly mixing human population, in differentially susceptible age groups, and in vastly different healthcare conditions (e.g. (1) ). In addition, only a small part of deposited genomes are annotated with patient status data. Consequently, one can argue that mutations are simply neutral regional markers that rarely affect viral fitness and clinical outcome. On the other hand, there is a growing body of empirical evidence showing that specific mutation patterns such as Spike protein mutation D614G and its accompanying mutations are associated with faster spreading of the virus (2), (3) and it was shown that Spike D614G mutants not only spread faster but also cause more severe disease in animal models (4) . Recent statistical studies of about five thousand SARS-Cov-2 genome sequences showed that various mutations were significantly associated with clinical outcome and it was found that many of the mutations affected known functional parts of the Spike and Nucleocapsid proteins (5), (6) . It is an open question whether or not the mutation signature of SARS-Cov-2 genomes can be used as an indicator of disease severity given the current data available. Machine learning classification algorithms (such as support vector machines (7, 8) , random forest (9), logistic regression (10) among many others) are par excellence tools for uncovering hidden associations in large datasets. Given two sets of samples assigned to different classes (such as disease outcomes), and a mathematical description for the samples (such as a vector, or a list of mutations), classification algorithms can give well-understood statistical estimates regarding how well a mathematical description can discriminate the two classes. The pertinent measures are defined in the framework of ROC -receiver operating characteristicsanalysis (11, 12) . On the other hand, mutation lists -that we term here mutation signatures -can be quite long and difficult to handle. Feature selection algorithms -such as the Lasso algorithm or the more recent Statistically Equivalent Signatures (SES) method (13) , -can help one to condense a mutation list to an essential core set. And if such a recurrent set exists across various datasets and classification algorithms, one is encouraged to believe that there is an association between sample descriptions and the class definitions -in our case mutation signatures and disease outcomes. Here we applied machine learning classification combined with feature selection algorithms to a cohort of 1594 SARS-Cov-2 genomes and their associated patient data in order to show that the known mutation signatures contain the information sufficient to separate mild and severe outcome classes and can be considered as predictors of severe outcomes. We also established an online analysis platform for predicting the probability of severe infection, starting from a SARS-Cov-2 genome sequence. The SARS-CoV-2 nucleic acid sequences were downloaded in FASTA format from the GISAID virus repository (https://www.gisaid.org/, accessed on December 2, 2020). Only genome sequences annotated with patient follow-up status data were downloaded. The CoVsurver analysis tool (https://corona.bii.a-star.edu.sg) was used to extract the mutations. The viral sequences in FASTA format were used as input for this tool. The "hCoV-19/Wuhan/WIV04/2019" strain was used as the reference. The UTR mutations were extracted from the multiple alignment of underlying sequences by comparing the target sequence to the reference sequence. The multiple alignment was constructed using the MAFF software tool (14) , and substitutions occurring in at least ten genomes were selected for further analysis. The protein mutations were exported in protein alteration format, non-protein (i.e. UTR) mutations were exported in nucleotide mutation format. Artificial intelligence (machine learning) algorithms were used to identify mutations associated with the sever outcome. Briefly, we chose a procedure, using the JADBio platform (15) , that starts with genomic mutation data as the input, carrying out classification based on rigde logistic regression (RLR) (10), random forests (9) or support vector machines (7) in conjunction with LASSO feature selection (16) whenever appropriate, and outputting a) classification efficiency measures (accuracy and AUC -for a review see (12) ) and, b) a feature importance list i.e. a list of mutations ranked according to their importance in distinguishing severe vs. other outcomes. For training and testing the classification and feature selection algorithms, we organized the genome data into three datasets: Dataset #1 included 797 severe and 797 mild cases (Supplemental Table 1 ); and Dataset #2 included 638 severe and 638 mild samples (Supplemental Table 2 ). The online analysis platform (https://www.covidoutcome.com) is written in R using the R Shiny package (https://CRAN.R-project.org/package=shiny) and is running under Linux Debian 64-bit (x86_64). The server takes a SARS-CoV-2 genomic sequence in FASTA format as the input, and provides a) a list of protein as well as UTR mutations and b) a probability of the input genome producing severe infection, as the output. The server fist performs a global pairwise sequence alignment between the input sequence and the reference genome of the "Wuhan strain" (hCoV-19/Wuhan/WIV04/2019), using the program of the "Biostrings" R Bioconductor package (https://bioconductor.org/packages/Biostrings/), and outputs the protein as well as UTR mutations. The second step of the analysis includes prediction of the clinical outcome which is expressed as the probability of severe outcome. The prediction is based on the Random Forest model trained on 797 "mild" and 797 "severe" genome records (Dataset 1). The result is presented in numerical as well as graphical form. Figure 1 shows the complete analysis workflow. We retrieved from the GISAID database a total of 9,781 SARS-CoV-2 genome data that were provided with patient status indications. We found that patient status was described with 179 different, submitter-defined terms, so we formed cohorts that included clearly defined patient descriptions. This was possible for the mildest and for the most severe outcomes that we designated as "mild" and "severe". The pertinent terms are listed in Supplemental Table 3 . Hospitalized patients were more difficult to categorize as hospitalization criteria varied from country to country, so these were not included in the learning and test sets. The retrieved sequences contained a total of 3,779 protein and 36 UTR mutation types as compared to the Wuhan strain. We represented the genomes with mutation signature vectors that contained the name of the genome, the age of the patient, followed by a series of protein and UTR mutations listed in the order of their sequence positions. Our goal was to establish whether or not the mutation signatures can distinguish two input classes, i.e. mild and severe patient outcomes. Machine learning algorithms can help to approach this problem, since a high classification efficiency is generally considered an indicator of the input data being able to distinguish the class labels. A specific problem of the genomic data is the large number of possible mutations which can obscure the identity of truly relevant mutations. Techniques of feature selection (13), (16) are designed to solve this problem as they can narrow down the number of input dimensions to a few, relevant dimensions ranked according to their importance. In practice, we can combine feature selection algorithms, such as LASSO (16) or Statistically Equivalent Signatures (13) with classifier algorithms (such as support vector machines (7, 8) , random forests (9), logistic regression (10), in such a way that classification performance and feature (mutation) signatures will be optimized at the same time. Table 1 summarizes the results in terms of classification performance. The AUC and accuracy values (for a review see (12) ) are high enough to indicate that the mutation signatures contain the information necessary to separate mild and severe outcomes. Table 2 lists examples of mutation signatures identified by the LASSO algorithm on two datasets. Here we also included models that did not contain patient age data (which sum up the total of 4 methods listed in Table 2 ). It is conspicuous, that the mutation lists are similar, i.e. almost the same mutations were found to be important in all cases. For instance, out of the 26 different mutations 19 were prioritized by all four methods, four were selected in two of them and there were only three mutations that were found by one method only. Similar if not identical mutation signatures were found with the Statistically Equivalent Signatures method (13) (data not shown). In other words, the prioritized mutations can be considered a stable, robust subset that apparently contain most of the information necessary to distinguish the "Mild" and "Severe" cases. We also note that the mutations prioritized here by feature selection quite well coincide with those well-known from previous sequencing studies. For instance, spike protein variants V1176F and S477N, that cooccur with DG14G, affect important functional domains of the spike protein, and were found to increasingly spread around the world (6). In the nucleocapsid protein, S194L maps onto the phosphorylated "RS-motif" (17) which is in the intrinsically unstructured serine rich region 181-213 of the protein and was previously found associated with severe outcome (5) . Similarly, UTR mutations SC241T and SG29830T were also noted by Mukherjee and Goswami (18) . The complete analysis pipeline is summarized in Figure 1 . In the first step of the analysis, global pairwise sequence alignment is used to align the query nucleotide sequence to the reference nucleotide sequence (hCoV-19/Wuhan/WIV04/2019) using the "Biostrings" R Bioconductor package (https://bioconductor.org/packages/Biostrings/). A quality control is carried out at this point, and input sequences containing to few identities, too many "N"-s, abnormal GC content, or having a discrepant length with respect to the reference sequence are rejected. Then, using the translate() function of the "Biostrings" package, nucleotide alterations are translated to protein alterations plus UTR alterations. The resulting mutation signature is passed on to random forest based predictor that contains a model with patient age. The output is a severity score, which is a (0,1) probability of the infection being severe. This value can be evaluated in comparison with distribution data of the test set (Figure 2) . In this figure one can designate approximate segments depending on the ratio of severe and mild outcomes. Namely, score<0.20 and score>0.80 are regions of high confidence, 0.20