key: cord-0945404-9cllucmc authors: Nagy, Ádám; Pongor, Sándor; Győrffy, Balázs title: Different mutations in SARS-CoV-2 associate with severe and mild outcome date: 2020-12-23 journal: Int J Antimicrob Agents DOI: 10.1016/j.ijantimicag.2020.106272 sha: 054dfa9379b32cd9df43a4ffbdc410299626c4ef doc_id: 945404 cord_uid: 9cllucmc INTRODUCTION: Genomic alterations in a viral genome can lead to either better or worse outcome and identifying these mutations is of utmost importance. Here, we correlated protein-level mutations in the SARS-CoV-2 virus to clinical outcome. METHODS: Mutations in viral sequences from the GISAID virus repository were evaluated by using “hCoV-19/Wuhan/WIV04/2019” as the reference. Patient outcomes were classified as mild disease, hospitalization and severe disease (death or documented treatment in an intensive-care unit). Chi-square test was applied to examine the association between each mutation and patient outcome. False discovery rate was computed to correct for multiple hypothesis testing and results passing FDR cutoff of 5% were accepted as significant. RESULTS: Mutations were mapped to amino acid changes for 3,733 non-silent mutations. Mutations correlated to mild outcome were located in the ORF8, NSP6, ORF3a, NSP4, and in the nucleocapsid phosphoprotein N. Mutations associated with inferior outcome were located in the surface (S) glycoprotein, in the RNA dependent RNA polymerase, in ORF3a, NSP3, ORF6 and N. Mutations leading to severe outcome with low prevalence were found in the ORF3A and in NSP7 proteins. Four out of 22 of the most significant mutations mapped onto a 10 amino acid long phosphorylated stretch of N indicating that in spite of obvious sampling restrictions the approach can find functionally relevant sites in the viral genome. CONCLUSIONS: We demonstrate that mutations in the viral genes may have a direct correlation to clinical outcome. Our results help to quickly identify SARS-CoV-2 infections harboring mutations related to severe outcome. There are seven human coronaviruses including MERS, Human-HKU-1, Human NL63, Human 229E, Human OC43, SARS-CoV, and SARS-CoV-2. The natural host of this latest RNA virus is the Chinese rufous horseshoe bat (Rhinolophus sinicus) and the transfer to human initiated the ongoing COVID-19 outbreak at the end of 2019 1 . Some studies estimated a low mortality rate of SARS-CoV-2 in the overall population 2,3 , while other investigators reported mortality percentages up to 26% when the virus strikes a critically ill patient 4 . Overall, based on current data of the WHO (October 2020), the mortality rate is around 2.7%. The linear genome of the SARS-CoV-2 virus has 29,903 bases and harbors 25 genes 5 , the reference sequence is accessible in GeneBank using the accession number MN908947. Phylogenetic analysis of SARS-CoV-2 genomes show three variants termed A, B and C which have different distribution when comparing sequences from Asia, Europe or the Americans 6 . The viral genes encode among others an envelope protein, an RNA dependent RNA polymerase, a surface glycoprotein, an exonuclease, a methyltransferase, and 11 nonstructural proteins. Some of these are within the virus, but others, including the spike glycoprotein, the membrane glycoprotein, and the envelope protein are on the viral surface. In theory, any functional or structural viral gene can have an effect on the efficiency of a virus and both mutations 7 or alteration in the expression 8 can increase pathogenicity. It is important to emphasize that even the untranslated regions of a coronavirus can have important role in the viral replication as has been previously demonstrated for the 3' untranslated region 9 . SARS-CoV-2 is no different compared to other viruses and new mutations continually pop up with its spread 10 . Some mutations uncovered in the SARS-CoV-2 virus lead to a novel RNAdependent-RNA polymerase variant 11 , while other genomic changes drive the evolution and the spread of the virus by resulting in a more transmissible form of the virus 12 . Mutations potentially making the virus more transmissible have a significant evolutionary advantage as has been demonstrated for the SARS-CoV-2 variant with spike G614 which mainly replaced D614 between February and July 2020 13 . In this context, the most important question is to identify viral mutations leading to different patient outcomes. Mutations resulting in a mild disease could facilitate the spread of the virus and thereby maintain the outbreak. Other mutations leading to a more severe disease need immediate attention to prevent detrimental outcomes. Here, our goal was to identify and rank mutations associated with altered patient outcome by simultaneously correlating outcomes to all mutations across a large cohort of patients. All available SARS-CoV-2 (taxid: 2697049) viral nucleic acid sequences were downloaded from the GISAID virus repository (https://www.gisaid.org/). The sequences were acquired in FASTA format. Those viral sequences were selected where the entire viral nucleic acid sequence was published. A second filtering was executed to include only virus genomes with available patient follow-up status. The mutations were evaluated using the CoVsurver (https://corona.bii.a-star.edu.sg). To achieve this, the viral sequences in .FASTA format were used as the query and the "hCoV-19/Wuhan/WIV04/2019" was used as the reference. The analysis was run by using batches of 1000 samples in one run. Protein mutations do not have overlaps, and the genomic boundaries of the various proteins in the WIV04 reference genome are displayed in Table 1 . As the patient samples were annotated with all together more than sixty different outcome classification, we had to coerce these into three major categories. Patients who were "asymptomatic", were "not hospitalized", had a "mild" disease, were at "home" were all assigned to have a "mild" disease. Also patients who were treated at outpatient departments, were quarantined or were treated by the physician network were classified as "mild". Patients who definitely needed medical care were assigned to the "hospitalized" group. These include those "hospitalized", "inpatient", "discharged", "released", and "recovered". In addition, combinations of the annotations which included any of these were also assigned into this cohort (e.g. "initially hospitalized" or "to be hospitalized"). Finally, patients with detrimental outcome were allocated to the "severe" cohort. These include those "deceased", those with a "severe" disease, those who entered "intensive care units". Also any combination of these with other annotations (e.g. "hospitalized / ICU") were also added to this category. All data processing and statistical analysis steps were performed in the R statistical environment v 3.6.3. Data processing was performed on 18 th October 2020. Chi-square test was applied to examine the association between each mutation and patient status data. False discovery rate using the Benjamini-Hochberg method was computed to correct for multiple hypothesis testing and only results passing a FDR cutoff of 5% were accepted as significant. All together 149,061 SARS-CoV-2 viral nucleic acid sequences were available, and 147,960 of these included the entire viral nucleic acid sequence. Clinical data was available for 7,702 patients, and 4,566 of these had also follow-up data. This is a small fraction of the total data which implies that our findings could contain a sampling bias. When looking on the clinical parameters of these patients, 58.6% were male and 36.5% were female (remaining samples did not had this information). The geographical origin of the samples covers the entire globe: 4.2% were from Africa, 46% from Asia, 26.8% from Europe, 12.4% from North America and 10.2% from South America. Collection of the samples happened between 30.12.2019 and 14.9.2020. Of all patients with a follow-up 708 had a mild disease, 3,306 had to be hospitalized and 552 patients had a severe disease. All together 3,733 different mutations affecting the protein amino acid sequence were identified, and 937 of these mutations were not present in samples with clinical follow-up. When looking on all mutations, we have identified on average 4.7 mutations in each sample. As an internal control to validate any potential bias in the mutation prevalence related to patient proportions we computed the average numbers of mutation in each clinical outcome cohort and found similar values (mean in those with mild, hospitalized, and severe outcome were 4.8, 4.6, and 5.1, respectively). When analyzing the correlation to clinical outcome across all mutations, 79 mutations reached statistical significance at FDR<5%. The complete list of these mutations with sample numbers in each cohort is displayed in Supplemental Table 1 and mutation data for each investigated patient is provided in Supplemental Table 2 . In order to concentrate only on mutations with a clinical relevance, we selected only those mutations which were present in at least 2% of the samples (this corresponds to a cutoff of at least 91 patient samples with a mutation). When looking at mutation related to mild outcome, only five mutations passed all filtering criteria -L84S in the ORF8 protein, L37F in the NSP6 protein, G196V in the ORF3a protein, F308Y in the NSP4 protein, and the S197L mutation in the nucleocapsid phosphoprotein. The complete list as well as distribution among patient samples is provided in Table 2 . When searching for mutations related to hospitalization or to severe outcome, we used the above filter of including only mutations present in at least 2% of the samples. All together 15 mutations passed these criteria. These originated in seven genes: L54F, D614G and V1176F in the surface (S) glycoprotein, A97V and P323L in the RNA dependent RNA polymerase, Q57H and G251V in the ORF3a protein, P13L, S194L, R203K, G204R and I292T in the nucleocapsid phosphoprotein, I33T in the ORF6 protein, S1197R and T1198K mutations in the NSP3 protein. In order not to miss mutations leading to deadly outcome we also included all mutations which were present in at least 10 patients with severe outcome. This additional analysis delivered two further mutations, the L71F in the NSP7 protein and the S253P mutation in the ORF3A gene. These were linked to 53 and 11 severe outcomes after being spotted in 60 (L71F) and 11 (NSP7) patients, respectively. Interestingly, the overall prevalence of mutations leading to mild outcome (n=1,851) was smaller than the prevalence of those leading to worse outcome (n=11,725), but at the same time the proportion of patients with mild outcome in the entire cohort was also smaller (18.3%). Nevertheless, a significant proportion of the mutations (n=7,875) were not significantly correlated to any clinical outcome. The complete list of all mutations correlated to severe disease is presented in Table 3 . We have simultaneously analyzed the correlation between patient outcome and all identified mutations resulting in amino acid sequence changes in the viral proteins. Strikingly, we have not only found a significant number of mutations, but some of these were correlated to mild diseases while other had a significant correlation to severe outcome. Nucleocapsid phosphoprotein was the protein with most significant mutations linked to both mild and severe patient outcome. All these changes are at a close genomic positions, G196V and S197L resulting in mild outcome and R203K, G204R, and S194L resulting in inferior outcome. Interestingly, when comparing the S197L (71% of mild outcome) to the S194L (1% chance of a mild outcome) variants, the relative risk was extremely high. Interestingly, the Researchers from the University of Washington compared two dominant clades of virus in circulation and have observed no difference in outcome when comparing these in patients sufficiently ill to warrant testing for virus 19 . Previously, a 382-nucleotide deletion (∆382) in the open reading frame 8 was associated with a milder infection 20 . In another recent study, a set of common deletions were identified in the spike protein of SARS-CoV-2 21 . Other deletions were also validated by RT-PCR 22 . However, due to missing data about insertions and deletions in GISAID we could not evaluate a potential link between deletions and patient outcome. Importantly, our findings might contain a sampling bias, since only a fraction of the available genomes had patient outcome data. On the other hand, four out of 22 potentially significant mutations (listed in Tables 2 and 3 ) map to an about 10 amino acid long, functionally important region of the nucleocapsid phosphoprotein which leads us to believe that the current statistical approach can reveal functionally important sites within the COVID 19 genome. The main limitation of our study results from the database used. Information was retrieved from GISAID, a repository that contains only general information about patient outcome. The patient treatment protocols resulting in designation into "mild", "hospitalized" and "severe" cohorts may significantly depend on the country and even on the region where patients were managed. We could also not include potential confounding factors including age, comorbidities and treatment against COVID-19 in our analysis. Coronaviruses have generally a stable genome which changes very little over time 23 . A fundamental question of SARS-CoV-2 research is whether or not the virus can get weaker or stronger with time. Our findings suggest that there are mutations that can support either of these changes so the theoretical possibility is there that in the future the viral effect will shift towards milder or more severe patient outcomes. Table 3 . SARS-CoV-2 mutations correlated to hospitalization and severe outcome in 4,566 patients with available genomic and follow-up information were found in seven distinct genes. The proximal origin of SARS-CoV-2 SARS-CoV-2: fear versus data Comparison of mortality associated with respiratory viral infections between Baseline Characteristics and Outcomes of 1591 Patients Infected With SARS-CoV-2 Admitted to ICUs of the Lombardy Region A new coronavirus associated with human respiratory disease in China Phylogenetic network analysis of SARS-CoV-2 genomes Mutational analysis of the coat protein gene of potato virus X: effects on virion morphology and viral pathogenicity Expression of measles virus V protein is associated with pathogenicity and control of viral RNA synthesis A phylogenetically conserved hairpin-type 3' untranslated region pseudoknot functions in coronavirus RNA replication Emergence of genomic diversity and recurrent mutations in SARS-CoV-2 Emerging SARS-CoV-2 mutation hot spots include a novel RNAdependent-RNA polymerase variant SARS-CoV-2 Spike protein variant D614G increases infectivity and retains sensitivity to antibodies that target the receptor binding domain. bioRxiv : the preprint server for biology Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Phosphorylation of the arginine/serine dipeptiderich motif of the severe acute respiratory syndrome coronavirus nucleocapsid protein modulates its multimerization, translation inhibitory activity and cellular localization IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content Virtual screening and dynamics of potential inhibitors targeting RNA binding domain of nucleocapsid phosphoprotein from SARS-CoV-2 COVID-2019: The role of the nsp2 and nsp3 in its pathogenesis Molecular Architecture of Early Dissemination and Massive Second Wave of the SARS-CoV-2 Virus in a Major Metropolitan Area. medRxiv Outcomes associated with SARS-CoV-2 viral clades in COVID-19. medRxiv Effects of a major deletion in the SARS-CoV-2 genome on the severity of infection and the inflammatory response: an observational cohort study Identification of common deletions in the spike protein of SARS-CoV-2 An 81-Nucleotide Deletion in SARS-CoV-2 ORF7a Identified from Sentinel Surveillance in Arizona Genetic variability of human respiratory coronavirus OC43 The authors wish to acknowledge the support of ELIXIR Hungary (www.elixirhungary.org) as well as the advice of Drs Sebastian Maurer-Stroh (Bioinformatics Institute, A*STAR, Singapore) and Balázs Ligeti (Pázmány University, Budapest).