key: cord-0860516-cmd4elpt authors: Konishi, Tomokazu title: Progressing adaptation of SARS-CoV-2 to humans date: 2021-01-15 journal: bioRxiv DOI: 10.1101/2020.12.18.413344 sha: 3d96935a83aaa54c8954ab6933bf87853558f5d5 doc_id: 860516 cord_uid: cmd4elpt The second and subsequent waves of coronavirus disease 2019 (COVID-19) have caused problems worldwide 1. Here, using an objective analytical method 2, we present the changes that occurred in the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causative virus of COVID-19, over time. The virus has mutated in three major directions, resulting in three groups to date. Analysis of the basic structure of the group of viruses was completed by April and shared across all continents. However, the virus continued to mutate independently in each country after the borders were closed. In particular, the virus mutated before the occurrence of the second and subsequent peaks. It seems that the mutations conferred higher infectivity to the virus, because of which the virus overcame previously effective protection and caused second waves of the disease. Currently, each country may possess such a unique, stronger variant. Some of them slowly entered other countries and caused epidemics. These viruses could also serve as sources of further mutations by exchanging parts of the genome, which could create variants with superior infectivity. The evolutionary trajectory of an evolved virus is difficult to analyse from nucleotide sequences 2 because the data have complex multivariate structures with numerous dimensions 2 . Phylogenetic trees have long been used to present relationships in among sequences 3 and many studies on severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) 4 have employed them 5-8 , but they have two drawbacks. One is the decisive lack of falsifiability that ruins objectivity 9 , and the other is a lack of generality that makes it difficult to integrate with other data sources. Here, an analysis was performed using a more objective method, i.e., principal component analysis (PCA) 10 . PCA runs a singular value decomposition on a sequence described as a Boolean vector and obtains the principal component (PC) for the sample and base in parallel 11 . PCA finds common directions among a dataset and represents them as independent axes. These axes have an order; PC1 is common to most data, and the lower the contribution to differences in the data, the lower the order. The sample or base is presented on the axis according to its characteristics. As shown in the current study, SARS-CoV-2 data were represented by two axes as a whole. The lower axes represented mutations found in smaller areas. The components were scaled and compared with those of the influenza virus 12 as needed. In addition, the results were integrated with the date and location of collection. Overview of PCA. PCs 1 and 2 formed a triangle comprising four groups that were temporarily termed as 0-3 (Fig. 1A) . All groups were found on all continents. In contrast to PC1 and PC2 showing the overall situation, the PC3 axis and the lower-order axes showed differences that relate to only a small number of the samples ( Fig. 1B and Extended Data Fig. 1) , with smaller contributions to differences in the data (Extended Data Fig. 2 ); many of them were detected in limited countries after April (comparing Extended Data Fig. 1 and 3 ). Such mutations occurred 3 since April when countries began to restrict the movement of people across national borders. The overview of Fig. 1A was completed at the end of March (Fig. 1C ). Although group 3 once appeared worldwide, it has been contained to date (Fig. 1D) . These features were similar for both RNA and protein (Extended Data Fig. 1 and 4) . The rate of missense mutations, which affects amino acid sequence, was 63% (Extended Data Fig. 5 ). This is a fairly high rate; rather, null mutations are expected to be 61% 13 even though some differences caused by ignoring damaging mutations have been identified here. In fact, in the case of H1N1 influenza virus from 2001 to 2003 in the United States, the rate was 22%. This suggested that SARS-CoV-2 was under selective pressure to alter its protein structure. The viral groups causing the epidemics changed over time ( Fig. 2A) . The second and third waves in each country were caused by new variants of the virus, which showed different patterns at lower PCs (Fig. 2) . In England, the second wave was caused by a variant of the Group 0, which has since spread to many countries in Europe (Extended Data Fig. 6 , panels A-I). The third wave occurred due to another variant of Group 1, H69-V70 ( Fig. 2C) 14, 15 . The H69-V70 variant in the UK 14,15 actually involves multiple mutations (Extended Data Table 2 ). The deletion itself has occurred spontaneously in multiple groups (Fig. 2D ); the epidemic in the UK is caused by a group 1 variant, while those in Germany and Denmark are caused by group 0. There were other group 1 variants without the deletion in the UK, which would be the ancestral variants. The group 1 H69-V70 variant in the UK is growing at the same or faster rate than the pan-European variant by replacing it (Fig. 2A) . These results show that this variant could be more infectious than the pan-European variant. A group 2 variant is quickly spreading in South Africa (Extended Data Fig. 6Q and 6R). As it had not appeared until October, the axes of PCA had not trained for the variant; hence, the figure was made by renewing the axes. This is a novel variant that contains many mutations concentrated in the spike protein, and the magnitude of mutation is deemed comparable to that of influenza H1N1 hemagglutinin between sequential peaks (Table 1) . Group 2 variants have been negligible in South Africa (Extended Data Fig. 6Q ), and no variant was considered ancestral. Perhaps the epidemic variant was from areas where samples are rarely sequenced. Any region in the world can produce a new highly infectious variant, with potential to cross national borders. Therefore, measures such as 6 vaccines are urgently needed on a global scale. In many countries, mutations were detected several weeks prior to the peaks ( Fig. 2 and Extended Data Fig. 6 ). This suggested that the mutations were the cause of the peak. In any case, the variants prevalent at those peaks should certainly be more infectious than they were before the mutation, so the variants were replaced. The higher infectivity may have weakened the previously effective protections from the virus. Altered residues between the peaks are shown in Extended Data Table 1; because PCA showed differences among samples and bases in parallel, PCs for bases may help identify those differences. It may be possible to predict the next wave by monitoring the mutations. However, only a limited number of countries continue sequencing the virus; many have reported only the first peak. However, the second and subsequent peaks are caused by different variants ( Magnitude of mutations. Influenza H1N1 is highly infectious in humans 16 . It mutates continuously but does not appear in the same area during two sequential years. It is prevalent every few years because it takes time to mutate enough to survive against herd immunity in the area. However, the magnitude of changes in the meantime varies (Table 1) . Perhaps, the part of the sequence that needs to be modified to escape herd immunity varies from case to case. However, the changes in coronavirus before and after the peaks were much smaller ( Table 1 ). The differences seem to be smaller than that needed to obtain another peak of seasonal influenza, although this does not guarantee safety from exceptional reinfections 17, 18 . A possible exception is the one that presently causes an epidemic in South Africa. A case of pdm09 influenza virus that likely corresponded to SARS-CoV-2 mutations was observed in Thailand from 2009 to 2010 (Extended Data Fig. 7) 19 . The 2009 pandemic was subdued in this tropical country, and three peaks were subsequently identified toward 2010. At the second peak, only the nucleotide sequence changed, and at the third peak, the hemagglutinin protein changed (Table 1 ) with a high rate of missense mutations (54%). However, these changes were not highly associated with herd immunity because most of the population was not yet infected. These changes might be caused by the adaptation of pdm09-originally a swine strain-to humans. The influenza virus genome continues to mutate and not only in the part that is directly associated with infection 12 . This property may be the reason it has repeatedly deceived herd immunity for decades, allowing influenza to remain as a pandemic. In contrast, mutations in SARS-CoV-2 were not uniform; rather, some smaller open reading frames (ORFs) such as E, M, and ORF6 were preserved (Fig. 3) . These could have adapted to humans, but they are also conserved in various coronaviruses 20 . If they remain conserved, they could be subjected to herd immunity, extending the period between future epidemics. Evidently, this does not mean that these ORFs do not change forever; the rate of change is just slower. The strain responsible for the pandemic has mutated rapidly. As of March, the bases of PCs 1 and 2 were formed, and each continent harboured the same set of mutated variants (Fig. 1) . Since then, human movement has been restricted, and each country accumulated its own variants. In this process, the force that caused the mutations was probably adaptation to humans (Table 1) . It caused changes in both codon usage and protein structure (Extended Data Fig. 1 and 4) , but the rate of missense mutations was high (Extended Data Table 1 The unique variants in each country would serve as the source of newer variants, which are more infectious than the previous variants; therefore, they replaced the mode ( Fig. 2 and Extended Data The epidemics showed much variation among the countries (Extended Data Fig. 6) A pneumonia outbreak associated with a new coronavirus of probable bat origin Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins Detection and Characterization of Bat Sarbecovirus Phylogenetically Related to SARS-CoV-2 Scientific method: Defend the integrity of physics Principal component analysis for designed experiments Re-evaluation of the evolution of influenza H1 viruses using direct PCA Probability of phenotypically detectable protein damage by ENU-induced mutations in the Mutagenetix database Two-step strategy for the identification of SARS-CoV-2 variants co-occurring with spike deletion H69-V70 Transmission of SARS-CoV-2 Lineage B.1.1.7 in England: Insights from linking epidemiological and genetic data CDC. Similarities and Differences between Flu and COVID-19 Asymptomatic Reinfection in 2 Healthcare Workers From India With Genetically Distinct Severe Acute Respiratory Syndrome Coronavirus 2 Seasonal coronavirus protective immunity is short-lasting Principal component analysis of coronaviruses reveals their diversity and seasonal and pandemic potential Data, disease and diplomacy: GISAID's innovative 14 contribution to global health DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment R: A language and environment for statistical computing. (R Foundation for Statistical Computing