key: cord-0872329-431ksdno authors: Konishi, T. title: Coronavirus, as a source of pandemic pathogens date: 2020-05-28 journal: bioRxiv DOI: 10.1101/2020.04.26.063032 sha: 43746f927123d224d3d95e1ced2a13ea8f12206b doc_id: 872329 cord_uid: 431ksdno The coronavirus and the influenza virus have similarities and differences. In order to comprehensively compare them, their genome sequencing data were examined by principal component analysis. Variations in coronavirus were smaller than those in a subclass of the influenza virus. In addition, differences among coronaviruses in a variety of hosts were small. These characteristics may have facilitated the infection of different hosts. Although many of the coronaviruses were more conservative, those repeatedly found among humans showed annual changes. If SARS-CoV-2 changes its genome like the Influenza H type, it will repeatedly spread every few years. In addition, the coronavirus family has many other candidates for subsequent pandemics. One Sentence Summary The genome data of coronavirus were compared to influenza virus, to investigate its spreading mechanism and future status. Coronavirus would repeatedly spread every few years. In addition, the coronavirus family has many other candidates for subsequent pandemics. Short titles: Coronavirus, as a source of pandemics. Introduction: (1, 2) is rapidly spreading worldwide. To investigate its spreading mechanism, genomes of the coronavirus were compared to those of the influenza virus (3) by principal component analysis (PCA) (4). These two RNA viruses differed in host specificity and 5 speed of mutations. Here, I present the characteristics of both and the changes in coronavirus. The coronaviruses presented smaller variations and did not differ much from host to host. These characteristics may have eased the novel transfection of coronaviruses to humans; many of the bat strains seem to have the ability to infect humans, and the intermediate animals may not be necessarily required, against the present expectations. Although many of the coronaviruses were 10 more conservative, their character may reflect a less selective pressure owing to their limited infectivity. Rather, those repeatedly found among humans showed annual changes. If SARS-CoV-2, which is highly infectious, changes its genome like the influenza H type, it will repeatedly spread every few years. In addition, the coronavirus family may have other candidates for subsequent outbreaks. 15 While the coronavirus genome is a positive-sense single-stranded 30 kb RNA, the influenza genome is divided in eight segments. Both directly replicate using their RNA-dependent RNA polymerases, which may cause many errors (5) . This characteristic has introduced variations among coronaviruses, including the number and size of their open reading frames (ORFs). Some classes of coronaviruses, such as HCoV, cause upper respiratory tract infections in 20 humans. The symptoms are mostly similar to those of the common cold, but they may also cause severe pneumonia (6) . They have lower infectivity than the human influenza viruses. For example, a 2010-2015 study in China reported that 2.3% and 30% of patients were positive for coronavirus and influenza virus, respectively (6); a similar ratio was found in another large study (6) . Some viruses may have a much higher infectivity and cause outbreaks: e.g., SARS-CoV (7, 25 8), MERS-CoV (7, 9, 10) , and SARS-CoV-2 (SCoV2) (2, (11) (12) (13) . The former two cause severe symptoms, while the latter varies from asymptomatic to critical. The corona and influenza viruses have similarities and differences in infectivity, spread ability, and symptoms. These differences are based on their genomes, which are important for estimating how SCoV2 will act in humans. 30 3 Sequencing data were obtained from DDBJ database. Aligned data, obtained with DECIPHER(14) (presented in the Supplementary material), were further processed to observe the relationships among samples by using the direct PCA method (4), which can handle data with 5 limited assumptions. The conceptual diagram of the PCA is as follows (all calculations were performed in R) (15) . Updated versions of the scripts are presented in GitHub (https://github.com/TomokazuKonishi/direct-PCA-for-sequences). To escape the imbalance effect among samples, the decomposition was performed by removing clusters of similar samples, e.g. those caused by SARS, MERS, or SCoV2. Instead, only one sample was included 10 from each cluster. To prepare a comprehensive data set for SCoV2, 2,796 of full-length data were obtained from GISAID database and added to those used for Fig. 1 . Some of those records were rather preliminary and contained several uncertain bases designated by "N", which may be counted as indels. To cancel such artifacts, the corresponding regions were replaced with the average data in 15 the PCA. The unit of length is the same as that of the PCA, which will extract the length toward particular directions. The levels of PCs 1-5 for bases were estimated by the root sum square at each base position 5 ( Fig. 2 and S3 ). If alterations existed in several samples, and if they occur coincidently, they may contribute to a higher level of PC. To see the tendencies at the positions, two moving averages with a width of 200 amino acid residues were shown for substitutions (grey) and indels (blue). The coronaviruses consist of distinct classes ( Fig. 1 ). In the lower PC axes, the other classes were separated ( Fig. S1a and b), and these were further divided into subclasses. For example, SARS-CoV and SCoV2 belong to different subclasses of Sarbecovirus ( Fig. S1c and Table S1 ). The origin of the graph, (0, 0) coincides with the mean data. The accumulation of mutations will form a variety of viruses, which have different directions and distances from an original virus. If 15 the mutations and samplings are random, the original virus would be near to the data mean. The variation magnitude, estimated by the mean distance ̅ , was 0.11. This is much smaller than those of single subclasses of influenza A virus (Fig. S2 ), such as H1 or H9. Incidentally, the value of ̅ was not significantly altered by artificial reductions of sample numbers or sequence length (not shown). 20 Among the classes, Gammacoronavirus and Deltacoronavirus, which are close to the origin of the graph, were found mainly in bird samples ( Fig. 1 and Table S1 ). These could be the origin of coronaviruses, as like the influenza viruses are thought to have originated from those of waterfowls. The mean of the studied samples was located in the bat virus Norbecovirus, which seems to be the origin of the viruses in mammals. Indeed, many classes apart from this class 25 were found in bat samples. The most distant classes from the mean, TGEV and Embecovirus, were found in larger animals, but not in bats. Human coronaviruses (HCoV) belong to Embecovirus, Alphacoronavirus, and Dubinacovirus Similarly to other RNA viruses (5), many indels were observed, especially in some smaller ORFs ( Fig. 2 and S3 ). These range from small regions without frameshifts to large ones that alter plural ORFs, e.g. Embecovirus is unique because it possesses an ORF of hemagglutinin. The class is further distinguished by having another ORF, NS2a, or not (Figs 1 and 2 ). Even within a small group of HCoV, OC43, an indel of 14 aa length existed in the spike protein. The classification 5 was not significantly affected by either focusing on the indels or on the rest (Fig. S4) . Therefore, indels were not given extra weight in this study; they were treated as a base or a residue. Note that some other small ORFs, such as the envelope and nucleocapsid, are conservative. The values of PC were not significantly affected by the hosts (Fig. 1) e.g. differences between bird and swine viruses in Deltacoronaviruses were small (Table S1 , PC18). This is contrarily to 10 influenza viruses, which were separated among different hosts; the class for waterfowl is located near the centre, with three swine groups located around it, and two human groups were positioned in the most apart ( Fig. S2a) (3). In coronaviruses, those apart from the Norbecovirus seem to infect larger animals, but this rule is not absolute (Table S1 ). Each of the human-outbreak strains had similar ones in bats or camels, with minor differences 15 ( Fig. 1 and Table S1 ). In the SARS spike protein, no amino acid residue was unique to humans. This is partially because our knowledge about the viruses has increased after the efforts to screen likely viruses in wild animals (1, [16] [17] [18] . Only 35 out of 2412 residues were different from SARS and similar bat viruses, and many of these were not conserved among the bat samples (Table S2 ). The situation was the same in SCoV2 which presented 34 unique amino acid residues (Table 20 S2); however, this uniqueness could disappear after further research. Influenza A H1N1 and HCoV yearly occurrence is very different, since only one H1N1 variety spreads worldwide yearly (3). Contrarily, several OC43 variants appear even within a single country (Fig. 3a, S5 , and Table S3 ). H1 variants will never return in the subsequent seasons, whereas OC43 varieties appeared repeatedly for a decade. However, by concentrating solely on 25 one variety, the annual alterations became obvious (Fig. 3b) . A comprehensive set of SCoV2 samples was separated into three directions, forming some classes (Fig. 3c, S6a) . Such classes could be made if a few persons with mutated virus migrated to another place and the virus started to infect people. Plural classes were found in some countries, suggesting multiple influx routes. The class closest to the data mean was that of China. (Table S4) . Shift-type alterations were observed in coronaviruses, even though the genome is not separated. This is contrastive to the cases of influenza, which can replace RNA molecules between viruses. By focusing on the spike protein, coronaviruses were separated (Fig. 4a) similarly to the 5 classification obtained by the whole genome (Fig. 1) . However, in the classification obtained by 1ab polyprotein, the positions of Deltacoronavirus and Embecobirus were exchanged (Fig. 4b) . Additionally, the position of the nucleocapsid protein in SARS-CoV moved from OC43, losing the Emvecovirus unity (Fig. 4c) . These drastic changes are difficult to explain without shifts. 10 Coronavirus and influenza classes were fairly different in the following two aspects: although the groups were clearly separated, the differences did not match those of the hosts in coronavirus; and the divergence magnitude among coronaviruses was much lower than that of a subclass of the influenza A virus. These characteristics corroborate the assessment that "coronaviruses can 15 apparently breach cell type, tissue, and host species barriers with relative ease" (19 The dominated R type H1 of influenza A changed annually during its outbreak periods (3); by changing its most variable residues during three decades, and Pdm09 also changed annually. Coronaviruses have shown few annual changes ( Fig. 3b and S6a), which might be due to their 15 limited infectibility (HCoV) or to the lack of infected people (MERS-CoV). SCoV2 will face the selective pressure (b) as influenza A did. If it escapes this selective pressure, it will remain among humans and spread every few years. Actually, the change in SCoV2 has begun; they have formed several classes within the short emergence time (Fig. 3c, 6b) . The magnitudes of the PC may show the migration pathways of the classes. They might mutate within China and 20 transferred to other countries, and mutate further (Fig. S6b) . The changes could be acclimation to humans (c); however, they may also relate to the herd immunity (b) and/or lower lethality (d). Fortunately, the lifespan of the classes of coronaviruses should be shorter than that of the influenza virus. The ORF lengths for the influenza virus are within a certain range, but some of the ORFs of coronavirus are quite short, e.g. the envelope protein, which is located in the 25 conserved outermost region of the virus (24) (Fig. 2 and S3 ). It seems that this protein is too short to form a variable structure. Therefore, this will be a good target for herd immunity and these conservative ORFs might be suitable to produce vaccines. In contrast, the Spike protein tends to change (Fig. S3a, S3b) , and may cause antibody-dependent enhancement (25). Many bat coronaviruses seemed to be able to infect humans. The bat and human viruses are The conventional classification system of coronavirus did not coincide with the relationships of the sequence data; e.g. the categories of Alpha-and Beta-coronavirus were too wide. 20 Additionally, many of the credits for classification of original sequencing records were misjudged as well as those of the influenza virus. Using an objective method is preferable to determine the attributions (4). The authors declare no competing interests. 10 Data and materials availability: All data is available in the External Databases. Tables S1-S4 External Databases S1-S3 15 (an HTML version, which is easier to see, of the supplementary information is available in https://www.biorxiv.org/content/10.1101/2020.04.26.063032v2) Table S4. 14 Fig. 4 A. Classification obtained from the amino acid sequences of the spike protein The relationships between the classes were similar to those estimated from the entire nucleotide sequences (Fig. 1) and Notes Characterization of a Novel Coronavirus Associated with Severe Acute Respiratory Syndrome SARS-CoV-2 and COVID-19: The most important research questions Re-evaluation of the evolution of influenza H1 viruses using direct PCA. Scientific Principal Component Analysis applied directly to Sequence Matrix Coronavirus infection and hospitalizations for acute respiratory illness in young 5 children From SARS to MERS, Thrusting Coronaviruses into the Spotlight What Have We Learned About Middle East Respiratory Syndrome Coronavirus Emergence in Humans? A Systematic Literature Review WHO (2020) Middle East respiratory syndrome coronavirus (MERS-CoV) A new coronavirus associated with human respiratory disease in China A pneumonia outbreak associated with a new coronavirus of probable bat origin WHO (2020) Coronavirus disease (COVID-19) Pandemic DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment R: A language and environment for statistical computing (R Foundation for 20 Statistical Computing Molecular Evolution of the SARS Coronavirus During the Course of the SARS Epidemic in China Isolation and Characterization of Viruses Related to the SARS Coronavirus from Animals in Southern China Date of origin of the SARS coronavirus strains Recombination, Reservoirs, and the Modular Spike: Mechanisms of Coronavirus Cross-Species Transmission Characterization of viral RNA splicing using whole-transcriptome datasets from host species DNA Repair: The Search for Homology Infection, Replication, and Transmission of Middle East Respiratory Syndrome Coronavirus in Alpacas. Infection, Replication, and Transmission of Middle East Respiratory Syndrome Coronavirus in Alpacas Respiratory Syndrome Coronavirus Emergence in Humans? A Systematic Literature Review. Vector-Borne and Zoonotic Diseases