key: cord-1044588-jkm4fuw4 authors: Nakashima, Akiko; Takeya, Mitsue; Kuba, Keiji; Takano, Makoto; Nakashima, Noriyuki title: Virus database annotations assist in tracing information on patients infected with emerging pathogens date: 2020-10-08 journal: Inform Med Unlocked DOI: 10.1016/j.imu.2020.100442 sha: 82e7bdd906912b90b51190c521c545350ea291b4 doc_id: 1044588 cord_uid: jkm4fuw4 The global pandemic of SARS-CoV-2 has disrupted human social activities. In restarting economic activities, successive outbreaks by new variants are concerning. Here, we evaluated the applicability of public database annotations to estimate the virulence, transmission trends and origins of emerging SARS-CoV-2 variants. Among the detectable multiple mutations, we retraced the mutation in the spike protein. With the aid of the protein database, structural modelling yielded a testable scientific hypothesis on viral entry to host cells. Simultaneously, annotations for locations and collection dates suggested that the variant virus emerged somewhere in the world in approximately February 2020, entered the USA and propagated nationwide with periodic sampling fluctuation likely due to an approximately 5-day incubation delay. Thus, public database annotations are useful for automated elucidation of the early spreading patterns in relation to human behaviours, which should provide objective reference for local governments for social decision making to contain emerging substrains. We propose that additional annotations for past paths and symptoms of the patients should further assist in characterizing the exact virulence and origins of emerging pathogens. The coronavirus disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has disrupted social and economic activities worldwide since the first outbreak in China in 2019 [1] . COVID-19 presents varied symptomatic features [2] , [3] , [4] , [5] , [6] , [7] with a wide range of incubation periods and epidemic curves J o u r n a l P r e -p r o o f 4 are occurring [21] , [22] , the pathogenicity and origins of the mutated substrains of SARS-CoV-2 should be available in real time to adopt early measures by authorities at the onset of emergence. In parallel with individual treatment at hospitals and clinics, specimens from infected patients are directly sequenced, and the genetic information of SARS-CoV-2 is being globally sampled and added to public databases [31] , [32] , [33] . The databases have been used to predict viral transmissibility, antibody affinities and drug efficacy [34] . The cross-disciplinary usability of databases should promote the feedback of accumulating raw data to predict the actual profiles of pathogenic diseases [35] . Simple real-time surveys with regional public assistance are fundamentally necessary in an internationally available format. Here, we utilized these database annotations to detect virus variants and to estimate the virulence and transmission trajectories of the emerging substrains. We examined the nucleotide mutations and visualized the transmission trajectories of SARS-CoV-2 by consulting the world specimens registered in the virus data bank of the National Center for Biotechnology Information (NCBI) [32] . Due to its accessibility to the raw data of nucleotides and proteins with multiple annotations in a simple FASTA format, we used the data deposited in the National Center for Biotechnology Information (NCBI) Virus-SARS-CoV-2 data hub [36] . In the "Refine results" window, we specified the data by release date 2019/1/11-2019/5/3 (From 11 Jan 2019 to 3 May 2020). The latest data at that point were deposited on 1 May 2020. In the "Results" window, we rearranged the "Length" in ascending order. Then, we obtained 23042-FASTA formatted data of Protein J o u r n a l P r e -p r o o f 6 for coronavirus spike glycoprotein are 6vsb [38] (SARS-CoV-2), 5xlr [39] (SARS-CoV), 5x5c [40] We originally built a program to manipulate big data. Codes are provided as supplementary data by Excel Visual Basic (Office Professional 2016, Microsoft Corporation, WA, USA). All the source codes of the programs are provided with the annotation. Each program operates as follows: "prPCVcov2" aligns each single letter code of all the amino acid sequences separately in each cell for all the different protein datasets in a FASTA formatted data file; "priNuc" extracts a text string "GAT" or "GGT", which comes after a text string "CCAGGTTGCTGTTCTTTATCAG". See Supplementary notes 1 and 2 for how to use the program. The processed metrics were visualized by Kaleida Graph 4 (HULINKS Inc., Tokyo, Japan), and artworks were originally created with Illustrator (Adobe Systems Incorporated, CA, USA). The SVG data were obtained from the public domain under the license of CC0 1.0 at originated from the United States Central Intelligence Agency's World Fact Book. J o u r n a l P r e -p r o o f The periodicity of sampling of the mutated specimens was analysed by power spectrum using Axograph X (Version 1.7.4). Coronaviruses are unique RNA viruses equipped with proofreading machinery [45] . However, substantial mutations were expected, leading to the overestimation of substrains with unchanged genetic codons. On the other hand, amino acid mutations occur less frequently due to the wobble nature of codons [46] . Therefore, we consulted the NCBI database [36] and utilized a downloaded data table with all the applicable annotations (see also Methods). Among them, partial sequences or incomplete readouts were eliminated. We used 1500-2000 nucleotide and protein sequences with all applicable annotations, including sampling dates, locations, and genetic information of the virus. We detected the accumulation of the same mutations or the branching to multiple amino acids at approximately 100 residues in several component proteins of SARS-CoV-2 (Supplementary Table 1 ). Despite the genome proofreading ability of coronaviruses [45] , multiple random mutations in SARS-CoV-2 have been reported [47] , [48] . Any of these conversions might be attributable to increased or decreased virulence of viral particles [29] . In particular, the presence and increase of identical mutations at the same residues from different specimens could be due to the transmissible pathogenic substrains of SARS-CoV-2 [12] . Mutations in the amino acid sequences have indeed occurred in different phases of the COVID-19 pandemic and are probably fixed, inherited and dominantly spreading around the world. J o u r n a l P r e -p r o o f However, the pathogenicity and exact origins of these variations are difficult to retrace only using this mutation profile. Among the proteins with frequent mutations, the surface glycoprotein, namely, the spike or S protein, contained a single eminent mutation from aspartate (D) to glycine (G) at 614 (D614G conversion; Supplementary Table 1 ). The relative mutation frequency at each residue was calculated as information entropy to digitize the variations across the S protein in the database [30] and visualized in a spectral view across the S protein: D614G appeared in the early stage of the COVID-19 pandemic and accumulated over time (Fig. 1a) . Among the D614G substrain, additional major mutations accumulated in other viral proteins in contrast to the D614 original strain (Supplementary Table 1 , Supplementary Figure 1 ); the D614G could be an initial mutation for a more dominant substrain circulating in the world afterwards. Therefore, we next investigated the possible impact of the D614G mutation in the S protein of the converted substrain by the structural analysis and estimated the regional origins by the sampling periodicity analysis based on the obtained excel data. We consulted the Protein Data Bank [43] and the Swiss Institute of Bioinformatics resource portal [44] for the subsequent structural analysis. The spike of SARS-CoV-2 forms a homotrimer. Each S protein, comprising approximately 1300 amino acids, is a large transmembrane protein containing two subdomains, S1 and S2, which are responsible for receptor binding and D614 and the corresponding residues slightly deviated or did not exist in the equivalent positions among the other coronaviruses (Fig. 2) . Structurally, D614 is embedded in the S1 domain of the S protein, facing another protein unit within the trimer (Fig. 3a, b) , but this aspartic acid residue is not accessible from the orifice for receptor binding. Thus, D614G conversion is expected to change the inter-and intramolecular properties of the spike trimers. Molecular simulation predicted that the single D614G replacement would increase the thermal fluctuation not only in the vicinity but also throughout S, especially in the S2 subunit near the viral membrane [52] (Fig. 3c-f ). D614G conversion resulted in the deletion of the side chain of the aspartic acid residue, and the distance between D614 and T859 of another protein unit should expand from 4.4 to 6.4 angstroms (Fig. 3g, h) . This estimation indicates that the D614G mutation should change the inter-subunit interaction in the subdomain and the conformational state of the receptor binding domain so that the mutated viral particles can effectively interact with its cognate receptor in host cells for viral entry [12] , [34] , [53] . J o u r n a l P r e -p r o o f Next, we investigated the mutations at the genome level to retrace the transmission history of the D614G-converted virus. By analysing the nucleotide database, we detected the identical conversion from guanine-adenine-uracil to guanine-guanine-uracil (GAU to GGU) in all the D614G-converted cases despite 8 more convertible codons. We then retraced the specimens with a GGU mutation. The GGU specimens exponentially increased in March 2020 worldwide (Fig. 4a) . As of May 1, the conversion was found in more J o u r n a l P r e -p r o o f Specimens of the original SARS-CoV-2 without mutations had already been reported in the United States in January 2020 (Fig. 5a) . The sampling ratio of the GGU-mutated specimens with respect to the original GAU specimens suddenly increased at the end of February, followed by periodic fluctuations (Fig. 5b) . The spectral analysis indicated that the predominant transmission interval ranged from 4 to 6 days. This period most likely corresponds to the approximate incubation delay at the early phase of transmission of the mutated substrain within the United States (Fig. 5c) . When the database was consulted again on 22 June 2020, the deposited data increased not only in total number (from 1866 to 7596 specimens in the world) but also in the number of monthly specimens based on collection dates: January, from 84 to 129; February, from 78 to 189; March, from 1527 to 4155; April, from 161 to 2468 specimens. Therefore, many specimens were deposited after a substantial delay because of the collection J o u r n a l P r e -p r o o f trends in spectral analysis were mainly unchanged; the periodicity was slightly sharpened (Fig. 5c ). Collectively, the annotations in the virus genome database are of fundamental use to hypothesize the pathogenicity and to trace the transmission route at the early phase of emergence of the new substrains. These results have elucidated the need for additional annotations on patients (Fig. 6a,b) , which should reinforce the utility of virus genomic annotations by characterizing the symptomatic features (Fig. 6c) . Furthermore, the annotations with follow-ups and outcomes will update the profiles of COVID-19 with substrains including severity, morbidity or unique symptomatic trends. The current NCBI database can elucidate the weekly or monthly trends in the propagation of emerging virus variants. Estimations of the transmission trajectories and close contacts by multiple data comparisons can refine the current genomic and geological features of SARS-CoV-2 [25] , [30] , [54] . However, the exact origins and the pathogenicity of the virus variants need to be more refined [55] , [56] . The virus is closely linked to human behaviours and health conditions. If viral information is tagged with additional annotatable data on the patients, we can make the best of the limited number of specimens. In particular, travel history and medical records are critically useful. Such human-associated information should be tagged to the virus information. To stop further economic loss on a global scale, the restrictions of international personal travelling will be mitigated in the future. If any outbreaks of other pathogenic substrains that require different medical treatments [57] , [58] , [59] occur during this ongoing pandemic, the restart of global traffic may result in the sequential attacks of variant viruses on human society. In regard to the urgent clinical necessity, it is also important to locate the origins of the emergence of new strains of fatal viruses to prepare medical facilities for the emergency [29] . The annotation tags for patients' mobility history linked to virus information should be useful to J o u r n a l P r e -p r o o f retrace the detailed transmission paths of virus variants using similar filtering functions on the excel format. There should be a deposit delay after collection under conditions of social turmoil even with the aid of next-generation sequencing [15] , [60] , [61] . Thus, occasional updating on the same datasheet is important. Even though voluntary service is not always unlimited all over the world, international cooperation for fixed point surveys is necessary to reinforce the global monitoring and retracing of the transmission paths of emerging pathogens along with human mobility. Structural modelling would be helpful for hypothesising pathogenicity. However, a portion of the conformational data downstream of D614 is also not in the PDB database. Such information which will add to further insights into the molecular mechanisms structural information should be accompanied by symptomatic features [29] , [56] to ultimately understand pathogenicity in humans on the basis of experimental studies [62] . The outbreak has been rather small in Japan [63] , and the mechanisms remain unknown due to the lack of available information on the disease, namely, the patient symptoms. Currently, individual case reports, case series, regional analyses, and meta-analyses are conducted under ethical regulations and structured protocols [17] , [64] , [65] , [66] . These close observations by trained clinical staff characterized the unique symptoms in COVID-19, including olfaction and gustatory impairments [67] . However, the meta-analysis on the symptoms will appear later [17] , [64] . Moreover, the majority of young patients with COVID-19 are suspected to be asymptomatic [68] . If the emergence of the new variant is traced and retraced J o u r n a l P r e -p r o o f in a real-time public platform with visualization [30] , [69] utilizing such medical data with human mobility history data [70] , governments and other authorities can take swift and flexible actions to contain the virus. At present, little is known about the specific symptoms of COVID-19 [71] . An analysis of the trends of a pandemic is reinforced by open source, public databases with medical annotations. Medical records on past paths of human mobility should be used to refine the total profile of virus-human relationships with acceptable anonymity [71] . Since partial data are available at the initial time of deposit, the information may need to continual updates. SARS-CoV-2 and other emerging infectious diseases [72] , [73] are associated with human socioeconomic activities together with environmental and ecological factors. Compartmentalization of the world into monitorable regions based on human mobile trends [74] and sentinel surveillance including pathogen sampling with patient medical records is necessary. Local governments around the world should share real-time information on the changing nature of viruses and could conduct regional prevention measures, including caution procedures, travel restrictions and lockdowns [9] . Additionally, the susceptibility of animals to pathogens as the intermediate transmission source should be addressed [75] . Along with a risk assessment for animal-borne infections via domestic animals or animals in zoos, a simple tag for infected animals must be used to estimate the potential risks of zoonosis [76] , [77] . Conventionally, the use of databases to predict other diseases has been developed as a disease mining method. The literature can be searched using MeSH terms [78] , and the extraction of available data from the literature, including case reports using natural languages [79] , can be conducted as a meta-analysis or evidence-based medicine (EBM). However, a lack of verbalized resources can be a barrier to EBM [80] . Diagnosis by digital data is especially powerful in evaluating electrocardiogram, gene-phenotype association, and pathological data [81] , [82] or radiographic images [79] . Composite phenotypes can also be assessed though multivariate correlations [83] , [84] . Automated clustering using digital annotations should decrease the substantial risk of overlooking a relevant prior study or finding. Artificial intelligence (AI) can further optimize the diagnostic accuracy [85] . However, AI may confront other risks in overlooking minor trends in rare cases by overfitting errors [85] . Virus and medical annotation tags in a simple and unified spreadsheet format are preferable for further analyses in the future. The empirical insights of medical staff are surely needed for detailed annotations, which is important for the emergence of unique pathogens. The current databases are already powerful and useful and can evolve based on the needs of the implementation of sociomedical science. We propose the use of additional annotation tags for patients that are anonymized with maximum privacy protection and informed consent on sampling virus genetic data around the world without borders. Additionally, a cooperative system of international databases [32] , [33] in a single platform might also be helpful during this global emergency. Urgent international discussion is needed. J o u r n a l P r e -p r o o f An interactive web-based dashboard to track COVID-19 in real time Effect of changing case definitions for COVID-19 on the epidemic curve and transmission parameters in mainland China: a modelling study Updated rapid risk assessment from ECDC on coronavirus disease 2019 (COVID-19) pandemic: increased transmission in the EU/EEA and the UK Estimating clinical severity of COVID-19 from the transmission dynamics in Wuhan, China Large-Vessel Stroke as a Presenting Feature of Covid-19 in the Young Association between a Novel Human Coronavirus and Kawasaki Disease Phylogenetic network analysis of SARS-CoV-2 genomes Temporal dynamics in viral shedding and transmissibility of COVID-19 Interrupting COVID-19 transmission by implementing enhanced traffic control bundling: Implications for global prevention and control efforts COVID-19: A New Virus, but a Familiar Receptor and Cytokine Release Syndrome Encouraging results from phase 1/2 COVID-19 vaccine trials Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Highly sensitive detection of SARS-CoV-2 RNA by multiplex rRT-PCR for molecular diagnosis of COVID-19 by clinical laboratories Preliminary Identification of Potential Vaccine Targets for the COVID-19 Coronavirus (SARS-CoV-2) Based on SARS-CoV Immunological Studies The end of social confinement and COVID-19 re-emergence risk Closed environments facilitate secondary transmission of coronavirus disease 2019 (COVID-19) Clinical characteristics of COVID-19 in 104 people with SARS-CoV-2 infection on the Diamond Princess cruise ship: a retrospective analysis COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) Making Decisions in a COVID-19 World Interpreting Diagnostic Tests for SARS-CoV-2 A single variant sequencing method for sensitive and quantitative detection of HIV-1 minority variants Development and clinical application of a rapid IgM-IgG combined antibody test for SARS-CoV-2 infection diagnosis The receptor binding domain of the viral spike protein is an immunodominant and highly specific target of antibodies in SARS-CoV-2 patients BlueTrace : A privacy-preserving protocol for community-driven contact tracing across borders Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing Mobile Fact Sheet Panic and generalized anxiety during the COVID-19 pandemic among Bangladeshi people: An online pilot survey early in the outbreak Prevalence of and Risk Factors Associated With Mental Health Symptoms Among the General Population in China During the Coronavirus Disease 2019 Pandemic SARS-CoV-2 viral spike G614 mutation exhibits higher case fatality rate Nextstrain: real-time tracking of pathogen evolution Global Spread of SARS-CoV-2 Subtype with Spike Protein Mutation D614G is Shaped by Human Genomic Variations that Regulate Expression of TMPRSS2 and MX1 Genes Severe acute respiratory syndrome coronavirus 2 data hub Buckland-Merrett, Data, disease and diplomacy: GISAID's innovative contribution to global health COVID-19 Coronavirus spike protein analysis for synthetic vaccines, a peptidomimetic antagonist, and therapeutic drugs, and analysis of a proposed achilles' heel conserved region to minimize probability of escape mutations and drug resistance Implementation science in times of Covid-19 Virus Variation Resource-improved response to emergent viral outbreaks KEGG: Kyoto Encyclopedia of Genes and Genomes Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation Cryo-electron microscopy structures of the SARS-CoV spike glycoprotein reveal a prerequisite conformational state for receptor binding Cryo-EM structures of MERS-CoV and SARS-CoV spike glycoproteins reveal the dynamic receptor binding domains The Protein Data Bank Protein Identification and Analysis Tools in the ExPASy Server Coronaviruses: an RNA proofreading machine regulates replication fidelity and diversity Celebrating wobble decoding: Half a century and still much is new Geographic and Genomic Distribution of SARS-CoV-2 Variant analysis of SARS-cov-2 genomes A crucial role of angiotensin converting enzyme 2 (ACE2) in SARS coronavirus-induced lung injury Understanding Human Coronavirus HCoV-NL63 Cryo-EM analysis of a feline coronavirus spike protein reveals a unique structure and camouflaging glycans Insights into changes in binding affinity caused by disease mutations in protein-protein complexes Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2 Comment Improving epidemic surveillance and response : big data is dead , long live big data A real-time dashboard of clinical trials for COVID-19 Antibodies and vaccines against Middle East respiratory syndrome coronavirus A human monoclonal antibody blocking SARS-CoV-2 infection Discovering drugs to treat coronavirus disease 2019 (COVID-19) Rapid implementation of for real-time epidemiology of COVID-19 Epidemic Models of Contact Tracing: Systematic Review of Transmission Studies of Severe Acute Respiratory Syndrome and Middle East Respiratory Syndrome Title: Simulation of the clinical and pathological manifestations of Coronavirus Disease 2019 (COVID-19) in golden Syrian hamster model: implications for disease pathogenesis and transmissibility Authors An interactive web-based dashboard to track COVID-19 in real time SARS-CoV-2 Genome Analysis of Japanese Travelers in Nile River Cruise Impact of cerebrovascular and cardiovascular diseases on mortality and severity of COVID-19-systematic review, meta-analysis, and meta-regression Prevalence of comorbidities and its effects in coronavirus disease 2019 patients: A systematic review and meta-analysis Olfactory and gustatory dysfunctions as a clinical presentation of mild-to-moderate forms of the coronavirus disease (COVID-19): a multicenter European study Prevalence of Asymptomatic SARS-CoV-2 Infection Big data stream analysis: a systematic literature review Using Google Location History data to quantify fine-scale human mobility Estimating the efficacy of symptom-based screening for COVID-19 Global trends in emerging infectious diseases Emerging Infectious Diseases/ Pathogens Modeling real-time human mobility based on mobile phone and transportation data fusion Susceptibility of ferrets, cats, dogs, and other domesticated animals to SARS-coronavirus COVID-19: Epidemiology, Evolution, and Cross-Disciplinary Perspectives Automated Radiology Report Summarization Using an Open-Source Natural Language Processing Pipeline Barriers to evidence-based medicine: A systematic review PedAM: A database for Pediatric Disease Annotation and Medicine The human disease network Human symptoms-disease network Direct functional assessment of the composite phenotype through multivariate projection strategies A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases Satoshi Matsuoka for his advice in programming Hisato Jingami (The Graduate Courses for Integrated Research Training, Kyoto University) and Dr. Takayuki Tokimasa (Department of Physiology, Kurume University School of Medicine) for their advice in conducting the research and discussion. For their assistance with documentary filing and manuscript proofreading