key: cord-0428081-zgxn722p authors: Zelenova, Maria; Ivanova, Anna; Semyonov, Semyon; Gankin, Yuriy title: Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database date: 2021-08-04 journal: bioRxiv DOI: 10.1101/2021.08.04.454929 sha: 0032c74f2141cbe3622d8f54de2de0c33df87089 doc_id: 428081 cord_uid: zgxn722p Background The 31st of December 2019 was when the World Health Organization received a report about an outbreak of pneumonia of unknown etiology in the Chinese city of Wuhan. The outbreak was the result of the novel virus labeled as SARS-CoV-2, which spread to about 220 countries and caused approximately 3,311,780 deaths, infecting more than 159,319,384 people by May 12th, of 2021. The virus caused a worldwide pandemic leading to panic, quarantines, and lockdowns – although none of its predecessors from the coronavirus family have ever achieved such a scale. The key to understanding the global success of SARS-CoV-2 is hidden in its genome. Materials and Methods We retrieved data for 329,942 SARS-CoV-2 records uploaded to the GISAID database from the beginning of the pandemic until the 8th of January 2021. To process the data, a Python variant detection script was developed, using pairwise2 from the BioPython library. Pandas, Matplotlib, and Seaborn, were applied to visualize the data. Genomic coordinates were obtained from the UCSC Genome Browser (https://genome.ucsc.edu/). Sequence alignments were performed for every gene separately. Genomes less than 26,000 nucleotides long were excluded from the research. Clustering was performed using HDBScan. Results Here, we addressed the genetic variability of SARS-CoV-2 using 329,942 worldwide samples. The analysis yielded 155 genome variations (SNPs and deletions) in more than 0.3% of the sequences. Nine common SNPs were present in more than 20% of the samples. Clustering results suggested that a proportion of people (2.46%) were infected with a distinct subtype of the B.1.1.7 variant. The subtype may be characterized by four to six additional mutations, with four being a more frequent option (G28881A, G28882A, and G28883С in the N gene, A23403G in S, A28095T in ORF8, G25437T in ORF3a). Two clusters were formed by mutations in the samples uploaded predominantly by Denmark and Australia, which may indicate the emergence of “Danish” and “Australian” variants. Five clusters were linked to increased/decreased age, shifted gender ratio, or both. According to a correlation coefficient matrix, 69 mutations correlate with at least one other mutation (correlation coefficient greater than 0.7). We also addressed the completeness of the GISAID database, where between 77% and 93% of the fields were either left blank or filled incorrectly. Metadata mining analysis has led to a hypothesis about gender inequality in medical care in certain countries. Finally, we found ORF6 and E as the most conserved genes (96.15% and 94.66% of the sequences totally match the reference, respectively), making them potential targets for vaccines and treatment. Our results indicate areas of the SARS-CoV-2 genome that researchers can focus on for further structural and functional analysis. A virus that appeared in Wuhan in December 2019 was not initially expected to cause a worldwide crisis and a deadly pandemic. It was soon recognized as a coronavirus, a single-stranded positivesense RNA virus belonging to a Coronaviridae family, whose members have already frightened humankind before. First discovered in the 1960s, two Coronaviridae family members (CoV-229E and CoVOC43) did not present a global threat (Qi et Walls et al., 2020) . The worst outcomes of the COVID-19, the disease caused by SARS-CoV-2, are currently associated with old age (65 and older), male gender, smoking, and comorbidities such as diabetes, cardiovascular disorders, and hypertension (de Sousa et al., 2020) . At present, over a year and a half later, the reasons for SARS-CoV-2 high transmissibility are still elusive. Most researchers focus on the viral genome, its evolution, and its mutations (Kaur et al., 2020) . Studies of single nucleotide polymorphisms (SNPs) are especially beneficial in revealing heavily mutated genomes and understanding the viral changing pattern (Yin C et al., 2020) . Since common knowledge of SARS-CoV-2 proteins' functioning, signaling pathways, protein-protein, and protein-host cell interactions keeps rapidly accumulating due to its novelty, there is an urgent need to explore the SARS-CoV-2 changes. A relationship between SNPs and their consequences can be predicted from genotyping, protein analysis, and transmission tracking . SARS-CoV-2 genome was first sequenced in January 2020, a month after COVID-19 became a worldwide alert ., Zhu et al., 2020) . The genome consists of 29903 nucleotides (GenBank accession number MN908947). Its length and overall genetic contents carry little surprise since it has long been established that coronaviruses have ones of the largest genomes amid all RNA viruses (varying from ~26 to ~32 kb in length) (Kaur et al., 2020) . Although many mutations have currently been found in the viral genome , only a few of them are high-frequency: 119 SNPs exceed the 0.3% threshold, according to Yuan et al., 2020 . Based on the mutations, eight distinct viral clades have been reported by GISAID and twelve by Nextstrain (by March 2021). Specific SARS-CoV-2 variants caused the most concern. A "British variant" (also B. Table 1 . nucleotides long were included in the study, since the smaller sequences did not allow us correctly align all genes of interest. Data filtering was completed as follows: 100% match to a reference genome was required to consider a sequence highly conservative, more than 99% match -to consider it moderately conservative, alignments in a range from 99% to 93% match were marked as low conservative. The genomes that were less than 93% similar to the reference sequence contained lowquality sequences and were excluded from further analysis. As these cutoffs were determined experimentally and we considered all the viral genes separately, we were free from simply deleting all the records containing ambiguous/unidentified symbols ("N", "Y" etc.). Instead, examining genes separately increased the number of sequences that could be used in the research. If unidentified symbols were determined in the aligned gene, and their count was not equal to the count of SNPs, the sequence was included in the research. Statistical significance was measured using a t-test and Bonferroni correction (for two parametersage and gender). Correlation was measured using Pearson correlation coefficient. By the 8th of January 2021, the GISAID database had SARS-CoV-2 records deposited by 142 countries. Even though more than 329,000 records had been uploaded up until then, these data has limited research potential due to several significant problems. First, some of the uploaded sequences are dramatically smaller than the reference sequence (e.g. <5000 nucleotides) or contain an enormous (more than 7% of each gene of interest) number of ambiguous letters ( Fig.1 . represents the sequence size range obtained for the data used in current research; the smallest sequences were mostly obtained by Sanger sequencing). Another weakness is the lack of automation/control in terms of data entry to the system. That drawback led to numerous misspellings and data variants, along with missing information. Thus, the "collection date" field may include a year, a month, and a date, contain only the year, or, for some records, have a wrong year (e.g., 2002 instead of 2020). "Gender" and "Patient age" parameters are filled only for 23.3% and 23.1% of the records, respectively. The least informative for research is "Patient status," which is not only filled for just 6.9% but also contains hardly interpretable data. Records' bias is another problem. The prevalent number of genomes is uploaded by the United Kingdom (45.3%), USA (18.3%), Denmark (6.7%), and Australia (5.1%), with other countries' input ranging from 3 to less than 1 percent of all records. . The records' bias also affected the patients' status. Some countries presumably uploaded the records with predominantly one or another status (e.g., out of all records uploaded by Brazil, 40% contained patient status "Dead"). The data were considered for every viral gene separately, except for ORF1ab, which was not considered in the present research. While filtering the data to include only good-quality sequences ( sequences contain stretches with 52 "N"s, the "assembly method" is filled for 23.5%. Since we could not estimate the assembly method and its parameters, we investigated the most prevalent methods among records containing stretches with 52 "N"s. Although further research is limited due to too many variations created by manual system entry, the top assembly method was either "mapping consensus" or numerous variations of ARCTIC. Analyzing the conservation of the genes allowed us to get some insights into their importance for the virus and potential treatment (Table 3) . No insertions with a frequency greater than 0.3% were found. Two deletions were identified in the S gene: 21765-ATACATG>A with 4.67% frequency and 21991-TTTA>T with 2.94% frequency. Analyzing genomic data considering the date axis, we were not merely able to determine the most frequent mutations but also reveal their changes through the year (Supplement 2 contains data on SNPs occurring with more than 0.3% frequency among 329,942 viral genomes. Supplement 3 contains charts representing changes by month for each mutation). HDBScan yielded 43 clusters (based on data on SNPs and deletions with a frequency greater than 0.3%; Fig.2 ). Some data did not fit any cluster. According to a correlation coefficient matrix, 69 mutations have correlations with at least one other mutation ( Fig.3 ; larger resolution and lower cutoff may be found in Supplement 4). In total, 160 pairs with a correlation coefficient greater than 0.7 were found (Supplement 5). The statistical and bioinformatic analysis of 329,942 records obtained from the GISAID database yielded data concerning many areas, from database design and medical care issues to genomic mutations and their probable effects. The abovementioned results are discussed below. At the moment, one of the most promising treatment and vaccine targets is the S protein, which enables the virus to enter human cells and is already targeted in such vaccines as Gam-COVID-Vac According to clustering results, it may be proposed that, although "British variant" mutational contents may not be expanded due to the absence of the concomitant mutations in the general cohort, there is a proportion of people who got infected with its distinct subtype. The subtype may be characterized by four to six additional mutations, with four being a more frequent option (G28881A, G28882A, and G28883С in the N gene, A23403Gin S, A28095T in ORF8, G25437T in ORF3a). Both clusters containing the "British variant" mutations are also the most recent, with a mean upload time of 1.2 months ago (count back from the 8th of January 2021). A mutation in ORF3a (G26144T) that forms a cluster and is featured by increased age (57) , which lets us speculate on the existence of so-called "Danish" and "Australian" variants. Current research shows that some mutations often present together with one or more others. In total, 160 pairs of mutations with a correlation coefficient greater than 0.7 were found. Most studies in this direction focus on certain concomitant mutations. For example, D614G is often considered together with P323L. Some researchers suggest inability of D614G to cause viral success when presented alone (Ilmjärv et al., 2020; . T85I is noted to co-occur with Q57H, and P504Lwith The most frequent mutation in the analyzed genes is a mutation in the S gene -A23403G (D614G), which is found in 94.15% of all studied genomes and in 99.9% of genomes uploaded in December 2020. D614G is considered to be more infectious than the ancestral form, but not associated with Only three mutations have not been noted in the uploads since the beginning of December 2020: G26144T (G251V) and G25979T (G196V) in ORF3a, which were last uploaded around September 2020 and early December 2020, respectively, and a C28836T (S188L) in the N gene, which was last seen around early to middle November. G251V is revealed to result in the loss of a phosphatidylinositol-specific phospholipase X-box domain and a creation of a serine protease cleavage site (Issa et al, 2020) . Another work states that G251V and G196V might influence virulence, infectivity, ion channel activity, and viral release (Coronaviridae Study Group of the International Committee on Taxonomy of Viruses, 2020). Might disappearing mutations impact viral fitness or human survival? The data is yet incomplete. However, in the present research G26144T (G251V) was found to create a cluster on its own; the mutation is featured by increased age (57) and increased proportion of women compared to the general cohort. The most recent mutation in the current analysis is A28111G (Y73C) in ORF8, which appeared in the uploaded data about early September 2020. The mutation is included in a B.1.1.7 ("British variant") mutations' list. In total, B.1.1.7 is featured by 23 mutations (Rambaut et al., 2020) and is preliminarily reported as possibly associated with an increased risk of death (Frampton D. et al, 2021) . We detected 13/14 mutations not located in the ORF1ab region and associated with the variant in the analyzed data. A T26801C mutation in the M gene was not found among mutations with a frequency greater than 0.3%. Our data yielded two mutations in the same position (freq > 0.3%): C26801G and C26801T. The discrepancy could occur due to the differences in the reference sequences. This, however, cannot be verified as Rambaut et al. did not specify the reference sequence number. We have also considered two other variants that appeared lately -B.1.351 (a variant from South Africa) and P.1 (a variant from Brazil). Nevertheless, out of 8 and 14 non-ORF1ab mutations, respectively, only 2 and 3 were detected in our analysis among highly-present mutations. Consequently, it can be speculated that either a "British variant" has more transmissivity compared to the other two variants, or this result is due to a bias because of the number of the uploads. We have revealed that the major drawback of letting the users manually fill the fields of the records led to a loss of approximately 77% to 93% of the data, depending on the parameter. The absence of quality control for genomic data yielded a presence of many sequences significantly shorter or longer than the reference genome (ranging from <5000 to 34000 nt). Many laboratories uploading the data do so significantly later than the sample collection date, some even a year later, which may distort the bioinformatic analysis. Certain laboratories indicate a month and a year, or only a year, of sample collection, omitting the day, or day and month. An important analysis factor is that most data are uploaded by the United Kingdom, which creates an overall data bias towards the UK statistics. As time is a crucial factor in a pandemic, a database update can be recommended in order to increase its value and quality. The cohort studied in the current research is represented by 52% of males and 48% of females (mean values; gender was not indicated for a subset of records (Colvin, 2017) . Russian Minister of Healthcare noted in his speech, dated August 3rd, 2020, that Russian males address outpatient hospitals 2.5 times less often than females, with inpatient estimates being about the same for both genders (reported by TASS). One more explanation is that more people working in the areas related to abundant social contact (e.g., medicine, education) in these countries are women. We suppose that this distribution may also be considered in terms of hospitalization criteria and sex differences between distinct age groups, and therefore leave this question to be still open for discussion. Although mean age across gender-filled records in our cohort was determined as 48 and mean gender as 52% of males and 48% of females, some mutations were characterized by increased or decreased age and shift in male to female ratio. In total, two mutations in the S gene exhibited a decline in more than 10 points (years or percent) for both age and gender parameters. A T24814C (D1084D) could not be further analyzed due to the lack of data. Denmark uploaded most of the records for the SNP (99.28%), but none of them contained patient age or gender. which cannot yet explain our finding (Wang R. et al., 2020.) . Beside the aforementioned data, there are 12 mutations that are 10 points different in terms of gender (Table 5) , and 2in terms of age. Also, as increased mortality risk is associated with cardiovascular diseases , the greater percentage of these disorders and thrombosis in men may contribute to fatality increase among males. A higher case fatality rate could also result from the fact that in general, among intubated patients, men are more likely to acquire ventilator-associated pneumonia (Cook et al., 1998, Ahmed, Dumanski, 2020). In this paper, we have analyzed 329,942 SARS-CoV-2 records obtained from the GISAID database. We addressed the quality of the uploaded records, gender distribution, gene conservation, SNPs, insertions and deletions, clusters, and a correlation coefficient matrix. Our research shows that mutations occurring with high frequency (>0.3%) are not abundant and constitute 155 changes concerning all genes (except ORF1ab, which was not considered in a current work). Many mutations present with concomitant changes, which may alter their consequences for the virus or a human host. All authors are employed by the commercial company Quantori in Cambridge, Massachusetts, United States. Quantori provided support in the form of salaries for the employed authors and relevant publication fees. The authors declare there are no competing interests. Single Cell RNA Sequencing of 13 Human Tissues Identify Cell Types and Receptors of Human Coronaviruses Genotyping Coronavirus SARS-CoV-2: Methods and Implications Signal Hotspot Mutations in SARS-CoV-2 Genomes Evolve as the Virus Spreads and Actively Replicates in Different Parts of the World Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein Mortality in COVID-19 Disease Patients: Correlating the Association of Major Histocompatibility Complex (MHC) with Severe Acute Respiratory Syndrome 2 (SARS-CoV-2) Variants Genetic Comparison among Various Coronavirus Strains for the Identification of Potential Vaccine Targets of SARS-CoV2 Decoding SARS-CoV-2 Transmission and Evolution and Ramifications for COVID-19 Diagnosis, Vaccine, and Medicine Addendum: A Pneumonia Outbreak Associated with a New Coronavirus of Probable Bat Origin A Novel Coronavirus from Patients with Pneumonia in China Characterizing SARS-CoV-2 Mutations in the United States Global SNP Analysis of 11,183 SARS-CoV-2 Strains Reveals High Genetic Diversity Investigation of novel SARS-COV-2 variant: Variant of Concern Emergence and Rapid Spread of a New Severe Acute Respiratory Syndrome-Related Coronavirus 2 (SARS-CoV-2) Lineage with Multiple Spike Mutations in South Africa Genomic Characterisation of an Emergent SARS-CoV-2 Lineage in Manaus: Preliminary Findings Effectiveness of Covid-19 Vaccines against the B.1.617.2 (Delta) Variant SARS-CoV-2 Orf6 Hijacks Nup98 to Block STAT Nuclear Import and Antagonize Interferon Signaling Structural Insight Reveals SARS-CoV-2 ORF7a as an Immunomodulating Factor for Human CD14+ Monocytes Immune Evasion via SARS-CoV-2 ORF8 Protein? A New Coronavirus Associated with Human Respiratory Disease in China Viral Targets for Vaccines against COVID-19 Coronavirus Biology and Replication: Implications for SARS-CoV-2 Development of New Vaccine Target against SARS-CoV2 Using Envelope (E) Protein: An Evolutionary, Molecular Modeling and Docking Based Study Immunoinformatic Analysis of the SARS-CoV-2 Envelope Protein as a Strategy to Assess Cross-Protection against COVID-19 Structure and Drug Binding of the SARS-CoV-2 Envelope Protein Transmembrane Domain in Lipid Bilayers SARS-CoV-2 nsp13, nsp14, nsp15 and orf6 Function as Potent Interferon Antagonists Characterization of SARS-CoV-2 Proteins Reveals Orf6 Pathogenicity, Subcellular Localization, Host Interactions and Attenuation by Selinexor SARS-CoV-2 E Protein Is a Potential Ion Channel That Can Be Inhibited by Gliclazide and Memantine In-Silico Approaches to Detect Inhibitors of the Human Severe Acute Respiratory Syndrome Coronavirus Envelope Protein Ion Channel Epidemiologically Most Successful SARS-CoV-2 Variant: Concurrent Mutations in RNA-Dependent RNA Polymerase and Spike Protein Evolutionary Dynamics of SARS-CoV-2 Nucleocapsid Protein and Its Consequences Structural and Functional Analysis of the D614G SARS-CoV-2 Spike Protein Variant SARS-CoV-2 Spike Protein Variant Tracking Changes in SARS-CoV-2 Spike: Evidence That D614G Increases Infectivity of the COVID-19 Virus Emergence and Spread of a SARS-CoV-2 Variant through Europe in the Summer of 2020 The Newly Introduced SARS-CoV-2 Variant A222V Is Rapidly Spreading in Lazio Region, Italy Molecular Dynamics Simulation Study of Effects of Key Mutations in SARS-CoV-2 on Protein Structures Effects of SARS-CoV-2 Mutations on Protein Structures and Intraviral Protein-Protein Interactions The SARS-CoV-2 ORF10 Is Not Essential in Vitro or in Vivo in Humans SARS-CoV-2 and ORF3a: Non-Synonymous Mutations and Polyproline Regions The Species Severe Acute Respiratory Syndrome-Related Coronavirus: Classifying 2019-nCoV and Naming It SARS-CoV-2 on behalf of COVID-19 Genomics Consortium UK (CoG-UK). Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations Genomic Characteristics and Clinical Effect of the Emergent SARS-CoV-2 B.1.1.7 Lineage in London, UK: A Whole-Genome Sequencing and Hospital-Based Cohort Study Measures to contain the COVID-19 outbreak in migrant worker dormitories Overall Care-Seeking Pattern and Gender Disparity at a Specialized Mental Hospital in Bangladesh Gender, Health and Change in South Africa: Three Ways of Working with Men and Boys for Gender Justice Russian Minister of Healthcare's speech reported by TASS Clinical Features of Patients Infected with 2019 Novel Coronavirus in Wuhan, China Demographic Science Aids in Understanding the Spread and Fatality Rates of COVID-19 A Geroscience Perspective on COVID-19 Mortality Spike Mutation Pipeline Reveals the Emergence of a More Transmissible Form of SARS-CoV-2 Coronavirus COV-19/SARS-CoV-2 Affects Women Less than Men: Clinical Response to Viral Infection Addendum: A Pneumonia Outbreak Associated with a New Coronavirus of Probable Bat Origin Sex Influences the Effect of Body Mass Index on the Vascular Response to Angiotensin II in Humans Prevalence of Comorbidities and Its Effects in Patients Infected with SARS-CoV-2: A Systematic Review and Meta-Analysis Risk Factors for ICU-Acquired Pneumonia Sex, Gender and COVID-19: A Call to Action Authors thank Tatiana