key: cord-0961921-32k84j81 authors: Siddiqe, Rasel; Ghosh, Ajit title: Genome-wide in silico identification and characterization of Simple Sequence Repeats in diverse completed SARS-CoV-2 genomes date: 2021-01-26 journal: Gene Rep DOI: 10.1016/j.genrep.2021.101020 sha: 6270a85f42b443e9d7ad0811a10cf5111f477019 doc_id: 961921 cord_uid: 32k84j81 Simple sequence repeats (SSR) or, Microsatellites are short repeat sequences that have been extensively studied in eukaryotic (plants) and prokaryotic (bacteria) organisms. Compared to other organisms, studies the presence and incidence of SSR on viral genomes are less numerous. With the emergence of novel infectious viruses over the past few decades, it is imperative to study the genetic diversity in such viruses to predict their evolutionary and functional changes over time. Following the emergence of SARS-CoV-2, we have assembled 121 complete genomes reported from 31 countries across the six continents for the identification and characterization of SSR repeats. Using two independent SSR identification tools, we have found remarkable consistency in the diversity of microsatellites pattern (38-42 per genome) found in the 121 analyzed SARS-CoV-2 genomes indication their important role for genome stability. Among the identified motifs, trinucleotide and hexanucleotide repeats were found to be the most abundant form followed by mono- and di-nucleotide. There was no tetra- or pent-nucleotide repeats in the analyzed SARS-CoV-2 genomes. The discovery of microsatellites in SARS-CoV-2 genomes may become useful for the population genetic, evolutionary analysis, strain identification and genetic variation. Coronavirus disease 2019 is an acute respiratory infectious disease caused by a novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It belongs to the subfamily Coronavirinae of the family Coronaviridae of the order Nidovirales and genus Betacoronavirus (Saha et al., 2020; Weiss and Leibowitz, 2011) . According to the serotype and genomic characteristics, coronaviruses could be divided into four major genera that include alpha and beta causing infection primarily to mammals, and gamma and delta forms predominantly infect birds (Tang et al., 2015) . Coronaviruses are enveloped, unsegmented single positivestranded RNA virus with a genomic length varying from 26 to 32 kilobases (Wang et al., 2020) . (Khan et al., 2020; Zhou et al., 2020) . COVID-19 was initially found in China but spread all over the world rapidly (Guo et al., 2020) . The total number of COVID-19 cases diagnosed so far exceeds 63 million worldwide as on 30 th November 2020 with a total death of more than 1.4 million (https://www.worldometers.info/coronavirus/). SARS-CoV-2 has caused a state of alarm across the world due to its high infection rate and mortality among the elderly and immune-deficient individuals. Due to very limited knowledge of this novel virus, high rate of transmission to all the age groups and diverse demographics population, genome sequence and comparative genomics has attracted much attention. Moreover, the advancements in sequencing technologies and analysis tools boost-up the process at an unprecedented speed. The first three novel coronaviruses (GISAID accession ID: EPI_ISL_402119, EPI_ISL_402120 and EPI_ISL_402121) were sequenced from Wuhan (Wu et al., 2020) . Currently, over 94,000 SARS-CoV2 viral genomes have already been sequenced and deposited for in the public domain like GenBank database (Benson et al., 2000) and GISAID database (Shu and McCauley, 2017) . To understand the molecular genetics, evolutionary genomics and other important features of these viruses, development of a reliable biomarker like SSR could be an excellent tool. Simple sequence repeats (SSR) are short tandem repeat sequences found across the genomes of all organisms. SSR's are essentially sequences of varying lengths containing repeats of 1-6 nucleotides. There are several characteristics associated with SSR sequences such as they are J o u r n a l P r e -p r o o f present ubiquitously in any genome (Li et al., 2004) ; their accumulation has been associated with the variation in genome size (Gao and Qi, 2007) ; they could exist in both coding and non-coding sequences (Riley and Krieger, 2009 ); they are highly variable and polymorphic in nature (Kim et al., 2008) . SSR's are found to be associated with the recombination hotspots and random integration. This could be considered as an explanation of the fact that pathogenic organisms use this variability to combat host immune responses (Zhao et al., 2012) . One of the extensive applications of SSR's has been considered to use as genetic markers (Heesacker et al., 2008; Temnykh et al., 2001) . A few notable results have also been found using SSR's in genome mapping, along with ecological and evolutionary biology. Although several independent studies have focused on SSR's in viral genomes, a distinct distribution pattern is yet to be established (Chen et al., 2011) . Viral SSR's are capable of generating genomic diversity that in turn manifest phenotypic changes (Li et al., 2004) . Genome features including length and GC content largely influence their occurrence (Dieringer and Schlötterer, 2003; Kelkar et al., 2008) . Here, we have investigated the distribution, size and GC content variability among 120 SARS-CoV-2 genome sequence isolated from different countries and identified the prevalence of SSR markers. Complete genome sequences of SARS-CoV-2 (121) were acquired from the NCBI Viruses database (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/). Sequences were collected from 31 countries (Table S1 ) and selected according to the date of data deposition ranging from early January 2020 to late June 2020. The sequence data were processed in FASTA format. Two SSR identification tools were used in the study. First, Simple Sequence Repeat Identification Tool (SSRIT, https://archive.gramene.org/db/markers/ssrtool) was used to detect perfect SSR motifs in the given sequences at FASTA format. The minimum number of repeats was set to 5 for dimers, 3 for Trimeric, Tetrameric and Pentameric repeats: and 2 for hexameric repeats. Thus, the resulting configuration is 5-3-3-3-2 of the minimum number of repeats. make sure the presence of diversity. The list of genomic sequences including their accession number, size, attributed region and GC content are summarized in Table 1 . Incident frequency of SSR's in the 121 genomes varied at a negligible level ( Fig. 1) Relative abundance (RA) and Relative density (RD) of SSR was calculated as the number of repeats per kilobase pair (kb) and total length in repeats per kb, respectively (Figs. 2 and 3). Relative abundance was calculated for each type of repeats (i.e: monomeric, dimeric, trimeric, hexameric denoted by RA1, RA2, RA3 and RA6) as well as for the total number of repeats in a sequence (Tables 2 and 3 ). All the identified SSR repeats from IMEX and SSRIT tools were analyzed with little variation among all the 121 genome sequences. Similarly, relative density (RD) was calculated as the total length of repeats divided by the genome size in kb for all the repeats detected by both IMEX and SSRIT tools. There is more variation in RD values using IMEX analyzed SSRs due to the inconsistency of monomeric repeats (Fig. 2B ). The highest value of total RA and RD from the IMEX tool is 1.42 and 14.89; while the lowest value is 13.29 J o u r n a l P r e -p r o o f and 1.27, respectively (Table 2) . Likewise, the highest value of total RA and RD for SSRIT tool is 1.37 and 14.36; while the lowest is 1.27 and 13.45, respectively (Table 3) . Monomeric repeats from the Imex tool analysis showed that 50 sequences do not contain any monomeric repeat while remaining 59 have only one and the rest 12 sequences have 2 monomeric repeats. Out of these 59 sequences with only one monomeric repeat, 45 contained (A) n while the rest 14 contained (T) n (Table S2) and S66 (MT447176). Exceptionally, (AATAGG) n motif was only found to be present in one sequence S74 (MT539160). All other hexameric repeats were found precisely once in every sequence. These SSR markers were found to be distributed in the ORF1ab, S, ORF3ab, ORF7a, and N regions of the SARS-CoV-2 genome (Fig. 5 ). Maximum 24 motifs were present in the ORF1ab region, followed by 5 motifs each in S, ORF3ab, and N regions, and only one motif present in ORF7a region. The correlation between genome size / GC content with the value of relative abundance (RA) and relative density (RD) of SSRs was determined. Correlation coefficient of Imex tool detected J o u r n a l P r e -p r o o f SSRs repeats showed a positive correlation with the total RA 0.52 (R 2 = 0.271, P <0.05) and RD 0.419 (R 2 = 0.176, P <0.05). While that with G/C content is -0.102 (R 2 = 0.010, P >0.1) and 0.147 (R 2 = 0.022, P >0.1) for RA and RD, respectively. Surprisingly, total RA and RD correlation coefficients obtained from the SSRIT tool correlate negatively with the genome size as -0.0595 (R 2 = 0.003, P >0.1) and -0.107, (R 2 = 0.011, P >0.1), respectively. Further analysis suggested that the RA and RD are both positively correlated against G/C content with a coefficient value of 0.310 (R 2 = 0.096, P <0.05) and 0.331 (R 2 = 0.109741269, P <0.05) respectively. Since the genome sizes of the analyzed viruses are very much similar with little variation to one another, a significant correlation was not expected. Due to the advancement of next-gen DNA sequencing technologies, microbial genome could be We have calibrated our identification tools so that tandem repeat sequences below 9bp and above 15bp are not counted. The minimum number of repeats for each type is 10-5-3-3-3-2 configuration for mono-, di-, tri-, tetra-, penta-, and hexa-repeats. We have identified incredible similarity pattern in all of 121 genomes, might be due to the high level of sequence conservancy in SARS-CoV-2. Independent studies on vertebrate and plant genomes have provided a basis for categorizing the most common SSR motifs. The most common SSR motif in animals and invertebrates is (GT) n (Stallings et al., 1991) , whereas in plants it is (AT) n (Lagercrantz et al., 1993) and in insects, the most common motif is thought to be (CT) n (Paxton et al., 1996) . Dinucleotide repeats AT/TA and AG/GA were found to be the two most prominent form in the largest Closteroviridae RNA virus family (George et al., 2016) . Following the similar trend SSR analysis of viral genomes revealed the most common motif to be (AT) n (Zhao et al., 2012). SARS-CoV-2 deviates from this trend with the most common repeat being trimeric (TTC) n and (CTT) n repeats which were present in all of the analyzed genomes for multiple times. In the case of the SARS-CoV-2 genome, results revealed that the hexameric motif was the most abundant type of repeat (49%) followed by the trinucleotide of 42%, the other two types of mono-and dimeric repeats present in 4% (Table S3) ; while tetra-and pentanucleotide repeats were non-existent. In partial agreement with our results, trinucleotide SSRs were found to be the most frequent types in SpliMNPV and Human Immunodeficiency Virus Type 1 (HIV-1). However, the genome of hepatitis C virus ( (Karaoglu et al., 2005) and plant genomes (Morgante et al., 2002) . A weak influence of genome size and GC content had been established on the number, relative abundance and relative density of microsatellites in various analyzed HCV genomes (Chen et al., 2011). Our findings suggest that relative abundance and density is positively correlated with genome size and the correlation is statistically significant. Conversely, the correlation with G/C content is positive but not statistically significant. In establishing distribution patterns of SSR's in SARS-CoV-2, it could be concluded that there is no significant pattern in the distribution of SSR's in viral genomes. It can also be said that the number of SSR's present in a genome cannot be considered proportional to the genome size as the sequences used in this study were grossly similar in size (Table 1) There was no funding received to carry out this work. Table S1 . Country tri-letter code legend Table S2 . Type and occurrence of monomeric repeats in 121 sequences from Imex That creates a variation in the total number of identified SSR motifs and presented in the figure. J o u r n a l P r e -p r o o f Two distinct modes of microsatellite mutation processes: evidence from the complete genomic sequences of nine species Whole genome molecular phylogeny of large dsDNA viruses using composition vector method Differential distribution and occurrence of simple sequence repeats in diverse geminivirus genomes In silico genome-wide identification and analysis of microsatellite repeats in the largest RNA virus family (Closteroviridae) The origin, transmission and clinical therapies on coronavirus disease 2019 (COVID-19) outbreak-an update on the status SSRs and INDELs mined from the sunflower EST database: abundance, polymorphisms, and cross-taxa utility Genome wide survey of microsatellites in ssDNA viruses infecting vertebrates Survey of simple sequence repeats in completed fungal genomes The genome-wide determinants of human and chimpanzee microsatellite evolution Emergence of a novel coronavirus, severe acute respiratory syndrome coronavirus 2: biology and therapeutic options Simple sequence repeats in Neurospora crassa: distribution, polymorphism and evolutionary inference The abundance of various polymorphic microsatellite motifs differs between plants and vertebrates Microsatellites within genes: structure, function, and evolution Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes Mating structure and nestmate relatedness in a communal bee, Andrena jacobi (Hymenoptera, Andrenidae), using microsatellites Embryonic nervous system genes predominate in searches for dinucleotide simple sequence repeats flanked by conserved sequences Complete Genome Sequence of a Novel The authors declare that there is no competing interest. There was no funding for this particular study. AG conceived the idea and designed the experiments. RS performed all the analysis. Both authors wrote the initial draft of the manuscript and approved the final version.