key: cord-189561-jhvwozsn authors: Chechetkin, Vladimr R.; Lobzin, Vasily V. title: Combining Detection and Reconstruction of Periodic Motifs in Genomic Sequences with Transitional Genome Mapping date: 2020-10-14 journal: nan DOI: nan sha: doc_id: 189561 cord_uid: jhvwozsn A method of transitional automorphic mapping of the genome on itself (TAMGI) is aimed at combining detection and reconstruction of periodic motifs in the genomic RNA/DNA sequences. The periodic motifs (whether tandem or sparse) are assumed to be randomly modified by point mutations and indels during molecular evolution and present in the genomes in a hidden form. TAMGI is robust with respect to patched phasing of periodic motifs induced by indels. We developed and tested the relevant theory and statistical criteria for TAMGI applications. The applications of TAMGI are illustrated by the study of hidden periodic motifs in the genomes of the severe acute respiratory syndrome coronaviruses SARS-CoV and SARS-CoV-2 (the latter coronavirus SARS-CoV-2 being responsible for the COVID-19 pandemia) packaged within filament-like helical capsid. Such ribonucleocapsid is transported into spherical membrane envelope with incorporated spike glycoproteins. Two other examples concern the genomes of viruses with icosahedral capsids, satellite tobacco mosaic virus (STMV) and bacteriophage PHIX174. A part of the quasi-periodic motifs in these viral genomes was evolved due to weakly specific cooperative interaction between genomic ssRNA/ssDNA and nucleocapsid proteins. The symmetry of the capsids leads to the natural selection of specific quasi-periodic motifs in the related genomic sequences. Generally, TAMGI provides a convenient tool for the study of numerous molecular mechanisms with participation of both quasi-periodic motifs and complete repeats, the genome organization, contextual analysis of cis/trans regulatory elements, data mining, and correlations in the genomic sequences. A variety of genetic molecular mechanisms is based on the specific molecular interactions between biomolecules. In many cases they are related to (quasi-)periodic motifs in underlying genomic sequences. In particular, the satellite DNA composed of tandem repeats plays an important role in the structural organization of the chromosomes of the higher organisms, whereas the short tandem repeats in human genomes are widely used in the medical diagnostics and forensic [1] [2] [3] [4] [5] . Similar short tandem repeats were found also in some prokaryotic genomes [6] . However, in the majority of cases the quasi-repeating patterns are present in the genomes in a hidden form due to random point mutations and indels occurred during molecular evolution. The hidden repeating patterns can be tandem or sparse. The most known examples are the patterns with the period p = 3 in the protein-coding regions [7] and the patterns with the period p = 10.5 bp related to the pitch of B-form dsDNA (see, e.g., [8, 9] and references therein). The methods available for statistical detection of hidden repeating patterns in genomic sequences are numerous and comprise discrete Fourier transform (DFT) [10] [11] [12] [13] [14] , discrete double Fourier transform (DDFT) [15] [16] [17] , wavelet transform [18, 19] , Ramanujan transform [20] , correlation functions [21] [22] [23] , and information theory [24, 25] (for a review and further references see, e.g., [26, 27] ). These methods reveal the presence of hidden repeating patterns, while explicit reconstruction of the underlying motifs is commonly performed separately. Typically, quasi-repeating patterns are searched for explicitly by defining the lengths of repeats within a chosen interval and by defining a chosen permissible number of mismatches [28] [29] [30] [31] . The statistical significance of the lengths and composition of found repeats can be assessed by the comparison with the counterpart characteristics in the random sequences with the same nucleotide composition [32, 33] . By using such detection technique, the periodic phasing of repeats is not obligatory and found repeats may often be considered as scattered rather than phased. The patched phasing of hidden repeats induced by indels hampers additionally the explicit reconstruction of repeats. In this paper we present a method of transitional automorphic mapping of the genome on itself (TAMGI) combining detection and reconstruction of hidden repeating patterns, retaining phasing of repeats, and robust with respect to indels. The method of TAMGI was sketched briefly in [34] when applying to the study of ribonucleocapsid assembly/packaging signals in the genomes of the severe acute respiratory syndrome coronaviruses SARS-CoV and SARS-CoV-2. In this publication we develop the extended theory and derive the relevant statistical criteria for analysis of the output set of patterns in TAMGI. To illustrate the applications, we chose the genomes of the coronaviruses SARS-CoV and SARS-CoV-2 packaged within the helical capsid as well as the genomes of two viruses with the icosahedral capsids, satellite tobacco mosaic virus (STMV) and bacteriophage ϕX174 studied previously [34, 35] by different methods. Such choices allow us to cross-check the features detected by TAMGI and other methods. The main idea behind the method of TAMGI is simple; if repeating motifs (whether tandem or sparse) are separated by a space s, they can be mapped onto each other by the step s. The method is robust with respect to indels because such mapping retains phasing of repeats within relatively long phased stretches separated by indels. The algorithm for TAMGI is defined as follows. Let a nucleotide N m be positioned at a site m of the genomic sequence. The N m -th nucleotide will be retained if it has at least one neighbor N m-s or N m+s of the same type and be replaced by void otherwise (denoted traditionally by hyphen). All neighbors of the same type N m-s or N m+s should also be retained. The resulting sequence after TAMGI is composed of the nucleotides of four types (A, C, G, T) and the hyphens "-" denoting voids. Further analysis is reduced to the enumeration of all complete words of length k (k-mers) composed only of nucleotides (voids within the complete words are prohibited) and surrounded by the voids "-" at 5'-and 3'ends, -N k -. By definition, the complete words are non-overlapping. At the next stage, the mismatches with hyphens to the complete words can be studied. To avoid end effects and to ensure homogeneity of the mapping, the linear genomes will always be circularized, (1) where N m,α denotes the nucleotide of the type α∈(A, C, G, T) positioned at the site m and M is the genome length. Choosing circularized mapping is natural and convenient for the study of quasi-repeating motifs. The theory and simulations show that for even M the step M/2 should be considered apart from the other steps. Therefore, the range of steps can be chosen from 1 to where the brackets denote the integer part of the quotient. Formally, the action of TAMGI with the step s on the genome can be presented in an operator form, Repeating operation with s fixed yields the same result, Eq. (4) means that TAMGI can be considered as a projection because s s R R2 = . TAMGI operations with different steps do not commute with each other, The frequencies of nucleotides after TAMGI with the step s, should be properly normalized to assess their statistical significance. The normalization ought to be performed against the counterpart characteristics in the random sequences of the same nucleotide composition. Below, we will always imply that the length of the genome satisfies the condition M >> 1. The frequencies for the combinations (non-N α )(N α ) k (non-N α ) (here it is implied that any neighboring nucleotides are separated by the step s) can be assessed in the random sequences as ... , 2 , where α ϕ is the frequency of nucleotides of the type α in the genome. Then, the frequency of the nucleotides of the type α after application of TAMGI, α Φ , is expressed through the frequencies defined by Eq. (7) as follows, while the total frequency of nucleotides after TAMGI is obtained by summation over The variances for the frequencies defined by Eqs. (8) and (9) are determined for the random sequences by the binomial distribution, This means that the total frequency defined by Eq. (6) can be presented in terms of a normalized deviation, total total s total s (for notations, see Eqs. The frequency of words of length k, -N k -(where N means nucleotide of any type), in the random sequences can be assessed as, where total Φ is defined by Eqs. (8) and (9) . Let there be n k, s words of length k after TAMGI with the step s. Then, the empirical probability to detect a word longer than k' can be determined as, The summation in Eq. (13) is performed up to the maximum detected length. The frequency of words with lengths exceeding k' in the random sequences is given by The probability to find a word with a length exceeding k' in the complete pool of words can be calculated as The empirical probability for the natural genomic sequences (13) should be compared with the predictions for the random sequences expressed by Eq. (15) . The difference between natural and theoretical distributions can be statistically assessed by the Kolmogorov-Smirnov criterion. The probability density corresponding to probability (15) is defined as Then, the mean length of words and its variance in the random sequences after TAMGI are given by The mean number of all words in the random sequences after TAMGI can be assessed as The maximum length of words within the interval of steps A typical threshold for the statistical significance is thr Ν = 0.05, whereas the lower boundary for the maximum length corresponds to 1 ≅ Ν thr . The theoretical predictions for TAMGI of random sequences were tested by reshuffling the genomic sequence for SARS-CoV (M = 29751, N A = 8481, N G = 6187, N T = 9143, N C = 5940). The frequencies averaged over 10,000 random realizations were in complete agreement with the theoretical results (8)- (10) . The spectrum for normalized deviations (11) is shown in Fig. 1A and is homogeneous for the circularized genome. As expected, the normalized deviations (11) are governed by Gaussian statistics (see Fig. 1B ). The distribution of k-mer lengths after TAMGI for an arbitrarily chosen step s = 7437 is shown in Fig. 1C , while Fig. 1D shows the distribution of k-mer lengths for the steps 1-500. Both in Figs. 1C and 1D the deviations for distributions of word lengths for particular random realization from theoretical probability defined by Eq. (15) appeared to be insignificant by Kolmogorov-Smirnov criterion. As expected from Eq. (20) , the maximum length detected for the set associated with the interval 1-500 appeared to be longer in comparison with that for a particular step. Thus, the tests prove that suggested characteristics can be used for the study of real genomes. N(0, 1) . c The distribution of k-mer lengths after TAMGI for a random sequence obtained by reshuffling genomic sequence for SARS-CoV for an arbitrarily chosen step s = 7437 and its comparison with the theoretical prediction based on Eq. (15) (straight line). d The distribution of k-mer lengths after TAMGI for a random sequence obtained by reshuffling genomic sequence for SARS-CoV for the steps within interval 1-500 and its comparison with the theoretical prediction based on Eq. (15) (straight line). For illustration of TAMGI applications, we chose motifs in viral genomes evolved due to specific interactions between genomic RNA/DNA during packaging into protein capsid [36, 37] . As the main information in viral Step, s and nucleocapsid proteins should lead to the natural selection of specific quasi-periodic motifs in the related genomic sequences [34] . In this section we will show how these features can be established with TAMGI method. We took for comparative analysis and illustration one genomic sequence for SARS-CoV Length, k' Step, s Step, s of these signals. The horizontal lines correspond to the Gaussian p-value 0.05 for the random sequences. c The distribution of kmer lengths after TAMGI with the step s = 54 for the genome of SARS-CoV (shown by crosses) and its comparison with the counterpart distribution for a random reshuffled sequence (shown by circles). d The distribution of k-mer lengths after TAMGI with the step s = 54 for the genome of SARS-CoV-2 (shown by crosses) and its comparison with the counterpart distribution for a random reshuffled sequence (shown by circles). e The distribution of k-mer lengths after TAMGI for the steps within interval 1-500 for the genome of SARS-CoV (shown by crosses) and its comparison with the counterpart distribution for a random reshuffled sequence (shown by circles). f The distribution of k-mer lengths after TAMGI for the steps within interval 1-500 for the genome of SARS-CoV-2 (shown by crosses) and its comparison with the counterpart distribution for a random reshuffled sequence (shown by circles). The The sequences after TAMGI with the step s = 54 and the lists of the words, -N k -, with k ≥ 12 for TAMGI with the steps within interval s = 1-500 are explicitly reproduced in Supplements S1-S4. As is seen from the list of motifs in Supplement S3, a part of the longest words for SARS-CoV was generated by poly-A signal positioned at 3'-end of the genome and corresponding to transcription termination. The counterpart signal was not reproduced in the version of the genomic sequence for SARS-CoV-2. We retained, however, poly-A signal to illustrate the ability of TAMGI to detect such signals as well. The comparison of repertoires of words with the lengths k ≥ 6 reveals their partial divergence between SARS-CoV and SARS-CoV-2; however, such words appeared to be closely conserved for different isolates of SARS-CoV-2 despite the load from point mutations and indels [34] . The choice of the most conserved motifs at the step s = 54 and the epitopes in N proteins responsible for the interaction between ssRNA and N proteins provides the promising therapeutic targets for the development of an antiviral vaccine [45] [46] [47] (see also [34] for discussion and further references). The two next examples concern the viruses with icosahedral capsids. In the virus world, more than a half of viruses belong to such species [36, 37] . The icosahedral symmetry comprises 15 axes of the second order, 10 axes of the third order, and 6 axes of the fifth order. The total number of operations for the icosahedral symmetry is 60. The correspondence should be searched between the (generally multiple) elements of icosahedral symmetry and the character of large-scale quasi-periodic segmentation induced by weakly specific cooperative interactions between genomic RNA/DNA and capsid proteins. STMV, a small icosahedral plant virus with linear positive-strand ssRNA genome, may be considered as one of the smallest reproducing species in nature (for a review and further references see [48, 49] ). Its reproducibility needs both a host cell and a host virus (tobacco mosaic virus in this case). The icosahedral capsid consisting of 60 identical subunits with genomic ssRNA inside was resolved on 1.4 Å scale [50] . The visible RNA revealed 30 double-helical segments, each about 9 bp in length, packaged along the edges of capsid icosahedron [50, 51] . The corresponding quasi-periodic segmentation on 30 segments with the period p ≈ 35.3 nt was clearly pronounced in the DFT-DDFT spectra for genomic sequences [35] . The genomic sequence of STMV is of length M = 1058 (GenBank accession: M25782). About a half of the genome contains two overlapping ORF, the longer of which codes for coat protein, whereas the other half contains UTR. The overview of TAMGI deviations for the genome of STMV is shown in Fig. 3A . A significant deviation for s = 38 which can be associated with the period p ≈ 35.3 nt appeared to be biased. The approximately equidistant peaks at s = 75 and 110 were biased as well. Only a significant deviation at the step s = 69 can be associated with the doubled period p ≈ 35.3 nt. The capsid assembly is presumed to be performed hierarchically via 5-fold intermediate [52] . As genomic ssRNA participates actively in virion assembly [53] , a similar mechanism may be suggested for RNA packaging. The peaks that can be associated with 5-fold segmentation are shown in Fig. 3A (s = 211 and 427) . The distribution of k-mers after TAMGI with the step s = 211 is presented in Fig. 3B , whereas the Step, s Step, s complete distribution of k-mers after TAMGI with the steps from 1 to 528 is shown in Fig. 3C . The deviations between these distributions and the counterpart distributions for random sequences were found to be statistically insignificant by Kolmogorov-Smirnov criterion. The sequence after TAMGI with s = 211 in GenBank format and the list of words with lengths k ≥ 9 are reproduced explicitly in Supplements S5 and S6, respectively. The plot for the maximum length of words at the step s is shown in Fig. 3D . The insert to Fig. 3D The bacteriophage ϕX174 belonging to Microviridae family presents an example of icosahedral viruses with ssDNA genome packaging (for a review see, e.g., [54] ). The capsid of the mature virus is composed of 60 copies each of the coat protein F, the spike protein G, the DNA-binding protein J, and 12 copies of the pilot protein H [55] . both modes of segmentation on 30 and 180 segments were highly significant by DDFT analysis [35] . The deviation at the step s = 90 associated with segmentation on 60 segments, though not prominent, was also significant. The high deviation for s = 1828 can be associated with 3-fold segmentation. The distributions of k-mers for the step s = 180 and for the steps from 1 to 2692 are presented in Figs. 4B and 4C in comparison with the counterpart distributions for the deviations for particular random realization. Though the differences between two distributions are poorly seen on the logarithmic scale, they are significant by the Kolmogorov-Smirnov criterion (Pr < 10 -4 ). The plot for the maximum k-mer lengths at the step s is shown in Fig. 4D and again revealed short-range correlations seen in the insert. The clustering of maximum lengths around the steps associated with the characteristic segmentation modes compatible with icosahedral packaging is marked explicitly in The common paradigm for the cooperative specific interaction between genomic RNA/DNA and proteins is based on the consensus motifs in genomic sequences recognized by protein epitopes. The length of consensus motifs is implied to be approximately fixed, though, generally, the motifs may be variable and clustered into particular groups. TAMGI indicates the possibility of the other scenario for the molecular evolution of motifs. In the molecular mechanisms related to quasi-periodic motifs, the evolvement of motifs may start from approximately phased nucleotides of particular type followed by subsequent generation of the longer and longer motifs with a distribution resembling that of defined by Eq. (15) . TAMGI reconstructs the longer motifs from the shorter ones. The abundance 500 1000 1500 2000 2500 Step, s Step, s of primary phased nucleotides facilitates the generation of the longer motifs in comparison with counterpart motifs in the random sequences. The longest motifs may serve as a seed of a particular molecular mechanism or play multifunctional role similar to cis/trans regulatory elements. The contextual analysis of the known cis/trans elements with TAMGI may elucidate the mechanisms of their evolving and be useful for data mining. The extension of TAMGI on the mapping of complementary stretches is non-trivial. Such analysis is important, e.g., for the predictions of secondary structures of ssRNA viruses within the icosahedral capsids [56] [57] [58] . The geometrical constraints imposed by capsid and the interaction between RNA and capsid proteins make such secondary structure different from free structure in solution. A possible extension of TAMGI for this problem can be achieved by using two windows of equal lengths w with centers separated by step s. To exclude the overlapping of windows, the inequality s > w should be fulfilled. The fragment within one of windows should, first, be mapped onto its complementary counterpart and, then, such complementary fragment can be mapped onto the fragment in the second window by the rules similar to TAMGI. Such complementary mapping retains the mutually complementary nucleotides within windows. Varying s-w parameters provides a set of two-parameter complementary mappings. The pool of k-mer motifs reconstructed by TAMGI needs additional analysis. The overlapping shorter motifs can be used for the putative assembly of the longer motifs. Nevertheless, even relatively short motifs with k ≥ 6 for the step s = 54 appeared to be stably reproduced in the genomes of SARS-CoV-2 isolates [34] . This means that the characteristic quasi-repeating motifs can be used, e.g., for subtyping of viruses or for the assessment of evolutionary divergence between species. The inhomogeneity of motif distribution over the genome and the regions with the highest and the lowest content of motifs may indicate their involvement into molecular mechanisms. To sum up, TAMGI method developed in this article is quite general and can be applied to the combined detection/reconstruction of quasi-periodic motifs in the genomic RNA/DNA sequences. The method provides insight into mechanisms of evolving specific motifs in the genomic sequences. Its scope of applications comprises numerous molecular mechanisms with participation of both quasi-periodic motifs and complete repeats. The method can also be applied to the study of correlations in the genomic sequences. Development and use of molecular markers: past and present Practical applications of DNA genotyping in diagnostic pathology Short tandem repeat expansions and RNA-mediated pathogenesis in myotonic dystrophy Advanced topics in forensic DNA typing: methodology Forensic use of Y-chromosome DNA: a general overview Satellites in the prokaryote world Gene prediction based on DNA spectral analysis: a literature review 10-11 bp periodicities in complete genomes reflect protein structure and DNA folding Coexistence of different base periodicities in prokaryotic genomes as related to DNA curvature, supercoiling, and transcription The 14-fold periodicity in alpha-tropomyosin and the interaction with actin Search of hidden periodicities in DNA sequences Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats Periodic power spectrum with applications in detection of latent periodicities in DNA sequences Identification of CpG islands in DNA sequences using short-time Fourier transform Large-scale chromosome folding versus genomic DNA sequences: A discrete double Fourier transform technique Detection of large-scale noisy multi-periodic patterns with discrete double Fourier transform Detection of large-scale noisy multi-periodic patterns with discrete double Fourier transform. II. Study of correlations between patterns Localizing triplet periodicity in DNA and cDNA sequences Identification of protein-coding regions using modified Gabor-wavelet transform with signal boosting technique Detecting periodicities in eukaryotic genomes by Ramanujan Fourier transform Study of correlations in DNA sequences The study of correlation structures of DNA sequences: a critical review Study of statistical correlations in DNA sequences Repeats and correlations in human DNA sequences Information decomposition method to analyze symbolical sequences Order and correlations in genomic DNA sequences. The spectral approach Comparative analysis of periodicity search methods in DNA sequences Tandem repeats finder: a program to analyze DNA sequences Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression Tandem repeats over the edit distance Generic repeat finder: a high-sensitivity tool for genome-wide de novo repeat detection Limit distributions of random variables associated with long duplications in a sequence of independent trials Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes Ribonucleocapsid assembly/packaging signals in the genomes of the coronaviruses SARS-CoV and SARS-CoV-2: Detection, comparison and implications for therapeutic targeting Genome packaging within icosahedral capsids and large-scale segmentation in viral genomic sequences Viral molecular machines Structure and physics of viruses Coronaviruses Medical virology: from pathogenesis to disease control Supramolecular architecture of the coronavirus particle Coronavirus genomic RNA packaging The SARS coronavirus nucleocapsid protein-forms and functions Structure of the SARS coronavirus nucleocapsid protein RNA-binding dimerization domain suggests a mechanism for helical packaging of viral RNA Electron microscopy studies of the coronavirus ribonucleoprotein complex RNA vaccines Recent insights into the development of therapeutics against coronavirus diseases by targeting N protein The nucleocapsid protein of SARS-CoV-2: a target for vaccine development Satellite tobacco mosaic virus Satellite tobacco mosaic virus RNA: structure and implications for assembly Satellite tobacco mosaic virus refined to 1.4 Å resolution A model for the structure of satellite tobacco mosaic virus RNA-protein interactions in some small plant viruses Rewriting nature's assembly manual for a ssRNA virus The microviridae: Diversity, assembly, and experimental evolution Analysis of the single-stranded DNA bacteriophage ϕX174, refined at a resolution of 3.0 Å Probing viral genomic structure: alternative viewpoints and alternative structures for satellite tobacco mosaic virus RNA Packaged and free satellite tobacco mosaic virus (STMV) RNA genomes adopt distinct conformational states Challenges and approaches to predicting RNA with multiple functional structures