key: cord-254942-g51mjj2b
authors: Touati, Rabeb; Tajouri, Asma; Mesaoudi, Imen; Oueslati, Afef Elloumi; Lachiri, Zied; Kharrat, Maher
title: New methodology for repetitive sequences identification in human X and Y chromosomes
date: 2020-10-19
journal: Biomed Signal Process Control
DOI: 10.1016/j.bspc.2020.102207
sha: 
doc_id: 254942
cord_uid: g51mjj2b

Repetitive DNA sequences occupy the major proportion of DNA in the human genome and even in the other species’ genomes. The importance of each repetitive DNA type depends on many factors: structural and functional roles, positions, lengths and numbers of these repetitions are clear examples. Conserving such DNA sequences or not in different locations in the chromosome remains a challenge for researchers in biology. Detecting their location despite their great variability and finding novel repetitive sequences remains a challenging task. To side-step this problem, we developed a new method based on signal and image processing tools. In fact, using this method we could find repetitive patterns in DNA images regardless of the repetition length. This new technique seems to be more efficient in detecting new repetitive sequences than bioinformatics tools. In fact, the classical tools present limited performances especially in case of mutations (insertion or deletion). However, modifying one or a few numbers of pixels in the image doesn’t affect the global form of the repetitive pattern. As a consequence, we generated a new repetitive patterns database which contains tandem and dispersed repeated sequences. The highly repetitive sequences, we have identified in X and Y chromosomes, are shown to be located in other human chromosomes or in other genomes. The data we have generated is then taken as input to a Convolutional neural network classifier in order to classify them. The system we have constructed is efficient and gives an average of 94.4% as recognition score.

Repetitive DNAs are sequences with multiple copies in the genome. They are rarely associated with clearly defined biological functions. Some of the moderately-repetitive sequences may be involved in gene expression regulation. Other mobile DNA can be constituted by transposable genetic elements (TEs) that are involved in the genome evolution process. The transposition mechanism and the structure of these TEs are the keys to dividing this DNA into classes. Retrotransposons, are an example of TEs class that move via an RNA intermediate. This RNA is transcribed from the DNA and subsequently copied back into DNA. As repetitive DNA we can find tandem repeats or scattered repeated sequences. These repetitive DNA sequences can be classified into two types: highly repetitive or moderately repetitive sequences [1, 2] .

The major repetitive sequences in all eukaryotic cells are classified into five types according to the sequence's length. In this classification, the microsatellite sequences (Short Tandem Repeat: STR) are the smallest. They are characterized by periodicity between 2 and 4 nucleotides per unit. The second class is constituted by the minisatellites with a length varying between 10 and 60 base pairs (bp). The third class is composed of the satellites which can contain up to 100 nucleotides (100-200 base pairs) [3] [4] [5] . The retrotransposons like SINE and LINE are part of the fourth-class which is characterized by a length varying from 50 bp to 6 kb. The final class consists of Ribosomal RNA gene repeat (rDNA) which is the longest with a length between 9 and 45 kb.

In the Human genome, rare fragile sites are chromosomal DNA regions especially characterized by repetitive sequences. In fact, in these regions, DNA damage occurs more frequently than in other locations. Due to chromosome structure, the common fragile sites can be sensitive to replication stress, and they are often rearranged in cancer. In the mammalian centromeres and telomeres, the presence of repetitive sequences is necessary in order to protect chromosomes from damage. For example, alphoid DNA is a kind of DNA satellite having a length of 173 bp. This DNA is located in the middle of a chromosome and makes up the larger part of the Human centromeres region [6] . Moreover, telomeres regions located at the chromosome extremities are made up of repeat sequences of 5-7 bp. These elements are called telomere repeats [7] . The repetitive sequence 'TTAGGG' is one example. The chromosome integrity is protected by telomere repeats [8, 9] . In fact, telomeres hinder the chromosomes' fusion and protect them against degradation by exonucleases [10] .

These repetitive functional elements are not susceptible to become fragile sites because they are hidden in heterochromatin. This heterochromatin prevents unusual DNA structures occurrence leading to recombination by not yet identified mechanisms [11] .

Repetitive sequences are abundant in various genomes, from bacteria to mammals, and they cover nearly half of the Human genome [5] . Finding new common repetitive sequences within and between different chromosomes and genomes is an important theme of research in biology. In fact, the detection of all repetitive sequences in DNA could serve in elucidating important biological phenomena. To identify the repetitive sequences, different bioinformatics tools were used [12, 13] . Their principle is based on comparison between DNA consensus sequences and repeats candidates. The Mreps [14] , MISA [13] , Sputnik [15] , EMBOSS (etandem and equitandem) [16] , TRF [17] and Repeat-Masker [18] are obvious examples. In the comparison step, these tools used different approaches such as regular expression [18] , Hamming distance [12] , recursive match and penalty scores [17] . Localizing new repetitive sequences presents always technical challenges. This is due to the ambiguities that such repeats can create in alignment and assembly programs [19] . In this work, we have developed a new algorithm to detect repetitive patterns that correspond to new repetitive sequences. For this purpose, we used a combination of coding techniques, signals, and image processing techniques. As a result, we have constructed a repetitive sequence database which we subdivided into two sub-databases. The first one contains the existing and validated repetitive sequences. The second DNA repetitive database regroups the newly detected sequences.

In this context, we called "new repetitive sequence", a sequence that was not detected by all current bioinformatics systems as well as alignment programs. In this research, we converted all of the DNA sequences into a synthetic image representation. After that, we extracted all patterns that correspond to the repeat DNA sequences. The second part of this work consists in classifying the obtained data. A deep learning model is chosen for this purpose: Convolutional neural network (CNN). This paper is divided into four sections. After the introduction, we describe the materials and methods. In Section 2, we first present the biological database subject of this study. We also introduce the coding technique we used to transform the biological data into a numerical one. After that, we describe how we convert the obtained signal into an image based on the wavelet analysis. Further, we introduce the CNN architecture we establish for the repetitive DNA classification. The final parts of this section consist of the employed detection steps and the adopted evaluation system. In Section 3, we provide and discuss the results in terms of repetitive DNA sequences detection and classification. Finally, Section 4 concludes the paper.

Two-thirds of the human genome consists of repetitive DNA sequences [20] ; which confers great importance to identification and localization of these elements. In this section, we expose a novel approach for the repetitive DNA sequence identification. This method is effective in detecting dispersed or tandem repeats such as minisatellites and satellites. The detection system is composed of four main blocks. The first one consists in extracting the Human DNA sequences from existing database. The second block is the DNA coding into a numerical representation. The third block consists of "Find Human Repetitive Sequences" (FHRS) method which we propose to the Repetitive DNA sequences detection. It is the application of the wavelet analysis and thus for detecting the repetitive patterns. The last block consists of determining the repetitive sequences and the repetitive DNA sequences database establishment. Fig. 1 shows the corresponding flowchart.

The human genome (Homosapiens) contains 22 autosomes and two chromosomes that determine human sex: X and Y, with a total number of 46 chromosomes. We find one pair of sex chromosomes in each human cell. In females, the cell contains two X chromosomes, while in males we have one X and one Y chromosome. A detailed description of the human DNA material is available in the NCBI database (National Center for Biotechnology Information) [21] . From the human DNA data, we count 2.91-billion base pairs (bp) consensus sequence in the euchromatic portion [22] . Given that this is a huge amount of data, we based our work only on X and Y chromosomes. Even, at the level of these two chromosomes, we have an important mass of data. As an example, we give in Fig. 2 the number of apparition of dinucleotides in both X and Y chromosomes.

Our goal is to find repetitive DNA on these chromosomes. It is important to mention that the more complex the genome is, the more difficult is to find new repetitive sequences within. Therefore, the challenge presented in this work is identifying new repetitive DNA sequences in human X and Y chromosomes.

Aiming to visualize repetitive patterns in the human genome, the DNA sequences have to be transformed into numerical data. This transformation is called "DNA coding". In this work, we opted for a special coding technique called "Order 2 Frequency Chaos Game Signal" (FCGS 2 ) [23, 24] . The FCGS 2 coding is a statistical representation of DNA. In the proposed method, chromosomes are transformed based on the occurrence probability of the successive dinucleotides groups. This technique represents the time-frequency evolution of the dinucleotides in the chromosome. In the following, we give the transformation equation (eq. 1).

where N 2 nucleotide is the occurrence number of dinucleotides group in the whole chromosome and Length Chr is the chromosome's length.

In this work, we coded the entire human chromosomes X and Y. The sequence that represents chromosome X is a signal with a length of 156,040,895 bp. As for chromosome Y, it is a signal of size equal to 57227415bp.

The identification of repetitive DNA sequences is taking greater and greater importance these days. Many algorithms, using various knowledge fields, have been implemented for repetitive sequences localization. In this context, signal processing approaches were used to detect repetitive sequences, according to the correspondent periodicity [25] [26] [27] [28] [29] . In this paper, we propose an efficient algorithm based on the signal and image processing tools to localize repetitive DNA sequences. This method has the advantage of being independent from prior knowledge about the repeated sequences. This section presents the new algorithm we designed to detect the repetitive DNA-sequences after transforming them into numerical signals. This algorithm is called Find Human Repetitive Sequences (FHRS). It contains three steps:

-DNA signals to DNA images transformation: the scalogram representation; -Energy calculation of each scalogram image which is obtained by the wavelet analysis. After that, retaining the image whose energy amplitude exceeds a chosen threshold (equal to 10 here); -Finding the reference repetitive sequence in the retained image. It is the longest repeated unit in the considered DNA sequence.

The scalogram representation of a DNA sequence is an image that we obtain by wavelet analysis and encode in the RGB space (three color channels: Red, Green, and Blue). This time-frequency representation is shown to be efficient in terms of visualizing and detecting repetitive patterns. Here, the idea is to use this type of DNA image to find repetitive patterns that correspond to periodic sequences.

The motivation behind this choice is that changing a pixel in the image has no influence on the overall shape of the repetitive pattern. Indeed even if the repetition pattern contains variations in nucleotide composition, this does not greatly impact the overall shape of the repetitive pattern at the level of DNA image. Furthermore, our choice for this method is reinforced by its performance in characterizing different classes of transposable elements [30, 31] . For the wavelet analysis, we use the complex Morlet wavelet which is best suited to localize repetitive DNA in the time-frequency domain. The principle consists of applying the wavelet analysis to the signal obtained by the FCGS 2 coding. This analysis is done by decomposing a given DNA signal into a sum of basic functions called wavelets. The latter wavelets are issued from the mother wavelet by two operations: expansion and translation. These wavelets take into account both time and frequency variations, which allow them to easily capture all the different hidden frequencies in the signal [32] [33] [34] . Unlike the mother wavelet, which only has a time-varying parameter expressed by the function ψ(t), the daughter wavelet expression depends on time and scale parameters (a and b respectively). It is generated following this equation:

where* indicates the conjugate complex. As we have chosen a Gaussianwindowed complex sinusoid (complex Morlet) to be applied as analysis window, the Continuous Wavelet Transform (CWT) will be written as:

Here the oscillation's number (ɷ 0 ) must be greater than 5 (admissibility condition). The continuous wavelet coefficients of a DNA signal x(t) is a matrix which elements are calculated by the following formula:

The modulus of these coefficients | W (a,b) | provides the scalogram representation of the DNA sequence.

Since chromosomes X and Y are too long, we decompose x(t), which is the correspondent FCGS 2 signal, in a set of segments. Each segment x i (t) has a size of 1000 bp. After segment cut, we apply the CWT wavelet and calculate the correspondent energies. As a result, we obtain a new database of the human DNA representations. In total, we count 156,041 images of the X chromosome and 57,228 images of the Y chromosome. The wavelet coefficients matrix contains the time-frequency information about a signal. To further explore this information, we calculate the scale-energy (E) of each nucleotide position, according to following equation:

for each i = 1 : Length Chr /1000.

Here, the parameter a represents the scale in the wavelet analysis; it varies from 1 to 64. As for the indicator i, it represents the image number. By applying (eq.5), we obtain a vector that contains the energy of the DNA scalogram. Peak values higher than 10 in the vector indicate the existence of repetitive patterns in the DNA image. Fig. 3 shows an example of the FCGS 2 signal, the correspondent scalogram in a 3D representation and the energy wavelet of a sequence located in chromosome X of the human genome. This sequence corresponds to the portion [342,500 bp: 344,000 bp] in the PPP2R3B gene.

As we can see, magnitude of the energy wavelet indicates the presence of periodicities in the sequence. If we consider the frequency content, we can note that the repetitive sequence is characterized by a specific frequency band. The limits of this frequency band correspond to the repetitive DNA portion in the analyzed sequence. As for the 3D representation, it contains repetitive patterns of particular shape that are related to the DA repetitions. Following this method, we have constructed our database of the repetitive DNA images. The patterned images were selected according to the energy-wavelet peaks. The generated database was named "repeat-Data".

For each DNA image into the repeat-Data database, we aim to identify a DNA-reference sequence, to which corresponds the existing repetitive pattern in the scalogram. This DNA-reference sequence is the longest subsequence in terms of size and repetition numbers. After this step, we have built a database that contains the location and the repetition number of all the localized sequences of reference. As we focus on detecting new repetitive sequences in the human genome, we verified the availability of the reference repetitive sequence in the public databases. For this, we checked if this sequence is annotated or not in both DFAM and NCBI databases. Hence, if our new repetitive sequence is not listed in these public databases, we added it to our new database. This new repetitive sequence is called "New-repeat-Data".

After collecting the new repetitive sequences using the FHRS algorithm, we move on to the step of extracting the repeat patterns using image processing tools. The Fig. 4 summarizes the proposed methodology of extracting tandem repeat patterns in the DNA images. It illustrates the results obtained when we considered the "TRseq1" sequence. The sequence is 261 base pairs lengthen; its position is 28,076,765 bp to 28,077,025 bp along the human X chromosome.

As in this example, the data we are treating here is the set of scalogram images that we stored before in the database "New-repeat-Data". The main goal of this part of work is to detect and localize the repetitive patterns in the scalogram representations. That's why we based our work on a segmentation algorithm. Our method consists first in decomposing the DNA image into three color channels (red, green and blue) and choosing the blue one. This choice is justified after testing all the color bands. The best segmentation result corresponds to the bleu channel since it is best contrasted compared to the others. Then for a binarization purpose, a simple thresholding is applied to keep only the pixels having an intensity value less than or equal to 26. Then, to keep only the region of interest, we have used an edge detection technique. The Canny edge detector provides good detection and localization relatively to other operators [35, 36] . The algorithm detects brightness discontinuities in the image using a Canny filter. It is a multi-stage algorithm used to detect a wide range of edges in images [37, 38] . The Canny operator uses double thresholds: high and low thresholds. The high threshold algorithm detects important and significant information like lines and contours in the image. The low threshold algorithm ensures that no details are missing. The Canny edge detector is widely used to locate sharp intensity changes and to find object boundaries in an image, especially in computer vision domains. The classification of one pixel as an edge, using the Canny edge detector, is achieved by gradient Table 1 Position of "Rseq1" on both X and Y chromosomes of the human genome. magnitude computation of this pixel. The result is then compared with one of its neighbors, where the maximum intensity varies the most. Finally, we fill the holes in areas of interest based on morphological operators [39] . The result is an image that only contains repetitive patterns. Based on this method, we can then extract and isolate the particular regions of repetitive DNA patterns.

After finding the DNA repetitive sequences in the human X and Y chromosomes (which can be tandem or scattered repeated sequences), we verified their existence in other chromosomes or even in other genomes. To achieve this goal, we have used two public bioinformatics algorithms: BLAT [40] and DFAM [41] . For each new repetitive sequence we detected, we searched it in the whole human genome and in all other genomes using the BLAT platform. As an example, we consider the new scattered repeated sequence "Rseq1".

Rseq1="CTTTAGAGTCTGCATTGGGCCTAGGTCTCATTGAGGACA-GATAGAGAGCAGACTGTGCAAC".

It is a 61 base pair (bp) lengthen sequence with a repetition number equal to12 in the whole human genome. The corresponding positions on both X and Y chromosomes are given in the following table (Table 1) .

After localizing "Rseq1" in X and Y chromosomes, we searched for the existence of this sequence in other regions. Fig. 5 shows the result of the checking of the "Rseq1" existence in other species. As we can see, "Rseq1" exists in several genomes such as; Human, Gorilla, Chimpanzee, Greenmonkey, Bonobo, etc.

After proving the existence of the newly discovered repetitive sequence in all genomes, we tried to find whether this sequence is located in genes. We, especially, searched for its existence in exonic regions or in other families of DNA. If this sequence exists nowhere in these DNA types, we classified it as a new repetitive DNA sequence type. On the other hand, we verified the uniqueness of these new sequences using our approach FHRS, and thus by comparing the repetitive patterns in the scalogram representations.

In order to ensure that our work is as meaningful and effective as possible, we thought of establishing a classification system to classify these new datasets (new repetitive DNA sequences). For this reason, we considered the scalogram representation (2D image) as input data to the system. As for the classifier, we have chosen CNNs as they are efficient in terms of images classification Fig. 6 .

CNN is a special neural networks type which works using data having a grid topology [42] . CNNs classification technique were developed by LeCun et al. (in 1998) in the aim to recognize handwritten characters from bank checks. CNNs is a deep learning model inspired by the visual mechanism of living organisms. It uses convolutional layers to the features extraction from input data. In the CNN model, convolutional layer neurons are able to extract higher-level abstraction features from features extracted at the previous layer. CNN was applied with success in DNA studies [43] [44] [45] [46] , Breast Cancer Cell Segmentation [47, 48] , medical diagnosis [49, 50] , character recognition [51] and in other areas of application.

In this work, we used CNN to establish a system of new repetitive DNA sequences recognition in human X and Y chromosomes. For this, we took the RGB scalogram representations of DNA as the input of the classification system with a size of 75 × 100.

The DNA images are passed, then, through a stack of convolutional layers, where we used filters with a very small receptive field (3 × 3). These filters act in the role of a scanner as they capture motifs in different orientations (up/down, center, left/right). Each neuron output on a convolutional layer is the result of a convolution operation between the kernel matrix and the neuron input. As for Max-pooling, it is performed over a 2 × 2 pixel window. For each convolutional layer, the second layer is a global max-pooling layer. Each one of max-pooling layers only outputs the maximum value of all of its respective convolutional layers outputs. The second layer is considered as a samplebased discretization process. This process has a goal to down the sample of input and to reduce its dimensionality.

After transforming the image into a suitable form for the Multi-Level Perceptron, the image must be flattened into a column vector. The result is a flattened output that is fed to a feed-forward neural network.

A back-propagation was applied to every iteration of training. A Fully-Connected layer was added to ensure a non-linear combination learning of the high-level features (which are represented by the output of the flatten layer). The Fully-Connected layer is learning a possibly non-linear function in that space. Over an epoch's series, using the Softmax Classification technique our model is eligible to distinguish between dominating and certain low-level features in images and it can classify repetitive DNA classes.

After transforming the image into a suitable form for the Multi-Level Perceptron, the image must be flattened into a column vector. The result is a flattened output that is fed to a feed-forward neural network. A backpropagation was applied to every iteration of training. A Fully-Connected layer was added to ensure a non-linear combination learning of the high-level features (which are represented by the output of the flatten layer). The Fully-Connected layer is learning a possibly non-linear function in that space. Over an epoch's series, using the Softmax Classification technique our model is eligible to distinguish between dominating and certain low-level features in images and it can classify repetitive DNA classes.

Only sexual chromosomes provide opportunities to know the evolution mechanisms from one specie to another. These mechanisms can depend on the accumulation of repetitive sequences [2] . In this work, we first applied the FHRS technique to detect new repetitive sequences within human sexual chromosomes (X and Y). After that, we entered these sequences to a CNN based on classification system aiming at recognizing them.

In this work, we used the FHRS approach (Find Human Repetitive Sequences), which combines wavelet analysis and a specific coding technique, to represent repetitive patterns in the form of an image. This method has the advantage of identifying new repetitive sequences without using any prior knowledge about the input DNA sequence. Based on this, we have discovered various new repetitive DNA sequences within sexual chromosomes, be they tandem or interspersed. After that, we have looked for the existence of these sequences in the whole human chromosomes or in other genomes. Afterward, we checked if these sequences exist or not in genes. Finally, we classed these repetitive sequences in terms of their relative location to heterochromatin, telomere, and centromere.

As a result, we have constructed a database comprising two subdatabases. The first one contains newly discovered repetitive sequences of type satellites and minisatellites. The second one encloses existing repetitive sequences.

Here, the new repetitive sequences database provides the composition of the new highly repetitive DNA sequences and the correspondent locations. The repetitive sequences are of different sizes and are classified into two types: tandem repeat sequences or interspersed repeat sequences. We called this new database "New-repeat-Data".

With our approach, highly conserved repetitive DNA sequences, having no annotations in the DNA library (NCBI or DFAM), have been found in the human genome.

In the telomere of X and Y chromosomes, we have found highly short Fig. 7 . Telomere image signature of homologue regions corresponding to the minisatellite "Rseq2" (CTTTAGAGTCTG) n within X and Y chromosomes.

R. Touati et al. or long repetitive sequences. The sequence "Rseq2" (Rseq2=CTTTA-GAGTCTG) is an example of short Minisatellite of 21 base pairs. Its repetition number is 312 extending from 26,304 bp to 249,544 bp. In addition, the sequence (CCCTAA) n , which is annotated in NCBI database, has been well localized using our algorithm. As long repetitive Minisatellite sequences, we have discovered a new sequence "Rseq1" of 61 base pairs and a repetition number of 12. These repetitive sequences exist in the same location within great portions of chromosome Y. Fig. 7 shows an example of the global signature of a new telomeric repetitive sequence with a 71000bp of size.

On the other hand, a high repetitive sequence "Rseq3" (Rseq3='TTTAAAGAT' of size equal to 9 bp) has shown as a new repetitive sequence in the human genome. This short repetitive DNA sequence was found also in many species such as chimpanzees, bonobo, and even in SARS− COV2 (COVID-19) coronavirus genome with a repetition number of 2. Table 2 shows the location of this microsatellite in some chromosomes of the human genome.

Other sequences are found to be very high repetitive in the human genome, like the sequence "Rseq4" (Rseq4= 'GTATACA') which appears in the X chromosome 1375 times. This sequence exists also in the COVID-19 coronavirus.

Furthermore, we have found a new minisatellite with a size of 61bp in human. Using the BLAT algorithm, this sequence was also found in the X chromosome of Gorilla (gorGor4) with a position of 15499bp to 15,559 bp. Fig. 8 shows the method adopted to localize this repetitive sequence in other regions. Fig. 8 is divided into two result blocks. In the first one, we expose the scalogram corresponding to the new repetitive DNA sequence. The second one contains the sequence location result in all the other genomes using the BLAT algorithm.

In the first result block, we provide the scalogram representation of the DNA sequence we have located at the X chromosome of the human genome (Xp22.33, position: [321001:322000bp]). The scalogram representation makes possible to see all the specific repetitive patterns. After that we extracted the reference sequence which is the maximum repetitive sequence having a maximum size in the DNA sequence. Then, we have found two new repetitive sequences that were not referenced by the current bioinformatic systems or sequence alignment programs. Locations of these two new repetitive sequences in both X and Y chromosomes are given by Table 3 . The repetitive patterns in the scalograms prove the presence of two microsatellites: Rseq5 whose size is 61bp and reRseq6 size is 28 bp.

These sequences are: After the localization of these two repetitive DNA sequences (Rseq5 and Rseq6), we have chosen to use the BLAT alignment tool in order to see if these sequences have other locations in the other human chromosomes or in other genomes. Indeed, the repetitive sequences that migrate to different regions of the genome have a great importance and they have been classified as conservative mobile DNA sequences. Their importance will be higher if these conservative regions are localized in genes.

As a result, we have found the Rep2 sequence at the position 321,267 bp to 321447 bp in the intronic region of a non-protein coding RNA 685 (LINC00685) gene, and thus in both X and Y chromosomes [52] .

In the sub-figure b of Fig. 8 (second result), we show that the new repetitive sequence Rep2 is located, not only within other chromosomes (1, 5, 15, X and Y) of the human genome, but also in other genomes like chimpanzee and bonobo. Results shown in Table 4 prove that Rep2 has been located in intronic region of different chromosomes of the human genome: 1, 5, 15, X and Y.

In fact, the sequence "Rseq6" presents a special intronic conservative region located, not only in different chromosomes but also in different genomes. Rseq6 sequence that have a size of 29 bp has been localized in two genes corresponding to chimpanzee genome. It is located at the po- In addition, we present another example of a special new repetitive sequence "Rseq7" which has been found using our approach. The Fig. 9 shows the time-frequency representation of the LOC652,608 gene which has a size of 2532 bp. The gene is found at the position: 1172583-1175114 bp in the X chromosome of the human genome. This Fig. 9 . LOC652608 Gene in the X chromosome contains a tandem repeat sequence: Rseq7 started in Intronic region (Intron 2) until Exonic region (Exon 3). Fig. 10 . Two examples of conserved intronic repetitive sequences (satellites) and noncoding sequence located in coding region such as senescence [53] .

pseudo-gene is a 60S ribosomal protein L6-like. The DNA image shown in Fig. 9 demonstrates three exonic regions and two intronic regions. We can clearly see that the second intronic region is composed by a specific tandemic sequence which we called "Rseq7". The correspondent modified version has the same size as "Rseq7" which is equal to 208 bp.

This particular repetitive sequence starts in the intronic zone: Intron2 until reaching and exceeding the exonic zone: Exon3; with a modification of 11 nucleotides.

Intron 2 is a noncoding sequence (208 bp) which is composed of multiple repetitions of "Rseq7".

Rseq7='TGATGGTTTTCCTGAAGCAGCTGGCTAGTGGCTTGT-TACTCGTAACTGGACCTCTGGTCCTCAATCGAGTCCCTCCACGAA-GAACGCACCA-GAAATTTGTCATTGCCACCTCAACCAAAATCGGTATCAGCAATG-TAAAAATCTCAAAACATCTTAGTGATGCTGACTTGAAGAAGAA-GAAGCTGTGGAAGCCCAGACACCAGGAGAG'.

Then, we searched this new tandem repeat "Rseq7" in the other chromosomes. As a result, we found that this sequence exists in 7 chromosomes with some nucleotides modifications. Moreover, we have located this modified intronic sequence in genes regions of other chromosomes of the human genome. Fig. 10 shows two reference sequences and the modified version. The first exonic sequence example corresponds to the LOC652608 gene in located in the X chromosome (Fig. 10a) . The second exonic sequence corresponds to the RPL6P22 gene in which is located in the chromosome 7 (Fig. 10b) .

For these two examples the nucleotides variation number between the intronic sequence "Rseq7" and the exonic sequence is equal to 11 base pairs but with different locations.

On the other hand, we have chosen to use image processing techniques to extract the repetitive sequences. The idea consists in segmenting the scalogram image in order to extract the repetitive patterns. For this purpose, we developed a new segmentation algorithm applied to the DNA scalograms. Fig. 11 illustrates the obtained results by our segmentation algorithm with a thresholding value equal to 26. It shows the location of the "Rseq7" repetitive sequences and the correspondent Fig. 11 . Example of DNA image segmentation by which we can obtain the begining and the end of the repetitive patterns located in intronic region (Intron 2), and the corresponding modified sequences (especially in exonic region) with the modification region.

Location of repetitive intronic satellites sequence "Rseq7" and the corresponding exonic modified sequences in different chromosomes of the human genome. modified versions. Here, we can see in the first subfigure (scalogram) that the repetitive pattern is located at: 1173583bp-1175114 bp in the X chromosome of the human genome. The second subfigure presents the segmented image. The repetitive patterns correspond to the repetitive sequences which start in intronic sequences and end in exonic region with some nucleotides modification (11 nucleotides) in the beginning and in the end (Fig. 11 ).

After the repetitive sequences localization, we checked if these sequences are located in other regions in the human genome and even in the genomes of other species. Table 5 shows the location of the repetitive sequence "Rseq7" and its modified repetitive sequences in different gene regions of different chromosomes in the human genome. We can note that this new repetitive sequence characterizes a ribosomal protein (RPs) region in the human genome. The ribosomal RNA gene repeat (rDNA) is the largest repetitive region in the eukaryotic genome. The genome stability depends on the stability of the rDNA, the latter affects cellular functions

The next example in Fig. 12 shows highly repetitive patterns in the X chromosome at position: 2277000-2282500 bp (Xp22.33 region) in the human genome. This region contains tandem repeat sequences and interspersed repeat sequences. In addition, the localization results have shown that these specific patterns are localized in the intronic region of the DHRSX gene ([2,219,506 bp: 2,500,974 bp]) in the X chromosome and even in other genes located in other chromosomes.

DHRSX gene is a new gene discovered in 2014 at the Xp22.33 and Yp11.2 in the human genome. It has been shown that the protein encoded by this gene is implicated in the positive regulation of starvation induced autophagy [54] .

The scalogram represented in Fig. 12 indicates the presence of repetitive patterns in intronic regions. The reference sequence corresponding to tandem repeat sequence "Rseq8" has a size equal to 89bp and 14 as a repetition number. Other repetitive sequences are localized in these intronic regions which are:

-"Rseq9" with a size of 42 bp and 26 as repetition number -"Rseq10" with a size of 19 bp and 63 as repetition number -"Rseq11" with a size of 6 bp and 123 as repetition number.

All these repetitive sequences are minisatellite type. In the NCBI database, these regions are defined as a low complexity G-rich repetition and there is no further given information.

• Rseq8="AGGGAGAGAGAGGGAGGGCAAACGAGAGGGAGAGAGAA-GGAGGAGGAGGAAATGGGGGAAAGAGAGAGAAAGAGAGATGGA-GAGGGAAC" • Rseq9="AGAGAGATGGAGAGGGAACAGGGAGAGAGAGGGAGGGC-AAAC"

• Rseq10="AGAGAGATGGAGAGGGAAC" • Rseq11= "AGAGAGAA"

These repetitive sequences are also located at the same position in intronic region within the DHRSX gene in the Y chromosome of the human genome. Table 6 details the location of the new repetitive sequence "Rseq8" inside the X and Y chromosomes.

Furthermore, this repetitive sequence is located inside the intronic region of the DHRSX gene with tandem repeat and dispersed repeat forms. Fig. 13 shows an example of another repeat tandem pattern found in the X chromosome at position 27210460− 27211308bp in the human Fig. 12 . Scalogram corresponding to a DNA sequence in X chromosome that contains repetitive sequences in intronic region.

Location of the intronic repetitive sequence "Rseq8" in the X and Y chromosomes of the human genome. The Table7 provides the locations of "Rseq9" in the X chromosome of other genomes.

Position of "Rseq9" in X chromosome of other genomes. • Rseq12="ATATATGATATATACTATATATGTCATATATACATATACAC"

The short repetitive sequence "TACATA" (6 bp) appears 22 times in this DNA sequence and has 69,710 as a repetition number in the X chromosome.

After searching for the existence of this tandem repeat sequence Using our algorithm, we have successfully found 9 repetitions of another new short repetitive sequence as a tandem repeat sequence (TRs). We called this sequence of 29 base pairs "Rseq13".

-Rseq13="CTGTATAACCTAAATAATATAGGTTATAT" Fig. 14 shows the scalogram of a new repetitive DNA sequence that we called "Rseq13". The sequence has a size of 261 bp and it is localized at 28076765-28077025 bp in the X chromosome. It is a tandem repeat sequence, with patterns of 29 bp length: "Rseq13". The NCBI and the Dfam databases don't indicate the existence of such repetitive sequence ("Rseq13"). With our approach we succeeded to detect this tandem repeat without any prior knowledge about its existence.

The repetitive sequence "Rseq13" is located not only in the X chromosome of human genome but also in other genomes like in the X chromosomes of Bonobo (at [28, Fig. 15 shows the scalogram of a new DNA sequence "TRseq2" with a size of 261 bp. The sequence is positioned at 156029111-156029371 bp in the X chromosome. As we can see, the scalogram contains a repetitive pattern corresponding to a tandem repeat sequence: "Rseq14". This subsequence ("TCTCTGCGCCTGCGCCGGCGCGGCGCGCC") has a size of 29 bp and 9 as a repetition number.

Rseq14 is not annotated as a tandem repeat in the NCBI or the Dfam databases but it is defined as a TAR1of the telomeric satellite family [55] .

In Table 8 , we provide the localization results of "Rseq14" in the whole human genome and in other genomes. Fig. 16 shows the scalogram of another new DNA sequence: "TRseq3" with a size of 500 bp and extending from 2845001bp to 2845500bp in the X chromosome of human genome. The sequence contains a tandem repeat sequence: "Rseq15" (CGTGTGTATGTA-TATTTATATACA), which size is a 24 bp and its repetition number is equal to 18. This sequence is not annotated as a tandem repeat sequence in the NCBI database nor in the Dfam database.

Our "New-repeat-Data" database of all new discovered repetitive Fig. 15 . Scalogram image corresponding to DNA sequence "TRseq2" confirm the existing of the "Rseq14" tandem repeat sequence (TCTCTGCGCCTGCGCCGGCGCGGCGCGCC) n annotated in [45] as a minisatellites sequence which their repetition number equal to 9. sequences are presented in "Supplementary Material" file. To conclude, we succeeded to implement an efficient algorithm for repetitive sequences detection. The sequences we detected are of two types: satellites and minisatellites. On the other hand, we have obtained better results than those of the bioinformatics tools. The main advantage presented by this work is being independent of any prior knowledge about the searched repeat.

In this section, we present the results of using CNN model to classify DNA scalograms obtained in the first part of this work. Our goal is to identify the different classes of the new repetitive sequences we discovered and stocked in the "New-repeat-Data" database. As a data, we randomly took 200 non-repetitive sequences (NonRep) and 780 repetitive sequences (Rep). Repetitive sequences data consists of 780 sequences divided into 4 classes depending on their repetitive pattern length (Table 9 ). These classes are: Rep1 (with a size >100), Rep2 (with a size between 60 and 100), Rep3 (with a size between 30 and 60) and Rep4 (with a size <30). In globally, our constructed database contains five classes that four contain scalograms of repetitive sequences and one contains scalograms without repetitive sequences. For the classification purpose, all the dataset (980 scalogram images) was splitted into 80% for training (784 images) and 20% for testing (196 images). Thus, by such classification system we can discover images that contain similar repetitive patterns. We can also differentiate these images from others that don't contain repetitions.

The Fig. 17 represents the classification results of the four repetitive DNA classes (images with repetitive patterns) against one class of nonrepetitive DNA (images with no repetitive patterns).

Scalogram image corresponding to the DNA sequence "TRseq3" that contains "Rseq15" as tandem repeat motif.

Description of the input data to the CNN classification system.

Repetitive pattern with size X NUMBER With the CNN model, we distinguished different specific types of DNA images. The score ranges from 89% to 100%. The obtained results yield an average score of 94.4%.

The confusion matrix of the classification rates confirms that our system is efficient in distinguishing between small repetitive patterns (Rep4) and non-repetitive DNA sequences (NonRep) with score equal to 100%. This result is quite clear, since the scalogram images of these two classes are very different.

The following Table 10 contains three evaluation measurements: precision, recall and F1-score which we used to evaluate our classification system. Overall, our system gives good results in recognizing the four new repetitive DNA sequences with an average of 95% in precision, recall and F1-score.

Genetic knowledge improvement of the human genome is a complex and a continuous research process. To contribute to this process, bioinformatics and signal and images processing tools have been applied to reveal hidden spectral features of DNA sequences. Although the repetitive DNA sequences occupy 40% of the Human genome, the localization of these sequences remains insufficient as it is a very difficult task.

In this paper, we proposed a new algorithm based on the signal and image processing tools to extract the repetitive patterns from DNA images that correspond to the repetitive DNA sequences. The main goal of this is to create a new database that contains locations of all the new discovered repetitive sequences. As an example of the obtained results, we found a new modified repetitive sequence that can characterize 60S ribosomal protein: "Rseq7". Therefore, deeper studies that may give a biological interpretation of these results will be welcome.

In this article, we proposed a novel and highly-effective method for DNA images prediction based on CNN model. In our prediction system, the obtained accuracy scores over 100 fold cross validation ranged from 89% to 100% with an overall score of 94.4%.

On behalf of all authors, the corresponding author states that there is no conflict of interest.

The authors declare that there are no conflict of interest exists and no competing interests regarding the publication of this paper. Afef. Elloumi Oueslati: PhD in electrical engineering from the National Engineering School of Tunisia (ENIT). She is Associate Professor at the National School of Engineers of Carthage (ENICarthage). Her research interest includes issues related to signal and image processing applied in the biomedical and genomic fields.

Zied. Lachiri: PhD in electrical engineering from the National Engineering School of Tunisia (ENIT).He is Professor and Research Director in the Signal, Image and Information Technology laboratory (LR-SITI, ENIT). His research interests include pattern recognition, and signal and image processing in biomedical, multimedia, and man-machine communication

The sequence of the human genome

Early stages of XY sex chromosomes differentiation in the fish Hoplias malabaricus (Characiformes, Erythrinidae) revealed by DNA repeats accumulation

Mini-and microsatellites

Repetitive DNA and next-generation sequencing: computational challenges and solutions

Characterization of human centromeric regions of specific chromosomes by means of alphoid DNA sequences

A tandemly repeated sequence at the termini of the extrachromosomal ribosomal RNA genes in tetrahymena

Maintaining the end: roles of telomere proteins in end-protection, telomere replication and length regulation

A highly conserved repetitive DNA sequence,(TTAGGG) n, present at the telomeres of human chromosomes

Structure and function of telomeres

Epigenetic regulation of heterochromatic DNA stability

Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance

Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.)

mreps: efficient and flexible detection of tandem repeats in DNA

Sputnik -DNA Microsatellite Repeat Search Utility

wEMBOSS: a web interface for EMBOSS

Tandem repeats finder: a program to analyze DNA sequences

Using RepeatMasker to identify repetitive elements in genomic sequences

Sense from sequence reads: methods for alignment and assembly

Repetitive elements may comprise over two-thirds of the human genome

The sequence of the human genome

Comparative genomic signature representations of the emerging COVID-19 coronavirus and other coronaviruses: high identity and possible recombination between Bat and Pangolin coronaviruses

The Helitron family classification using SVM based on Fourier transform features applied on an unbalanced dataset

Detection and visualization of tandem repeats in DNA sequences

Identification of short exons disunited by a short intron in eukaryotic DNA regions

Search of hidden periodicities in DNA sequences

Spectral Repeat Finder (SRF): identification of repetitive sequences using Fourier transformation

Helitron's periodicities identification in C. Elegans based on the smoothed spectral analysis and the frequency Chaos game signal coding

A combined support vector machine-FCGS classification based on the wavelet transform for Helitrons recognition in C. elegans

Distinguishing between intragenomic helitron families using time-frequency features and random forest approaches

Decomposition of Hardy functions into square integrable wavelets of constant shape

Wavelet theory and applications: a literature study, DCT rapporten 2005

The continuous wavelet transform and variable resolution time-frequency analysis

Algorithm and technique on various edge detection: a survey

Breast cancer detection using image processing techniques

A computational approach to edge detection

Canny edge detection enhancement by scale multiplication

Morphological Image Analysis: Principles and Applications

BLAT-the BLAST-like alignment tool

Dfam: a database of repetitive DNA based on profile hidden Markov models

Gradient-based learning applied to document recognition

Bacterial classification with convolutional neural networks based on different data reduction layers

Convolutional neural network architectures for predicting DNA-protein binding

CNN-MGP: convolutional neural networks for metagenomics gene prediction

Lightweight convolutional neural network for breast Cancer classification using RNA-Seq gene expression data

Weakly supervised 3D deep learning for breast cancer classification and localization of the lesions in MR images

Cervical cancer classification using convolutional neural networks and extreme learning machines

A convolutional neural network approach to detect congestive heart failure

An experimental study on upper limb position invariant EMG signal classification based on deep neural network

P300 based character recognition using convolutional neural network and support vector machine

Cancer specific long noncoding RNAs show differential expression patterns and competing endogenous RNA potential in hepatocellular carcinoma

Genome instability of repetitive sequence: lesson from the ribosomal RNA gene repeat

DHRSX, a novel nonclassical secretory protein associated with starvation induced autophagy

Structure and polymorphism of human telomere-associated DNA

This study was founded by the Ministry of Higher Education and Research, LR99ES10 Human Genetics Laboratory.

Supplementary material related to this article can be found, in the online version, at https://doi.org/10.1016/j.bspc.2020.102207.