key: cord-0976919-vvzkpetn authors: Olyaee, Mohammad Hossein; Pirgazi, Jamshid; Khalifeh, Khosrow; Khanteymoori, Alireza title: RCOVID19: Recurrence-based SARS-CoV-2 features using chaos game representation date: 2020-08-07 journal: Data Brief DOI: 10.1016/j.dib.2020.106144 sha: 933201e67ab1e85401b32b854cb6eadfd53222c0 doc_id: 976919 cord_uid: vvzkpetn Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is responsible for the COVID-19 pandemic. It was first detected in China and was rapidly spread to other countries. Several thousands of whole genome sequences of SARS-CoV-2 have been reported and it is important to compare them and identify distinctive evolutionary/mutant markers. Utilizing chaos game representation (CGR) as well as recurrence quantification analysis (RQA) as a powerful nonlinear analysis technique, we proposed an effective process to extract several valuable features from genomic sequences of SARS-CoV-2. The represented features enable us to compare genomic sequences with different lengths. The provided dataset involves totally 18 RQA-based features for 4496 instances of SARS-CoV-2. Genetics, Genomics and Molecular Biology Specific subject area Bioinformatics, Sequence analysis, Nonlinear analysis Type of data Value of the data  The dataset can be used by those working in bioinformatics and Artificial intelligence. Because they can apply machine learning methods to assess genomic information.  The proposed dataset can be used for clustering and classification of SARS-CoV-2 genomes.  The dataset can be effectively used for the investigation of the genetic diversity of SARS-CoV-2 genomic sequences.  The dataset involves features that enable us to compare genomic sequences with different lengths. A new coronavirus named SARS-CoV-2 appeared from Wuhan in China and has spread rapidly to the other provinces of China as well as the other countries. According to the situation report of the World Health Organization (WHO), as of 4 June 2020, more than six million cases of COVID19 have been confirmed around the world [1] . On 5 January 2020, the whole genome sequence of SARS-CoV-2 was provided and so far, several thousands of whole genome sequences have been presented [2] . Investigating these available nucleotide sequences provides an insight into its evolutionary similarity with other viruses as well as novel mutations. Indeed, it can provide valuable knowledge about designing vaccines and providing drugs [3] . For this aim, it is necessary to extract insightful features to describe the nucleotide sequences. In this work, according to the diagram represented in Fig. 1 , several recurrencequantification-based features are extracted from the nucleotide sequences. Recurrence quantification analysis (RQA) is a powerful nonlinear method which can propose representative features. This technique successfully aids us to compare biological sequences with different lengths [4, 5] . In this paper, we introduce a new dataset which involves efficient nonlinear features related to genomic sequences of SARS-CoV-2. For this aim, the nucleotide sequences of 4496 variants of SARS-CoV-2 virus are gathered. The collected genomic sequences are publicly available in the National Center for Biotechnology Information (NCBI). As can be seen in Fig. 1 , each nucleotide sequence is transformed into 2D space by applying chaos game representation. According to this map, all details of the input sequence are preserved. Next, the obtained picture is decomposed into two coordinate series which contain the position of points in the picture. In the next step, recurrence plot (RP) as a powerful technique is used which illustrates recurrent properties in the coordinate series. In the final step, by applying recurrence quantification analysis (RQA), from each extracted coordinate series, 9 features are provided and totally 18 ( ) features will be extracted. The extracted features are described below: (1) This measure describes the density of recurrence points in the RP. The second feature (DET) describes the amount of determinism which is gained as the ratio of the number of points constructing diagonal lines to all the recurrence points. DET= (2) and are the next features which are the length of the longest diagonal line in RP and the mean of diagonal lines, respectively. ENT is the Shannon information entropy which describes the diversity of diagonal lines. This measure is computed as below: In the above relation, is the minimum length of diagonal lines in the RP. Moreover, is obtained as below: The next feature is laminarity which is computed as below: Trapping time (TT) is the next measure which equals the average length of vertical line structures. is the other feature which is the maximum length of the vertical lines in RP. Finally, the last feature is which is the average of the local clustering coefficient. This • Map each nucleotide sequence by chaos game representation Step2 • Extract two coordinate series Step3 • Draw recurrent plot for each coordinate series Step4 • Extract features from coordinate series by applying recurrence quantification analysis measure gives the probability that two neighbors of any state are also neighbors and is obtained as below: In the above formula, is the degree centrality for node v which yields the number of its neighbors. The files of the dataset are represented in two folders named as "RQA" and "CGR". The first involves two excel files which are named "GenomeInfo.xlsx" and "RQADataset.xlsx". The former contains the information of nucleotide sequences which are GenBank accession, strain name, sequence length, and, nucleotide sequence. The latter is a table with 4496 rows and 18 columns which contains the extracted features for collected instances. The "CGR" folder contains raw information of viruses such that for each virus, a text file is stored which includes the point's positions of its CGR. As explained above, providing the dataset includes several steps. In this section, each part is reviewed with more details. Chaos game representation (CGR) is an interesting map which transforms an input sequence into a two-dimensional space. The result of this map is a picture which reveals the hidden subsequence structures [6] [7] [8] . Let be a given nucleotide sequence. Since it is composed of four kinds of letters (A,T,C, and G), the resulted map is a square [ ] [ ] such that each vertex corresponds to a letter. The first point is located halfway between (0.5, 0.5) as the center of the square and the vertex equals the first letter. Each letter is iteratively mapped to a unique point halfway between the previous point and the vertex matching with . Fig.2 demonstrates the resulting plot for MT503004. It is interesting to note that like current implementations, it is supposed that the input sequence includes four alphabets and the other codes such as R and Y are omitted. Since direct investigation of the obtained plot is a challenging task, according to the coordination of each point i.e. x and y, the CGR plot is decomposed into two coordinate series named CGRX and CGRY. Fig. 3 shows a part of the extracted coordinate series relating to the CGR plot of Fig. 2 . Recurrence is an essential feature of dynamical systems which emerges in the phase space [9] . Recurrence plot (RP) as a graphical tool enables us, for a given time series, to detect patterns of recurrence. In fact, RP is an matrix; is the number of points in the phase space with dimension m and each entry of the matrix is . When one element ( [ ]) equals 1, it means the corresponding states of the two time points i and j are close. Fig. 4 . Illustrates the RPs for the two coordinate series shown in Fig. 3 . Coronavirus disease 2019 (COVID-19): situation report A new coronavirus associated with human respiratory disease in China Emergence of genomic diversity and recurrent mutations in SARS-CoV-2 Improved protein structural class prediction based on chaos game representation Predicting protein structural classes based on complex networks and recurrence analysis Sequence analysis by iterated maps, a review Chaos game representation of gene structure Application of Chaotic Laws to Improve Haplotype Assembly Using Chaos Game Representation Recurrence plots for the analysis of complex systems Fig. 4 . The corresponding recurrence plots for the coordinate series in Fig. 3 . TheFormer is related to CGRX and the latter is the recurrence plot of CGRY. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.