key: cord-0907301-clagd6rh authors: Sen, Dwaipayan; Roy, Proyasha; Nandy, Ashesh; Basak, Subhash C.; Das, Sukhen title: Graphical representation methods: How well do they discriminate between homologous gene sequences? date: 2018-09-24 journal: Chem Phys DOI: 10.1016/j.chemphys.2018.07.031 sha: c44fb8bc94087ea0a7117236f8d9d8b5c0a17d1e doc_id: 907301 cord_uid: clagd6rh Graphical representation methods constitute a class of alignment-free techniques for comparative study of biomolecular sequences. In this brief commentary, we study how well some of these methods can discriminate among closely related genes. Graphical representations of biomolecular sequences have generated a lot of interest as a tool for alignment-free analysis, evidenced by the large number of research work on the subject. A principle application of these techniques has been in determining the evolutionary relationships analysed between different gene families [1] . The results obtained from the different graphical methods show small differences among them and display conformity with standard phylogenetic studies. Graphical representations often provide a visual clue to the pattern of distribution of bases along DNA or RNA sequences. The representations are slightly more complicated in the case of protein sequences where one has to contend with 20 basic units, the amino acids, but ingenious schemes from 2D to 20D abstract graphs have been utilised to represent them too [2] . The graphical representation methods remain, to date, among the best to 'see' the base distribution in a DNA or a RNA sequence and follow the variations in a family of genes. This has enabled many applications of the techniques, for example, generation of new plant varieties [3] , determination of origins of the SARS-coronavirus [4] , identification of conserved regions in influenza virus neuraminidase gene for vaccine design and development [5] . Applications of these methods to viral sequence analyses have been a prime area of interest in view of the need for rapid development of drugs and vaccines against viral diseases. In a number of studies, our group has advocated rational designs of peptide vaccines against Influenza [5] , Rotavirus [6] , Human Papillomavirus [7] and Zika virus [8] . In particular, we have worked out detailed differences between base distributions within the species of the Flavivirus genus and have shown that the Dengue virus which rapidly becomes endemic in a country or region has country/region-specific sequence differences, requiring customized vaccines for better efficacy [9] . Similar graphical studies of viruses have also been carried out by other researchers, including analysis of the Coronavirus by Liao et al. [4] . Such observations raise the question of which among the many graphical representation methods should be employed in order to obtain the most accurate results. An expected tendency would be to consider those methods which discriminate among sequences more clearly than others, i.e., sequence differences should lead to more pronounced differences in their descriptors. In order to address this issue, we had earlier carried out a comparative analysis of several different methods [10] and showed that results of some of the methods correlate strongly with results of other methods, implying that the sequence descriptors arising from the various approaches essentially map the same features; those that do not exhibit high level of correlation could be considered as mapping some attributes unique to those schemes. In this brief commentary, we work out the descriptors for two families of sequencesthe globin genes of Homo sapiens and the envelope genes of flavivirusesto determine precisely how much discrimination is evidenced in a selected group of methods. In our comparative study, we have used six methods based on 2dimensional (2D) analysis and one 3-dimensional (3D) system (see Ref. [10] for details). The 2D methods are primarily based on the Cartesian coordinate system. In the Nandy plot [11, 12] , the four nucleic acid bases are assigned to the four axes of a 2D Cartesian coordinate system. A given sequence is plotted based on the distribution of its bases in the corresponding direction; in computations for this note adenine (A) was assigned to the negative x-axis, cytosine (C) to the positive y-axis, guanine (G) to the positive x-axis and thymine (T) to the negative yaxis. The weighted average of the x-and y-coordinates of each point of a sequence of length N represents the centre of mass. The Euclidean distance between the origin and the centre of mass provides a quantitative graph descriptor, termed as the graph radius (g R ). A variant of this method can be seen in the Yau plot [13] in which two quadrants are used with the first quadrant addressing thymine (T) and cytosine (C), and the second quadrant denoting guanine (G) and adenine (A). However, the authors of this method did not prescribe a sequence descriptor. As a result, for the purpose of this commentary, Nandy's graph radius method is used to define the sequence descriptor of the Yau plot for comparative analysis. The 2D Randic plot [14] and Song-Tang plot [15] , employ parallel horizontal lines to denote nucleic acid bases. The key difference The coordinates, the centre of mass of all the points, and the graph radius g R of the sequence as per Nandy rectangular plot. Coordinates between the two systems is that, the Randic 2D plot consists of four lines whereas the Song-Tang plot comprises three lines. In the former, each of the four bases are assigned to each of the four lines and in the latter, the central line represents a pair of bases while the two peripheral lines each represent one of the remaining two bases. In both methods, the bases of a given sequence are plotted on the corresponding line along in the direction of the x-axis and then subsequently, joined by straight lines. The sequence descriptors are represented by the leading eigenvalues computed from M/M matrices of each graphical method. The Wang-Zhang [16] and Ji-Li TB curve [17] deviate from the aforementioned systems in that, the Wang-Zhang plot employs a binary system in which the presence and absence of particular bases are represented by 1 and 0, respectively. There are three such systems: non-A, non-C and non-G. If, for example, the non-A configuration is used, A will be denoted with 0 and the remaining bases by 1. In the Ji-Li TB curve, the given sequence is mapped into nodes which are then connected sequentially in order to obtain three curves, R-Y, M-K and W-S. The L/L matrices are used to calculate the leading eigenvalues which are taken as the descriptors of sequences for both the analyses. For a 3D graphical representation of a sequence, Randic [18] used each of the four corners of a tetrahedron to represent each of the four nucleic acid bases. The sequence descriptor is given by the leading eigenvalues calculated from the ratio of the Euclidean pairwise distance matrix D E and pairwise graph theoretic distance matrix D G of all the points on the graph. The sequence descriptors for the aforementioned 7 graphical methods and the standard deviation of the descriptors from their averages have been calculated. For the Nandy and Yau methods, graph radii have been calculated. Leading eigenvalues from M/M matrices have been determined for Randic 3D, Randic 2D and Song-Tang methods as their numerical indices. For Wang-Zhang and Ji-Li plots, L/ L matrices have been used to compute the leading eigenvalues; the L/L matrix for the Ji-Li plot is for the R-Y curve. To fix our ideas of what these various approaches to graphical representations of nucleotide sequences represent and how the sequence descriptors characteristic to each method are computed, we consider a 10-base oligonucleotide, ATGAAGGCAA, for a simple application of each method. This short sequence constitutes the first ten bases in the hemagglutinin gene of the influenza virus, A/Novosibirsk/02/ 2009(H1N1), and therefore can be taken as representative of a real-life situation. 2.1.1.1. Nandy 2D rectangular plot. In the Nandy plot [11] , the graph of our 10-base sequence drawn according to the given prescription is as shown in Fig. 2 .1. Table 2 .1 represents the coordinates, the centre of mass of all the points, and the graph radius g R , the sequence descriptor, of the sequence as described in Section 2.1 [12] . In the representation by Yau et al. [13] , the four bases being assigned in the first and second quadrant only, the coordinates of the four bases are defined as: For the purpose of this commentary, as mentioned in Section 2.1, we define a centre of mass and a graph radius as descriptors analogously to the Raychaudhury-Nandy method [12] . The graph and sequence descriptor for our sequence in the Yau et al. [13] plot are given in Fig. 2 The D G matrix in Randic 2D model. [14] proposed four horizontal equidistant parallel lines labelled as A, T, G, and C from top to bottom. To plot the nucleic acid sequence, the bases are numbered along the x-axis, and straight lines are drawn from individual points on the four parallel lines according to the bases occurring in the sequence (Fig. 2.3 ). An M/M matrix method is used to determine a descriptor for the given sequence ATGAAGGCAA. The entries of the M/M matrix are determined by dividing the Euclidean distance D E between two vertices of the zigzag curve by the graph theoretic distance D G between the two vertices. As an example, if we take the first base A and the eighth base C of our sequence, then the corresponding matrix element is obtained by taking the ratio of Euclidean distance between the A and C bases, i.e., 58 (Table 2. 3.1), to the number of edges between A and C, which is 7 here ( Keeping one of such groups, such as purine, as central line, and cytosine (C) and thymine (T) as the peripheral lines, we can plot a graph. Repeating the same process for the other groups, we get a total of 6 ( 4 C 2 ) combinations of plots. For each of them, the 2 peripheral lines can also be exchanged. In total, 12 (6 × 2) plots can be obtained. In where ED, GD and PD are the Euclidean distance matrix, graph theoretical distance matrix and path distance matrix, respectively. The leading eigenvalue in this case turns out to be 10.5856. This is L/L matrices ( Table 2 .5) are computed based on the three 2D graphs shown in Fig. 2 .5 as a typical example for characterizing the DNA sequence. Next, eigenvalues are computed from the matrices and then the descriptor: λ non-N = maxeig (maximal eigenvalue) + mineig (minimal eigenvalue), where N = A, G, C. Table 2 .5 tabulates the leading eigenvalues for our sequence ATGAAGGCAA. 2.1.1.6. Ji-Li method. Ming Ji and Chun Li [17] proposed a 2D graphical representation method where = ⋯ X X X X n 1 2 is taken as a DNA primary sequence with n bases and defined a homomorphic map [18] , each of the four bases are assigned to the corners of a tetrahedron where the points are as follows: (+1, −1, −1) for A, (−1, +1, −1) for G, (−1, −1, +1) for C and (+1, +1, +1) for T. The graph is plotted by placing the 1st base say A (for our mini sequence ATGAAGGCAA) at the corner position i.e. at (+1, −1, −1) and the next base i.e. T at (2, 0, 0). Proceeding in this manner, we obtain a plot of the sequence ATGAAGGCAA as shown in Fig. 2 .7. In this method, a D E /D G matrix (from the Euclidean pairwise distance matrix D E and a pairwise graph theoretic distance matrix D G of all the points on the graph) is constructed in which the sequence descriptor is defined as the leading eigenvalue. In the following table each matrix element is calculated as follows: Taking the (1,8) element, i.e., 1st base A to the 8th base C as an example, the Euclidean distance is calculated from the coordinates shown in Table 2 .7.1 and is divided by the minimum distance between two consecutive points which is √ 3. The division by the minimum distance is required to normalize the distance scale, so that the Euclidean distance between adjacent vertices equals 1, and not √ 3 (due to taking the side of the cube to be 1). Dividing this value by the number of edges before the 8th base, i.e., 7 here, the value obtained is =(√11/√3)/7 = 0.2736, the matrix element (Table 2.7.2). The descriptor value, i.e. the leading eigenvalue, in the Randic 3D model for this sequence is 5.6174. Our sample data consist of four members of the Homo sapiens globin family and four mosquito-borne human-infecting members of the flavivirus group (Table 2. 2.1). Among the globin genes, which are about 444 bases in length, the alpha globin being CG-rich is quite different from the beta globin group which are primarily AT-rich. The flaviviruses are of approximately 1500 bases in length and the Dengue virus is quite different in its adenine richness from the other three flaviviruses [9] . The sequence descriptors should therefore yield sufficiently different numbers to discriminate among the globin genes and between the flaviviruses, which will be compared with the computed standard deviations of the results as percentages of the corresponding descriptor average; higher the percentage, more is the discriminatory power of the method. Table 3 .1 shows the results for the flavivirus group of envelope genes and the globin group of genes. As can be seen, the two oldest of the current crop of graphical representations, viz., Nandy plot and Randic-3D plot, show the best clear discriminatory power. For example, in case of the flaviviral envelope genes, the Nandy plot descriptor ranges from 72.61 for DENV2 to 8.91 for WNV and as low as 4.51 for ZIKV; the Randic 3D plot shows descriptors ranging from 221.99 for DENV2 to 113.57 for WNV and 156.22 for ZIKV. The observation that there is a wide difference between DENV2 and, WNV and ZIKV which reflects the base distributions in their genetic composition was first mooted in a previously published paper in detail [9] . On the other hand, the same three sequences yield descriptor values, respectively, of 1485.51, 1503.43, 1495.21 in the Song-Tang representation and 1487.07, 1505.02, 1496.75 in the Randic 2D representation, showing very little distinction between DENV2, WNV and ZIKV. This is reflected Base Coordinates in the standard deviation (SD) as a percentage of the average descriptor values (SD%) for each method; while the Nandy and Randic 3D plot descriptors yield SD% 133.51 and 29.60, respectively, for the envelope genes; in case of the Song-Tang and Randic 2D methods, the SD% are 1.03 and 1.27, respectively. Higher SD% implies greater differences in the sequence descriptor values showing increased discriminatory power among related sequences. This is also borne out by the analysis of the human globin genes (Table 3 .1) where, again, the Nandy and Randic 3D plots show the best discriminatory power among the seven methods tested. Our analysis, therefore, indicates that it would be advisable to consider methods like the Nandy plot, Randic 3D plot and others of the same ilk, not analysed in this brief commentary, that have the ability to distinguish among sequence differences for analysis of DNA/RNA sequences of a family of closely similar genes. However, given a random collection of gene sequences of a variety of related and non-related sequences, the methods not meaningful in the above analysis could come into effect; for example, between the globin and flavivirus gene sets, the Randic 2D method or the Song-Tang method yield sequence descriptors that are different by at least one order of magnitude (Table 3 .1). The non-similar sequence descriptors would tend to provide distinct separate groups where similar sequences, such as the family of flavivirus genes for instance, having almost identical descriptor values in these methods would cluster together, thus providing an easy way of discriminating between divergent sequences. In this context, it is relevant to recall that in the chemical domain it has long been known that no single topological index can discriminate graphs uniquely [19] . Perhaps, it may be worthwhile to introduce a concept of some "super-descriptor" on the lines of topological "super-index" first proposed by Bonchev, Mekenyan and Trinajstic [20] or extracted principal components (PCs) derived from the collection of individual sequence descriptors [21] as some combination of descriptors from various methods for discriminatory power of inter-gene and intra-gene sequences. X Y Z A 1 −1 −1 T 2 0 0 G 1 1 −1 A 2 0 −2 A 3 −1 −3 G 2 0 −4 G 1 1 −5 C 0 0 −4 A 1 −1 −5 A 2 −2 −6 In graphical representation methods, the base composition and distribution within a nucleic acid sequence are visually projected and further numerically characterized by way of sequence descriptors which provide a quantitative description of the base pattern in the query sequence. Multiple methods are available to compute this. A comparative analysis of seven such methods -Yau method, Wang-Zhang method, Song-Tang method, Randic 3D method, Randic 2D method, Nandy method and Ji-Li methodin this brief commentary have elucidated that not all of them have the full capacity to discriminate between closely related sequences. In a retrospective study of flaviviral envelope gene and human globin genes, where one element of each sample set is known to be characteristically different from the rest, the Nandy and Randic 3D methods have yielded results that allow good discrimination between the sequences, whereas the other methods were unable to reflect similar outcome. However, on a broader scale, for classification of a set of close and distant sequences, the remaining five graphical representation methods are able to discern diverging sequences. We suggest that a kind of "super-descriptor", which is a blend of elementary descriptors, could be considered to distinguish sufficiently between inter-and intra-gene sequences. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Mathematical descriptors of DNA sequences: development and applications Graphical representation of proteins 2D random walk representation of Begonia × tuberhybrida multiallelic loci used for germplasm identification Coronavirus phylogeny based on triplets of nucleic acids bases Computational analysis and determination of a highly conserved surface exposed segment in H5N1 avian flu and H1N1 swine flu neuraminidase In silico study of rotavirus VP7 surface accessible conserved regions for antiviral drug/vaccine design Rational design of peptide vaccines against multiple types of human papillomavirus A bioinformatics approach to designing a Zika virus vaccine Comparison of base distributions in Dengue, Zika and other flavivirus envelope and NS5 genes Intercorrelation of major DNA/RNA sequence descriptors -a Preliminary Study Graphical representation and analysis of DNA sequence structure: I. methodology and application to globin genes Indexing scheme and similarity measures for macromolecular sequences DNA sequence representation without degeneracy Novel 2-D graphical representation of DNA sequences and their numerical characterization A new 2-D graphical representation of DNA sequences and their numerical characterization Characterization and similarity analysis of DNA sequences grounded on a 2-D graphical representation TB curve, a new 2D graphical representation of DNA sequence On 3-D graphical representation of DNA primary sequences and their numerical characterization Discrimination of isomeric structures using information theoretic topological indices Isomer discrimination by topological information approach Characterization of isospectral graphs using graph invariants and derived orthogonal parameters