key: cord-1013072-9t3kneoo
authors: Abd Elwahaab, Marwa A.; Abo-Elkhier, Mervat M.; Abo el Maaty, Moheb I.
title: A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector
date: 2019-05-08
journal: Biomed Res Int
DOI: 10.1155/2019/8702968
sha: 46e0a035e55b44ef196234e32b2c550b87db4fdc
doc_id: 1013072
cord_uid: 9t3kneoo

Similarity/dissimilarity analysis is a key way of understanding the biology of an organism by knowing the origin of the new genes/sequences. Sequence data are grouped in terms of biological relationships. The number of sequences related to any group is susceptible to be increased every day. All the present alignment-free methods approve the utility of their approaches by producing a similarity/dissimilarity matrix. Although this matrix is clear, it measures the degree of similarity among sequences individually. In our work, a representative of each of three groups of protein sequences is introduced. A similarity/dissimilarity vector is evaluated instead of the ordinary similarity/dissimilarity matrix based on the group representative. The approach is applied on three selected groups of protein sequences: beta globin, NADH dehydrogenase subunit 5 (ND5), and spike protein sequences. A cross-grouping comparison is produced to ensure the singularity of each group. A qualitative comparison between our approach, previous articles, and the phylogenetic tree of these protein sequences proved the utility of our approach.

Sequence comparison is used to study structural and functional conservation and evolutionary relations among the sequences. The importance of similarity/dissimilarity of biological sequences returns to its relationship with the structures and functions. Proteins with similar sequences usually have similar structures. The rate of addition of new sequences to the databases is increasing exponentially [1] . Comparing these new sequences to those with known functions is a key way of understanding the biology of an organism. Thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. There are many tools and techniques that provide the sequence comparisons.

Sequence comparison can be classified into alignmentbased methods and alignment-free methods [2, 3] . Alignment-based methods assign scores to different possible alignments, picking the alignment with the highest score. Some algorithms do global alignment or local alignment [4] [5] [6] . BLAST [7] and FASTA [8] are the most widely used applications. Alignment-based methods are computationally difficult with multiple sequence alignments at the same time. A wide range of scoring systems has been proposed such as amino acid substitution scoring matrices PAM and BLOSUM for protein alignment [9] .

Alignment-free approaches overcome the limitations of alignment-based methods. Graphical representation approaches are one of them. Graphical representations are usually accompanied by numerical characterization and then a descriptor to describe each protein sequence. A similarity/dissimilarity analysis is then done using these descriptors by evaluating Euclidean distance or correlation angle among them. The smallest Euclidean distance or correlation angle is the more similar. Many graphical representations of DNA and protein primary sequences have been proposed. Some other approaches characterize numerically protein sequences without previous graphical representation and nongraphical representation methods [10, 11] .

In this article, an alignment-free method is introduced. It is considered a nongraphical representation method. Three groups of protein sequences are selected to illustrate our approach. They are beta globin, NADH dehydrogenase subunit 5 (ND5), and spike protein sequences. They are selected as each group has sequences of similar range of lengths. The 1  Human  AAA16334  147  2  Chimpanzee  CAA26204  125  3  Gorilla  CAA43421  121  4  Mouse  CAA24101  147  5  Rat  CAA29887  147  6  Gallus  CAA23700  147  7 Opossum AAA30976 147 Opossum NP 007105 602 most common sequences of each group are selected. The selected sample is assumed to be unbiased and the population distribution of each group is normal. Therefore, the selected sample represents the group. Statistics can be used to estimate the population's parameters. The adjacency vector is introduced as a novel descriptor for protein sequences. It is computed for each sequence in the selected sample of three groups. A reference vector is then computed for each group. This vector acts as a representative of the group. Each sequence's degree of similarity in each group is measured according to its group's representative vector. So, a similarity/dissimilarity vector is constructed instead of ordinary similarity/dissimilarity matrix. Our approach is independent of the protein sequence length. It does not require any previous graphical representation. It is a mathematically simple approach.

The protein sequences used in this article are listed in Tables 1, 2 , and 3. The sequences are downloaded from the National Center for Biotechnology Information (NCBI) "https://www.ncbi.nlm.nih.gov/" as FASTA files. These FASTA files are imported into Wolfram Mathematica 8 where all the results and figures are produced. The phylogenetic tree of these protein sequences is also created by the Basic Local Alignment Search Tool (BLAST) "https://blast.ncbi.nlm.nih .gov/Blast.cgi". Table 1 shows the 1 st sample set that consists of seven species of beta globin protein sequences. Their range of lengths is from 121 to 147. This sample set is applied before in [12] . Table 2 shows the 2 nd sample set which consists of nine ND5 protein sequences. Their range of lengths is from 602 to 610. This sample set is applied before in [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] . Table 3 shows the 3 rd sample set which consists of 29 spike protein sequences. Their range of lengths is from 1162 to 1447. These viruses are coronavirus. They are classified into four classes: Class I that includes the porcine epidemic diarrhea virus (PEDV) and the transmissible gastroenteritis virus (TGEV). Class II includes the bovine coronavirus (BCoV), human coronavirus OC43 (HCoV-OC43), and the murine hepatitis virus (MHV). Class III contains the infectious bronchitis virus (IBV). The others are severe acute respiratory syndrome coronaviruses (SARS-CoV). This sample set is applied before in [26] .

In this approach, a new vector is suggested to be a descriptor of a protein sequence. This vector is called the adjacency vector ( ); x refers to the species' protein sequence and y refers to its related group. It counts the occurrence of all possible pairwise adjacencies obtained by reading the protein primary sequence from left to right. The protein sequence Table 4   AA AR  AN  AD  AC  AQ  AE  AG  AH  AI  AL  AK  AM  AF  AP  AS  AT  AW  AY  AV  1  0  1  0  0  0  0  1  4  0  3  0  0  1  0  0  1  0  1  2   Table 5 VA Table 6 AA AR Table 7 VA

is composed of 20 common different amino acids which are "A," "R," "N," "D," "C," "Q," "E," "G," "H," "I," "L," "K," "M," "F," "P," "S," "T," "W," "Y," and "V" as ordered alphabetically according to 1 st letter code. Therefore, the adjacency vector (A xy ) consists of 400 elements. Every 20 elements are related to each amino acid. The first 20 elements are related to "A" amino acid. The second 20 elements are related to "R" amino acid. The third 20 elements are related to "N" amino acid and so on by the same order which is illustrated previously according to 1 st letter code. We borrow our idea from the 20 ×20 adjacency matrix [27] . The adjacency vector counts the possibilities of each pair. In other words, it counts the number of times that each pair is repeated along the sequence length. If the pair does not exist, its value in the adjacency vector is zero. For example, to evaluate the adjacency vector of the two short segments of "yeast Saccharomyces cerevisiae" protein [16, 19, [22] [23] [24] 28] Protein I: "WTFESRNDPAKDPVILWLNGGPGCSSLTGL" Protein II: "WFFESRNDPANDPIILWLNGGPGCSSFTGL" The two protein sequences are composed of 30 amino acids. Protein I is converted to 29 adjacent pairs that are WT, TF, FE, ES, SR, RN, ND, DP, PA, AK, KD, DP, PV, VI, IL, LW, WL, LN, NG, GG, GP, PG, GC, CS, SS, SL, LT, TG, GL as reading sequence from left to right. Protein II is converted to 29 adjacent pairs that are WF, FF, FE, ES, SR, RN, ND, DP, PA, AN, ND, DP, PI, II, IL, LW, WL, LN, NG, GG, GP, PG, GC, CS, SS, SF, FT, TG, GL as reading sequence from left to right. For example, "ND" pair has a count one in protein I and two in protein II. "DP" pair has a count two in both protein I and protein II. "SL" and "LT" pairs have a count one in protein I and zero in protein II.

Our approach is applied on three selected groups of protein sequences. The groups are beta globin, ND5, and spike protein sequences as illustrated in Tables 1, 2 , and 3, respectively. The most common protein sequences are selected in each group. The selected sample is assumed to be unbiased and the population distribution of each group is normal. Therefore, the selected three samples can represent the three groups. The samples consist of seven beta globin, nine ND5, and 29 spike protein sequences.

Seven adjacency vectors for beta globin proteins, nine adjacency vectors for ND5 protein sequences, and 29 adjacency vectors for spike proteins are evaluated. For example:

(1) Human (beta globin) protein sequence's first 20 elements of its adjacency vector (A human beta globin ) are as shown in Table 4 . (2) Gorilla (ND5) protein sequence's last 20 elements of its adjacency vector (A gorilla ND5 ) are as shown in Table 5 .

The adjacency vector is used to describe each protein sequence individually in its corresponding group. This article provides a descriptor to the group itself. The median vector is selected to play the role of the group representative (GR y ); y refers to its group. It acts as a reference vector for each group. The median is a better measure of central tendency. It separates the higher half from the lower half of the sample's data. It is not sensitive to extreme values like average. The suggested group representative vector (GR y ) is a vector which is composed of also 400 elements. Each element of 400 is the median of the corresponding elements in all adjacency vectors related to its sample that represents the group. Beta globin, ND5, and spike protein sequences' representative vectors are computed. For example:

(1) Beta globin representative vector's (GR beta globin ) 1 st 20 elements are as shown in Table 6 .

(2) ND5 representative vector's (GR ND5 ) last 20 elements are as shown in Table 7 .

(3) Spike proteins representative vector's (GR spike proteins ) 1 st 20 elements are as shown in Table 8 . BioMed Research International Table 8   AA AR  AN  AD  AC  AQ  AE  AG  AH  AI  AL  AK  AM  AF  AP  AS  AT  AW  AY  AV  9  1  3  5  2  4  3  6  0  7  6  2  1  3  6  4  7 1 5 3

A similarity/dissimilarity vector is introduced instead of the regular similarity/dissimilarity matrix [10, 11] . The similarity/dissimilarity matrix is a square symmetric matrix with zeros in its main diagonal. In order to evaluate this matrix, it is required to measure the degree of similarity between each protein sequence and others in the same group. If the 1 st row represents human and the 2 nd row represents gorilla, the similarity of all species according to human in 1 st row is measured. Then the similarity is measured again of all species in 2 nd row according to gorilla and so on. The calculations' number of this matrix equals ∑ 1 = (K − 1)/2 where n is the number of compared species.

The similarity/dissimilarity vector is suggested to save time and number of calculations. It is a vector that has a number of elements equal to the number of protein sequences in the selected sample of each group. It measures the degree of similarity between each protein sequence's adjacency vector and the group representative vector. In other words, it measures the degree of similarity between each protein's descriptor and the "group representative." It is simpler than previous matrix. It is calculated only one time for each sequence. The calculations' number of this vector equals n where n is the number of compared species.

To measure the degree of similarity, we suggest two methods: 

(ii) e nd Method. Compute the angle between each sequence's adjacency vector (A xy ) and the group representative vector (GR y ) in radians by

For beta globin protein sequences, seven species are selected in our sample set: human, chimpanzee, gorilla, mouse, rat, gallus, and opossum, as illustrated in Table 1 . There are seven adjacency vectors corresponding to them. The group representative GR beta globin is evaluated based on these seven adjacency vectors. Therefore, the similarity/dissimilarity vector has seven elements. The 1 st element corresponds to human, 2 nd element corresponds to chimpanzee, and so on, by the same order as in Table 1 . In the Tables 2 and 3 , respectively. The similarity/dissimilarity vectors that are corresponding to beta globin, ND5, and spike protein sequences are illustrated in Tables 9, 10, and 11, respectively, based on the two methods discussed before.

The results in Table 9 show that the magnitude (

, where x: species) cannot measure the similarity/dissimilarity degree well among all beta globin sequences. The human, chimpanzee, and gorilla have the same value that is equal to 0.5568, while the similarity is well measured between mouse and rat. Also, the dissimilarity between opossum and human is very clear. The angle ( ) is successfully measured similarity/dissimilarity among all the species as shown in Figure 1 . The closest values of both and mean more similarity. The results in Table 10 show that both the magnitude ( 5 ) and the angle ( 5 ) can measure similarity/dissimilarity degree well among ND5 protein sequences as shown in Figure 2 . It is obvious that pigmy chimpanzee, common chimpanzee, human, and gorilla are very similar. Also it shows the similarity of the blue whale, fin whale, and the mouse and rat as pairs and the dissimilarity between human and opossum. These results are satisfied with [13, 14, 16, 18, 19, [21] [22] [23] [24] [25] .

The results in Table 11 show that both and classified the 3 classes of viruses and SARs Covs well each as a single coherent class except only the "MHVJHM" virus. This virus belongs to class II but our approach cannot classify it well. The classification of 29 spike proteins into classes by our approach is illustrated in Figure 3 . The MHVJHM virus is the only wrong classified sequence. It is colored red. Despite the wrong classification of MHVJHM virus, our approach corrects the broken classification of Class I in [26] . According to the results in Tables 9, 10 , and 11, the angle is preferred to be used as shown in Figures 1, 2 , and 3.

The group representative vector ( ) carries the information of its group. A cross-group comparison is done to prove the singularity of each group. Tables 9, 10, and 11 are evaluated based on the group's sample set of protein sequences related to their corresponding group representative vector. Tables 12, 13, 14, and 15 are evaluated based on each group sample set of protein sequences with another group representative vector. The similarity/dissimilarity analysis among the seven beta globin sequences measured according to ( 5 ) is illustrated in Table 12 and shown in Figure 4 . The similarity/dissimilarity analysis among the ND5 sequences measured according to ( ) is illustrated in Table 13 and shown in Figure 5 . The similarity/dissimilarity analysis among the beta globin sequences measured according to (GR spike ) is illustrated in Table 14 and shown in Figure 6 . The similarity/dissimilarity analysis among the ND5 sequences Table 15 and shown in Figure 7 . The results show a big distortion that ensures the individuality of each group.

The phylogenetic tree is a branching diagram showing the evolutionary relationships among various biological species based upon similarities and differences in their sequences. A qualitative comparison between our results and the phylogenetic tree of protein sequences is used to prove the utility of our approach. The matching between the results and phylogenetic trees means matching with the naïve measure of sequence similarity (sequence homology). The basic local alignment tool (BLAST) is used to draw the phylogenetic trees. The phylogenetic trees of beta globin's seven species, ND5 nine species, and 29 spike protein sequences are illustrated in Figures 8, 9 , and 10, respectively. The qualitative comparison of the results of Tables 9, 10, and 11 and Figures 8, 9 , and 10 shows the utility of our work especially the angle results.

The proposed method is an alignment-independent method. An adjacency vector is suggested as a descriptor of any protein sequence. It does not require any graphical representation. A group representative vector is introduced to represent each group of protein sequences. A similarity/dissimilarity vector is produced instead of the regular similarity/dissimilarity matrix. The similarity/dissimilarity analysis is done by two methods. Our approach is applied on three sample sets of three groups of protein sequences. Each sample has a different range of lengths than the others. Our approach does not depend on protein sequence length. It successfully measured similarity/dissimilarity among different lengths. It is very mathematically simple. A cross-grouping comparison is introduced to prove the singularity of each group. The results approved the utility of our approach compared with previous articles and phylogenetic tree obtained by BLAST program. 

We hope to make the method available to include ambiguous amino acid residues and nonstandard amino acids. We hope also to include the analyses of partial or gapped sequences.

All data is mentioned clearly in the manuscript in Section 2 under the title "Dataset." In this section, we illustrate the data in three tables: Tables 1, 2, and 3. We also mention in the 1st paragraph of dataset that data are downloaded from "Gene Bank." All data files are with extension ". fasta".

The authors declare that they have no conflicts of interest.

DNA sequence comparison by a novel probabilistic method

Linear regression model of short kword: a similarity distance suitable for biological sequences with various lengths

Sequence comparison via polar coordinates representation and curve tree

A general method applicable to the search for similarities in the amino acid sequence of two proteins

Identification of common molecular subsequences

An improved algorithm for matching biological sequences

Basic local alignment search tool

Rapid and sensitive protein similarity searches

Amino acid substitution matrices from protein blocks

Graphical representation of proteins

Similarity/dissimilarity calculation methods of DNA sequences: a survey

3-D maps and coupling numbers for protein sequences

A novel descriptor for protein similarity analysis

Similarity analysis of protein sequences based on 2D and 3D amino acid adjacency matrices

A new method to analyze protein sequence similarity using dynamic time warping

A 2D graphical representation of protein sequence and its numerical characterization

Graphical representation and similarity analysis of protein sequences based on fractal interpolation

ADLD: a novel graphical representation of protein sequences and its application

Comparative analysis of protein primary sequences with graph energy

UC-curve: a highly compact 2D graphical representation of protein sequences

The graphical representation of protein sequences based on the physicochemical properties and its applications

F-curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids

A novel method of 2D graphical representation for proteins and its application

3D graphical representation of protein sequences and their statistical characterization

Novel numerical characterization of protein sequences based on individual amino acid and its application

Similarities/dissimilarities analysis of protein sequences based on PCA-FFT

On novel representation of proteins based on amino acid adjacency matrix

A sequence-segmented method applied to the similarity analysis of long protein sequence

It is a figure which summarizes our approach. It is submitted under the name of Graphical Abstract. (Supplementary  Materials)