key: cord-0304491-p5o8k4ff authors: Kuang, Mengmeng title: DLPAlign: A Deep Learning based Progressive Alignment for Multiple Protein Sequences date: 2020-07-17 journal: bioRxiv DOI: 10.1101/2020.07.16.207951 sha: be69c9781f14d5d030629cc9a214d938d2e0c410 doc_id: 304491 cord_uid: p5o8k4ff This paper proposed a novel and straightforward approach to improve the accuracy of progressive multiple protein sequence alignment. We trained a decision-making model based on the convolutional neural networks and bi-directional long short term memory networks, and based on this model, we progressively aligned the input sequences by calculating different posterior probability matrixes. To test the accuracy of this approach, we have implemented a multiple sequence alignment tool called DLPAlign and compared its performance with eleven leading alignment methods on three empirical alignment benchmarks (BAliBASE, OXBench and SABMark). Our results show that DLPAlign can get the best total-column scores on the three benchmarks. When evaluated against the 711 low similarity families with average PID ≤ 30%, DLPAlign improved about 2.8% over the second-best MSA software. Besides, we also compared the performance of DLPAlign and other alignment tools on a real-life application, namely protein secondary structure prediction on four protein sequences related to SARS-COV-2, and DLPAlign provides the best result in all cases. A Multiple Sequence Alignment (MSA) can be seen as a table constructed from protein sequences with an appropriate amount of spaces inserted. MSA can be applied in many cases, such as recovering the history or relationship between protein or amino acid sequences and digging out some structural or functional roles of the sequences [1] . More and more biological modeling methods rely on the assembly of precise MSAs [2] . Therefore, it is of considerable significance to construct an algorithm to assist the MSA construction. The MSA construction problem can be defined as a mathematical problem. Given n sequences S i , i = 1, 2, · · · , n as Equation (1), S 1 = (S 11 , S 12 , · · · , S 1m1 ) S 2 = (S 21 , S 22 , · · · , S 2m2 ) . . . S n = (S n1 , S n2 , · · · , S nmn ) (1) an MSA is constructed from this set of sequences by inserting an appropriate amount of gaps needed into each of the S i sequences of S until the modified sequences, S i , all conform to a same length l and no values in the sequences of S M. Kuang is with The University of Hong Kong, Hong Kong. E-mail: mmkuang@cs.hku.hk Manuscript received XXX. of the same column m, consists of only gaps. The mathematical form of an MSA of the above sequence set is shown at Equation (2): S 1 = (S 11 , S 12 , · · · , S 1l ) S 2 = (S 21 , S 22 , · · · , S 2l ) . . . S n = (S n1 , S n2 , · · · , S nl ) For a protein family, if the number of sequences is small and the length is short, we may be able to align it manually, but when we face a large number of sequences or long sequences, we need to use computer programs to handle it. Since the early 1980s, the MSA construction problem has been solved by the algorithm-centric approach [2] : Design algorithms to solve the combinatorial optimization problem, which is to find the alignment with the most massive sum of column scores. Many excellent algorithms are well applied, such as dynamic programming [3] , divide and conquer algorithm [4] and so on. Besides, in the past few decades, many alignment strategies have been proposed, such as progressive strategy [5] , non-progressive strategy, consistency-based method [6] , iterative refinement [7] etc. The progressive strategy is one of the maturest MSA strategies with large amounts of research validation and highest accuracy. A typical MSA method using progressive strategy mainly includes five parts: (1) posterior probability matrix calculation, (2) distance matrix calculation, (3)"Guide Tree" [8] generation by clustering methods, (4) consistency transformation, and (5) refinement. In past studies, the main research focus is on the calculation method of the posterior probability matrix, the generation method of the guide tree, and the consistency transformation method. The most popular MSA tools adopting a progressive strategy in the past 10 years are as follows: (1) ProbCons [9] uses a pair-hidden Markov model (HMM) to calculate the posterior probability matrix, an unweighted probabilistic consistency transformation, using an unweighted pair group method with the arithmetic mean (UPGMA) [10] hierarchical clustering method to generate a guide tree and iterative refinement to construct an MSA. (2) Probalign [11] , another popular, highly accurate MSA tool, uses a partition function instead of the ProbCons pair HMM to calculate the posterior probability matrix. (3) MSAProbs [12] combines (1) and (2), using the Root Mean Square (RMS) of pair HMM and the partition function as the calculation method for the posterior probability matrix, and adopts a weighted consistency transformation. (4) GLProbs [13] introduced random HMM, and adaptively uses (i) the partition function, (ii) global pair HMM, (iii) the RMS of global pair HMM and random HMM to calculate the posterior probability matrix using the different average pairwise percentage identity (PID) of each protein family. PID stands for the percentage of the number of homologous positions in the pairwise alignment of two sequences. (5) PnpProbs [14] applies UPGMA and the weighted pair group method with arithmetical mean (WPGMA) [15] adaptively to generate the guide tree in its progressive branch. Although these MSA tools can achieve relatively high accuracy on the whole, when it comes to a specific protein family, the accuracy of the different tools is often different [16] . Most importantly, after so many years of research, the algorithmcentric method has not significantly improved accuracy. As for the progressive alignment strategy, protein families with low similarity have always been the most challenging part [17] . This paper explores a different approach, namely the datacentric approach [18] , to tackle the MSA construction problem to further improve the accuracy of MSAs, especially the accuracy on low similarity protein families. We first determined which specific part in the progressive MSA methods could provide the greatest improvement in accuracy. Then we applied deep learning methods to train decision-making models to choose the most appropriate calculation method for that part. Our contributions: After determined which specific part in progressive MSA methods to improve, we transformed the classification of MSA families into the classification of sequence pairs, and in this process, we obtained large-scale training data (there were 954854 such training data). We gained these protein sequences data from many datasets which were SISYPHUS [19] , SABmark [20] , BAliBASE [21] and a part of the extension set of BAliBASE namely BAliBASE-X, OXBench [22] , HOMSTRAD [23] , and Mattbench [24] . Table I Deep learning methods were used to train decision-making models to help us choose the most appropriate calculation method of the specific part. We give more details of the decision-making models and the implementation in Sections II-B and III-B. Further, based on the most accurate decision-making model, we build a new progressive MSA tool called DLPAlign. We compared DLPAlign with eleven popular MSA tools including PnpProbs, QuickProbs [25] , GLProbs, PicXAA [26] , Prob-Cons, MSAProbs, MAFFT [27] , Muscle [28] , ClustalΩ [29] , ProbAlign and T-Coffee [30] on three empirical benchmarks SABmark, BAliBASE and OXBench. Tables V, VI and VII (in Section IV) demonstrate the alignment accuracies of the MSAs constructed by the tools for families in BAliBASE, OXBench and SABMark, respectively. DLPAlign achieved the highest TC-scores among all the tools on all benchmarks. Note that DLPAlign got better performance on the low or medium similarity protein families that other progressive methods were not good at. Figures 1(a) and 1(b) compare the average TC-scores on low similarity (with PID ≤ 30% and there are totally 711 such families) and medium similarity (with 30% ¡ PID ≤ 60% and there are totally 352 such families) families in BAliBASE, OXBench and SABmark. It can be seen from the figures that the improvement of DLPAlign is pronounced, especially in the low similarity families set. We think this tool can be used in actual MSA task, so we upload the source code as well as the benchmarks for testing to GitHub (https://github.com/kuangmeng/DLPAlign). In this section, we first explain how to determine which part is used to improve the accuracy, then consider which data to use as the training data, and finally introduce the decisionmaking process of our deep learning methods. A. How to select the best promotion part? As we mentioned in Section I, a typical progressive alignment method consists of five main parts: (1) posterior probability matrix calculation, (2) distance matrix calculation, (3) "Guide Tree" generation by clustering methods, (4) consistency transformation and (5) refinement. The most studied of these parts are the posterior probability matrix calculation, clustering methods for generating guide tree, and consistency transformation. We chose these three parts from Parts (1) -(5) as our candidate promotion parts and refer to the posterior probability matrix calculation as Part A, guide tree generation as Part B, and consistency transformation as Part C. For each part, we extracted several candidate options from previous studies [9] , [11] , [12] , [13] , [14] , as shown below: Options for Part A: 1) Pair-HMM 2) Partition function 3) the RMS of pair-HMM and partition function 4) the RMS of pair-HMM, partition function and random HMM Options for Part B: 1) UPGMA 2) WPGMA Options for Part C: 1) Unweighted consistency transformation 2) Weighted consistency transformation When we test a single part, we need to use the same methods for the other parts. We used the calculation method of each part numbered (1) as the default method for that (a) Over 711 families in BAliBASE, OXBench and SABmark with PID ≤ 30%. DLPAlign gets the best average TC-score 47.17 and improves about 2.8% over the second best one which is 45.89. (b) Over 352 families in BAliBASE, OXBench and SABmark with PID between 30% and 60%. DLPAlign gets the best average TC-score 81.21 and improves about 0.8% over the second best one which is 80.60. (c) Over 141 families in BAliBASE, OXBench and SABmark with PID ¿ 60%. All the MSA tools could achieve more than 96.80 on average TC-score. For other parts that we did not select, such as the distance matrix calculation and refinement, we adopted the same implementation method as in GLProbs. The critical concern is the upper bounds of the various calculations in different parts of the progressive alignment strategy, and in which part the maximum improvement can be made. For the i-th option of Part A, we implemented a pipeline P i A . We obtained pipelines P i B and P i C in the same way. we report the results of these pipelines in Section III-A. In Part X of the progressive alignment strategy, for a specific protein family, F, choosing the method with the highest accuracy on which to construct an MSA M can be expressed as the classification problem C Aln P X . Classification C Aln P X of a protein family is defined as follows: C Aln P X has n classes P 1 X , P 2 X ...... P n X . A protein family F is in class P i X if the MSA constructed by the pipeline P i X could get better TC-score than those constructed by others, where n is the number of options in Part X and i is a positive integer not greater than n. 1) Data augmentation.: In the past few years, there have been significant developments in deep learning, which have been applied in Bioinformatics [31] , mainly due to the continuous expansion of the data scale. It is not sufficient to use 2766 protein families or 29747 sequences to train deeplearning models, so we considered coupling any two sequences in the same protein family as an independent piece of data. If a family has n sequences, we can get C 2 n = n×(n−1) 2 sequence pairs from it. In this way, our data was expanded to 954,854 pairs. 2) The structure of candidate deep learning models.: Because the length of the sequence pairs was not very consistent, we normalized the length before choosing the neural networks. We unified all pairs into a fixed length α (α = 512 in our structure). When the length of a pair was insufficient, it was filled by gaps at the end to increase the length to the α. When the length of the sequence exceeded α, only the leading fragments of length α were intercepted. When we regard each character as a single word, if we convert it into a one-hot word vector, the size of the vector is a little large, so we first used the word-embedding [32] technique to convert each word into a small size (eight-dimensional vector). Even so, our input scale was still relatively large, so we applied convolutional neural networks (CNNs) [33] , which had made a significant breakthrough in computer vision to reduce the dimensionality of the data while retaining its characteristics, as the first two layers of our models. There were order relationships between every character in a protein sequence pair, so we added a recurrent neural network (RNN) [34] layer after the CNNs. The improved versions of the recurrent neural network, long short term memory network (LSTM) [35] and gated recurrent unit network (GRU) [36] , and their bi-directional versions (BiLSTM, BiGRU) have many advantages, so they were alternatives. Subsequently, two full connection layers were connected. To reduce overfitting, we added a specific dropout rate to the first full connection layer. This kind of neural network structure is very suitable and widely used for classification tasks [37] , [38] , [39] , [40] . Section III-B gives the implementation of different deep learning models, as well as the training and testing processes. The final decision-making model was determined according to the accuracy of different models. To test the effectiveness of our methods and select the best promotion part, we have implemented pipelines P A , P B and P C by adopting different calculation methods in Part A, Part B and Part C of progressive alignment strategy. In the meantime, different deep learning models were trained for choosing the best decision-making model for the best promotion part Part X. All these efforts were to implement DLPAlign. A. Find the best promotion part. To evaluate the advantages and disadvantages of several methods in Part A and to what extent they could be improved, we got four different pipelines by using different calculation methods of the posterior probability matrix in the GLProbs' code and implemented the calculation in Parts B and C by default, naming them P i A , i = 1, 2, 3, 4, which respectively represents the different options of Part A mentioned above. We implemented P i B , i = 1, 2, where i denoted the different clustering methods for guide tree generation in Part B, and P i C , i = 1, 2, where i denoted the different calculation of consistency transformation in Part C in the same way. To measure the accuracies of MSAs constructed by different pipelines, the total-column score (TC-score), which was first introduced in BAliBASE [41] , is the most popular measurement in many alignment benchmark tests. TC-score represents the percentage of the correctly aligned columns in alignments comparing with the references. Qscore (http: //www.drive5.com/qscore) is an essential tool for analyzing the quality of MSAs in this paper. We chose the famous BAliBASE, OXBench, and SABmark benchmarks as the evaluation materials. Table II summarizes for each pipeline P i A , P i B or P i C and each benchmark database the average TC-scores of the alignments constructed by the pipelines for the families in the database. Table II shows that if a particular decision is used in Part A to assist in selecting different calculation methods, the theoretical maximum promotion proportion can be obtained. So next, we chose the right decision-making method for pipeline P A . We divided the collected pairs data into two subsets: (1) 80% was randomly selected for model training, and (2) the remaining 20% was used for final testing. In the training process, a five-fold cross-validation was performed. This kind of validation method proved to be the most efficient [42] . In the process of training, we also set early stopping to further reduce overfitting. Table III reveals the macro average (averaging the unweighted mean per label) [43] of the precision, recall and f 1 -score of the four labels in the 20% test data. If P i is the precision and R i is the recall rate of class i, where i = 1, 2, 3 or 4 in this paper, the macro average precision (P macro ) and recall (R macro ) can be calculated by Formula (3). where N = 4, which means the category number is 4 in this paper. The macro average f 1 -score (F macro ) equals to the harmonic average of P macro and R macro . According to the f 1 -score in Table III , we decided to use CNN + BiLSTM as our decision-making model. Although there is a difference in the precision or recall rate among the different categories, the f 1 -scores of each category is generally good; they were all over 85%, indicating that our model can handle all categories well. Figure 2 further shows the confusion matrix obtained using CNN + BiLSTM as the decision-making model to predict the four categories of pipeline P A . The confusion matrix is a table that is often used to describe the performance of a classifier on a set of test data for which the true values are known. The structure of the model we used is shown in Figure 3 . Given the high accuracy of our decision-making model (CNN + BiLSTM above), we integrated it into existing progressive alignment methods to construct a new alignment tool for multiple protein sequences, which we named DLPAlign. Given a protein family F, DLPAlign produces a multiple sequence alignment using mainly Steps III-C1 -III-C5. 1) Decision-making for the posterior probability matrix calculation: Each pair x, y in F is inserted into the model with the highest accuracy described in the previous section to Fig. 3 : The neural network structure of our decision-making model. Firstly the input is transformed through the Word embedding layer into a 512 × 8 matrix. Then the matrix passes through two CNN layers with filter sizes of 6 and 3 respectively (each CNN layer is followed by a max-pooling layer of size 2). Next, the output of the previous layer goes through the Bi-directional LSTM layer with a hidden size of 64. Finally, two full connection layers are connected and a 0.5 dropout to the first full connection layer is set. get a label (say label x,y ). These labels represent the specific calculation method of posterior probability matrix that should be used. Which calculation method is chosen for the protein family F is determined by the dominate proportion of labels that all of its pairs get after passing through the decisionmaking model. Because we already know the percentage of each correct label of C Aln P i A , where i = 1, 2, 3 or 4, the dominate proportion of predicted labels can be calculated using the following Formula (4). where P LP i means the percentage of the i th predicted label, T LP i means the proportion of the i th true label and i = 1, 2, 3 or 4. This process is shown in Figure 4 . Depending on the final family category, we use (1) the pair HMM, (2) the partition function, (3) the RMS of pair HMM and the partition function or (4) the RMS of pair HMM, Fig. 4 : The process of splitting a protein family into pairs, using decision-making model to determine the label of each pair, and finally calculating the dominate proportion of the labels to get the label of the protein family. the partition function and random HMM to accomplish the calculation of the posterior probability matrix. 2) Distance matrix calculation and guide tree generation.: By finding the maximum summing path (or maximum weight trace) through the posterior probability matrix, a pairwise alignment is computed on every pair x, y in F and the maximum sum is saved as Prob(x, y). The distance between sequences x and y can be measured by the Equation (5). where x.Length and y.Length represent the length of sequences x and y, respectively. A guide tree is a data structure used to determine the relationship between (1) sequence and sequence, (2) sequence and profile, and (3) profile and profile. Defining two clusters A and B, the distance between their union and another cluster C can be expressed as Formula (6) , which is also the specific implementation of UPGMA. |A| + |B| (6) where |A| , |B| and |C| represent the weight of clusters A, B and C. According to this distance calculation formula and the distance matrix computed, we can start from the sequences of the minimum distance and gradually builds a binary tree, named "Guide Tree". 3) Consistency transformation: In this step, we use other sequences to relax the posterior probability matrix of every pair x and y (written as P x,y ), which we calculated in Step III-C1 to determine the substitution scores for the relaxation. The relaxation process can be expressed by the Equation (7). where S stands for the sequences set in protein family F, and P x,y is the new transformed posterior matrix of pair < x, y >. 4) Progressive alignment.: Based on the guide tree we determined in the Step III-C2 and the relaxed posterior probability matrix in the Step III-C3, we can merge two child nodes (sequences) from the deepest node to get a profile, and then merges them to the root node of the tree respectively to get a complete MSA containing all sequences in F. The purpose of refinement is to correct some errors that may have occurred in the alignment between previous sequences. In this specific implementation, we also used the iterative refinement step to divide all aligned sequences into two groups each time randomly and then used the profile-profile alignment to realign them again. However, we added accuracy judgment. Each refinement was valid only if the maximum sum described in Section III-C2 was larger than before. To determine the accuracy of DLPAlign implemented by the CNN + BiLSTM decision-making model and comparing this with other MSA tools with high accuracy, three empirical benchmarks were selected -BAliBASE 3.0, OXBench 1.3 and SABmark 1.65 -and the newest versions of eleven popular MSA tools were chosen for comparison: Quick-Probs, PnpProbs, GLProbs, MSAProbs, Probalign, ProbCons, PicXAA, MAFFT, MUSCLE, Clustal Ω and T-Coffee. Of these eleven MSA tools, PicXAA adopted the non-progressive strategy, PnpProbs used both the non-progressive strategy and the progressive strategy, and the others used the progressive strategy. The TC-score mentioned in Section III-A was the leading indicator in the comparison. The alignments in BAliBASE were organized into reference sets that were designed to represent real multiple alignment problems. benchmark with 386 families and the accuracy of its two divergent reference sets (say, RV11 and RV12). DLPAlign could also handle RV11, which is a very divergent subset, obtaining a 1.58% higher TC-score than the second-best MSA tool. Table VI shows the results of DLPAlign, as well as other MSA tools, for OXBench. In addition to the complete set of families, the table shows the alignment accuracy for families with average PID of less than and more than 30%. Note that OXBench did not divide the whole database into different subsets. We made the division here because we thought the two parts after separation could respectively represent divergent families and high similarity families. It can be seen that no matter which divergent set or high similarity set is used, DLPAlign can always produce some improvement, although the improvements are sometimes small. Table VII summaries the accuracy for SABMark benchmark, which was divided into two subset Twilight Zone and Superfamilies, depending on the SCOP classification. These subsets together covered the entire known fold space using sequences with very low to low, and low to intermediate similarity, respectively. DLPAlign improved both subsets and the whole benchmark. For Twilight Zone, except for DLPAlign, none of the MSA tools could get a TC-score of more than 25%. In this subset, DLPAlign's TC-score was 4.38% higher than the second-best MSA tool, PnpProbs. The results were very surprising. It should be noted that ProbCons, Probalign, MSAProbs, GLProbs and PnpProbs all used the standard progressive alignment steps mentioned in Section ??, so there was not much difference in their running times. The running time of DLPAlign comprised mainly the running time of the decisionmaking model and the standard progressive alignment. As Table VIII shows, the time taken by the decision-making model was affected by the benchmark size (BAliBASE was largest benchmark while SABMark was the smallest), but in general, the time was acceptable. Protein secondary structure prediction is an appropriate application of multiple sequence alignment [44] . We picked some protein sequences related to SARS-COV-2 which could be found at Protein Data Bank (PDB) [45] with these PDB ID: 6YI3, 6VYO, 6W9C and 6W61 1 . We then used the following process to evaluate the performance of DLPAlign with the other five highly accurate MSA tools. Given a protein sequence S, we use Jpred 4 [46] to search protein sequences similar to it. Then, we construct an MSA M for these sequences and S, and use the secondary structure prediction tool provided on Jpred 4 to predict the secondary structure of S. Finally, comparing the predictions with reference secondary structures offered by Jpred 4. Table IX summaries the number of wrongly aligned residues for each MSA tool. The table shows that DLPAlign always got the fewest wrong aligned number of the leading MSA tools. The significant contributions of this paper are (i) using deeplearning methods to train a decision-making model to determine which specific calculation method to use in the posterior probability matrix calculation of progressive alignment approaches and (ii) releasing a new progressive multiple protein sequence alignment tool based on this deep learning model named DLPAlign. As hindsight after our study, DLPAlign with decision-making model could get better accuracy on all tests and perform especially excellent on low similarity families. We would also like to point out that we have not optimized DLPAlign for efficiency. The efficiency of DLPAlign slows down with the increase of the number of sequences, because the input of the decision-making model is sequence pair. The more sequences, the more sequence pairs will be obtained; thus, the number of times the decision-making model runs is increased. One way to improve efficiency is to reduce the number of times the decision-making model runs. In this study, we regard each sequence pair as a sentence to classify. If we can take the combination of more than 2 sequences as the input of the decision-making model, then the whole decisionmaking model will run less, which will significantly improve the efficiency of DLPAlign. It is also one of the improvements of DLPAlign in the future. Besides, if we use the protein family simulation tool such as INDELible [47] to obtain enough protein family data, we can also consider the entire protein family or the temporary MSA built by the fast MSA tool as the training data of the decisionmaking model, so that the decision-making model will only run once, the efficiency can be further improved. Recent evolutions of multiple sequence alignment algorithms Multiple sequence alignment modeling: methods and applications Streaming algorithms for biological sequence alignment on gpus DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment Progressive sequence alignment as a prerequisitetto correct phylogenetic trees COBALT: constraint-based alignment tool for multiple protein sequences Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments The effect of the guide tree on multiple sequence alignments and subsequent phylogenetic analyses Prob-Cons: Probabilistic consistency-based multiple sequence alignment Numerical taxonomy Probalign: multiple sequence alignment using partition function posterior probabilities MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities GLProbs: Aligning multiple sequences adaptively PnpProbs: a better multiple sequence alignment tool by better handling of guide trees Numerical Taxonomy WH Freeman and Co Recent progress in multiple sequence alignment: a survey A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives Focus on translational bioinformatics: The coming age of data-driven medicine: translational bioinformatics' next frontier SISY-PHUS-structural alignments for proteins with non-trivial relationships SABmark-a benchmark for sequence alignment that covers the entire known fold space BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy HOM-STRAD: a database of protein structure alignments for homologous families Touring protein space with Matt QuickProbs 2: towards rapid construction of high-quality alignments of large protein families PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences MAFFT multiple sequence alignment software version 7: improvements in performance and usability MUSCLE: multiple sequence alignment with high accuracy and high throughput Clustal Omega, accurate alignment of very large numbers of sequences T-coffee: A novel method for fast and accurate multiple sequence alignment Deep learning in bioinformatics How to generate a good word embedding Artificial convolution neural network for medical image pattern recognition Deep recurrent neural networks for hyperspectral image classification Long short term memory networks for anomaly detection in time series Gate-variants of gated recurrent unit (GRU) neural networks Acoustic scene classification using parallel combination of LSTM and CNN CNN-and LSTM-based claim classification in online user comments Bidirectional-convolutional lstm based spectral-spatial feature learning for hyperspectral image classification A c-lstm neural network for text classification BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs A study of cross-validation and bootstrap for accuracy estimation and model selection Macro-and micro-averaged evaluation measures Integrating protein secondary structure prediction and multiple sequence alignment Protein data bank JPred4: a protein secondary structure prediction server INDELible: a flexible simulator of biological sequence evolution This research was guided by Dr. Hing-fung Ting of the University of Hong Kong. We thank Dr. Ting and Dr. Bin Yan from the University of Hong Kong for many helpful discussions and suggestions.