key: cord-0857474-wch72xks authors: Almog, Gal; Olabode, Abayomi S; Poon, Art FY title: Tuning intrinsic disorder predictors for virus proteins date: 2020-10-27 journal: bioRxiv DOI: 10.1101/2020.10.27.357954 sha: d5b7ef9c0303e2f979db715d7d52b45dd203a658 doc_id: 857474 cord_uid: wch72xks Many virus-encoded proteins have intrinsically disordered regions that lack a stable folded threedimensional structure. These disordered proteins often play important functional roles in virus replication, such as down-regulating host defense mechanisms. With the widespread availability of next-generation sequencing, the number of new virus genomes with predicted open reading frames is rapidly outpacing our capacity for directly characterizing protein structures through crystallography. Hence, computational methods for structural prediction play an important role. A large number of predictors focus on the problem of classifying residues into ordered and disordered regions, and these methods tend to be validated on a diverse training set of proteins from eukaryotes, prokaryotes and viruses. In this study, we investigate whether some predictors outperform others in the context of virus proteins. We evaluate the prediction accuracy of 21 methods, many of which are only available as web applications, on a curated set of 126 proteins encoded by viruses. Furthermore, we apply a random forest classifier to these predictor outputs. Based on cross-validation experiments, this ensemble approach confers a substantial improvement in accuracy, e.g., a mean 36% gain in Matthews correlation coefficient. Lastly, we apply the random forest predictor to SARS-CoV-2 ORF6, an accessory gene that encodes a short (61 AA) and moderately disordered protein that inhibits the host innate immune response. For almost a century, it was assumed that proteins required a properly folded and stable threedimensional or tertiary structure in order to function [1] [2] [3] . More recently, it has become evident that many proteins and protein regions are disordered, which are referred to as intrinsically disordered proteins (IDPs) and intrinsically disordered protein regions (IDPRs), respectively. Both IDPs and IDPRs can perform important biological functions despite lacking a properly folded and stable tertiary structure [2, 4] . These kinds of proteins are an important area of research because they play major roles in cell regulation, signalling, differentiation, survival, apoptosis and proliferation [5, 6] . Some are also postulated to be involved in disease etiology and could represent potential targets for new drugs [7, 8] . Virus-encoded IDPs facilitate multiple functions such as adaptation to new or dynamic host environments, modulating host gene expression to promote virus replication, or counteracting host-defense mechanisms [9] [10] [11] . IDPRs may be more tolerant of non-synonymous mutations than ordered protein regions [12] , which may partly explain why virus genomes can tolerate high mutation rates [13, 14] . Viruses also have very compact genomes with overlapping reading frames [15, 16] , in which mutations may potentially modify multiple proteins. This may confer viruses a greater capacity to acquire novel functions and interactions [17] . Overlapping regions tend to be more structurally disordered when compared to non-overlapping regions [18] . Several experimental techniques are available to detect IDPs and IDPRs. The most common methods identify either protein regions in crystal structures that have unresolvable coordinates (X-ray crystallography) or regions in nuclear magnetic resonance (NMR) structures that have divergent structural conformations [6, 19, 20] . Other experimental techniques include circular dichroism (CD) spectroscopy and limited proteolysis (LiP) [1] . The challenge, however, is that these methods are very labour-intensive and difficult to scale up to track the rapidly accumulating number of unique protein sequences in public databases [6, 19] . At the time of this writing, over 60 million protein sequences have been deposited in the Uniprot database, yet only 0.02% of these sequences have been annotated for disorder [6] . As a result, numerous computational techniques that could potentially predict intrinsic disorder in protein sequences have been developed. These techniques work based on the assumptions that compared to IDPs and IDPRs, ordered proteins have a different amino acid composition as well as levels of sequence conservation [21, 22] . To date about 60 predictors for intrinsic disorder in proteins have been developed [1, 23, 24] , which can be broadly classified into three major categories. The first category, the scoring function-based methods, predict protein disorder solely based on basic statistics of amino acid propensities, physio-chemical properties of amino acids and residue contacts in folded proteins to detect regions of high energy. A second category is characterized by the use of machine learning classifiers (e.g., regularized regression models or neural networks) to predict protein disorder based on amino acid sequence properties. The third category are meta-predictors that predict disorder from an ensemble of predictive methods from the other two categories [1, 6, 25] . Different predictors of intrinsic disorder are developed on a variety of methodologies and will inevitably vary with respect to their sensitivities and biases in application to different protein sequences. As a result, it has been relatively difficult to benchmark these methods to identify a single disorder prediction method that can be classified as the most accurate relative to the others [24] . The DisProt database is a good resource for obtaining experimental data that has been manually curated for disorder in proteins, and can be used for benchmarking the performance of disorder predictors. As of April 27th, 2020, the Disprot protein database contained n = 3500 proteins of which 126 were virus-encoded proteins that have been annotated for intrinsic disorder as a presence-absence characteristic at the amino acid level [26, 27] . Previously, Tokuriki et al. [14] reported preliminary evidence that when compared to non-viruses, viral proteins possess many distinct biophysical properties including having shorter disordered regions. We are not aware of a published study that has previously benchmarked predictors of intrinsic disorder specifically for viral proteins. Here, we report results from a comparison of 21 disorder predictors on viral proteins from the DisProt database to firstly determine which methods work best for viruses, and secondly to generate inputs for an ensemble predictor that we evaluate alongside the predictors used individually. The Database of Protein Disorder (DisProt) [28] was used to collect virus protein sequences annotated with intrinsically disordered regions, based on experimental data derived from various detection methods; e.g., X-ray crystallography, NMR spectroscopy, CD spectroscopy (both far and near UV) and protease sensitivity. DisProt records include the amino acid sequence and all disordered regions annotated with the respective detection methods as well as specific experimental conditions. At the time of our study, DisProt contained 3,500 author-verified proteins, of which all viral proteins were collected for the present study. A total of 126 virus proteins were obtained, derived from different detection methods. Similarly, a set of 126 non-viral proteins was sampled at random without replacement from the protein database for comparison. We evaluated a number of disorder prediction programs and web applications. From the methods tested, we selected a subset of predictors favouring those that were developed more recently, are actively maintained, and performed well in previous method comparison studies [1, 29] . Where alternate settings or different versions based on training data were available for a given predictor, we tested all combinations. Our final set of 21 prediction methods tested were: SPOT-Disorder2 [30] , PONDR-FIT [31] , IUPred2 (short and long) [32] , PONDR (VLXT, XL1-XT, CAN-XT, VL3-BA, and VSL2 variants) [33] , Disprot (VL2 and variants VL2-V, -C and -S; VL3, VL3H, and VSLB) [34] , CSpritz (short and long) [35] , and ESpritz (variants trained on X-ray, NMR, and Disprot data) [36] . Although several other predictor models have been released online, the respective web services were unavailable or broken over the course of our data collection. To obtain disorder predictions from the methods that were only accessible as web applications, i.e., with no source code or compiled binary standalone distribution, we wrote Python scripts to automate the process of submitting protein sequence inputs and parsing HTML outputs. We used Selenium in conjunction with ChromeDriver (v81.0.4044.69) [37] to automate the web browsing and form submission processes. For each predictor, we implemented a delay of 90 seconds between consecutive protein sequence queries to avoid overloading the webservers hosting the respective predictor algorithms with repeated requests. Due to issues with the Disprot webserver, we were only able to obtain predictions for the non-viral protein data set for 13 of the predictors. We converted each DisProt record to a binary vector corresponding to ordered/disordered state of residues in the amino acid sequence. To compare results between disorder prediction algorithms, we dichotomized continuous-valued residue predictions, i.e., intrinsic disorder probability, by locating the threshold that maximized the Matthews correlation coefficient (MCC) for each predictor applied to the DisProt training data. This optimal threshold was estimated using Brent's root-finding algorithm as implemented by the optim function in the R statistical computing environment (version 3.4.4). In addition, we calculated the accuracy, specificity and sensitivity for each predictor from the contingency table of DisProt residue labels and dichotomized predictions. To assess whether the accuracy of existing predictors could be further improved on the virusspecific data set, we trained an ensemble classifier on the outputs of all predictors as features. Specifically, we used the random forest method implemented in the scikit-learn (version 0.23.1) Python module [38] , which employs a set of de-correlated decision trees and averages their respective outputs to obtain an ensemble prediction [39] . To reduce bias, random forests fit the same decision trees to many bootstrap samples of the training data, and each committee of trees 'votes' for a particular classification [40] . By splitting the trees based on different samples of features, random forests reduce the correlation between trees and the overall variance. We split the viral protein data into random testing and training subsets, with 30% of protein sequences reserved for testing. Due to class imbalance in the data (i.e., only a minority of residues are labeled as disordered), we used stratified random sampling using the 'StratifiedShuffleSplit' function in the scikit-learn module. This function stratifies the data by label so that a constant proportion of labels is maintained in the training subset. Continuous-valued outputs from each predictor were normalized to a zero mean and unit variance. Thus, we did not apply the dichotomizing thresholds to these features (predictor outputs) when training the random forest classifier. We used 5-fold cross validation to tune the four hyper-parameters of the random forest classifier; namely: (1) the number of decision trees; (2) the maximum depth of any given decision tree; and the minimum number of samples required to split (3) an internal node or (4) a leaf node. To further minimize the effect of class imbalance in our data, we used over-sampling to balance the data with synthetic cases [41] . As suggested in [42] , we applied an over-sampling procedure at every iteration of the cross-validation analysis to avoid over-optimistic results. We used the Python package imbalanced-learn [43] to over-sample the minority class (residues in intrinsically disordered regions) using the synthetic minority oversampling technique (SMOTE) [44] . SMOTE generates new cases by sampling the original data at random with replacement, evaluates each sample's k nearest neighbours in the feature space, and then generates new synthetic samples along the vectors joining the sample to one of the neighbouring points. Over-sampling enables decision trees to be more generalizable by amplifying the decision region of the minority class. Using the optimized tuning parameters, we fit the final model on all of the training data. We applied this final model to generate predictions on the reserved testing data and calculated the MCC, sensitivity, specificity and accuracy. We repeated this process 10 times with randomly generated seeds to split the data into training and testing subsets, and averaged these performance metrics across replicates. To characterize how the performance of individual disorder predictors might vary among proteins from viruses and non-viruses, we computed the root mean square error (RMSE) for all continuousvalued predictions relative to the Disprot label (0, 1). We visualized this error distribution using principal component analysis (PCA). As well, we trained a support vector machine (SVM) on the RMSE values to determine whether the virus/non-virus labels were separable in this space. We used the default radial basis kernel with the C-classification SVM method implemented in the R package e1071 [45] , with 100 training subsets sampled at random without replacement for half of the data, and the remaining half for validation. We have released the Python scripts for automating queries to the disorder prediction web servers under a permissive free software license at https://github.com/PoonLab/Floppy/. We obtained 126 viral and 126 randomly selected non-viral protein sequences from the DisProt database. The sequences were already annotated manually by a panel of experts for the presence or absence of disorder at each amino acid position, based on experimental data [27] . Supplementary Tables S1 and S2 summarize the composition of the viral and non-viral protein datasets, respectively. The viral protein data set represents 22 virus families and 48 species. Not surpris-ingly, human immunodeficiency virus type 1 was disproportionately represented in these data with 16 entries corresponding to seven different gene products. Similarly, the non-viral protein data set was predominated by 75 human proteins, followed by 23 proteins from the yeast Saccharomyces cerevisiae. We found no significant difference in amino acid sequence lengths between viruses and all other organisms ( Our first objective was to benchmark the performance of different predictors of intrinsic protein disorder to determine which predictor conferred the highest accuracy for viral proteins. These predictors generate continuous-valued outputs that generally correspond to the estimated probability that the residue is in an intrinsically disordered region. To create a uniform standard for comparison to the binary presence-absence labels, we optimized the disorder prediction thresholds as a tuning parameter for each predictor for the viral and non-viral datasets, respectively (Supplementary Tables S3 and S4 ). Put simply, residues with values above the threshold were classified as disordered. We used both the Matthews correlation coefficient (MCC, ranging from −1 to +1 [46] ) and area under the receiver-operator characteristic curve (AUC, ranging from 0 to 1) to quantify the performance of each predictor. were allowed to specialize on different subsets of a partitioned training set; for example, V tended to call higher levels of disorder in proteins of Archaebacteria [34] . Similarly, PONDR-XL1 was optimized to predict longer disordered regions and PONDR-CAN was trained specifically on calcineurins (a protein phosphatase) that is known to perform poorly on other proteins [48] . Supplementary Tables S3 and S4 . To examine potential differences among predictors in greater detail, we calculated the RMSE for each protein and predictor and used a principal components analysis to visualize the resulting matrix (Figure 2 ). The PCA indicated that the different predictors did not exhibit markedly divergent error profiles at the level of entire proteins. However, a support vector machine classifier trained on a random half of these data obtained, on average, an AUC of 0.75 (n = 100, range = 0.65 − 0.83), indicating that the viral and non-viral protein labels were appreciably separable with respect to these RMSE values. Ensemble classifiers are expected to perform better than their constituent models because they can reduce overfitting of the data by the latter [49] . Although multiple predictive models of protein disorder employ an ensemble approach, none of them has been trained specifically on viral protein data. We trained a random forest classifier on the outputs of the predictors used in our study using 10 random training subsets of the viral protein data. Next, we validated the performance of this ensemble model in comparison to these individual predictors to determine if training on viral data conferred a significant advantage. We found that the ensemble classifier performed substantially better, with a mean MCC of 0.72 (range 0.62 − 0.86). This corresponded to a roughly 27% improvement relative to ESpritz.Disorder, the best performing disorder predictor on these data ( Figure 1 ). To examine the relative contribution of the different predictors used as inputs for the ensemble method, we evaluated the feature importance of each input (Figure 3 ) -roughly the prevalence of that feature among the decision trees comprising the random forest. We observed that the individual accuracy of a predictor did not necessarily correspond to its feature importance. Specifically, the best predictors (ESpritz.Disprot, CSpritz.Long and SPOT-Disorder.2) tended to be assigned higher importance values. On the other hand, both Disprot-VL2.C and Disprot-VL2.V also displayed high importance despite having some of the worst accuracy measures when evaluated individually ( Figure 1 ). To illustrate the use of our ensemble model on a novel protein, we applied this model and the 21 individual predictors to the accessory protein encoded by ORF6 in the novel 2019 coronavirus that was first isolated in Wuhan, China (designated SARS-CoV-2). ORF6 is one of the eight accessory 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 Impurity-Based Feature Importance Figure 3 : Box plot of the average decrease in Gini impurity by each feature in the random forest, for 10 random runs of the random forest model. Vertical line indicates the median, the box is the interquartile range (IQR; range from first to third quartiles). The left whisker extends to the first datum greater that Q1 − 1.5×IQR and the right whisker extends to the last datum smaller than Q3 + 1.5×IQR. Individual points are outliers that lie outside this range. genes of this virus. Its protein product is involved in antagonizing interferon activity thereby suppressing host immune response [50] . The protein is predicted to be highly disordered, particularly in its C-terminal region that contains short linear motifs involved in numerous biological activities [51] . We used a heatmap (Figure 4 ) to visually summarize results from the ensemble method and individual predictors, mapped to the ORF6 amino acid sequence. Overall, most predictors assigned a higher probability of disorder in the C-terminal region of the protein, with the conspicuous exception of PONDR-XL1 and PONDR-CAN, which did not predict any disordered residues in this region. We also observed considerable variation among predictors around this overall trend. Although the PONDR-XL1 predictor is documented to omit the first and last 15 residues from disorder predictions, we observed that only 14 residues were reported this way -this treatment was also obtained for PONDR-CAN, although it was not a documented behaviour of that predictor. Intrinsically disordered protein regions play an essential role in many viral functions [11] . It is therefore important to predict these regions accurately in order to make biological inferences from sequence variation. In this study, we found that predictive models of intrinsic disorder were more divergent in performance when evaluated on viral proteins than non-viral proteins. We note that many of these predictors could only be accessed through web applications, and some services become unavailable at different points of our study. Although we obtained more accurate predictions -or at least, predictions that were more concordant with an expert-curated database of intrinsic protein disorder [27] -using an ensemble 'machine learning' method, the erratic availability of the constituent predictors presents a significant obstacle to the practical utility of such approaches. Hence, we encourage researchers in the field of disorder prediction to support open science by releasing their source code or compiled binaries for local execution. A comprehensive assessment of long intrinsic protein disorder from the DisProt database Intrinsically disordered proteins and their 'mysterious' (meta) physics. Frontiers in Physics 100 Years "Schlüssel-Schloss-Prinzip": What made Emil Fischer use this analogy? Intrinsically unstructured proteins: re-assessing the protein structurefunction paradigm MetaDisorder: a meta-server for the prediction of intrinsic disorder in proteins Accuracy of protein-level disorder predictions Intrinsically disordered proteins in human diseases: introducing the D2 concept Untapped potential of disordered proteins in current druggable human proteome Rapid evolution of virus sequences in intrinsically disordered protein regions Structural disorder in viral proteins Intrinsically disordered proteins of viruses: Involvement in the mechanism of cell regulation and pathogenesis Comparative analysis of mutational robustness of the intrinsically disordered viral protein VPg and of its interactor eIF4E Viral mutation rates Do viral proteins possess unique biophysical features? Trends in biochemical sciences Structure and organization of the viral genome The evolutionary genetics of emerging viruses The evolution of genome compression and genomic novelty in RNA viruses Overlapping genes produce proteins with unusual sequence properties and offer insight into de novo protein creation Single-molecule fluorescence studies of intrinsically disordered proteins Resolving the ambiguity: Making sense of intrinsic disorder when PDB structures disagree Intrinsically disordered protein What does it mean to be natively unfolded? A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction. Briefings in bioinformatics Disorder prediction methods, their applicability to different protein targets and their usefulness for guiding experimental studies An overview of predictors for intrinsically disordered proteins over 2010-2014 DisProt 7.0: a major update of the database of disordered proteins DisProt: intrinsic protein disorder annotation in 2020 Quality and bias of protein disorder predictors SPOT-Disorder2: Improved Protein Intrinsic Disorder Prediction by Ensembled Deep Learning PONDR-FIT: a metapredictor of intrinsically disordered amino acids IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding Length-dependent prediction of protein intrinsic disorder Flavors of protein disorder. Proteins: Structure, Function, and Bioinformatics CSpritz: accurate prediction of protein disorder segments with annotation for homology, secondary structure and linear motifs ESpritz: accurate and fast prediction of protein disorder ChromeDriver: WebDriver for Chrome Scikit-learn: Machine learning in Python. the Journal of machine Learning research Random forests. Machine learning The Elements of Statistical Learning The class imbalance problem: Significance and strategies COVER: conformational oversampling as data augmentation for molecules Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning SMOTE: synthetic minority oversampling technique LIBSVM: A library for support vector machines Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric Improving protein disorder prediction by deep bidirectional long short-term memory recurrent neural networks Sequence complexity of disordered protein Ensemble Prediction of Intrinsically Disordered Regions in Proteins SARS-CoV-2 nsp13, nsp14, nsp15 and orf6 function as potent interferon antagonists Dark Proteome of Newly Emerged SARS-CoV-2 in Comparison with Human and Bat Coronaviruses. bioRxiv