key: cord-0825158-w8foij5s authors: Rodriguez Horta, Edwin; Weigt, Martin title: On the effect of phylogenetic correlations in coevolution-based contact prediction in proteins date: 2021-05-24 journal: PLoS Comput Biol DOI: 10.1371/journal.pcbi.1008957 sha: 21c7880a33622fae81752c7bd6b65ddabe56e0a3 doc_id: 825158 cord_uid: w8foij5s Coevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. A comparison between the results of Direct Coupling Analysis applied to real and to resampled data shows that the largest coevolutionary couplings, i.e. those used for contact prediction, are only weakly influenced by phylogeny. However, the phylogeny-induced spurious couplings in the resampled data are compatible in size with the first false-positive contact predictions from real data. Dissecting functional from phylogeny-induced couplings might therefore extend accurate contact predictions to the range of intermediate-size couplings. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Global coevolutionary modeling approaches have recently seen a lot of interest [1, 2] , either directly for predicting residue-residue contacts from sequence ensembles corresponding to homologous protein families [3] [4] [5] , in predicting mutational effects [6] [7] [8] , or even in designing artificial but functional protein sequences [9] [10] [11] [12] , or as an input to deep-learning based protein structure prediction. The latter approach has recently lead to a breakthrough in predicting protein structure from sequence [13] [14] [15] [16] [17] . The basic idea of coevolutionary models, like the Direct-Coupling Analysis (DCA) [6] , is that the amino-acid sequences, typically given in the form of a multiple-sequence alignment (MSA) of width (or aligned sequence length) L, can be considered as a sample drawn from some unknown probability distribution P(a 1 , . . ., a L ), with (a 1 , . . ., a L ) being an aligned amino-acid sequence. This probabilistic model is typically parameterized as P(a 1 , . . ., a L ) / exp{∑ i 0.97) between the two distances matrices are actually obtained across protein families by our algorithm. Fig B in S1 Text shows the distance histograms between natural and randomized sequences. We find that, with rare exceptions, randomized sequences are at most at 60-70% sequence identity (minimal Hamming distance 30-40%) to the closest natural sequences, showing that Null model II actually generates sequences, which are distant from the original Pfam MSA. Since the sequence profile remains unchanged by this procedure, and the natural distances between sequences are approached, the randomized MSA thus contains approximately the same conservation and phylogenetic properties of the biological sequence data. However, potentially existing functional correlations between MSA columns are eliminated. The resulting data-covariance matrix is expected to have non-trivial properties in agreement with [20] , and DCA is expected to be able to reproduce this correlation structure via couplings J ij (a, b). Repeating the randomization many times, we can assess the statistics of phylogeny-generated couplings, and thereby the significance of the couplings found by running DCA on the original protein sequences collected in D. Note also that, in the limit where the formal inverse temperature in Eq (5) is set to β = 0, i.e. in the case of infinite formal temperature T = β −1 , we recover Null model I. One could use β therefore as an interpolating parameter between these two Null models. To corroborate the results of Null model II, we have also used a complementary strategy using explicitly an evolutionary model and a phylogenetic tree to resample sequences on this tree. The evolutionary model we use is the Felsenstein model for independent-site evolution [32] , i.e. a model which does account for site-specific conservation profiles and phylogeny, but not for any intrinsic correlation / coupling between different sites. In this context, the stationary probability distribution of sequences is described by a profile model which has the same form of the profile model in Eq (4), but the factors are not given directly by the empirical amino-acid frequencies in the MSA columns. All sites i = 1, . . ., L evolve independently, and for each site i the probability of finding some amino acid b, given an ancestral amino acid a some time t before, is given by with μ being the mutation rate and δ a,b the Kronecker symbol, which equals one if and only if the two arguments are equal, and zero else. In this model, there is no mutation with probability e −μt , and the amino acid in position i does not change, or at least one mutation with probability 1 − e −μt . In the latter case, the new amino acid b is emitted with its equilibrium probability ω i (b). While being simple, the Felsenstein model of evolution is frequently used in phylogenetic inference. The algorithm proceeds in the following way, using the implementation of [22] : • A phylogenetic tree T is inferred from the MSA D using FastTree [33] . Instead of using inter-sequence Hamming distances (like in Null model II), FastTree is using a maximumlikelihood approach based on a model of independent-site evolution, i.e. no coevolutionary information is taken into acocunt in tree inference. • The mutation rate μ and all site-specific frequencies {ω i (a)} are inferred using maximum likelihood. • To resample the MSA according to this model, the root sequence is drawn randomly from P ω , and stochastically evolved on the branches of T using the transition probability Eq (7). • The resampled MSA is composed by the sequences resulting in the leaves of T . This procedure allows thus to emit many artificial MSA being evolved on the same phylogeny and with the same stationary sequence distribution as the one inferred from the natural sequences given in D, but no coevolutionary information is taken into account at any of the four steps. Note that the emitted MSA are expected to be more noisy than the ones of Null model II. In particular the column statistics will differ from f i (a), and also the inter-protein Hamming distances D H will differ more from the ones in the training MSA, cf. Fig C in S1 Text showing that correlations between the D H matrices remain large but not as large as in Null model II (Pearson correlations 0.7-0.95 for the protein families in dataset DS1). Again DCA can be run on many of the resampled MSA, and the DCA couplings of the natural MSA can be compared with the statistics of the resampled ones, to assess their statistical significance beyond finite-size and phylogenetic effects. The two Null models II and III, which both include phylogenetic correlations between proteins, lead to qualitatively coherent, but quantitatively slightly different results, which reflect the different randomization strategies. In the main text of this article, we will present almost exclusively the results of Null model II, in comparison to the natural MSA and Null model I. The results for Null model III are delegated to S1 Text, unless explicitly stated. Following the mathematical derivations published in [20] , we would expect that the residueresidue covariance matrix C = {c ij (a, b) | i, j = 1, . . ., L; a, b2{−, A, . . ., Y}} with c ij (a, b) is strongly impacted by phylogenetic correlations in the data. More precisely, while totally random data would lead to the Marchenkov-Pastur distribution for the eigenvalue spectrum of C, the hierarchical structure of data on the leaves of a phylogenetic tree leads to a power-law tail of large eigenvalues. It is thus not very astonishing, that both Null models II and III show fat tails in the spectrum of their data covariance matrices C (even if Null model II does not fulfill the mathematical conditions of the derivation in [20] because not generated according to a hierarchical process), while the spectrum of Null model I has a substantially more compact support, cf. Fig 2, and Figs D and E in S1 Text. The interesting observation is that, at the level of the eigenvalue spectrum, the natural data are hardly distinguishable from the phylogeny-aware Null models II and III, in difference to Null model I. This suggests the following conclusions: the dominant global residue-residue correlation structure, as far as reflected by the largest eigenvalues of the C-matrix, results from phylogeny. A comparison with principal-component analysis (PCA) relates these eigenvalues to the largescale organization of sequences in sequence space, e.g. into clusters of sequences. Note that the eigenvectors are expected to contain complementary information, e.g. used for PCA or for the identifcation of protein sectors [34, 35] , defined as multi-residue groups of coherent evolution. However, the couplings derived by DCA are not directly related to the largest eigenvalues of the residue-residue covariance matrix. Actually, the computationally most efficient DCA approximations based on mean-field [3] or Gaussian [36] approximations, relate the couplings J to the negative of the inverse of C. The DCA couplings are therefore dominated by the smallest eigenvalues of C, cf. also [37] . Here we use plmDCA, the resulting couplings therefore lack any simple relation to the eigenvalues and eigenvectors of the residue-residue covariance matrix. In difference to standard plmDCA we do not use any sequence weighting, since it might interfere with the Fig 2. Eigenvalue spectra of the covariance matrix of the natural MSA and for Null models I and II. We show the cumulative distribution of the unified eigenvalue spectra for the 60-protein dataset DS2, i.e. the fraction of eigenvalues larger than λ is shown as a function of λ. We observe that the phylogeny-aware Null model II shows the same fat tail for large eigenvalues, which is also present in the natural data, while the non-phylogenetic Null model I has a more compact support. The cutoff of the tail for large λ is an effect of the inter-family variability of the largest eigenvalues among the 60 spectra, cf. Phylogenetic correlations and coevolutionary contact prediction phylogenetic signal in a non-controlled way. In Fig 3, we plot histograms of the DCA couplings (APC-corrected Frobenius norms F APC of the coupling matrices for each residue pair, i.e. the standard output of plmDCA, cf. [29] and the Introduction for a short explanation of APC) for Null models I, II and the natural MSA D for the datasets DS2 of large and DS3 of small-medium depth MSA. Equivalent results for the nine individual families in DS1 are shown in Fig F in S1 Text, along with those for Null model III in Fig G in S1 Text. We see that across all protein families, DCA couplings from natural data reach significantly larger values than those derived from both Null models. The latter two miss in particular the strong tail for large values; their supports being much more concentrated. For large MSA (DS2), the phylogeny-aware Null model II generates larger couplings than the phylogenyunaware Null model I, i.e. they go beyond what is to be expected from finite-sample effects alone. This latter difference almost vanishes for smaller MSA (DS3), where the coupling histograms for Null models I and II almost coincide. Interestingly, the phylogenetic couplings of Null model II are almost invariant with respect to family size, while finite-size effects (Null model I) decrease with family size, and the tail of large couplings resulting from natural sequences grows with family size. It is very interesting that, while the spectra are similar for natural data and phylogenyaware null models, the dominant residue-residue couplings are neither explainable by phylogeny nor by finite sample size. They must consequently result from intrinsic evolutionary constraints acting on the proteins due to natural selection for correctly folded and properly functioning proteins. This observation becomes even more interesting, when we compare the couplings of residueresidue contacts and non-contacts. In Fig 4A, we show normalized coupling histograms for the two subsets of residue pairs in dataset DS2. The tail of large couplings is present only in the contacts, explaining why DCA and the closely related GREMLIN accurately predict contacts when couplings are high enough [38, 39] , cf. also the positive predictive value (PPV) in function of the coupling in Fig 3. As is visible in the Fig H in S1 Text for the nine individual protein families, the contact prediction in a family depends crucially on the size of this tail of large couplings. Using Null model II, we cann assess the strength of phylogeny-induced spurious couplings on the same subsets of contacts and non-contacts extracted for DS2, cf. Fig 4B and Fig I in S1 Text (and similarly Fig J in S1 Text for Null model III). We see that the two histograms for contacts and non-contacts get almost identical to each other; due to the larger number of noncontacts the largest couplings are therefore dominated by non-contacts across all studied protein families. Most interestingly, when we compare the non-contact histograms in both panels of Fig 4, they are extremely similar. It appears that the strength of the non-contact couplings in the natural data is mostly consistent with the phylogeny-induced spurious couplings in Null model II. The differences between these histograms translates into differences between PPV-curves as shown in Fig 5. The upper panels show the results for the union of all predictions, i.e. the largest DCA-couplings scores F APC of all families come first, for the dataset DS2 of large MSA (Panel A) and the DS3 of smaller MSA (Panel B). We see that the largest couplings derived from the natural MSA are contacts in both cases, but much more contacts are found in the larger MSA. However, in the lower panels we see that for smaller MSA only about 60% of the considered families have such large scores, leading to an initial average PPV (each family Here and in the following, histograms are normalized as probability distributions, i.e. to area one under the curve. It becomes evident that phylogenetic effects create-at least for sufficiently deep MSA-larger couplings than to be expected from finite sample size alone. However, couplings derived from the natural MSA have substantially larger values. The figures include also the positive predictive value (PPV, scale on the right of each panel), providing the fraction of true contacts in between all couplings F APC above some threshold θ, as a function of θ, for plmDCA run on the natural MSA. We clearly see that almost all large couplings correctly predict contacts, while the PPV starts to drop once we reach F APC reached also by phylogenetic effects in Null model II. We find this to be true for all non-trivial contacts (sequence separation |i − j| > 4) as well as for long-distance contacts (|i − j| > 24). https://doi.org/10.1371/journal.pcbi.1008957.g003 Phylogenetic correlations and coevolutionary contact prediction considered individually, and the individual PPV are averaged) of only about 0.6, while almost all large MSA lead to initially accurate contact predictions. The figures also contain a contact prediction for randomized data from Null model II. In the upper panels we observe a very weak contact signal; it results from the fact that conserved sites tend to lead to larger phylogeny-induced spurious couplings, but they also tend to be concentrated in proteins, e.g. in active sites or the protein core, and consequently to have a higher contact fraction. Table B in S1 Text). Only pairs with linear separation |i − j| > 4 along the chain are taken into account. One might expect that they change from sample to sample. While this is the case for individual couplings, the histograms remain remarkably unchanged when comparing samples, cf. Figs K and L in S1 Text. These observations show us that, while phylogenetic effects result in nonzero couplings between residues when DCA is applied, these couplings are relatively weak and never reach the size of the couplings, which allow for a high-confidence contact prediction. This idea is also corroborated by the quantitative assessment of the statistical significance in the couplings derived from natural sequences, as compared to the ones generated by the Null models. To this aim, we assign a z-score to each residue pair (i, j): Using 50 samples of Null model II, we determine the mean and standard deviation of couplings derived from Null model II, individually for each pair (i, j). We use these values to determine the z-score, i.e. the number of standard deviations, the actual couplings (from natural MSA) is away from the means for Null model II. In Fig 6, we observe, that this z-score is highly correlated with the plmDCA score derived from natural MSA, across all families. Almost all DCA scores above 0.2-0.3 have highly significant z-scores above 3 or even more. Even larger correlations between DCA and z-scores are observed in Null model III, cf. Fig M in S1 Text. One might be tempted to use this statistical significance score instead of the DCA-coupling strength for contact prediction. In Fig 5 we show that the two lead to highly comparable results, with a slight advantage for the standard F APC after the first few predictions. This difference might result from the before-mentioned observation that conserved positions tend to have larger phylogeny-induced couplings (and thus larger variances between different samples of Null model II), causing a systematic reduction of the related z-scores. Global coevolutionary modeling treats multiple-sequence alignments of homologous protein sequences as collections of independently and identically distributed samples of some unknown probability distribution P(a 1 , . . ., a L ), which has to be reconstructed from data. The assumption of independence is obviously violated due to the common evolutionary history, in particular sequences from related species show strong phylogenetic correlations. It is, however, notoriously difficult to unify the idea of a global model including coevolutionary covariation between sites and phylogenetic correlations between sequences. Statistical corrections may improve the situation slightly, but they are too simple to take the hierarchical correlation structure into account, which is generated by the evolutionary dynamics on a phylogenetic tree. Here we have proposed to approach this problem in a complementary way, by introducing null models-i.e. randomized or re-emitted multiple-sequence alignments-which reproduce conservation and phylogeny, but do not contain any real coevolutionary signal. When applying Direct Coupling Analysis as a prototypical global coevolutionary modeling approach, we observe that phylogenetic correlations between sequences lead to a changed residue-residue correlation structure, represented by a fat tail in the eigenvalue spectrum of the data covariance matrix. It leads also to distributed couplings, which, however, are smaller than the largest couplings found when applying DCA to natural sequence data, i.e. smaller than the couplings used for residue-residue contact prediction. The latter are significantly larger than couplings resulting from phylogeny, i.e. we can conclude that the first predicted contacts are influenced only to a very limited degree by phylogenetic couplings. However, it is also striking that, across the studies protein families, the phylogeny-caused couplings in Null models II and III almost reach the DCA-score threshold found before for accurate contact prediction. This suggests that the suppression of phylogenetic biases in the data (or their better consideration in model inference), may shift this threshold down and therefore allow for predicting much more contacts. The potential gain would be limited due to the finite depths of the MSA, whose effects are assessed by Null model I. We can therefore quantitatively assess the potential in removing phylogenetic effects by the following hypothetical DCA output: on all contacts in our dataset DS2 and DS3 we use the standard plmDCA scores derived from natural data, and on all non-contacts we remove phylogenetic effects by using couplings derived by running plmDCA on samples of Null model I. The resulting hypothetical PPV curves are given in Fig 5 as purple lines: they are substantially higher than the real PPV obtained from the original data. In the case of the large MSA in DS2, we find a broad plateau of almost perfect PPV close to one, starting to drop only after about 50 top residue pairs, but staying above 90% even after 100 pairs, as compared to about 70% for the real DCA predictions. Even in the small-to-medium-depth MSA of DS3 the potential effect of removing phylogenetic effects is considerable, even if the finite-sample effect is much more pronounced. While in the real data only less than 60% of the considered protein start with a true-positive prediction, the hypothetical phylogeny-removed prediction starts with a PPV above 70%. Since coevolution-based scores are also input to most of the recent deep-learning-based contact predictors, we could imagine that corrections for phylogenetic effects would also improve the accuracy of these methods. Using many realization of the Null models, we can provide a z-score for the couplings found in the original sequence data, and thereby assess their statistical significance beyond effects of finite and phylogenetically correlated sampling. This is of interest in exploratory studies: in a recent study, one of us has used Null model II in a collaboration aiming at finding potential epistatic effects in a global analysis of more than 50,000 SARS-Cov-2 genomes [40] . Due to the obvious strong correlation between these very recently diverged genomes, potential epistatic couplings have to be assessed carefully, and scoring them by Null model II has turned out to be an essential element in the identification of a sparse, but statistical significant genome-wide network of epistatic couplings. Supporting information S1 Text. The file contains all supporting tables and figures cited in the main text. (PDF) Emerging methods in protein co-evolution Inverse statistical physics of protein sequences: a key issues review Direct-coupling analysis of residue coevolution captures native contacts across many protein families Protein structure prediction from sequence variation Protein structure determination using metagenome sequence data Coevolutionary information, protein folding landscapes, and the thermodynamics of natural selection Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Molecular biology and evolution Mutation effects predicted from sequence co-variation Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information Co-Evolutionary Fitness Landscapes for Sequence Design Structures of a dimodular nonribosomal peptide synthetase reveal conformational flexibility An evolution-based model for designing chorismate mutase enzymes Accurate de novo prediction of protein contact map by ultra-deep learning model High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features Protein structure prediction using multiple deep neural networks in CASP13 Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints Improved protein structure prediction using predicted interresidue orientations Identification of direct residue contacts in proteinprotein interaction by message passing Inferring phylogenies Power law tails in phylogenetic systems Inverse Ising inference with correlated samples Toward Inferring Potts Models for Phylogenetically Correlated Sequence Data Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction Phylogenetic weighting does little to improve the accuracy of evolutionary coupling analyses The Pfam protein families database in 2019 Synthetic protein alignments by CCMgen quantify noise in residueresidue contact prediction The protein data bank PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models CoPAP: coevolution of presence-absence patterns The neighbor-joining method: a new method for reconstructing phylogenetic trees Evolutionary trees from DNA sequences: a maximum likelihood approach FastTree 2-approximately maximum-likelihood trees for large alignments Protein sectors: evolutionary units of three-dimensional structure Evolution-based functional decomposition of proteins Fast and accurate multivariate Gaussian modeling of protein families: Predicting residue contacts and protein-interaction partners From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis Origins of coevolution between residues distant in protein 3D structures Global analysis of more than 50,000 SARS-CoV-2 genomes reveals epistasis between eight viral genes We thanks Pierre Barrat-Charlaix and Alejandro Lage-Castellanos for numerous discussions. Conceptualization: Martin Weigt.