key: cord-0825158-w8foij5s
authors: Rodriguez Horta, Edwin; Weigt, Martin
title: On the effect of phylogenetic correlations in coevolution-based contact prediction in proteins
date: 2021-05-24
journal: PLoS Comput Biol
DOI: 10.1371/journal.pcbi.1008957
sha: 21c7880a33622fae81752c7bd6b65ddabe56e0a3
doc_id: 825158
cord_uid: w8foij5s

Coevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. A comparison between the results of Direct Coupling Analysis applied to real and to resampled data shows that the largest coevolutionary couplings, i.e. those used for contact prediction, are only weakly influenced by phylogeny. However, the phylogeny-induced spurious couplings in the resampled data are compatible in size with the first false-positive contact predictions from real data. Dissecting functional from phylogeny-induced couplings might therefore extend accurate contact predictions to the range of intermediate-size couplings.

a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

Global coevolutionary modeling approaches have recently seen a lot of interest [1, 2] , either directly for predicting residue-residue contacts from sequence ensembles corresponding to homologous protein families [3] [4] [5] , in predicting mutational effects [6] [7] [8] , or even in designing artificial but functional protein sequences [9] [10] [11] [12] , or as an input to deep-learning based protein structure prediction. The latter approach has recently lead to a breakthrough in predicting protein structure from sequence [13] [14] [15] [16] [17] .

The basic idea of coevolutionary models, like the Direct-Coupling Analysis (DCA) [6] , is that the amino-acid sequences, typically given in the form of a multiple-sequence alignment (MSA) of width (or aligned sequence length) L, can be considered as a sample drawn from some unknown probability distribution P(a 1 , . . ., a L ), with (a 1 , . . ., a L ) being an aligned amino-acid sequence. This probabilistic model is typically parameterized as P(a 1 , . . ., a L ) / exp{∑ i<j J ij (a i , a j ) + h i (a i )} via biases (or fields) h i (a i ) representing site-specificities in aminoacid usage (i.e. patterns of amino-acid conservation), and via statistical couplings J ij (a i , a j ), which represent coevolutionary constraints and cause correlated amino-acid usage in positions i and j [18] .

In most cases, the parameters of these models are inferred by (approximate) maximumlikelihood, under the assumption that the sequences in the MSA are (almost) independently and identically distributed according to P(a 1 , . . ., a L ). On one hand, this assumption is needed to make model inference from MSA technically feasible. On the other hand, it is in obvious contradiction to the fact that sequences in homologous protein families share common ancestry in evolution, and therefore typically show considerable phylogenetic correlations, which can be used to infer this unknown ancestry from data [19] . Phylogeny induces highly non-trivial correlations between MSA columns [20] , which however do not represent any functional relationship.

Disentangling correlations induced by functional or structural couplings from phylogenycaused correlations turns out to be a highly non-trivial task [20] [21] [22] . Simple statistical corrections have been proposed, like down-weighting similar sequences when determining statistical correlations [3] , or the average-product correction (APC) [23] applied to the final coevolutionary coupling scores. While sequence weighting has initially been reported to significantly improve contact prediction, recent works show little effect [24] , probably due to the fact that, e.g., Pfam [25] is now based on reference proteomes and therefore less redundant than databases used to be about a decade ago. APC was shown to be more of a correction of biases related to amino-acid conservation than to phylogeny [26] .

To make progress, we suggest a complementary approach. Instead of removing phylogenetic correlations from DCA-type analyses, we suggest null models having the same conservation and phylogenetic patterns of the original MSA of the protein family under consideration, but strictly lack any functional or structural couplings.

Running DCA on artificial MSA generated by these null models, and comparing them to the results obtained from natural MSA, we find some remarkable results: while the largest eigenvalues of the residue-residue covariance matrix appears to be dominated by phylogenetic effects, the strongest DCA couplings are hardly influenced by phylogeny. The spurious couplings induced by phylogeny are, however, non-zero, and may limit the accuracy of contact prediction when going beyond the first few strongest couplings. More precisely, our observations are quantitatively compatible with the idea, that already the first false-positive contact prediction are caused by not correctly treating phylogeny in DCA. This shows also that methods properly dissecting phylogenetic and functional correlations in sequence data have a high potential to substantially extend the contact predictions beyond current methods.

The paper is organized as follows. After this introduction, we provide the Materials and Methods, with a short review of DCA, but most importantly with the presentation of three null models. The Results section compares the spectral properties of the residue-residue covariance matrix of the real data with those of MSA generated by the null models, followed by an assessment of the couplings inferred by DCA and their relation to residue-residue contacts. The Conclusion sums up the results and discusses potentially interesting future directions. Supporting tables and figures are shown in S1 Text.

Coevolutionary analysis is mostly applied to families of homologous proteins (or protein domains), as provided by the Pfam database [25] . Multiple sequence alignments (MSA) can be downloaded in the form of rectangular arrays D ¼ fa m i j i ¼ 1; . . . ; L; m ¼ 1; . . . ; Mg of width L (aligned sequence length) and depth M (number of aligned sequences). The entries a m i 2 fÀ ; A; C; . . . ; Yg are either one of the 20 standard amino acids, or alignment gaps represented as "−". Note that insertions are not aligned in Pfam alignments, and are therefore typical removed from the MSA before statistical model learning. Here L describes the sequence length after removal of insertions. For the structural analysis, the Pfam MSA is mapped to experimentally resolved PDB protein structures [27] , and distances are measured as minimum distances between heavy atoms. Following established standards in the coevolutionary literature, we use a cutoff of 8Å for residue-residue contacts.

For our work, we have selected three datasets:

• DS1: Detailed results are given for 9 Pfam protein families, with not too long sequences (L < 250), not too large MSA (M < 10, 000 after removing duplicate sequences) and available PDB structures, see details in Table A in S1 Text.

• DS2: Statistical results are given for 60 Pfam protein families belonging to the PSICOV benchmark set [28] . Only families with M < 12, 000 pairwise distinct sequences were considered, cf. Table B in S1 Text.

• DS3: We have also compiled and dataset of 20 smaller Pfam protein families with known PDB structures, to study the influence of finite MSA depth on our findings, cf. Table C in S1 Text.

Global coevolutionary models, as those constructed by DCA, describe the sequence variability between the members of a protein family, i.e. between different rows of the MSA D, via a statistical model

parameterized via pairwise coevolutionary residue-residue couplings J ij (a i , a j ) and single-residue biases (or fields) h i (a i ), while Z is a normalization factor also known as partition function.

In the simplest setting, these parameters are inferred from the data via maximum likelihood, i.e.

This maximization leads directly to the fact, that the model P reproduces the empirical statistics of single MSA columns and of column pairs,

where f i (a) represents the fraction of amino acids a in column i, i.e. the residue-conservation statistics, while the f ij (a, b) describe the fraction of sequences having simultaneously amino acids a and b in columns i and j, thereby representing residue covariation / coevolution, i.e. the correlated usage of amino acids in pairs of columns, cf. Fig 1. The inference of the parameters is a computationally hard task, since, e.g., the computation of the marginals in Eq (3) depends on an exponential sum over Oð21 L Þ terms. Many approximation schemes have been proposed, we use plmDCA [29] based on pseudo-likelihood maximization, since it represents a welltested compromise between accuracy and running time.

A particularity of this approach is that Eq (2) assumes that the sequences in the MSA D form an independently and identically distributed sample of P(a 1 , . . ., a L ) and that the likelihood can be factorized into a product over the rows of D. This assumption is incorrect; biological sequences are the result of natural evolution and thus show hierarchical phylogenetic relations. Phylogeny by itself leads to a non-trivial correlation structure between different residue positions with a power-law spectrum [20] , and this leads to non-zero but also non-functional residue-residue couplings when using DCA. These may interfere with the functional couplings, which are e.g. used for residue-residue contact prediction from MSA data, and thereby negatively impact prediction accuracy.

It is notoriously hard to disentangle the two, cf. [21, 22] . The problem is that evolution is a non-equilibrium stochastic process, whose dynamics in principle depends on the evolutionary constraints represented, e.g., by the couplings and fields in the DCA model. Global model inference from phylogenetically correlated data remains an open questions.

Here we follow a different route. We define different null models, which take residue conservation and in part also phylogeny into account, but do not show any intrinsic amino-acid covariation. The null models allow us to create large numbers of suitably randomized sequence ensembles, on which standard DCA can be run. The couplings resulting from randomized data can be used to assess the statistical significance of the couplings resulting from the real MSA D, and therefore to discard purely phylogeny-caused couplings. While, to the best of our knowledge, this has never been done in the context of protein families and DCA, somehow similar techniques have been proposed in the context of phylogenetic profiling [30] , but applied to correlations rather than couplings.

The first null model is very simple. It randomizes the input MSA by conserving the single-column statistics f i (a), for all sites i = 1, . . ., L and all amino acids or gaps a 2 {−, A, . . ., Y}. This is done by simple random but independent permutations of all MSA columns. This destroys all correlations between positions (the coevolutionary ones) and between sequences (the phylogenetic ones), only the residue conservation patterns of the original MSA are preserved. Formally, the randomized sequences become an independently and identically distributed sample from the profile model

So in principle there are no couplings between different residues at all. However, when running DCA on this sample, inferred couplings will be non-zero due to finite sample size. They will take distinct values from one randomization to the next, but there may be systematic biases due to the distinct conservation levels, which are maintained as compared to the original Pfam MSA.

The second null model is more complicated, since it preserves also (at least approximately) the phylogenetic information contained in the original MSA. Here we assume that this information is coded in the pairwise distances between sequences, i.e. in the matrix fD H mn j1 � m < n � Mg of Hamming distances between all pairs of sequences, as is done in distance-based phylogeny reconstruction [19, 31] .

The aim of the second null model is to construct a randomized MSA which preserves both the sequence profile given by the position-specific frequencies f i (a), and the matrix fD H mn g of pairwise Hamming distances between sequences. This can be achieved by the following Markov chain Monte Carlo (MCMC) procedure acting on the entire alignment. Our method is initialized using a sample of null model I, i.e. all coevolutionary and phylogenetic information from the original MSA is destroyed, but the profile is preserved. The resulting randomized MSA after t MCMC steps is calledD t ¼ fã m i j i ¼ 1; . . . ; L; m ¼ 1; . . . ; Mg. In step t ! t + 1, one column i 2 {1, . . ., L} is selected randomly, as well as two rows m, n 2 {1, . . ., M}. An exchange of the entriesã m i andã n i is attempted, to obtain a new matrixD tþ1 . This matrix is accepted with a Metropolis-Hastings acceptance probability p acc of

otherwise the exchange is refused and the matrixD t remains invariant in step t + 1. In this expression, D H (D) stands for the matrix of Hamming distances between the rows of the original MSA D, analogously for the randomized MSA, and ||�|| for the Frobenius norm of matrices. This acceptance rule guarantees that each exchange, which brings the distance matrix D H ðD t Þ closer to the target matrix D H (D), is accepted. Exchanges going into the opposite direction are accepted with a smaller probability depending exponentially on the formal "inverse temperature" β. Here we use simulated annealing, i.e. we initialize β in a small value and slowly increase it over time, in order to force the Hamming distances of the randomized MSA closer an closer to the ones of the natural MSA. Fig A in S1 Text illustrates that very high correlations (Pearson correlation > 0.97) between the two distances matrices are actually obtained across protein families by our algorithm. Fig B in S1 Text shows the distance histograms between natural and randomized sequences. We find that, with rare exceptions, randomized sequences are at most at 60-70% sequence identity (minimal Hamming distance 30-40%) to the closest natural sequences, showing that Null model II actually generates sequences, which are distant from the original Pfam MSA.

Since the sequence profile remains unchanged by this procedure, and the natural distances between sequences are approached, the randomized MSA thus contains approximately the same conservation and phylogenetic properties of the biological sequence data. However, potentially existing functional correlations between MSA columns are eliminated. The resulting data-covariance matrix is expected to have non-trivial properties in agreement with [20] , and DCA is expected to be able to reproduce this correlation structure via couplings J ij (a, b). Repeating the randomization many times, we can assess the statistics of phylogeny-generated couplings, and thereby the significance of the couplings found by running DCA on the original protein sequences collected in D.

Note also that, in the limit where the formal inverse temperature in Eq (5) is set to β = 0, i.e. in the case of infinite formal temperature T = β −1 , we recover Null model I. One could use β therefore as an interpolating parameter between these two Null models.

To corroborate the results of Null model II, we have also used a complementary strategy using explicitly an evolutionary model and a phylogenetic tree to resample sequences on this tree. The evolutionary model we use is the Felsenstein model for independent-site evolution [32] , i.e. a model which does account for site-specific conservation profiles and phylogeny, but not for any intrinsic correlation / coupling between different sites. In this context, the stationary probability distribution of sequences is described by a profile model

which has the same form of the profile model in Eq (4), but the factors are not given directly by the empirical amino-acid frequencies in the MSA columns. All sites i = 1, . . ., L evolve independently, and for each site i the probability of finding some amino acid b, given an ancestral amino acid a some time t before, is given by

with μ being the mutation rate and δ a,b the Kronecker symbol, which equals one if and only if the two arguments are equal, and zero else. In this model, there is no mutation with probability e −μt , and the amino acid in position i does not change, or at least one mutation with probability 1 − e −μt . In the latter case, the new amino acid b is emitted with its equilibrium probability ω i (b). While being simple, the Felsenstein model of evolution is frequently used in phylogenetic inference. The algorithm proceeds in the following way, using the implementation of [22] :

• A phylogenetic tree T is inferred from the MSA D using FastTree [33] . Instead of using inter-sequence Hamming distances (like in Null model II), FastTree is using a maximumlikelihood approach based on a model of independent-site evolution, i.e. no coevolutionary information is taken into acocunt in tree inference.

• The mutation rate μ and all site-specific frequencies {ω i (a)} are inferred using maximum likelihood.

• To resample the MSA according to this model, the root sequence is drawn randomly from P ω , and stochastically evolved on the branches of T using the transition probability Eq (7).

• The resampled MSA is composed by the sequences resulting in the leaves of T .

This procedure allows thus to emit many artificial MSA being evolved on the same phylogeny and with the same stationary sequence distribution as the one inferred from the natural sequences given in D, but no coevolutionary information is taken into account at any of the four steps. Note that the emitted MSA are expected to be more noisy than the ones of Null model II. In particular the column statistics will differ from f i (a), and also the inter-protein Hamming distances D H will differ more from the ones in the training MSA, cf. Fig C in S1 Text showing that correlations between the D H matrices remain large but not as large as in Null model II (Pearson correlations 0.7-0.95 for the protein families in dataset DS1).

Again DCA can be run on many of the resampled MSA, and the DCA couplings of the natural MSA can be compared with the statistics of the resampled ones, to assess their statistical significance beyond finite-size and phylogenetic effects.

The two Null models II and III, which both include phylogenetic correlations between proteins, lead to qualitatively coherent, but quantitatively slightly different results, which reflect the different randomization strategies. In the main text of this article, we will present almost exclusively the results of Null model II, in comparison to the natural MSA and Null model I.

The results for Null model III are delegated to S1 Text, unless explicitly stated.

Following the mathematical derivations published in [20] , we would expect that the residueresidue covariance matrix C = {c ij (a, b) | i, j = 1, . . ., L; a, b2{−, A, . . ., Y}} with c ij (a, b) 

is strongly impacted by phylogenetic correlations in the data. More precisely, while totally random data would lead to the Marchenkov-Pastur distribution for the eigenvalue spectrum of C, the hierarchical structure of data on the leaves of a phylogenetic tree leads to a power-law tail of large eigenvalues.

It is thus not very astonishing, that both Null models II and III show fat tails in the spectrum of their data covariance matrices C (even if Null model II does not fulfill the mathematical conditions of the derivation in [20] because not generated according to a hierarchical process), while the spectrum of Null model I has a substantially more compact support, cf. Fig 2, and Figs D and E in S1 Text. The interesting observation is that, at the level of the eigenvalue spectrum, the natural data are hardly distinguishable from the phylogeny-aware Null models II and III, in difference to Null model I.

This suggests the following conclusions: the dominant global residue-residue correlation structure, as far as reflected by the largest eigenvalues of the C-matrix, results from phylogeny. A comparison with principal-component analysis (PCA) relates these eigenvalues to the largescale organization of sequences in sequence space, e.g. into clusters of sequences. Note that the eigenvectors are expected to contain complementary information, e.g. used for PCA or for the identifcation of protein sectors [34, 35] , defined as multi-residue groups of coherent evolution.

However, the couplings derived by DCA are not directly related to the largest eigenvalues of the residue-residue covariance matrix. Actually, the computationally most efficient DCA approximations based on mean-field [3] or Gaussian [36] approximations, relate the couplings J to the negative of the inverse of C. The DCA couplings are therefore dominated by the smallest eigenvalues of C, cf. also [37] .

Here we use plmDCA, the resulting couplings therefore lack any simple relation to the eigenvalues and eigenvectors of the residue-residue covariance matrix. In difference to standard plmDCA we do not use any sequence weighting, since it might interfere with the Fig 2. Eigenvalue spectra of the covariance matrix of the natural MSA and for Null models I and II. We show the cumulative distribution of the unified eigenvalue spectra for the 60-protein dataset DS2, i.e. the fraction of eigenvalues larger than λ is shown as a function of λ. We observe that the phylogeny-aware Null model II shows the same fat tail for large eigenvalues, which is also present in the natural data, while the non-phylogenetic Null model I has a more compact support. The cutoff of the tail for large λ is an effect of the inter-family variability of the largest eigenvalues among the 60 spectra, cf. 

Phylogenetic correlations and coevolutionary contact prediction phylogenetic signal in a non-controlled way. In Fig 3, we plot histograms of the DCA couplings (APC-corrected Frobenius norms F APC of the coupling matrices for each residue pair, i.e. the standard output of plmDCA, cf. [29] and the Introduction for a short explanation of APC) for Null models I, II and the natural MSA D for the datasets DS2 of large and DS3 of small-medium depth MSA. Equivalent results for the nine individual families in DS1 are shown in Fig F in S1 Text, along with those for Null model III in Fig G in S1 Text.

We see that across all protein families, DCA couplings from natural data reach significantly larger values than those derived from both Null models. The latter two miss in particular the strong tail for large values; their supports being much more concentrated. For large MSA (DS2), the phylogeny-aware Null model II generates larger couplings than the phylogenyunaware Null model I, i.e. they go beyond what is to be expected from finite-sample effects alone. This latter difference almost vanishes for smaller MSA (DS3), where the coupling histograms for Null models I and II almost coincide. Interestingly, the phylogenetic couplings of Null model II are almost invariant with respect to family size, while finite-size effects (Null model I) decrease with family size, and the tail of large couplings resulting from natural sequences grows with family size.

It is very interesting that, while the spectra are similar for natural data and phylogenyaware null models, the dominant residue-residue couplings are neither explainable by phylogeny nor by finite sample size. They must consequently result from intrinsic evolutionary constraints acting on the proteins due to natural selection for correctly folded and properly functioning proteins.

This observation becomes even more interesting, when we compare the couplings of residueresidue contacts and non-contacts. In Fig 4A, we show normalized coupling histograms for the two subsets of residue pairs in dataset DS2. The tail of large couplings is present only in the contacts, explaining why DCA and the closely related GREMLIN accurately predict contacts when couplings are high enough [38, 39] , cf. also the positive predictive value (PPV) in function of the coupling in Fig 3. As is visible in the Fig H in S1 Text for the nine individual protein families, the contact prediction in a family depends crucially on the size of this tail of large couplings.

Using Null model II, we cann assess the strength of phylogeny-induced spurious couplings on the same subsets of contacts and non-contacts extracted for DS2, cf. Fig 4B and Fig I in S1 Text (and similarly Fig J in S1 Text for Null model III). We see that the two histograms for contacts and non-contacts get almost identical to each other; due to the larger number of noncontacts the largest couplings are therefore dominated by non-contacts across all studied protein families. Most interestingly, when we compare the non-contact histograms in both panels of Fig 4, they are extremely similar. It appears that the strength of the non-contact couplings in the natural data is mostly consistent with the phylogeny-induced spurious couplings in Null model II.

The differences between these histograms translates into differences between PPV-curves as shown in Fig 5. The upper panels show the results for the union of all predictions, i.e. the largest DCA-couplings scores F APC of all families come first, for the dataset DS2 of large MSA (Panel A) and the DS3 of smaller MSA (Panel B). We see that the largest couplings derived from the natural MSA are contacts in both cases, but much more contacts are found in the larger MSA. However, in the lower panels we see that for smaller MSA only about 60% of the considered families have such large scores, leading to an initial average PPV (each family Here and in the following, histograms are normalized as probability distributions, i.e. to area one under the curve. It becomes evident that phylogenetic effects create-at least for sufficiently deep MSA-larger couplings than to be expected from finite sample size alone. However, couplings derived from the natural MSA have substantially larger values. The figures include also the positive predictive value (PPV, scale on the right of each panel), providing the fraction of true contacts in between all couplings F APC above some threshold θ, as a function of θ, for plmDCA run on the natural MSA. We clearly see that almost all large couplings correctly predict contacts, while the PPV starts to drop once we reach F APC reached also by phylogenetic effects in Null model II. We find this to be true for all non-trivial contacts (sequence separation |i − j| > 4) as well as for long-distance contacts (|i − j| > 24).

https://doi.org/10.1371/journal.pcbi.1008957.g003

Phylogenetic correlations and coevolutionary contact prediction considered individually, and the individual PPV are averaged) of only about 0.6, while almost all large MSA lead to initially accurate contact predictions. The figures also contain a contact prediction for randomized data from Null model II. In the upper panels we observe a very weak contact signal; it results from the fact that conserved sites tend to lead to larger phylogeny-induced spurious couplings, but they also tend to be concentrated in proteins, e.g. in active sites or the protein core, and consequently to have a higher contact fraction. Table B in S1 Text). Only pairs with linear separation |i − j| > 4 along the chain are taken into account. One might expect that they change from sample to sample. While this is the case for individual couplings, the histograms remain remarkably unchanged when comparing samples, cf. Figs K and L in S1 Text. These observations show us that, while phylogenetic effects result in nonzero couplings between residues when DCA is applied, these couplings are relatively weak and never reach the size of the couplings, which allow for a high-confidence contact prediction. This idea is also corroborated by the quantitative assessment of the statistical significance in the couplings derived from natural sequences, as compared to the ones generated by the Null models. To this aim, we assign a z-score to each residue pair (i, j): Using 50 samples of Null model II, we determine the mean and standard deviation of couplings derived from Null model II, individually for each pair (i, j). We use these values to determine the z-score, i.e. the number of standard deviations, the actual couplings (from natural MSA) is away from the means for Null model II. In Fig 6, we observe, that this z-score is highly correlated with the plmDCA score derived from natural MSA, across all families. Almost all DCA scores above 0.2-0.3 have highly significant z-scores above 3 or even more. Even larger correlations between DCA and z-scores are observed in Null model III, cf. Fig M in S1 Text. One might be tempted to use this statistical significance score instead of the DCA-coupling strength for contact prediction. In Fig 5 we show that the two lead to highly comparable results, with a slight advantage for the standard F APC after the first few predictions. This difference might result from the before-mentioned observation that conserved positions tend to have larger phylogeny-induced couplings (and thus larger variances between different samples of Null model II), causing a systematic reduction of the related z-scores.

Global coevolutionary modeling treats multiple-sequence alignments of homologous protein sequences as collections of independently and identically distributed samples of some unknown probability distribution P(a 1 , . . ., a L ), which has to be reconstructed from data. The assumption of independence is obviously violated due to the common evolutionary history, in particular sequences from related species show strong phylogenetic correlations.

It is, however, notoriously difficult to unify the idea of a global model including coevolutionary covariation between sites and phylogenetic correlations between sequences. Statistical corrections may improve the situation slightly, but they are too simple to take the hierarchical correlation structure into account, which is generated by the evolutionary dynamics on a phylogenetic tree.

Here we have proposed to approach this problem in a complementary way, by introducing null models-i.e. randomized or re-emitted multiple-sequence alignments-which reproduce conservation and phylogeny, but do not contain any real coevolutionary signal. When applying Direct Coupling Analysis as a prototypical global coevolutionary modeling approach, we observe that phylogenetic correlations between sequences lead to a changed residue-residue correlation structure, represented by a fat tail in the eigenvalue spectrum of the data covariance matrix. It leads also to distributed couplings, which, however, are smaller than the largest couplings found when applying DCA to natural sequence data, i.e. smaller than the couplings used for residue-residue contact prediction. The latter are significantly larger than couplings resulting from phylogeny, i.e. we can conclude that the first predicted contacts are influenced only to a very limited degree by phylogenetic couplings.

However, it is also striking that, across the studies protein families, the phylogeny-caused couplings in Null models II and III almost reach the DCA-score threshold found before for accurate contact prediction. This suggests that the suppression of phylogenetic biases in the data (or their better consideration in model inference), may shift this threshold down and therefore allow for predicting much more contacts. The potential gain would be limited due to the finite depths of the MSA, whose effects are assessed by Null model I. We can therefore quantitatively assess the potential in removing phylogenetic effects by the following hypothetical DCA output: on all contacts in our dataset DS2 and DS3 we use the standard plmDCA scores derived from natural data, and on all non-contacts we remove phylogenetic effects by using couplings derived by running plmDCA on samples of Null model I. The resulting hypothetical PPV curves are given in Fig 5 as purple lines: they are substantially higher than the real PPV obtained from the original data. In the case of the large MSA in DS2, we find a broad plateau of almost perfect PPV close to one, starting to drop only after about 50 top residue pairs, but staying above 90% even after 100 pairs, as compared to about 70% for the real DCA predictions. Even in the small-to-medium-depth MSA of DS3 the potential effect of removing phylogenetic effects is considerable, even if the finite-sample effect is much more pronounced. While in the real data only less than 60% of the considered protein start with a true-positive prediction, the hypothetical phylogeny-removed prediction starts with a PPV above 70%. Since coevolution-based scores are also input to most of the recent deep-learning-based contact predictors, we could imagine that corrections for phylogenetic effects would also improve the accuracy of these methods.

Using many realization of the Null models, we can provide a z-score for the couplings found in the original sequence data, and thereby assess their statistical significance beyond effects of finite and phylogenetically correlated sampling. This is of interest in exploratory studies: in a recent study, one of us has used Null model II in a collaboration aiming at finding potential epistatic effects in a global analysis of more than 50,000 SARS-Cov-2 genomes [40] . Due to the obvious strong correlation between these very recently diverged genomes, potential epistatic couplings have to be assessed carefully, and scoring them by Null model II has turned out to be an essential element in the identification of a sparse, but statistical significant genome-wide network of epistatic couplings.

Supporting information S1 Text. The file contains all supporting tables and figures cited in the main text. (PDF)

Emerging methods in protein co-evolution

Inverse statistical physics of protein sequences: a key issues review

Direct-coupling analysis of residue coevolution captures native contacts across many protein families

Protein structure prediction from sequence variation

Protein structure determination using metagenome sequence data

Coevolutionary information, protein folding landscapes, and the thermodynamics of natural selection

Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Molecular biology and evolution

Mutation effects predicted from sequence co-variation

Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information

Co-Evolutionary Fitness Landscapes for Sequence Design

Structures of a dimodular nonribosomal peptide synthetase reveal conformational flexibility

An evolution-based model for designing chorismate mutase enzymes

Accurate de novo prediction of protein contact map by ultra-deep learning model

High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features

Protein structure prediction using multiple deep neural networks in CASP13

Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints

Improved protein structure prediction using predicted interresidue orientations

Identification of direct residue contacts in proteinprotein interaction by message passing

Inferring phylogenies

Power law tails in phylogenetic systems

Inverse Ising inference with correlated samples

Toward Inferring Potts Models for Phylogenetically Correlated Sequence Data

Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction

Phylogenetic weighting does little to improve the accuracy of evolutionary coupling analyses

The Pfam protein families database in 2019

Synthetic protein alignments by CCMgen quantify noise in residueresidue contact prediction

The protein data bank

PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments

Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models

CoPAP: coevolution of presence-absence patterns

The neighbor-joining method: a new method for reconstructing phylogenetic trees

Evolutionary trees from DNA sequences: a maximum likelihood approach

FastTree 2-approximately maximum-likelihood trees for large alignments

Protein sectors: evolutionary units of three-dimensional structure

Evolution-based functional decomposition of proteins

Fast and accurate multivariate Gaussian modeling of protein families: Predicting residue contacts and protein-interaction partners

From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction

Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis

Origins of coevolution between residues distant in protein 3D structures

Global analysis of more than 50,000 SARS-CoV-2 genomes reveals epistasis between eight viral genes

We thanks Pierre Barrat-Charlaix and Alejandro Lage-Castellanos for numerous discussions.

Conceptualization: Martin Weigt.