key: cord-0213201-nxo4yfa5
authors: Zeng, Hong-Li; Liu, Yue; Dichio, Vito; Aurell, Erik
title: Temporal epistasis inference from more than 3,500,000 SARS-CoV-2 Genomic Sequences
date: 2021-12-24
journal: nan
DOI: nan
sha: 4864e1b80f72e64826fc2e8ce5a1dea4ebcede6d
doc_id: 213201
cord_uid: nxo4yfa5

We use Direct Coupling Analysis to determine epistatic interactions between loci of variability of the SARS-CoV-2 virus, segmenting genomes by month of sampling. We use full-length, high-quality genomes from the GISAID repository up to October 2021, in total over 3,500,000 genomes. We find that DCA terms are more stable over time than correlations, but nevertheless change over time as mutations disappear from the global population or reach fixation. We discuss the validity of a DCA analysis under these conditions in terms of a transient Quasi-Linkage Equilibrium state. We identify putative epistatic interaction mutations involving loci in Spike.

The ongoing global pandemic of the disease COVID-19 caused by coronavirus SARS-CoV-2 has so far led to more than 270 million confirmed cases and more than 5 million deaths [1] . Efforts to counter the epidemic have included extensive use of Non-Pharmaceutical Interventions (NPI) [2] [3] [4] [5] [6] , and the development of several vaccines [7, 8] . To date more than 8,3 billion vaccine doses have been administered world-wide [1] . Although drugs such as dexamethasone that lower the fatality rate are now in wide use, effective anti-viral drugs which would open an another frontline in the pandemic are so far lacking [9, 10] .

Both vaccines and antiviral drugs are based on an understanding of the biology of the pathogen, its strengths and potential weaknesses [11] [12] [13] . The COVID pandemic is the first to have occurred after massive DNA sequencing became a commodity service. The number of SARS-CoV-2 genomes publicly available in data repositories is many orders of magnitude larger than ever seen in the past. While disparities in sampling and other sources of bias are serious issues [14] , such large amounts of data should nevertheless be marshaled in support of the common good to the fullest extent possible. In this work we have relied on a full-length high-quality SARS-CoV-2 genome sequences from the GISAID repository [15] with sampling date up to October 2021: in all more than three and half million viral genomes. These are the virtually exact genetic blueprints of actual viruses infecting actual persons in more than one percent of the confirmed cases world-wide. That such quasi-real-time monitoring is at all possible is a staggering achievement. We are likely only the beginning of the process of understanding what information that can be unlocked from such vast yet extremely rich and precise data [16] .

The focus on this work is on epistasis, synergistic or antagonistic contributions to fitness from allele variations at two or more loci. Long-standing theoretical arguments predict that the distribution of genotypes in a population directly reflect such multi-loci fitness terms when recombination is the dominant force of evolution [17] [18] [19] . Coronaviruses in general exhibit recombination due to their mode of RNA replication [20] [21] [22] [23] , and recombination has been observed between different strains of SARS-CoV-2 co-infecting the same human host [24] [25] [26] [27] [28] . While partly conflicting reports have appeared in the literature as to the impact of recombination on the total SARS-CoV-2 population [29, 30] , it is also the case that recent theoretical advances have shown similar correspondences also when mutation is the dominant force of evolution, provided some recombination is present [31] . If both mutation and recombination are slower (weaker) processes than selection, there is on the other hand no simple relation between epistatic contributions to fitness and variability in the population [32] [33] [34] and the approach taken here does not apply. In this work we assume that the latter scenario does not pertain.

In an earlier contribution we inferred epistatic interactions from about 50,000 SARS-CoV-2 genome sequences deposited in GISAID until August 2020 [35] . A slightly later contribution used about 130,000 sequences available until October 19, 2020 , and reached largely consistent results [36] . An important aspect of both analyses was to separate linkage disequilibrium (LD) due to epistasis (the objective of the studies) and LD due to phylogeny (a confounder). In [35] interactions imputed to phylogeny were separated out by a randomized null model procedure [37] , while [36] leveraged GISAID metadata (sample geographic position) and assignment of samples to clades. In this work we used almost two orders of magnitude more data. This necessitated a different approach, as will be described below. Signs of epistasis in large-scale SARS-CoV-2 data was also recently investigated by other methods in [38, 39] , the authors of which found limited amounts at the RBD surface of Spike. This is consistent with the results in [35, 36] where epistasis was mostly detected between loci outside Spike.

In this work we stratify genome sequences as to sampling data. We collect all sequences sampled in the same month since the beginning of the pandemic, and analyze epistasis month-by-month. In contrast to the earlier analysis we find in the new larger data several mutations in Spike that appear epistatically linked to other mutations in Spike and outside Spike. Among the highest-ranked such predictions we single out S:S112L (21897), recently associated to vaccine breakthrough infections [40] .

The input data for this analysis are the genomic sequences of SARS-CoV-2 as stored in the GISAID [15] public repository (high quality and full lengths, number of bps ∼ 30, 000). The sequences were sorted by collection date -the typical delay with respect to their appearance on GISAID being > 2 weeks -and grouped on a monthly basis until the end of October 2021. Considering the small number of sequences available for the first months after the outbreak in the 2020, data until the end of March 2020 are grouped together. In total, we hence have 20 datasets and 3,532,252 sequences. The number of collected sequences per month is shown in Fig. 1 : there number of instances markedly increments towards the 2021, slightly decreases in the first half on the 2021, and then increases again starting from July/August 2021.

Multiple Sequence Alignments were constructed exploiting the help of the online tool MAFFT [41, 42] . Groups of sequences relative to each month are aligned separately with respect to the reference sequence "Wuhan-Hu-1" -GenBank accession number NC-045512 [43] . Note that this is different from what previously done in [35] , where a pre-aligned MSA was used to lighten the computational burden. The resulting MSAs are given as Supplementary Information (SI) Dataset S2, and are also available on the Github repository [44] . Each MSA is a matrix σ = {σ n i |i = 1, ..., L, n = 1, ..., N }, where the N rows represent genomic sequences and the L columns stand for the genomic positions [45, 46] . Here L = 29, 903 is the length of the reference sequence, while N by construction varies from month to month, see Fig. 1 . Each entry σ n i of the MSA σ is either one of the 4 nucleotides (A,C,G,T), or "unknown nucleotide" (N), or the alignment gap '-' (introduced to account for nucleotide deletions or insertions), or some minorities "KYF..."

For the MSA filtering, we follow the methods already employed in [35] . As a first step, the minorities "KYF..." are merged with the 'N', so that there are 6 states "-,N,A,C,G,T" left, represented as "0,1,2,3,4,5" respectively. Subsequently, the following criteria are used to filter all the 18 MSAs. Each locus (column) in each MSA is discarded if one of the following two condition is matched: (1) if one same nucleotide is found with a frequency greater than a given p; (2) if the sum of the frequencies of A, C, G and T at this position is less than 0.2. As an example, in Fig. 2 the number of survived loci is shown for each dataset and for p = 0.98.

The QLE state for a genomic population was found by M. Kimura in a study of the steady-state distribution over two bi-allelic loci evolving under selection, mutation and recombination in presence of both additive and epistatic contributions to fitness [17] . A global QLE state over many loci was reviewed and investigated in [18] and the generalization to the case of more than two alleles per locus was considered in [47] . In a static QLE state, the covariance of alleles at each pair of loci is a small but non-zero quantity. In presence of pairwise epistasis and sufficiently high rate of recombination, the steady-state distribution takes the Gibbs-Boltzmann form

with

In a QLE state, there is a direct relation between the parameters J ij (σ i , σ j ) describing statistic dependencies in the distribution over genotypes in the population, and epistasis between loci i and j which give rise to these dependencies. The relation has been derived for with bi-allelic and multi-allelic loci, and if either recombination or mutation is the fastest process [18, 31, 47] and have been verified in silico [19, 31] (in simulations). The relation between correlations and epistasis is on the other hand indirect as it goes through the relation between correlations and parameters J ij (σ i , σ j ) in probability distributions of this type; a problem variously called "parameter inference in models in exponential families" [48] , or "Direct Coupling Analysis" [49] or "inverse Ising problem" [50, 51] . In this work our focus is on the parameters J ij (σ i , σ j ) themselves.

A QLE state can prevail over a finite time in the sense that correlations and DCA terms J ij in (2) inferred from a temporal snapshot of the population remain stable, while single-locus frequencies change. The mechanism behind such an effect is time-constant epistatic fitness parameters (f ij ) coexisting with genetic drift and/or time-changing additive fitness parameters (f i ). One scenario when this occurs is when two weakly advantageous mutations appear at about the same time in a population, and then grow in frequency towards fixation. At the very beginning there is only one mutation present, and there is no variability on which epistatic effects to act. When both mutations are present but one is still at low prevalence, both correlation and DCA analysis will give non-zero but noisy output due to small sample size.

The equations satisfied by single-locus frequencies and two-locus joint frequencies in a finite population were derived in [18] (Eqs. 36 and 37) under a diffusion approximation and for an Ising genome model (two alleles per locus). This approximation is valid when both allele frequencies at both loci are significant, i.e. none is close to zero or to one (fixation). The equations take the forṁ

where χ i = σ i and χ ij = σ i σ j − χ i χ j are signed frequencies and correlations in physical notation, f i and f ij are additive and epistatic fitness parameters, µ is mutation rate, r overall recombination rate, c ij is a measure of closeness of loci i and j, F j is the expected additive fitness component in the population andζ i andζ ij genetic drift noise terms. f i , and as a result F i , can be time-dependent. It is readily seen that (3) and (4) are qualitatively different. The first equation describes a process driven by noise and (1 − χ 2 i )F i − 2µχ i , modulated, if there are non-zero correlations in the population, by j =i χ ij F j . Depending on the sign of the net drift, it will hence tend to drive χ i towards ±1 (fixation or elimination of the mutation). The second equation on the other hand has vanishing drift whenever the expression in the bracket vanishes. It can be checked that without the small field assumptions used in [18] when deriving (3) and (4), and stated in terms of the (time-dependent) Ising model parameters, this vanishing of the bracket corresponds to J ij = fij rcij , and that this is a stable equilibrium ( [18] , Eq. 25). There can thus be a transient QLE phase where single-locus frequencies may go up in a fluctuating manner for a fairly long time, while J ij and two-mode frequencies remain steady because governed by a relaxation dynamics. An extension of the above to the case where the fastest process is mutations and not recombination can be found in [31] .

Towards the end when one (or both) mutation are close to fixation both correlation and DCA analysis will give non-zero but noisy output due to small sample size, and (3) and (4) are no longer valid. At the very end when there is only one mutation left, there is again no variability on which epistatic effects to act and correlation or DCA analysis applied to the data will again yield nothing.

We computed correlations as a measure of linkage disequilibrium (LD), statistical dependency between distributions over alleles at pairs of loci. For multi-allele distributions, statistical co-variance matrices are defined as

where 1 σi,a = 1 if σ i = a and zero otherwise, and where · · · indicates the average over q different alleles per locus. As discussed above, in our representation of the GISAID data, q is 6. We compute correlations, a score of LD, as Frobenius norms over the inner indices a, b of the statistical co-variance matrices

Correlation analysis may differ from statistical dependency through (1) -(2) because two loci i and j may be correlated even if their direct interaction J ij is zero, provided they both interact with a third locus k. Many techniques have been developed to infer the direct couplings in Eq. (1), see [51] and references therein. In this work we have used the Pseudo-Likelihood Maximization (plmDCA) method [47, 50, [52] [53] [54] [55] to estimate the parameters J ij . The basic idea of plmDCA is to substitute maximum-likelihood inference of parameters from the joint distribution (1) by the simpler one of estimating which parameters best match the conditional probabilities

Here q = {0, 1, 2, 3, 4, 5} the possible states of σ i in the data set and σ \i stands for all the loci of variability on the genome except locus i. Assuming independent samples, the functions to optimize (one for each locus of variability) are

where s labels the sequences (samples), from 1 to N . We use the asymmetric version of plmDCA [55] as implemented in [56] with l 2 regularization with penalty parameter λ = 0.1. The inferred epistatic interaction between loci i and j was scored by the Frobenius norm over the inner indices a, b as in (6) , and as implemented in [56] .

Statistical dependency between allele distributions at two loci can arise from epistatic contributions in QLD, and from inheritance. In the second case recombination rate and mutation rate can be small compared to fitness. However, the global distribution of genotypes is then not of the Boltzmann form (1), but instead reflects mixtures of clones [33] .

All data from which one wishes to infer epistasis from LD to some extent contain such a combination of the actual effect, and of phylogeny. In particular, when recombination acts approximately in the same manner throughout a genome, LD due to phylogeny should dominate between pairs of loci that are close, while LD due to epistasis can dominate between pairs of loci that are distant. In earlier studies on whole-genome data from bacterial pathogens, a distance cut-off was therefore employed [57, 58] as well as in previous work on SARS-CoV-2 data [35, 36] . The effect of phylogenetic correlations in DCA-based contact prediction in proteins was recently investigated in [37] .

In the current work we have leveraged the well-documented growth of large clones in the global SARS-CoV-2 population. We have hence attributed pairs of loci with large scores J ij as an effect of phylogeny when one or both appear as mutations in the three Variants of Concern 'alpha' [59] , 'beta' [60, 61] or 'delta' [62] . The corresponding tables of mutations and time course of mutation frequencies were recently reported in [63] .

We considered the fraction of retained links R over n. Within the nth highest ranked links, k links are removed either because the terminals are located in the non-coding region or the distance between two terminals of links are too close or the terminals are in alpha, beta or delta variants. Then the fraction of retained links is

We applied this to the inferred links by plmDCA and correlation analysis respectively. As shown in Fig. 3 , the curve for retained pairs defined by plmDCA lies well above the curve defined by correlations for small values of rank n.

A. The number of sequences sampled per month has increased during the pandemic Fig. 1 shows the number of whole-genome high-quality SARS-CoV-2 sequences deposited in GISAID and stratified by month. With some irregularity this number has grown exponentially since the summer of 2020, and is now around half a million SARS-CoV-2 genomes per month.

The relevant diversity in epsistatic analysis of many genomes of overall high similarity is how many loci show variability. A locus is classified as variable if less than a fraction p of sequences carry another allele than the major (most frequent) allele. The threshold p is a free parameter, subject to the condition (1 − p)·(number of sequences) 1 (expected number of all minor alleles is greater than one). Fig. 2 shows this number for threshold p = 98%. Total diversity decreased in the beginning of the pandemic until early fall of 2020, and has since increased from about 500 loci to about 1,200 loci. In the same time period the number of sequences increased tenfold (Fig. 1) ; the diversity per sequence has hence decreased. Similar results hold for other choices of p equals 90% and 99, 9% are discussed further in Appendix A.

C. Leading correlations can mostly be explained by the growth of SARS-CoV-2 Variants-of-Concern 'alpha', 'beta' and 'delta'

A simple way to visualize if ranked effects are due to one out of several factors is to plot the enrichment of the factor of interest as function of rank. A standard procedure in DCA analysis of tables of homologous protein sequences is indeed top-k plots illustrating the fraction of k highest rank predictions which correspond to spatially proximate residue pairs. This is given by eq. (9) . Similarly, in Fig. 3 we plot for two representative months the fraction of highest ranked correlations and DCA terms (plm scores) where no terminal is in a non-coding region, no pair of terminals are close, and no terminal appear in the list of mutations of Variants of Concern 'alpha' [59] , 'beta' [60, 61] and 'delta' [62] . Both panels show that leading correlations can mostly be explained by variations VoCs alpha, beta and delta. Leading DCA terms on the other hand contain much fewer pairs where one or both terminals appear in the 

In earlier work we showed that a subset of the mutations in alpha, beta and delta have different dynamics than would be expected from a clone growing de novo [63] . In Fig. 4 we show that the same holds for the newly characterized variant omicron [64] . The red-dot trajectory shows the frequency of a nucleotide substitution at position 21846, which is T95I in Spike. This mutation in the N-terminal domain of S1 rose quickly in frequency from May-21 to June-21 in GISAID sequences, and has since been found in about half of the samples. Its prevalence can hence not be explained by the rise of omicron, which came later. Indeed, T95I was among the most common in the variant B.1.526 (Iota) which spread mainly in USA in late 2020 and early 2021 [65] , and has been observed in strains classified as VoC delta circulating in France [66] , and in UK and Germany [67] . It is therefore an example of a mutation which though classified as a defining mutation of one strain of the virus, in fact is more widely spread, and can be found also in other strains. The other curves in Fig. 4 with a dynamics different from omicron can mostly be explained as also belonging to VoCs alpha and delta, and are discussed in caption to the table.

One novelty of the analysis presented in this work is that the data is much larger in previous contributions [35, 36] , and that it has been stratified as to sampling time. A second novelty is that phylogenetic confounders have been eliminated by excluding inferred links where one or both loci appear in the 'alpha' [59] , 'beta' [60, 61] or 'delta' [62] variants of SARS-CoV-2. Fig. 5 hence displays the ranks of leading retained epistatic interactions as function of sampling time, together with the ranks of the correlations between the same pairs of loci. One feature that stands out on this graph is that as long as they appear in the data, both types of ranks appear fairly stable, but the ranks of correlations is far lower. A second feature is that none of the interactions appear throughout the period, but only in some time window. Outside this window, the frequency of the major allele of one or both loci in a pair rises above the threshold p, and the pair hence does not appear in the analysis. We can therefore at best have a transient QLE phase, for definitions, see subsection III A and III B.

In an earlier study using data up to August 2020, only one large plmDCA score involving a locus in Spike was detected, at genomic position 23403 [35] (Table 1) . This G → A substitution was detected by a phylogenetic random- Links with one or both terminals located in non-coding region or of distance between the terminals less than 3 nucleotides or one or both terminals listed for the SARS-CoV-2 VoCs 'alpha', 'beta' or 'delta' are not retained. The data of which this figure is an illustration is given in Datasets 2 mentioned in appendix E. ization procedure [35] (Table 2) , and therefore not retained as a predicted epistatic link in that study. In fact, this substitution corresponds to the well-known mutation Spike D614G which grew in frequency in the early phase of the pandemic.

In the larger month-by-month data we find both persistent plmDCA links with terminals in Spike which correspond to variants alpha, beta, delta and also the recent omicron [64] , as well as plmDCA links with terminals which do not appear in any large variant. Table I gives for the months August-October 2021 the pairs where both terminals are in  Spike while Table II gives the pairs where but one of the terminals are in Spike; corresponding data for all months is mentioned in Appendix E Dataset 4.

The largest inferred Spike-Spike interaction in all three months, not related to any mutation appearing in alpha, beta, delta or omicron, is between two synonymous mutations D574D and D1259D. Most of the other pairs in Table I either have one terminal listed in delta, or are somewhat close along the genome, within 35 bp, or involves a synonymous mutation. This includes the pair (21846, 24208) appearing in October, where the first locus is the mutation S:T95I, part of the definition of omicron and discussed above, while the second locus is S:I882I.

The first prediction appearing in Table II , in all three months, is the pair (17236, 24208) where the first locus is in gene nsp13 and the second is a synonymous mutation in Spike. This is fact the largest effect detected by the DCA analysis in all three months (rank 1 in Table II ).

The second prediction in Table II , ranked respectively 14, 13 and 10, is the pair (7851, 21846) where the first is the mutation A1711V in nsp3. In Table II the variation at locus 21846 (S:T95I) appears also together with nsp2:K81N in all three months, together with ORF8:P38P and ORF7a:V71I in August, and together with nsp6:A2V, N:S327L, nsp13:I334V, nsp12:S837S, ORF3a:E239, N:Q9L and ORF7a:G38G in October. The mutation nsp3:1711V was a defining mutation of Variant of Interest labelled N.9 discovered in Brazil in 2020 [68] . The mutation nsp2:K81N has been detected in variants of VoC delta circulating in Russia [69] .

The third prediction in Table II , ranked respectively 20, 16 and 17, is the pair (28461, 24410) where the first is the mutation G63D in N, and the second is N950D in Spike. N:G63D is a defining mutation of VoC delta while S:N950D is a reverse mutation of delta defining mutation D950N, identified as such in a recent study [67] .

The next two predictions which appear in all three months and which do not involve any of the above or any variants both involve locus 21897 (S:S112L) with partners respectively 26107 (ORF3a:E239Q) at ranks 52, 57 and 60, and 27507 (ORF7a:G38G) at ranks 57, 71 and 74. Spike mutation S112L was recently associated to vaccination breakthrough infections in New York City [40] . That study also identified that genes ORF3a (56%) and ORF8 (67%) had higher numbers of sites with enriched mutations in breakthrough sequences. ORF3a mutation E239Q is located at the protein C-terminal, and appears in the sub-variant of VoC delta variously labelled AY. 25 

In this work we have applied the Direct Coupling Analysis (DCA) methodology [49, 51, 54, 55, 70, 71] to identify putative epistatic interactions between pairs of loci in the SARS-CoV-2 virus. We have described the rationale for such an approach based on the Quasi-Linkage Equilibrium (QLE) mechanism of Kimura [17, 18] , which we have recently combined with DCA in in silico validation [19, 31, 47] . As part of the world-wide effort to combat the COVID19 epidemic an unprecedented number of genomes of the disease agent have been obtained and released through open repositories. In this study we have thus been able to use more than three and a half million full-length high-quality SARS-CoV-2 genomes from GISAID deposited until October 2021 [15] . Such very large, quasi-exact and easily accessible data resources will very likely be the norm in future pandemics. Methods to turn them into actionable information in new ways are therefore of high relevance. Except for the more that order of magnitude larger data size, the main methodological novelty in this study has been to separate genomes as to sampling date (by month). We have hence been able to carry out a temporal epistasis inference, to the best of our knowledge for the first time.

Our main finding is that the leading terms identified by DCA are more stable over time than correlations. This is an argument in favor of the global SARS-CoV population being approximately in a state of QLE, as would be expected from the substantial rate of recombination characteristic of coronaviruses [20] and the sometimes high rate of circulating infections in the human population world-wide. This finding however comes with a caveat: DCA analysis (and correlation analysis) is necessarily based on observed variability which disappears if an allele at a locus is lost. This is indeed also what we find. The stability of DCA terms therefore only pertain for the time window when the mutations at both terminals appear in a significant proportion of the samples. Few of the epistatic interactions found in two earlier studies [35, 36] are in fact found in the later data, as one or both of the corresponding mutations have either since been lost or reached fixation. We therefore refer to the mode of analysis introduced here as temporal epistasis inference. The potential practical relevance is that epistasis can be detected in a transient phase, and then used as input to further analysis at a later time, when variations at one or both terminals will have disappeared, and epistasis can no longer be detected from the sequences present in the population.

The main success story of DCA applied to biological data has been to predict spatial residue-residue contacts in proteins [45] . In that important application accuracy of predictions can be assessed by comparing to distance data in resolved protein structures. Spatial proximity is the main mechanism behind and a relevant proxy for epistasis within one gene (one protein). It is a general feature of DCA that the accuracy is generally highest for the largest predictions, typically visualized through plots of the True Prediction Rate of k'th largest predictions (T P R(k)) [45] . On the global genome scale labelled test data of the same kind is not available, and evaluation will necessarily be in terms of potential biological or medical relevance, compared to literature, or other data. In an earlier study based on around 3,000 full-length genomes of the bacterial pathogen Streptococcus pneumoniae we hence found as main terms epistatic interactions between loci in the PBP family of proteins central to antibiotic resistance in the pneumococcus; analogous results have also been found for the gonococcus [58] .

For the SARS-CoV-2 data used here we have limited ourselves to a discussion of the top-200 predictions per month that are also stable in rank over the last three months of data (August-October 2021), and which involve loci in Spike. We find several DCA terms associated to Variants of Concern delta and omicron, which we in the case of delta attribute to a phylogenetic effect. For a recent discussion of phylogeny as a confounder of DCA we refer to [37] . The most prominent of the mutations in omicron is S:T95I at genomic position 21846. Although a defining mutation for this VoC, it was actually found in approximately half of the genomes collected world-wide in the time period August-October 2021. The inferred epistatic interactions between S:T95I and loci in other genes are hence examples detectable now, but which will soon not be, if the omicron variant will take over fully. Among the remaining predictions we surmise that the most interesting are two epistatic interactions involving Spike mutation S112L, recently shown to be associated to vaccination breakthrough infections [40] . One of its interaction partners is mutation ORF3a:E239Q where ORF3a is a cation channel protein unique to the coronavirus family [72] and known to be involved in inflammation of lung tissue and severe disease outcomes [73] [74] [75] . In the earlier study [35] several other mutations in ORF3a appeared prominently; in this study a new one does so together with a mutation in Spike. We computed the frequencies of nucleotides along each locus/column in the MSA matrix. If the frequency of any of the nucleotides is larger than the given value of p, this locus will be excluded in the following epistasis analysis. To complement the Fig. 2 in the main context, we present the plots for the number of retained loci with different values of p here in Fig. 6 .The upper panel is for p = 0.9 while the bottom one for p = 0.999 respectively.

This appendix shows the fraction of retained epistatic links for plmDCA and correlation analysis as a function of the top-k links considered. Data for the months Oct 2020 and Oct 2021 are also shown in Fig. 3 of the main context. For the highest ranks, plmDCA gives a greater fraction of true epistatic predictions with respect to correlation analysis. Here links are excluded if a) one or both terminals fall in a non-coding region, b) the distance between the terminals is ≤ 3 bps, c) one or both terminals appear in the list of mutations of the Variants of Concern 'alpha' [59] , 'beta' [61] and 'delta' [62]. In the following analysis, loci (MSA columns) where any of the nucleotides is found with a frequency greater than p will be excluded.

In addition to the stability analysis as shown in Fig. 5 of the main body of the paper, we present in Fig. 8 for other cases. In the upper panel, some of the epistatic links within top-100 plmDCAs are picked and shown in solid lines. The marked dashed lines are for ranks of the counterparts from correlation analysis. Differently from Fig. 5 in the main body, we here include links with one or both terminals in loci defining any of the variants. In the other two panels, we first select some correlations from the top-2000s (marked-dashed lines), then we plot the corresponding ranks in plmDCA scores with solid lines. In the middle panel links related to variants alpha, beta, delta (in the sense above) are excluded, in the bottom panel they are instead included.

For each monthly dataset, three different p values are employed for filtering the loci. If the percentage of a same major nucleotide along a column is larger than the given value (say, 0.93, 0.95 and 0.98 respectively), then the column is discarded in the following DCA analysis.

With plmDCA analysis, each pair of retained loci gets a score, which is related to the the epistasis between them. The pair-wise epistatic links can be sorted by ranking the scores. Here, we plot the top-200 epistasis with the "Circos" software. Only those located within the coding region are shown. The short links i.e., those with distances between two terminals less than 4bps, are not included. Color links are for those within top 50 ranks. Red ones for short links while blue ones for the long links. The other greys are those within rank 51 to 200. 

All datasets listed below are available on github [44] .

• Dataset1-Accession IDs.xlsx The Accession IDs for the genomic sequences we used in the analysis. The prefix of each sequence "EPI ISL " is excluded to decrease the file size. This dataset is cut into two separate files further to satisfy the limitation of file size on Github, names as "Dataset1-1-Mar-2020-May-2021-Accession IDs.xlsx" and "Dataset1-2-Jun2021-Oct-2021-Accession IDs.xlsx" respectively on Github.

• Dataset2-p0.98 plm Top200 No 3variants.xlsx This dataset contains selected links in top 200 plmDCA epistasic pairs, as ranked by their score. The plmDCA links shown in Fig. 3 in the main text are based on this dataset. Here, the links located in the non-coding region and with close locus (≤ 5) and any loci included in • Dataset5-protein aa mut for links in Dataset4.xlsx We provide the links within top 200s plmDCA scores that containing Spike terminals, for each month. The short links with loci located within 5 bps are discarded.

Here, we also annotate the genes to which loci in the Dataset4 belong to and the corresponding amino acid mutations. The annotated genes and corresponding amino acid mutations in Table I and II in the main body of the manuscript are based on this dataset.

World Health Organization, Coronavirus disease (covid-19) pandemic (2020)

non-pharmaceutical interventions during the COVID-19 pandemic: a review

Controlling the SARS-CoV-2 outbreak, insights from large scale whole genome sequences generated across the world

Recombinant sars-cov-2 genomes are currently circulating at low levels, bioRxiv

Proc. Natl. Acad. Sci

Statistical genetics and direct coupling analysis beyond quasi-linkage equilibrium (2021)

Epistasis at the sars-cov-2 rbd interface and the propitiously boring implications for vaccine escape

hlzeng/Filtered MSA SARS CoV

Sars-cov-2 variants of concern and variants under investigation in england technical briefing 29

Investigation of SARS-CoV-2 variants of concern in England

Mutation frequency time series reveal complex mixtures of clones in the world-wide sars-cov-2 viral population

Investigation of novel SARS-CoV-2 Variant of Concern 202112/01

Cryo-EM structure of the SARS-CoV-2 3a ion channel in lipid nanodiscs, bioRxiv

alpha, beta, delta are excluded. Fig.8 . Similarly to its plm counterpart, links in coding region and the distance between loci is larger than 5bps are considered. No variants are included.• Dataset4-links with Spike locus or loci ranks.xlsx The epistasis provided by plmDCA and correlation analysis are included in this dataset for each month. Only those within top-200s, for which the distance between two terminals is > 5 loci and whose both terminals located in the coding region are listed in the dataset. The genomic positions provided in Table 1 and 2 in the main context are based on this dataset.