key: cord-0502754-gf249url authors: Ge, Xiaofei; You, Kaichao; Tan, Zeren; Hou, Hedong; Tian, Yang; Sun, Pei title: Self-organized critical dynamics of RNA virus evolution date: 2022-04-19 journal: nan DOI: nan sha: f8aba1f06b6ba1050f8fb03f664a720049ba2b30 doc_id: 502754 cord_uid: gf249url RNA virus (e.g., SARS-CoV-2) evolves in a complex manner. Studying RNA virus evolution is vital for understanding molecular evolution and medicine development. Scientists lack, however, general frameworks to characterize the dynamics of RNA virus evolution directly from empirical data and identify potential physical laws. To fill this gap, we present a theory to characterize the RNA virus evolution as a physical system with absorbing states and avalanche behaviors. This approach maps accessible biological data (e.g., phylogenetic tree and infection) to a general stochastic process of RNA virus infection and evolution, enabling researchers to verify potential self-organized criticality underlying RNA virus evolution. We apply our framework to SARS-CoV-2, the virus accounting for the global epidemic of COVID-19. We find that SARS-CoV-2 exhibits scale-invariant avalanches as mean-field theory predictions. The observed scaling relation, universal collapse, and slowly decaying auto-correlation suggest a self-organized critical dynamics of SARS-CoV-2 evolution. Interestingly, the lineages that emerge from critical evolution processes coincidentally match with threatening lineages of SARS-CoV-2 (e.g., the Delta virus). We anticipate our approach to be a general formalism to portray RNA virus evolution and help identify potential virus lineages to be concerned. With high mutation and replication capacities, RNA viruses exhibit intricate evolution [1, 2, 3] . Studying the complex evolution dynamics of RNA virus remains as an attractive challenge in biology, as RNA virus evolution is a frequent cause of new pathogens and may be a pivotal route in medicines [4, 5, 6] . Meanwhile, RNA virus interests physicians for its potential to serve as a representative model of long-term molecular evolution [7, 8] . Despite its apparent significance, the physics underlying RNA virus evolution remains elusive. While important progress has been accomplished in characterizing the dynamics of mutant virus population size (e.g., see Refs. [9, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] ), there remains another non-negligible aspect of evolution for exploration. From the biological perspective, the dynamics accountable for phylogenetic tree formation is informative for the global evolution properties [24, 25, 26, 27] . Compared with the evolution previously considered in the context of population dynamics [9, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] , this unexplored aspect may convey key information about the macroscopic laws of RNA virus evolution [24, 25, 26, 27 ] but yet remains unclear. Moreover, existing models either nontrivially handle multiple time scales with extreme differences in magnitude orders or are simplified by phenomenological parameters [23] . Although phenomenological models are more solvable on empirical data than non-trivial analytic ones, these models are powerful in imitating phenomena but limited in explaining mechanisms. Interpretations of phenomenological parameters with experiments should be prudently considered [23] . In sum, much unknown remains about the general frameworks that are applicable to identify potential physical laws of RNA virus evolution from empirical data. The objectives of the current research are twofold. We aim at exploring the global properties of RNA virus evolution by developing a physics theory of the dynamics underlying phylogenetic tree formation. The theory is demanded to map fundamental biological factors to general characterizations of system dynamics so as to guarantee its applicability on real data and universality to arbitrary RNA viruses. Applying this theory, we present a systematic analysis of SARS-CoV-2, the virus accounting for COVID-19 [28] . The global infection of SARS-CoV-2 provides us a unique opportunity with rich data (e.g., a sufficiently large phylogenetic tree) to validate our theory. Moreover, confirming the macroscopic characteristics of SARS-CoV-2 evolution may benefit to developing coping strategies of COVID-19. To make our analysis comprehensible, we begin with a fundamental perspective of RNA virus. In RNA virus populations, the replication implies the vanishing of the original RNA virus and the birth of new RNA viruses that can also replicate themselves [29] . The proliferation of RNA virus is bounded by time-dependent host cell resource λ (t) [1, 8, 30] . Specifically, the maximum cumulative mutation rate, sup ω ζ (ω), of an RNA virus population of initial size ω ∈ N + during a duration of [0, κ] is limited by λ (t) in the following way where σ is the mutation rate of single RNA virus, parameter τ denotes entrance and eclipse periods (assume τ is divisible by κ), and φ 10 3 denotes burst size (the number of viruses produced by a host cell) [1, 8, 30] . Here sup ω ζ (ω) can reach the maximum cumulative mutation rate of an RNA virus population under ideal conditions when host cell resources are ample In summary, host cell resource λ (t) intrinsically determines the cumulative daily mutation rate of a RNA virus population. Instead of proposing detailed definitions of λ (t) as previous studies (e.g., define life circles of cells) [13, 15, 17, 19, 20, 22, 23] , we leave the characterization of λ (t) as a task of epidemic and observe the evolution dynamics of RNA virus controlled by λ (t) directly from empirical data. Certainly, one can not expect to count λ (t) in a real epidemic. This parameter can be reasonably replaced by the number of human hosts. Assuming that a lineage l i of RNA virus emerges at moment t and one of its progeny lineage l j is born at moment t , we can measure the duration length of evolution as T = t − t . Meanwhile, we define S (T ) as the number of cumulative confirmed patients with lineage l i during [t , t ] and denote the time-dependent number of confirmed patients with lineage l i by S (t | T ) (here t ∈ [t , t ]). The terminology "lineage" can be generally understood as a kind of definition of virus sub-type (e.g., see examples in Ref. [31] ). Viruses of lineage l i need to accumulate sufficient mutations to make lineage l j emerge. We can relate these concepts with a branching process concerning the number of copies of l i during [t , t ], that is, a process where a patient with lineage l i infects multiple people before the virus mutates to lineage l j during replication. The realization of this branching process, from the initialization t to the termination t after which a new lineage l j emerges, is referred to as an avalanche. The size and life time of avalanche are exactly S (T ) and T . This branching process is slightly different from the standard one, where the termination is defined as a moment after which the population size (e.g., the number of patients with lineage l i ) remains 0 indefinitely [32, 33] . We can further relate the branching process of evolution with directed percolation, a universality class exhibiting a phase transition separating an absorbing state from an active state [32, 34, 35] . In directed percolation (e.g., the contact process [36] ), sites (e.g., people) can be either active (infected) or inactive (healthy). With different balance between infection and recovery, the infection process may propagate over the system or vanish gradually. If infection vanishes, the system is trapped in a completely inactive state, the so-called absorbing state. Directed percolation is a non-equilibrium process since detailed balance breaks in the absorbing state (it can be reached but not be escaped) [37] . We are interested in potential self-organized criticality (SOC) due to its pervasiveness in biological systems (e.g., neural avalanches in the brain [38, 39] ). The self-organization to criticality distinguishes SOC (e.g., Refs. [40, 41] ) from ordinary critical phenomena [37, 42] . At criticality, the system jumps between absorbing configurations by avalanches [37, 42] . From the biological perspective, this property implies an unpredictable and unpreventable evolution of RNA virus. Numerous mean field theories of directed percolation have been proposed to determine avalanche exponents to study the scale-invariance underlying P T and P S , the probability distributions of T and S (e.g., see Ref. [43] ). Note that S is an abbreviation of S (T ). In Appendix B, we re-calculate avalanche exponents following a similar idea of Refs. [44, 32, 45] . Our results show that consistent with other mean field theory predictions of self-organized criticality [43, 44] . To verify whether the system is at criticality, we can consider a scaling relation [46, 47, 37, 42] Inserting α = 2 and β = 3 2 into Eq. (4), we derive γ = 2. Apart from that, more precise verification of criticality can be implemented according to the collapse shape [47, 39] and the slow and exponential decay of auto-correlation [39] . Please see Appendix C for details of these approaches. With host cell resource λ (t) as a clue, we have derived a directed percolation theory to characterize RNA virus evolution as a system with absorbing states and avalanche behaviors. A series of approaches are presented to verify the existence of self-organized criticality. Applying this theory, we aim at presenting a systematic analysis of SARS-CoV-2, the virus that causes COVID-19 [28] . We use an open-source database [48] to acquire the necessary data. The criterion of lineage definition can be seen in [31] . Our data includes all lineages that emerge before 27 December 2021. In the SARS-CoV-2 database [48] , the probability for a patient to be infected with each lineage of virus is estimated with confidence intervals. By multiplying the midpoint of confidence interval with the total number of worldwide patients recorded by the World Health Organization (WHO) [49] , we can estimate the accumulative and daily numbers of confirmed patients with each lineage of virus, corresponding to S (T ) and S (t | T ) of each lineage (see Fig. 1a ). Similarly, we can estimate the upper and lower bounds of S (T ) and S (t | T ) based on the upper and lower bounds of confidence interval. In Fig. 1b , we show the probability distributions of life time T and avalanche size S (for midpoint and bounds). In real cases, the power-law may only hold on specific tails {P (X) |X ≥ X } of the empirical distribution of variable X, where X denotes a certain distribution cutoff [50] . We apply the approach in Ref. [51] to estimate cutoffs T , S , S − (lower bound), and S + (upper bound) as the points where the Kolmogorov-Smirnov (KS) statistic η reduces to a sufficiently small threshold at the first time (Fig. 1c) . Details of cutoff estimation can be seen in Appendix D. Then, a maximum likelihood estimation of power-law exponent [51] is implemented on the samples above cutoffs to find ideal exponents that maximize the likelihood L (Fig. 1c) [51] . In our results, exponents are estimated as α = 1.69 (υ = 7.13%), β = 1.32 (υ = −4.53%), β − = 1.32 (υ = −5.06%), and β + = 1.34 (υ = −14.88%), where υ < 10% is a reasonable standard for the goodness of ideal estimations. Our estimated power-law models are shown in Fig. 1d . Please see Appendix D for details of power-law exponent estimation. Meanwhile, we implement least square fitting on S (T ) vs. T to derive γ + = 1.8431 (R 2 = 0.9112), γ = 1.9363 (R 2 = 0.8871), and γ − = 1.9944 (R 2 = 0.8652) in Fig. 1b . Together with our estimated avalanche exponents α and β, these results coincide with the scaling relation in Eq. (4), suggesting potential criticality. To analyze criticality more precisely, we calculate collapse shape following Refs. [47, 39] . Please see details of collapse analysis in Appendix C. To ensure the robustness of our results, we further verify if the estimators of distribution cutoffs for T and S affects collapse shape. In Fig. 2a Fig. 2a) . By increasing S, we can simultaneously increase the distribution cutoff of T . Our results suggest that the trend of V (t | T ) is principally independent of S. Meanwhile, we notice that V (t | T ) exhibits a mixture of global increasing trend and parabolic trend. The increasing trends may reflect the intrinsic property of the raw data. Therefore, we recalculate S by multiplying the midpoint of confidence interval [48] with the average total number of worldwide patients in the WHO database [49] for detrending. In Fig. 2a , V (t | T ) plausibly exhibits a parabolic trend in the detrended data. Meanwhile, we show several instances of V (t | T ) vs. t T in Fig. 2a . In Fig. 2b , the parabolic trend of V (t | T ) is verified quantitatively by quadratic polynomial fitting. We show the average quadratic coefficient µ (averaged across avalanches under each condition of S) as a function of S. Consistent with Fig. 2a , µ < 0 holds for every S in both original and detrended data, suggesting plausible parabolic trends of V (t | T ) . The probability distributions of µ and the goodness of fitting, R 2 , are also shown. In Fig. 2c , we show the auto-correlation (covariance) Cov (S (t i | T ) , S (t j | T )) under each condition of S, where we select t i ∈ {0.3T, 0.5T } as instances. Please see Appendix C for details. In Fig. 2d , the corresponding decays of auto-correlation are fitted to derive the decay rate ξ. In the raw data, we obtain ξ = −0.5407 (averaged R 2 = 0.6301) for t i = 0.3T and ξ = −0.5159 (averaged R 2 = 0.6324) for t i = 0.5T . In the detrended data, we find that ξ = −1.6679 (averaged R 2 = 0.6301) for t i = 0.3T and ξ = −1.4251 (averaged R 2 = 0.6332) for t i = 0.5T . These results suggest exponential decays of auto-correlation with reasonable errors, offering a practical verification of potential criticality. Interestingly, our results coincide with threatening lineages of SARS-CoV-2 if we filter directed edges in its original phylogenetic tree (edges correspond to avalanches and indicate evolution paths of lineages) as following: an edge is removed unless its avalanche satisfies S ≥ S and T ≥ T (Fig. 3) . The filtered result, referred to as critical phylogenetic tree, is a sub-graph of the original one that includes only critical avalanches. Threatening lineages, such as B.1.167.2 (the Delta virus), B.1.1 (the parent of Alpha, Beta, and Omicron viruses), and B.1 (the parent of Epsilon, Eta, Iota, and Mu viruses), can be found by searching non-isolated lineages with high out-degree δ (number of offspring lineages) in the critical tree. Note that the out-degree in the original tree does not completely determine the possibility of being isolated in the critical tree. Numerous lineages with high out-degrees in the original tree still become isolated (e.g., 20.83%, Figure 3 : Original and critical phylogenetic trees. Edges are colored according to T of corresponding avalanches. 8.33%, and 9.09% of lineages with δ ≥ 5, δ ≥ 15, and δ ≥ 20 in the original tree become isolated after filtering). Following a clue that host cell (human host) resources shape RNA virus mutation and replication, we have naturally derived a theory to characterize RNA virus evolution as a system with absorbing states and avalanches. Our framework maps fundamental biological factors (e.g., phylogenetic tree and infection) to general theories of directed percolation and self-organized criticality, identifying potential physics laws of RNA virus evolution directly from accessible empirical data. Compared with the evolution characterized from the perspective of population dynamics [9, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] , our framework may be more applicable to analyzing the macroscopic characteristics of evolution dynamics underlying RNA virus phylogenetic tree formation [24, 25, 26, 27] . Applying this theory, we discover that SARS-CoV-2 evolution may proceed at selforganized criticality. Threatening lineages (e.g., the Delta virus) can be coincidentally found on the critical phylogenetic tree. These findings suggest the potential of our theory in analyzing RNA virus evolution and related phenomena. Although the present framework has been validated on SARS-CoV-2, its generalizability is suggested to be further verified on other RNA viruses. Moreover, future studies can also investigate the relation between the self-organized critical evolution process and the threatening degree of RNA virus. Hopefully, this direction may help identify potential virus lineages to be concerned from a physically-explainable perspective. If host cell resources are sufficient, RNA viruses may have an opportunity to replicate independently and freely. When these happen, a supremum of the cumulative mutation rate of an RNA virus population of initial size ω ∈ N + during a duration of [0, κ] (assume τ is divisible by κ) can be reached where notion · denotes the floor function. However, these ideal conditions are not realistic since host cell resources are always limited. Therefore, biologically sound models, including the discussed previous works [13, 15, 17, 19, 20, 22, 23] , of RNA virus population dynamics bound the proliferation of RNA virus by time-dependent host cell resource λ (t) (e.g., the number of susceptible cells) sup ω ζ (ω) = ωσ The difference sup ω ζ (ω) − sup ω ζ (ω) will be minimized when λ (t) ≥ φ t/τ − φ t/τ −1 . Consider a time-continuous infection process, where a random patient at moment t ∈ [t , t ] implies three possibilities: becoming absorbed with probability ρ, creating another patient with probability θ, or remaining effect-free with probability 1 − (ρ + θ). In critical states, we have ρ = θ [44] . We define A n (t) as the probability for n patients to exist at t * + t given that 1 patient exists at t * . Assuming the independence of patient emergence, we have Assume that A n (t), n ∈ N + admits a Maclaurin expansion A n (t) = a n t + o (t 2 ) (when n = 1) or A n (t) = a n t + 1 + o (t 2 ) (when n = 1) where a n = dA n (0) /dt, we can readily derive a 0 = a 2 = ρ and a 1 = −2ρ [44] . Meanwhile, we can know 1] denotes the generating function. Applying a trick introduced in Ref. [44] , we define which naturally leads to can be derived based on a 0 , a 1 , and a 2 [44] . Taken together, we have Note that the initial condition is G (x, 0) = x since one patient emerges at t * . Solving Eq. (B.7), we derive an analytic expression Therefore, we have A 0 (t) = G (0, t) = ρt ρt+1 , supporting a calculation of lifetime distribution P T (t) Following Refs. [44, 32, 45] , one can similarily calculate These exponents are consistent as the predictions of other mean field theories [43, 44] . Compared with the scaling relation [46, 47, 37, 42] , a more precise verification of criticality can be implemented by the collapse shape [47, 39] in which the expectation is averaged across different T . Notion F (·) denotes a universal scaling function. Notion V (t | T ) is defined as the average collapse shape of all avalanches with the same life time T where S (t | T ) denotes the time-dependent avalanche size during these avalanches. At criticality, all data of V (t | T ) should collapse onto F (·), a parabolic function, with reasonable errors [47, 39] . In some cases, function F (·) may contain both the parabolic component and a slight global trend (e.g., increasing or decreasing), making the shape not perfectly parabolic. De-trending can be used to deal with this issue during data pre-processing. Moreover, a more practical verification concerns the slow and exponential decay of auto-correlation [39] ln The probability distributions of life time T and avalanche size S (for midpoint and bounds) are derived after a fine-grained data binning process with 1000 bins is applied, which is useful in de-noising while controlling information loss. We discover that P T (·) does not follow power-law properties for all values of T (see Fig. 1b) . Therefore, distribution cutoff estimation is necessary for T and S. We apply the method introduced in Ref. [51] to estimate T , S , S − (lower bound), and S + (upper bound), which calculates the Kolmogorov-Smirnov (KS) statistic η between the observed cumulative probability distribution above cutoff and the cumulative probability distribution of a standard power-law variable. An ideal cutoff is expected to minimize η [51] . To control the reduction of sample size, we relatively relax the restriction and estimate cutoffs when η − 0.5 std (η) is reached at the first time, where std (·) denotes the standard deviation. Distribution cutoffs are estimated as T = 29, S = 13320, S − = 12760, and S + = 13980. The distribution tails above cutoffs cover 93.879%, 76.38%, 70.526%, and 91.683% of original samples, supporting robust estimations of power-law exponents. A maximum likelihood estimation of power-law exponent [51] is implemented on the samples above cutoffs. An ideal exponent can maximize the likelihood L [51] . After estimating power-law exponents, we carry out a KS-statistic-based analysis to derive the goodness of estimation. To calculate the average KS statistic η * between sample distributions and estimated models, we first generate 1000 sample distributions (each sample distribution contains n ∈ [500, 5000] samples) of estimated power-law models. Then, we define υ = η−η * η * to reflect the goodness of estimation, where η is the KS statistic between the cumulative probability distributions of estimated power-law models and empirical distributions above cutoffs. We suggest that υ < 10% can be a reasonable standard for ideal estimations. Virus dynamics: mathematical principles of immunology and virology: mathematical principles of immunology and virology The origins of order: Self-organization and selection in evolution Multi-scale problem in the model of rna virus evolution Rules for the designation and naming of pango lineages URL The theory of branching processes Advances in physics Network P 2021 Pango network lineage database of sars-cov-2 URL Correspondence should be addressed to Y.T. and P.S. Authors Z.R.T. and H.D.H. contribute equally to this work. This project is supported by the Artificial and General Intelligence Research Program of Guo Qiang Research Institute at Tsinghua University (2020GQG1017) as well as the Tsinghua University Initiative Scientific Research Program. With reasonable errors, the mutation rate σ of a single RNA virus is typicallywhere χ ∈ N + measures the number of nucleotides, parameter τ denotes entrance and eclipse periods, and φ 10 3 denotes burst size (the number of viruses produced by a host cell) [1, 8, 30] . These parameters are principally constant for a given RNA virus. Note that the original RNA virus vanishes after replication [29] .