key: cord-1015996-83v1032q authors: Lee, Geon; Yoon, Se-eun; Shin, Kijung title: Simple epidemic models with segmentation can be better than complex ones date: 2022-01-12 journal: PLoS One DOI: 10.1371/journal.pone.0262244 sha: 2cbbe6df86c178c51d08ec9e26c30bdef21186dd doc_id: 1015996 cord_uid: 83v1032q Given a sequence of epidemic events, can a single epidemic model capture its dynamics during the entire period? How should we divide the sequence into segments to better capture the dynamics? Throughout human history, infectious diseases (e.g., the Black Death and COVID-19) have been serious threats. Consequently, understanding and forecasting the evolving patterns of epidemic events are critical for prevention and decision making. To this end, epidemic models based on ordinary differential equations (ODEs), which effectively describe dynamic systems in many fields, have been employed. However, a single epidemic model is not enough to capture long-term dynamics of epidemic events especially when the dynamics heavily depend on external factors (e.g., lockdown and the capability to perform tests). In this work, we demonstrate that properly dividing the event sequence regarding COVID-19 (specifically, the numbers of active cases, recoveries, and deaths) into multiple segments and fitting a simple epidemic model to each segment leads to a better fit with fewer parameters than fitting a complex model to the entire sequence. Moreover, we propose a methodology for balancing the number of segments and the complexity of epidemic models, based on the Minimum Description Length principle. Our methodology is (a) Automatic: not requiring any user-defined parameters, (b) Model-agnostic: applicable to any ODE-based epidemic models, and (c) Effective: effectively describing and forecasting the spread of COVID-19 in 70 countries. Infectious diseases have been serious threats to global public health. They not only change lifestyles of millions of people worldwide but also bring about dramatic changes in many areas, including economies, cultures, ecologies, and more. Unfortunately, the war against infectious diseases has continued throughout human history. The Black Death killed a third of the world's population in 1340s, and the Spanish flu in 1918 is estimated to have resulted in at most 500 million deaths. Recent epidemic outbreaks of SARS, Ebola, Zika, and COVID- 19 show that the war is not over yet. Consequently, understanding and predicting epidemic spreads are important for prevention and effective decision making. How many people will be infected within a week? How will lockdowns affect the spread? To answer these questions, we require a method that is simple enough to be comprehensible but expressive enough to accurately model and predict the spread of infectious diseases. Ordinary differential equations (ODEs) have successfully described dynamic systems in various fields, including ecology, economics, physics, and biology. ODEs have also been utilized in epidemics. Some of the earliest epidemic models, such as SIS, SIR, and SEIR, are compartment models [1] . These models divide the population into several compartments and capture patterns of dynamic changes in the sizes of the compartments over time. The dynamics are expressed as predefined ODEs, which are based on human knowledge, with tunable parameters. While these models are intuitive and simple, they often have limited expressiveness, failing to capture epidemic dynamics accurately. On the other hand, data-driven models [2, 3] aim to model and forecast co-evolving time-series data using ODEs, without relying on human knowledge. They employ latent variables and non-linear differential equations to capture complicated temporal dynamics. Despite the development of epidemic models, describing long-term dynamics of epidemics using a single epidemic model often faces limitations due to the unpredictability and abruptness of real-world events. Indeed, various external factors may substantially change the dynamics of epidemic events. For example, policies reducing contacts between individuals (e.g., lockdown) and the capability to perform tests can significantly affect the dynamics. In this work, we demonstrate that properly dividing an epidemic event sequence into multiple segments and fitting a simple epidemic model to each segment greatly helps describe and predict the epidemic propagation concisely and accurately. For example, in Fig 1(a) and 1(b) , the entire sequence of events regarding COVID-19 in Italy is fitted to two epidemic models with different numbers of parameters. On the other hand, in Fig 1(c) , the sequence is split into multiple segments, and then a simple model is fitted to each segment. As seen in Fig 1(d) , the segmentation leads to 8.09× smaller fitting error with fewer parameters than using a single model for the entire sequence. Then the following questions naturally arise: Given a sequence of epidemic events, where should we divide it? How many segments should we divide it into? We propose a segmentation scheme that greedily decides where to split. It also decides the number of segments by balancing the fitting error and the sizes of the models for all segments, based on the Minimum Description Length (MDL) principle. We validate our approach using event sequences regarding recent Coronavirus Disease-19 (COVID-19), specifically the numbers of active cases, recoveries, and deaths in 70 countries. COVID-19 was recognized as a pandemic by the World Health Organization. By early April 2021, 129 million confirmed cases and 2.8 million deaths were reported worldwide. Our experiments reveal that our segmentation scheme enhances three epidemic models in explaining and predicting the propagation of COVID-19. The strengths of our approach are summarized as follows: • Automatic: It does not require any user-defined parameters, such as the number of segments. • Model-agnostic: It is applicable to any ODE-based epidemic models without being restricted to certain models. • Effective: Applied to the COVID-19 datasets, it significantly reduces the fitting error (up to 14.29× with fewer parameters) and forecasting error (up to 31.54×) of three epidemic models. We briefly review previous work on two related topics: epidemic models and time-series analysis models. A variety of epidemic models have been proposed to understand and predict the spread of infectious diseases [4] . In the SI model, the population is divided into two different groups: susceptible and infectious; and the size of each group changes based on predefined differential equations. Taking realistic conditions, such as reinfection, recovery, immunity, population change, and exposure, into consideration, the SI model has been extended to SIS, SIR [5] , SIRS [6] , SIRD [7] , SEIR [8] , and many more. The spread of COVID-19 has been analyzed using modified SIRs: Li et al. [9] take human mobility into account, and Dandekar et al. [10] consider quarantine controls. These models are intuitive, explainable, and simple since they are based on human knowledge. However, they show weakness in capturing long-term dynamics of epidemic events especially when the dynamics heavily depend on external factors. Mining and modeling time-series data is a building block of many analytical and predictive tasks, such as pattern discovery [11, 12] , disaggregation [13] , and forecasting [2, 3, 14, 15] , in a variety of fields, including social media [16, 17] , web [14] , and medical science [18] . Especially, ordinary differential equations (ODEs) have attracted much attention, due to its simplicity and expressiveness, and several studies focus on learning ODEs from data [19] [20] [21] [22] . Recently, Chen et al. [19] introduce a generative model to solve ODEs using neural networks. There have been several studies on learning to segment temporal data. Most of them [2, 3, 15, 23] focus on detecting repetitive patterns in activities (e.g., sensor data and motion events), while we focus on segmenting epidemic data, where dynamics suddenly change due to external factors, eventually better modeling and forecasting the spread of COVID-19. Recently, Jiang et al. [24, 25] propose piecewise linear quantile models that detect multiple change-points, where an SN-based test statistic is above the properly chosen threshold, for capturing the ever-changing growth rate of daily new cases of COVID-19. Note that our segmentation scheme has two distinct advantages over those used in these models: (a) automatic: it does not require any prior hyperparameters and (b) model-agnostic: it can be applicable to any ODE-based epidemic models, including non-linear fitting models. Our segmentation scheme belongs to the class of binary segmentation [26] . While existing binary segmentation schemes are known to cause loss when detecting non-monotonic changes [27, 28] , we demonstrate that our MDL-based segmentation scheme accurately divides the sequences and fits a model to each segment. Specifically, as shown in the experiment section, our segmentation scheme detects splitting points 3.59× more accurately and leads to 3.23× smaller fitting error (with the same number of parameters) than the non-binary the segmentation method inspired by [2] . In this section, we introduce some notations and three main epidemic models that are used in the paper. Refer to Table 1 for the frequently-used notations. We first review the Susceptible-Infectious-Recovered (SIR) model, which is one of the most classical compartment models. Then, we introduce two latent dynamics models that are based on linear and non-linear dynamics of latent variables. The SIR model is one of the most classical epidemic models. Given a group of individuals of closed population P, each individual is assigned to one of the three states: S (susceptible), I (infectious), and R (recovered). Here, we use S(t), I(t), and R(t) to denote the number of individuals at the three states, respectively, at timestamp t. The model assumes that each individual goes through two types of transitions: infection and recovery. That is, the state to which an individual belongs changes from S to I and then from I to R. Additionally, the model assumes that the probability of a susceptible individual to get infected at each time t is proportional to the number of infected individuals with a coefficient β, and the model assumes that the probability of an infected individual to become recovered at each time t is γ. These dynamics can be expressed as the following three differential equations, where β and γ are model parameters: Note that these equations imply S(t) + I(t) + R(t) = P. This model [2] consists of two multi-dimensional event sequences: a k-dimensional latent (i.e., unobservable) event sequence w(t) and a d-dimensional observable event sequence v(t). The observed events v(t) are assumed to be determined by the following non-linear dynamical systems of the latent factors w(t): where � denotes the Hadamard product (i.e., the elementwise product); and p 2 R k , Q 2 R k�k , and A 2 R k describe the linear, exponential, and non-linear dynamics between latent factors. In addition, u 2 R d and V 2 R k�d are used to project the latent factors to the observed events. The model parameters are p, Q, A, u, V, and the initial condition w(0) = w 0 of the latent factors. We also consider a special case of the NLLD model, where the d-dimensional observed event sequence v(t) is assumed to be determined by the following linear dynamical systems of kdimensional latent factors w(t): The NLLD and LLD models can naturally be used as epidemic models if we regard I(t) and R (t) (i.e., the numbers of infected and recovered individuals) in the SIR model as the 2-dimensional observed event sequence v(t). Unlike the SIR model, the latent dynamics models are fully data driven, and thus they capture the temporal patterns in the event sequences without any prior knowledge of epidemics. Moreover, they describe the dynamics of the observed events using latent factors, which are not directly observed. Many real-world events are known to be largely affected by latent factors, and as shown in the experiment section, the latent dynamic models predict the spread of COVID-19 significantly more accurate than the SIR model. Our segmentation scheme described in the following section is model agnostic. That is, it can be applied to any epidemic or time-series analysis models, including but not limited to the three considered ones. In this section, we present our approach for deciding the number of segments and their locations automatically without user-defined parameters. We first define the description length of an event sequence. Then, based on the definition, we describe how we adapt the Minimum Description Length (MDL) principle to evaluate segmentation. Then, we propose a search algorithm for finding the best segmentation. Given a sequence X and a model M, the description length (in bits) of X, denoted by Cost(X), is defined as: where the model cost Cost(M) is the number of bits required to describe the model M, and the data cost Cost(X|M) is the number of bits to encode X given M. The model cost and the data cost are described below. To measure the model cost Cost(M), we examine the parameters of the model M and their sizes in bits. Below, we consider the three aforementioned epidemic models. Note that the model cost of any other models can be measured in a similar way. • SIR Model: The infection rate β and the recovery rate γ are two real numbers, and encoding each requires C F bits (we set C F to 8 by convention). Thus, the model cost required to describe the SIR model in bits is (we ignore the cost required to encode the population P since it is required only once regardless of the number of segments): • Non-linear Latent Dynamics (NLLD) Model: This model is described by a set of six parameters: w 0 , p, Q, A, u, and V (see Eqs (1) and (2)). They contain to k, k, k 2 , k, d, and kd real-valued parameters, respectively. Thus, the model cost in bits required to describe the NLLD model is: • Linear Latent Dynamics (LLD) Model: The model cost required by the LLD model is: Note that the cost in bits required to encode A is subtracted from Eq (3). Input: (1) epidemic event stream X 1:n (2) epidemic model solver f Output: segmented event stream X s 1 :e 1 � � � � � X s r :e r 1 if n � 2 then return X 1:n ⊳ Base Case 2 C Cost(f(X 1:n )) + Cost(X 1:n |f(X 1:n )) 3 i � arg min i2f2;���;nÀ 2g The data cost Cost(X|M) is the number of bits required to describe X given M. We assume the Huffman coding [29] to encode the difference between the observed event sequence X and the event sequence V estimated by the model M. Then, the number of bits required is the negative log-likelihood under a Gaussian distribution N ð0; s 2 Þ as follows: where x i (t) and v i (t) are the i-th dimension of actual and estimated events at time t. We fix σ to the standard deviation of the elements of X − V during the period of each segment. In order to fit M to X, we use the Levenberg-Marquardt (LM) algorithm to minimize the mean square errors between the given data sequence and the estimated sequence. Specifically, the LM algorithm adaptively varies the parameter updates to be interploated between the Gauss-Newton update or the gradient descent update, by adopting a damping parameter. The lmfit library we used in our implementation requires two arguments xtol and ftol, which are the relative errors desired in the approximation solution and the desired sum-of-squares, respectively. That is, termination occurs (a) when the relative error between two consecutive iterates is at most xtol or (b) when both the actual and predicted relative reductions in the sum of squares are at most ftol. However, as discussed in Section 5.5.1, our segmentation scheme is insensitive to these parameters, and thus we consistently use the same values throughout experiments. For the NLLD model, we split into the linear parameter set (p, Q, u, and V) and the non-linear parameter set (A) and separately optimize them using the expectation-maximization (EM) algorithm, as suggested in [2] . This, in practice, accelerates convergence, compared to simultaneously optimizing the entire parameters. We adapt the Minimum Description Length (MDL) principle [30] for segmentation evaluation. Consider an event sequence X(= X 1:n ) and a solver f of an epidemic model. We denote the division of X into r segments where each i-th segment starts at time s i and ends at time e i by X s 1 :e 1 � � � � � X s r :e r ; where s 1 = 1, e r = n, and e i + 1 = s i+1 for each i 2 {1, � � �, r − 1}. Let f(X i:j ) be the epidemic model fitted to the segment X i:j . Then, the description length in bits of X s 1 :e 1 � � � � � X s r :e r is: where (r − 1) � log 2 (n) is the cost in bits required to encode r − 1 splitting points (i.e., s 2 , � � �, s r ). Since each splitting point is an positive integer smaller than n, the number of bits required to encode it is log 2 (n). The description length (i.e,. Eq (4)) balances the fitting error and the size of the parameters required to encode the epidemic models for all segments, and we use it to evaluate segmentation. Specifically, based on the MDL principle, we prefer the segmentation that minimizes Eq (4), and in the following subsection, we discuss how we search for such a segmentation. Given an event sequence X, how can we find the segmentation that minimizes the description length (i.e., Eq (4))? Since there are 2 n ways to segment a length n sequence, naïvely trying all possible segments is computationally prohibitive. Thus, we propose to greedily segment the sequence, as described in Algorithm 1, throughout which we make the length of each segment at least two. Given an event sequence X 1:n , we find a splitting point i � 2 {2, � � �, n − 2} where the description length (i.e., Eq (4)) of the corresponding segmentation is minimized (Line 3). If splitting X 1:n at time i � strictly decreases the description length, we divide X 1:n into X 1:i � and X i � +1,n , and then recursively divide each segments (Line 6). Otherwise, we stop segmentation (Line 5). In this section, we review our experiments designed to answer the following questions: • Q1. Effectiveness of Segmentation: Does segmentation help understand the spread of COVID-19? Does it give a better trade-off between model complexity and fitness? • Q2. Effectiveness of our Segmentation Scheme: How well does our greedy segmentation algorithm based on the MDL principle work? Does it yield small fitting error with the same number of segments than baseline? • Q3. Accuracy of Forecasting: Is segmentation beneficial for accurately predicting the spread of COVID-19? Is it beneficial regardless of epidemic models used? • Machines: We conducted all the experiments on a machine with AMD Ryzen 9 3900X CPU and 128GB RAM. • Datasets: We considered the 70 countries with the most confirmed cases as of the end of March, 2021. We used the number of active cases as I(t) and the number of recoveries and deaths as R(t) in each of the 70 countries from March 1, 2020 to March 30, 2021. The dataset is publicly available at [31] . Since the number of recoveries in the US is not available, we used the number of deaths as R(t). • Implementations: We implemented the SIR model, the LLD model, and the NLLD model in Python. We used the lmfit library for the optimization (see https://lmfit.github.io/ lmfit-py/ for details). • How to choose k: For the LLD and NLLD models, we chose the number of latent factors k between 1 and 6 so that the description length (i.e., Eq (4)) is minimized. We measure how segmentation by Algorithm 1 affects the model complexity and fitting error of the three considered epidemic models. As seen in Fig 2, segmentation leads to significantly better trade-offs between the model cost (in bits) and the fitting error (in terms of RMSE), regardless of the epidemic models used. For example, in the India dataset, the NLLD model with segmentation yields 11.54× smaller fitting error with smaller model cost than the same model without segmentation. Fig 3 show the input and estimated event sequences when the description length is minimized. The description length is minimized when a simple epidemic model with few latent factors is used with an enough number of segments. Simple epidemic models with segmentation provide more concise and accurate description of the spread of COVID-19 than complex models without segmentation. The results in the other countries can be found in the supplement. We further qualitatively analyze the splitting points detected by our segmentation scheme in the dataset collected in Japan. Specifically, in the dataset our segmentation scheme detects three splitting points: (1) May 14, 2020, (2) August 25, 2020, and (3) January 13, 2021. As shown in Fig 4 , these dates coincide with the periods when the state of emergency (SOE) was declared or lifted by the Japanese Government. The result indicates that there is a close correspondence between the segmentation derived by the proposed scheme and the deployed policies. We demonstrate the effectiveness of our greedy segmentation scheme based on the MDL principle by comparing it with the incremental method inspired by [2] . The incremental method goes through the sequence from the start and initiates a new segment whenever the fitting error within the current segment exceeds a given threshold �. As in [2] , we set the threshold proportional to the L 2 norm of the current segment X c with a coefficient α. That is, � = α � ||X c || 2 . Note that smaller α is expected to yield more segments. As seen in Fig 5, where we fix k to 2 and vary α from 0.05 to 0.5, our proposed segmentation scheme significantly outperforms the incremental method. Specifically, our scheme gives up to 3.23× smaller fitting error with the same model cost, which is proportional to the number of segments, than the incremental segmentation. The results in the other countries can be found in the supplement. Furthermore, to numerically evaluate the accuracy of the segmentation, we generate synthetic sequences with randomly selected splitting points where each segment is generated by a different set of random parameters of the NLLD model. We carefully sample parameters based on the model parameters fitted to real-world sequences. Specifically, we sample −0.1 < p < 0.1, −0.1 < Q < 0.1, −0.001 < A < 0.001, −0.1 < u < 0.1, −1.0 < V < 1.0, and −1 < w 0 < 1 uniformly at random. Then, we compare the detected splitting points, i.e., timestamps where the segmentation occurs, and the ground-truth ones by measuring F1 scores. When measuring F1 scores, for robust evaluation, we consider a detected splitting point is correct if it is within δ time units from a ground-truth one. As shown in Table 2 , splitting points detected by our segmentation scheme match the ground-truth splitting points closely, and especially, our segmentation scheme is more accurate than the incremental method. We examine the effect of segmentation on the the accuracy of future prediction using the three considered epidemic models. To this end, we divide each sequence into the training sequence and the test sequence, which span 327 days and 37 days, respectively. Then, we fit the epidemic models to each training sequence with and without segmentation and predict the event sequence during the test period. When segmentation is applied, we ensure that the last segment is at least as long as the test period, and we use the model fitted to the last segment of the training sequence for prediction. We can ensure this by modifying Algorithm 1 so that it never splits the training sequence during its last 37 days. That is, it searches for splitting points during the first 290 days. This constraint is helpful for forecasting, as shown experimentally in Section 5.5.2. For the LLD and NLLD models without segmentation, we vary the the number of latent factors k from 1 to 6. In Table 3 , we compare the prediction error (in terms of RMSE) of the three epidemic models with and without segmentation. When the LLD model or the NLLD model is used, among 7 different settings, our segmentation scheme leads to the most accurate prediction in 32 and 33 (out of 70) countries, respectively. The second best one, which is the LLD model with k = 2 and no segmentation, is most accurate only in 9 countries. When the SIR model is used, segmentation increases the prediction accuracy in 70 (out of 70) countries. Moreover, prediction without segmentation is unstable with unreasonably large RMSE in some countries, while it is stable with segmentation in all countries. To sum up, segmentation tends to improve the prediction accuracy of all three considered epidemic models. Note that with segmentation, only the last segment, not the entire sequence, is used for prediction. Despite the fact, segmentation increases the accuracy of prediction by letting epidemic models focus on the part that represents the current epidemic dynamics while ignoring the part before inherent changes in the dynamics. Below, we present the results of additional experiments. 5.5.1 Insensitivity to two arguments: xtol and ftol. For optimization, we used the lmfit library provided in Python, which minimizes non-linear least-squares. The leastsq function, which we used, requires two arguments, xtol and ftol, which are the desired relative errors in the approximation solution and the sum-of-squares, respectively (see https:// lmfit.github.io/lmfit-py/fitting.html#lmfit.minimizer.Minimizer.leastsq for details.). We tested the NLLD model in the Japan dataset using eight different xtol and ftol values (10 −1 to 10 −8 ) and five different latent factors k (2 to 6). In the 40 considered settings, the splitting points of the segmentation were exactly the same (71 th , 198 th , and 324 th day), which implies that the proposed scheme is insensitive to these parameters. Thus, in this work, we do not tune xtol and ftol but fix them to 10 −8 in all experiments in the main paper. Table 3 . Segmentation is helpful to accurate prediction of the spread of COVID-19. Single Segment (r = 1) Ours Single Segment (r = 1) Ours (r = 1) Ours k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 The effect of the constraint on the last segment. One might concern that avoiding segmentation within the last 37 days before the test set may degrade the flexibility of the model and thus the accuracy of forecasting. Empirically, however, this constraint is helpful for accurate prediction by preventing overfitting. Note that if the length of the last segment is too short, overfitting easily occurs, resulting in a large generalization (i.e., prediction) error. In order to demonstrate the effect of the constraint, we compared the forecasting errors of the NLLD model with (our original setting) and without the constraint in 70 countries. As shown in Fig 6, without the constraint, NLLD greatly overestimated the numbers of infected and recovered individuals in some countries (specifically, Lebanon and Lithuania). It should be noted that the estimates were even larger than the population of the countries. On the other hand, the constraint helped preventing such absurd predictions, and specifically, NLLD with the constraint always made predictions within the population of the countries. In addition, out of the 70 countries, NLLD with the constraint outperformed that without the constraint in 39 countries. The average forecasting error (in terms of RMSE) was also smaller when adopting the constraint. Specifically, it was 94.3 with the constraint and 116.3 without the constraint (averaged only the reasonable results in the 68 countries). In this work, we propose to divide epidemic event sequences into multiple segments and fit a simple model to each segment. To this end, we propose a greedy algorithm based on the MDL principle to decide where to split the sequences. Through extensive experiments using the COVID-19 event sequences from 70 countries, we demonstrate that our methodology has the following advantages: • Automatic: All parameters are tuned automatically based on the MDL principle without relying on users. • Model-agnostic: Any ODE-based epidemic models can be used with our segmentation scheme. • Effective: The fitting error and prediction error of three epidemic models decrease up to 14.29× and 31.54×, respectively, with our segmentation scheme. The code and datasets used in the paper are available at https://github. com/geonlee0325/covid_segmentation. Supporting information S1 Appendix. (PDF) Periodicity and stability in epidemic models: a survey Regime shifts in streams: Real-time forecasting of co-evolving time sequences Dynamic modeling and forecasting of time-evolving data streams Infectious diseases of humans: dynamics and control FastSIR algorithm: A fast algorithm for the simulation of the epidemic spread in large networks by using the susceptible-infectedrecovered compartment model A stochastic epidemic model with nonmonotone incidence rate: Sufficient and necessary conditions for near-optimality Mathematical modelling of the transmission dynamics of ebola virus Modelling the SARS epidemic by a lattice-based Monte-Carlo simulation Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus(SARS-CoV-2) Quantifying the effect of quarantine control in Covid-19 infectious spread using machine learning Streaming pattern discovery in multiple time-series Optimal multi-scale patterns in time series streams Ares: automatic disaggregation of historical data The web as a jungle: Non-linear dynamical systems for co-evolving online activities BeatLex: Summarizing and Forecasting Time Series with Patterns," In: ECML-PKDD Rise and fall patterns of information diffusion: model and implications Early online identification of attention gathering items in social media Network discovery via constrained tensor analysis of fmri data Neural ordinary differential equations Hidden physics models: Machine learning of nonlinear partial differential equations Probabilistic ODE solvers with Runge-Kutta means Numerical Gaussian processes for time-dependent and nonlinear partial differential equations Autoplait: Automatic mining of co-evolving time sequences Time series analysis of COVID-19 infection curve: A change-point perspective Modelling the COVID-19 infection trajectory: A piecewise linear quantile trend model A cluster analysis method for grouping means in the analysis of variance Narrowest-over-threshold detection of multiple change points and change-point-like features Circular binary segmentation for the analysis of arraybased DNA copy number data Ric: Parameter-free noiserobust clustering Modeling by shortest data description Novel Corona Virus 2019 Dataset. Day level information on covid-19 affected cases The authors have declared that no competing interests exist.