key: cord-0747919-ny9xi0rn
authors: Apolloni, Bruno
title: Inferring statistical trends of the COVID19 pandemic from current data.Where probability meets fuzziness.
date: 2021-06-09
journal: Inf Sci (N Y)
DOI: 10.1016/j.ins.2021.06.011
sha: 6927b2e74215cfd6b1453751e7ff7d9fb1a8298f
doc_id: 747919
cord_uid: ny9xi0rn

We introduce unprecedented tools to infer approximate evolution features of the COVID19 outbreak when these features are altered by containment measures. In this framework we present: 1) a basic tool to deal with samples that are both truncated and non independently drawn, and 2) a two-phase random variable to capture a game changer along a process evolution. To overcome these challenges we lie in an intermediate domain between probability models and fuzzy sets, still maintaining probabilistic features of the employed statistics as the reference KPI of the tools. This research uses as a benchmark the daily cumulative death numbers of COVID19 in two countries, with no any ancillary data. Numerical results show: i) the model capability of capturing the inflection point and forecasting the end-of-infection time and related outbreak size, and ii) the out-performance of the model inference method according to conventional indicators.

Facing a dramatically imminent phenomenon like the current Covid19 pandemic, we were driven to issue forecasts on its evolution, albeit in the presence of not always clean data [7] . As part of the scientific community's massive effort to put science at an immediate service of society, we focused on epidemic situations where measures to contain the contagion produce tangible effects. This led us to use a two-phase model [5] consisting of a first trait ruled by the physics of Brownian motion followed by a second one in the realm of Lévy flights. The final goal of this modeling is to capture relevant features of the epidemic process that enable a proper protection of public health. As per usual, we identify them with: inflection date, outbreak size and end-of-epidemic date (the target features from now on), that we expressly address to the paths of the daily cumulated deaths. The trail to manage this model needs a set of steps as shown in Figure 1 In fact, we had to rely on samples of the phenomenon data that are not independent, because they are both truncated to the current date and possibly affected by correlation along the span of time. This accounts for the first two blocks in the above figure, which we get rid of in two ways: by moving from the probability framework to fuzzy set one, to manage non-iid (independently, identically distributed) samples of a pseudo-random variable; and by using the Algorithmic Inference approach [6] , to devise a dependency generator under our control. Once these two steps have been accomplished, we are equipped with inference tools for identifying the mentioned two-phase process in terms of a probability distribution of a random variable which is compatible, in a proper acceptation, with the observed sample. This closes the trail from death prevalences in input to an ensemble of Cumulative Distribution laws, represented through their complements to 1 (CCDF) in LogLog scale on the right end of With this output we easily obtain the target features as random variables as well, from which we may compute resuming statistics such as point estimators, spread quantifiers and so on. The deriving procedure is devised to produce relatively long term forecasting, where approximately one fifth of the entire process from start to end of outbreak is used to compute the target features concerning the process as a whole. Approximation of the results will be checked when the data are available of the entire course, akin in the first wave of the epidemic. The value of the procedure from data to distribution laws will be compared with specific competitors' in the literature.

The paper is structured as follows. After framing our work in the literature in Section 2, in Section 3 we introduce the theoretical aspects within the statistical framework of Algorithmic Inference and with the focus on our non-iid samples. In Section 4 we recall a phase model introduced elsewhere and adapt it to the current problem. In Section 5 we implement the entire contrivance on actual Covid19 datasets and discuss the value of our results. We devote the last Section (Section 6) to a discussion of the advantages of our approach and the performance of the results in comparison with those of competitors in the literature. We also highlight some advances in the carrying out of the intermediate blocks of the trail in Figure 1 which are both indispensable for completion of the procedure and unprecedented as for the adopted solutions.

The four steps in Figure 1 address likewise research tracks calling for advanced solutions to cope with COVID19 data analytics.

Though in some cases used synonymously, what differentiates censored samples from truncated samples is knowledge of the sample size [12] . Thus, of a sample of known size m, with a censored release we may exploit a smaller number of observations because of a threshold on their values or on their number; for instance we may observe only the c smallest values. In a truncated sample we have only r observation available and we do not know how many data we missed because of truncation. While methods of inferring from censored data are well developed, especially in survival analysis [28] and insurance [15] , truncated samples are dealt with mainly in specific cases [12] . Actually, the problem has been faced for a long time, with solutions found initially through momentum methods (see for instance [33] ) and subsequently through maximum likelihood ones (see for instance [21] ). In recent years the problem has taken the features of a machine learning procedure [31] , possibly endowed with some unfeasibility lemmas [30] . The benchmark of Covid19 data we use constitutes instances of truncated data with unknown truncation and absence of oracle (see [30] ).

Samples of non independent observations are the typical subject of longitudinal data analysis in medical data, with wide application precisely in epidemiology [26, 24] . Per se, sample correlation heavily hampers the suitability of the computed statistics. Hence, methods such as mixed effects [35] and generalized estimating equations [34] partition the samples into subsamples which are internally independent and model the dependence among them via regression models. Despite this, we introduce a rather empirical method to actually process the sample correlation, as we will show in the next section.

Non-homogeneous stochastic processes comprise a wide chapter of mathematical statistics which lists many sophisticated methods in terms either of simulation procedures or of analytical tools, or a combination of them. Basic time descriptors in the case of epidemic processes are the number of people who are: i) infected, ii) hospitalized, iii) dismissed and iv) deceased. Of these variables we commonly follow the trends in terms of prevalence, i.e. the cumulated number of related individuals from the start of the epidemic up to a given day, and the incidence, i.e. its daily increment of the number. Rather, we may be interested in some meaningful features such as peak size and date [19] or outbreak size [38] , or basic reproduction numbers such as R0 [20] .

In studying these kinds of processes, a preliminary choice to make is between regression methods and modeling tools. With the former, we fit in a rather agnostic way the available trend to forecast its continuation in the future and possibly derive some analytical features, such as peaks and inflections from the interpolating curve. Many approaches have been specially developed for the questioned epidemic, such as GLM [18] , ARIMA [9] and Neural Networks [36] .

In this paper we choose the second option in the perspective of Explainable AI [1] , hence with the aim of producing a model that may be understood in terms of the actual mechanism behind the pandemic, thus resulting suitable to public health stakeholders in making operational decisions.

Modeling epidemic evolution lies on the two companion threads of ordinary differential equations (ODEs), as for the deterministic approach, and Markov model/stochastic differential equations (SDEs), as for stochastic approach [3] .

In a sharp way we may say that the former consider the sole expected values of the random distribution dealt with in the stochastic approach. The ancestor is the SIR model [39] , which models the variables S(t), I(t) and R(t) denoting the fractions of susceptible, infectious and removed individuals at time t, respectively. Later, many variants were developed, possibly specifically for COVID19 [23, 29] . At the core we have a birth-death process that we may model on average via ODEs [14] . Their numerical implementation constitutes highly descriptive tools that are susceptible to embedding time dependence and relieving the principled homogeneity of susceptible people, but only when sufficient details of them can be evaluated. The stochastic companions are mainly used to investigate the overall structure of the process, to deduce the presence or absence of major outbreaks and related features from asymptotic state distributions. A bridge between the two approaches is represented by closed analytical forms of the probability distribution of the above variables to be fitted directly into sampled paths. Actually, while homogeneous versions of these processes are at the basis of queuing theory [11] , non homogeneous versions prove to be analytically manageable mainly in elementary instances [42] , with direct estimation of the involved parameters (see for instance [37] ). Moving to more complex instances, we may preserve this benefit through the simple strategy of approximating the analytical solution with functions of the exponential family that are endowed with a sufficient number of parameters to make them fit the experimental companions. For instance, in [27] , the case prevalence π is modeled by a Weibull function of exposure e where, for short, exposure is the time since the start of the epidemic, while a more complex variant of this distribution is proposed in [40] . The variable π may encompass the non homogeneity of the underlying stochastic process. In a previous work on opportunistic networks [5] we focus only on two phases and marginalize on the phase transition time distribution.

The fourth block in Figure1 falls within the sphere of our own approach to statistics. In place of looking for the features of the true distribution law (if any) of a random variable, we list features that are compatible with the observed sample [6] . We will discuss this at a greater length in the next section.

In this paper we look directly at an approximate distribution law of a spurious random variable. Namely, with reference to the randomness of the prevalence π, it depends on both exposure e and some kind of noise at a given e.

Considering a sample of prevalences of the same local infection phenomenon, we have a monotonic relation between observed value and related exposure, so that we may hide this dependence and handle the observation as an ordered sample of the random variable Π. This is a rather spurious variable for the reasons we mentioned in the introduction, which we will deal with in the next sections within the same approximation thread of [27] . By adopting the mentioned two-phase model, we are left with a unique random variable as a reward for the many approximations we carried out in a framework that is partly rooted in probability, partly in fuzziness.

In this section we introduce a variant of the Algorithmic Inference tools to cope with the non-iid sample of our two-phase model.

We adopt a generative approach to random variables, called Algorithmic Inference [6] . In this approach, each such variable 1 X is explained through a sampling mechanism M X = (Z, g θ ) where a random seed Z translates into a random variable X via an explaining function g θ (Z), for proper function g and parameter θ. Namely, for a continuous X we adopt the sampling mechanism

where g θ = F −1 X θ , U is the unitary continuously uniform random variable, F X θ is the Cumulative Distribution Function (CDF) of X parametrized in θ, and F −1

The main inference tool of the Algorithmic Inference is the master equation

where s is a statistic properly synthesizing an observed iid sample x = {x 1 , . . . , x m } and h is its expression that we get by substituting x i with the right term of (1) evaluated on the corresponding u i . It is a sort of reverse engineering where we know x and want to identify θ. Since the seeds {u 1 , . . . , u m } are unknown, apart from their distribution law and their independence, the identification result is a probability distribution on Θ in the role of random parameter.

For instance, for a Negative Exponential distribution law with parameter λ,

(1) and (2) read, respectively

log(1−ui) s and the corresponding random parameter Λ is a function of the random seeds {U 1 , . . . , U m }. By simple analytical considerations we recognize Λ to follow the distribution Gamma with parameters (s, m).

While full details of this inference method may be found in [4] , we remark that s must be properly devised. The key feature of such s is to provide the master equation with a solution in θ that is always defined and unique, for whatever seed sample (a condition that denotes it as a well behaving statistic).

This feature allows us to establish a bootstrap procedure to derive the parameter distribution law which we will exploit in this paper when analytical tools are not available like for the above Λ. Namely, the following pseudo-code generates a bootstrap population of the random parameter Θ from which to derive its empirical distribution law. (c) addθ to Θ population.

The observations of the random process we will handle in the next sections constitute a spurious sample for two reasons:

1. they are truncated 2. they are sequences of data that are correlated.

Since a monotone relation exists between sampled data and their seeds, we may reverse both above defects on the seeds. Moreover, the monotonic feature of the explaining function g in (1) guarantees that the autocorrelation sign is maintained while shifting from non independent seeds to the generated variables. Finally, the sampling mechanism does not distinguish whether the seeds have been drawn from U or U d when the related ECDF almost coincide, as in Figure 2 (c). As a whole, by sampling from U d we have in our hands a generator of non independent samples working like the speed inverter control lever of a boat, as in Figure 3 . It is a rather rough device which, with d playing the role of the lever degree, implements the general rule

and asks for a proper adjustment of its angle during maneuvers, where the main feature is the monotony of d versus r and hence versus ρ. Obviously, the autocorrelation of a sample may be a more articulated function of its items, whereas (4) refers to a special family of autocorrelations. However, it works generally well both with simple random variable models and, on the contrary, with complex models that are endowed with a relatively large number of free parameters to fit the data.

Since the correlation lever has the effect of transferring the correlation on the seeds of the sampling mechanism, which in turn remain independent, to infer θ from a non independent sample we:

1. define a new explaining function g θ , in place the one in (1):

2. adopt the general strategy of leaving unchanged the general inference procedures, with the sole full substitution of u i with u d i in the master equation (2), where d is an external parameter to be optimized by inspection methods according to a loss function L. 

3. store par(d) = θ * 4. compute the loss function L(d)

Refinements and loss function depend on the inference task.

The third feature we want to manage of the epidemic trends is the possible non homogeneity of the underlying stochastic process. In this paper we come to the more circumscribed task of discovering a change of phase between two dynamics: an initial one that we denote as unintentional and a second that we denote as intentional. Our approach aims to capture a pseudo-equilibrium distribution of the process, i.e. a description of the observed data in terms of a law encompassing their frequencies. Hence our strategy consists of: i) focusing on two large families of processes, respectively without memory and with memory, ii) concatenating them in a proper way, and iii) studying some of the main properties of the resulting process iv) through a general purpose distribution law v) endowed with a relatively large number of free parameters, vi) to be identified through both well done statistics and numerical methods.

In very essential terms, we speak of memory if we have a direction along which to order the events. Now, for any ordered variable T , such that events on their sorted values are of interest to us, the following master equation holds

What is generally the target of the memory divide in stochastic processes is the time t − k elapsing between two events. In this perspective, the template of the memoryless phenomena descriptor is the (homogeneous) Poisson process, whose basic property is P(T > t) = P(T > q)P(T > t − q), if t > q. It says that if a random event (for instance a hard disk failure) did not occur before time q

and you ask what will happen within time t, you must forget about this former situation (it means that the disk did not become either more robust or weaker), since your true question concerns whether or not the event will occur at a time t − q. Hence your local variable is T − q, and the above property is satisfied by the (negative) exponential distribution law with

for constant λ > 0, since with this law (8) reads

On the contrary, we introduce a memory of the past (q-long) if you cannot separate T − q from q. In this paper we consider very simple cases where this The simplest solution of (8) is represented by

so that the master equation reads

Note that this distribution, commonly called Pareto distribution, is defined only for t ≥ k, with k > 0 denoting the true time origin, where α identifies the distribution with the scale of its logarithm. The main difference w.r.t.

the exponential distribution is highlighted by the LogLogPlots of their CCDF (complement to 1 of CDF), denoted as F T in Fig. 4 : a line segment with a Pareto curve (see picture (a)) in contrast to a more than linearly decreasing curve with the exponential distribution ( Fig. 4(b) ). A first operational consequence is that, for the same mean value of the variable, we may expect its occurrence in a more delayed time if we maintain memory of it as a target to be achieved (getting a Pareto distribution), rather than relying on chance (getting an exponential distribution).

We relate the above temporal evolution to companion space evolutions represented by Brownian Motion and Lévy flights, respectively. In fact, it is well known that at sufficiently low densities the distribution of times between successive collisions of a molecule in a fluid is approximately exponential, while its trajectory follows a Brownian motion [10] . Analogously, experimental studies show a Pareto distribution reckoning the time intervals between changes in Lévy flight direction that we see in nature. This occurs, for instance, with albatrosses [16] in search of food.

For Covid19 pandemic, we consider the following wait and chase model that concatenates the two dynamics. It is iconic-ally described by the course of a dodgem car at a moon park but may be applied to virus activity as well.

Assume you are playing with dodgem cars. You drive around until, from time to time, you decide to bang into a given car which is unaware of your intent. Thus, initial bumps occur by chance, as with molecules in a gas, while the second kind of bumps are intentional, i.e. directed toward the target, albeit disturbed by occasional diversions from the impact trajectory. We may assume the trajectory of each car to be a plane Brownian motion before the triggering of the chase. The change of dynamics during the chase derives from the fact that one dimension of your car motion is now represented exactly by the line connecting your car and the target car. In a previous paper [5] we show that the distribution of the random time T elapsed between two subsequent bumps of the dodgem cars (the intercontact time) has the following expression:

with a, c > 0 and b ≥ 0, whose template shape is reported in Figure 5 process. Analytically, its abscissa is close to b 1 /a .

Letting dodgem cars play the role of viruses, we are mainly interested in the evolution of the cumulative number N (t) of contacts (as primers of death) with elapsing time t. For a large number of cars and T following the negative exponential distribution law of the first phase, at each t this variable follows a Poisson distribution law with parameter µ(t) = λt, where the former parameter coincides with expected value E[N (t)] of N (t) and λ is the constant inter-contact rate. Passing to the Pareto distribution of the second phase, the additional N ′ (t) still behaves as a Poisson variable, but the related λ now decreases proportionally to 1/t, so that µ(t) ∝ t t0 λ(τ )dτ = cLog[t], for a proper c, increases less than linearly with t. Actually, recognizing this function to be the limit of the integration of λ(τ ) = c/(τ ǫ ) when ǫ goes to 1, we integrate this function to get µ t still as a power of t.

Hiding the explicit dependence on t, at the i−th observation, N i is a function of two random variables: the inter-contact time T and the Poisson spread of N i around its mean µ(T ). By invoking a locally loose ergodicity of the process in the thread of [32] , we approximately equate µ(T ) to the local mean along the temporal path so that µ(T ) constitutes an interpolationÑ of N .

In this way, in the first phase the CDF ofÑ has the same shape as the CDF of T , while in the second phase this CDF is well approximated by a Pareto 

We may consider our estimation problem in terms of drawing a regression curve through the set of pairs n i , F N (n i ) , coupling the observed prevalences with the ECCDF computed on it 3 . According to our model, the regression curve depends on three parameters: a, b, c of (13) that we want to estimate with the Algorithmic Inference tools. This requires to establish: 1. Sampling mechanism. As stated in the previous section the expression of F N (n) is the same of (13) with T and t replaced by N and n, respectively.

Hence, by solving the equation F N (n) = u in n, the sampling mechanism reads:

2. Relevant statistics. Denoting by n (i) the i-th element of the sorted sample of N and by med the quantity ⌊ (m+1) /2⌋, we adopt the following statistics

They almost completely fulfill the well beahvingness requests. Indeed, thanks to the explaining function in (14), we obtain the master equations

Since (a, b, c) ≥ 0, from these expressions we see that both s 1 and s 3 are monotonic in the three parameters, thus allowing for unique solutions for whatever seed {u 1 , . . . , u m }. s 2 dependences are less univocal, since singularly the two addends denote the same monotonic trends with parameters, but their difference may give rise to non unique solutions. We solve these master equations in the parameters in correspondence to a large set of randomly drawn seeds {u 1 , . . . , u m }. In this way we obtain a sample of fitting curves, as in Fig. 7 , which we statistically interpret to be compatible with the observed data. In the figure we also report the 0.90 confidence region for these curves, that we obtain through a standard peeling method [4] .

To complete the procedure we introduce the correlation lever in Sect. 3.2.

This accounts for fitting the data ECDF with the function:

With known d nothing changes on the above statistical procedures, apart from the new seed generation, since the explaining function now reads

This requires replacing u i with u d i in the formulas (16-18).

The latest advances in this research have been fostered by the aim of exploiting data on the current Covid19 pandemic. Statistics on pandemic evolution are relatively abundant, but their quality is not always satisfactory. This stems from the absence of shared protocols with which they are collected and from some political biases as well. Therefore we decided to narrow the focus solely on the numbers of deaths, which, while systematically underestimated in the official data, is less affected by variations in data collection procedures. To recall this option, henceforth we will refer to an infection, rather than an epidemic, as the phenomenon generating these numbers. No ancillary data have been formally taken into account; rather, we considered two datasets that proved quite familiar to us:

1. deaths from COVID19 in 13 regions of Italy in the period February-December 2020, by obvious reasons given the author's nationality.

2. deaths in Senegal in the same period, due to a special connection with epidemiologists in that country 4 .

We considered the daily cumulative deaths (prevalence) referring to the 13 most infected Regions in Italy, as they have been reported from the start on GitHub [22] , with the miss of Sicily that initially resulted less hit. Table 1 lists the names of these regions, jointly with the related prevalence at the end of the observation period. In Figure 8 -left we report the deriving incidence course in those regions. The gray sections identify the two infection waves that characterized the phenomenon up to the end of year 2020 (but a third wave is However, numerical considerations will be carried out also on the remaining data for comparison sake.

The core of our inference is the estimation of the CDF of the random variable representing the death prevalence along the infection time. From this distribu- the correlation lever, i.e. opposite ds, whose prevailing will emerge from the data when we use the explaining function (20) . Those values are computed according to Algorithm 2, with a loss function based on a proper distance between experimental CDF and the computed one. As a whole, replicas of the four parameter estimates are computed with the mentioned nesting of Algorithm 2 into Algorithm 1; meanwhile some replicas are discarded because they do not pass a consistency check (due to a bad solution of the equations (16) (17) (18) ).

In the pictures of Fig. 9 we reported these curves in gray for three template regions. In the same pictures we reported in thick-black the curves parametrized with the median of the parameters of the gray curves. Finally, the thick red lines show the related ECCDFs.

From left to right the pictures refer to regions whose data are more to less compliant with our model. In particular Lombardia, jointly with Piemonte and Toscana, are affected by anomalous epidemic trends that are understood only in part by scientists. Though gray curves include the red ones, latter curves show a greater slope toward the end as a consequence of the sample truncation effect.

The pivot of these curves is the mentioned knee vertex of the distribution.

We call it the turning point and assign it exactly the abscissa b From the turning point we derive the target features as follows:

• Inflection point. We equate it to the turning point.

• Outbreak size. We exploit the linearity of the LogLogPlot of the Pareto CCDF to derive this value at the crossing of the linear interpolation of the CCDF after the turning point in the above representation and a horizontal threshold at the height ( 2 /outbreaksize)

• End-of-infection. We compute it as the day on which the outbreak size is reached. It entails that we expect 2 further deaths after this time. It is a rather loose condition for epidemiologists, who generally wait for a certain number of days without new cases to declare the end of the infection, but it is close enough. Table 2 details the values for all regions and related statistics. In particular, the first column reports the numbers of the data that have been processed to compute these values. They coincide with the lengths of the first traits in it depends on the smoothing method we adopt for the prevalence graphs 6 and the policy we choose for its location in the almost rectilinear segment separating the opposite curvatures. According to what discussed, for purposes of representation we locate the end-of-infection point at height= 1.5. Overall, the three regions denote the same model compliance graduation mentioned in regard to Figure 9 .

As a whole, our forecasting achieves root-mean-square-errors on the target features as in Table3. In spite of the long term forecasting and the absence of any ancillary data, they are on the order of 10%, 3% and 10% of the mean of the predicted value, respectively. 

In early September, a second COVID19 wave hit Italy. We repeated the above procedure on the new data from September to December 2020 and found that the two-phase process works relatively well for some regions but with others suffers from some well known anomalies. They are partly due to the fact that the 

The second traits of the curves in Figure 8 (b) are not covered by the twophase model. Rather, they denote a faster decrease of the incidence than found with the Pareto trait. Namely, by analyzing with the same approach the ECCDF of these data we get an incidence R that we approximately handle with a CDF

Albeit it is a very early modeling, with this distribution we recover with an acceptable approximation the ECDF of the death incidence in the various regions, as shown in Fig. 12 with shifts on the right end of the graphs as a consequence of infinite mean of distribution law (21) . This trend helps on the one hand to explain the shift between fore-casted and actual end-of-infection times; on the other hand, it highlights a rarefaction phenomenon where, besides their reduced number, inter-contacts possibly occur when either the virus has partly exhausted its infection power and/or the disease treatment has improved.

The tail of the second wave is less handy, possibly representing the start of a third infection wave. Table 4 is analogous to Table 2 for Senegal. Actually, the data concern mainly the capital Dakar. those that were put in place in Italy). Apart the vagueness of these considerations, from the picture we may determine that the two-phase model appears to be sufficiently suitable also in this instance.

Forecasting Covid19 evolution is a formidable challenge as for both a profitably immediate exploitation of the results it produces and the theoretical prob- our objective, getting long term forecasts as for outbreak size, inflection point and end-of-infection time which are accurate for most regions, while identifying some problematic ones whose evolution is scarcely covered by our model.

We avoid confronting our results with those of the various agnostic regression methods, such those mentioned in Section 2, because the underlying models result non interpretable and generally based on a great number of parameters (e.g. the neural network connection weights), with the consequence of producing good short term forecast of the outbreak curves (about 20 days) but no direct evaluation of the above target features. Rather, among the various modeling proposals [17, 40] , an objective analogous to ours has been pursued in [43] , where a non linear regression model with meaningful parameters has been employed.

From the inferred parameters the authors compute the inflection point and the outbreak size on Italy data with an accuracy higher than ours; however, they are based on fitted data (with no prediction), exploit data of other countries through mixed effect model and refer to the pandemic in the country as a whole

(not to the individual regions). Our approach is an unprecedented one, since we model a univariate pseudo-distribution law of the death prevalence, whose parameters are related to a physical model ruling the contagion process along two specific phases. In this, we too come to a regression problem which concerns, however, the shape of the CDF of this distribution. Identifying it leads, as a fringe benefit, to the estimation of the target features.

Hence we toss the methodological value of our approach exactly on this regression task, having as a competitor two sets of distributions: LogNormal [41] , Weibull [27] and Extended Weibull [40] , which have been used in the litera- : Senegal data graphical synthesis. Same notation as Figures 8, 9, 10, respectively. ture for this task, and Pareto, Tapered Pareto [25] and Truncated Pareto [5] , which are in the same family as ours. To compare their performances, we use three well-known statistics, namely, the Maximum Likelihood (ML), the Akaike Criterion and the Cramer-von-Mises (CvM) test. In very essential terms, ML expresses the fitness of the distribution in terms of the product of the density functions of the recorded prevalences optimized versus the free parameters of the distribution. The Akaike Criterion is a balanced sum of the above fitness and a second term penalizing the distribution complexity [2] . The CvM test bases the acceptance of a given distribution law as the source of the observed data (the null hypothesis) on the mean square difference between the data ECDF and the hypothesized distribution CDF [13] . We progressively compute these statistics on the first wave of the 13 regions, the same plus the performances in the second wave, and the same with the addition of the Senegal data. Thus in the cells of Table 5 we report the fraction of instances on which the distribution in the row proved to be the winner with respect to the first two statistics in the column and the fraction of instances not rejected by the CvM test, with significance level = 0.05, in the third one. Though the additional benefits of the Algorithmic Inference approach [6] are not evidenced by these conventional criteria, the ML column in the first window clearly indicates the outperformance of our method, with Truncated Pareto as the sole competitor. Moreover, the winning fraction of our method becomes 1 if we remove the three anomalous regions (Lombardia, Piemonte and Toscana) from the reckoning. The prevailing score is smoothed in the second column, since the AKAIKE Criterion penalizes the higher number (4 in place of 3) of free parameters of our distribution. Finally, the CvM test highlights the inadequacy of the simplest models (Exponential and Pareto), and NFEWeibull too, in coping with the complex process under examination. This trend is reversed along the other windows, clearly denoting the selectivity of our model which is not a general purpose one. Namely, while Senegal data are well covered by our method, as mentioned in Section 5.1.2, the second epidemic wave does not fully comply with our two-phase process. This is particularly evident in some regions, while in others it is denoted in any case by some drifts. This turns ML and AKAIKE criteria in favor of the Truncated Pareto in most cases, as a distribution that is more adaptable in the absence of a particular underlying model. The CvM path remains substantially unvaried along the windows.

Besides these operational achievements, this paper introduces an unprecedented way of dealing with non-iid-samples. The iconic image in Figure 3 recalls this novelty and its limits as well. Moving our imaginary lever ahead we introduce a positive autocorrelation on the data generated through the universal sampling mechanism, and a symmetric effect in the back way. They are specific effects that we analyze without issuing general theorems, albeit framing the analysis in a robust theoretical framework. Rather we propose them as an approximation tool to capture trends of complex processes, as a primer for further investigations.

Peeking inside the black-box: A survey on explainable artificial intelligence (xai)

A new look at the statistical model identification

A primer on stochastic epidemic models: Formulation, numerical simulation, and analysis

The Puzzle of Granular Computing

Mobility timing for agent communities, a cue for advanced connectionist systems

Algorithmic Inference in Machine Learning

A study on the quality of novel coronavirus (covid-19) official datasets

Visualization and machine learning for forecasting of covid-19 in senegal

Application of the arima model on the covid-2019 epidemic dataset

The collision rate in a dilute classical gas

Introduction to the Queueing Theory

Truncated and Censored Samples

The exact and asymptotic distributions of Cramer-von Mises statistics

From discrete to continuous evolution models: A unifying approach to drift-diffusion and replicator dynamics

Patterns of neuronal migration in the embryonic cortex

Revisiting Lévy flight search patterns of wandering albatrosses, bumblebees and deer

Forecasting covid-19 dynamics and endpoint in bangladesh: A data-driven approach. medRxiv

Estimation of the basic reproduction number for infectious diseases from age-stratified serological survey data

Final and peak epidemic sizes for seir models with quarantine and isolation

Quantifying sars-cov-2 transmission suggests epidemic control with digital contact tracing

Estimation of the parameters in a truncated normal distribution

Modeling infectious disease parameters based on serological and social contact data: A modern statistical perspective

Statistical analysis of correlated data using generalized estimating equations: an orientation

Estimation of the upper cutoff parameter for the tapered Pareto distribution

The multicenter aids cohort study: rationale, organization, and selected characteristics of the participants

Age-specific incidence and prevalence: A statistical perspective

Introduction to Survival Analysis

Modeling the dynamics of covid19 spread during and after social distancing: interpreting prolonged infection plateaus. medRxiv

Efficient truncated statistics with unknown truncation

Estimation of the parameters in a truncated normal distribution

Understanding the simulation of mobility models with palm calculus

Table of the gaussian tail functions; when the tail is larger than the body

Longitudinal data analysis using generalized linear models

A unified approach to mixed linear models

Convolutional neural networks and temporal cnns for covid-19 forecasting in france

Age-specific incidence and prevalence: A statistical perspective

Predication of inflection point and outbreak size of covid-19 in new epicentres

Infectious diseases of humans

Comparison of covid-19 pandemic dynamics in asian countries with statistical modeling

A bimodal lognormal distribution model for the prediction of covid-19 deaths

Using epidemic prevalence data to jointly estimate reproduction and removal

Deep learning methods for forecasting covid-19 time-series data: A comparative study