key: cord-0228006-2rm5rwrs authors: Abry, Patrice; Fort, Gersende; Pascal, Barbara; Pustelnik, Nelly title: Temporal evolution of the Covid19 pandemic reproduction number: Estimations from proximal optimization to Monte Carlo sampling date: 2022-02-11 journal: nan DOI: nan sha: 510dd362949ee248bd786f4181c0d7cfffc29d37 doc_id: 228006 cord_uid: 2rm5rwrs Monitoring the evolution of the Covid19 pandemic constitutes a critical step in sanitary policy design. Yet, the assessment of the pandemic intensity within the pandemic period remains a challenging task because of the limited quality of data made available by public health authorities (missing data, outliers and pseudoseasonalities, notably), that calls for cumbersome and ad-hoc preprocessing (denoising) prior to estimation. Recently, the estimation of the reproduction number, a measure of the pandemic intensity, was formulated as an inverse problem, combining data-model fidelity and space-time regularity constraints, solved by nonsmooth convex proximal minimizations. Though promising, that formulation lacks robustness against the limited quality of the Covid19 data and confidence assessment. The present work aims to address both limitations: First, it discusses solutions to produce a robust assessment of the pandemic intensity by accounting for the low quality of the data directly within the inverse problem formulation. Second, exploiting a Bayesian interpretation of the inverse problem formulation, it devises a Monte Carlo sampling strategy, tailored to a nonsmooth log-concave a posteriori distribution, to produce relevant credibility intervalbased estimates for the Covid19 reproduction number. Clinical relevance Applied to daily counts of new infections made publicly available by the Health Authorities for around 200 countries, the proposed procedures permit robust assessments of the time evolution of the Covid19 pandemic intensity, updated automatically and on a daily basis. Context. The online and daily surveillance of the Covid19 pandemic intensity has become a critical societal stake and constitutes a key preliminary step in the implementation of any counter-measures by public authorities. The evolution of the pandemic is usually assessed from epidemiological models fed by daily counts of new infections or death cases, the core data of any pandemic surveillance strategy. At the outbreak of the Covid19 pandemic, facing the urgent need for data to monitor its evolution, significant efforts were devoted by most national public health authorities to collect such data and to make them publicly available. However, because of the emergency and sanitary crisis contexts, the available data were of low quality, strongly corrupted by missing samples, outliers and pseudo-seasonalities. More surprisingly, after two years of pandemic, the data collected by most countries remain of very limited quality. That low quality of the available data combined with the need for online and 1 barbara.pascal@univ-lille.fr regular (ideally daily) monitoring turn the assessment of the pandemic intensity evolution into a far more difficult task than when performed once the pandemic is over and with consolidated data. Further, assessing the confidence that can be granted to such estimates also provides another critical and difficult challenge. The within pandemic online and daily updated estimation, via credibility intervals, of the pandemic intensity from limited quality data thus constitutes the core issue of this work. Pandemic surveillance can be achieved with a large variety of tools, different in nature [1] . Estimating retrospectively the pandemic evolution, when it is over and after the data have been post-processed, is usually performed with compartmental models [2] , [3] . They suffer yet from heavy computational costs and are of limited robustness against the low quality of the Covid19 data. Instead, the pandemic intensity can be measured by the reproduction number, R, that quantifies the number of second infections stemming from one same primary infection (cf. e.g., [4] , [5] , [6] , [7] ). It has recently been proposed that relevant estimates of the time evolution of the reproduction number can be obtained from nonsmooth convex optimization procedures [8] , with the functional to minimize built from a pandemic model [7] . Attempts to create credibility intervals from a Bayesian interpretation of that model complemented with Monte Carlo sampling schemes were recently reported in [9] (see also [10] ). Though delivering epidemiologically realistic assessments of the temporal evolution of the pandemic intensity, there is still a significant need to increase the robustness of these tools against the limited quality of the Covid19 data. Goals, contributions and outline. Elaborating on [9] , [11] , the goal of the present work is to further improve estimation robustness against the limited quality of the Covid19 data. To that end, the reproduction number-based epidemiology model [7] is recalled in Section II-A. Section II-B recalls the model-based and regularized inverse problem formulations for the estimation of R. Sections II-C and II-D detail how the regularized inverse problem formulations can be modified to bring robustness against the Covid19 data limited quality, while remaining close to the original epidemiological model. Section II-E details the proposition of a construction of credibility interval-based estimation of R relying on an original Markov Chain Monte Carlo sampler, referred to as Metropolis Adjusted Proximal-Gradient Algorithm, refining classical Metropolis Adjusted Langevin procedures. Using real Covid19 data described in Section III, Section IV discusses the performance of the proposed estimation procedures, for different countries. The pandemic model developed in [7] and used here, assumes that the count of daily new infections at time t, Z t , is drawn from a Poisson distribution, conditionally to past counts Z 1:t−1 := {Z 1 , . . . , Z t−1 }. It further postulates that the Poisson parameter, p t , varies along time, and depends on past counts Z 1:t−1 , on the causal serial interval function Φ t and on the reproduction number at time t, R t : p models the main epidemic evolution mechanism: the random delays between the onsets of symptoms in a primary and secondary cases [7] , [6] , [12] , [2] . For the Covid19 pandemic and for earlier pandemics of same types, it was shown that Φ can be approximated as a Gamma function, with shape and rate parameters corresponding to mean and standard deviation of 6.6 and 3.5 days, indicating a high risk of infecting other persons from 3 to 10 days after the symptoms have appeared [13] , [14] , [15] . Maximum Likelihood estimation. A natural estimation strategy is based on maximizing the log-likelihood of the data. Because of the Poisson distribution assumption in the model above, the negative log-likelihood (also referred to as the data fidelity term) is essentially a sum in time of the standard Kullback-Leibler divergence, L(R|p (0) ) : Explicit calculations yield a simple closed-form expression: Both because of the low quality of the data, and of its being ill-conditioned (one new observation Z t is available to estimate daily R t ), R (0) turns out to be extremely irregular along time (cf. Fig. 1 , second rows) and thus useless for epidemic surveillance. Penalized Maximum Likelihood estimation. To favor temporal regularity in the estimate of R, it was proposed in [8] to complement the data fidelity term L(R|p (0) ) with a regularization term based on the L 1 -norm of the Laplacian of R, with λ R > 0 a regularization hyperparameter balancing the regularization term against the data fidelity term. The use of the Laplacian operator favors a piecewise linear estimate of R. In addition, the use of the L 1 -norm imposes sparsity in the locations where changes in the second derivative actually occur. While yielding more realistic estimates than R (0) , R (1) still lacks significant robustness against the low quality of the data, cf. Fig. 1 and [11] . A classical approach to address the low quality of the data would consist in first performing data preprocessing or denoising step followed by, second, the estimation of R. This, however, requires to construct a detailed model for data corruption (missing data and outliers, pseudo-seasonalities notably). This is a tedious task as such a model is likely to be specific to each country, and even likely to vary for one same country with the different stages of the pandemic (cf. [11] ). Instead, it was proposed in [11] to perform both data denoising and reproduction number estimation within a single step. The leading thread is to modify the functional form in (2) to account for outliers while staying as close as can be from the pandemic model in [7] . The only assumption on data corruption is that it can be modeled as sparse outliers O t , i.e., isolated rather than raws of successive irrelevant values, of unknown values that need to be estimated in addition to R t . Ideally, the epidemic model in [7] should hence be modified to state that, conditionally to past counts Z 1:t−1 and outliers O 1:t−1 , the denoised infection counts This however results in a functional that would not be jointly convex, which impairs fast and robust minimization [11] . To preserve convexity, it has instead been proposed in [11] to weaken the model into: Conditionally to past counts and Outliers (Z 1:t−1 , O 1:t−1 ), Z t follows a Poisson distribution, with non stationary parameter: p This leads to estimate R and O as, with λ R > 0 and λ O > 0, regularization hyperparameters, balancing the strengths of the different constraints one against each other and against the data fidelity term. The regularization O 1 favors sparsity in outliers, the {0, +∞}valued indicator function ι ≥0 (R) ensures non-negativity in R (2) . While robust to outliers [11] , the approximation leading to the criterion in (3) is likely to induce a bias in the estimation of R, leading to the refinement proposed here. To remove the bias described above and hence to further improve estimation, it is proposed here, to assume that conditionally to Z 1:t−1 and O 1: thus leading to In other words, starting from the daily new infection counts, Z, minimization in Eq. (3) is first applied to obtain O (2) , minimization in Eq. (5) is then applied to the denoised counts The above penalized Maximum Likelihood procedure provides an optimization-based point estimation of R, yet without confidence assessment. To complement it, an estimation by means of Credibility Intervals is also proposed in this work. It adopts a Bayesian, hence stochastic, perspective on Eq. (5) and assumes that R is a random vector, with a posterior density π(R) written as [9] : To produce an estimation of R by means of credibility intervals, one resorts to Monte Carlo schemes to produce a sequence of samples {R n } n≥0 to approximate π. The most classical ones are referred to as Metropolis samplers, and combine two steps: Proposition and Accept/Reject, as sketched in Algorithm 1. When − ln π is a smooth convex function, the proposition step relies on Langevin dynamics [16] , [17] , whose key idea is to drive the proposition with the gradient ∇ of ln π, as µ(R n ) := R n + γΓ∇ ln π(R n ), and to perturb it with an additive correlated Gaussian noise √ 2γΓǫ n+1 , where ǫ n+1 ∼ N (0 T , Id T ); γ is a positive step size. The accept/reject step relies on a Metropolis mechanism, cf. Eq. (8) . We set q(R, Step2: R n+1 = R n+1/2 with probability and R n+1 = R n otherwise. In the case of Eq. (6), because of the regularization terms, − ln π is a convex but non smooth function. To preserve the key intuition of the Langevin scheme (7), the gradient step is replaced with a proximal-gradient step, a suited extension to nonsmooth functions. Several developments were conducted in that line [18] , [19] , [20] , [21] . Recently, we proposed in [9] , a sampling scheme well suited to the structure and properties of π in Eq. (6) . It amounts to write the drift µ in the Gaussian proposition: where D o is a T × T invertible matrix obtained by orthogonal complementation of the (T − 2) × T Toeplitz matrix associated with the Laplacian D 2 and Prox γλ R · 1 is a soft thresholding operation. Full technical details are provided in [9] , [10] . The Johns Hopkins University has developed and maintains a remarkable Covid19 data repository, https://coronavirus.jhu.edu/, impressively started with the outbreak of the pandemics. It notably collects on a daily basis new infection and death counts, as produced by the National Health Authorities of around 200 countries or autonomous territories, and makes them publicly available in a consistent setting. This constitutes an exceptional source of data for the Covid19 pandemic monitoring and intensity assessment, as data are made available in (quasi-)real time and within the pandemics. Because the present work focuses on the reproduction number, daily new infection counts only are used here. Figs. 1 and 2 (top plots) illustrate the time evolution of such counts for several countries. Outcomes of the proposed estimation procedures are illustrated for a few countries for space reasons. Yet, procedures are operational for any country. Daily updated estimations are automatically made available at perso.ens-lyon.fr/patrice.abry/ and www.math.univ-toulouse.fr/gfort/. Following [9] , [11] , the hyperparameter are set to λ 0 = 0.05, λ R = 3.5× std(Z)/4 with std(Z) the standard deviation of Z. Inverse problem estimation. Fig. 1 reports (top row) raw (Z) and denoised ( Z (D) ) daily counts of new infections for the full period of the pandemic. Fig. 1 compares (bottom row) the different estimates proposed here: R (0) , R (1) , R (2) , R (3) , leading, for all countries, to the following conclusions: The crude estimator R (0) yields estimates that are far too irregular in time to be useful by epidemiologists. The time regularized estimator R (1) consists of piecewise linear estimates of R and thus provides far more regular and hence realistic assessments of the pandemic intensity evolution. Yet, Fig. 1 also shows that R (1) lacks robustness against outliers and missing counts in the raw Z. Further, (3) jointly provides estimates of outliers O (2) and piecewise linear estimates R (2) of R that are robust to irrelevant counts in Z, and thus of far greater interest to epidemiologists. Yet, a careful examination of the approximation made in the outlier modeling to maintain the convexity of the functional suggests a possible overestimation in R (2) . Finally, R (3) , obtained from the denoised counts Z (D) , provides the most relevant and useful estimation of R, with smooth (piecewise linear) and accurate estimations of R, permitting notably the detection of the occurrences of changes between pandemic growth and regression phases. Credibility interval estimation. Fig. 2 focuses on the most recent five weeks of the pandemic. It reports raw (Z) and denoised ( Z (D) ) new infection daily counts for other countries (top raw). It shows (middle row) the estimated median (50%quantile) of the posterior distribution sampled by the strategy described in Section II-E and the corresponding centered 95% credibility intervals, obtained from the 2.5% and 97.5%quantiles and after subtraction of the 50%-quantile (bottom row), leading for all countries, to the following conclusions: The credibility intervals are extremely narrow (around a few %) around the median, and relatively homogeneous along time, yet with mild increase around the piecewise linearity change points. These results show that both the inverse problem formulations and the Metropolis Adjusted Proximal-Gradient sampler proposed here yields extremely realistic estimates for the time evolution of R, that are hence actually usable by epidemiologists. Notably, these estimation tools have a double potential value: Retrospectively, they permit to quantify the impacts of given sanitary measures on the pandemic evolution ; Prospectively, the piecewise linear nature of the estimation of R permits the short term forecast (the nowcast) of the evolution of the pandemic intensity. Further, sampling strategies for Credibility Interval joint estimation for both the reproduction number R and the Outliers O are being devised and compared, with several formulations of convex nonsmooth compliant Proposition steps (cf. [10] ). Finally, these estimation tools are being made publicly available in a document toolbox, as a contribution to open science and dedication of science to major societal stakes. Describing, modelling and forecasting the spatial and temporal spread of COVID-19-A short review Measurability of the epidemic reproduction number in datadriven contact networks Mathematical models in epidemiology On the definition and the computation of the basic reproduction ratio R0 in models for infectious diseases in heterogeneous populations Reproduction numbers and sub-threshold endemic equilibria for compartmental models of disease transmission The R0 package: A toolbox to estimate reproduction numbers for epidemic outbreaks A new framework and software to estimate time-varying reproduction numbers during epidemics Spatial and temporal regularization to estimate COVID-19 reproduction number R(t): Promoting piecewise smoothness via convex optimization Credibility interval design for Covid19 reproduction number from nonsmooth Langevin-type Monte Carlo sampling Covid19 reproduction number: Credibility intervals by blockwise proximal monte carlo samplers Nonsmooth convex optimization to estimate the Covid-19 reproduction number space-time evolution with robustness against low quality data Improved inference of time-varying reproduction numbers during infectious disease outbreaks Epidemiological parameters of coronavirus disease 2019: A pooled analysis of publicly reported individual data of 1155 cases from seven countries Epidemiological characteristics of COVID-19 cases in Italy and estimates of the reproductive numbers one month into the epidemic The impact of a nation-wide lockdown on COVID-19 transmissibility in Italy Correlation functions and computer simulations Exponential convergence of Langevin distributions and their discrete approximations A Shrinkage-Thresholding Metropolis Adjusted Langevin Algorithm for Bayesian Variable Selection Efficient Bayesian Computation by Proximal Markov Chain Monte Carlo: When Langevin Meets Moreau Analysis of Langevin Monte Carlo via Convex Optimization Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm