key: cord-0193729-plt3jx93
authors: Fort, Gersende; Pascal, Barbara; Abry, Patrice; Pustelnik, Nelly
title: Covid19 Reproduction Number: Credibility Intervals by Blockwise Proximal Monte Carlo Samplers
date: 2022-03-17
journal: nan
DOI: nan
sha: 994927828057967f4310eb221cf7119d0083f880
doc_id: 193729
cord_uid: plt3jx93

Monitoring the Covid19 pandemic constitutes a critical societal stake that received considerable research efforts. The intensity of the pandemic on a given territory is efficiently measured by the reproduction number, quantifying the rate of growth of daily new infections. Recently, estimates for the time evolution of the reproduction number were produced using an inverse problem formulation with a nonsmooth functional minimization. While it was designed to be robust to the limited quality of the Covid19 data (outliers, missing counts), the procedure lacks the ability to output credibility interval based estimates. This remains a severe limitation for practical use in actual pandemic monitoring by epidemiologists that the present work aims to overcome by use of Monte Carlo sampling. After interpretation of the functional into a Bayesian framework, several sampling schemes are tailored to adjust the nonsmooth nature of the resulting posterior distribution. The originality of the devised algorithms stems from combining a Langevin Monte Carlo sampling scheme with Proximal operators. Performance of the new algorithms in producing relevant credibility intervals for the reproduction number estimates and denoised counts are compared. Assessment is conducted on real daily new infection counts made available by the Johns Hopkins University. The interest of the devised monitoring tools are illustrated on Covid19 data from several different countries.

Proximal-Gradient based extensions of the Langevin Metropolis algorithms: PGdec and PGdual. We establish their ergodicity, and carry out a comparative study. Using real Covid19 data, made available at the Johns Hopkins University repository and described in Section IV, the performance of up to twelve variations of the sampling strategies are assessed and compared, using well-thought indices quantifying their efficiency (cf. Section V). Finally, in Section VI, the relevance of the proposed blockwise Proximal-Gradient samplers is illustrated for several different countries representative of the evolution of the pandemic across the world, for a 5-week recent period. Daily updates of these credibility interval estimates as well as Matlab routines for their calculations are available on the authors web pages. Notations. Vectors are column-vectors. For p ≤ q, the vector x p:q concatenates the scalars or vectors {x i , i = p, . . . , q}. For a matrix A, A ⊤ (resp. det(A) and A −1 ) denotes the transpose of A (resp. the determinant and the inverse of A). We set A −⊤ := (A ⊤ ) −1 = (A −1 ) ⊤ . I p is the p × p identity matrix, and 0 p×q is the p × q null matrix. For a vector x ∈ R p , x 1 is the L 1 -norm and x is the L 2 -norm. Finally, N r (µ, C) denotes the R r -valued Gaussian distribution with expectation µ and covariance matrix C. For some γ > 0, the proximity operator of a proper, convex, lower semi-continuous function f from R d to ] − ∞, +∞] is defined as (∀x ∈ R d ) prox γf (x) := arg min

The present work makes use of a pandemic model devised by epidemiologists in [15] that focuses on a main pandemic index: the reproduction number R, to be estimated from daily new infection counts Z. Elaborating on [15] , it was further proposed in [37] to account for the limited quality of the intra-pandemic Covid19 data -highly corrupted by irrelevant, missing and mis-reported counts or by pseudo-seasonal effects -by means of additional outliers O, also unknown and to be estimated. The goal of the present work is thus to estimate, from a vector of T observed new daily infection counts Z := (Z 1 , . . . , Z T ) ⊤ ∈ N T , the vector of unknowns and the observed data Z are realizations of random vectors Θ and Z, whose distributions need to be specified. We define the a posteriori distribution (which is proportional to the joint distribution by the Bayes theorem), by its density with respect to the Lebesgue measure on (R + ) T × R T given by

involving the product of the likelihood and of the prior over (R t , O t ), specified below. Likelihood. The likelihood is modeled as follows: the rate at which a person infected at time t − u, generates new infections at time t is R t Φ u where Φ := (Φ u ) 1≤u≤τ φ denotes the serial interval function, describing the average infectiousness profile after infection [15, 29, 49] . Φ is assumed known and, following [27, 41] , is classically modeled as a Gamma distribution truncated over τ φ = 26 days with mean and standard deviation of 6.6 and 3.5 days. Then, following [15, 37] , the pandemic diffusion is modeled as a Poisson distribution: The conditional distribution of Z t given the past Z t−τ φ :t−1 and (R t , O t ) reads

where the intensity p(R t , O t |Z t−τ φ :t−1 ) has to be nonnegative; by convention, a Poisson distribution with null intensity is the Dirac mass at zero. The likelihood of the observations Z is hence defined for any θ in the measurable set D Z ⊆ (R + ) T × R T given by

A key element of that modeling is that the intensity of the Poisson distribution varies along time as:

By convention, all distributions are conditionally to initial values (Z 1−τ φ , . . . , Z 0 ), omitted in notations. Prior distribution. Regarding the prior, it is assumed that R and O are mutually independent. Further, on one hand, the outliers O are assumed independent and distributed as a Laplace distribution (with parameter λ O > 0) as commonly encountered in the literature (see, e.g., [22, 33, 36] ). The decay of the Laplace distribution in the tails favors some large values among many small ones. This yields

On the other hand, to model smooth piecewise linear time evolutions for R, or equivalently a sparse set of components where the discrete second order derivative in time of R is non zero, R is assumed distributed as a Laplace AR(2) process: for every t > 2,

where λ R > 0. λ R , λ O are (fixed) positive regularization hyperparameters, balancing the strengths of the different penalizations against the likelihood term. This yields the a priori distribution:

Posterior distribution. Combining the likelihood (1) and the priors (4) and (5) leads to the a posteriori density with respect to the Lebesgue measure on

where ½ A denotes the {0, 1}-valued indicator function of the set A and

d KL denotes the Kullback-Leibler divergence, related to the log-likelihood of a Poisson process, whose definition is, for some z ∈ N,

and D 2 is the discrete-time second order derivative (T − 2) × T matrix:

In the Bayesian approach to Decision Theory, the maximum, the median, and the expectation of the a posteriori distribution, are Bayes estimators θ associated to a loss function ℓ θ(Z) := Argmin τ ∈DZ DZ ℓ(τ , θ)π(θ|Z)dθ; ℓ is, respectively, the 0 − 1 loss, the L 1 -norm and the squared L 2 -norm (see e.g. [42, Sections 2.3 . and 2.5.]). Computing the Maximum a Posteriori (MAP) fits the minimization problem proposed in [37] for the reconstruction of θ:

The optimization problem is a non-smooth convex minimization problem encapsulating both the transmission process, favoring piecewise linear behavior of the reproduction number along time and sparsity of the outliers. The minimization is performed with the Chambolle-Pock primal-dual algorithm allowing to handle both the non-differentiability and linear operators [7, 10] . Properties of the MAP are established in Proposition 1.

If there are at least two positive averaged counts

Proof. The first statement is adapted from [37] ; the second statement is established in [37] . The sign conditions result from a first order expansion of the L 1 -norm. For a detailed proof, see section VIII in the supplementary material.

Proposition 1 implies that the MAP is either unique, or that there are uncountably many MAP. In addition, it shows that f Z and g are constant over the set of the minimizers. Thus, following the same lines as in [3] , a sufficient condition for the uniqueness of the MAP is derived (see section VIII in supplementary material).

The expression of the a posteriori distribution (6) is so complex that the distribution is known only up to a normalizing constant. Consequently, the computation of most statistics of π(·|Z) relies on Monte Carlo samplers, in order to produce points {θ n , n ≥ 0} in D Z approximating π (see e.g. [17, section 2.3]): for example, the estimation of the median and more generally of quantiles of π(·|Z) can rely on the order statistics of the points, and the mean a posteriori can be approximated by the Monte Carlo sum N −1 N n=1 θ n .

The aim is now to devise Monte Carlo sampling strategies for the posterior distribution π defined in (6)- (7) . However, this section will address a broader class of densities, defined on R d with respect to the Lebesgue measure, and expressed as π(θ) ∝ exp(−F (θ))½ D (θ) where F := f + g and f , g, and D satisfy the smoothness and blockwise structure assumptions A1 and A2 defined below.

A. Blockwise structure A1. f and g are finite on D ⊆ R d and f is continuously differentiable on the interior of D.

Additionally, g has a blockwise structure that we aim to use in the design of the proposed samplers. This blockwise structure stems both from the decomposition of θ into J blocks (θ 1 , . . . , θ J ) ∈ R d1 × . . . × R dJ and from the sum of several functions of θ j possibly combined with a linear operator.

In addition, the proximity operator of g i,j has a closed form expression.

In Bayesian inverse problems, f may stand for the data fidelity term and the non-smooth part g stands for many penalty terms acting on blocks of the parameter θ. Different splittings of the prior defined by (7) fits Assumption A2. Example 2. The prior g given by (7) satisfies A2:

where θ 1 := R, θ 2 := O, A 1,1 := D 2 , A 1,2 = I T , g 1,1 := λ R · 1 , and g 1,2 := λ O · 1 . Example 3. The prior g given by (7) satisfies A2:

where θ 1 := R and θ 2 := O. A i,1 collects the rows i, i + 3, i + 6, . . . , of the matrix D 2 , A 1,2 := I T , g i,1 := λ R · 1 and g 1,2 := λ O · 1 .

The second example follows block splitting strategies described in [38, 39] .

The design of an optimization strategy to minimize F on D and the design of an algorithm to reach the target distribution π both relies on the activation of an operator µ : R d → R d . To be more specific, when minimizing F , we aim to design a sequence of the form:

where µ is an operator build from F and D in such a way that the sequence (θ n ) n∈N converges to a minimizer of F (cf. [8, 11, 14] for an exhaustive list of algorithmic schemes). When building a Metropolis-Hastings algorithm (say with Gaussian proposal), a new point is proposed as:

where C ∈ R d×d is a positive definite matrix. The Langevin dynamics is recovered in the specific case where F is smooth with µ(θ) := θ − γ∇F (θ) being a gradient ascent over ln π with step size γ > 0 and C := 2γ I d . Scaled Langevin samplers are also popular: given a d × d matrix Γ, set

they are inherited from the so-called tempered Langevin diffusions [28] (see also [46] for a pioneering work on its use in the Markov chain Monte Carlo literature). Either this proposed point is the new point θ n+1 = θ n+1/2 (thus yielding the Langevin Monte Carlo algorithm [35] ; see also [19, 21] ) or there is an acceptance-rejection Metropolis mechanism (thus yielding the Metropolis Adjusted Langevin Algorithm (MALA) [45] ). The general Metropolis-Hastings procedure with Gaussian proposal, is summarized in Algorithm 1 where we denote by q(θ, τ ) the density of the distribution N d (µ(θ), C) evaluated at τ ∈ R d :

.

The constraint θ ∈ D is managed by the acceptance-rejection step (since π(θ n+1/2 ) = 0 when θ n+1/2 / ∈ D) but not necessarily in the proposal mechanism. The MALA algorithms drift the proposed moves towards areas of high probability for 

and θ n+1 = θ n otherwise.

the distribution π, using first order information on π. Building on this idea, many strategies were proposed in the literature in the setting defined by A1: µ can either be a gradient step when F is smooth, or a proximal step (i.e., µ = prox γF also referred as an implicit subgradient descent step), or a Moreau-Yosida envelope gradient step (i.e. µ = I −γ∇( γ F ) where the Moreau envelope of a function F with parameter γ > 0 is defined as γ F := inf y γF (y) + 1 2 · −y 2 ). In [18] explicit subgradient steps possibly combined with a proximal step are used. In [13] , µ relies on a Gaussian smoothing of convex functions with Hölder-continuous sub-gradients; this method applies under regularity conditions and convexity assumptions on g which are not implied by A1-A2. In [20] , the authors adds a Moreau-Yosida envelope term and a gradient term. In [30] , µ compose a Moreau-Yosida envelope of g(A·) and a gradient step. Let us cite [6, 47, 48] who also use proximal operators in order to define trans-dimensional Monte Carlo samplers -an objective which is out of the scope defined by A1.

However, in the optimization context (11) with a non-smooth objective function F , proximal-based strategies are often preferred to explicit subgradient steps as the convergence is insured with a fixed step size; explicit subgradient descent steps require decreasing step sizes, far from being numerically efficient. When F = f + g where f and g satisfy Assumption A1, deriving a proximal activation µ is often a tedious task as no closed form expression for prox f +g exists in a general framework [40, 52] . The standard solution consists in a proximal-gradient activation: µ(θ) := prox γg (θ − γ∇f (θ)). When dealing with a blockwise structure for g as in A2, the choice of µ has to manage both the additive structure of g and the combination of the g i,j 's with a linear operator. Unfortunately, the proximity operator has a closed form expression in very limited cases recalled below in Lemma 4.

2) Let g be a proper lower semi-continuous convex function. Let AA ⊤ be invertible. For every γ > 0,

3) Let g(A·) := c ℓ=1 g ℓ (a ℓ ·) where g ℓ is convex, lower semi-continuous, and proper from R to ] − ∞, +∞], and a ℓ ∈ R d denotes the row #ℓ of A. Suppose that AA ⊤ = Λ, where Λ := diag(χ 1 , . . . , χ c ) and χ ℓ > 0. Then, for every γ > 0,

where for all ζ = ζ 1:c , we set Λg(ζ) := c ℓ=1 χ ℓ g ℓ (ζ ℓ ).

Result extracted from [39] and a direct consequence of 2) for specific choices of A and g.

None of the algorithms recalled in the previous section directly applies to the context of Assumptions A1 to A2. We propose two novel Metropolis-Hastings algorithms: the PGdec sampler and the PGdual sampler, which use the proximal operator to handle the non-smooth part g of − ln π, its additive structure and its combination with linear operators.

• The PGdec sampler. Additionally to Assumptions A1 and A2, we further assume:

A 3 assumes that each component g i,j (A i,j ·) has a tractable proximal operator which does not imply that the nonsmooth component g admits a tractable proximal operator.

The Proximal-Gradient Decomposition sampler (PGdec) is described by algorithm 2. It is a Metropolis-Hastings sampler with Gaussian proposal: conditionally to the current point θ n , PGdec proposes a move to θ n+1/2 = (θ

sampling independently the J blocks from Gaussian distributions (see θ n+1/2 in line 5 of algorithm 2); then a Metropolis acceptance-rejection step is applied (see (17)). The originality of our method is the definition of µ: for every j ∈ {1, . . . , J}

where γ j is a positive step size and ∇ j denotes the differential operator with respect to (w.r.t.) the block #j of θ. The proposed drift takes benefit of the blockwise separable expression of g and computes at each iteration the proximal operator associated to part of the sum in order to perform the proximal activation with a closed form expression. Conditionally to θ n , for each block #j, one of the component #i j is selected at random in {1, . . . , I j } (see line 3); then, θ n+1/2 j is sampled from a R dj -valued Gaussian distribution with expectation µ PGdec ij ,j (θ n ) and covariance matrix C ij ,j . We denote by q i,j (θ, τ ) the density of the distribution N dj (µ i,j (θ), C i,j ) evaluated at τ ∈ R dj :

. Remark 5. Let j ∈ {1, . . . , J} and i ∈ {1, . . . , I j }. Given θ n = (θ n 1:J ) ∈ R d , µ PGdec i,j (θ n ) successively computes a gradient step w.r.t. the smooth function f and the variable θ j , and a proximal step with respect to the function g i,j (A i,j ·); the step size is γ j for both steps. Hence, µ PGdec

Example 6 (Example 2 to follow). The proximal operator of R → λ R D 2 R 1 is not explicit, so that, when decomposing g as proposed in Example 2, PGdec can not be applied to sample π(·|Z).

is the identity matrix so that, by Lemma 4, prox γ1 λR Ai,1· 1 is explicit and given by (

In addition, prox γ2 λO · 1 has a closed form expression. Hence, when decomposing g as proposed in Example 3, PGdec can be applied to sample π(·|Z).

• The PGdual sampler. The Proximal-Gradient dual sampler (PGdual) is defined along the same lines as algorithm 2. It is designed for situations when for any i, j, the dimensions of the matrices A i,j satisfy c i,j ≤ d j and A i,j can be augmented in an invertible d j × d j matrix -denoted byĀ i,j .

For every j ∈ {1, . . . , J} and i ∈ {1, . . . , I j }, letĀ i,j be a d j × d j invertible matrix such that for any θ j ∈ R dj ,

Remark 8. For every j ∈ {1, . . . , J}, select i j ∈ {1, . . . , I j }, and consider the partial objective function:

Algorithm 2: Blockwise Metropolis-Hastings samplers

and θ n+1 = θ n otherwise.

Consider the one-to-one maps θ j =Ā ij ,j θ j for any j, and the application

The PG step w.r.t. θ j reads:

Therefore, applying a PG step w.r.t. the dual variable θ j in this dual space, and going back to the original space by applyinḡ

Example 9 (Example 2 to follow). Denote by D 2 any T × T invertible augmentation of D 2 , obtained by adding two vectors in R T ; for example,

All the rows of the matrix D 2 are of norm 1. Then (D 2 R) 3:T = D 2 R, and for R ∈ R T ,ḡ 1,1 (R) = λ R R 3:T 1 . In addition, g 1,2 = g 1,2 . Hence, when decomposing g as proposed in Example 2, PGdual can be used to sample π(·|Z).

• Choice of the covariance matrix C. Based on Remark 8,Ā i,j µ PGdual i,j ( θ) is a PG step in a dual space. Since such a step can be seen as an extension of a gradient step to a nonsmooth function, a natural idea is to mime the MALA proposal and add a Gaussian noise with covariance matrix 2γ j I dj in the dual space. Therefore, in the original space, this yields a covariance matrix C i,j := 2γ jĀ (18) , note the preconditioning matrixĀ −⊤ i,j before the gradient and the matrixĀ −1 i,j before the proximal operator: there is a parallel between such a choice of C i,j and the scaled MALA proposal mechanism (13) .

An equivalent argumentation can be developed for µ PGdec .

In the case

where G j (θ n ) := θ n j − γ j ∇ j f (θ n ) is a gradient step, and

is the orthogonal projection matrix on the range of A ⊤ i,j . (23) shows that µ PGdec i,j (θ n ) is the sum of two orthogonal terms: the first one is the orthogonal projection of G j (θ n ) on the orthogonal space of the range of A ⊤ i,j , and the second term is in the range space of A ⊤ i,j . This second term may be seen as a proximal-contraction of Π i,j G j (θ n ). Theorem 10 makes the PGdual idea explicit by proposing a matrixĀ i,j augmenting A i,j , computing the associated drift µ PGdual i,j and comparing it to µ PGdec i,j . Theorem 10. Assume A1 and A2. Let j ∈ {1, . . . , J} and i ∈ {1, . . . , I j }. Assume that c i,j < d j and

where

is given by (23) .

Proof. The main ingredients are the equalities

The proof follows from standard matrix algebra. A detailed proof is given in section IX of the Supplementary material.

The result remains true when to be equal. Observe that the conditions on U i,j are satisfied as soon as the rows of U i,j are orthonormal and orthogonal to the rows of A i,j .

Let us derive two strategies for the application of PGdual to sample the target density defined by (6)-(7).

Example 11 (Example 2 and Example 9, to follow). A first strategy is to decompose g as in Example 2; Example 9 provides a possible augmentation of D 2 . Another one is proposed by Theorem 10:

where U 1,1 is obtained by making orthogonal the first two rows of D 2 and making them orthogonal to the rows of D 2 . This strategy acts globally on ln π(·|Z) by proposing, at each iteration, a PG approach for the function

It will be numerically explored in section V.

Example 12 (Example 3 to follow). A second strategy is to decompose g as in Example 3, and define the matricesĀ i,j as described in Theorem 10. This strategy defines µ PGdual by considering part of ln π(·|Z): at each iteration, having selected Remark 8) .

E. Convergence analysis of PGdec and PGdual.

We prove that both PGdec and PGdual produce a sequence of points {θ n , n ≥ 0} which is an ergodic Markov chain having π as its unique invariant distribution.

Proposition 13. Assume A1, A2 and A3. Assume also that π is continuous on D. Then the sequence {θ n , n ≥ 0} given by algorithm 2 applied with µ = µ PGdec is a Markov chain, taking values in D and with unique invariant distribution π. In addition, for any initial point θ 0 ∈ D and any measurable function h such that |h(θ)| π(θ)dθ < ∞, The additive structure of g in A2, naturally suggests the extension of PGdec and PGdual to the Gibbs sampler algorithm [26] . Since whatever θ = θ 1:J ∈ R d , exact sampling from the conditional distributions on R dj

for j = 1, . . . , J, is not always explicit, we propose a Metropolis-within-Gibbs strategy [12] . The pseudo-code of the so-called Gibbs Blockwise Proximal sampler is given by algorithm 3 in the case of a systematic scan order of the J components (our method easily extends to other scan orders; details are left to the reader): at each iteration #(n + 1), and for each block #j, (i) sample at random i j ∈ {1, . . . , I j }, (ii) sample a candidate from N dj (µ ij ,j (ϑ), C ij ,j ) where ϑ := (θ n+1 1:j−1 , θ n j:J ) ∈ R d is the current value of the chain, and (iii) accept/reject this candidate via a Metropolis step targeting the distribution π j (·|ϑ). Sample i j ∈ {1, . . . , I j } with probability 1/I j ; To illustrate, assess, and compare the relevance and performance of the Monte Carlo procedures for credibility interval estimation proposed here, use is made of real Covid-19 data made available by the Johns Hopkins University 1 . The repository 2 collects, impressively since the outbreak of the pandemic, daily new infections and new death counts from the National Public Health Authorities of 200+ countries and territories of the world. Counts are updated on a daily basis since the early stage of the pandemic (Jan. 1st, 2020) until today. This repository thus provides researchers with a remarkable dataset to analyse the pandemic.

As mentioned in the Introduction section, because of the sanitary crisis context, data made available by most National Public Health Authorities in the world are of very limited quality as they are corrupted by outliers and missing or negative counts. Data quality also varies a lot across countries or even within a given country depending on the phases and periods of the pandemic.

The present work uses the new infection counts only, as the estimation of the space-time evolution of the pandemic reproduction number R is targeted.

A mild and non-informative preprocessing is applied to data by replacing negative counts by a null value.

The present section aims to assess and compare the performance for several variations of the samplers PGdec and PGdual introduced in section III, using the real Covid19 data described in section IV.

Performance assessments were conducted on data from several countries and for different periods of interests. For space reasons, they are reported only for the United Kingdom and for a recent time period (Dec. 6th, 2021 to Jan. 9th, 2022), with corresponding Z in Fig. 2[top right plot] .

Following [37] , we set λ O = 0.05 and λ R = 3.5 × √ 6 σ Z /4 with σ Z the standard deviation of Z. Samplers are run with N max = 1e7 iterations, including a burn-in phase of 3e6 iterations. Except for the plots in Fig. 1[first row] , reported performance are computed by discarding the points produced during the burn-in phase. The initial values Z −τ φ +1 , . . . , Z −1 , Z 0 are set to the observed counts. For each algorithm, Performances are computed from averages over 15 independent runs. Initial points consist of random perturbations around the non-informative point R 0 := (1, . . . , 1) ⊤ ∈ R T and O 0 := (0, . . . , 0) ⊤ ∈ R T . For all samplers, θ is seen as a two-block vectors (θ 1 = R, θ 2 = O) (hence J = 2).

PGdec and Gibbs PGdec. These two samplers are run by decomposing g as described in Example 3. This yields two different drift functions for the blocks R and O; see Example 7. For the R-part, the drift functions are defined, for i = 1, 2, 3, as:

obtained from (23) 

where ∇ 1 (resp. ∇ 2 ) denotes the gradient operator w.r.t R (resp. O).

PGdual and Gibbs PGdual. These two samplers are run by decomposing g as described in Example 2. Consequently, for the R-part, the drift function is given by

where D := D 2 (see Example 9, the sampler is referred to as PGdual Invert (I)) or D := D o (see Example 11, the sampler is referred to as PGdual Ortho (O)); for the O-part, for both PGdual I and PGdual O,

RW and Gibbs RW. For comparisons, we also run Random Walk-based samplers (RW) with Gaussian proposals: RW and Gibbs RW are defined respectively from algorithm 2 and algorithm 3, with I 1 = I 2 = 1 and µ 1,j (θ) = θ for j = 1, 2.

Covariance matrices. For the covariance matrices C i,j , we choose C 1,2 := 2γ 2 I T for the O-part, where γ 2 is the same step size as in (27) and (29) . For the R-part, based on the comment in subsection III-C, we choose C 1,1 := 2γ 1 D (26) and (28) . We also run the PGdec-based samplers and the RW-based samplers with the same covariance matrices: when C i,1 := 2γ 1 D Step sizes. Different strategies are compared for the definition of the step sizes (γ 1 , γ 2 ). All of them consist in adapting the step sizes during the burn-in phase in such a way that the mean acceptance-rejection rate reaches approximately 0.25 (which is known to be optimal for some RW Metropolis-Hastings samplers, [24] ). We observed that convergence occurs before 5e5 iterations, see Fig. 1 [row 1, columns 2 and 3] where the proposals are all rejected and the chain does not move from its initial point during the first iterations, when the step size is too large. At the end of the burn-in phase, the step sizes are frozen and no longer adapted. For the Metropolis-Hastings samplers PGdec, PGdual, and RW, γ 1 is adapted and γ 2 := γ 1 σ 3 Z /T . For the Gibbs samplers, the acceptance-rejection steps are specific to each block R and O. A first consequence is that a move on one block can be accepted while the other one is not; this may yield larger step sizes (γ 1 , γ 2 ) which in turn favor larger moves of the chains. A second consequence is that we use the acceptance-rejection rate for the R-part (resp. for the O-part) to adapt γ 1 (resp. γ 2 ): the two step sizes have their own efficiency criterion.

The sampler performances are assessed and compared using three different criteria, see Fig. 1 . Distance to the MAP. We compute the normalized distance R n − R MAP / R MAP where R MAP denotes the MAP estimator, (computed as in [37] ): it is displayed vs. the iteration index n, in the burn-in phase (row 1) and after the burn-in phase (row 2). This criterion quantifies the ability of the chains to perform a relevant exploration of the distribution: The chains have to visit the support of π(·|Z), they show better ergodicity rates when they are able to rapidly escape from low density regions to move to higher density regions. Paths start from a non-informative initial point, considered as a point in a low density region. This criterion permits to quantify a relevant behavior of the Markov chain when (i) it drifts rapidly towards zero during the burn-in phase and (ii) it fluctuates in a large neighborhood of zero after the burn-in phase. Autocorrelation function (ACF). We then compute the ACF for each of the 2T components of the vector θ along the Markov path, for a lag from 1 to 1e5. On row 3, we report the mean value, over these 2T components, of the absolute value of the ACF versus the lag. This criterion quantifies a relevant behavior of the Markov chain when it converges rapidly to zero; it is indeed related to the effective sample size of a Markov chain (see [43] ) and to the mixing properties of the chain (see e.g. [32, Theorem 17.4.4 and Section 17.4.3.]). For example, a weaker ACF means that less iterations of the sampler are required to reach a given estimation accuracy. Gelman-Rubin (GR) statistic. Finally, we compute the GR statistic (see [9, 25] ) which quantifies a relevant behavior of the Markov chain when it converges rapidly to one. It measures how the sampler forgets its initial value and provides homogeneous approximations of the target distribution π(·|Z) after a given number of iterations. On row 4, we report this statistic versus the iteration index.

Covariance matrices. Normalized Distance to MAP indices ( Fig. 1[rows 1 & 2] ) and GR indices ( Fig. 1[row 4 ]) show that, for all algorithms RW, PGdec, and PGdual, the strategy O is more efficient than the strategy I, as corresponding indices decay more rapidly (to 0 and 1 respectively) for the formers than for the latters. Metropolis-Hastings vs. Gibbs samplers. The evolution of the ACF criteria ( Fig. 1[row 3 ]) shows that the Gibbs strategies are globally more efficient than the Metropolis-Hastings ones. Further, GR indices confirm better efficiency of the Markov chains obtained from Gibbs O strategies. Drift functions. The benefit of using functions µ miming optimization algorithms to drift the proposed points towards the higher probability regions of π(·|Z) is clearly quantified in Fig. 1 

During the burn-in phase, the normalized distance to MAP indices decay toward 0 significantly faster for all PGdual-based algorithms, compared to RW-based ones. Also, the PGdual O algorithms reach an expected plateau early in the burn-in phase, while this is barely the case at the end of the burn-in phase for RW O algorithms.

After the burn-in phase (second row), the samplers using optimization-based drift functions µ show better behavior after reaching the high density regions as they permit a broader and more rapid exploration of the support of the distribution around its maximum, with large amplitude moves from R MAP , and faster returns to R MAP . This is notably clearly visible for the Gibbs PGdual O strategy.

ACF indices decay more rapidly for the PGdual strategies than for the RW ones, irrespective of the choice of the covariance matrix or of the Metropolis-Hastings or Gibbs versions. Also, ACF indices for PGdual algorithms are less sensitive to the choice of the covariance matrices than the RW ones.

Finally, the GR statistic indices (row 4) clearly illustrates that Markov chains produced by PGdual samplers have better mixing properties, compared to others, showing hence sensitivity to the choice of the initial point. Optimal sampler. Combined together, these observations, globally consistent with those stemming from other countries or time periods, lead to the following generic comments.

While yielding essentially equivalent performance, across time periods and countries, the choice of the covariance matrices (algorithms O and I) significantly outperforms algorithms relying on Identity covariance matrices (not reported here as showing poor performances). This non trivial construction actually stemmed from the thorough and detailed mathematical analysis conducted in subsection III-C. Theorem 10 advocates to perform the augmentation of the (T − 2) × T matrix D 2 into a T × T invertible matrix D, by adding two rows which are orthogonal to the rows of D 2 .

The observation that Gibbs samplers show better performances may stem from the fact that they benefit from larger values of the step sizes learned on the fly by the algorithms which favor larger jumps when proposing θ n+1/2 from θ n and imply better mixing properties.

Finally, the PGdual algorithms show homogeneous performances for example when varying the covariance matrix, and are thus less sensitive to parameter tuning, an important practical feature. Also, the PGdual algorithms show systematically better performances than the PGdec ones. This may result from the definition of µ PGdec which, because of the block-splitting approach (see (26) ), uses only partial information on π(·|Z) at each iteration.

As an overall conclusion, systematically observed across all studied time periods and countries, the Gibbs PGdual O algorithm, devised from the careful analysis of the theoretical properties of the distributions defined by A1-A2, is consistently found to be the most efficient strategy for a relevant assessment of the Covid19 pandemic time evolution. Credibility interval-based estimations are reported for Gibbs PGdual O sampling scheme only, as Section V established that it achieved the best performance amongst the twelve sampling schemes tested. Estimations are computed for a time period of five weeks (T = 35 days), which corresponds to a few typical pandemic time scales, of the order of 7 days, induced by the serial interval function Φ, cf. Section II. This period is set to a recent phase of the pandemic (Dec. 13th, 2021 to Jan. 17th, 2022). Estimations are reported for several countries, arbitrarily chosen as representatives of the pandemic across the world, but conclusions are valid for most countries. Computation of the credibility intervals. Credibility intervals are computed as follows: For each day t ∈ [T 1 , T 2 ] in the period of interest [T 1 , T 2 ], the chosen sampling scheme (Gibbs PGdual O) outputs 7.10 6 points of a Markov chain approximating the a posteriori distribution of R t ; Quantiles of that distribution are estimated using the empirical cumulative distribution function ; For a chosen credibility level of 1 − α, the upper and lower limits of the credibility intervals are defined by the empirical 1 − α/2 and α/2 quantiles. For illustration purposes, 1 − α is set here to 95%. The same procedure is applied to produce credibility intervals for the outliers O t . Finally, credibility intervals for the estimated denoised new infection counts Z (D) are obtained by subtracting the credibility intervals for O t to the raw and possibly corrupted count Z t . Fig. 2 reports, top plots, the daily counts of new infections (black lines) to which are superimposed the 95% credibility interval-based estimations for the denoised counts, Z (D) , (red pipes). Fig. 2 further reports, bottom plots, the 95% credibility interval-based estimations for R. Relevance of credibility interval-based estimations. Fig. 2 , together with the examination of equivalent plots for other countries, yields the following generic conclusions.

The estimated denoised counts show far smoother evolution along time compared to the raw counts, hence providing far more realistic assessments of the intensity of the pandemics. Notably, for most countries, the zero or low counts, associated with week-ends or non-working days, followed by days with over-evaluated counts by compensation, are smoothed out by the outlier estimation procedure, while the values of the counts for the regular (or non corrupted) days are left unchanged. This is the direct benefit of the nonlinear filtering procedure underlying the estimation formulation in (6)- (7), as opposed to traditional denoising procedures performed by classical linear filtering (such as moving average).

For the credibility intervals for R, their sizes range, depending on countries and time periods, from below 1% to above 10%. A careful examination shows that the credibility interval size is mostly driven by data quality: the credibility interval size increases when outliers are detected. Further, for a given country, the size of the credibility intervals varies only mildly along time over a five-week period. Changes in size are often associated with changes in the trends of the estimates of R, or with the occurrence of outliers. These credibility intervals provide hence a relevant assessment not only of the intensity of the pandemics, but also of the confidence that can be granted to this assessment, by providing epidemiologists with a range of likely values of R, rather than a single value. This permits to compare the evolution of the pandemics across several countries on a better scientifically grounded basis.

These credibility interval-based estimations permit a double analysis of the pandemic: They permit retrospectively to evaluate the impacts of sanitary measures on the pandemic evolution. Additionally, the smooth nature of the estimation of R (close to piecewise linear) performs an implicit short term forecast (or a nowcast) of the evolution of the pandemic intensity: For instance, for several countries (e.g., France, Mali, Brazil, Singapore,. . . ) the estimate of R is decreasing for the last 5 to 10 days of the studied period, predicting that daily new infections will reach a maximum of the current wave within the coming days and then will start to decrease.

These credibility interval estimates can be complemented with other estimates such as the Maximum, median or Mean a Posteriori (cf. e.g., [1, 5] ).

Finally, let us emphasize that these estimates, denoised counts and credibility intervals, are obtained using a single and same set of hyperparameters λ R /σ Z and λ O common to all countries.

The proposed tools perform a relevant credibility interval-based estimation of the Covid19 reproduction number and denoised new infection counts, by combining a Bayesian modeling of the time evolution of the pandemic with Monte Carlo Metropolis sampling strategies. Robustness against the low quality of the Covid19 is achieved by engineering the Bayesian model to impose sparsity in the changes of a smooth time evolution for the reproduction number and in the outlier occurrences, modeling data corruption. This is obtained at the price of the non-differentiability of their a posteriori distribution, thus precluding the use of the classical Metropolis Adjusted Langevin Algorithm to produce credibility intervals. This lead us to propose several Proximal-Gradient Monte Carlo algorithms tailored to the sampling of non-differentiable a posteriori distributions, that also constitute valid sampling schemes for a much broader range of applications than that of the strict Covid19 pandemic monitoring, e.g., in image processing or more generally, in any Bayesian inverse problems with several non-smooth priors.

Estimation performances were assessed and compared on real Covid19 data, using a set of well-selected indices quantifying the efficiency of these different sampling schemes.

Finally, it was shown for several countries and for a recent five-week period that the achieved credibility interval-based estimations of both denoised new infection counts and reproduction number provide practitioners with an efficient and robust tool for the actual and practical monitoring of the Covid19 pandemic. Such estimates are updated on a daily basis on the authors's web-pages. Automated and data-driven estimations of the hyperparameters λ R and λ O are under current investigations.

In an effort toward reproducible research and open science, Matlab codes implementing PGdec, PGdual for both Metropolis-Hastings and Gibbs versions are made publicly available at https://github.com/gfort-lab. 

Let D Z be the subset of (R + ) T ×R T given by (2) . To shorten the notations, we will write p t (θ) instead of p(R t , O t |Z t−τ φ :t−1 ) (see (3)):

and π instead of π(·|Z) (see (6) ) . Observe that − ln π(θ) is equal to +∞ for θ / ∈ D Z and for θ ∈ D Z

for some normalizing constant C π . By convention 0 ln 0 = 0. Define the T × T invertible matrix D 2 and compute its inverse:

Finally, define the criterion 

We start with two lower bounds on the criterion C which will help us to study its behavior on some boundaries of D Z . ◮ Lower bounds on C. Let ( R, O) ∈ D Z . Then

and for any τ ∈ {1, · · · , T },

Proof. For any p > 0 and z ∈ N, p z exp(−p)/z! ∈ (0, 1) thus implying that z ln p − p − ln(z!) ≤ 0. When p = z = 0, then z½ z>0 ln p − p = 0 and ln(z!) = 0; hence we have

from which we obtain (33) and (34) .

◮ Behavior of C on some boundaries of D Z . Assume that there exist t ⋆ < t ⋆⋆ in {1, . . . , T } such that Φ Z t⋆ > 0 and Φ Z t⋆⋆ > 0. 1) For any sequence ( R n , O n ) ∈ D Z s.t. lim n R n 1 + lim n O n 1 = +∞, we have lim n C( R n , O n ) = +∞.

2) Let t ∈ {1, . . . , T } such that Z t > 0. For any sequence

Proof. First statement. Let {( R n , O n ), n ≥ 0} be a sequence in D Z . For the discussions below, remember that this implies that lim

We distinguish two cases. First, either O n 1 tends to infinity or R n 3:T 1 tends to infinity. In the second case, these two norms are assumed bounded but R n 1:2 1 tends to infinity. 

by convention, the last term in the RHS is zero when t = 1, 2. Under the assumptions of this second case, sup n | t k=3 (t − k + 1) R n k | < ∞ for any t ∈ {3, . . . , T }. We prove that either (D −1 2 R n ) t⋆ tends to infinity, or (D −1 2 R n ) t⋆⋆ tends to infinitywhich will imply by (34) that the criterion tends to infinity.

• Subcase 2A. Assume that lim n {t ⋆ R n 1 + (t ⋆ − 1) R n 2 } = +∞ (observe that this limit can not be −∞, as a consequence of the assumptions of Case 2, (36) and (35)). Apply (36) with t = t ⋆ ; this yields lim n (D −1 (30) ) which implies that lim

and then lim n C( R n , O n ) = +∞ by (34) .

• Subcase 2B. Assume that sup n |t ⋆ R n 1 + (t ⋆ − 1) R n 2 | < ∞. Then, necessarily lim n | R n 2 | = +∞ (otherwise, it is the Subcase 2A). We write

Since t ⋆ < t ⋆⋆ , this equality and (35) imply that lim n R n 2 = +∞. Therefore, since t ⋆ < t ⋆⋆ , we have lim n (D −1 2 R n ) t⋆⋆ = +∞. We then conclude, along the same lines as in Subcase 2A, that lim n C( R n , O n ) = +∞. Second statement. Let {( R n , O n ), n ≥ 0} be a sequence in D Z and τ such that Z τ > 0. By (34), we have

The RHS tends to +∞ since lim n p τ (D 

The goal of the proof below is to build a closed bounded set K in D Z such that outside K, − ln π ≥ 1 + M 0 . This implies that (R 0 , O 0 ) ∈ K. Since − ln π is continuous on D Z , it reaches its minimum on the compact subset K, and this minimum is upper bounded by M 0 . Hence, this minimizer is also a global minimizer. Let us define K. We have

where λ min (resp. λ max ) is the minimal (resp. maximal) eigenvalue of D −⊤ 2 D −1 2 ; they are positive and finite. Consequently, by setting R n := D 2 R n , we have ( R n , O n ) ℓ → +∞ iff (R n , O n ) ℓ → +∞ for ℓ = 1, 2 since the norms are equivalent on R 2T . This property, the coercivity property (statement 1), and the equality (32) imply that lim n − ln π(R n , O n ) = +∞ for any D Z -valued sequence {(R n , O n ), n ≥ 0} such that lim n R n + lim n O n = +∞. As a consequence, there exists

Similarly, there exists c 1+M0 > 0 such that

Consequently, we define

We just proved that there exists a compact subset of the interior of D Z that contains a minimizer of − ln π. Let θ ⋆ = (R ⋆ , O ⋆ ) be a minimizer.

• One or uncountably many. The function − ln π is convex and finite on a convex set: hence, given a second minimizer θ ⋆⋆ , µθ ⋆ + (1 − µ)θ ⋆⋆ is also a minimizer, whatever µ ∈ [0, 1].

• Same intensity, data fidelity term, and penalty term. Let f Z and g be defined by (7) . Following the same lines as in [37] where the strict convexity of the Kullback-Leibler term f Z is the key ingredient, it can be proved that p t (θ ⋆ ) = p t (θ ⋆⋆ ) for any t ∈ {1, . . . , T } and thus f Z (θ ⋆ ) = f Z (θ ⋆⋆ ); since − ln π(θ ⋆ ) = − ln π(θ ⋆⋆ ) since both points minimize − ln π, we have g(θ ⋆ ) = g(θ ⋆⋆ ).

• Sign conditions. Set sign(a) = 1 when a > 0, sign(a) = −1 when a < 0 and sign(a) = 0 when a = 0. For A, B in R, and µ > 0, we have |A + µB| = |A| + µsign(A) sign(B)|B|½ A =0 + µ|B|½ A=0 + µo (1) where o(1) is a function satisfying lim µ→0 o(1) = 0. Hence, for τ = τ 1:d , τ ′ = τ ′ 1:d ∈ R d and µ ∈ (0, 1),

(sign(τ t )sign(τ ′ Second, for any τ = τ 1:dj ∈ R dj , prox γjḡi,j (τ ) = prox γj gi,j (AĀ −1 ·) (τ )

since under the stated assumptions, we have

Therefore,

Now, let us applyĀ −1 ; by (37), we havē

Finally, sinceĀ −1Ā = I dj , we have

and this concludes the proof of (24). When AA ⊤ = I ci,j , we have Ω i,j = A ⊤ A and A Ω i,j = A. When UU ⊤ = I dj−ci,j , we have Ω i,j = UU ⊤ and I dj −Π i,j = U ⊤ U; this yields

leading to the final result of Theorem 10.

Let a finite set S of indices. For any θ ∈ D, let {ρ ι (θ), ι ∈ S} be a weight function: ι∈S ρ ι (θ) = 1 and ρ ι (θ) ≥ 0. Finally, for any ι ∈ S, let q ι (θ, θ ′ )dθ ′ be a Markov transition with respect to the Lebesgue measure on Draw θ n+1/2 ∼ q ι (θ n , ·) ; 4 Set θ n+1 = θ n+1/2 with probability α ι (θ n , θ n+1/2 ) α ι (x, y) := 1 ∧ π(y) π(x) ρ ι (y) ρ ι (x) q ι (y, x) q ι (x, y) and θ n+1 = θ n otherwise.

for any θ; and to q ι (θ, θ ′ ) = J j=1 q ij ,j (θ, θ ′ j ), θ ∈ D, θ ′ ∈ D where ι = (i 1 , . . . , i J ) and θ ′ = (θ ′ 1 , . . . , θ ′ J ). Claim1. Assume: [B1] for any θ, θ ′ ∈ D, there exists ι ∈ S such that ρ ι (θ)q ι (θ, θ ′ )∧ρ ι (θ ′ )q ι (θ ′ , θ) > 0; [B2] π is continuous on D; [B3] for any compact set K of D, inf K×K ι∈S ρ ι q ι > 0. Then the sequence {θ n , n ≥ 0} obtained by algorithm 4 is a Markov chain, taking values in D. It is φ-irreducible, strongly aperiodic and π is its unique invariant distribution.

Proof. • θ n ∈ D for any n. The proof is by induction on n. This property holds true for n = 0. Assume that θ n ∈ D. If θ n+1/2 / ∈ D, then π(θ n+1/2 ) = 0 and α i (θ n , θ n+1/2 ) = 0, so that θ n+1 = θ n and θ n+1 is in D. This concludes the induction. • π is an invariant probability measure. Conditionally to ι and θ n , the distribution of θ n+1 is P ι (θ n , dθ ′ ) := δ θ n (dθ ′ ) 1 − R d α ι (θ n , τ )q ι (θ n , τ )dτ + α ι (θ n , θ ′ )q ι (θ n , θ ′ )dθ ′ ; δ x (dθ ′ ) denotes the Dirac mass at x. Conditionally to θ n , the distribution of ι is {ρ i (θ n ), i ∈ S}. Hence the conditional distribution of θ n+1 given θ n is P ⋆ (θ n , dθ ′ ) := i∈S ρ i (θ n )P i (θ n , dθ ′ ) .

Following the sames lines as in [43, Theorem 7.2] (details are omitted), the detailed balance condition with π can be established π(θ) i∈S ρ i (θ)α i (θ, θ ′ )q i (θ, θ ′ ) = π(θ ′ ) i∈S ρ i (θ ′ )α i (θ ′ , θ)q i (θ ′ , θ) ; hence π is invariant for P ⋆ .

• Irreducibility. By [B1], the chain is φ-irreducible (see [31, Lemma 1.1.] ).

• Aperiodicity. Let us prove that the compact sets are 1-small and the chain is aperiodic; the proof is on the same lines as the proof of [31, Lemma 1.2.]. Let K be a compact set in D. Since π(x) < ∞ on D then sup K π < ∞ by [B2]. For any measurable set A ⊆ K and any θ ∈ K, it holds

The RHS is positive by [B3] and this proves that K is 1-small and the chain is aperiodic. Proof. • Both algorithms satisfy [B1] . For both algorithms, ρ ι (θ) = 1/(I 1 · · · I J ) and q ι (θ, θ ′ ) is proportional to J j=1 exp −0.5(θ ′ j − µ ij ,j (θ)) ⊤ C −1 ij ,j (θ ′ j − µ ij ,j (θ))

where ι = (i 1 , · · · , i J ) and µ is µ PGdec or µ PGdual . Therefore, since µ i,j (τ ) < ∞ for any τ ∈ D, we have q ι (θ, θ ′ )∧q ι (θ ′ , θ) > 0 for any ι ∈ S and θ, θ ′ ∈ D.

• Both algorithms satisfy [B3] . For any compact set K of D, we have sup K µ i,j < ∞; in addition, µ is a continuous function on D (the function f is continuously differentiable and the proximal operator is continuous by [8, Proposition 12.28] .

Hence inf K×K q ι > 0 and [B3] holds.

• Positive Harris recurrence. From Claim 1, the PGdec Markov chain and the PGdual one are positive recurrent (they are φ-irreducible with an invariant distribution, and recurrent by [32, Proposition 10.4.4] ). Following the same lines as in [44, Theorem 8] , we prove that the chain is Harris recurrent by showing that for any measurable set A such that A π(θ)dθ = 1 and any θ ∈ D, P θ (τ A < ∞) = 1 where τ A is the return-time to the set A ([44, Theorem 6(v)]); here P θ denotes the probability on the canonical space of the Markov chain with initial distribution the Dirac mass at θ and with kernel P ⋆ . Let A be a measurable subset of D such that A π(θ)dθ = 1. Let θ ∈ D. We write the kernel P ⋆ as follows

where r(θ) := 1 − ι∈S ρ ι (θ) D q ι (θ, θ ′ )α ι (θ, θ ′ )dθ ′ , and M (θ, A) := (1 − r(θ)) −1 ι∈S ρ ι (θ) A q ι (θ, θ ′ )α ι (θ, θ ′ )dθ ′ .

Hence, P ⋆ (θ, ·) is a mixture of two distributions: a Dirac mass at θ and M (θ, ·). Since A c π(θ)dθ = 0 (here, A c := D \ A), then the Lebesgue measure of A c is 0. This implies that M (θ, A c ) = 0 and M (θ, A) = 1. It holds P θ (τ A = +∞) = E θ [½ X1 / ∈A P X1 (τ A = +∞)] = E θ [½ X1∈A c P X1 (τ A = +∞)] = r(θ) P θ (τ A = +∞); indeed, starting from θ, the chain can not reach A c when the kernel M (θ, ·) is selected; and remains at θ when this kernel is not selected. Since r(θ) < 1 (otherwise the chain can not be φ-irreducible), we have P θ (τ A = +∞) = 0. This concludes the proof. 

Temporal evolution of the Covid19 pandemic reproduction number: Estimations from Proximal optimization to Monte Carlo sampling

Spatial and temporal regularization to estimate COVID-19 reproduction number R(t): Promoting piecewise smoothness via convex optimization

The Generalized Lasso Problem and Uniqueness

Describing, modelling and forecasting the spatial and temporal spread of COVID-19-A short review

Credibility interval design for Covid19 reproduction number from nonsmooth Langevin-type Monte Carlo sampling

A Moreau-Yosida approximation scheme for a class of high-dimensional posterior distributions

Convex Analysis and Monotone Operator Theory in Hilbert Spaces

Convex Analysis and Monotone Operator Theory in Hilbert Spaces

General methods for monitoring convergence of iterative simulations

A first-order primal-dual algorithm for convex problems with applications to imaging

An introduction to continuous optimization for imaging

Discussion: Markov Chains for Exploring Posterior Distributions

Langevin Monte Carlo without smoothness

Proximal splitting methods in signal processing

A new framework and software to estimate time-varying reproduction numbers during epidemics

On the definition and the computation of the basic reproduction ratio R0 in models for infectious diseases in heterogeneous populations

Markov Chains. Springer Series in Operations Research and Financial Engineering

Analysis of Langevin Monte Carlo via Convex Optimization

Nonasymptotic convergence analysis for the unadjusted Langevin algorithm

Efficient Bayesian Computation by Proximal Markov Chain Monte Carlo: When Langevin Meets Moreau

Log-concave sampling: Metropolis-Hastings algorithms are fast

Adaptive sparseness for supervised learning

COVID-19 cacophony: is there any orchestra conductor? The Lancet

Weak convergence and optimal scaling of random walk Metropolis algorithms

Inference from Iterative Simulation Using Multiple Sequences

Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images

The impact of a nation-wide lockdown on COVID-19 transmissibility in Italy

Time-Reversible Diffusions

Measurability of the epidemic reproduction number in data-driven contact networks

Sampling from Non-smooth Distributions Through Langevin Diffusion

Rates of convergence of the Hastings and Metropolis algorithms

Markov Chains and Stochastic Stability

Analysis of multiresolution image denoising schemes using generalised Gaussian and complexity priors

The R0 package: A toolbox to estimate reproduction numbers for epidemic outbreaks

Correlation functions and computer simulations

The Bayesian Lasso

Nonsmooth convex optimization to estimate the Covid-19 reproduction number space-time evolution with robustness against low quality data

Block-coordinate proximal algorithms for scale-free texture segmentation

Parallel proXimal algorithm for image restoration using hybrid regularization

Proximity operator of a sum of functions; application to depth map estimation

Epidemiological characteristics of COVID-19 cases in Italy and estimates of the reproductive numbers one month into the epidemic

The Bayesian choice: a decision-theoretic motivation

Monte Carlo Statistical Methods (Springer Texts in Statistics)

Harris recurrence of Metropolis-within-Gibbs and trans-dimensional Markov chains

Exponential convergence of Langevin distributions and their discrete approximations

Langevin Diffusions and Metropolis-Hastings Algorithms

Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm

A Shrinkage-Thresholding Metropolis Adjusted Langevin Algorithm for Bayesian Variable Selection

Improved inference of time-varying reproduction numbers during infectious disease outbreaks

Reproduction numbers and sub-threshold endemic equilibria for compartmental models of disease transmission

Different Epidemic Curves for Severe Acute Respiratory Syndrome Reveal Similar Impacts of Control Measures

On decomposing the proximal map

• Sufficient conditions for uniqueness. The proof is adapted from [3, Section 4] . Let θ ⋆⋆ = θ ⋆ + ω be another minimizer. DefineThe Fermat rule ([8, Theorem 16.2]) which characterizes optimality implies that zero is in the subdifferential of − ln π atSince ∇f Z (θ ⋆ ) depends on θ ⋆ through the p t 's which are constant on the set of the minimizers M (see above), and U ⊤ has full column rank, γ(θ ⋆ ) is the same whatever the minimizer θ ⋆ ; it is denoted byObserve that any minimizer is in the kernel K 1 of the matrix U I which, by definition, collects the rows of U indexed by I; hence ω ∈ K 1 . In addition, ω is in the kernel, since all the p t 's are constant on M. Therefore, if K 1 ∩ K 2 = {0}, the MAP is unique.

Throughout the proof, we write A, U andĀ as a shorthand notation for A i,j , U i,j andĀ i,j . Under the stated assumptions,We first focus on the gradient step in (18) leading to:.