key: cord-0987903-w5aijz03
authors: Parag, K. V.
title: Improved estimation of time-varying reproduction numbers at low case incidence and between epidemic waves
date: 2020-09-18
journal: nan
DOI: 10.1101/2020.09.14.20194589
sha: 7c3cda052cf0e6467b742832eef8441ac125fb8c
doc_id: 987903
cord_uid: w5aijz03

We construct a recursive Bayesian smoother, termed EpiFilter, for estimating the effective reproduction number, R, from the incidence of an infectious disease in real time and retrospectively. Our approach borrows from Kalman filtering theory, is quick and easy to compute, generalisable, deterministic and unlike many current methods, requires no change-point or window size assumptions. We model R as a flexible, hidden Markov state process and exactly solve forward-backward algorithms, to derive R estimates that incorporate all available incidence information. This unifies and extends two popular methods, EpiEstim, which considers past incidence, and the Wallinga-Teunis method, which looks forward in time. This combination of maximising information and minimising assumptions, makes EpiFilter more statistically robust in periods of low incidence, where existing methods can struggle. As a result, we find EpiFilter to be particularly suited for assessing the risk of second waves of infection, in real time.

During an unfolding epidemic, one of the most commonly available and useful types of surveillance data is the daily (or weekly) number of new infected cases. This time-series of case counts, also known as the incidence curve, not only measures the epidemic size and burden, but also provides information about trends or changes in its transmissibility [1] , [2] . These trends are captured by the time-varying effective reproduction number, denoted R s at time s, which defines how the number of secondary cases generated per primary case varies across the duration of the outbreak [3] . When R s > 1 we can expect and prepare for growing incidence, whereas R s < 1 signifies that the epidemic is being controlled [4] .

Inferring changes in R s given an observed incidence curve is therefore crucial, both to understanding transmissibility and to forecasting upcoming case loads, especially in an ongoing epidemic, where it can help inform policymaking and intervention choices [1] , [5] . Real-time and retrospective R s estimates have been used to characterise rates and patterns of spread in various diseases such as malaria [6] and Ebola virus disease [7] and have already proven valuable across the COVID-19 pandemic, providing updating synopses of global transmission [8] or evidencing the impact of implemented control actions (e.g. lockdowns and social distancing) [9] , [10] .

Many of the studies that aim to infer R s either apply the Wallinga-Teunis (WT) method [2] or the Cori et al method, known as EpiEstim [11] . Both methods take complementary viewpoints on how incidence data inform on R s and hence have diverse use-cases. The WT method reconstructs the average number of new cases caused by infectious individuals at s and so requires incidence data beyond time s for its estimate, which is called the case reproduction number. It is better suited for retrospective analyses [12] . Alternatively, EpiEstim infers how past infections propagate to form the incidence observed at s, only requiring data prior to time s. EpiEstim is preferred for real-time investigations and its R s estimate is termed the instantaneous reproduction number [3] .

While both methods provide useful and important estimators of transmission, they are not perfect. Two main limitations exist. First, each suffers from data censoring or edge-effects [3] . Because the WT method is forwardlooking i.e. depends on data later than s, its estimates are right censored when s is close to the last observed time point [12] . In contrast, EpiEstim looks backward in time and suffers edge-effects when s is near the first observed time point [11] . Estimates in the vicinity of the start and end of the incidence time-series are therefore unreliable under EpiEstim and the WT method, respectively. Techniques have been proposed to limit this unreliability [5] , [14] , but the problem is intrinsic, and inevitable near the actual start and end times of an epidemic, where there is necessarily no or scarce data. This leads into the second key limitation: inference in periods of small incidence. This presents a fundamental challenge for any R s estimation approach and effectively creates additional edge-effects. When few or no case counts are available to EpiEstim WT method

EpiFilter

EpiFilter algorithm (mse optimal)

Forward pass (compute ): p s q t = p t , q 1 = r 1 Fig. 1 : EpiFilter algorithm. We summarise the construction of EpiFilter and compare its use of information to that of EpiEstim and the WT method. Left panels show that EpiEstim and the WT method consider complementary parts of the incidence curve, I t 1 (blue dots), to be informative about the reproduction number R s , while EpiFilter makes use of the entire curve. Red windows indicate the portions of the curve that inform on R s under a window of size k as in [11] , while blue windows show the portions involved in constructing the main posterior distributions in this work: p s , r s and q s . Right panels outline the main assumptions (the model box) and computations (the algorithm box) necessary for realising EpiFilter, which allow us to obtain the most informative (and minimum mse) posterior distribution q s . See the main text for the specific equations employed in our implementation [13] .

constrain inference, methods are largely driven by their inherent prior distributions and assumptions, which can result in misleading and unreliable estimates [11] , [15] . Understanding how to best mediate the trade-off between prior assumptions and data when incidence is small is of both statistical and epidemiological importance.

Following a period of low incidence, two outcomes are possible: either the epidemic continues to exhibit small or zero case counts until it goes extinct or a resurgence in infections, also termed a second wave, occurs. Deciphering, in real time, which of these conditions is likely is a key challenge for infectious disease epidemiology given the limited information available at low incidence [16] . Better inference of transmission under these conditions is currently considered central to designing data-informed COVID-19 intervention exit or relaxation strategies [17] . With many countries now facing potential second waves in this pandemic, estimating fluctuations in transmission during suspected epidemic troughs could be essential to achieving sustained control [10] .

Here we present and develop a novel method, termed EpiFilter, for reliably estimating R s in real time, which ameliorates the above limitations. We take an engineering inspired approach and construct an exact, recursive and deterministic (i.e. EpiFilter produces the same output for fixed input data and requires no Monte Carlo steps) inference algorithm that is quick and easy to compute both across an unfolding outbreak and in retrospect. Our method solves what is called the smoothing problem in control engineering [13] . This means we compute R s estimates that formally integrate both forward and backward looking information. This unifies the WT method and EpiEstim, and nullifies their edge-effect issues.

Further, EpiFilter only makes a minimal Markov assumption for R s , which allows it to avoid the strong prior window size and change-point assumptions that existing methods may apply to infer shifts in transmission [11] , [9] . Using simulated and empirical data, we show that EpiFilter accurately tracks changes in R s and provides reliable one-step-ahead incidence predictions. Moreover, we find that EpiFilter is appreciably more robust than even optimised versions of EpiEstim [14] , if performing real-time inference in periods of low incidence and between epidemic waves. We illustrate the practical utility of EpiFilter on the COVID-19 incidence curve of New Zealand, which exhibits a potential second wave.

Our method, which is outlined in Fig. 1 , provides a straightforward yet formally optimal (in mean squared error [18] ) solution to real-time and retrospective R s estimation. Because it couples minimal prior assumptions with maximum information extraction, it more gracefully handles periods with scarce data. Hopefully our approach will serve as a useful inference tool for investigating the risk of second waves of infection in COVID-19 and other epidemics. An implementation of EpiFilter is available at https://github.com/kpzoo/EpiFilter and its mechanics are explored and validated in the Appendix.

We consider an infectious disease epidemic observed over some period of time 1 ≤ s ≤ t. If the incidence or number of newly infected cases at time s is I s then I t 1 = {I s : 1 ≤ s ≤ t} is known as the incidence curve of the epidemic. For convenience, we assume that incidence is available on a daily scale so that I t 1 is a vector of t daily counts (weeks or months could be used instead). A common problem in infectious disease is the inference of the transmissibility of the epidemic given this incidence curve. The renewal model [1] , [19] presents a general and popular framework for investigating this problem.

This model posits that epidemic transmissibility, summarised by the effective reproduction number, R s , generates the observed incidence as in Eq. (1). There Pois indicates Poisson observation noise, d = means that both distributions are equal and Λ s := s−1 u=1 I s−u w u .

In Eq. (1) R s describes the branching of the epidemic i.e. it captures the number of secondary cases generated per effective primary one at s, while Λ s , which is termed the total infectiousness, defines how many effective past cases are still infectious at s. The generation time distribution of the epidemic, w s−1 1 = {w u : 1 ≤ u ≤ s − 1}, controls how past incidence influences Λ s , with w u as the probability that a primary case takes between u − 1 and u days to generate a secondary case [1] .

We make the standard assumption that w ∞ 1 is known for the epidemic of interest and well approximated by its serial interval distribution and focus on inferring the complete set of R s values i.e. R t 1 given I t 1 with t as the last recorded time of the incidence curve. Estimating R t 1 is important, not only for learning whether the epidemic is still growing (whenever R s > 1) or under control (i.e. if R s < 1) [4] , but also for assessing how much existing interventions can be relaxed without risking a second wave of infection [10] . This latter problem can be particularly difficult (as we will show) since, under these conditions, incidence data are scarce.

We define three key inference problems, commonly studied in engineering [18] , based on how information in I t 1 is recruited to construct R t 1 estimates. We represent these problems in terms of the posterior distribution they induce over R t 1 . Estimates are then functions of these posteriors. The first is called causal filtering, where we sequentially calculate p s = P(R s | I s 1 ) [20] . Causal means that estimates of R s only depend on data up to time s. Solving this problem is fundamental to real-time inference [21] . The second problem is non-causal filtering and is the complement of the first. Here we calculate r s = P(R s | I t s ), which is important for retrospective or backward-looking estimates [22] .

The last problem is termed smoothing and is our main interest. To solve it we must extract all the information possible to formulate q s = P(R s | I t 1 ) [13] . Note that p s , r s and q s depend on the choice of state space model, which describes the dynamics of R s across time, and observation model, which explains how changes in R s lead to trends in the observed incidence data [18] . These models encode our assumptions about the epidemic of interest and determine how estimates tradeoff assumptions against data. We will find this latter point important for determining performance when incidence data are scarce. We next explain how the most popular approaches for estimating R t 1 , the WT method and EpiEstim fit within this engineering framework.

We mostly detail EpiEstim as real-time R s estimates are the key focus of this work. EpiEstim assumes that all reproduction numbers are the same over a window back in time, τ (s) = {s, s−1, . . . , s−k+1} of size k. Consequently, only I s s−k+1 provides information about R s and the causal filtering distribution p τ (s) = P(R s | I s s−k+1 ) follows as in Eq. (2) with a shape-scale gamma prior distribution applied i.e. P(R s ) d = Gam(a, c −1 ) [11] .

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted September 18, 2020. . https://doi.org/10.1101/2020.09.14.20194589 doi: medRxiv preprint

Here i τ (s) = u∈τ (s) I u and λ τ (s) = u∈τ (s) Λ u are sums over the window of interest.

The resulting posterior mean estimate is thenR τ (s) = p τ (s) R s dR s = E[R s | I s s−k+1 ] = (a + i τ (s) )(c + λ τ (s) ) −1 and the Fisher information that I s s−k+1 contains about R s is proportional to λ τ (s) [23] . The observation model is Eq. (1) but the state space model of EpiEstim is not explicit. However, if R τ (s) is the assumed average reproduction number in τ (s) (which is used to estimate R s ) then λ τ (s) R τ (s) = u∈τ (s) Λ u R u [14] . As a result, EpiEstim somewhat incorporates a linear moving average state space model, and assumes that p s ≈ p τ (s) by deeming I s−k 1 to be uninformative. Since p τ (s) can be computed sequentially across an ongoing epidemic, EpiEstim provides real-time inference.

The WT method takes a complementary approach to EpiEstim and computes an estimate of R s based on a forward-looking window [2] . It also uses the observation model of Eq. (1) and has an implicit moving average state model R τ (s) = u∈τ (s) R u w u−s+1 [3] . We abuse notation and define τ (s) = {s, s + 1, . . . , s + k − 1} in this case as a window into the future of size k. The WT method assumes that r s ≈ r τ (s) = P(R s | I s−k+1 s ) and approximately solves the non-causal filtering problem. We illustrate the information windows employed by the WT method and EpiEstim and the complete causal and non-causal filtering windows in Fig. 1 . The goodness of p τ (s) and r τ (s) as measures of p s and r s will depend on k and the appropriateness of the state model [14] .

While EpiEstim and the WT methods are powerful tools for inferring R s in real-time and in retrospect, they have two main and related limitations, which necessarily reduce the reliability of their outputs [12] . First, their performance degrades as s gets close to 1 for EpiEstim and t for the WT method [11] . These edge or censoring effects correspond, at the extreme, to p 1 = p τ (1) = P(R 1 | I 1 ) and r t = r τ (t) = P(R t | I t ), which are weakly informed posterior distributions. As a result, at the beginning of the incidence curve EpiEstim can be unreliable (and even unidentifiable). The WT method suffers similarly at the end of that curve [1] (see Fig. 1 ).

The second limitation occurs in phases of the epidemic where incidence is low and prolonged [23] , [17] . In these periods information is scarce and the quality of estimates will depend on how well the method of choice mediates between the little available data and its inherent assumptions [15] . We illustrate this effect using EpiEstim by considering an epidemic with a sequence of n zero incidence days. This sequence is realistic for an epidemic in its tail phase, and can precede either elimination (i.e. the epidemic goes extinct) or resurgence (i.e. a second wave of infection emerges) [24] . As n increases both i τ (s) and λ τ (s) shrink, though i τ (s) decays more rapidly. For every n ≥ k, i τ (s) = 0, meaning that the shape of p τ (s) is completely determined by the prior P(R s ).

As n increases further λ τ (s) becomes negligible and we exactly recover our prior distribution (see Eq. (2)). This problem is acute if the window size, k, is small. Previous studies, which formally optimise k to maximise estimate reliability, found that small k is needed when inferring sharp changes in transmissibility (e.g. due to lockdowns) [14] . Moreover, as λ τ (s) falls, the Fisher information available about R s from I s s−k+1 shrinks, leading to an inflation in estimate uncertainty and a loss of statistical identifiability [25] , [23] . An analogous effect occurs in the WT method if there is low incidence across its forward-looking window [12] .

While some estimate degradation is guaranteed for any R s inference method when faced with either edge-effects or low incidence, robustness can still be improved. Edgeeffects can be largely overcome by constructing q s . Solving the smoothing problem melds the advantages of the opposite looking windows of EpiEstim and the WT method, removing the vulnerability near the ends of I t 1 . This follows as q 1 = r 1 and q t = p t . Further, by maximising the information used for inferring every R s and by minimising our state model assumptions, we can ameliorate the impact of low incidence. We develop a method, termed EpiFilter, to realise these improvements in the next section.

We reformulate the causal inference problem of estimating R s from I s 1 as an optimal Markov state filtering problem. Filtering is the term used in engineering for a general class of estimation problems aimed at optimally (usually in a mean squared error (mse) sense) inferring some hidden state causally and in real time from noisy observations [18] , [13] . Given functions f s and g s , which describe the state (R s in our case) space dynamics and the process of generating noisy observations (the I s here), the filtering problem tries to construct the posterior distribution p s [20] , which EpiEstim approximates. The conditional mean estimateR

The famed Kalman filter [21] was the genesis of these methods. Here we focus on Bayesian recursive filters for models with noisy count observations. These generalise the Kalman filter [20] and have been successfully applied to similar problems in phylodynamics and computational biology [26] , [27] , [28] , [29] . We reconsider our renewal model inference problem within this engineering state-. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted September 18, 2020. . https://doi.org/10.1101/2020.09.14.20194589 doi: medRxiv preprint observation framework, as described in Eq. (3) [13] .

Here R s is the hidden Markov state that we wish to infer. It dynamically depends on the previous state R s−1 and a noise term s−1 via f s . Observation I s is then elicited due to R s and a noise term ν s , according to g s [30] . We formulate our filter under two very mild assumptions. First, we define some closed space, R, over which R s is valid i.e. for a given resolution m, extrema R min and R max , and grid size δ = m −1 (R max − R min ) then R := {R min , R min +δ, . . . , R max }. This means R s must take a discrete value in R, the i th element of which is

This is not restrictive since we can compute our filter for large m if needed and usually we are only interested in R s on a coarse scale (e.g. policymakers may only want to know if R s ≤ 1 or not). Note that other approaches, which depend on MCMC or related sampling methods, all implicitly assume some sort of discretisation [27] . Second, we propose a linear model for f s , as defined in Eq. (4b). There s−1 is a standard white noise term i.e. P( s−1 ) d = Norm(0, 1) with Norm signifying a normal distribution and η is a free parameter. We assume that a noisy linear projection of states over consecutive timepoints provides a good approximation of the state trajectory. Not only is this assumption standard in engineering [20] and epidemiology [31] but it is also more flexible than the state model inherent to EpiEstim and the WT method. We scale the noise of this projection by a fraction, η < 1 of the magnitude of R s−1 . This allows variation but ensures R s is a-priori non-negative.

Our observation model, g s , is implicit and leads to the probability law in Eq. (1). As a result, both our observations and state models are discrete and are summarised in Fig. 1 . We now define the Bayesian recursive filtering procedure, which is a main contribution of this work, and can be solved exactly, in real-time and with minimal computational effort. We adapt general recursive filtering equations from [30] , [13] , [22] , [18] , which are valid for various types of observation and state models, for our renewal model inference problem. The proof of the equations we employ can be found in these works.

Recursive filtering involves two steps: prediction and correction. The first, given in Eq. (5a), constructs a sequential prior predictive distribution, p s = P(R s | I s−1 1 ), which is informed by past data I s−1 1 and the last state R s−1 . The second step then corrects or updates this prior into the posterior distribution, p s , which constrains p s using the latest observation, I s , as in Eq. (5b).

is the observation model from Eq. (1) and the constant of proportionality for Eq. (5b) is simply a normalising factor. Solving Eq. (5) iteratively and simultaneously over the grid of R leads to our novel real-time estimate of the time varying effective reproduction number. We initialise this process with a uniform prior over R for p 1 and note that p s and p s are m element vectors that sum to 1, with i th term corresponding to when R s = R[i]. Eq. (5) forms the first half of EpiFilter, is flexible and can be adapted to many related problems [13] . A key difference between EpiFilter and the EpiEstim-type methods [11] , [14] is that the latter approximate P(R s | I s 1 ) and P(R s | I s−1 1 ) with P(R s | I s s−k+1 ) and P(R s ), respectively. Estimators based on these approximations can be suboptimal, especially when data (i.e. cases) are scarce.

While Eq. (5) provides a complete real-time solution to the causal filtering problem, it is necessarily limited at the starting edge of the incidence curve, where past data are scarce or unavailable. Further, because it does not update past estimates as new data accumulate, it cannot provide optimal retrospective estimates. Here we develop the second half of EpiFilter, which involves solving the optimal smoothing problem and hence computing q s . To our knowledge, smoothing has not yet been explicitly considered in infectious disease epidemiology.

We specialise the general methodology from [13] , [22] to obtain the recursive smoother of Eq. (6a).

We can realise Eq. (6a) exactly by taking a forwardbackward algorithmic approach with p s and p s+1 first computed from Eq. (5). Remaining terms emerge from the state space model and by noting that q t = p t and iterating backwards. This approach neatly links p s and q s . If we assume that r s is reasonably approximated by P(R s | I t s+1 ) then we can also apply the two-filter smoothing solution of [22] to get Eq. (6b). This shows how smoothing connects EpiEstim and the WT methods, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted September 18, 2020. . https://doi.org/10.1101/2020.09.14.20194589 doi: medRxiv preprint and explains why EpiFilter can be used for both realtime and retrospective inference. We summarise the main ingredients of the EpiFilter algorithm in Fig. 1 .

The smoothed posterior q s yields the conditional mean estimateR s = E[R s | I t 1 ], which is known to significantly improve on the mse of the filtered equivalentR s [20] . While filtering provides the minimum mse given causal knowledge, smoothing provides the minimum given all knowledge. This relationship is formal, with filtered and smoothed mse values mapping to the amount of mutual information that I t 1 provides about R t 1 [32] . Extracting the maximum information from I t 1 should engender estimates that are more robust to periods of low incidence.

While our main interest is on optimised and rigorous real-time and retrospective estimates of transmissibility, which are completely defined by q s , it is also important to predict future incidence, for informing epidemic preparedness plans and for validating past R s estimates [33] , [14] . We compute the filtered posterior predictive distribution as in Eq. (7) (integrals are over R [13] .

We assume, as in [34] :

Replacing p s with q s yields the smoothed equivalent of Eq. (7). We will use Eq. (7) to compare EpiFilter against APEestim, which is the prediction-optimised version of EpiEstim [14] . Since predictions depend strongly on p s or q s , optimising these distributions can be crucial, for example, to forecasting second waves of infection.

The estimation of time-varying effective reproduction numbers when incidence is low is seen as a key challenge limiting our understanding of transmission [17] . Periods with small counts of new cases contain little information and so present necessary statistical difficulties [23] . Here we compare EpiFilter, which allows exact inference but conditioned on R, with EpiEstim and APEestim, which minimises the prediction error of EpiEstim by optimising its window size, at these data-poor settings [11] , [14] . We use EpiEstim with a long window, as this is known to improve robustness at low incidence [24] . The assumptions and choices inherent in estimation methods become important and visible when data are scarce and can bias inference or support spurious predictions [15] .

We simulate epidemics using Eq. (1)) under the serial interval distribution of Ebola virus disease available from [35] . We examine several diverse incidence trajectories (e.g. outbreaks with small sizes, large peaks or multiple modes) that all eventually decline to periods of few or no cases. We apply APEestim with optimal window k * , EpiEstim with a long window k and EpiFilter with state noise η (see Eq. (4b)) to estimate R s and causally predict I s for each epidemic. For the first two methods we compute mean predictions (Ĩ τ (s) ) as in [14] and estimates (R τ (s) ) from Eq. (2) with τ (s) delimiting the window times used. We obtain smoothed EpiFilter estimates (R s ) and causal predictions (Ĩ s ) from Eq. (6) and Eq. (7).

Our main results are in Fig. 2 , which exposes how these three approaches degrade as data diminish. Each quadrant of this figure examines a different epidemic scenario. In all plots the true R s and I s are dashed black and dotted grey respectively, estimates are in red and predictions are blue. The central lines are conditional means and the shaded regions are their 95% confidence intervals. As our main focus is on real-time performance we do not investigate the WT method. See [11] , [12] for comparisons of the WT method and EpiEstim. Our reasons for comparing to EpiEstim with a long window and APEestim follow from the Methods.

There we showed that since EpiEstim-type methods group data over a window into the past, they revert to their prior distribution as the total cases in this window becomes small. During low incidence periods, e.g. when the epidemic is dying or in its early growth phase, using long windows makes sense [24] . However, this reduces predictive accuracy as fluctuations in R s are underfit. This is the use-case for APEestim, which optimises for prediction. The top and middle panels of each quadrant of Fig. 2 illustrate the clear consequences of these tradeoffs. APEestim often selects shorter windows to better fit R s and sequentially predict I s . Each predicted I s is a one-step-ahead prediction depending on the past data up to s − 1 as in Eq. (7) (we have abused notation).

While shorter windows improve estimates and predictions the reversion to the prior distribution (seen as wide confidence intervals in Fig. 2 ) when I s is low is acute. Long windows resolve this latter difficulty, showing a slower rise to the prior but at the expense of getting other parts of the epidemic trajectories incorrect. Interestingly, in the bottom panels of the quadrants of Fig. 2 we see that EpiFilter has superior performance for these types of epidemics. Not only does it give good estimates and predictions throughout the trajectory of the epidemic but its rise in uncertainty as the epidemic dies is slower and more controlled. It therefore combines the advantages of APEestim and long-window EpiEstim.

Maintaining robust R s estimation at low incidence is not just statistically important. Two possible outcomes . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted September 18, 2020. . https://doi.org/10.1101/2020.09.14.20194589 doi: medRxiv preprint Fig. 2 : Small or dying epidemics. We compare reproduction number estimates (R τ (s) orR s ) and causal one-stepahead incidence predictions (Ĩ τ (s) orĨ s ) from APEestim with optimised window k * (top panels), EpiEstim with long window k (middle panels) and EpiFilter with state noise η (bottom panels). We simulate epidemics with low daily case numbers or long tails (long sequences of zero cases) using the standard renewal model (Eq. (1) ). The true R s is in black (dashed, left panels) and its corresponding I s in grey (dots, right panels). All mean estimates or predictions are in red with 95% confidence intervals. APEestim and EpiEstim use a Gam(1, 2) prior distribution and EpiFilter a grid with m = 2000, R min = 0.01 and R max = 10. We find that EpiFilter is more robust to small incidence (better uncertainty), whereas the other approaches can quickly decay to their prior distribution. may follow periods of small I s : either the epidemic goes extinct, or a second wave of infection (also known as resurgence) occurs e.g. due to imports or unmonitored local transmission. Predicting which outcome is likely, rapidly and in real-time, is of major global concern as countries aim to relax interventions during the ongoing COVID-19 pandemic, while also minimising the risks of second waves [17] . As changes in reproduction numbers predate and signal corresponding variations in incidence, reliably identifying and inferring R s trends in the trough preceding potential new peaks is crucial for preparedness and both timely and effective epidemic control [10] .

Reliable estimation of R s between epidemic waves depends on the prior assumptions of the inference method used and on how that method relies on those assumptions when data are scarce [36] , [37] . Here we examine this dependence and exactly investigate this scenario, where after a low incidence period resurgence occurs. As in the . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted September 18, 2020. . https://doi.org/10.1101/2020.09.14.20194589 doi: medRxiv preprint Fig. 3 : Epidemics with multiple waves. We compare reproduction number estimates (R τ (s) orR s ) and causal one-step-ahead incidence predictions (Ĩ τ (s) orĨ s ) from APEestim with optimised window k * (top panels), EpiEstim with long window k (middle panels) and EpiFilter with state noise η (bottom panels). We simulate epidemics with multiple waves of infection using the standard renewal model (Eq. (1) ). The true R s is in black (dashed, left panels) and its corresponding I s in grey (dots, right panels). All mean estimates or predictions are in red with 95% confidence intervals. APEestim and EpiEstim use a Gam(1, 2) prior distribution and EpiFilter a grid with m = 2000, R min = 0.01 and R max = 10. We find that due to the improved robustness of EpiFilter to low incidence regions, it is best able to negotiate troughs between epidemic peaks and hence infer re-emerging infectious dynamics.

previous section, we test APEestim with optimal window k * (top panels), EpiEstim with a long window k (middle) and EpiFilter with state noise η (bottom panels) on four diverse epidemics, which in this case all feature multiple waves or early hints of upcoming resurgence. Simulation parameters are unchanged from the last section.

Our main results are in Fig. 3 with each quadrant specifying a different epidemic trajectory. The true R s and I s are in dashed-black and dotted-grey respectively.

All estimates (R τ (s) for EpiEstim and APEestim andR s for Epifilter) are in red with corresponding causal predictions (Ĩ τ (s) andĨ s ) in blue. Plots provide conditional means with 95% confidence intervals. For all second and additional wave epidemic examples we observe a similar tradeoff in performance between APEestim and long-window EpiEstim as in our previous analysis of Fig. 2 . APEestim is better able to predict upcoming I s terms and follow the overall trend in R s than EpiEstim.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted September 18, 2020. . https://doi.org/10.1101/2020.09.14.20194589 doi: medRxiv preprint However, the long window of EpiEstim makes it more stable during the trough between waves, which can be advantageous for understanding transmission there.

EpiFilter once again combines the advantages of the other approaches. For every scenario in Fig. 3 it provides accurate tracking of changes in R s with stable confidence intervals. Concurrently, its predictive performance rivals that of APEestim. EpiFilter is therefore a powerful tool when dealing with second-wave scenarios. We also find that the η = 0.1 parameter value seems to be an allpurpose heuristic, meaning that usage of EpiFilter can be somewhat simpler than APEestim and EpiEstim. The superior capability of EpiFilter in these scenarios likely results from its minimalistic assumptions (see Eq. (4)) and its increased information extraction. We next test our method on empirical incidence data.

The previous sections confirmed EpiFilter as a powerful inference and prediction tool, especially in data-poor conditions, using simulated epidemics. We now confront our method with empirical data from the 1918 H1N1 influenza pandemic in Baltimore (USA) [38] and the ongoing COVID-19 pandemic in New Zealand (up to 17 Aug 2020) [39] . The H1N1 dataset has been well-studied and so we first use this to benchmark EpiFilter. We clean this data by applying a 5-day moving average filter as recommended in [38] . Previous work [11] analysed this data with EpiEstim and found that sensible R s estimates result when a weekly window (k = 7) is applied.

A recent study, which re-examined this epidemic with APEestim [14] , shows that, while EpiEstim with k = 7 provides stable estimates for this epidemic, it is a poor causal predictor of the incidence data. Instead, an optimised window of 2 days (k * = 2) yields good predictions but the resulting R s estimates are noisy. We reproduce the estimates (R τ (s) ) and predictions (Ĩ τ (s) ) from both studies in Fig. 4 and compare them against EpiFilter with η = 0.1 (R s andĨ s ). The cleaned H1N1 incidence is in dotted grey, the critical R s = 1 line in dashed black and all R s estimates and I s predictions (with 95% confidence intervals) are in red and blue respectively.

Top and middle rows of Fig. 4 show the mentioned trade-off between estimate stability and prediction accuracy. The bottom row confirms the power of EpiFilter. Our R s estimates are of comparable stability to that of EpiEstim at k = 7, yet our prediction fidelity matches APEestim. Our improved inference again benefits from using more information (i.e. the backward pass in Fig. 1 ) and making less restrictive prior assumptions. We see the latter from the R s confidence intervals over 40 ≤ s ≤ 60.

There EpiEstim seems overconfident, and this results in a rigid overestimation of incidence. However, EpiFilter mediates its confidence to a level similar to APEestim. Fig. 4 : H1N1 influenza transmission estimates for Baltimore (1918) . We compare APEestim (top), EpiEstim with recommended weekly window (middle) (both with Gam(1, 2) prior) and EpiFilter (with m = 2000, η = 0.1, R min = 0.01 and R max = 10) on the H1N1 influenza dataset of [38] . We use a 5-day moving average filter, as in [38] , to remove known sampling biases. Estimates of R s and corresponding 95% confidence intervals are in red. One-step-ahead predictions of I s (with 95% confidence) are in blue with the actual incidence in grey. We find that EpiFilter combines the benefits of the other methods, achieving good estimates and predictions.

We explore COVID-19 transmission patterns in New Zealand using incidence data up to 17 Aug 2020 from [39] . New Zealand presents an insightful case study because officials combined swift lockdowns with intensive testing to achieve and sustain very low incidence levels that some believed could engender local elimination of COVID-19 [40] . However, an upsurge in cases in early Aug inspired concerns about a second wave (which led to new interventions and is why we do not consider data beyond 17 Aug). Here we investigate time-varying transmission in New Zealand to see if this uptick suggests that the epidemic was resurfacing in mid Aug. We believe smoothing can confer important inferential advantages in exactly these types of low incidence scenarios.

We make the common assumptions that case underreporting is constant and that reporting delays are negligible [11] . While this makes our investigation somewhat naive, we consider this to be reasonable given the intense surveillance that New Zealand employed [41] . We also use the serial interval distribution from [42] and do not . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted September 18, 2020. . https://doi.org/10.1101/2020.09.14.20194589 doi: medRxiv preprint explicitly distinguish imported from local cases in our analysis. The latter could bias our study [24] but our focus is on demonstrating differences between filtering and smoothing on R s trends instead of specifying the precise value of R s during this period. We plot the results of our exploration in Fig. 5 . More involved analyses can be performed by pre-processing the data for known delays or case ascertainment fractions as in [12] , [6] . We compute smoothed and filtered reproduction number estimates,R s andR s respectively, from the COVID-19 incidence curve for New Zealand (available at [39] ). We use EpiFilter with m = 2000, η = 0.1, R min = 0.01 and R max = 10 with a uniform prior over the grid R. The top panel shows conditional mean estimates and 95% confidence intervals forR s (red) andR s (grey). Vertical lines indicate the start and end of lockdown, a major intervention that was employed to halt transmission. The additional 'future' information used in smoothing has a notable effect. The bottom panel provides smoothed onestep-ahead predictionsÎ s (blue, with 95% confidence) of the actual reported cases I s (grey). The inset gives the probability ofR s ≤ 1 and lockdown times.

We apply EpiFilter and obtain causally filtered (R s , grey) and smoothed (R s , red) conditional mean estimates together with their 95% confidence intervals. These are in the top panel of Fig. 5 and computed from Eq. (5) and Eq. (6) respectively. The times of lockdown and release are the grey vertical dashed lines. Interestingly, we see a notable difference in the quality of inference betweeñ R s andR s . The former, as expected, is unreliable at the beginning of the incidence curve and features wider uncertainty and unclear trends. The smoothed estimate, by using both forward and backward-looking data largely overcomes these issues and clarifies transmission trends.

Our smoothed analysis suggests that R s has resurged and supports the reimplementation of measures around 14 Aug. In the bottom panel of Fig. 5 we provide predictions (which are from the smoothedR s hence the notationÎ s ) and their 95% confidence intervals in blue against the reported incidence from [39] in dotted grey. These confirm thatR s reasonably describes the data. In the inset we show that the P(R s ≤ 1) also supports the resurgence hypothesis. Last, we note that transmission decreased considerably (and rise in P(R s ≤ 1)) across the lockdown period.

Estimating time-varying trends in effective reproduction number R s reliably and in real-time is an important and popular problem in infectious disease epidemiology [5] . As the COVID-19 pandemic has continued to unfold, the interest in solving this problem has only elevated [8] . Initially, the focus was on characterising how R s might respond to interventions such as lockdowns and social distancing [10] . However, as COVID-19 has progressed and countries have entered the controlled phase, concerns have shifted to trying to understand how existing interventions might be relaxed with minimum risk.

The literature on intervention exit strategies is, however, still in development, and several challenges remain to modelling the transmission behaviour of epidemics following prolonged periods of low incidence [17] , [24] . While changes in R s over this period are not the only analytic for assessing this risk, they do provide a key realtime diagnostic since upticks in R s generally precede corresponding rises in incidence. Unfortunately, current approaches to estimating R s under these conditions tend to be underpowered or prior-constrained [11] , [23] .

Here we re-examined existing methodology for inferring R s from an engineering perspective. We observed that two of the most useful and popular inference approaches, EpiEstim [11] and the WT method [2] , only capitalise on a portion of the data available, deeming either upcoming or past incidence informative (see Fig. 1 ). This informative portion is directly controlled by prior assumptions on the speed of possible R s changes, which are often characterised by a window of size k. Other methods are also known to apply similarly strong change-point assumptions on R s [23] , [9] . When data are scarce these assumptions can control inference.

In control engineering a common problem, known as filtering, involves optimally (in a mse sense) estimating hidden Markov states, causally in real-time, from noisy and uncertain observations [18] . A related problem termed smoothing provides accompanying and optimal . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted September 18, 2020. . https://doi.org/10.1101/2020.09.14.20194589 doi: medRxiv preprint retrospective inference [13] . By reinterpreting R s as a Markov state (Eq. (4)) observed through a noisy renewal process (Eq. (1)) and defining R s on a predetermined grid R, we were able to construct exact filtering (Eq. (5)) and smoothing (Eq. (6)) solutions. This led to EpiFilter, which is our central contribution.

Generally, filtering and smoothing can be involved and require sophisticated sequential Monte Carlo techniques [30] . However, because our system is low dimensional and as we make only minimal assumptions about R s , modelling it as a simple diffusion, we were able to solve these problems exactly and without complex sampling algorithms [27] . Our solutions are computationally simple and deterministic (i.e. precisely reproducible given the same data). While EpiFilter is limited by one free parameter η, our η = 0.1 heuristic seems statistically justified by its good causal one-step-ahead predictive performance (see Appendix and Fig. 2 and Fig. 3 ) [14] .

Importantly, EpiFilter is able to look both forward and backward through the incidence data, and so maximise the information extracted [32] . This property means it combines the advantages of both EpiEstim and the WT method (see Fig. 1 ) and largely negates their edge-effect issues [12] , offering both real-time and retrospective R s estimations that are robust. We illustrated the advantages of EpiFilter by comparing it to EpiEstim and APEestim on many simulated examples with diverse periods of low incidence and epidemic resurgences ( Fig. 2 and Fig. 3) . Interestingly, we found EpiFilter able to track arbitrary changes in transmission without requiring any preset change-points or window sizes or any tuning of η.

Moreover, EpiFilter was notably better at negotiating periods of low incidence, offering a graceful degradation to its prior without sacrificing predictive accuracy. When incidence is low, it is usually beneficial to use long windows with EpiEstim [24] . This keeps R s estimates reasonably stable. However, it often leads to poor predictions [14] . APEestim, which optimises window size for prediction fidelity, showed that in many of the simulated scenarios short windows are necessary for describing transmission patterns. Consequently, we have a tradeoff between estimate robustness and prediction accuracy.

We found that EpiFilter overcomes this tradeoff, concurrently achieving good estimates and predictions. We confirmed these advantages on empirical data from the H1N1 pandemic of 1918 (see Fig. 4 ). The ability to integrate the benefits of long window EpiEstim and APEestim make EpiFilter particularly useful for investigating resurgence after a period of low incidence, as it is better able to forewarn of increasing case numbers and can more speedily infer upticks in transmission (see Fig. 3 ). Capitalising on these properties, we performed a naive exploratory analysis of COVID-19 in New Zealand and found evidence of a potential second wave of infection after a prolonged period of control (see Fig. 5 ).

Balancing the assumptions inherent to a model against the data it is applied on, to produce reliable inference is a non-trivial problem that is still under active investigation in several fields [15] , [36] , [37] . EpiFilter, by maximising the information extracted from the incidence data and minimising its state space model assumptions, appears to strike this balance. Consequently, it performs strongly on a wide range of problems, including those involving sparse data, where other methods might struggle. Given its demonstrated advantages, straightforward formulation and theoretical underpinning, we hope that EpiFilter will be useful as a diagnostic tool for reliably warning about second waves of infection as COVID-19 unfolds.

Having constructed EpiFilter in Eq. (3) -Eq. (6) we now investigate its mechanics and performance. First we illustrate how the choice of the grid resolution m allows our approach to achieve estimates at differing levels of accuracy. Our recursive implementation is also known as a grid smoother, since its estimates are optimal (in mse) but conditioned on the grid R [27] . Left panels of Fig. A1 show how the smoothed estimatorR s (red together with 95% confidence bounds) becomes more accurate with increased resolution. The true R s is in black. Generally, we find that no visible improvement occurs above m ∼ 10 3 . Wider grid ranges R max − R min require larger m for the same resolution.

We also test the behaviour under different choices of the process or state noise parameter η. This parameter is vaguely similar to the window size parameter, k, in EpiEstim approaches, but is easier to select. The right panels of Fig. A1 , show how a smaller η might lead to underfitting, while large η (where we would not consider η ≥ 1 given Eq. (4b)) could promote overfitting. However, while the optimal choice of window in EpiEstim varies with the form of the true R s (e.g. large k is needed for stable R s time-series and small k for strongly fluctuating ones [14] ), we find that a fixed η = 0.1 works well across many and diverse epidemic scenarios.

We illustrate this further in Fig. A2 . There we compare R s estimates from APEestim (left panels, denotedR τ (s) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted September 18, 2020. . https://doi.org/10.1101/2020.09.14.20194589 doi: medRxiv preprint Fig. A1 : Filtering mechanics. We examine how EpiFilter behaves under various choices of process noise parameter η and grid resolution m. Left panels indicate that increasing m leads to improved estimation, as expected, but that even at coarse settings the estimate behaves sensibly. Right panels show that reducing η sharpens confidence intervals but can potentially underfit. We find that η = 0.1 serves as a good general heuristic. All simulations are at R min = 0.01 and R max = 10, black is the true R s (a seasonal epidemic) and red is the EpiFilter mean estimate (R s ) with 95% confidence intervals.

with optimal window size k) [14] , [11] to EpiFilter (right panels, denotedR s ) at η = 0.1 and m = 2000 for several simulated epidemic scenarios. We generate epidemics according to the standard renewal model of Eq. (1) using the serial interval distribution for Ebola virus disease [35] . The true R s is in black with respective estimates in red (together with 95% confidence intervals). To keep comparisons fair we set the usual gamma prior distribution over R s in APEestim to Gam(1, 2) so that its domain is essentially bounded by R max .

EpiFilter has broadly comparable inference performance to APEestim, despite our using a single η value across the entire range of simulated scenarios. This is in stark contrast to the discordant optimal k values, which are shown in Fig. A2 . Moreover, our implementation is fast (it executes in seconds) and unlike other window-less approaches does not require any choices of the timing or number of change-points i.e. at this single η it easily handles unchanging trajectories, seasonal variations and rapidly or gradually fluctuating transmission.

The bottom panels of Fig. A2 illustrate the central advantage of EpiFilter. There the associated incidence values (not shown) are small. While this causes APEestim inference to destabilise (see Methods for explanation), EpiFilter offers more robust and usable estimates. We explore this in detail in the main text. Note that in none of these scenarios does the true R s have either the state model of Eq. (4b) or the sliding-window relationship in EpiEstim. Consequently, these simulations also indicate robustness of the different inherent state assumptions in both approaches to moderate model mismatch. Fig. A2 : Validation against APEestim. We compare estimates from EpiFilter (right panels) with optimised ones from EpiEstim (called APEestim, left panels [14] ) over various epidemic scenarios, simulated using Eq. (1). We use m = 2000, R min = 0.01 and R max = 10. To be fair we use a Gam(1, 2) prior distribution with APEestim, which sums to 1 by R max . Here k is the optimal window choice in APEestim and η = 0.1 is fixed for EpiFilter inferences. True R s is in black and conditional mean estimates with 95% confidence intervals are in red. Both approaches offer comparable performance but EpiFilter is more stable especially early in the epidemic when incidence is low.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted September 18, 2020. . https://doi.org/10.1101/2020.09.14.20194589 doi: medRxiv preprint

Estimating individual and household reproduction numbers in an emerging epidemic

Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures

The effective reproduction number as a prelude to statistical estimation of time-dependent epidemic trends

Infectious diseases of humans: dynamics and control

Estimating in real time the efficacy of measures to control emerging communicable diseases

Measuring the path toward malaria elimination

Ebola virus disease in West Africa -the first 9 months of the epidemic and forward projections

Short-term forecasts of COVID-19 deaths in multiple countries

Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe

Reconstruction of the full transmission dynamics of COVID-19 in Wuhan

A new framework and software to estimate time-varying reproduction numbers during epidemics

Practical considerations for measuring the effective reproductive number, rt

Bayesian Filtering and Smoothing

Using information theory to optimise epidemic models for real-time prediction and estimation

Are skyline plot-based demographic estimates overly dependent on smoothing prior assumptions?

Five challenges for stochastic epidemic models involving global transmission

Key questions for modelling COVID-19 exit strategies

Feedback Systems: An Introduction for Scientists and Engineers

How generation intervals shape the relationship between growth rates and reproductive numbers

Random Point Processes in Time and Space

A new approach to linear filtering and prediction problems

Smoothing algorithms for nonlinear finite-dimensional systems

Adaptive estimation for epidemic renewal and phylogenetic skyline models

An exact method for quantifying the reliability of end-of-epidemic declarations in real time

Identification in parametric models

Optimal point process filtering and estimation of the coalescent process

Exact bayesian inference for phylogenetic birth-death models

Point Process Analysis of Noise in Early Invertebrate Vision

Uncoupled analysis of stochastic reaction networks in fluctuating environments

Bayesian filtering: From Kalman filters to particle filters, and beyond

A primer on stochastic epidemic models: Formulation, numerical simulation, and analysis

Mutual information and minimum mean-square error in Gaussian channels

Assessing the performance of real-time epidemic forecasts: A case study of Ebola in the western area region of Sierra Leone

A simple approach to measure transmissibility and forecast incidence

A review of epidemiological parameters from Ebola outbreaks to inform early public health decision-making

Modeling the growth and decline of pathogen effective population size provides insight into epidemic dynamics and drivers of antimicrobial resistance

Horseshoe-based Bayesian nonparametric estimation of effective population size trajectories

Influenza in Maryland: preliminary statistics of certain localities

WHO coronavirus disease (COVID-19) dashboard

New Zealand eliminates COVID-19

Coronavirus pandemic (COVID-19)

Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand

This work is jointly funded under grant reference MR/R015600/1 by the UK Medical Research Council (MRC) and the UK Department for International Development (DFID) under the MRC/DFID Concordat agreement and is also part of the EDCTP2 programme supported by the European Union.