key: cord-0984032-2ufwx5jp
authors: Koyama, S.; Horie, T.; Shinomoto, S.
title: Estimating the time-varying reproduction number of COVID-19 with a state-space method
date: 2020-07-11
journal: nan
DOI: 10.1101/2020.07.09.20150219
sha: a3d5c4467fb420fd581a674295ffccd9486492db
doc_id: 984032
cord_uid: 2ufwx5jp

After slowing down the spread of the novel coronavirus COVID-19, many countries have started to relax their severe confinement measures in the face of critical damage to socioeconomic structures. At this point, it is desirable to monitor the degree to which political measures or social affairs have exerted influence on the spread of disease; however, tracing back individual transmission of infections whose incubation periods are long and highly variable seems to be difficult. Nevertheless, it may be possible to estimate the changes that may have occurred in the past, if we can suitably fit a proper model to daily event-occurrences. We have devised a state-space method for fitting the Hawkes process to a given dataset of daily confirmed cases. This method detects changes occurring in the spread of the contagion in each country. Furthermore, this method can assess the impact of social events in terms of the temporally varying reproduction number representing the average number of cases directly caused by a single infected case. This information might serve as a reference for the behavioral guidelines that should be adopted according to the varying risk of infection.

In response to the steep increase in infections and deaths due to the novel coronavirus COVID-19, almost every country has taken severe confinement measures, such as lockdowns and border closures. The interventions that enforce social distancing have succeeded in slowing down the contagion, but they have also caused severe damage to socioeconomic structures, including an economic crisis, mental health deterioration, and a lag in education. There is no universal principle of adopting a specific intervention, and accordingly, there was a wide variety of initial intervention strategies, which was later followed by liberalization and relaxed behaviors of individuals, across different countries.

Nevertheless, it is desirable to capture the degree to which individual political interventions or social affairs in each country have exerted influence on the spread of the disease. A fundamental metric representing the degree of the spread is the reproduction number, R, which is defined as the average number of cases directly caused by a single case [1, 2] . One can easily infer the current status solely by glancing at the daily new cases; R > 1 if the daily cases are expanding on average and R ≤ 1 otherwise. However, it is difficult to trace concrete processes by which infections are transmitted among individual persons, particularly in light of protecting private information. Thus, a statistical analytical method is needed to infer the underlying process from the available data, considering the possibility that they were obtained with partial observation and are accompanied by errors.

Mathematical epidemiological studies using ordinary differential equation models, such as the susceptibleinfectious-recovered (SIR) model, have contributed to our understanding of causal factor dynamics, and these were further used in suggesting control measures needed in given situations [3] [4] [5] . While the original study of Kermack and McKendrick in 1927 [6] considered integrodifferential equations taking into account the delay in the transmission of disease, the majority of later studies used ordinary differential equations in favor of analytical treatment. COVID-19 cases were also discussed using the susceptible-exposed-infectious-recovered (SEIR) model [7] . Though the ordinary differential equation models also assume the transmission delay, they cannot take account of the detailed distribution of delays, which are widely dispersed from 2 to 14 days [8] [9] [10] [11] . To evaluate how individual political measures have exerted influence on the transmission of disease, it is necessary to incorporate the delay distribution explicitly in the analysis, as has been done previously in an analysis using the semimechanistic Bayesian hierarchical model [12] .

Here, we construct a state-space method for estimating R that is changing in time, by fitting the Hawkes process [13] , which adopts the delay distribution explicitly. While the semi-mechanistic Bayesian hierarchical model [12] requires the change-points be assigned manually, our method automatically detects the change-points solely from a given series of the number of daily cases. We first apply the method to synthetic data obtained by the simulation to confirm that the method properly detects the change-points in the reproduction number. We then apply our method to the real data and examine if the detected changes were consistent with the times at which political measures had been taken in each country. Importantly, our method assesses the impact of individual political measures accurately in terms of the time-varying reproduction number.

We have developed a state-space method of estimating the temporally changing reproduction number from a given series of the number of daily new infections, by adopting the Hawkes process [13] as a basic model describing the transmission of disease.

A. The rate process on a daily basis The Hawkes process describes a self-excitation process in terms of the instantaneous occurrence rate λ(t) as

where µ is the spontaneous occurrence rate, and the second term represents a self-excitation effect such that the occurrence of an event adds the probability of future events (Fig. 1) . R is the reproduction number representing the average number of events induced by a single event, t k is the occurrence time of a past (kth) event, and ϕ(t) is a kernel representing the distribution of the transmission delays, satisfying the normalization

Schematic description of the Hawkes process Eq. (1). The occurrence rate λ(t) is increased according to past events occurred at times t = t k (k = 1, 2, 3, · · · ) with the transmission delays t − t k distributed with ϕ(t − t k ). R is the reproduction number that represents the average number of events induced by a single event.

We convert the original instantaneous rate Eq. (1) into the expected number of events on a daily basis:

where λ j is the expected number of events on jth day. The first term µ ′ on the right-hand-side referes to the expected number of spontaneous occurrences on a daily basis. The second term represents the self-excitation process, the manner in which ν i events that have occurred on a day i exerted influence with the delay of j − i days. Here, we assume that the reproduction number R may change and represent the daily dependence as R i . ϕ j−i represents a distribution of the transmission delays d = j − i, satisfying the normalization ∑ ∞ d=1 ϕ d = 1. Given the mean rate λ i or λ, the number of events ν i or ν would be derived from the Poisson distribution p(ν|λ) = λ ν ν! e −λ . However, real data are subject to erroneous observation and accordingly they tend to be overdispersed, or the sample variance exceeds the sample mean. Here, we incorporate over-dispersed data using the negative binomial distribution [14] :

where ρ (> 0) represents the degree of over-dispersion, or the variance is Var(ν) = (1 + ρ)λ. The Poisson distribution is in the limit of ρ → 0.

The COVID-19 model parameters were chosen as follows: the spontaneous occurrence of infection is absent, µ ′ = 0, because there is no spontaneous occurrence for COVID-19 except at the initial occurrence in China. The virus is transmitted between individuals during close contact, and each individual is determined to have an episode of infection. The transmission delay d = j − i is an interval between the day i that an individual was found to be infected and the day j that another individual who caught infection was found. The duration between symptom onsets of successive cases is referred to as the serial interval [15] ; it is considered to be slightly different from the incubation period [16, 17] . It is reported that the distribution of the serial intervals is suitably approximated with the log-normal distribution function of the mean 4.7 days and SD 2.9 days for COVID-19 [11] . We have adopted this distribution as the transmission delay kernel ϕ d . The distribution of transmission delays on a daily basis is given as the difference of a cumulative distribution function of the log-normal distribution,

where the parameters µ and σ are given in terms of the mean m = 4.7 and the SD s = 2.9 as µ = log(m 2 / √ s 2 + m 2 ) and σ = √ log(1 + s 2 /m 2 ).

To detect change-points in the reproduction number R i , we introduce a method of estimating stepwise dynamics [18] . We assume that system's state x i obeys the evolution

with the Cauchy random number ξ:

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 11, 2020. . https://doi.org/10.1101/2020.07.09.20150219 doi: medRxiv preprint

We assume that the reproduction number is given as R i = f (x i ) with the non-negative function. Here we adopted a ramp function f (x) = max(0, x).

We constructed a state-space model for estimating the temporally changing reproduction number R i from a given dataset of daily confirmed cases {ν 1 , . . . , ν T }. The basic procedure of constructing the state-space method is similar to the one we developed for estimating exogenous and endogenous factors in a chain of point events [19] .

To put the model in the state-space form, we take the summation in Eq. (2) over the last L days,

and introduce a concatenated state vector,

so that the rate process (7) depends only on the current state X i . Accordingly, the state X i obeys the evolution

where

We have chosen L = 30 in the following analysis.

The posterior distribution of system's state X i , given a set of daily new cases until ith day Y i := {ν 1 , . . . , ν i } is obtained using Bayes' theorem as

Here, p(X i |Y i−1 ) may be obtained using a system model p(X i |X i−1 ) and the posterior distribution on day i − 1,

Starting from the initial distribution p(X 1 |Y 0 ), we iterate Eqs. (11) and (12) to compute p(X i |Y i−1 ) and p(X i |Y i ) for i = 1, 2, . . . , T .

Then, we compute the distribution of system's states {X 1 , . . . , X T }, given an entire set of occurrences Y T :=

(13) in reverse order as i = T − 1, T − 2, . . . , 1, using the distribution functions p(X i |Y i ) and p(X i |Y i−1 ), which were obtained with Eqs. (11) and (12).

We then take the median of the posterior distribution p(X i+1 |Y T ) for the estimate of the stateX i+1 . The estimate of the reproduction number,R i , is then given by the first element of f (X i+1 ). With the estimated reproduction number, we obtain the estimated total rate aŝ

We devised an algorithm that performs the integrations in Eqs. (11) , (12) , and (13) numerically using a sequential Monte Carlo method [20, 21] .

To avoid bias in estimating the state, which is caused by outliers in the data, we may discard the preassigned outliers and treat them as "missing observations" [21] , for which the posterior distribution of

A large variation has been observed by day of the week in the number of reported infections, as reported infections tend to be fewer on the weekend than on the weekdays. There might have been variations in the original infectious activity due to human behavior, but it is more likely that this variation was caused by the delay in confirming infections and compiling the results at the weekend. The variation by day of the week is commonly observed, but there are large differences between countries, presumably due to the cultural difference in weekly activities (Fig. 2) .

Before analyzing a sequence of daily cases of a given country, we process the data as follows; we first obtained the gross daily variation β i in a week by averaging over the entire infection record (from March 1 to the present), so that the average over a week is normalized as

Then, we convert the original data of daily infections {n 1 , . . . , n T } to an adjusted dataset {ν 1 , . . . , ν T } by

to which we apply the state-space method.

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 11, 2020. 

We first evaluated the functionality of the state-space method by applying it to synthetic data, which was generated by numerically simulating the self-exciting process. After confirming that the estimation method was able to detect the change-points and estimate the amplitude of the reproduction number properly, we applied the method to real data of daily confirmed cases in several countries; then we further examined whether the detected changes were consistent with social events that occurred in each country.

We constructed simulations of the Hawkes process mimicking prototypical evolutions in several countries. We took µ ′ = 0, as there has been no spontaneous occurrence of COVID-19 except at the initial occurrence in China. We began with a few infections as initial seeds, mimicking those who introduced the disease into each country. With an initial reproduction number R > 1, the daily cases initially grew exponentially. To reproduce a variety of evolutions in the number of infected cases in different countries, we have evaluated several schedules of the reproduction number R i . Figure 3 depicts three prototypical cases: the rapid increase is followed by a slow decrease, as in Italy and France (type A); the rapid increase is followed by a rapid decrease, and then it started to increase slowly, as in Japan, and Australia (type B); and the increase is followed by a decrease, and then another large increase, as in Iran and Saudi Arabia (type C). For each type of timevarying reproduction number R i , the Hawkes process was simulated over an interval of length T = 120 days, in order to generate daily cases {ν 1 , . . . , ν T } (Fig. 3) . The parameter of the negative binomial distribution (3) was set to ρ = 50.

We were able to estimate the reproduction numberR i by applying the state-space method to a series of the number of daily new cases. We then tested different values for the hyperparameter γ as 10 −2 , 10 −3 , and 10 −4 , and confirmed that the free energy was lowest for the case of γ = 10 −3 . Accordingly, we fixed the hyperparameter as γ = 10 −3 throughout the following analysis.

To estimate R i from the synthetic data using the statespace method, we first estimated ρ using the equation

whereλ i = ∑ 3 j=−3 ν i+j /7 represents the mean daily cases averaged over a week. The parameter of the Cauchy distribution (6) was set to γ = 10 −3 . With these parameters, we performed the sequential Monte Carlo algorithm with 10 6 particles to compute the posterior distributions of the reproduction number for each day. Figure 3 depicts the median (solid line) and 95% range (shaded areas) of the posterior distributions. We see that the amplitude of the reproduction number is estimated properly. In particular,R i was able to detect change-points.

The number of daily new cases in various countries is made available on websites hosted by public research centers such as Our World in Data (https://ourworldindata.org/coronavirussource-data) and the Humanitarian Data Exchange (https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases). We used data from the former site in this analysis. For a sequence of daily cases {n 1 , . . . , n T } in each country, we obtained the gross daily variation β t in a week and adjusted the real data using Eq. (16) to {ν 1 , . . . , ν T }. The parameter of the negative binomial . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 11, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 11, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 11, 2020. . https://doi.org/10.1101/2020.07.09.20150219 doi: medRxiv preprint distribution ρ was estimated using Eq. (17) . The parameter of the Cauchy distribution (6) was set to γ = 10 −3 , as in the analysis of synthetic data. With these parameters, the reproduction number was estimated using the sequential Monte Carlo method with 10 6 particles. Figure 4 depicts the daily new cases in several countries and the reproduction numberR i estimated with our state-space method:

A rapid increase in new cases was followed by a slow decrease. The estimated reproduction number wasR > 1 at the outset and dropped toR < 1.

It is interesting to note that the drop in the reproduction number is temporally similar to the time at which political measures, such as lockdown and border closure, were enforced.

The number of cases in countries of Asia and Oceania is found to be relatively small compared to those in Europe. An increase in the number of cases was followed by a rapid decrease, and then by a second slow increase. Accordingly, the reproduction number exhibited a drop fromR > 1 tô R < 1, and then it increased slightly.

The number of new cases in these countries alternately moved up and down, and so accordingly, the estimated reproduction numberR repeatedly crossed the value unity. It can be observed that Ramadan has promoted increased reproduction number, as it may have facilitated human contact.

A rapid increase of new cases is followed by a very slow decrease, and then another growth. The estimated reproduction numberR was higher than unity at the beginning, dropped off to near unity due to the confinement measures taken, but then it exceeded unity again. The political measures taken were found to vary by state, making it difficult to interpret the data from this country as a whole in our model.

The number of new cases keeps growing. The estimated reproduction numberR has remained greater than unity.

There have been debates about why infection rate and mortality rate change by orders of magnitude across different countries. Though these numbers likely reflect the confinement measures taken in individual countries, there might also have been differences across nations in susceptibility to COVID-19, reflecting not only genetic resistance but also lifestyle and cultural differences, such as shaking hands or hugging.

Because most governments did not implement serious confinement measures at the initial phase, the initial exponential increase of infections might reflect the natural susceptibility of citizens of each country. We realized that the estimated reproduction number was stable in a certain period before each country took confinement measures such as a lockdown or social distancing. Figure 5a depicts the reproduction numbers estimated with our state-space method for 10 days until the day before the confinement measures of each country. The initial variation in the numbers of daily new cases is depicted in Fig. 5b , indicating that the estimated reproduction number is correlated to the slope in the log plot. Here we have selected the period shifted by 5 days by taking account of the typical transmission delays. We can observe that countries in different regions tend to cluster, indicating that the susceptibility tended to be similar between nations in the same region.

Society and the media currently alternate between hope and despair in response to the temporary decrease or increase of daily new COVID-19 infections, which came out after a long latency period. To make an objective assessment of the current status, we have developed a state-space method for estimating the control status, in particular quantifying the time-dependent reproduction number R.

Using our method, it is also possible to predict the number of new cases in the future. This can be done by simulating the converted Hawkes process Eq. (2) with the negative binomial distribution Eq. (3), whose parameter ρ is chosen at the value estimated from the data using Eq. (17) . One may adopt the reproduction number R i as constant at the value of an endpoint of estimation if the current conditions are assumed to be maintained. Alternatively, one may also examine various time schedules of R i , by assuming possible relaxation or confinement.

We adopted the Hawkes process, or the self-exciting model, in describing the variables underlying the transmission of disease. In contrast to ordinary differential equation models such as SIR or SEIR models, the Hawkes process is advantageous in that it explicitly specifies the distribution of transmission delays. However, the Hawkes . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 11, 2020. process does not account for the finite size effect, in which infected and recovered people represent a finite fraction among the entire population. There have been some models that incorporate the finite population size effect into the Hawkes process, as has been done with the SIR or SEIR models [22, 23] . To analyze the current status of COVID-19, however, we do not take the finite size effect into account, as the fraction of the recovered or removed people is still less than a few % of the entire population.

While preparing this manuscript, the authors discovered another interesting approach that applied the Hawkes process for modeling COVID-19 transmission [24] . In that study, the authors combined the Hawkes process with spatial and temporal covariates, such as demographic features and Google mobility indices, to explain the variability of the reproduction number, and to forecast future cases and deaths in the USA. By contrast, our state-space method estimates the change-points in the reproduction number solely from a series of daily cases. Though these methods apply the Hawkes process differently, both methods are seen to demonstrate the effectiveness of this model in analyzing epidemics.

We introduced the Cauchy distribution, Eq. (6), into our analysis, assuming the stepwise changes in the reproduction number R i . Accordingly, we were able to infer change-points from the posterior distribution taking on stepwise characteristics. It should be noted that, however, there may be an additional latency between the times at which political measures were conducted and the changes in the reproduction number, which may reflect the change in behavior. This delay may also be countryspecific. Therefore, it could be interesting to investigate the delay in the change-points in the reproduction number following social events.

When inferring the transmission of disease from daily confirmed cases, we have considered potentially erroneous observations made in the real data. We took into account counting errors by assuming a negative binomial distribution that represents the over-dispersion. We also took into account the variation by day of the week and adjusted the data by compensating for the periodic dependency. Note that there may still be an underestimation of infection numbers, as asymptomatic cases may have been overlooked. Though this is unavoidable unless the inspection is enforced, it is reported that the infections caused by asymptomatic people are relatively small (about 6%) for COVID-19 [16] .

We have assumed that the transmission delay is a serial interval defined as the duration between symptom onsets of successive cases and adopted the log-normal distribution with the mean 4.7 days and SD 2.9 days, as suggested in reference [11] . As our mathematical formulation is more general, it is possible to search for a more suitable transmission kernel ϕ d without relying on such external knowledge, if the numbers of daily cases are accurately provided.

The most crucial assumption in the majority of mathe-. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted July 11, 2020. . https://doi.org/10.1101/2020.07.09.20150219 doi: medRxiv preprint matical model studies, including this study, is the meanfield assumption, in which all individuals are assumed to interact uniformly. Though difficult to incorporate, it is desirable to consider the heterogeneity of the real-world community in analyzing the communicability of disease.

Despite these assumptions, our state-space method may be of worth in assessing the status of the disease systematically, based on reported daily confirmed cases. This method might serve as a reference for governments adopting variable regulations that should be changed according to current infection circumstances.

The application program and example datasets are available at our site hosted publicly on GitHub, accessible via https://github.com/shigerushinomoto.

Infectious Disease Modelling

Introduction to Time Series Modeling

Proceedings of the 2018 World Wide Web Conference

We thank Shin Takagi, Masaki Ogura, Ryota Kobayashi, Takaaki Aoki, and Hideaki Shimazaki for their constructive comments on this manuscript. S.S. is supported by the New Energy and Industrial Technology Development Organization (NEDO).