key: cord-0779857-7m8x6hl0
authors: Fernández-Fontelo, Amanda; Moriña, David; Cabaña, Alejandra; Arratia, Argimiro; Puig, Pere
title: Estimating the real burden of disease under a pandemic situation: The SARS-CoV2 case
date: 2020-12-03
journal: PLoS One
DOI: 10.1371/journal.pone.0242956
sha: 19159ce4c1d1b03a0a3326a6ec981625aadc97b9
doc_id: 779857
cord_uid: 7m8x6hl0

The present paper introduces a new model used to study and analyse the severe acute respiratory syndrome coronavirus 2 (SARS-CoV2) epidemic-reported-data from Spain. This is a Hidden Markov Model whose hidden layer is a regeneration process with Poisson immigration, Po-INAR(1), together with a mechanism that allows the estimation of the under-reporting in non-stationary count time series. A novelty of the model is that the expectation of the unobserved process’s innovations is a time-dependent function defined in such a way that information about the spread of an epidemic, as modelled through a Susceptible-Infectious-Removed dynamical system, is incorporated into the model. In addition, the parameter controlling the intensity of the under-reporting is also made to vary with time to adjust to possible seasonality or trend in the data. Maximum likelihood methods are used to estimate the parameters of the model.

A major difficulty in the fight against the pandemic caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV2) is the large number of people who become infected and experience a mild form of the disease but can pass it on to others [1, 2] . The lack of tests to carry out large-scale diagnoses and the different protocols regarding testing policies add an extra source of uncertainty about the true number of infected individuals. This causes the number of cases reported by the authorities that serve as a basis for public health policies to severely underestimate the actual number of cases in the population [3] .

The problem of under-reporting affects data quality and therefore contributes to misrepresent results and conclusions, as reported observations do not reflect the total amount of cases of interest but only a fraction of them. Any measure related to the evolution or impact of the epidemic (e.g., lethality rates, basic and effective reproduction numbers, and others) will be distorted. This problem is not exclusive of epidemics but pervades in most areas of public a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 health, economics, and society, among others. During the past years, several authors have studied this phenomenon in different applications. Among these authors, we can highlight [4] who studied the under-reporting in worker absenteeism through Markov chain Monte Carlo analysis, [5] who considered a Bayesian approach to estimate the number of committed crimes in Málaga in 1993 and Stockholm between 1980 and 1986, [6] who described the under-reported in work-related skin diseases in Norway from 2000 to 2013, or [7] who studied under-reporting in tuberculosis in Brazil. Global percentages of under-reporting during a given period of time can be estimated, for instance, with stochastic Susceptible-Exposed-Infectious-Recovered (SEIR) models, including unobservable compartments of non-ascertained individuals. In order to estimate the parameters in such models, it is necessary to have data on individual evolution of the epidemic, that is, for each individual, the date of contagion or appearance of first symptoms, number of days in quarantine, hospital, or similar information is required, regardless of the estimation methodology. There are many examples of this situation. For instance, [8] who estimates SARS-CoV2 in Wuhan via MCMC, [9] who uses least squares estimator for SARS-CoV2 in Uruguay, or [10] who employs a simpler version of an SEIR model, called Susceptible-Infected-Recovered (SIR) model, to understand the relationship between the observed and unobserved cases of the Hong Kong seasonal influenza epidemic in New York between 1968 and 1969. Although the previous works proposed new methods to describe, identify or estimate under-reporting of data, none of them, to our knowledge, tried to model the underreporting in integer-valued time series data.

In [11] it is proposed a simple model for integer-valued time series data that estimates the under-reporting of the human papillomavirus infections in the province of Girona from 2010 to 2014, the number of deaths attributable to a rare, aggressive tumour (pleural and peritoneal mesotheliomas) in Great Britain from 1968 to 2013, and the number of botulism cases in Canada from 1970 to 2013. The model mentioned above was extended by considering a more complex correlation structure among the time series observations in [12] , where the authors studied the number of real cases of gender violence in Galicia from 2007 and 2017. The adequacy of the models in [11, 12] was assessed through simulations of different scenarios that were well recovered by the estimation procedure, and as for real data, the results coincide with the expert's opinion.

Our original motivation for this work was to study daily reported cases of SARS-CoV2 in different areas of Spain. The protocol for testing as of February 2020 only included clinically suspicious patients who recently arrived from China [13] . The protocol experienced changes in the succeeding weeks, and by May 2020, the norm became the polymerase chain reaction (PCR) or molecular tests performed to individuals with a broader collection of symptoms and contacts of confirmed patients [14] . This scenario suggests a hidden process that governs the evolution of the daily number of infected individuals, and an observed process that reflects only part of it. Moreover, the proportion of unobserved cases varies in time, due at least to the changes in testing protocols. On the other hand, it is reasonable to assume that the underlying process is non-stationary since the evolution of the epidemic of SARS-CoV2 has been observed to evolve initially drawing a mild logarithmic curve followed by an outbreak with exponential growth, which later slows down and also declines exponentially, with varying growth-decay rates which depend much on the application and effectiveness of public health prevention measures.

In light of all the above considerations, we propose here a new extension of the model in [11] , which deals with the non-stationary behaviour of the hidden process and estimates the under-reporting in epidemics such as the SARS-CoV2 and potential outbreaks. The unobserved process is modeled with an INAR(1) structure, assuming that for each case counted during day n, a new case appears in day n + 1 with a fixed (yet unknown) probability α, and to these, a random number of other counts are added (innovations). We shall assume that these innovations are independent of the past and Poisson distributed. The mean of the innovations will be modelled as the difference of the affected individuals from day n to day n + 1, found through the solution of a SIR (Susceptible, Infectious, and Removed) compartmental model, thus taking into account the spread of the epidemic. We reconstruct the most plausible count for each time and propose different forecasting methods. The latter allows us to estimate more precisely measures such as the lethality rate and provide more accurate predictions for applying more realistic control and prevention measures. The model is applied to the time series of the number of new daily SARS-CoV2 cases confirmed by PCR in different regions with different characteristics and climate conditions in Spain. Despite being especially useful to model and estimate the under-reporting in small areas with low counts, the application also shows that our model can be used in larger areas that can be split into smaller regions following geographic or sanitary criteria (e.g., dividing large areas into smaller sanitary areas).

Our approach to the modelling of the non-reported daily counts in the SARS-CoV2 cases series is an extension of the model introduced in [11] , that we briefly discuss here.

Consider that the true (unobserved) counts come from a process X n , n 2 N, defined with an integer-valued autoregressive model of order 1 (INAR(1)):

where 0 < α < 1 is a fixed parameter, and W n are the innovations, distributed according to a discrete probability law, independent of X n . The operator˚is the binomial thinning or subsampling operator defined by:

where {B j } is a sequence of independent and identically distributed Bernoulli random variables with parameter α, denoted as Bern(α). Note that [α˚X n−1 |X n−1 = x n−1 ] * Binomial(x n−1 , α). The model in (1) can be seen as an homogeneous Markov chain with transition probabilities given by:

where, in the case of the so-called INAR(1) process with Poisson innovations,

The standard interpretation of an INAR(1) model is that a proportion α of the individuals at time t "survive" and are part of the population at time t + 1. However, this interpretation is misleading in our context. The observations at time t + 1 are all new individuals; some correspond to the binomial thinning and the others to the independent innovations. It is known that for many applications for where INAR(1) models can be applied, this meaningful interpretation is not possible. However, the thinning is needed for modelling the autocorrelation of time series. For instance, this is the situation for the example of meningococcal infection analysed in [15] .

More details on the INAR(1) model and several extensions can be found in [16] [17] [18] [19] [20] or in [21] [22] [23] [24] where INAR models based on generalisations of the binomial thinning operators (e.g., expectation thinning operators) are defined. Now consider a very simple mechanism that can lead to an observable and potentially under-reported process Y n :

That is, for each n, we observe X n with probability 1 − ω, and a q-thinning (as defined in Eq (2)) of X n with probability ω, independent of the past {X j : j < n}. Therefore, what we observe (the reported counts) is Y n ¼ ð1 À 1 n ÞX n þ 1 n P X n j¼1 x j where 1 n * Bern(ω) and ξ j * Bern (q).

In the next sections, we will generalise this process to allow for non-time-homogeneous processes, by modelling the mean of the innovations in (1), as well as the under-reporting parameter q in (4), as functions of time.

Parameter estimation. The parameters of the model can be estimated using different strategies. In [11] , the authors proposed a moments-based method and a likelihood-based method. Since the first method is only appropriate when the series is stationary, we will focus on the second method of estimation based on the likelihood function.

The model described in (1) and (4) is a Hidden Markov Model (HMM) with an infinite number of states [25, 26] , and hence, the maximum likelihood estimators of the parameters involve intensive numerical computations. For a given n, the possible values of the series X n must be equal to or greater than the observed value of Y n , which implies a wide range of possible trajectories. Given the observed series, there are a countable number of potential sequences that can lead to it, and therefore the likelihood function cannot be computed directly. A way to circumvent this problem consists of using the forward algorithm [25, 26] . This recursive algorithm is linear in n; it is based on the forward probabilities of the Markov Chain that can be computed in terms of the transition and emission probabilities. These forward probabilities are defined by

Thus, the likelihood function of the model can be computed as

. . . ; Y n ¼ y n Þ ¼ P 1 X n ¼y n g n ðy 1:n ; x n Þ. In this case, the transition probabilities, P(X k = x k |X k−1 = x k−1 ) are given by the Eq (3). That is, the transition probabilities are defined by the conditional probability mass function of the INAR(1) model. On the other hand, the emission probabilities are defined by:

Notice that, in practice, an upper threshold has to be defined in the sum that computes the likelihood function. In this application, this threshold is fixed as 1.5 times the maximum value of the series.

The reconstruction of the most likely latent sequence is a key point in the current analysis since it gives us a picture of how the unobserved process behaves. To do so, the Viterbi algorithm [27] is used, which consists of finding the sequence X � that maximises the likelihood function of X n given the observed process Y n and a known vector of parameters. That is, X � ¼ argmax XP ðX 1:n jY 1:n Þ ¼P ðX 1:n jY 1:n Þ PðY 1:n Þ . However, since the denominatorPðY 1:n Þ, does not depend on X n , it suffices to maximise the joint probabilityPðX 1:n jY 1:n Þ.

The key interest of researchers when dealing with an epidemic such as the current SARS-CoV2 is to estimate the propagation of the disease and predict its possible end date to apply appropriate measures of control and prevention [28] . The literature offers different approaches to deal with so, as the so-called SIR and SEIR compartmental models. These models have extensively used for study influenza's epidemic evolution as [29] who use an SEIR model to evaluate vaccine policies effects on England and Wales's influenza epidemics, [30] who employs SEIR model to study seasonal influenza evolution in England by linking the prior seasonal information to the immunity in the following period in order to ensure non-independence between the successive influenza seasons, or [31] who presents a set of different research works aimed at modelling the influenza epidemics.

We shall link the expectation of the innovations in (1) to the daily number of individuals affected by the disease. For that purpose, we will study a simplified version of the SIR model. This model belongs to the class of compartmental models, and a system of ordinary differential equations governs its behaviour. Consider three classes of individuals at each time t 2 R: those who are healthy but susceptible to get the disease (S(t)), those who are infected and thus transmitters of the disease (I(t)), and those individuals who have been removed from the system and will not get infected again (R(t)) [32, 33] . The SIR model describes the dynamics of the spread of the virus and it is formally defined by a system of differential equations given in S1 Appendix.

The parameters of interest are β, γ, and N, which are the infection rate, the removal rate, and the total susceptible population, respectively. For each t, the following condition is fulfilled: S(t) + I(t) + R(t) = N. Usually, the initial conditions are set to R(0) = 0, I(0) = I 0 and S(0) = N − I 0 . Although this model sensibly represents certain epidemics' evolution, it is hard to fit into real-world data due to the sensitiveness to slight changes in both the parameters' values and the initial conditions.

Consider now the number of affected individuals

In S1 Appendix, we show that the number of individuals affected by the disease can be fairly represented by:

where k = β − γ and M � ¼ NðbÀ gÞ bÀ g=2 . Recall that β is the infection rate, γ is the recovery rate and N the total susceptible population.

The solution given in (7) allows to take into account the information on the spread of the epidemics in the model given by (1) and (4), by considering that the expectation of the innovations, instead of being constant, that is, λ, it will be a function of time such that λ t = new(t) = A (t) − A(t − 1), where new(t) are the new affected cases at time t. It can be seen that the Eq (7) behaves as an exponential function close to the origin, that is, A(t)�A 0 e kt when t � 0. Therefore, the new affected cases grows exponentially at the beginning, that is,

In addition, the function A(t) tends to M � as t tends to 1. The maximum value of A(t), that is, A(1) can be obtained by numerically solving the following equation (see S1 Appendix for details):

Eq (8) can be especially useful in the reconstruction of the SIR process by recovering the parameters β, γ, and N once the under-reporting model is estimated.

Taking into account this SIR representation of the expected value of the innovations λ t in the latent process model, a more realistic description of the model for the SARS-CoV2 data will be derived, which will allow estimating the characteristics of the under-reporting in such data and the spread of the epidemic jointly.

The model described in Eqs (1) and (4) is useful for detecting and quantifying the underreporting at a local scale because of the likelihood computations work well with relatively small counts. It is also meant to model weakly stationary processes, i.e., with expectation, variance, and auto-covariances not varying in time.

However, many real-world time series data are non-stationary as they may be governed by trends or volatilities and may have different seasonal and cyclic patterns. For example, the series of daily new SARS-CoV2 cases analysed in the present work show intricate trend patterns. The observed series in the cases we analyse present a seasonal component due to the "weekend effect". That is, the number of cases reported during certain days of the week decreases, and thus the official records show fewer cases periodically. This behaviour repeats weekly. It is a problem attributable to the reporting process and not to the nature of the underlying phenomenon. The model in Eqs (1) and (4) has to be modified in order to take this particularities into account.

We proceed now to incorporate the information on the evolution of the SARS-CoV2 epidemic into the model in Eqs (1) and (4), and thus to fit the trend displayed in Figs 1, 3, and 4 appropriately.

Our analysis is based on the daily number of new cases of SARS-CoV2. For each time n, the new counts can be expressed in terms of the affected number of individuals introduced in the previous section. That is, for each n, the number of new individuals can be defined in terms of the number of affected individuals, as follows: (7). This information can be appropriately incorporated into the model to accommodate the trend present in the data using the information on the propagation of the epidemic provided by the data themselves.

A sensible way to do this is by considering that the expectation of the innovations of the latent process X n in expression (1) varies with the number of new cases at each time n, and thus that the model in (1) and (4) is not stationary anymore. Specifically, the innovations of X n will have Poisson distribution with λ n = new(n) = A(n) − A(n − 1), where A(n) is given by (7) . Therefore, the unobserved process X n becomes:

where the value of parameter α is still fixed between (0, 1), but now the parameter of the (7), assuming that A 0 = 1. The value of A 0 is known, representing the number of affected people at the starting time. However, if this value also needs to be estimated, it should thus be kept in the expression (7) as a parameter. The model (4) defines the under-reported as an independent process in the sense that the state of under-reporting at time n is not affected by the same state at time n − 1. However, Fernández-Fontelo et.at. in [12] introduced a version of the under-reporting scheme according to 

Estimating the real burden of disease under a pandemic situation: The SARS-CoV2 case a two-state Markov chain. Although this approach of the under-reporting process could have been considered in the current model, the resulting model would be significantly more complicated than the one presented here from a computational perspective. In addition, the under-reporting process in (4) could also be considered stationary since it remains constant throughout the study (e.g., the under-reporting does not vary with time).

In the present work, however, the under-reporting process is flexible in that we do not restrict the under-reporting parameters to be constant over time; they can vary throughout the study if needed. To do so, both under-reporting parameters ω and q can be made to vary with time, that is, ω n = f(n, Γ) and q n = f(n, Δ), where Γ and Δ are vectors of parameters. In our current model, just the parameter q is considered time-dependent, and not both parameters to reduce computational issues. The latter means that, if both parameters ω and q are considered time-dependent, the resulting model is more complex and often shows convergence problems.

Particularly, the intensity of the under-reporting (i.e., q) is adjusted by the following logistic function:

Hence, we ensure that q n 2 (0, 1). In expression (10), γ 1 indicates whether q increases or decreases over time, while γ 2 and γ 3 indicate whether the series has a seasonal pattern with period p = 7 (weekly). Notice that if γ 1 = γ 2 = γ 3 = 0, then the previous logistic function becomes constant and thus q n = q, resulting in the model (4) . Hence, considering this function for the intensity of the under-reporting, the under-reporting process in the present model is defined by:

The parameters of the new model defined in (9) and (11) can be estimated using the forward algorithm through the forward probabilities defined in (5) . In addition, the Viterbi algorithm introduced before can also be used to reconstruct the most likely latent sequences.

Model forecasting. Another interesting point of the current analysis is the prediction of new daily cases of SARS-CoV2. These predictions can be used as a tool for foreseeing potential future outbreaks of the disease and, therefore, helping to implement earlier measures to lessen the impact of outbreaks. We propose two different predictors.

The most straightforward way to predict the values of Y n+1 , Y n+2 , . . . given the sample values Y 1 , Y 2 , � � �, Y n is by considering their average point predictions, that is, E(Y n+1 ), E(Y n+2 ). . .. In particular, since the model (1) is an auto-regressive model of order 1, these average point predictions are expressed in terms of the last observed value, that is, Y n .

According to the properties of the binomial thinning operator, we have that E(Y n+k ) = E(X n+k )(1 − ω(1 − q n )), where EðX nþk Þ ¼ l nþk 1À a and λ n+k = A(n + k) − A(n + k − 1). On the other hand, it is easy to see that EðX nþk Þ ¼ a k EðX n Þ þ P k i¼1 a kÀ i l nþi . Hence, if we have estimates for the corresponding parameters at a given time n + k, the average point prediction of Y n+k can be computed as follows:

where φ nþk ¼ 1À oð1À q nþk Þ 1À oð1À q n Þ . See S2 Appendix for more details on the computations.

The standard errors of these predictions (12) can be estimated using the Delta method. Briefly, the estimated variance of the predictionÊðY nþk Þ, that is, We can also predict an individual value of Y n+k based on its conditional distribution given the last value of the latent process X n . This distribution is (see S3 Appendix):

The distribution (13) is a mixture of two components that are sums of a Binomial distribution and a Poisson distribution. To compute the corresponding probabilities for each component, a direct modification of expression (3) can be used. Finally, if P 1 (Y n+k = j|X n = x n ) is the probability of Y n+k = j in the first component of the mixture (13) , and P 2 (Y n+k = j|X n = x n ) the same probability in the second component, the probability that P(Y n+k = j|X n = x n ) of the mixture (13) 

Given the distribution (13) , and replacing the parameters by the maximum likelihood estimates, we can also estimate regions of prediction of size 1 − α � finding the lower and upper limits r 1 and r 2 that satisfy:

The current application is based on the official daily number of confirmed SARS-CoV2 cases in different areas of Spain. In particular, it shows that the model presented before can be used to identify and quantify the under-reporting in small regions of Spain as well as in larger areas that can be officially and hierarchically divided into smaller regions (e.g., areas that can be divided into provinces or sanitary regions). That is, the model is ideal for quantifying the under-reporting issue locally and brings a solution to study that phenomenon in larger areas by aggregating the information in their smaller regions. Also, at the same time the underreporting is estimated, the model accommodates the spread of the pandemic and provides this information through the parameters M � and k.

Three different small areas from Spain in the North (Cantabria), South (Islas Canarias), and Mediterranean coast (Islas Baleares) have been selected. The data from these areas consist of the number of confirmed cases by PCR tests. The day of confirmation coincides with the actual day the patient manifests symptomatology. See "Availability of data and codes" section for data availability. All time series range in the period from March 5 to May 20, 2020. The time series corresponding to Cantabria takes values ranging from 0 to 161 cases per day, with a mean of 36 and 2788 positive PCR cases. The number of deaths is 209, 36 deaths per 100000 inhabitants since the beginning of the pandemic, set on February 20, 2020.

The time series for Islas Canarias takes values ranging from 0 to 147 positive PCR cases per day and with an average of 30 roughly and a total of 2299 positive cases by PCR. A total of 155 people died, which means seven deaths per 100000 inhabitants since the beginning of the pandemic.

The time series for Islas Baleares has values ranging from 0 to 107, with an average of 28 and 2125 positive cases by PCR. Two hundred twenty-one deaths are registered in this area since the beginning of the pandemic. This implies a total of 19 deaths per 100000 inhabitants. Fig 1 shows the evolution over time of the new daily positive cases by PCR in Cantabria (top), Islas Canarias (middle), and Islas Baleares (bottom). The graph shows that these time series are governed by a trend that increases to a maximum peak (the peak of the pandemic) and decreases. Therefore, it is evident that the time series are non-stationary. Additionally, the series shows periodic peaks that coincide with the "weekend effect" previously described. Table 1 shows the maximum likelihood estimates (MLE) of the model defined in (9) and (11) . For Cantabria, the overall frequency of under-reporting, that is, ω is estimated asô ¼ 0:8814 and the intensity, that follows the function (10), isq n ¼ On 20 May, 209, 155, and 221 people died due to SARS-CoV2 in Cantabria, Islas Canarias, and Islas Baleares, respectively. As expected, the lethality rates computed using the observed and reconstructed number of confirmed cases by PCR differ. While these rates are 7.5%, 6.7%, and 10.4% in Cantabria, Islas Canarias, and Islas Baleares, the reconstructed rates significantly decrease to 3.4%, 4.6%, and 5.4%.

Results in Table 1 also allow reconstructing the SIR model, and therefore estimating the parameters β, γ, and N, also using the number of affected people A � when the curve A(t) grows fastest (see S1 Appendix).

Although the SIR model's exact solution can be derived, in our model, an approximate solution to the SIR model has been considered the logistic function A(t) to make the model computationally less expensive. Because our SIR estimation relies on an approximated solution, in practice, in some cases, the reconstruction of the parameters β, γ, and N is not possible since the equation (S1.11) has no proper solution.

In the case of Cantabria, Islas Canarias and Islas Baleares, a proper solution for (S1.11) has been found for the three regions. In particular, for Cantabria, considering the estimated parametersâ ¼ 0:9653,M � ¼ 237:99,k ¼ 0:3344 and observing that the fastest growth of A (t) occurs at A � = 1631.9, we obtain A 1 � 6858.5, A 0 ¼ 1=ð1 ÀâÞ ¼ 28:8 and, solving numerically (S1.11),N ¼ 427418:7. Then, plugging the value ofN in (8) and using thatb Àĝ ¼ 0:3344 we findĝ ¼ 85:57 andb ¼ 85:90.

Acting similarly for the other two regions, for Islas CanariasN ¼ 50629:4,ĝ ¼ 10:07 and b ¼ 10:40. For Islas Baleares,N ¼ 79584:6,ĝ ¼ 13:05 andb ¼ 13:34 Fig 2 shows the forecasting results based on the average point prediction and the k-ahead forecasting distribution (e.g., percentiles 2.5%, 50%, and 97%) for the areas mentioned above using a dynamic and static approach. The dynamic method is usually used to evaluate the models' predictive capability and consists of splitting down the time series into the training and testing time series sets of sizes n − k and k. The method starts training the model over the n − k observation in the training set. The prediction and the 95% confidence levels for the observation n − k + 1 are computed through the trained model and compared to the true observation n − k + 1. After that, the training set is updated by including the first n − k + 1, the model is re-fitted over the new training set, and a new prediction for the observation n − k + 2 is computed. This recursive process is repeated until the last prediction for the n observation is computed over the training set containing the n − 1 first observations. The static method is usually used to predict unknown future values. The idea consists of using the last X n value to predict both the average point prediction at time t + k and the k-ahead forecasting distribution. 

Estimating the real burden of disease under a pandemic situation: The SARS-CoV2 case

In this second example, the number of daily SARS-CoV2 cases confirmed by PCR is studied in two large areas of Spain by splitting these areas into smaller hierarchical regions (e.g., areas divided into smaller areas according to geographical or sanitary reasons). In these cases, as before, the day of confirmation coincides with the actual day the patient had symptoms. Recall that the model introduced here is especially useful for studying the evolution of the SARS-CoV2 cases and identifying and quantifying its under-reported in small areas. However, in this example, we will show that larger areas can also be studied with these models if we have a way to split these vast areas.

The autonomous community of Galicia is divided into seven different sanitary areas. These areas' data consist of the number of daily new cases of SARS-CoV2 confirmed by PCR from 12 March to 27 April. For the Galician data, the series had to be cut on 28 April since the region's health system changed the definition of new cases from 28 April onwards. Overall the autonomous community, the minimum and the maximum number of new daily cases confirmed by PCR range from 0 to 185, with 21 cases per day on average. A total of 6974 cases are registered in Galicia as confirmed cases by PCR on 28 April. See "Availability of data and codes" section for data availability.

The autonomous community of Andalucía is divided into seven provinces. In this case, each province's time series is the number of new daily SARS-CoV2 cases confirmed by PCR from 5 March to 20 May. Overall Andalucía, the minimum and the maximum number of daily cases confirmed by PCR range from 0 to 185, with 20.4 cases per day on average. A total of 12591 cases are officially registered in Andalucía as confirmed cases by PCR on 20 May. See "Availability of data and codes" section for data availability. Table 2 shows the maximum likelihood estimates of the model in (9) and (11) with the same characteristics as those in the registered cases; that is, only 65.6% are registered in this autonomous community. As before, the lethality rate strongly differs depending on whether the denominator is considered as the observed or reconstructed total of cases over the specified period. In particular, overall Andalucia, 1371 people died over the specified period, which implies lethality rates of 10.9% or 7.1% if the number of total cases corresponds to the officially registered or reconstructed, respectively. The mortality rate overall Andalucía is estimated as 16 deaths per 100000 inhabitants. transmission, combined with the lack of tests that impede carrying out large-scale screenings [1] . However, the quick and efficient identification of those people is vital to earlier control potential trends of infection and evaluate the pandemic's impact (e.g., to get unbiased estimators on the spread and impact of the epidemic).

Since the epidemic showed up last December in China, the concern with the under-reporting in official data as the number of infections and deaths worldwide has been on everyone's lips, including the media [34] [35] [36] .

With this in mind, this paper aims to introduce an extension of the model proposed in [11] to estimate the magnitude of the under-reporting in epidemics such as the current SARS--CoV2. That article presents a model that considers two processes: an underlying process (true process) that we do not observe, and the observed and potentially under-reported process that provides a proportion of what happens (a proportion of the true process). The model's particularity is that the underlying process assumes an INAR(1) structure, and hence a particular correlation structure. The model measures the under-reporting with two parameters that estimate the number of times the process is under-reported (ω), and the overall distance between the most likely sequence of latent states and the observed sequence (q). However, the model in [11] is intended to fit a stationary time series, which is not the case of the SARS-CoV2 data. Therefore, to adapt the model above to the SARS-CoV2 case, the underlying process's expected value is allowed to be time-dependent through an approximated solution of the SIR differential equations that depend on the new affected individuals at each time. The new version of the model also allows considering time-varying under-reporting, which, in the SARS-CoV2 case, may sometimes be more realistic (e.g., the intensity of the under-reporting may decrease if the number of large-scale screenings increases). Thus, the resulting model measures the underreporting and adapts the pandemic's evolution based on the SIR model at the same time.

This paper's results confirm that the under-reporting is effectively present in SARS-CoV2 data from various regions in Spain conditioned to different management, policies, and climate conditions. Results also show that the model has powerful predictive characteristics exhibited in Fig 2 and that the SIR parameters N, β, and γ can be relatively quickly recovered from the results in Tables 1 and 2 . As expected, the under-reporting from almost all regions is not constant over time but varies with times showing a decreasing trend (ĝ 1 ) and a seasonal pattern with seven days of periodicity (ĝ 2 andĝ 3 ). A decreasing trend means that the q parameter tends to 0 as time increases, and therefore the intensity of the under-reporting phenomenon is stronger as time passes. This result is surprising since it was expected a less intense underreporting process as time increases. However, the changes in protocols, data collection strategies, among others, could have affected the evolution of this under-reporting process. On the other hand, the seven-days periodicity is explained by the "weekend effect" that produces a systematic decrease in the number of new cases during the weekends.

It has been shown that the coverage percentages vary from 33.7% (Ourense in Galicia) to 71.8% (Málaga in Andalucía) and that the estimated lethality rates decrease significantly when the number of reconstructed cases, rather than the number based only on officially reported cases, are considered as the total number of affected individuals. In particular, lethality rates with official cases range from 5.8% to 10.9% and decrease to 2.5% and 7.1% with the reconstructed cases. The example with the lethality rates reveals the under-reporting influence on the parameters' estimates, which are often used to make decisions. Thus, the importance of having appropriate methods to identify, control, and estimate under-reporting.

Besides the under-reporting quantification, the model allows estimating the SIR model under the underlying process. In particular, for Cantabria, Islas Canarias, and Islas Baleares is estimated that the number of susceptible people (N ) is 427418.7, 50629.4, and 79584.6, while the infection and removal rates (scaled byN ) are 0.0201% and 0.0200%, 0.0205% and 0.0199%, and 0.0168% and 0.0164%, respectively. The latter means the basic reproduction numbers (b=ĝ) are slightly over 1, which means that the virus, although nearly to be controlled, is not under control yet. For the second example, which is mainly intended to show how the model can be used for large areas with large counts, similar numbers have been obtained forb andĝ.

One of the current work's main limitations is based on the available data since the model can only estimate the unobserved counts that have similar characteristics to the observed data. Since our data do not contain asymptomatic, our estimated reconstructions in Figs 1, 3 and 4 do not contain those asymptomatic and only people with the same characteristics than the observed counts who have not been officially registered. To also include asymptomatic in the reconstruction of the underlying process, random tests should be done to include cases that have passed the infection asymptomatically. However, to the authors' knowledge, this information is not available yet. Other limitations are related to computational issues, especially when dealing with relatively large counts, and the sensitivity of the SIR model's approximated solution that sometimes does not allow recovering the parametersN ,b, andĝ.

The model presented here constitutes a first approach to a reliable method to estimate a pandemic's under-reporting, such as the current SARS-CoV2. Furthermore, the model can be extended in different ways, such as considering more complex correlation structures in the underlying process (e.g., INAR(p) or INARMA model), or considering more general thinning operators for representing the observed process. 

Prevalence of Asymptomatic SARS-CoV-2 Infection

The role of asymptomatic SARS-CoV-2 infections: rapid living systematic review and meta-analysis. medRxiv

Prediction of Epidemic Spread of the 2019 Novel Coronavirus Driven by Spring Festival Transportation in China: A Population-Based Study

Markov Chain Monte Carlo Analysis of Under-reported Count Data With an Application to Worker Absenteeism

Estimating with incomplete count data. A Bayesian approach

A hierarchical framework for correcting underreporting in count data

Reconstruction of the full transmission dynamics of COVID-19 in Wuhan

Modelo CoVID-19 Uruguay

Identifying the number of unreported cases in SIR epidemic models

Under-reported data analysis with INAR-hidden Markov chains

Untangling serially dependent underreported count data for gender-based violence

El Agujero por donde se coló la pandemia

Vigilancia y Control en la Fase de Transición de la Pandemia de Covid-19

On the application of integer-valued time series models for the analysis of disease incidence

Some simple models for discrete variate time series

First-order integer-valued autoregressive (INAR(1)) process

Discrete Variate Time Series

Thinning operations for modeling time series of counts-a survey

Handbook of Discrete-Valued Time Series

A new type of discrete self-decomposability and its application to continuous-time Markov processes for modelling count data time series

Negative binomial time series models based on expectation thinning operators

First-order integer valued AR processes with zero inflated poisson innovations

Integer Valued AR(1) with Geometric Innovations

Hidden Markov Models for Time Series: An Introduction Using R. Chapman & Hall/CRC

A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE

The Viterbi algorithm

COVID-19: towards controlling of a pandemic

Assessing Optimal Target Populations for Influenza Vaccination Programmes: An Evidence Synthesis and Modelling Study

Seasonal influenza: Modelling approaches to capture immunity propagation

Real-time modelling of a pandemic influenza outbreak

Infectious diseases of humans: dynamics and control

An introduction to infectious disease modelling

Underreporting Of COVID-19 Coronavirus Deaths In The U.S. And Europe (Update)

What we'll need to find the true COVID-19 death toll

UK coronavirus death toll reaches 1,789 amid data reporting concerns

The authors would like to thank Gustavo Eduardo Á valos Villaseñor for his help in data scrapping and processing.Software: Argimiro Arratia.

Writing -review & editing: Amanda Fernández-Fontelo, David Moriña, Alejandra Cabaña, Argimiro Arratia, Pere Puig.