key: cord-1016182-pvieqi52
authors: Paixão, Balthazar; Baroni, Lais; Pedroso, Marcel; Salles, Rebecca; Escobar, Luciana; de Sousa, Carlos; de Freitas Saldanha, Raphael; Soares, Jorge; Coutinho, Rafaelli; Porto, Fabio; Ogasawara, Eduardo
title: Estimation of COVID-19 Under-Reporting in the Brazilian States Through SARI
date: 2021-03-14
journal: New Gener Comput
DOI: 10.1007/s00354-021-00125-3
sha: 682e42ffcce949d9aa6e81177d84a6f6c05fddd8
doc_id: 1016182
cord_uid: pvieqi52

Due to its impact, COVID-19 has been stressing the academy to search for curing, mitigating, or controlling it. It is believed that under-reporting is a relevant factor in determining the actual mortality rate and, if not considered, can cause significant misinformation. Therefore, this work aims to estimate the under-reporting of cases and deaths of COVID-19 in Brazilian states using data from the InfoGripe. InfoGripe targets notifications of Severe Acute Respiratory Infection (SARI). The methodology is based on the combination of data analytics (event detection methods) and time series modeling (inertia and novelty concepts) over hospitalized SARI cases. The estimate of real cases of the disease, called novelty, is calculated by comparing the difference in SARI cases in 2020 (after COVID-19) with the total expected cases in recent years (2016–2019). The expected cases are derived from a seasonal exponential moving average. The results show that under-reporting rates vary significantly between states and that there are no general patterns for states in the same region in Brazil. The states of Minas Gerais and Mato Grosso have the highest rates of under-reporting of cases. The rate of under-reporting of deaths is high in the Rio Grande do Sul and the Minas Gerais. This work can be highlighted for the combination of data analytics and time series modeling. Our calculation of under-reporting rates based on SARI is conservative and better characterized by deaths than for cases.

In January 2020, the new coronavirus (COVID- 19) was considered a Public Health Emergency of International Importance by the World Health Organization (WHO). Later, in March, WHO characterized the disease as a pandemic. Due to its relevance, many efforts are being made to combat COVID-19, either by discovering the characteristics of the virus, methods of prevention, treatment, or directing public policy action [5] .

In Brazil, interventional measures such as the creation of field hospitals, surveillance information systems, and actions to reduce the economic impact are being adopted to mitigate the effects caused by COVID-19. Among the main objectives is to slow down the spread of the virus to avoid overloading the health system. In this sense, policies to encourage prevention are adopted, such as, for example, the recommendation or imposition of physical isolation and quarantine [32] .

Decision-making for the adoption of public policies in this pandemic scenario is a challenging task. Part of the difficulty comes from the lack of specific information about essential characteristics such as the total number of people infected. There is a lack of availability of tests to confirm the infection by SARS-CoV-2, which ends up being performed only in more severe cases of the disease, with exceptions. Such a scenario makes the capacity of the health system to monitor the evolution of the number of cases uncertain. The discrepancy between the actual amount of infected and diagnosed individuals constitutes under-reporting [21] .

It is estimated that under-reporting is a relevant factor in determining the actual mortality rate and, if not considered, can cause significant misinformation [20] . Therefore, this work aims to estimate the under-reporting of cases and deaths of COVID-19 in Brazilian states. Since the possibility of testing the entire population is not viable, data from the InfoGripe is used. InfoGripe targets notifications of Severe Acute Respiratory Infection (SARI).

Our paper stands out for adopting a methodology based on the combination of data analytics (event detection methods) and time series modeling (inertia and novelty). Data analytics is applied to determine the parameters to be used for time series modeling. The estimated parameters consider time series analysis through event detection methods.

The estimate of real cases of the disease, called novelty, is calculated by comparing the difference in SARI cases in 2020 (after with the total expected cases in recent years (2016-2019). The expected cases are derived from a seasonal exponential moving average. The novelty is based on inertial concepts. That is, there is a strength to maintain the values of a time series in a stable state through time [12] . Inertia remains until a rupture occurs. In this case, the rupture is the influence of the COVID-19. Under-reporting, then, is given by the difference between the novelty and the number of reported cases. In the end, underreporting (cases and deaths) is presented as a rate for each state in Brazil.

For the sake of clarity, it is important to introduce some background for time series, moving averages, and event detection used in the context of this work.

A time series is a sequence of observations collected in time. Usually, a time series y can be considered as a stochastic process, i.e., a sequence of n random varia-bles<y 1 , y 2 , … , y n > [11, 28] . A specific observation of a time series is represented as y i , indexed in time by i = 1, … , n , where y 1 represents the first observation, and y n is the most recent observation.

The ith subsequence of size p in a time series y, represented as seq i,p (y) , is a continu-

The sequence contains ith observation and its p − 1 predecessors.

The ith subsequence seasonally outdated for time series y, is represented as seq s i,p (y) , is an ordered sequence of values < y i−(p−1)⋅s , y i−(p−2)⋅s , … , y i > , where p corresponds to the size of the sequence ( |seq s i,p (y)| = p, with p ≤ i ≤ |y| ), and s corresponds to the seasonality ( s ≪ |y| ). The sequence contains i-th observation and its p − 1 predecessors outdated seasonally.

The ith moving average y i,p of p terms in a time series y is calculated by the average of t k observations in the sequence seq i,p (y) , as shown in Eq. 1. The ith exponential moving average ŷ i,p of p terms in a time series y is calculated by the weighted average of t k observations in the sequence seq i,p (y) and the weights k . The ŷ i,p is described in Equation 2, where there is more emphasis on the most recent observations.

The i-th seasonal moving average y s i,p and the i-th seasonal exponential moving average ŷ s i,p of p terms in a time series y are similarly calculated replacing the continuous sequence seq i,p (y) with the seasonal sequence seq s i,p (y) (see "Time series"), respectively, in Eqs. 1 and 2, as shown in the Eqs. 3 and 4.

(1)

Event detection methods include the discovery of anomalies and change points. Anomalies are observations that stand out because they do not appear to have been generated by the same process as the other observations in the time series [19] . Change points characterize a transition between different states in a process that generates the time series data [9, 30] . There are several methods to address the detection of anomalies [6, 13] and change points [1] . Among them, some methods consider the effects of inertia on time series data. As this work is based on inertial concepts [12] , two methods of this group are presented.

Adaptive normalization [23] is used to detect anomalies. This technique uses inertia to address heteroscedastic non-stationary series. Given a time series y, the outlier removal process consists of three stages: (i) inertia calculation, (ii) noise calculation, and (iii) anomaly identification. In the inertia calculation, a moving average for the series y i,p with p terms is calculated, as described by Eq. 1. The higher the value of p, the greater the inertia and the lower the adaptation speed. The noise i is calculated by the difference between y i and y i,p , i.e., i = y i − y i,p . Finally, the observations i classified as outliers by boxplot correspond to anomalies in Eq. 5.

Change Finder is a technique that detects change points in univariate time series data [30] . Given a time series y, the event detection process consists of two phases. In the first phase, outliers are detected. For this, a learning model is adjusted to the time series y, resulting in ŷ i = (y) i . 1 Next, a score s i is calculated for each observation in the series related to its deviation from the learned model. This calculation produces a time series s, as presented in Eq. 6. The highest scores for s, classified according to Eq. 5, indicate anomalies.

In the second phase, change points are detected. For this, a new time series s p is produced, composed of moving averages of s with p terms, according to Eq. 1. The detection of change points is then reduced to the outlier detection problem in s p like the first phase.

Due to its relevance and recent outbreak, COVID-19 has been attracting much interest in the academy. Therefore, many works on COVID-19 have been published since the beginning of 2020 until today. However, there are still few studies focused on under-reporting estimates. This low number of related publications can be a consequence of the time spent on the execution, review, editing, and publication of papers in scientific journals. Krantz et al. [17] used harmonic analysis and wavelets to model the underreporting of COVID-19 in several countries worldwide. They developed susceptibility and infection equations with parameters varied according to the characteristics of each country to build adaptive models. The under-reporting rate was calculated by the difference between the numbers predicted by the model and reported numbers. The result provided the ratio between reported and unreported cases in the format (1 to x) in seven countries. The authors concluded that the results are not entirely accurate due to the lack of some important information that should be included in the model and was not available.

Similarly, to review the numbers of reported COVID-19 cases in several countries, Lachmann et al. [20] also estimated expected cases. For this, the author used demographic data and fixed mortality rates of the countries and the paired comparison with the reference country (South Korea). It presented and discussed estimates of the number of people infected with COVID-19, considering a set of situations that must be true to justify the model.

Ribeiro et al. [25] used regression techniques on hospitalization data in Brazil with a type of acute respiratory syndrome as the cause. They analyzed the time evolution of hospitalizations for each month in the period between 2012 and 2019. They created a mathematical function that replicates the typical behavior of cases of hospitalization for SARI. This function was compared with data from 2020 in the same months to estimate under-reporting. The results showed an under-reporting rate of 7.7:1 for Brazil.

Bastos and Cajueiro [3] modeled and predicted the initial evolution of the COVID-19 pandemic in Brazil using about a month of data provided by the Ministry of Health of Brazil. They sought to model the spread of the virus and evaluate existing countermeasures. For that purpose, they use two variations of the SIR model and we include a parameter that comprises the effects of social distancing measures. They conclude that social distancing policy can fatten the infection pattern of the COVID-19 but that it is only effective if it lasts until mid-June, according to predictions. They also point out the importance of testing the population based on the proportion of asymptomatic individuals.

Silva et al. [29] fitted curves growth models using a Bayesian approach to calculate the total number and daily new cases in the state of Goias, Brazil. Results from the analysis also investigate the possible date of the outbreak peak to the state. The study did not take into consideration possibles changes in government control measures.

Saulo et al. [4] discussed the role of uncertainty in the prediction of the number of infected individuals and deaths. They proposed an adapted susceptible-infectedrecovered (SIR) model, which explicitly incorporates the under-reporting and the response of the population to public policies to cast short-term and long-term predictions. As a contribution, it seeks to comprise the role that sub-notification uncertainty plays in the model-based predictions of the COVID-19 contagion, harshly affecting the outlooks for its evolution spread in Brazil.

Our work stands out for estimating the under-reporting of COVID-19 in Brazilian states weekly. The estimate considers the weighted historical record (in which most recent years have more weight than less recent ones) to predict expected SARI cases in 2020. It enriches the analysis allowing an estimate closer to reality. This work can also be highlighted for focusing on time series and using event detection tools in the study. Furthermore, except for the article by Ribeiro et al. [25] , as far as we know, the data used in this work to obtain under-reporting rates were not used in any other work with the same or similar purpose.

In seasonal phenomena, time series are generated by superimposing a seasonal process and random noises. Based on this premise, Eq. 7 models the seasonal component of the time series, where y i is an observation, ŷ s i−s,p is the seasonal exponential moving average (SEMA) in the previous season, and i is the random noise. The obtained seasonal component brings up the inertia concept in time series. It enables the analysis of the intrinsic random noise of the observed phenomenon. At the same time, the influences that determine the behavior of the series are not changed [12] .

In the case of rupture (i.e., a "break" in inertial behavior), we adopt the concept of novelty . The novelty is the influence introduced in each interval resulting from a rupture in a time series. Once the novelty begins, the modeled SEMA from past data is no longer the only representative process of the new behavior of the time series. In this context, Eq. 7 is expanded to Eq. 8, that expresses novelty i and error ̂i . We have that ̂i is approximated by the average error observed in the pre-novelty period, i.e., ̂i is expected to be inside the interval confidence for (

Until the seasonal component ŷ s i−s,p incorporates the novelty i , i defines a new phenomenon in the time series. Regarding SARI, we assume that i is directly associated with COVID-19, i.e., the new known phenomenon.

From this concept, we first compute the inertial behavior of the time series to estimate under-reporting. Let t be the period in which the rupture y t occurs. In novelty period (i.e., t ≤ i ≤ |y| ), i is the subtraction of the observations of the time series y i by the values of SEMA from the previous period ŷ s i−s,p and the error ̂i

(approximated by ). Equation 8 shows the calculation of the time series with i for each i in the novelty period. The novelty i estimates the brute number of observations that exceed the expected according to the inertial behavior of the time series and its fundamental error.

To estimate the brute number of under-reported time series, we use the number of observations classified as SARS-CoV-2 (Severe Acute Respiratory Infection Coronavirus 2) in the novelty period. Equation 9 presents the calculation of the time series with absolute numbers of under-reported observations, where cov i are observations classified as SARS-CoV-2.

As we assume that the modeled novelty in time series i represents COVID-19 cases, the time series sub i defines the number of under-reported observations per week. Then, the estimates sub i are added together to form the accumulated number of under-reported observations in the period, represented as cur i in Equation 10 .

The under-reporting rate is estimated by dividing the accumulated number of underreported time series cur i by the accumulated number of total time series cov i for the period. Equation 11 describes the under-reporting rate, denoted as tx i , where tx |y| is the final rate. In this work, this calculation provides the estimated under-reporting rates for cases and deaths of COVID-19 for each Brazilian state individually. Thus, these rates allow for a comparable interpretation between the states.

This section discusses the experimental setup of the scenario in which the methodology was applied. The next section presents the process of data acquisition and preparation, whereas the following section describes the methods and parameters applied in the analysis. The next section presents the implementation details.

InfoGripe is the principal data source used for the analysis and development of the work. 2 It is an initiative of the Oswaldo Cruz Foundation (Fiocruz) with the Getulio Vargas Foundation (FGV) and the Brazilian Health Surveillance System of the Ministry of Health. It records weekly SARI reported cases since January 2009. The

data comes from the Influenza Epidemiological Surveillance Information System (SIVEP-Gripe). It presents the cases following the criteria: (fever) AND (cough OR sore throat) AND (dyspnoea OR oxygen saturation < 95% OR respiratory difficulty) AND (hospitalization OR death), symptoms equivalent to SARI international records [16] . For the sake of simplicity, we are calling the dataset DT_SARI.

To keep only the relevant data, we apply the following filter: type = "State" ∧ gender = "Total" ∧ scale = "Cases". The resulting dataset shows the number of cases or deaths per epidemiological week of a given year for each state. Besides, it specifies the number of observations that correspond to Influenza A, Influenza B, SARS-CoV-2, Respiratory Syncytial Virus (RSV), Parainfluenza 1, Parainfluenza 2, Parainfluenza 3, and Adenovirus.

It is then performed the differentiation of the case observations that evolved to death. For this, we apply a second filter that resulted in two datasets, one with cases ( DT_SARI_c ) and another with deaths ( DT_SARI_d ). Finally, five attributes of interest are selected: Year, Week, State, total, and SarS-CoV-2. Table 1 describes these attributes.

In addition to these data, we use the number of confirmed cases ( DT_MH_c ) and confirmed deaths ( DT_MH_d ) from COVID-19 by state, provided by the Ministry of Health. 3 These numbers are updated daily on the COVID-19 Portal, the official communication channel on the epidemiological situation of COVID-19 in Brazil [14] . The values are used for purposes of comparison with the results obtained in this work.

The method and parameter selection are a determining factor for the quality of the results obtained in the research. This section aims at justifying the applied methodology, which includes the choice of the used dataset, and the methods and parameters adopted in the data analysis.

Datasets. The most severe cases of COVID-19 manifest respiratory symptoms, such as difficulty in breathing or shortness of breath, and chest pain or pressure [27] . These symptoms are also present in Acute Respiratory Infection (ARI). Fever is Table 1 Attributes of processed datasets DT_SARI_c and DT_SARI_d

Year

The epidemiological year of first symptoms

Week

The epidemiological week of first symptoms

The state name total The total number of recorded cases ( DT_SARI_c ) / deaths ( DT_SARI_d)

The total number of cases with positive results for COVID-19 ( DT_SARI_c ) / deaths by another common symptom, even in mild cases of the disease. It is the reason for choosing SARI data ( DT_SARI ) instead of ARI data ( DT_ARI ). DT_SARI is a subset of DT_ARI . They differ only in the manifestation of fever. Therefore, we consider that the probable cases of COVID-19 with severe symptoms also present fever, making DT_SARI the most suitable dataset to estimate the under-reporting of the disease [18, 26] . SEMA for Inertial Model. It is necessary to identify the SARI observations that correspond to the COVID-19 to compute the under-reporting of COVID-19 in Brazil. For this, data from years predating COVID-19 should be observed to model the expected inertial behavior if there was no pandemic. Thus, it is possible to estimate the COVID-19 case number as the value exceeding the expected for the same period in the year.

SEMA provides an appropriate method to create the inertial function since it is a trend indicator that assigns more weight to the most recent data considering a seasonal pattern. It is efficient to estimate an inertial behavior of a time series if the series has not undergone any significant behavior change in the period.

First, we define the time series for which SEMA is calculated. For this, three parameters are required: p, i, and s (see "Introduction" section). The i represents the time index of the reference time series, p is the number of predecessors, and s is the seasonality to be considered. Note that p and s are defined based on the locality of i.

The s is chosen based on the seasonal variation of respiratory viral diseases. The annual epidemics of the common cold and the flu affect the human population of temperate regions in the winter season [7, 10, 22, 31] . Therefore, s is defined as 52, since 52 corresponds to the number of weeks in the year. In this way, we guarantee the analysis of comparable observation sequences in the SARI series.

The parameters p and i are based on the response of the event detection algorithms in each state. The event detection (targeting both change points and anomalies) in the series DT_SARI_c and DT_SARI_d evidence consistently, in several states, behavior change in two periods: (i) between the end of 2015 and the beginning of 2016, and (ii) between February and March 2020. Table 2 shows the dates of events detected in 2020 for each state.

The events detected in 2020 are a consequence of COVID-19 in Brazil. These events coincide with the first record of the disease in the country, considering the time for the disease spread and the manifestation of symptoms [2, 15] . The events appear for most of the states from March 07 and March 14. They correspond, respectively, to the 11th and 12th epidemiological week, two or three weeks after the first confirmed case of COVID-19 in Brazil.

It is possible to identify the beginning period (t) of the novelty for a determined state. 4 The online method consists of seeking a change point in 2020, running it weekly since the first week of 2020 until it detects a change point in the year. When the change point is detected, the method stops and considers that week as the beginning of the period. So, for each state, the parameter i admits values after t and extended until the last week of data (|y|), which corresponds the week 26 of 2020 (i.e., June 27, 2020). Figure 1 shows the events detected in the SARI cases curve in Brazil. In addition to 2009 (H1N1) and 2020 (COVID-19), events are observed in the 2016 period. Events presented in Fig. 1 correspond to abnormal behavior. They can affect the previous inertial behavior of the series. For this reason, the value attributed to p is 4, meaning that the previous 4 years (2016-2019) are considered.

The model errors (random noise) for this period for both the cases and deaths in each state are, respectively, described in Tables 3 and 4 . Since i follows a non-normal distribution, the interval confidence for is computed by bootstrap with 1000 repetitions. These values are important to determine the novelty calculation, reducing the chance of an increase generated by a random event.

The adopted methodology was implemented in R [24] . The code description and Jupyter notebook also developed in R complements this work. 5 In it, it is possible Fig. 1 Anomalies (yellow) and change points (red) detected in SARI cases of Brazil 4 According to the corresponding epidemiological week identified by change points. They are presented in Table 2 . For the state of Paraná, the date detected for deaths was used instead. 5 Available at https:// eic. cefet-rj. br/ ~dal/ covid-19-under-report/.

to check the entire process on the calculation of the under-reporting rates and all numerical and graphical results. The graphics with the cases and deaths series from the DT_SARI and the marking of the detected events are presented in this notebook for all states. Also, the site contains graphics with the evolution of under-reported records over the weeks after COVID-19 for each state. There it is possible to see whether under-reported records increase, decrease or remain constant over time.

The Harbinger 6 framework was used for detecting events in time series (adaptive normalization and change finder). It receives the time series and parameters and returns the detected events. The parameters used are those defined in "method and parameter selection" section. For each state, two time series were submitted to the process described in "Methods" section, both from the InfoGripe dataset on hospitalizations for SARI ( DT_SARI ). The first is the weekly series with information on the number of registered SARI cases in the state. The second is the weekly series with information on the number of SARI deaths in the state.

Under-reporting rates were calculated for states where it was found that there were, in fact, novelty and under-reported notification. For this, two independent tests were carried out using the Wilcoxon test. The average error observed in the pre-novelty period ( ) was compared with the novelty ( i ) to check if there was a novelty. To check if there was an under-reported notification, the number of novelty calculated ( i ) was compared with the number classified as SARS-CoV-2 at InfoGripe data ( cov i ) in a paired test. Then, in both cases, only when there is a relevant difference at a significance level of 0.05, the under-reporting rates were calculated. 

This work focuses on estimating under-reporting rates for cases and deaths of COVID-19. In "Data analytics" section, exploratory analysis is conducted. It contains discussions based on event detection (change points and anomaly) over the SARI time series. These findings bring valuable information to help understand the disease scenario in the most affected states. Besides, they helped to evaluate the choice of the method and the confidence of the estimates. Then, "Under-reporting rates" section briefly discusses the characteristics of the under-reporting rates calculated. Finally, "Evolution of the under-reporting rates" section presents the evolution of under-reporting in the period considered in this work.

The detection of change points and anomalies in the time series of SARI hospitalization in Brazil was an important aspect to understand the beginning process of the pandemic situation of COVID-19 in the country. It also enabled the analyses of epidemic moments over the last years. In Figs. 2 and 3 , it is possible to observe the behavior of data and specificity of the most affected Brazilian state. 7 Amazonas state is the epidemic center in the North region, and its capital, Manaus, was the first capital from Brazil to suffer from a wave of deaths. The state presented in 2019 an increase in the number of hospitalizations. This increase is also observed in other states from 2016 until 2019. The Amazonas time series shows some anomalies, but just one change point for both the number of cases and deaths. The change point in the number of cases and deaths is marked in the 11th epidemiological week of 2020. The state reaches its peak of hospitalizations and deaths at the 17th epidemiological week and now presents a decrease in the curve.

In the Northeast region, it is possible to highlight the cases and deaths at Ceará, Pernambuco, and Bahia. Both Ceará and Pernambuco displayed the highest numbers in the region. All three states present both of the change points in the 10th week. Pernambuco and Ceará, respectively, reached their peaks of hospitalizations in the 18th (more than 1000 cases) and 19th week (more than 1800 cases). The peak for deaths for both of these states is located in the 18th week. In Bahia and Pernambuco, the number of cases and deaths show, between 2016 and 2019, a similar increase and decrease in shaping a curve between March and July.

Distrito Federal, located in the Central-West region of Brazil, was then considered one of the main focuses of COVID-19 contagion beside Rio de Janeiro and São Paulo. Previously, the peak of the number of cases in Distrito Federal was August of 2009, during the H1N1 epidemic. The pandemic superseded this high number in 2020. Besides, when analyzing the number of deaths caused by H1N1, it was not as expressive as the number of deaths registered by COVID-19.

The Southeast is the most populous region and the most infected area in the country. São Paulo was the first state to register a case (February) and death (March) by COVID-19. It is still the epicenter of the disease in Brazil. The state has the mark of the change point for cases and deaths in the 10th week.

Rio de Janeiro, also in the Southeast region, was impacted by SARS-CoV-2. It is possible to observe in cases two change points. The first one is 2009 and the second in 2020. However, the number of observed change points for the number of deaths occurred only once, in 2020, showing the seriousness of this pandemic. In 2020 the change point was detected in the 11th epidemiological for both cases and deaths.

The 2009 H1N1 crisis also impacted the states in the south region. According to the time series, it is noticeable that Paraná and the Rio Grande do Sul were affected in the number of cases. On the other hand, if we compare the number of deaths, we can observe and analyze the lethality between these two epidemic moments. Paraná 

The under-reporting rates were computed according to the proposed methodology. Tables 5 and 6 show the values of the under-reporting rates of cases and deaths for the 27 states of Brazil (columns cases rate and deaths rate, respectively). The rates Table 5 Under-reporting rates of cases of COVID-19 for the states of Brazil • The difference between computed novelty and reported values as SARS-CoV-2 was not statistically significant shown are calculated for the period between the week detected by the event detection methods (see Table 2 ) and the epidemiological week 26 (which corresponds to the date 27/06/2020). Thus, the periods considered vary for cases and deaths and between states.

The second column of both tables (cum. novelty) presents the novelty values ( i ) computed according to the methodology. In the third column (cum. cases DT_SARI_c and cum. deaths DT_SARI_d ) are the number of cases/deaths classified as SARS-CoV-2 in InfoGripe data. In the fifth column (disclosed cum. cases DT_MH_c and disclosed cum. deaths DT_MH_d ) are the number of cases/deaths reported by the Ministry of Health, for comparison purposes. The information Table 6 Under-reporting rates of deaths by COVID-19 for the states of Brazil • The difference between computed novelty and reported values as SARS-CoV-2 was not statistically significant published by the Ministry of Health is all confirmed cases/deaths of COVID-19. They are presented regardless of whether there was hospitalization for SARI or not, so they capture a broader number of reported records.

The under-reporting rates presented in this paper can be applied to compute the under-reported cases or deaths of COVID-19 in each state. It is calculated by multiplying the under-reporting rates with the number of confirmed cases or deaths of COVID-19. The result can be added to reported cases/deaths to estimate the expected number of cases or deaths of COVID-19 in the state.

The under-reporting rates of cases vary between 0.124 and 1.811, while the under-reporting rates of deaths vary between 0.072 and 0.983. Among the states for which it was possible to calculate the two rates, most had a higher underreporting rate of cases than under-reporting rate of deaths. Only the states of Rio Grande do Sul, Roraima, Distrito Federal, Amazonas, and Amapá behaved differently.

There is no dominant pattern between states in each region of Brazil. It suggests that under-reporting is a characteristic of each state. The regional similarity is not a relevant factor. The states of Minas Gerais and Mato Grosso have the highest rates of under-reporting of cases. The rate of under-reporting of deaths is high in the Rio Grande do Sul and the Minas Gerais.

The Distrito Federal, São Paulo, and Rio de Janeiro are identified as the focus of the contagion of COVID-19 in Brazil. Nevertheless, these states are not among the ones with the highest rates of under-reporting. It may be because they might be better structured and less susceptible to reporting failures. This same observation is not valid for the states Mato Grosso and Minas Gerais. They are respectively from the mid-west and Southeast regions. They have the highest rates of under-reporting of cases across Brazil.

The proposed model did not capture the under-reporting of cases in the Mato Grosso do Sul. Similar behavior occurred for under-reporting deaths in the states of Acre and Mato Grosso do Sul. These are the cases in which under-reporting cannot be observed ( • ).

Regarding the margin of error considered for the case rates, the states of the south region are highlighted. A factor that may have been determinant for this result is their historical temperature. As they have low temperatures, they generally, a higher number of SARI records. Thus, the novelty modeled in this work takes longer to be noticed, as it needs to reach even higher values to provide statistically significant changes.

To create a better characterize the behavior of underrates-report, we analyze them week by week. It is important to have in mind that the COVID-19 tests were not available in most states at the beginning of the pandemic (11th week). Therefore, aiming for a better comparison, we present the analysis from the 12th week for all states.

The lack of tests for the population results in an increased rate of under-report in the beginning. Over time, tests are expected to occur more, and the rates start to decrease. This explanation can be observed in the weekly rates graphs (Fig. 4) .

As it can be observed, under-report rates tend to stabilize throughout time. This convergence enables more confidence in computed under-report rates. Besides, it shows that even when more tests for COVID-19 are available, there is still a high under-reporting rate for some states like Minas Gerais and the Rio Grande do Sul.

The three sections of the results complement each other. Data analytics (with results presented in "Data analytics") is used to set the parameters to be applied in the modeling of time series and determinant to calculate under-reporting rates. The subsequent analysis (with results presented in "Evolution of the underreporting rates" section) shows the trend towards stability for the behavior of the calculated under-reporting rates. When rates are stable, the long-term estimation is more reliable, as there is no significant change in rate values over time.

Limitations should be noted. One limitation is inherent to the dataset used. In times of epidemic, health services tend to be more sensitive and report more occurrences. Thus, the increase in the number of SARI cases in 2020 is partially justified by the over-notification of health units. This super notification, however, is mitigated when only hospitalized cases are observed.

Another limitation is due to random noise i . The states with higher i are slower to characterize the novelty i . Again, the computed under-reporting rates presented in this paper are conservative. They can be improved by predicting i using autoregressive models.

Since the under-reporting is inferred from SARI data, estimates are limited to cases of COVID-19, who were hospitalized from the specific symptoms: fever, cough or sore throat, dyspnoea, or oxygen saturation below 95% and difficulty to breathe. It corresponds to a portion of the cases of COVID-19, as many individuals have milder symptoms or are even asymptomatic. Thus, we can consider the computed under-reporting rates as conservative since it only considers symptomatic and hospitalized disease cases.

For this same reason, we believe that the results are better characterized for under-reporting of deaths than cases. It is reasonable since people who died are much more likely to have been hospitalized and, therefore, present in SARI data. It is quite clear when looking at Tables 5 and 6. The cases reported by the Ministry of Health mostly account for more cases than those determined by novelty. Conversely, the number of deaths found by novelty is sometimes even higher than the ones presented by the Ministry of Health.

An important observation that must be highlighted is the occurrence of underreporting with the impact of COVID-19 on the Health System. From the moment that health surveillance fails to identify cases-due to under-reporting at timesit becomes more difficult to control its dissemination. With that, the dynamics and the complexity of the disease changes, and the Health System is overloaded. A consequence of that is to preclude people from getting the proper treatment not just for COVID-19 but also for other diseases, leading to an increase of deaths without medical assistance and ill-defined causes compared to last years [8] .

This paper estimates the rates of under-reporting of cases and deaths in the states of Brazil. The methodology studies the time series of hospitalized SARI cases as a proxy variable for COVID-19. The paper contributes by combining data analytics (event detection methods) and time series modeling (inertia and novelty concepts). Data analytics ensures transparency and consistency in the choice of the adopted parameters. In contrast, novelty and inertia enable an understandable approach to estimate under-report.

COVID-19 causes a rupture in the SARI series inertial behavior, changing the statistical properties of the time series. Event detection techniques identify this rupture. Assuming that the change that occurred is due to COVID-19, the computed novelty then corresponds to estimates of the values of cases and deaths from the disease. From this, under-reporting rates were computed for both cases and deaths.

The rates of under-reporting of cases were estimated for all states except for Mato Grosso do Sul. The values vary between 0.124 (Espírito Santo) and 1.811 (Minas Gerais), thus reaching almost two under-reported cases for each notified case. The novelty observed by our SARI analysis in the states is lower, in their majority, compared to the cases reported by the Ministry of Health. It is expected since many diagnosed cases of COVID-19 are asymptomatic.

Under-reporting rates for deaths were estimated for 25 of the 27 states in Brazil. For the states of Acre and Mato Grosso do Sul, the under-report was not verified, and, therefore, death rates were not calculated for these states. Rates vary between 0.072 (Espírito Santo) and 0.983 (the Rio Grande do Sul), thus indicating that there may be more than twice as many deaths as reported. The novelties for death cases using SARI analysis in the states are commonly higher than those notified by the Ministry of Health. It helps to corroborate the justification that the death rates are better estimated since SARI covers most of the individuals who die.

No pattern of behavior was observed for the events detected or for the evolution and values of under-reporting rates between states in the same Brazilian region. Therefore, it is observed that the states behave in different and independent ways concerning the occurrence/notification of COVID-19. The analysis for each state allows heads of state to make strategic decisions about avoiding the spread of the disease in each geographic area.

The methodology developed in this paper can be adapted to support the underreport rate for other diseases as long as it exists a proxy variable that presents an inertial behavior. Besides, the methodology can also support the detection of outbreaks, as it uses both the combination of event detection and inertia concepts.

Funding BP, LB, FP were supported by CNPq. RS was supported by CAPES (finance code 001). MP was supported by FAPERJ. EO was supported by both CNPq and FAPERJ. The content is solely the responsibility of the authors. It does not necessarily represent the official views of the funding agencies. The funding agencies had no role in the study design, data collection, and analyses, decision to publish, or preparation of the manuscript.

Availability of data and materials The datasets analyzed during the current study and additional documentation is freely and openly available. It corresponds to weekly aggregated of anonymized records of patients contained in the SIVEP-Gripe. The Ministry of Health of Brazil is committed to respecting the ethical precepts and guaranteeing the privacy and reliability of the data. The continuously updated SARI data was obtained from the GitLab repository of Infogripe at https:// gitlab. procc. fiocr uz. br/ mave/ repo/-/ blob/ master/ Dados/ InfoG ripe/ dados_ seman ais_ faixa_ etaria_ sexo_ virus. csv. In this paper, we used a copy of Infogripe made on July 27th, 2020. It can be accessed on the GitHub repository at https:// github. com/ balth apaix ao/ Covid 19_ BR_ under report/ tree/ master/ Aux_ arqs.

A survey of methods for time series change point detection

COVID-19 and hospitalizations for SARI in Brazil: a comparison up to the 12th epidemiological week of 2020

Modeling and forecasting the early evolution of the Covid-19 pandemic in Brazil

The Covid-19 (sars-cov-2) uncertainty tripod in Brazil: assessments on model-based predictions with large under-reporting

The coronavirus pandemic in five powerful charts

Anomaly detection: a survey

Seasonal trends of viral respiratory tract infections in the tropics

Óbitos em excesso, dentro e fora de hospitais, mostram quadro de desassistência á saúde no município do rio de janeiro

Multiple change point analysis: fast implementation and strong consistency

Seasonality of infectious diseases and severe acute respiratory syndrome-what we don't know can hurt us

Time-series data mining

Basic Econometrics, 4th edn

Outlier detection for temporal data: a survey

Covid-19 epidemiological surveillance guide

Special epidemiological bulletin 14: Coronavirus Disease

Level of underreporting including under diagnosis before the first peak of COVID-19 in various countries: preliminary retrospective results based on wavelets and deterministic modeling

A novel coronavirus associated with severe acute respiratory syndrome

Outlier (anomaly) detection modelling in PMML

Correcting under-reported COVID-19 case numbers: estimating the true scale of the pandemic

COVID-19 in Brazil

Seasonality of respiratory viral infections

Adaptive normalization: a novel data normalization approach for non-stationary time series

R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing

Estimate of underreporting of COVID-19 in Brazil by Acute Respiratory Syndrome hospitalization reports

Characterization of a novel coronavirus associated with severe acute respiratory syndrome

The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak

Time Series Analysis and Its Applications: With R Examples

A Bayesian analysis of the total number of cases of the Covid 19 when only a few data is available. A case study in the state of Goias

A unifying framework for detecting outliers and change points from time series

Seasonal pattern of hospitalization from acute respiratory infections in Yaoundé. Cameroon

Risk factors of critical & mortal COVID-19 cases: a systematic literature review and meta-analysis

The authors thank CNPq, CAPES (finance code 001), and FAPERJ for partially funding this research.Author Contributions All authors contributed equally to the study. EO conceptualized the study design. MP and RFS acquired the data. BP and LB conducted data analysis and interpretation. RS, LE, CS, RC, FP, and JS revised it critically for intellectual content. All authors have the approval of the final version. The dataset used in this paper has not been reported in any other submission by us or anyone else.The authors are committed to keeping the under-reporting rates updated. It means that the underreporting rates will be recalculated periodically, provided that new data referring to SARI are made available by InfoGripe. The new under-reporting rates to be included will undergo the same methodological process described in this paper.

The authors declare that they have no competing interests.Ethics approval and consent to participate DATASUS provided the datasets used in this study. They were produced by aggregating and anonymizing all personal information of SARI registers contained in the SIVEP-Gripe. The Ministry of Health of Brazil is committed to respecting the ethical precepts and guaranteeing the privacy and reliability of the data.