key: cord-323074-u3bs5sj0 authors: Garcia, L. P.; Goncalves, A. V.; de Andrade, M. P.; Pedebos, L. A.; Vidor, A. C.; Zaina, R.; Canto, G. d. L.; de Araujo, G. M.; Amaral, F. V. title: ESTIMATING UNDERDIAGNOSIS OF COVID-19 WITH NOWCASTING AND MACHINE LEARNING: EXPERIENCE FROM BRAZIL date: 2020-07-02 journal: nan DOI: 10.1101/2020.07.01.20144402 sha: doc_id: 323074 cord_uid: u3bs5sj0 Background: Brazil has the second largest COVID-19 number of cases, worldly. Even so, underdiagnosis in the country is massive. Nowcasting techniques have helped to overcome the underdiagnosis. Recent advances in machine learning techniques offer opportunities to refine the nowcasting. This study aimed to analyze the underdiagnosis of COVID-19, through nowcasting with machine learning, in a South of Brazil capital. Methods: The study has an observational ecological design. It used data from 3916 notified cases of COVID-19, from April 14th to June 02nd, 2020, in Florianopolis, Santa Catarina, Brazil. We used machine-learning algorithm to classify cases which had no diagnosis yet, producing the nowcast. To analyze the underdiagnosis, we compared the difference between the data without nowcasting and the median of the nowcasted projections for the entire period and for the six days from the date of onset of symptoms to diagnosis at the moment of data extraction. Results: The number of new cases throughout the entire period, without nowcasting, was 389. With nowcasting, it was 694 (UI95 496-897,025). At the six days period, the number without nowcasting was 19 and 104 (95% UI 60-142) with. The underdiagnosis was 37.29% in the entire period and 81.73% at the six days period. Conclusions: The underdiagnosis was more critical in six days from the date of onset of symptoms to diagnosis before the data collection than in the entire period. The use of nowcasting with machine learning techniques can help to estimate the number of new cases of the disease. The World Health Organization has reported more than 10 million cases of SARS-CoV-2 infection and 500,000 deaths, 1 a significant part of which had occurred in Brazil. According to the Brazilian Ministry of Health, the country overcame 1,3 million cases and 58 thousand deaths, 2 what meets the Imperial College London prediction of growing in deaths caused by the COVID19. 3 Brazil has the biggest number of deaths among the Latino-American countries. 4 The Lancet has dedicated, recently, an editorial to the political-sanitary disaster that desolate the country. 5 Despite the already alarming numbers, the editorial 5 and other studies 6 had drawn attention to the possibility of a large number of underdiagnosed cases. One of the causes of underdiagnosis is the low testing rate of suspected individuals: 4.71 tests for a thousand habitants. 7 This rate is much lower than countries like Iceland (184.11), United States (66.76), Chile (30.01), South Africa (16.34). 7 Dealing with the underdiagnosis is essential so that appropriate actions can be taken to reverse the progression of deaths in the country. 8 Many countries are using a combination of containment and mitigation activities to stem the progression of SARS-CoV-2 and thus, manage the demand for hospital beds. 9 Nonpharmacological measures have been shown to be effective in controlling the transmission of COVID-19. [10] [11] [12] [13] [14] They can reduce the impact in the health system, given managers time to organize properly the system. These measures also reduce the need for hospitalization by other conditions that could compete for beds with SARS-CoV-2 patients. 15 In addition, they increase the chance that a substantial number of people not be infected until a treatment and vaccine be developed. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 2, 2020. . https://doi.org/10.1101/2020.07.01.20144402 doi: medRxiv preprint In outbreak situations, in which rapid changes are common, the actual number of infected cases must be closely monitored. Artefacts variations produced during the monitoring process should be distinguished of the real cases variation. 8 Among the artefacts are less testing capacity than suspected cases notification, for example. If the number of individuals notified as suspects is much higher than the testing capacity at the present time, this difference can cause an underdiagnosis of the current cases. Data about pathogens transmissibility and exposed population susceptibility, population density and demographic characteristics of the affected population, besides the temporal spatialdistribution of cases and population mobility, can contribute to the correction of such artefacts. 16 The natural history of the disease, on the other hand, is an important factor in determining the optimal case count update in the frequency monitoring. Rapidly progressing diseases like COVID-19 require daily updates, while monthly updates may be sufficient to others with slower progression, such as HIV / AIDS. A frequent analysis may also be necessary in times when transmissibility is expected to be changing, for example when control actions are initiated, enhanced, or stopped. 16 Nowcasting approaches try to estimate the number of a given event in the present. 8, 10, 17 This strategy has been used to improve surveillance of infectious diseases like AIDS 18, 19 cholera, 20 influenza infections 8, 21 and, recently, COVID-19. 3, 8, 17, 20, 22 Nowcasting techniques, in general, uses time-series predictions. 23-25 Recent advances in machine learning techniques offer opportunities to refine the nowcasting of an epidemic behavior. 16 The main objective of machine learning techniques is to produce a model that can be . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 2, 2020. . https://doi.org/10.1101/2020.07.01.20144402 doi: medRxiv preprint used to classify, predict, or estimate a phenomenon. This approach is useful in several applications in biomedical research, [26] [27] [28] [29] [30] [31] [32] including concerning 34 Monitoring the impact of non-pharmacological actions is essential to optimize the allocation of scarce resources in non-high-income-countries, like Brazil. 16 In these, the maintenance of long quarantine periods is even more challenging due to the deficiencies on the social protection system, the economic vulnerability of the population and the large portion of people acting as informal workers. No single set of interventions is appropriate to all contexts owing to the combination of these factors with climatic, demographic and organization issues of each country. 10 Thus, monitoring on near real time should be a key part of the strategy to couple with SARS-CoV2. Among the challenges for timely monitoring are delays in providing medical care after onset of the symptoms and delays in diagnosis. 8 It is plausible to assume that these challenges are even greater in nonhigh-income-countries, with less comprehensive health systems. To help overcome this challenge, the present study aimed to analyze the underdiagnosis of COVID-19 cases, through nowcasting with machine learning, in a South of Brazil capital city. This project was submitted to the Ethics in Research with Human-Beings Council at the CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 2, 2020. . https://doi.org/10.1101/2020.07.01.20144402 doi: medRxiv preprint under CAE nº 33374820.2.0000.0121/2020. We used exclusively secondary and anonymized databases. The present study has an observational ecological design, using data from notified cases We used the random forest 37 machine learning algorithm to classify the notified cases which had no diagnosis yet, producing the nowcast. To analyze the underdiagnosis, we compared the difference between data without nowcasting and the median of the nowcasted projections for the entire period of analysis and for the period from May 28 th to June 2 nd , 2020. The latter corresponds to the six days from the date of onset of symptoms to diagnosis at the moment of data extraction. Notification of suspected cases of COVID-19 within 24 hours is mandatory in Brazil. 38 From April 14 th , 2020, Florianópolis adopted the same criterion of notification used by COVID-19 as the criteria used by the Brazilian Ministry of Health: fever accompanied by cough, dyspnea, runny nose or sore throat. 38 The cases have been confirmed by real . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 2, 2020. . https://doi.org/10.1101/2020.07.01.20144402 doi: medRxiv preprint time reverse-transcriptase-polymerase-chain-reaction (RT-PCR), serological tests or clinical-epidemiological criteria. We used three data sources for the nowcasting, all from the Public Health Department of Florianópolis: 1) anonymized database of suspected and confirmed cases of Florianópolis' residents; 2) demographic data for the 49 health regions; and 3) traffic data, as a proxy for the movement of people in the municipality. The following variables were extracted from anonymized database of suspected and confirmed cases: i) diagnostic (confirmed, discarded or missing), ii) sex, iii) age (in years), The number of infected people (with a positive diagnosis and less than 14 days of symptom onset) and the rate of infected people per 100,000 inhabitants were calculated for the health regions where each notified person resides. In addition, the following demographic data from these regions were included in the analysis: i) the total number of inhabitants and by sex, ii) the number of persons aged 1 year old, 2 years old and so on up to 100 years old or more, iii) the number of people by race (white, black, yellow, brown, indigenous and ignored), iv) the number of people by years of schooling (from 1 to 17 years completed or more, in addition to literate, non-literate, literate through youth and adult literacy programs and with uninformed schooling), v) total income per household, average income of households, total income of heads of households, average income of . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 2, 2020. . https://doi.org/10.1101/2020.07.01.20144402 doi: medRxiv preprint heads of households, total income per person and average income per person. The proportion of male people, people aged 60 years old or over, people with non-white race and people with 10 or less schooling time, was calculated as possible indicators of vulnerability. The average daily traffic in four important avenues in the city was used as a proxy of people's mobility in the city. We hypothesize that there is a lag between the increase in mobility and the identification of the increase in cases, so we used the average traffic of the day and the average lagged daily until the thirteenth day of the onset of the symptoms of the notified cases. There was no imputation for missing data. To compare the characteristics of people with a confirmed and discarded diagnosis of COVID-19, t test was used for continuous variables and chi-square for categorical, adopting the p-value < 0.05 as a threshold of statistical significance. We used the random forest to carry out the nowcasting. The database was initially splitted in the training-validation-test database, formed by cases whose diagnosis (confirmed or discarded) was known; and the prediction database, which had no diagnosis. The training-validation-test database was divided, next, in training-validation database and test database, using 70% and 30% of the data, respectively. The training-validation basis was subjected to undersampling to improve the sample's balance as the number of discarded cases was much higher than confirmed. The balanced training-validation database was used to perform the feature selection and . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 2, 2020. . https://doi.org/10.1101/2020.07.01.20144402 doi: medRxiv preprint hyperparameter tuning. Nested cross validation was performed with 5 folds, both in the inner and outer loop. The feature selection and hyperparametrization were performed simultaneously in the inner loop using a random search to maximize the accuracy. Folds were balanced with respect to the outcome. Table 1 shows the range for feature selection and hyperparameters used. We analyzed the training and validation results. The model with the best fit was used for classification in the test database. The test database was not submitted to undersampling reflecting the prediction database as close as possible. Finally, the cases were classified as confirmed or discarded, based on the predictions. We repeated the resampling of the databases, the training and the testing of the algorithms 1000 times to determine the 95% Uncertainty Intervals (UI), the median of accuracy, sensitivity and specificity, in addition to the final classification of cases. The underdiagnosis was analyzed by the difference between the median of the number of cases predicted by the model (incidence with nowcasting) and the number of the cases diagnosed by the Public Health Department of Florianópolis (incidences without nowcasting). This analysis was carried comparing the entire period and the period from May 28 th to June 2 nd , 2020. The number of cases was also smoothed by a LOESS 39 regression and the cumulative number, without and with nowcasting, were presented graphically by day of symptom onset. All analyzes were performed using the software R v.3.6.3. Anonymous scripts and databases are available at: https://github.com/lpgarcia18/underdiagnosis_of_covid_19_cases_in_brazil . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 2, 2020. . https://doi.org/10.1101/2020.07.01.20144402 doi: medRxiv preprint During the analysis period, 3916 individuals residing in Florianópolis were reported as suspects for COVID-19. Among all notified individuals, 603 had a positive diagnosis, 2413 discarded diagnosis and 900 had no diagnosis yet. The association of individual characteristics, health regions and displacement of people with confirmed or discarded cases can be seen at the Table 2 and at the Supplement. The group of individuals with a positive result for SARS-COV-2 had an earlier symptom onset date and later notification dates than individuals with negative results. There was also a difference regarding the distribution according to sex and race between the two groups. The average age among confirmed cases was higher than among discarded cases. There was a heterogeneous distribution of confirmed and discarded cases among Table 3 . The incidence without nowcasting throughout the entire period was 389 new cases. With the nowcasting it was 694 (UI95% 496 -897). At the period from May 28 th and June 02 nd , 2020, the incidence without nowcasting was 19 new cases and 104 (UI95% 60 -142) with nowcasting (Table 3) . Thus, the underdiagnosis was 37.29% in the entire period and . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 2, 2020. . https://doi.org/10.1101/2020.07.01.20144402 doi: medRxiv preprint 81.73% in six days from the date of onset of symptoms to diagnosis at the moment of data extraction. The difference in the progression of new cases with and without nowcasting can be seen in Figure 1 . non-high-income-countries due to the magnitude of underreporting 40 Even so, the number of COVID-19 cases has grown rapidly in Brazil, and the country has the second largest number of cases in the world nowadays. 7 The city of Florianópolis has, so far, Maintaining strong measures of social distance for long periods, however, may not be sustainable. These restrictions have already caused a slowdown in the world economy. 43 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 2, 2020. . https://doi.org/10.1101/2020.07.01.20144402 doi: medRxiv preprint Research in non-high-income-countries shows an average 70% fall in income and 30% less in consumption expenses. 42 Strategies that combine more restrictive periods with moments of relaxation of these restrictions have been identified as ideal for countries with few resources. 44 Interleaving periods with greater restriction of social contact with periods of relaxation of the restrictions, but with an intensification of testing, case isolation, contact tracking and protection of vulnerable people, can allow a return to the minimum coexistence between people, and the resumption of economic production. 44 Florianópolis has carried out more than 10,000 tests so far, 45 that is, more than 20 tests per thousand inhabitants, more than 4 times the national average. Even with this greater number of tests, which should reduce the impact of underdiagnosis in the municipality, it is possible to observe a great disparity between the number of new cases confirmed by the municipality and the one predicted by the nowcasting. The underdiagnosis was more important in the proximal period of analysis. It shows the significance of underdiagnosis in the six days between the date of onset of symptoms and the date of diagnosis prior to the data collection. The underdiagnosis, probably produced by the mismatch between the onset of symptoms and the time of testing, may interfere with the current estimate and future projections of the disease incidence. In this context, the use of machine learning techniques can assist to enable adequate monitoring of the number of new cases and better decision making. 46 The algorithm performance was better in detecting negative cases (specificity) than positive cases (sensitivity). In this sense, a greater number of false positives are expected compared to false negatives, and the interpretation of nowcasting should take this into account. A greater amount of individual data, such as data related to symptomatology, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 2, 2020. . https://doi.org/10.1101/2020.07.01.20144402 doi: medRxiv preprint can improve model sensitivity. Besides, the association of SARS-CoV-2 infection rates with climate issues has been described. 47, 48 The introduction of these data may also assist and should be considered in future studies. The present study demonstrated the underdiagnosis of cases of COVID-19 in Florianópolis. The underdiagnosis was more important in the period of six days before the data collection than in the entire period, corresponding to an artifact in the monitoring caused, probably, by the greater capacity in notifying than in the testing processes. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 2, 2020. . https://doi.org/10.1101/2020.07.01.20144402 doi: medRxiv preprint FUNDING This work has not received any financial support. Thus, there is no funding interest in the study design, data collection, data analysis, data interpretation, writing of the manuscript, or in the decision to submit the manuscript for publication. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding sources. The authors have no competing interests. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 2, 2020. . https://doi.org/10.1101/2020.07.01.20144402 doi: medRxiv preprint Legend: the shaded area corresponds to the 95% Uncertainty Interval. WHO. Coronavirus disease Short-term forecasts of COVID-19 deaths in multiple countries Interim Analysis of Pandemic Coronavirus Disease 2019 (COVID-19) and the SARS-CoV-2 virus in Latin America and the Caribbean: Morbidity, Mortality and Molecular Testing Trends in the Region. medRxiv Characterization of the COVID-19 pandemic and the impact of uncertainties, mitigation strategies, and underreporting of cases in South Korea, Italy, and Brazil Covid-19 Coronavirus Pandemic Nowcasting by Bayesian Smoothing: A flexible, generalizable model for real-time epidemic tracking COVID-19: towards controlling of a pandemic Nowcasting and Forecasting the Spread of COVID-19 and Healthcare Demand In Turkey, A Modelling Study. medRxiv Impact of nonpharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand Impact assessment of nonpharmaceutical interventions against coronavirus disease 2019 and influenza in Hong Kong: an observational study Effect of non-pharmaceutical interventions to contain COVID-19 in China Implementation of Mitigation Strategies for Communities with Local COVID-19 Transmission. 2020. 15. IHME COVID-19 health service utilization forecasting T. Forecasting COVID-19 impact on hospital bed-days, ICU-days, ventilator-days and deaths by US state in the next 4 months. medRxiv Real-time epidemic forecasting: challenges and opportunities. Heal Secur Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study Regression Analysis of Censored and Truncated Data: Estimating Reporting-Delay Distributions and AIDS Incidence from Surveillance Data Real-Time Estimation of the Risk of Death from Novel Coronavirus (COVID-19) Infection: Inference Using Exported Cases Nowcasting the number of new symptomatic cases during infectious disease outbreaks using constrained P-spline smoothing. Epidemiology. 2019. 25. Review of "Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study The whole is greater than the sum of its parts: combining classical statistical and machine intelligence methods in medicine Machine learning classifies cancer Probabilistic machine learning and artificial intelligence Machine learning: Trends, perspectives, and prospects. Science (80-) Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil. Chaos, Solitons & Fractals; 2020. 34. Chimmula VKR, Zhang L. Time Series Forecasting of COVID-19 transmission in Canada Using LSTM Networks Random Forests Emergência de Saúde Pública de Importância Nacional pela Doença pelo Coronavírus 2019. Vigilância integrada de Síndromes Respiratórias Agudas Doença pelo Coronavírus Available from: ourworldindata.org. 2020. 41. Brasil. Painel de Casos de doença pelo coronavirus 2019 (COVID-19) no Brasil pelo Ministério da Saúde Mathematical modelling of COVID-19 transmission and mitigation strategies in the population of Ontario, Canada. CMAJ; 2020. 43. Baldé MAMT. Fitting SIR model to COVID-19 pandemic data and comparative forecasting with machine learning The effect of control strategies to reduce social mixing on outcomes of the COVID-19 epidemic in Wuhan, China: a modelling study Remuzzi A, Remuzzi G. COVID-19 and Italy: what next? Lancet Temperature and Latitude Analysis to Predict Potential Spread and Seasonality for COVID-19 Incidence -Entire period(UI 95%) Incidence May 28 th