key: cord-0607127-tdor9e3t authors: Silva, N'ivea B. da; Valencia, Luis Iv'an O.; Filho, F'abio M. H. S.; Ferreira, Andressa C. S.; Pereira, Felipe A. C.; Oliveira, Guilherme L. de; Oliveira, Paloma F.; Rodrigues, Moreno S.; Ramos, Pablo I. P.; Oliveira, Juliane F. title: Brazilian COVID-19 data streaming date: 2022-05-10 journal: nan DOI: nan sha: 9d522b4e54b599530685e65e70d14cc19a48e6eb doc_id: 607127 cord_uid: tdor9e3t We collected individualized (unidentifiable) and aggregated openly available data from various sources related to suspected/confirmed SARS-CoV-2 infections, vaccinations, non-pharmaceutical government interventions, human mobility, and levels of population inequality in Brazil. In addition, a data structure allowing real-time data collection, curation, integration, and extract-transform-load processes for different objectives was developed. The granularity of this dataset (state- and municipality-wide) enables its application to individualized and ecological epidemiological studies, statistical, mathematical, and computational modeling, data visualization as well as the scientific dissemination of information on the COVID-19 pandemic in Brazil. . The objectives of the Platform for Analytical Models in Epidemiology (PAMEpi). Due to the continental size of Brazil, which ranks among the countries with the highest numbers of reported COVID-19 cases and deaths, the collection and curation of primary data on health, human mobility and non-pharmaceutical interventions, as well as information on the social and economic conditions of the population are challenging tasks. PAMEpi's aims includes disease modeling, dissemination of scientific information on the COVID-19 pandemic at individual or aggregated levels and the promotion of multi-national cooperative studies, for which clean, harmonised and normalised data is a fundamental resource 15, 16 . Considering the relevance of open science, the dataset we produced is essential to inform timely policy responses and aid in the understanding of the pandemic's impact on other outcomes. In addition, this data architecture offers great potential for reuse by Brazilians and the international community, enabling comparative studies. We collected individualized and aggregated openly available data from various sources related to suspected/confirmed SARS-CoV-2 infections, vaccinations, non-pharmaceutical government interventions, human mobility and deprivation index values. All individualized and unidentifiable information related to suspected and confirmed SARS-CoV-2 infections and vaccinations (Vacdb) is available via the OpenData SUS system, provided by the Brazilian Ministry of Health (MoH) [8] [9] [10] . SARS-CoV-2 infections are notified via two databases: the Flu Syndrome Database (FSdb), which contains all suspected and confirmed mild-to-moderate COVID-19 cases, and the Severe Acute Respiratory Syndrome Database (SARSdb), containing cases of severe-to-critical infection. FSdb, SARSdb and Vacdb are licensed under a Creative Commons Attribution License (CC BY v4.0) and no ethical approval by an institutional review board was required to use this data (Brazilian National Health Council Resolutions 466/2012 and 510/2016, Article 1, Sections III and V). All data can be freely downloaded directly at the OpenData SUS website [8] [9] [10] or our Python algorithm can be used to download all data documented in this manuscript. It is important to mention that, in accordance with the Brazilian General Data Protection Law -13.709/2018 (LGPD), some personally identifiable information has been suppressed from the datasets provided by the OpenData SUS portal, which reduces the total number of variables described in the respective data dictionaries. With regard to aggregated data, Google Mobility enables users to access information on daily human mobility patterns at grocery stores and pharmacies, parks, transit stations, retail and recreation entities, residences and workplaces. In addition, historical series detailing average human flows on roads, rivers and via air are made freely available by the IBGE 6, 7 . Other collected aggregated information was provided through a cooperative agreement with legal information start-up JusBrasil 12 , which allowed automatic retrieval of Brazilian decrees describing non-pharmaceutical measures enacted by state or municipal governments to contain the pandemic. This database was obtained as raw text files and were subjected to human curation within PAMEpi, with enriched output made freely available as text and stringency values representing government containment efforts at state or municipal levels. Lastly, we also gathered information on community-level deprivation measures to assess socioeconomic inequalities. Together, this collected data forms what we call the Brazilian COVID-19 Data Lake. Figure 2 schematically illustrates the data sources and their respective characteristics. For the creation of the data lake, all raw data was obtained in its original file format (usually CSV text files). Deprivation index and intercity mobility datasets are static, with updates expected only following census tabulation. Data from the other sources may be updated daily, weekly or monthly depending on the respective source. Scripts in Python language were elaborated to create and update the data lake automatically 17 . Users can download and organize data locally, on an isolated partition or in an online repository, and can run these tasks manually or schedule them as system tasks. Due to the large volume of individualized information stored across several databases, an API available from the OpenData SUS portal can be used to update the information in the data lake. The next step involves data curation. For this, we 1) examine field names, variable types and categorical values (if applicable) and compare them to previous versions of the data at each moment of data updating; 2) perform semantic enrichment, data harmonization and cleaning to facilitate the use of information from multiple databases; 3) normalize data, to better structure the dataset by reducing data redundancy and improving data integrity; 4) generate metadata. The environment used to load and process these datasets was built using Python (PySpark tool) and all codes to perform these tasks are available from PAMEpi GitHub code repository 17 . The last stage involves formatting data for analysis and modelling. Both individualised and aggregated data analyses can be performed according to the goals of each researcher. As standard practice, we generate datasets aggregated at state, health region and municipal levels throughout Brazil. This comprises one of the most important outputs of our data architecture, which facilitates epidemiological studies, data visualization, health surveillance, etc. Figure 3 details the data collection, curation and ETL operations of the PAMEpi data architecture. In the following two sections, we describe key information about the datasets that comprise the Brazilian COVID-19 Data Lake. Although the PAMEpi data architecture utilizes a limited number of datasets, the structure described is adaptable and can integrate any other datasets needed to address public health challenges. All generated metadata and dataset descriptions are freely available via our online platform (https://pamepi. rondonia.fiocruz.br/en/data_en.html). To facilitate visualisation, we have provided a data explorer that allows users to view the first rows of each dataset along with metadata, including column descriptions, variable type, and variable harmonisation where applicable. This also allows broader re-use of this dataset, particularly since the original descriptors and data dictionaries are usually only available in Portuguese. The advent of the COVID-19 pandemic in March 2020 prompted the MoH to implement a scheme to report mild-to-moderate suspected cases of COVID-19, denominated the Flu Syndrome database (FSdb). The MoH characterizes a patient as suspected of having influenza syndrome (IS) if at least two of the following symptoms are present: fever (measured or reported), chills, sore throat, headache, cough, runny nose, olfactory or taste abnormalities. In children, nasal congestion is another condition warranting consideration. In elderly individuals, syncope, mental confusion, excessive drowsiness, irritability, and loss of appetite may also constitute relevant symptoms 18 . The Severe Acute Respiratory Syndrome (SARS) database (SARSdb) was created by the MoH through the Health Surveillance Secretariat after the last influenza virus subtype A (H1N1) pandemic in 2009. This database contains reports of influenza and other respiratory viruses, which were previously recorded only via influenza syndrome sentinel surveillance. After March 2020, severe-to-critical COVID-19 cases also began to be integrated into this database. According to the MoH, cases are reported in SARSdb if one of the following symptoms is present in addition to two other IS symptoms: shortness of breath/breathing difficulties, persistent chest discomfort or pain, O 2 saturation <95%, or bluish discolouration (cyanosis) of the lips or face. In addition, tachypnea, hypoxemia, cyanosis, intercostal retraction, dehydration, and loss of appetite are relevant symptoms among children. In the elderly, anosmia, ageusia, diarrhea, abdominal pain, myalgia or symptoms of exhaustion may warrant consideration in determining inclusion in SARSdb 18 . Importantly, all cases reported in SARSdb required hospitalization due to disease severity. A COVID-19 case can be diagnosed via three methods: clinical-epidemiological investigation, laboratory testing or by imaging. Clinical-epidemiological diagnosis is based on a patient's clinical manifestations and possible contact with other infected individuals 14 days prior to the onset of symptoms; laboratory diagnosis can be confirmed by molecular testing (real-time PCR), serology (ELISA, CLIA or ECLIA) or by rapid tests; imaging by high resolution computed tomography can also provide evidence of COVID-19 disease 18 . FSdb and SARSdb contain information provided by private and public health institutions, which report suspected and confirmed cases via the e-SUS NOTIFICA system. FSdb is organized according to each federal unit (i.e., state) of the Brazilian federation, while SARSdb files are organized annually (one per year). FSdb is available in .csv format, with a total of 30 variables and a size of around 15 GB in December 2021. In turn, SARSdb, also available in .csv format, contains 161 variables and a size of around 2 GB in December 2021. Variables provide spatial and temporal information on reported cases, as well as patient clinical (symptoms, comorbidities, etc.) and demographic (age, municipality of residence, etc.) data ( Figure 4) . Additionally, SARSdb also contains each patient's vaccination status against influenza or COVID-19 (vaccination cycle, type of vaccine, etc.) and hospitalization information (clinical ward or intensive care unit). FSdb and SARSdb are updated weekly with newly included case information. As of November 2021, more than 21% of the cases registered in FSdb were confirmed as COVID-19 (13,895,387 registries), with more than 67% resulting in cure and less than 1% in deaths ( Figure 5 ). By contrast, SARSdb contained reports of more than two million suspected severe-to-critical cases. Of these, 66.1% were confirmed to be COVID-19, with 32.4% being admitted to an ICU and a resulting death rate >50%. Among patients who were hospitalized but not admitted to ICU, 17.5% died from COVID-19, and more than 74% achieved cure ( Figure 5 ). Assigning a definitive classification to cases of suspected SARS-CoV-2 infection remains a challenge for Brazilian health authorities, as evidenced by high percentages of indeterminate cases in FSdb and SARSdb ( Figure 5 ). As is typical for other diseases in Brazil (such as arboviruses and leprosy) 19, 20 , many cases go undetected, and those reported in health databases are usually not updated with a final classification due to the huge number of registered patients and lack of laboratory facilities and testing. This limitation affects the estimation of true disease prevalence and the implementation of appropriate control measures. Under this scenario, mathematical, statistical and computational methodologies have been extensively employed to better understand the reality reflected by the information contained in Brazilian databases to address deficiencies, and provide more reliable estimates of true disease prevalence by inferring diagnosis status for cases not definitively classified in the database records 20-22 . The All data related to the vaccination campaign are made available by the MoH through the National Immunization Program's information system. The Opendata SUS Vacdb provides information on the doses administered (vaccine type, patient vaccination status) as well as patient demographic characteristics. Additional information related to administrative codes (system patient identifiers, IBGE or vaccine manufacturer) is also present. The data set, available in .csv format, contains 30 variables, is updated weekly, and, as of December 2021, had a total size around 80 GB. Table 1 lists the main variables contained in Vacdb. Google has provided a Community Mobility Report 5 to assess how human mobility has been affected by the spread of COVID-19 since February 2020 21, 25, 26 . The report compares the median attendance of people at a given location and day to the median attendance value at the same location for that day of the week, calculated for the pre-pandemic period from January 3 through February 6, 2020 (these values are defined as a baseline). The resulting metric, available for each country and regions within, reflects the percentage change in the number of people who visited grocery stores or pharmacies, parks, transit stations, retail and recreation entities, residences, and workplaces compared to the same period defined before. Google's mobility database contains 15 variables and is updated daily. The information available is based on a sample of users who have allowed the company to track their location history, which may introduce bias depending on the analysis performed. In addition, IBGE provides information on the flow of people between Brazilian cities on roads, rivers and via air. Air mobility data indicates the average number of people travelling between the country's airports 7 for the years 2010, 2019 6/12 and 2020. Another report contains data on intercity vehicle (bus) transport via roads or rivers estimated for 2016 6 . Figure 6 illustrates the Brazilian interstate and intercity flow network (via air, road and river). It is important to note that the IBGE databases do not indicate the direction of flow. Therefore, it is impossible to discern whether planes/vehicles are arriving or leaving a given city by air/road/river. Moreover, the road/river mobility data only considers the average number of vehicles per week, and does not specify vehicle occupancy rates. The Brazilian Deprivation Index (BDI) 13, 14 is a composite measure of health and social vulnerability, and represents the most up-to-date area-based measures of deprivation in Brazil, allowing to estimate socioeconomic inequalities at multiple scales ranging from census tracts, municipalities, states, macrorregions, or the entire country. The BDI combines information on the income, education and household conditions in a given region to measure social inequalities. The index combines the z-scores of three factors: the percentage of families with per capita incomes less than half the minimum wage ( US$117); and the percentage of illiterates older than seven years; the percentage of people without adequate access to potable water, sanitation, garbage disposal, bathroom or shower. Limitations of using this metric exist, since it comprises three dimensions of deprivation, excluding others such as employment, crime, health, education, and access to public services. Comparisons over time are also not possible since the data that compose the index were gathered from the 2010 Brazilian census. Because of this, the measure does not need to be frequently updated on our data lake. Still, the BDI contributes to essential insights on characterising the impact of the COVID-19 spread on the Brazilian population 27 . The BDI database we collect measures deprivation at the municipal level. It has 15 variables, 5,566 registers (rows) and is currently 734 Kb in size. The data is shared under a Creative Commons Share-Alike license, with copyright owned by the author's institutions (CIDACS/Fiocruz-Bahia and the University of Glasgow). The database, along with its extensive documentation, can be accessed at https://researchdata.gla.ac.uk/980/ or https://cidacs.bahia.fiocruz. br/ibp/painel-en/. Like most countries, as the epidemic unfolded the Brazilian government began to enact non-pharmaceutical interventions (NPIs) to mitigate the effects of COVID-19, specially the surge in hospitalization of COVID-19 patients. Some of these NPI were lifted over time following declines in number of cases and deaths between each infection wave. Several studies reported the effects of NPIs on the incidence of COVID-19 21, 25, 26 , highlighting how, coupled with human mobility patterns, they can be applied to gauge the intensity of government stringency needed to achieve control of viral transmission rates 28 . The governmental decrees that have been issued during this period can be extracted from state-and federal-level Official Bulletins (in Portuguese, Diário Oficial). Although the federal-level bulletin can be bulk downloaded in computer-readable XML format (https://www.in.gov.br/acesso-a-informacao/dados-abertos/base-de-dados), the same is not possible for state-level bulletins, which are decentralized and usually do not permit automatic data query or retrieval, which also occurs with municipality-level decrees. As a result of a collaboration agreement with legal information start-up JusBrasil (http://jusbrasil.com.br), we accessed state and city level text files describing NPIs enacted during COVID-19 issued by governments through JusBrasil's proprietary Application Programming Interface, allowing fast retrieval of COVID-19-related legislation during the period. The automatic retrieval was followed by manual human curation, when each decree was read, categorized, and multiple meta-data were systematically derived, including the geographic coverage of the measures and the validity period of the decree. Based on the type of measure processed, a metric was created, defined as the stringency index 25 , which combines the different enacted laws to summarize the level of governmental strictness over any given period of the pandemic. The contents of evaluated decrees (in Portuguese) and the resulting stringency index over time are freely available and regularly updated within the PAMepi platform at https://pamepi.rondonia.fiocruz.br/en/decree_en.html. In an attempt to enhance our understanding of the spread of respiratory disease in Brazil, particularly the COVID-19 pandemic, our data lake incorporates multiple databases that can be accessed, downloaded and routinely updated via automated scripts. However, as changes in disease status occur due to human interventions (e.g., changes in mobility patterns, non-pharmaceutical interventions, vaccine campaigns), the source databases may add or remove variables (or even include additional categories within variables). For example, this was observed in SARSdb and Vacdb when new variables were added to account for individuals infected with SARS-CoV-2 who were vaccinated, as well as to incorporate the administration of additional doses during the ongoing national vaccination campaign, respectively. Therefore, regular monitoring of data sources, which we perform, is necessary to update the automated script used to execute data lake updates. We created datasets containing community-level aggregated information extracted from the variables in the source databases. The integration of variables from different sources facilitates data analysis and epidemiological study, since the resulting aligned dataset contains relevant datapoints collected since the beginning of the pandemic in Brazil, as well as prior to the pandemic. For example, users can analyze a daily time series of mild-to-moderate cases resulting from the FSdb, or hospital occupancy and death rates from the SARSdb, in addition to daily vaccine doses administered, among other variables. The dataset enables real-time assessment of the pandemic in each Brazilian municipality through the lens of data visualization, which is essential to fostering community involvement during local and world health challenges [1] [2] [3] 11 . Finally, the dataset metadata we provide also facilitates federated data access and analytics, which are important for future integrative research and preparedness for the next pandemic. Users can access the updated dataset using a data explorer and dictionary, which are available at https://pamepi. rondonia.fiocruz.br/en/aggregated_en.html. In addition, all codes used to create the dataset are freely available from our GitHub repository 11 . After performing pre-processing and data curation, the databases directly related to COVID-19 information (FSdb, SARSdb and Vacdb) were verified through exploratory analyses to ascertain the technical quality of the databases. Verification consisted of evaluating the completeness of the main variables contained in these databases, while checking for inconsistencies in date-type variables. To assess the completeness of a given variable, we constructed a new column termed 'indeterminate' that served to quantify the amount of missing and incompatible data. Accordingly, database completeness is represented by the percentage of completion of variables, varying from poorly filled (0%) to well-filled (100%). Completeness was evaluated by selecting key variables from each database. More extensive descriptions of completeness are provided for each individualised database as Supplementary material. These reports include initial exploratory analyses in both table and plot format. The period considered for this evaluation was between January 1, 2020 and November 22, 2021. FSdb and SARSdb do not contain individual identification codes. Therefore, no procedure to identify duplicate records was carried out. We then labelled categorical variables and categorised age variables into groups to evaluate completeness. The spatial unit used for this analysis was the state in which cases were notified. The completeness analysis in FSdb examined the following demographic and clinical variables: sex, age group, final classification, case evolution, symptoms, test result and test type. For SARSdb, we assessed the completeness of sex, age group, scholarity, residential area (urban or rural), date of hospitalisation, ICU admission, case evolution, final classification and symptoms (each is represented by a variable in SARSdb), such as fever, cough, throat pain, dyspnea, respiratory discomfort, saturation, diarrhea, vomiting, abdominal pain and fatigue. We also explored the completeness of the symptoms variable using cross-tabulations of final classification, case evolution, mortality and COVID-19 case distribution by age group. Vacdb contains an identification code for each individual, as one person can receive more than one vaccine dose. Considering the data available to date, the maximum number of duplicate records for each subject should be three (first/single dose, second and third/booster dose). Therefore, we normalized Vacdb to identify each individual in a row versus the type of dose applied, as the original database is organized 11 . We then evaluated the completeness of first dose, second dose, third dose/reinforce, single-dose and single-dose/reinforce variables across Brazil, and in each state by age and ethnicity. Table 2 details completeness information for FSdb, SARSdb and Vacdb at a national level. Age group, sex and symptoms presented the highest completeness in FSdb (>99%), while final classification and evolution were lower (47.14% and 33.34%, respectively). For SARSdb, 63.88% of the records contained indeterminate information on scholarity. The lowest percentage of completeness was found for the sex variable. Among symptoms, the variables dyspnea and cough showed higher completeness, while fatigue and abdominal pain were lower (Table 2) . Completeness for all vaccination dose types in Vacdb by age group reached over 99% (Table 2) . Sex by all vaccination dose types achieved almost 100% completeness. Ethnicity by all vaccination dose types attained around 75% completeness (see Completeness data analyses available at https: //pamepi.rondonia.fiocruz.br/en/data_en.html). The validation presented in the material available at https://pamepi.rondonia.fiocruz.br/en/data_en. html shows that the variable completeness varies in accordance with the state in which a case was notified. This may serve to indicate quality regarding data collection processes on a regional health district level (data is then forwarded to the MoH data systems). However, an in-depth analysis of the quality of data collection is outside the scope of this manuscript. The COVID-19 pandemic continues to pose a threat to global health. Populations worldwide face severe inequalities with regard to the distribution of resources essential to diagnosing, treating and preventing COVID-19 27 . Concomitantly, extreme poverty and its consequences, particularly poverty-related infectious diseases, such as tuberculosis, HIV/AIDS, malaria, etc., persistently imperil the health of vulnerable populations, and scant attention has been paid to these conditions due to limited health services during the pandemic. Therefore, the data lake we have constructed represents a powerful resource for studying past and current aspects of the COVID-19 pandemic, as well as its impact on other health, economic and social outcomes. We have designed the PAMEpi data architecture to incorporate different data sources in order to enhance the power of a data-driven approach to understanding the COVID-19 pandemic in Brazil. To enrich the available health data, we aggregated data on human mobility, social inequality and non-pharmaceutical government interventions to better reflect the complexities of the COVID-19 pandemic. We have provided a description of all routines employed to collect, store, organise, integrate and use data in real-time for potential use in exploratory analysis, modelling and visualisation at both individual and aggregated levels. Furthermore, this approach is not solely limited to COVID-19, as the health data ingested also contains reports on other respiratory diseases circulating in the country. Therefore, our efforts can be extended to other diseases (particularly respiratory diseases) to include additional important variables (e.g., biogeoclimatic parameters) and potentially aid in the response to future emerging or re-emerging infectious diseases. County Level COVID-19 Tracking Map COVID-19 Coronavirus Pandemic COVID-19) Dashboard Covid-19 strategic preparedness and response plan Google COVID-19 Community Mobility Reports Brazilian institute of geography and statistics -IBGE: Ligações rodoviárias e hidroviárias Brazilian institute of geography and statistics -IBGE: Ligações aéreas Open Datasus. Notificações de Síndrome Gripal Banco de dados SRAG Campanha Nacional de Vacinação contra Covid-19 Platform for analytical models in epidemiology -PAMEpi Developing a small-area deprivation measure for brazil Two-dose chadox1 ncov-19 vaccine protection against covid-19 hospital admissions and deaths over time: a retrospective, population-based cohort study in scotland and brazil Estimating excess mortality due to the covid-19 pandemic: a systematic analysis of covid-19-related mortality Platform For Analytical Modelis in Epidemiology Guia de vigilância epidemiológica: Emergência de saúde pública de importância nacional pela doença pelo coronavírus Interdependence between confirmed and discarded cases of dengue, chikungunya and zika viruses in brazil: A multivariate time-series analysis Estimating underreporting of leprosy in brazil using a bayesian approach Mathematical modeling of covid-19 in 14.8 million individuals in bahia, brazil Classification algorithm for congenital zika syndrome: characterizations, diagnosis and validation WHO Covid-19 vaccines Assessing the nationwide impact of covid-19 mitigation policies on the transmission rate of sars-cov-2 in brazil Covid-19 no nordeste brasileiro: sucessos e limitações nas respostas dos governos dos estados Profile of covid-19 in brazil: Risk factors and socioeconomic vulnerability associated with disease outcome A control framework to optimize public health policies in the course of the covid-19 pandemic As described in the Methods section, all coding has been documented and is freely available in R or Python languages on the project's Github page (https://github.com/PAMepi). This study was financed by Bill and Melinda Gates Foundation and Minderoo Foundation HDR UK, through the Grand Challenges ICODA COVID-19 Data Science, with reference number 2021.0097 and the Fiocruz Innovation Promotion Program -Innovative ideas and products -COVID-19, orders and strategies INOVA-FIOCRUZ, with reference Number VPPIS-005-FIO-20-2-40. We thank Andris K Walter for English revision and manuscript proofing and suggestions. The authors declare no competing interests.