key: cord-0595789-isgfuc91 authors: Agarwal, Mayank; Chakraborti, Tathagata; Grover, Sachin; Chaudhary, Arunima title: COVID-19 India Dataset: Parsing COVID-19 Data in Daily Health Bulletins from States in India date: 2021-09-27 journal: nan DOI: nan sha: 5dd080cba0f36faebbe6aaad3bb1b845440481ad doc_id: 595789 cord_uid: isgfuc91 While India has been one of the hotspots of COVID-19, data about the pandemic from the country has proved to be largely inaccessible at scale. Much of the data exists in unstructured form on the web, and limited aspects of such data are available through public APIs maintained manually through volunteer effort. This has proved to be difficult both in terms of ease of access to detailed data and with regards to the maintenance of manual data-keeping over time. This paper reports on our effort at automating the extraction of such data from public health bulletins with the help of a combination of classical PDF parsers and state-of-the-art machine learning techniques. In this paper, we will describe the automated data-extraction technique, the nature of the generated data, and exciting avenues of ongoing work. Availability of COVID-19 data is crucial for researchers and policymakers to understand the pandemic and react to it in real-time. However, unlike countries with well-defined data reporting mechanisms, pandemic data from India is available either through volunteer-driven initiatives, through special access granted by the government, or manually collected from daily bulletins published by states and cities on their own websites or platforms. While daily health bulletins from Indian states contain a wealth of data, they are only available in the unstructured form in PDF documents and images. On the other hand, volunteer-driven manual data-curation cannot scale to the volume of data over time. For example, one of the most well-known sources of COVID data from India: covid19india.org, has manually maintained public APIs for limited data throughout the pandemic. Such approaches, while simultaneously limited in the detail of data made available, are also unlikely to continue in the long term due to the amount of volunteer manual labor required indefinitely. Although this project originally began anticipating that outcome, that eventuality has already come to pass for the aforementioned project, for similar reasons outlined in [12] . As such, detailed COVID-19 data from India, in a structured form, remains inaccessible at scale. [20] notes pleas from researchers in India, earlier this year, for the urgent access to detailed COVID data collected by government agencies. The aim of this project is to use document and image extraction techniques to automate the extraction of such data in structured (SQL) form from the state-level daily health bulletins; and make this data freely available. Our target is to automate the data extraction process, so that once the extraction for each state is complete, it requires little to no attention after that (other than responding to changes in the schema). The role of machine learning here is to make that extraction automated and robust in coverage and accuracy. This data goes beyond just daily case and vaccinations numbers to comprehensive state-wise metrics such as the hospitalization data, age-wise distribution of cases, asymptomatic and symptomatic cases, and even case information for individuals in certain states. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. India, one of the most populous countries in the world, has reported over 33 million confirmed cases of COVID-19 -second only to the United States. The massive scale of this data not only provides intriguing research opportunities in data science, document understanding, and NLP for AI researchers but will also help epidemiologists and public policy experts to analyze and derive key insights about the pandemic in real-time. At the time of this writing, covid19india.org has also released possible alternatives going forward once the current APIs are sunset next month. These suggestions, detailed here: [11] , also align perfectly with this current project and give us hope that we can continue providing this data, at scale and with much more detail than ever before. We segment the system into 3 major components: (a) the backend which is responsible for extracting data from health bulletins, (b) the database which stores the parsed structured data, and (c) the frontend which displays key analyses extracted from the parsed data. We describe each of these components in greater detail in the following sections. Since we aim to extract data from health bulletins published by individual states on their respective websites, there is no standard template that is followed across these data sources in terms of where and how the bulletin is published, and what and how information is included in these bulletins. To account for these variations, we modularize the system into the following 3 main components: a) bulletin download, b) datatable definition, and c) data extraction. We provide an overview of the system in Figure 1 and look at the three components in greater detail. The open-sourced code can be accessed at: https://github.com/IBM/covid19-india-data. The bulletin download procedure downloads the bulletins from the respective state websites to the local storage while maintaining the dates already processed. We use the BeautifulSoup 1 library to parse the state websites and identify bulletin links and dates for download. Since each state provides different information, we define table schemas for each state by manually investigating the bulletin (done once per state). We then use the free open-source SQLite 2 database to interface with the data extractor and store the data. States typically provide the bulletins in the form of PDF documents. To extract information from them, we use a combination of classical PDF parsers and state of the art Machine Learning based extraction techniques: Classical PDF parsing: Since a substantial amount of information in the bulletins are in the form of data tables, we use the Tabula 3 and the Camelot 4 Python libraries to extract these tables in the form of python data structures. While these libraries cover a lot of use cases, they do fail in certain edge case scenarios. Deep-learning augmented PDF parsing: Libraries extracting data tables from PDF typically use either the Lattice or the Stream [13] based method of detecting table boundaries and inferring table structure. While these heuristics works great for most cases, for cases where tables are either not well separated or are spread wide, they fail to correctly separate tables with each other, and group all the tables together. To correct for such errors, we utilize CascadeTabNet [19] , a state-of-the-art convolutional neural network that identifies table regions and structure. We use the detected table boundaries to parse for tables in areas of the PDF, thereby increasing the parsing accuracy. We show an example of performance gain we get from this approach in Appendix A.2. Data extraction from images: While a majority of information provided in health bulletins is in the form of textual tables, some information is provided as images of tabular data. This information cannot be processed through the aforementioned techniques, and requires Optical Character Recognition (OCR) to extract data from. We employ the Tesseract OCR engine [23] to read and extract tabular data provided as images. In Appendix A.3, we provide an example of a bulletin parsed through Tesseract OCR. The detected text is overlayed in the green boxes. Note that this is an experimental feature and we are actively working on assessing and improving its efficacy. To process information for a state, a separate data extractor routine is used, which has access to all the three aforementioned APIs. Depending on the format of the particular bulletin, we utilize a combination of the three techniques to extract information. The frontend or landing page for the project is generated automatically from the database schema and provides easy access to 1) the raw data (sampled at an appropriate rate to be loaded on the browser); and 2) pages for highlights and analysis based on SQL queries (such as those described in Section 3). The system described above runs daily and produces a SQL database that is publicly available for download. However, one can also use the source code to generate data customized with their own parameters, and deploy into their local systems. Current Status: At the time of writing, we have completely indexed information from seven major Indian states, covering a population of over 382 million people or roughly 28.67% of India's population. Additionally, we're in the final stages of integrating 5 new states, covering an additional 271.5 million people in the database, for a total coverage of 653.5 million people. In Appendix A.1, we provide an overview of the categories of information available in our database, and contrast it with the information in the covid19india.org database. In this section, we perform some preliminary analysis on the data collected from the health bulletins of Delhi and West Bengal. We would like to emphasize that some of these analyses (to the best of our knowledge) are the first such analyses available for the two states. However, these are still preliminary but provide an insight into the power of such data available to researchers interested in the subject. India has seen two major waves of COVID-19, with the second wave fuelled primarily by the Delta variant [25] being more deadly than the first [7, 16] . We aim to understand the difference between the two waves by computing the Weekly Case Fatality Rate as the ratio of total fatalities to total newly confirmed cases in a particular week. The charts for Delhi and West Bengal are presented in Figure 2 . While the weekly CFR for the first wave seems to be comparable for the two states, there appears to be a stark difference in the numbers for the second wave. Currently, India uses the reverse-transcriptase polymerase-chain-reaction (RT-PCR) tests and the Rapid Antigen Tests (RATs) to detect COVID-19 cases. While RT-PCR tests are highly accurate and are considered gold-standard tests for detecting COVID-19 [6] , they are more expensive and time-consuming than the less accurate RATs. While the official advisory is to prefer RT-PCRs over RATs [18] , there exists a discrepancy in how the two testing methods are used [9] and how this ratio affects the reported case results [8] . The state government of Delhi has in the past been called out for over-reliance on RATs as opposed to the preferred RT-PCR tests [22] . Following this criticism, the government increased the share of RT-PCR tests. We compute this ratio of RT-PCR tests to total tests conducted in the state ( Figure 2 ). As is evident, in 2020, less than 50% of the total tests conducted in the state were RT-PCR tests. However, starting 2021, and especially during the second wave of COVID-19 in India, this ratio increased to over 70%. Both DL and WB report the dedicated COVID-19 hospital infrastructure and occupancy information in their bulletins. Using these numbers, we compute the COVID-19 bed occupancy as the ratio of occupied beds to total (Figure 2 ). Similar to the results in Section 3.1, bed occupancy for Delhi shows a steep increase -reaching about 90% occupancy -during the second wave, while the occupancy for West Bengal does not show any significant difference during the two waves. To treat COVID-19 patients, India adopted a two-pronged strategy of hospitalization along with home isolation, where patients with a mild case of COVID-19 were advised home isolation whereas hospitals were reserved for patients with more severe cases of COVID-19 [24, 4] . We compute the hospitalization percentage as the ratio of the number of occupied hospital beds to the number of active cases. This is an estimate of how many of the currently active COVID-19 patients are in hospitals versus home isolation ( Figure 2 ). The peaks we see for the two states relate to time periods after the respective wave has subsided , the minima and the subsequent rise in hospitalization relate to the onset of the particular wave. The primary aim of this project is to extract as much information about the pandemic as possible from public sources so that this data can be made accessible in an easy and structured form to researchers who can utilize such data (from one of the most populous and heavily COVID-affected countries in the world) in their research. We foresee two main areas of future work for this project: 1. In the immediate future, we aim to integrate information for all Indian states into the dataset. Additionally, the project currently relies on health bulletins alone to extract the data. There are other platforms where the authorities release data, such as Twitter and Government APIs [10] . We hope to integrate these additional sources of information into the dataset. 2. We anticipate this data to be helpful in validating or extending models developed for other countries [14, 5] , developing pandemic models which integrate additional variables available in our dataset [17, 2, 1, 3] , and understanding other aspects of the pandemic [21, 15] . Figure 3 : Table extraction from a state health bulletin using classical PDF parsing and CascadeTabNet enhanced parsing. There are eight tables in the bulletin page (see (a)), and while classical parsing can only detect two tables due to insufficient separation between the tables, CascadeTabNet improves this detection significantly by extracting seven tables but missing one. In Figure 4 , we show an example of data table provided in the form of an image. Standard table extraction tools do not support extracting data from such format, and therefore we utilize Optical Character Recognition (OCR) for data extraction from such formats. In this figure, we show the detected text and bounding boxes around them. As is evident, this technique fails to identify certain text, such as the header of the table, and certain numbers from the table itself. This is currently an experimental feature, and we're actively working on assessing and improving its efficacy. Figure 4 : State bulletin sample providing tabular data in the form of an image. We use Tesseract OCR to extract data from the image (green bounding boxes). However, the OCR engine fails to extract all the information correctly, for instance, it fails to identify the table header. Mathematical models for COVID-19 pandemic: a comparative analysis SUTRA: An approach to modelling pandemics with asymptomatic patients, and applications to COVID-19 Extending the Susceptible-Exposed-Infected-Removed (SEIR) model to handle the high false negative rate and symptom-based administration of COVID-19 diagnostic tests: SEIR-fansy Analysis of Facility and Home Isolation Strategies in COVID 19 Pandemic: Evidences from Jodhpur Modeling of future COVID-19 cases, hospitalizations, and deaths, by vaccination rates and nonpharmaceutical intervention scenarios -United States Diagnostic Performance of an Antigen Test with RT-PCR for the Detection of SARS-CoV-2 in a Hospital Setting Differentials in the characteristics of COVID-19 cases in Wave-1 and Wave-2 admitted to a network of hospitals in North India Is India missing COVID-19 deaths? The Lancet Optimizing Testing for COVID-19 in India. medRxiv Anssi nurminen algorithmic extraction of data in tables in pdf documents Predictive performance of international COVID-19 mortality forecasting models Inter-state transmission potential and vulnerability of COVID-19 in India Clinical characterization and Genomic analysis of COVID-19 breakthrough infections during second wave in different states of India. medRxiv The mathematics of infectious diseases Advisory on Strategy for COVID-19 Testing in India Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents Indian scientists plead with government to unlock COVID-19 data Mukherjee. Predictions, role of interventions and effects of a historic national lockdown in India's response to the COVID-19 pandemic: data science call to arms It isn't just Delhi. Kerala, Bihar & UP also conduct more than 50% rapid antigen tests An overview of the tesseract ocr engine COVID-19 in India: Moving from containment to mitigation COVID-19 pandemic dynamics in India and impact of the SARS-CoV-2 Delta (B. 1.617. 2) variant. medRxiv We would like to thank all our open source contributors, in addition to those who have joined as as co-authors of this paper, for their amazing contributions to this project and this dataset. In particular, we thank Sushovan De (Google) for helping us extending the dataset to the Indian state of Karnataka. In Table 1 , we present the different attributes that are available in our dataset, and contrast it with the popular covid19india.org dataset. While covid19india.org contains the Case, Testing, and Vaccination information for all states, we include additional features, such as, Hospital infrastructure and hospitalization statistics, Individual fatality data, Age and gender distribution of cases, and Mental Health counselling among others.