key: cord-0835283-cf9h21uw authors: Mahasuar, Kiran title: Lies, damned lies, and statistics: The uncertainty over COVID‐19 numbers in India date: 2021-08-10 journal: Knowledge and Process Management DOI: 10.1002/kpm.1685 sha: 4ac7cdbd7df5ae704dbcf3383442bcf12a176e50 doc_id: 835283 cord_uid: cf9h21uw This paper intends to ascertain the veracity of reported data on deaths and testing pertaining to the novel coronavirus in India. We use a widely used forensic audit technique called Benford's law to analyze the data, and our findings suggest anomalies in the reported numbers and the reported data for most of the states do not adhere to the Benford distribution. The implications of these findings are manifold, especially on the trajectory of policy‐making, vaccination strategy, and preparedness for future waves and new variants. We strongly argue for the need for a robust data collection and reporting mechanism, creating a central data repository, and instituting a data‐driven policy framework as key steps in the process management bulwark for managing such future pandemics and other events concerning public health. The outbreak and spread of the novel coronavirus (COVID-19) has been catastrophic for India and has impacted the people and the economy severely. The policy response to the pandemic has been largely inadequate and muted and consequently, the second-most populous country is staring at an unprecedented humanitarian crisis with the healthcare system on the brink of collapse. Experts cite the paucity of accurate data and sampling as a major constraint in envisaging the uneven pattern of the spread, genome sequencing in cluster level investigations, identifying new variants, and predictive modeling of future waves in India (Ethiraj, 2021; Mallapaty, 2021) . The reported data have also been under the scanner with anecdotal evidence suggesting massive under-reporting of deaths in particular. This issue has also attracted considerable negative press, including in premier outlets like Time Magazine and New York Times (Bajekal, 2021; Gettleman et al., 2021) . To be fair such concerns were also raised for the reported data from China, but Koch and Okamura (2020) debunk that speculation and find no evidence that the Chinese authorities massaged the COVID-19 statistics. The aforementioned concerns and the issues of data integrity engendered our investigation where we focus on decoding the COVID-19 data pertaining to deaths and tests as reported from India. We use Benford's law-a widely used and acclaimed technique in forensic audit-to assess the veracity of the reported data on Covid deaths and tests. This empirical research assumes a lot of significance in the context of high levels of distrust and skepticism regarding the Covid numbers and ensuing policy responses from the government. The paper is structured as follows: In the second section, we discuss the methods and materials used for this study. In the third section, we discuss the findings threadbare, in the fourth section, we discuss the ramifications of the anomaly of underreporting and policy implications and in the last section, we conclude the study along with limitations and future research directions. The origin of Benford's law (also known as first-digit law or Newcomb-Benford law) can be traced to Newcomb's work published in the American Journal of Mathematics (Newcomb, 1881) . The mathematician observed that the logarithms book at the library was more worn on the front pages and less worn on the back pages. Newcomb subsequently devised a formula for calculating the probability of any non-zero initial digit of a number. Benford (1938) revisited this pattern several years later and confirmed its consistency in a myriad of distributions like home addresses, river lengths, and so on. The incidence of the first significant digits (FSD), according to Benford, follows a logarithmic distribution: where P k is the probability for a given number of having k as the FSD in a distribution. Thus, the probability of occurrence of each digit is as follows (see Table 1 ). Benford's law is scale-invariant and irrespective of changes in the unit of measurement in the data, the compliance does not get affected (Mir et al., 2014) . Given the empirical evidence of the higher occurrence of lower digits vis-à-vis higher digits, a dataset derived organically should ideally follow Benford's theoretical dis- tribution. An anomalous result demonstrating non-compliance with this law indicates a possible data manipulation, and therefore the dataset and its sources must be thoroughly investigated. Benford's law has been extensively used to detect frauds in economic, behavioral, and accounting data (Varian, 1972) . Early applications in accounting include work by Carslaw (1988) and Thomas (1989) . Nigrini (1996 Nigrini ( , 2005 We have used Pearson's Chi-square test (Pearson, 1900) as a significance test to ascertain if the "true" distribution follows the theoretical (Benford) distribution and if the sample comes from a distribution with a certain probability density function (González, 2020) . The Chisquare statistic is calculated as follows: In addition, we have supplemented this analysis with another goodness of fit test, that is, Kolmogorov-Smirnov (KS) test wherein the empirical distribution function f n for n independent and identically distributed ordered observations x i is defined as: where I À∞,x ð Þ X i is the indicator function, equal to 1 if X i ≤ x and equal to 0 otherwise. The KS statistic for a given cumulative distributive function F x is where Sup x is the supremum of the set of distances. In general, the statistic takes the largest absolute difference between the two distribution functions across all x values. Both the Chi-square test and KS test can present type I error in large samples, and therefore their compliance with Benford's law is circumspect in scenarios where we reduce the difference in the proportions (Barney & Schulzke, 2016; Druic a et al., 2018; González, 2020) . To alleviate this concern, we use the mean average deviation to check the concordance of the frequency distribution with Benford's law. where p i o ð Þ is the proportion of observations observed for class i and p i e ð Þ is the expected proportion for class i according to Benford's law. The adjusted mean absolute deviation (MAD) critical value ranges developed by Drake and Nigrini (2000) are shown in Table 2 . We have used the same to define the conformity levels of our data. 3 | RESULTS The results for the analysis of the reported Covid deaths in the selected period are presented below in Tamilnadu, and Chhattisgarh seem to follow Benford's law. We present the actual versus expected probability of the initial digit occurrence in accordance with Benford's law for a few selected states in Figure 1 . Based on our analysis for the cumulative data and the second wave-specific data, we conclude that the possibility of data manipulation in reported deaths for several Indian states cannot be ruled out. Therefore, the data necessitates further examination at the district and town levels to identify the actual sources of malfeasance. The reported data on deaths is closely interlinked to the state-level tests' robustness and subsequent mitigation measures. Therefore, we also examined the conformability of Benford's law to the testing Covid data massaging in Gujarat (Desai, 2020) and the impact of state elections on slowing the pace of Covid testing in West Bengal (Sharma, 2021) . The adverse reportage regarding Covid testing in the popular press suggests that the anomalous results and nonconformance with Benford's law are not a mere aberration and indicate a systemic failure in these states. The incorrect reporting of testing and death numbers has several ram- deaths from the other deaths occurring in the hospital, there is no parity in the manner in which COVID-19 deaths are construed and counted. There are reported instances of hospitals omitting deaths due to comorbidities from the Covid deaths tally (Bhattacharya, 1938) . The anomalies in the data and the consequent non-conformance to Benford's law underscore the absence of standardized surveillance, data collection, and reporting. Moreover, this issue is further compounded by the lack of a centralized data repository for the use of data scientists, public policy experts, and epidemiologists. This underlying deficiency also means that most states and the central government are plagued by a lack of real-time visibility beyond major cities. This drawback poses major challenges to epidemiologists and public health professionals from predicting the spread of the pandemic to other clusters, formulating the strategy for designing containment zones, and modeling the future waves (Ethiraj, 2021) . This handicap was evident when the system failed to envisage the gargantuan impact of the second wave, and the extant healthcare infrastructure across the states failed to cope with the multifold increase in cases. The possibility of under-reported death numbers and testing data is an outcome of this malaise, and it has three major repercussions as follows. a) lack of understanding regarding the actual path traversed by the virus across the states, b) the vulnerabilities for the future waves, and c) what should be our ideal vaccination strategy. It must be noted that the distribution not adhering to Benford's law merely points out that the data is anomalous, and the noncompliance is not conclusive evidence of data manipulation, and further scrutinizing is necessary to draw such a strong conclusion. Our contribution in this regard is limited to detecting the presence of anomalous data in the reported numbers. Secondly, another limitation is that the data collected from the various states represent aggregated data rather than district or town level data. Although the large data distributions embedded in an aggregated number will enhance the effectiveness of Benford's law to identify aberrations, the district or town level data is more helpful in identifying specific clusters where erroneous reporting or manipulation might have occurred. Our analysis is primarily concerned with the composite data from the onset of the pandemic till May 19, 2021, and we have not looked at the impact of lockdowns and shutdowns imposed by the Governments from time to time. In line with Koch and Okamura (2020) , we expect the pre-lockdown period to follow the Benford law and the period post-lockdown to be disruptive and not adhere to the Benford distribution. Future research could look at this prelockdown and post-lockdown period-specific analysis to validate this hypothesis. Another avenue for further research is a comparative analysis of excess deaths based on the data on death certification, which can be sourced from the local government bodies like the municipal corporations. Preliminary studies in this direction show a multifold increase in deaths in 2021 vis-à-vis 2019 and 2020 (Ramani & Radhakrishnan, 2021; Tumbe, 2021) . We strongly advocate for the need for a robust data collection mechanism and instituting a data-driven policy-making framework at the central and state levels. We expect this paper to foster further research and debates regarding the fallibility of extant data in predicting outcomes and provide an impetus to data-driven policymaking. The data that support the findings of this study are available from the corresponding author upon reasonable request. Kiran Mahasuar https://orcid.org/0000-0002-3849-8479 ENDNOTES 1 India's population is estimated to be 139.1 billion according to https:// www.worldometers.info/world-population/india-population/ (Last accessed on 28th May 2021) 2 Genome sequencing is a technique that reads and interprets genetic information found within DNA or RNA. Read more at https://www.who. int/publications-detail-redirect/9789240018440 3 The Pulse Polio Immunization Programme was rolled out in India on October 2, 1994, when India accounted for around 60% of the global polio cases. The last polio case in India was reported a decade ago in Howrah on January 13, 2011, and the country has been free of polio (Source: https://www.who.int/india/news/feature-stories/detail/110million-children-vaccinated-in-the-country-s-first-polio-drive-of-thedecade, last accessed on May 29, 2021) India's covid collapse, part 1: How Modi government's complacency in keeping track of new mutants triggered a second wave How did India's Covid-19 Crisis become a catastrophe? Moderating "cry wolf" events with excess MAD in Benford's law research and practice The law of anomalous numbers Covid-19 deaths in Bengal double in 5 days as govt junks casualty audit panel Anomalies in income numbers: Evidence of goal oriented behavior Covid testing drops in Delhi, labs say they don't have enough staff, skilled technicians. The Print Gujarat govt is managing data, not coronavirus, say medical Experts: Percentage of deaths in state among highest, even as number of tests remain low Computer assisted analytical procedures using Benford's law Benford's law and the limits of digit analysis By hiding the real number of COVID-19 cases and Deaths, some Indian states are Disempowering people As Covid-19 devastates India Benford's Law and Macroeconomic Data Quality Self-reported income data: Are people telling the truth? Applying Benford's Law to individual financial reports: An empirical investigation on the basis of SEC XBRL filings (Working Paper No. 2012-1 [rev.]). Working Papers in Accounting Valuation Auditing Financial sleuthing using Benford's law to analyze quarterly data with various industry profiles Detecting problems in survey data using Benford's law Benford's law and COVID-19 reporting India's massive COVID SURGE Puzzles scientists Benford's law predicted digit distribution of aggregated income taxes: The surprising conformity of Italian cities and regions Note on the frequency of use of the different digits in natural numbers A taxpayer compliance application of Benford's law An assessment of the change in the incidence of earnings management around the Enron-Andersen episode The political economy of numbers: On the application of Benford's law to international macroeconomic statistics X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine Indian scientists plead with government to unlock COVID-19 data COVID-19 in India: Dataset on Novel Corona Virus Disease 2019 in India The meta bloc: Why India faces a data shortage on genome sequencing of coronavirus Kolkata's COVID-19 deaths in 2021 could be 4 times higher Fact and fiction in EUgovernmental economic data Data shows drastic fall in COVID-19 tests is helping Bengal Unusual patterns in reported earnings Why 'excess mortality' figures for covid must be calculated Benford's law (letters to the editor) Benford's law and the FSD distribution of economic behavioral micro data Lies, damned lies, and statistics: The uncertainty over COVID-19 numbers in India. Knowledge and Process Management