key: cord-0314817-mfpgbqc8 authors: Kolias, P. title: Are COVID-19 data reliable? The case of the European Union date: 2021-12-25 journal: nan DOI: 10.1101/2021.12.24.21268373 sha: eb2a996bd1d7b48ef63975c80bbd2030fe4eb075 doc_id: 314817 cord_uid: mfpgbqc8 Previous studies have used Benford's distribution to assess whether there is misreporting of COVID-19 cases and deaths. Data inaccuracies provide false information to the media, undermine global response and hinder the preventive measures taken by countries worldwide. In this study, we analyze daily new cases and deaths from all the countries of the European Union and estimate the conformance to Benford's distribution. For each country, two statistical tests and two measures of deviations are calculated to determine whether the reported statistics comply with the expected distribution. Four country-level developmental indexes are also included, the GDP per capita, health expenditures, the Universal Health Coverage index, and full vaccination rate. Regression analysis is implemented to show whether the deviation from Benford's distribution is affected by the aforementioned indexes. The findings indicate that only three countries were in line with the expected distribution, Bulgaria, Croatia, and Romania. For daily cases, Denmark, Greece, and Ireland, showed the greatest deviation from Benford's distribution, and for deaths, Malta, Cyprus, Greece, Italy, and Luxemburg had the highest deviation from Benford's law. Furthermore, it was found that the vaccination rate is positively associated with deviation from Benford's distribution. These results suggest that overall official data provided by authorities are not confirming Benford's law, yet this approach acts as a preliminary tool for data verification. More extensive studies should be made with a more thorough investigation of countries that showed the greatest deviation. The pandemic of COVID-19 has affected the life of millions of people worldwide. Due to rapid contagiousness of the virus (Hafeez et al., 2020) , nearly every country employed measures against the virus' spread, such as national lockdowns and restrictions of typical activities. The pandemic showed that statistical and machine learning modelling procedures can potentially predict the number of new cases or deaths for a given country (Cássaro & Pires, 2020; Niazkar & Niazkar, 2020; Neto et al., 2020) . The accurate forecast of the infection curve can facilitate government's measures towards the suppression of the growth rate. However, in order to accurately predict or model COVID-19 spread, reliable and valid data should be collected from authorities. The recent pandemic of COVID-19 raised issues about data collection and handling. Media reports have questioned whether the statistics provided by countries are trustworthy (Kilani, 2021) . Several studies have questioned the accuracy of government data and had linked data manipulation with transparency and democracy indexes (Adsera, Boix & Payne, 2003; Magee & Doces, 2015; Rozenas & Stukal, 2019 ). Previous studies, in different fields, have applied Benford's distribution (or law) analysis to detect fraudulent and manipulated data. Specifically, for COVID-19, it was found that deaths were underreported in the USA (Campolieti, 2021) , while in China no manipulation was found (Koch & Okamura, 2020) . A study for Japan also showed deviation from Benford's distribution (Lee, Han & Jeong, 2020) . Furthermore, it was found that countries with higher values of the developmental index are less likely to deviate from Benford's law (Balashov, Yan & Zhu, 2021) . This study applies Benford's law to detect the first digit deviations of the announced cases and deaths from the expected frequencies in the European Union (EU). We further investigate whether the . CC-BY 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 25, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 deviation present for each country, is associated with four developmental indexes, the GDP per capita, health expenditures (% of GDP), the Universal Health Coverage Index and full vaccination rate. The public COVID-19 data of the European Union, regarding daily cases and deaths were exported from the European Centre for Disease Prevention and Control (ECDC) and consisted of observations between 2 nd of March to the 20 th of December 2021 (N = 8820). ECDC's Epidemic Intelligence team collects and refines daily data of new cases and deaths associated with COVID-19, based on reports from health authorities worldwide. Apart from COVID-19 data, we included the gross domestic product per capita (GDPc), the healthcare expenditures of countries as percentage of GDP (HGDP), and the Universal Health Coverage Index (UHC) from the World Bank (https://data.worldbank.org/). Finally, we included the full COVID-19 vaccination rate as of the 16 th of December 2021, obtain from ECDC. Benford's law (or law of prime digits) is a probability distribution for determining the first digit in a set of numbers. It was formally proposed in 1938, after an early work by the mathematician Simon Newcomb, by the physicist Frank Benford, who claimed that in natural and unrestricted data sets, the probability of each digit appearing is given by the formula: ( ) = 10 ( 1 + ) , = 1,2, . . . ,9. . CC-BY 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 25, 2021. ; Based on Benford's distribution, the probabilities for each number d as the first digit are presented in Table 1 . The most common application of the law is in Economics, where it has already been considered as a tool for checking tax validity and detecting fraud (Nigrini, 1996; Durtschi, Hillison & Pacini, 2004; Tam Cho & Gaines, 2007) . More recent studies have used Benford's law to investigate whether COVID-19 data provided by countries are accurate (Kilani, 2021; Silva & Figueiredo Filho, 2021; Campolieti, 2021; Koch & Okamura, 2020) and if the deviation from Benford's distribution could be affected by developmental indexes (Balashov, Yan & Zhu, 2021) . First, in order to investigate to which extent, the observed cases and deaths conform to Benford's law's expected frequencies, two goodness-of-fit tests were applied, the chi-squared (χ 2 ) goodness-of-fit test and Kolmogorov-Smirnov (K-S). The chi-squared test statistic is given by: , . CC-BY 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 25, 2021. ; where the index i is the digit, and and are the observed and expected frequencies of the i-th digit, respectively. The degrees of freedom for this test are equal to 8, and the critical value is ;8 2 = 15.507 for the significance level set at = 0.05; thus, any value of the statistic greater than the critical value would imply significant deviation from the expected distribution. However, in large samples, the interpretation of significance should be avoided, as the test has enough power to detect even small deviations from the expected distribution (Lin, Lucas & Shmueli, 2013 Both chi-squared and D statistics are greatly affected by sample size, hereby we included two measures that are not affected by large sample sizes, namely the Euclidean distance (ED) in the nine-dimensional space (Tam Cho & Gaines, 2007) given by: and Mean Absolute Distance (MAD) given by: . CC-BY 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 25, 2021. ; where and are the observed and expected proportions of the first digit, respectively. The creating the sampling distribution of each coefficient along with 95% bootstrap CIs (Davison & Hinkley, 1997) . The results of the goodness-of-fit tests along with the two measures of deviations are presented in Table 2 . For almost countries, except for Bulgaria, Croatia, and Romania, significant deviations were found for both cases and deaths. For daily cases, Denmark, Ireland and Greece were associated with the highest chi-squared . CC-BY 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 25, 2021. ; https://doi.org/10.1101/2021.12.24.21268373 doi: medRxiv preprint statistics and this was also confirmed by the two distance measures (Figure 1 and 2) . Regarding deaths, Cyprus, Italy, and Greece had the highest chi-squared statistics and distance measures. The K-S D statistic in most cases came in agreement with the chisquared test. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 25, 2021. ; The bootstrap estimates and 95% bootstrap CIs of the regression analysis for the two measures of deviation are presented in Table 3 . In order to avoid having small coefficients, GDP per capita has been log-transformed and the other three predictors were divided by 100. Regarding new cases, no predictor was found to significantly affect either MAD or ED. Vaccination rate was positively associated with deviation from Benford's distribution in new cases (0.076, 95% CI [0.020, 0.144]) and deaths . CC-BY 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 25, 2021. ; https://doi.org/10.1101 ), indicating that countries with a higher full vaccination percentage tend to deviate more from Benford's law. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 25, 2021. ; https://doi.org/10. 1101 positively associated with non-conformity with Benford's law, where countries with the highest vaccination percentage exhibited greater deviation. The results of this study imply that the deviation from Benford's law is not associated with country's economy, which was suggested by earlier findings (Hollyer, Rosendorff & Vreeland, 2011) . However, the effect would possibly be more apparent by including developing with developed countries (Judge & Schechter, 2009) . Deviations from Benford's distribution are a preliminary step for obtaining evidence for data manipulation; it is suggested that for specific economies that showed the greatest deviations, further studies could be made validating data reported by authorities. Additional parameters can be included, such as lockdown restrictions, preventive measures, and regional statistics and indicators. Funding: This study did not receive any funding. The author declares that there is no conflict of interest. . CC-BY 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 25, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 Are you being served? Political accountability and quality of government Using the Newcomb-Benford law to study the association between a country's COVID-19 reporting accuracy and its development COVID-19 deaths in the USA: Benford's law and underreporting Can we predict the occurrence of COVID-19 cases? Considerations using a simple model of growth Bootstrap methods and their application The effective use of Benford's law to assist in detecting fraud in accounting data A review of COVID-19 (Coronavirus Disease-2019) diagnosis, treatments and prevention Democracy and transparency Detecting problems in survey data using Benford's Law Authoritarian regimes' propensity to manipulate Covid-19 data: a statistical analysis using Benford's Law Benford's law and COVID-19 reporting COVID-19, flattening the curve, and Benford's law Research commentary-too big to fail: large samples and the p-value problem Reconsidering regime type and growth: lies, dictatorships, and statistics Application of artificial neural networks to predict the COVID-19 outbreak A taxpayer compliance application of Benford's law Compartmentalized mathematical model to predict future number of active cases and deaths of COVID-19 How autocrats manipulate economic news: Evidence from Russia's state-controlled television Using Benford's law to assess the quality of COVID-19 register data in Brazil Breaking the (Benford) law: Statistical fraud detection in campaign finance