key: cord-0101274-lohzagrs authors: Zhang, Junyi title: Testing Case Number of Coronavirus Disease 2019 in China with Newcomb-Benford Law date: 2020-02-13 journal: nan DOI: nan sha: 4381919fdd7d653aa85f7be289649c49cc2c10a5 doc_id: 101274 cord_uid: lohzagrs The coronavirus disease 2019 bursted out about two months ago in Wuhan has caused the death of more than a thousand people. China is fighting hard against the epidemics with the helps from all over the world. On the other hand, there appear to be doubts on the reported case number. In this article, we propose a test of the reported case number of coronavirus disease 2019 in China with Newcomb-Benford law. We find a $p$-value of $92.8%$ in favour that the cumulative case numbers abide by the Newcomb-Benford law. Even though the reported case number can be lower than the real number of affected people due to various reasons, this test does not seem to indicate the detection of frauds. The coronavirus disease 2019 (COVID19), first observed in mid December 2019 in Wuhan China, has hitherto (as reported by Feb. 12, 2020) Health Organization declared the outbreak to be a public health emergency of international concern [2] . China is facing the huge challenges of providing timely health care for the patients and preventing further spread of COVID19. Although many countries and individuals provide various supports to China for fighting against the virus, hospitals, especially those in small counties outside Wuhan, are still short of biomedical equipment and other necessary supplies. A severe impact on Chinese economy is also expected due to the outbreak of the disease. If the spread of COVID19 was not controlled effectively due to its high basic reproduction number R 0 and long incubation period, there might be a worse impact world wide. At the same time, the case number reported by the Chinese government has been questioned in worry of the government intentionally hiding the real situation. In this article, we provide a quantitative analysis on the data of the COVID19 reported. A test of frauds with Newcomb-Benford law has been implemented to examine the data of the reported case numbers of COVID19 in China. We find a p-value of 92.8% in favour that the cumulative case numbers of COVID19 in China abide by the Newcomb-Benford law. Therefore this test does not seem to indicate the detection of frauds. Nevertheless, it is not conclusive the reported case number precisely reflects the current situations. The details of the hypothesis and testing will be shown in Sec. II. Further discussion is presented in Sec. III. The skewed distribution of the significant digits was discovered by Newcomb [3] and latter advanced by Benford [4] . They found that the frequencies of the first significant digits d of the numbers in some statistics follow the distribution which is now often referred to as Newcomb-Benford Law (NBL). One may refer to Ref. 5 for historical developments and and applications of NBL. Not all the distributions obey the NBL. There are ones known to obey NBL exactly, e.g., Fibonacci numbers, factorials, powers of 2 and exponential growth process [5, [8] [9] [10] , while others known not to obey NBL e.g., square roots and reciprocals [11] An interesting application is to detect the frauds in financial activities even though no general theory proved NBL should hold [5] [6] [7] . Province-Level Divisions" [12] by adding the numbers according to the time order. Fig. 1 shows the total cumulative case number and the cumulative case number in Hubei province versus time from 15 Jan. 2020 to 10 Feb. 2020 in a semilog plot. For the first 15 days, the total cumulative case number increases exponentially, which is not surprising according to simple models of epidemics. The slop of the more recent total cumulative case number in the semilog plot however decreases. This transition around 30 Jan. seems to agree with the incubation period counting from 20-23 Jan. when several districts in China declared highest level emergency status for the epidemics, particularly when Wuhan was closed. However, it is not conclusive that COVID19 has bee controlled and the total cumulative case number tends to saturate. It is also possible that the total cumulative case number is still increasing exponentially but with a smaller rate. The decreasing of the rate may be due to the emergent isolation strategy and better protections (everyone wearing face masks). Nevertheless due to the limit medical resources, many suspected cases remained to be confirmed and people with extraordinary long incubation period are still potential sources for the further expansion. Convection of the population after the Chinese New Year might worsen the situation and cause another explosion. To the best knowledge of the author there are no general theories prove that the statistics of the epidemics like COVID19 should obey the NBL. It will be interesting to test if the any statistics follows NBL. It naturally has the potential applications as to detect frauds in the statistics of the epidemics. A reasonable expectation might be the cumulative case numbers that follows the NBL, as it seems to grow exponentially particularly at the beginning stage. We count the frequency of the first significant of COVID19 according to province-level divi- Hubei Province, the spread afterwords within the division may be considered independent development but following the same propagation law of COVID19. The inter-province convection of the population was suppressed significantly due to the Chinese New Year vacation and isolation policy after 24 Jan. We propose the null hypothesis (H 0 ): the cumulative case numbers of COVID19 obey the NBL. Fig. 2 shows both the theoretical distribution of the NBL according to Eq. 1 (with blue circles) and the empirical distribution of the first significant figure of the reported cumulative case numbers (with red cross). To the first sight, they seem to agree quite well. To quantify the agreement between the empirical distribution and the hypothetical distribution, we may consider the more general hypothesis (H 0 ) as follows. Suppose the cumulative case numbers of COVID19 follows the distribution Ψ in the absence of fraud but Φ otherwise and the probability of the frauds is measured by τ . Then the hypothetical distribution Π should be interpolated by the distribution with and without frauds [5] , i.e., Then the hypothesis H 0 is nothing but the special case of H 0 with τ = 0, i.e., in the absence of the frauds, the cumulative case numbers of COVID19 obey the NBL. One of the simplest statistics to test the agreement of Π and P N B can be chosen as where N (d) is the counting of figure d as the first significant digit among N tot cumulative case numbers. When N tot → ∞, V is subjected to the χ 2 ν distribution, where ν = 9×10 k−1 −1 [7] , k = 1. It easy to calculate V directly according to Eq. 3, we find V = 3.10 and the corresponding p-value is 92.8%. Therefore, we conclude that the test is in favour of the null hypothesis H 0 that the cumulative case numbers of COVID19 obey the NBL. If we assume that it is hard to fabricate data closely following the NBL [7] , we may conclude we did not detect frauds according to this test with the NBL. In our hypothesis and testing, we showed that NBL holds for the cumulative case number of COVID19, nevertheless, same as in the case of financial activities, there are no general theory predicting the NBL [5] [6] [7] . For checking the agreement with the NBL, similar tests may be applied to other epidemics like the flus (HxNy), SARS etc. It deserve to notice that, COVID19 seems still in its expanding stage with exponential scaling, and we know the exponential grow process satisfies the NBL, while it still remains to be tested in the NBL is valid when epidemics are no longer in the initial exponentially growing stage. We have in total 628 data point for the statistics. This is not a big sample set. Simple χ 2 -test may be subjected to large fluctuations. More advanced tests, like the Kolmogorov-Smirnov test or the Kuiper test may be used instead. It seems quite reasonable to accept the null hypothesis H 0 that the cumulative case numbers of COVID19 obey the NBL. Taking this as an assumption we may apply the NBL to detect frauds in epidemic statistics. Further more, we do not think frauds are detected from the cumulative case numbers of COVID19. Nevertheless, it does not mean that the reported cumulative case numbers precisely reflect the current situation exactly, particularly in Hubei Province. Due to the lack of medical equipment and resources, many suspected cases remain to be confirmed; some people showing syndromes may not even be counted as suspected cases yet. Therefore the statistical data can be biased. Unfortunately, two exponential grow process differing by a multiplicative constant, both abide by the NBL. This test with the NBL cannot tell if there are systematic fraction of patients not being included properly. The Earth is our common home of human beings and other creatures in nature. Our technologies of transportation has advanced unprecedentedly over the last centuries, it also brings more and more possibilities of quick and wide spread of the epidemics. We should stand together, help each other facing the crisis of the epidemics no matter of race, gender or religion. Spring cannot be far behind. WHO Coronavirus disease 2019 Situation Report Note on the frequency of use of the different digits in natural numbers The law of anomalous numbers Benford's Law: Theory and Applications Breaking the (Benford) law: statistical fraud detection in campaign finance Newcomb-Benford law and the detection of frauds in international trade Benford's Law for Fibonacci and Lucas Numbers An Application of Uniform Distribution to the Fibonacci Numbers An Observation on the Significant Digits of Binomial Coefficients and Factorials The first digit problem Timeline of the 2019-20 Wuhan coronavirus outbreak in