key: cord-0164436-ir067a8m authors: Kazemitabar, Javad title: Double-Crossing Benford's Law date: 2021-05-20 journal: nan DOI: nan sha: 5a18458494dbaaea0d19fa6ba44d3d5ba83f3252 doc_id: 164436 cord_uid: ir067a8m Benford's law is widely used for fraud-detection nowadays. The underlying assumption for using the law is that a"regular"dataset follows the significant digit phenomenon. In this paper, we address the scenario where a shrewd fraudster manipulates a list of numbers in such a way that still complies with Benford's law. We develop a general family of distributions that provides several degrees of freedom to such a fraudster such as minimum, maximum, mean and size of the manipulated dataset. The conclusion further corroborates the idea that Benford's law should be used with utmost discretion as a means for fraud detection. it has been suggested (Kalinin and Mebane n.d.) that perfect adherence to this law could also imply manipulation as we expect small levels of deviation from the law in regular lists of numbers. A question is then raised as whether it is possible to systematically manipulate a list such that it still complies with the law. In this paper we address this question and show that it is possible to deceive an auditor by generating a Benford-compliant list with desired statistics, such as max, min, average and size. We do so by building a distribution with tunable parameters that provide such degrees of freedom. Hill's 1995 paper (Hill 1995) provides a statistical explanation of Benford's law. The author shows that "if probability distributions are selected at random, and random samples are then taken from each of these distributions in any way so that the overall process is scale (or base) neutral" then Benford's law holds. He then asks "An interesting open problem is to determine which common distributions (or mixtures thereof) satisfy Benford's law". Several researchers pursued this question and found conditions for a Benford-compliant distribution (Balanzario and Sánchez-Ortiz 2010) (Balanzario 2015) (Leemis L M and Evans 2000) (Berger and Hill 2015) . They also proposed example distributions that satisfy these conditions. However, the proposed distributions do not provide the necessary degrees of freedom for a fraudster to build synthetic 1 Benford-compliant samples with desired statistics. In this paper, we provide two families of Benford-complaint distributions with tunable parameters. The mere existence of such distributions shows that Benford's law should be carefully used as a means of fraud detection. In (Leemis L M and Evans 2000) , a few Benford-compliant distributions were proposed that are the building blocks of the distributions to be introduced in this paper. 1 The term synthetic Benford set was first used by the celebrated author Mark Nigrini (Nigrini 2011) . He provides a method based on the uniform mantissa concept to build synthetic Benford-compliant samples, where the user can designate the maximum and minimum of the generated numbers. • Example 1: Let Y ∼ U (0, 2). Then, X = 10 Y is a Benford compliant distribution defined in (10 0 , 10 2 ). The result can be generalized for Y ∼ U (a, b) for integer a and b. • Example 2: Let Y ∼ T riangular(0, 1, 2) In other words: Then, X = 10 Y is a Benford compliant distribution defined in (10 0 , 10 2 ). The result can be generalized to symmetric Triangular distributions of Y such as T riangular(a, b, c) where a, b, and c are all integers and b = (a + c)/2. In both of the above examples, even though the maximum and minimum of the distribution -in its general form-is tunable, the average is not. To amend this shortcoming we use the a lemma that was independently proven by a number of authors (Kazemitabar and Kazemitabar 2020)(Balanzario and Sánchez-Ortiz 2010) (Balanzario 2015) . Then, X = 10 Y is a Benford compliant distribution. Using this lemma, we build upon these examples to introduce our tunable distributions. Concretely, we design the distributions such that the shifted versions of the density function add up to 1. • Let Then, X 1 = 10 Y 1 is a Benford compliant distribution with the following statistics: where mean(X 1 ) ranges between 3.9 × 10 m and 3.9 × 10 m+K−1 for very small and very large values of a respectively. [ Figure 1 about here.] • Let Then, X 2 = 10 Y 2 is a Benford compliant distribution with the following statistics: where mean(X 2 ) ranges between 2.7 × 10 m+1 and 2.7 × 10 m+2K−1 for very small and very large values of a respectively. [ Figure 2 about here.] One might wonder if the maximum and minimum points in the above mentioned distributions have to be powers of 10. To answer this, we should recall that Benford compliance is scale-invariant. As such the generated numbers can be multiplied with a constant number. Nevertheless, both the above proposed distributions require that the max and min are apart by an integer power of 10, that is max min = 10 K in the first distribution and max min = 10 2K in the second. In this section, we show how a fraudster can generate a Benford-compliant dataset. The generated data could be faked as journal entries of a company trying to look profitable. Of course, in order to fake a journal entry, the fraudster needs to generate two separate datasets; one for income and the other for expenses. For each dataset We can tune the maximum and minimum as well as the number of items and the total sum. This is directly achieved by plugging the right value for m, K and a in the distributions introduced in the previous section. Moreover, we note that total sum of numbers in the dataset is equal to the size of that dataset multiplied by its average. Since, we have control over size and average, as a result we have control over total sum. We use Inverse transform sampling (Luc 1986 ) to generate random samples. In this technique a uniformly generated set of samples is fed into F −1 (y), where F (y) is the cumulative distribution function of interest. Now, suppose the hypothetical company's income and expenses each total 5700000$ and 2310000$ respectively. Also, let us assume there are 1320 income entries in the journal ranging from 1000$ to 100000$ and 760 expense related entries in the range of 100$ to 100000$. Using Equation (4), we find m, K and a to be 3, 2, and 0.01177886831 respectively for income related entries. Moreover, for expense related entries, we find the aforementioned parameters to be 2, 3, and 0.25927727232382797. While for the scenario at hand we were able to analytically solve for a, in general, however, numerical methods may be necessary specially when K is a large number. Figure 3 shows the histograms of income and expense entries. We then generate X = 10 Y to populate the journal entries for revenue and expense separately. The total sum for revenue samples add up to 5556356 which is 97% accurate compared to the requested revenue of 5700000$. As for the expense dataset, the sum of fake journal entries is 2192381 which shows 5% deviation from the desired expense total of 2310000$. [ Figure 3 about here.] We tested the generated journal entries across 3 popular Benford tests namely chi-square, mantissa-arc and mean absolute deviation (MAD). The results of all three tests are shown in Table 1 . As can be seen, the generated datasets conveniently pass the Benford test in all the three methods. The practice of generating fake Benford-compliant datasets can easily be performed so long as the average is not too close to either end, i.e. minimum or maximum of the desired set. Concretely, the first proposed distribution, X 1 , ranges between 3.9 times the minimum value, i.e. 10 m , and 0.39 of the maximum value, i.e. 10 m+K . In practical scenarios, it rarely happens that the dataset is skewed to the level that the average exceeds the aforementioned limits. As such, building fake data to deceive the auditor is usually achievable and thus the auditor shall not solely rely on Benford test. [ Table 1 about here.] Chi-squared test p-value Mantissa arc test p-value Mean Absolute Deviation (MAD) Revenue 0.9 0.93 Close conformity Expense 0.54 0.28 Acceptable conformity Table 1 : Benford test results confirm compliance of fake data. Benford's Law for Mixtures Sufficient conditions for Benford's law An Introduction to Benford's Law A Statistical Derivation of the Significant-Digit Law Using Benford's law to investigate Natural Hazard dataset homogeneity When the Russians fake their election results, they may be giving us the statistical finger Measuring the conformity of distributions to Benford's law Survival Distributions Satisfying Benford's Law Non-Uniform Random Variate Generation A taxpayer compliance application of Benford's law Forensic Analytics, Methods and Techniques for Forensic Accounting Investigations National COVID numbers -Benford's law looks for errors Benford's Law (Letters to the Editor) Financial Statement Fraud Strategies for Detection and Investigation X 1 = 10 Y 1 follows Benford's 10 3 Histogram of the synthetic (fake) data generated based on the proposed Y 1 distribution. The actual journal entries will be populated by taking 10 to the power of these numbers. (a) Expense related Y samples with m = 2 and K = 3 (b) Income related Y samples with m = 3 and K = 2 Benford test results confirm compliance of fake data