key: cord-0803613-tiw76oph
authors: Kedem, Benjamin; Pyne, Saumyadipta
title: Estimation of Tail Probabilities by Repeated Augmented Reality
date: 2021-01-20
journal: J Stat Theory Pract
DOI: 10.1007/s42519-020-00152-1
sha: bc7f8e651cc9e516d0725a70456fdb4eda77bc0b
doc_id: 803613
cord_uid: tiw76oph

Synthetic data, when properly used, can enhance patterns in real data and thus provide insights into different problems. Here, the estimation of tail probabilities of rare events from a moderately large number of observations is considered. The problem is approached by a large number of augmentations or fusions of the real data with computer-generated synthetic samples. The tail probability of interest is approximated by subsequences created by a novel iterative process. The estimates are found to be quite precise.

The citation accompanying his U.S. National Medal of Science in 2002 honored Calyampudi Radhakrishna Rao "as a prophet of new age for his pioneering contributions to the foundations of statistical theory and multivariate statistical methodology and their applications." When Professor Rao organized the 'International Conference on the Future of Statistics, Practice and Education' in Hyderabad (Indian This article is part of the topical collection "Celebrating the Centenary of Professor C. R. Rao" guest edited by Ravi Khattree, Sreenivasa Rao Jammalamadaka, and M. B. Rao. School of Business, 12.29.04-01.01.05), one of us participated in it. Befitting this connection, we decided to contribute what we believe is a "futuristic" application of augmented reality in his honor.

In its February 4th 2017 edition, The Economist noted the promise of augmented reality, claiming that "Replacing the real world with a virtual one is a neat trick. Combining the two could be more useful." We concur. Combining real data with synthetic data, i.e., augmented reality (AR), opens up new perspectives regarding statistical inference. Indeed, augmentation of real data with virtual information is an idea that has already found applications in fields such as robotics, medicine, and education.

In this article, we advance the notion of repeated augmented reality in the estimation of very small tail probabilities even from moderately sized samples. Our approach, much like the bootstrap, is computationally intensive and could not have been viable without the computing power of modern systems. However, rather than looking repeatedly inside the sample, we look repeatedly outside the sample. Fusing a given sample repeatedly with computer-generated data is referred to as repeated out of sample fusion (ROSF) in Pan et al. [1, 2] . Related ideas concerning a single fusion are studied in Fithian and Wager [3] , Fokianos and Qin [4] , Katzoff et al. [5] , and Zhou [6] .

In 1984, the so-called Watras incident led to intense media and congressional attention in the USA to the problem of residential exposure to radon, a known carcinogenic gas. Radon in the home of Stanley Watras, a construction engineer, located in Boyertown, Berks county, on the Reading Prong geological formation in Pennsylvania, was recorded as almost 700 times the safe level, which is a lung cancer risk equivalent of smoking 250 packs of cigarettes per day! As noted by George [7] , this news caused a major alarm and led the US EPA to establish a radon measurement program. In this regard, the present article will review the underpinnings of ROSF in estimation of small tail exceedance probabilities. We will illustrate its application using residential radon level data from Beaver County, Pennsylvania.

Consider a random variable X ∼ g and the corresponding moderately large random sample X 0 = (X 1 , … , X n 0 ) where all the observations are smaller than a high threshold T, that is max(X 0 ) < T . We wish to estimate p = P(X > T) without knowing g. However, as is, the sample may not contain sufficient amount of information to tackle the problem. To gain more information, the problem is approached by combining or fusing the sample repeatedly with externally generated computer data. That is, ROSF.

Let X i denote the ith computer-generated sample of size n 1 = n 0 . Then, the fused samples are the augmentations where X 0 is a real reference sample and the X i are different independent computergenerated samples supported on (0, U), where U > T . The number of fusions can be as large as we wish. From each pair (X 0 , X j ) , under a mild condition, we get in a certain way an upper bound B j for p. Let {B (j) } be the sequence of order statistics. Then, the sorted pairs produce a monotone curve, referred to as the B-curve, which for large N, contains a point " • " as in Fig. 1 . As N increases, the ordinate of the point essentially coincides with p with probability approaching one. It follows that the sequence contains subsequences which approach p. The subsequences can be obtained by an iterative process to be described in Sect. 3.

Deferring details to later sections, it is helpful to shed light early on and introduce our iterative method which produces estimates of tail probabilities, using reference samples X 0 from F (2, 7) and LN(1, 1) distributions.

In the first illustration, X 0 is a random sample from F (2, 7) , T = 21.689 , giving p = 0.001 . Here, n 0 = n 1 = 100 , max(X 0 ) = 12.25072 , and the computer-generated samples consist of independent Unif(0, 50) . With N = 10, 000 fusions, and starting from j = 450 , our iterative process (9) bellow produces a converging subsequence which approaches p from above, a "Down" subsequence:

Starting from j = 210 , our iterative process (9) produces an "Up" subsequence which converges by a single iteration giving:

In the second illustration, X 0 is a random sample from LN(1, 1), T = 59.75377 , giving p = 0.001 . Here, n 0 = n 1 = 200 , max(X 0 ) = 33.63386 , and the computergenerated samples consist of independent Unif(0, 80) . With N = 10, 000 fusions, and starting from j = 800 , our iterative process (9) bellow produces a converging "Down" subsequence which approaches p from above by a single iteration:

And staring from j = 790 , our iterative process (9) produces an "Up" subsequence which converges by a single iteration giving:

(1) (X 0 , X 1 ), (X 0 , X 2 ), (X 0 , X 3 Notice that the "Down-Up" convergence in both illustrations is remarkably close to the true p = 0.001 . We have had quite a few similar results where the tail behavior differed markedly. The computation here required an important parameter called "p-increment" which in the present examples was 0.0001. We shall deal with this numerical issue soon.

A useful feature of the present article is the realization that we can come up with educated guesses as to the magnitude of p from the value of max(X 0 ) relative to T. And this in turn suggests a set of discrete points in the interval (min(B j ), max(B j )) at which p-estimates are evaluated, along the "p-increments" mentioned above. The p-increment is a single number used to create the grid for searching for p-estimates.

Recall that X 0 = (X 1 , … , X n 0 ) is a reference sample from some reference probability density (pdf) g(x) and let G(x) denote the corresponding distribution function (CDF) . Since we shall deal with radon data, we assume that x ∈ (0, ∞) . The goal is to estimate a small tail probability Let X 1 be a computer-generated random sample of size n 1 and assume X 1 ∼ g 1 , G 1 . The augmentation of size n 0 + n 1 gives the fused data from X 0 and X 1 . We shall assume the density ratio model [8, 9] where 1 is a scalar parameter, j is an r × 1 vector parameter, and h(x) is an r × 1 vector valued function. Clearly, to generate X 1 , we must know the corresponding g 1 . However, beyond the generating process, we do not make use of this knowledge. Thus, by our estimation procedure, none of the probability densities g, g 1 and the corresponding G, G 1 , and none of the parameters 1 and 1 are assumed known, but, strictly speaking, the so called tilt function h must be a known function. However, in the present application, the requirement of a known h is weakened considerably by the mild assumption (4) below, which may hold even for misspecified h , as numerous examples with many different tail types show. Accordingly, based on numerous

(2) t = (t 1 , … , t n 0 +n 1 ) = (X 0 , X 1 ),

experiments, some of which discussed in Pan et al. [1] , we assume the "gamma tilt" h(x) = (x, log x) . Further justification for the gamma tilt is provided by our data analysis below.

Under the density ratio model (11) , the maximum likelihood estimate of G(x) based on the fused data t = (X 0 , X 1 ) is given in (14) in the "Appendix A.1", along with its asymptotic distribution described in Theorem A.1. From the theorem, we obtain confidence intervals for p = 1 − G(T) for any threshold T using (17). In particular, we get an upper bound B 1 for p. In the same way, from additional independent computer-generated samples X 2 , X 3 , … , X N we get upper bounds for p from the pairs (X 0 , X 2 ), (X 0 , X 3 ), … (X 0 , X N ) . Thus, conditional on X 0 , the sequence of upper bounds B 1 , B 2 , … , B N is then an independent and identically distributed sequence of random variables from some distribution F B . It is assumed that so that Let B (1) , B (2) , … , B (N) be a sequence of order statistics from smallest to largest. Then, as N → ∞ , B (1) decreases and B (N) increases. Hence, as mentioned before, as the number of fusions N increases the plot consisting of the pairs contains a point " • " whose ordinate is p with probability approaching 1. It follows that as N → ∞ , there is a B (j) which essentially coincides with p. The plot of points consisting of the pairs (j, B (j) ) in (5) is referred to as the B-curve.

We now make the following important observations. a. Assumption (4) implies that as N increases, with probability approaching one. b. The point " • " moves down the B-curve when max(X 0 ) approaches T. The point " • " moves up the B-curve when max(X 0 ) decreases away from T.

Hence, as N increases, the size of max(X 0 ) relative to T provides useful information as to the approximate magnitude of p. Specifically, the first quartile of B 1 , B 2 , … , B N is a sensible guess of p as max(X 0 ) approaches T, and the third quartile, or even max(B) , is a sensible approximation of p when max(X 0 ) is small. Otherwise the mean or median of B 1 , B 2 , … , B N provides practical guesses of the approximate magnitude of p. c. Let F B be the empirical distribution obtained from the sequence of upper bounds B 1 , B 2 , … , B N . Then, from the Glivenko-Cantelli Theorem, F B converges to F B almost surely uniformly as N increases. Since the number of fusions can be as

large as we wish, our key idea, F B is known for all practical purposes. Hence, as seen from b, F B provides information about p.

Knowing F B is a significant consequence of repeated out of sample fusion. Its implication is that the exact distribution of any B (j) is practically known.

For a sufficiently large number of fusions N, the monotonicity of the B-curve and (6) imply there are B (j) which approach p from above so that there is a B (j) very close to p. Likewise, the B (j) can approach p from below. Thus, the B-curve establishes a relationship between j and p.

Another relationship between j and p is obtained from the well-known distribution of order statistics, which can be computed since F B is practically known for sufficiently large N. Iterating between these two relationships provides a way to approximate p as is described next.

From (7), we can get the smallest p j such that

The 0.95 probability bound was chosen arbitrarily and can be replaced by other high probabilities.

It is important to note that in practice, and in what follows, the p j in (8) are evaluated on a grid incrementally along specified small increments.

Thus, with B (j k ) 's from the B-curve, and p (j k ) 's the smallest p's satisfying (8) with j = j k , and B (j k+1 ) closest to p (j k ) , k = 1, 2, … , we have the iterative process [1, 10] , so that p j k keeps giving the same B (j k+1 ) (and hence the same j k+1 ) and vice versa. This can be expressed more succinctly as, In general, starting with any j, convergence occurs when for the first time B (j k ) = B (j k+1 ) for some k and we keep getting the same probability p j k .

Clearly, the p j k sequence could decrease or increase producing "down" and "up" subsequences. For example, suppose that the probabilities

are sufficiently high probabilities, and that from the B-curve we get the closest approximations Then, with a high probability we get a decreasing "down" sequence However, when the probabilities are sufficiently low it is possible for the closest B (j) approximations of the p j to reverse course leading to an increasing "up" sequence This "down-up" tendency has been observed numerous times with real and artificial data. It manifests itself clearly in the radon examples below.

In particular, as was illustrated earlier in Sect. 1.3, this "down-up" phenomenon tends to occur in a neighborhood of the true p, where a transition or shift occurs from "down" to "up" or vice versa, resulting in a "capture" of p. This is summarized in the following proposition.

Proposition Assume that the samples size n 0 of X 0 is large enough, and that the number of fusions N is sufficiently large so that B (1) < p < B (N) . Consider the smallest p j ∈ (0, 1) which satisfy the inequality (8) where the p j are evaluated along appropriate numerical increments. Then, (8) produces "down" and "up" sequences depending on the B (j) relative to p j . In particular, in a neighborhood of the true tail probability p, with a high probability, there are "down" sequences which converge from above and "up" sequences which converge from below to points close to p.

We shall now demonstrate the proposition using radon data examples. Many additional examples were given in Pan et al. [1] . All the examples point to a remarkable "down-up" patterns in a neighborhood of the true p, providing surprisingly precise estimates of p. It should be noted that the number of iterations decreases as the B (j) approach p, a telltale sign that convergence is about to occur.

The iterative process (9) is repeated with different starting j's until a clear pattern emerges where different adjacent j's give rise to Down-Up subsequences which converge to the same value, it being our estimate p . The process may be repeated with different p-increments.

To enable computation with R, in (8) the binomial coefficients were evaluated with N = 1000 , as if there were 1000 fusions only. However, there are no restrictions on the number of fusions and F B was obtained throughout from 10, 000 fusions, and hence 10, 000 B's. Each entry in the following tables was obtained from a different sample of 1000 B's sampled at random from 10,000 B's. More precisely, each entry was obtained from an approximate B-curve obtained from the sampled 1000 B's and an approximate (8) with N = 1000 . Thus, for each entry, we iterated between an approximate B-curve and approximate (8) with N = 1000.

An important consideration is the choice of the increments of p along which the probability (8) is evaluated. Certainly, any approximation of p must reside between consecutive B's. Hence, sensible p-increments are fractions of either the mean, median, first or third quartiles, or even fractions of max(B) = B (10, 000) . In the following example, the p-increments are of the order of magnitude approximately equal to one tenth of one of these quantities.

Radon-222, or just radon, is a tasteless, colorless and odorless radioactive gas, which is a product of Uranium-238 and Radium-226, both of which are naturally abundant in the soil. Radon is known around the world as a carcinogen, and its exposure is the leading risk factor of lung cancer among non-smokers. Geological radon exposure takes place mostly through cracks and openings in the ground due to underlying geological formations. Approximately 40 percent of Pennsylvania (PA) homes have radon levels above the US EPA action guideline of 4 pCi/L. Residential radon test levels were collected by PA Department of Environmental Protection (PADEP) statewide in the period from 1990 to 2007. See Zhang et al. [11] for a study of indoor radon concentrations from Beaver County and its neighboring counties in PA.

In the following examples, ROSF is applied to Beaver County radon data from 1989 to 2017, for various p-increments. There were 7425 radon observations, taken as a population, of which only 2 exceed 200. Hence, with T = 200 we wish to estimate the small probability p = 2∕7425 = 0.0002693603 . Throughout the examples, X 0 is a reference random sample chosen without replacement from the 7425 radon observations. The generated X 1 samples are from Unif(0, 300) and n 0 = n 1 = 500.

In the tables below, "Down", "Up", "No j change", means that in the iterative process (9) there was a downward, or upward, or no change in j, respectively. Figure 1 shows how the " • " moves along the B-curve as a function of the size of max(X 0 ) relative to T. The figure should be referred to when reading the following examples.

Since 107 is close to T/2, the " • " point is in the "middle" of the B-curve, far removed from both ends. Hence, we use as p-increment Median∕10 ≈ 0.000018 . We observed that the third quartile was 0.0002686, very close to the true p.

From Table 1 , the shift from down to up occurs at p = 0.0002689389 very close the true p = 0.0002693603 , giving an error of an order of 10 −7 .

Example 2 max(X 0 ) = 123. 1 A different reference sample X 0 was fused again 10,000 times with different X 1 ∼ Unif(0, 300) independent samples. Since max(X 0 ) = 123.1 , again we have, relative to T = 200 , a "middle" " • " point suggesting a p-increment of one tenth of the mean of the B's. As the order of the mean was 10 −4 we chose p-increment 0.00002, which is of the same order as that of Mean(B)/10.

From Table 2 , the shift from Down to Up occurs at p = 0.0002601254 not far from p = 0.0002693603 , giving an error on the order of 10 −5 . , Table 2 = . , Example 3 max(X 0 ) = 193.7 A different reference sample X 0 was fused again 10,000 times with different X 1 ∼ Unif(0, 300) independent samples. Since max(X 0 ) = 193.7 , we have, relative to T = 200 , a " • " point close to the lower end of the B-curve, suggesting a p-increment on the order of one tenth of the first quartile of the 10,000 B's. As the first quartile was 0.0002697, we chose a p-increment of 0.00001. A p-increment of 0.00002 gave identical results. We observe that the first quartile is very close to p.

From Table 3 , the shift from Down to Up occurs at p = 0.0002600818 not far from p = 0.0002693603 , giving an error on the order of 10 −5 .

Example 4 max(X 0 ) = 77.9

A different reference sample X 0 was fused again 10,000 times with different X 1 ∼ Unif(0, 300) independent samples. Since max(X 0 ) = 77.9 , we have, relative to T = 200 , a " • " point close to the upper end of the B-curve, a difficult case, suggesting a p-increment on the order of one tenth of max(B) from 10,000 B's. As max(B) = 0.0004583 , we chose a p-increment of 0.00004583.

From Table 4 , the shift from Down to Up occurs at p = 0.0002286204 not far from p = 0.0002693603 , giving an error on the order of 10 −5 . Table 5 provides our estimates of p = 0.0002693603 from various random radon samples X 0 of size n 0 = 500 fused repeatedly with independent X 1 ∼ Unif(0, 300) Table 3 = .

, Table 5 = .

, 

There are numerous situations where the interest is in the prediction of an observable exceeding a large or even a catastrophically large threshold level where the data at hand fall short of the threshold. For example, consider the daily rainfall amount in a region where all the diurnal amounts fall short of a high threshold level, say, 10 inches in 24 hours, and yet for risk management it is important to obtain the chance that a future amount exceeds 10 inches in 24 hours, an extreme situation by all accounts. Similar problems regard annual flood levels, daily coronavirus counts, monthly insurance claims, earthquake magnitudes, and so on, where the sample values are below certain high thresholds, and the interest is in very small tail probabilities. Furthermore, in many cases, the given data could be only moderately large. In this paper, it has been shown how to approach such problems by a large number of augmentations or fusions of the given data with computer-generated external samples. From this we obtained a curve, called B-curve, containing a point whose ordinate was close to the tail probability of interest. Moreover, the magnitude of the largest sample value relative to a given high threshold provided rough guesses as to the true value of the tail probability. The rough guesses were needed for successful applications of our iterative method which produced accurate estimates of tail probabilities.

Indeed, as illustrated in the paper, max(X 0 ) relative to T provides useful information about the true tail probability p represented as "the point," and this fact can be interpreted in terms of under-specification and over-specification of the tail probability under the density ratio model. This clearly is a consequence of the fact that F B provides information about p.

The large number of fusions resulted in a large number of upper bounds B 1 , … , B N , for a tail probability p, from some unknown CDF F B (x) where it was assumed that 0 < F B (p) < 1 . The examples in this paper and many more in Pan et al. [1] indicate that the choice of the (mostly misspecified) tilt function h(x) = (x, log x) in the density ratio model did not go against that assumption. Clearly, other tilts are possible as long as F B (p) is bounded away from 0 and 1.

The estimation of very small tail probabilities can be approached by extreme value methods. A well-known method is referred to as peaks-over-threshold [12, 13] , whereas the name suggests, only values above a sufficiently high threshold are used. However, if the sample is not large to begin with, any reduction in the sample size, by discarding those values deemed not sufficiently large, reduces the sample size and calls into question the applicability of the method. A comparison with ROSF is given in Wang [10] and in Pan et al. [1] .

The estimation of tail probabilities from fused residential radon data has been studied recently in Zhang et al. [11, 14] by using the density ratio model with variable tilts. There a given radon sample from a county of interest was fused with radon samples from neighboring counties.

Estimation of small tail probabilities by repeated fusion

Interval estimation of small tail probabilities-application in food safety

Semiparametric exponential families for heavy-tailed data

A note on Monte Carlo maximization by the density ratio model

Out of sample fusion in risk prediction

Out of sample fusion

The history, development and the present status of the radon measurement programme in the United States of America

Asymptotic theory for multiple-sample semiparametric density ratio models and its application to mortality forecasting

A goodness of fit test for logistic regression models based on case-control data

Data fusion based on the density ratio model

Estimation of residential radon concentration in Pennsylvania counties by data fusion

Statistics of extremes: theory and applications

On the block maxima method in extreme value theory: PWM estimators

Model selection in radon data fusion

Empirical likelihood

A goodness of fit test for multiplicative-intercept risk models based on case-control data

Conflict of interest The authors declare that there is no conflict of interest.

The appendix addresses the density ratio model (11) for m + 1 data sources. Thus, we deal with the density ratio model more generally where X 0 is fused with m computer-generated samples. Above we dealt with the special case of m = 1.Assume that the reference random sample X 0 of size n 0 follows an unknown reference distribution with probability density g, and let G be the corresponding cumulative distribution function (cdf). Let be additional computer-generated random samples where X j ∼ g j , G j , with size n j , j = 1, … , m . The augmentation of m + 1 samples of size n 0 + n 1 + ⋯ + n m gives the fused data. The density ratio model stipulates thatwhere j is an r × 1 parameter vector, j is a scalar parameter, and h(x) is an r × 1 vector valued distortion or tilt function. None of the probability densities g, g 1 , … , g m and the corresponding G j 's, and none of the parameters 's and 's are assumed known, but, strictly speaking, the so called tilt function h must be a known function.

Maximum likelihood estimates for all the parameters and G(x) can be obtained by maximizing the empirical likelihood over the class of step cumulative distribution functions with jumps at the observed values t 1 , … , t n [15] .Let p i = dG(t i ) be the mass at t i , for i = 1, … , n . Then, the empirical likelihood becomesMaximizing L( , G) subject to the constraintswe obtain the desired estimates. In particular,where I(t i ≤ t) equals one for t i ≤ t and is zero, otherwise. Similarly, Ĝ j is estimated by summing exp(̂j +̂� j h(t i ))dG(t i ).The asymptotic properties of the estimators have been studied by a number of authors including Qin and Zhang [9] , Lu [8] , and Zhang [16] .Define the following quantities:Then, the asymptotic distribution of Ĝ (t) for m ≥ 1 is given by the following result due to Lu [8] .Theorem A.1 [8] Assume that the sample size ratios j = n j ∕n 0 are positive and finite and remain fixed as the total sample size n = ∑ m j=0 n j → ∞ . The process √ n(Ĝ(t) − G(t)) converges to a zero-mean Gaussian process in the space of real right continuous functions that have left limits with covariance matrix given by where I p is the p × p identity matrix, and ⊗ denotes Kronecker product.For a complete proof, see Lu [8] . The proof for m = 1 is given in Zhang [16] . Denote by V (t) the estimated variance of Ĝ (t) as given in (15) . Replacing parameters by their estimates, a 1 − level pointwise confidence interval for G(t) is approximated by where z ∕2 is the upper ∕2 point of the standard normal distribution. Hence, a 1 − level pointwise confidence interval for 1 − G(T) for any T, and in particular for relatively large thresholds T is approximated by.