key: cord-0289202-bs5dvq54 authors: Gao, Chenyin; Thompson, Katherine Jenny; Yang, Shu; Kim, Jae Kwang title: Nearest neighbor ratio imputation with incomplete multi-nomial outcome in survey sampling date: 2022-02-23 journal: nan DOI: 10.1111/rssa.12841 sha: 63b1b56cec798905b8c3ccde08b5f2b17250fff7 doc_id: 289202 cord_uid: bs5dvq54 Nonresponse is a common problem in survey sampling. Appropriate treatment can be challenging, especially when dealing with detailed breakdowns of totals. Often, the nearest neighbor imputation method is used to handle such incomplete multinomial data. In this article, we investigate the nearest neighbor ratio imputation estimator, in which auxiliary variables are used to identify the closest donor and the vector of proportions from the donor is applied to the total of the recipient to implement ratio imputation. To estimate the asymptotic variance, we first treat the nearest neighbor ratio imputation as a special case of predictive matching imputation and apply the linearization method of cite{yang2020asymptotic}. To account for the non-negligible sampling fractions, parametric and generalized additive models are employed to incorporate the smoothness of the imputation estimator, which results in a valid variance estimator. We apply the proposed method to estimate expenditures detail items based on empirical data from the 2018 collection of the Service Annual Survey, conducted by the United States Census Bureau. Our simulation results demonstrate the validity of our proposed estimators and also confirm that the derived variance estimators have good performance even when the sampling fraction is non-negligible. Sample surveys are often designed to estimate totals (e.g., revenue, earnings). However, in addition to collecting this information, many surveys request and produce sets of compositional variables (details) that sum to a total, such as a breakdown of total expenditures by type of expenditure or a breakdown of total income by source. Examples from the United States Census Bureau include the 2020 Current Population Survey's Annual Social and Economic (ASEC) Supplement which defines total household earnings as the sum of total wages and salaries, farm income, and self-employment income and the 2018 Services Annual Survey (SAS) which requests detailed breakdown of the sampled business' reported total expenditures by expenditures on annual payroll, fringe benefits for employees, and expenditures on software, among other categories. This paper is concerned with the missing data treatment for these sets of detail items. In contrast to collected totals items, reliable auxiliary variable data for all detail items are rarely available for all requested compositional variables. Furthermore, there are often true zeros that differ across units. These limitations make it difficult to develop feasible parametric imputation models for each individual detail item and motivate the usage of hot deck imputation instead. Hot deck imputation -sometimes called "donor imputation"obtains replacement values for nonresponding (or missing) data items by matching a donor record containing valid data to a recipient record containing invalid or missing data, imputing the missing values from the donor record (Andridge and Little, 2010; Beaumont and Bocci, 2009 ). Kalton and Kaspryzk (1982) recommend using the same donor to impute sets of compositional details to preserve inter-item relationships in the multinomial data. To ensure additivity as well, it is sensible to use some form of hot deck imputation to impute the vector of proportions from a matched respondent (donor), then derive the imputed values for each imputed item by multiplying the (donated) proportion by the nonresponding recipient's total (Bankier et al., 2000; Little et al., 2008; Andridge and Thompson, 2015) . If the proportions associated with each detail item within a imputation class appear to be approximately the same for each unit, then the donor can be selected at random within imputation cells. However, it is possible that the multinomial distribution of the details could be related to unit size. In this case, the unit size should be incorporated into the hot deck matching procedure. The simplest version selects the "nearest" donor using a suitable distance function such as unit size or -as in this case -the total, assumed available for all donors and recipients. Since the total is univariate, the absolute value of the difference is a natural distance function, yielding estimates that are asymptotically unbiased (Yang and Kim, 2020) . Hereafter, we refer to the imputation process that selects the nearest neighbor as donor and imputes the sets of donor proportions as the nearest neighbor ratio imputation (NNRI) method. Consider the Service Annual Survey (SAS), the subject of the empirical application presented in Section 4. This program collects aggregate and detailed revenues and expenses from a stratified sample of business firms with paid employees in selected industries in the services sector. Given a lack of auxiliary data, weak historic reporting, and a high reported zero rate, developing good predictive models for each individual detail item collected by the SAS is perhaps infeasible. That said, there are verifiable predictors of the set of detail items i.e., the multinomial distribution, specifically the industry in which the firm is classified, the tax-exempt status of the firm, and the size of the firm as measured by total revenue or total expenses. For example, a finance business will often report high proportions of total expenses in all three personnel cost categories, a scientific business is unlikely to report costs from temporary or leased employees, and a full-service restaurant would likely report most of its expenses from gross payroll and expensed equipment, materials, costs, or supplies. Consequently, imputation classes in the SAS are defined by industry code and tax-exempt status. As regards unit size, larger businesses are more likely to expense costs and track depreciation than a smaller business in the same industry. Accordingly, one would expect zero or nearly zero proportions of costs and depreciation values from smaller businesses, and positively valued proportions for the same component items from the larger businesses. In turn, the proportion of total expenses represented by gross annual payroll tends to decrease as unit sizes increase. In practice, it is often extremely difficult -if not impossible -to delineate exactly where the changes in multinomial distributions occur. However, using nearest neighbor donor selection procedure implicitly accounts for these subtle shifts. The primary purpose of imputation is to "fill in the blanks with plausible (i.e., realistic) and consistent values" (Sande, 1982) . From a bias reduction perspective, there are advantages in preserving reported and previously imputed totals as is done with NNRI. Auxiliary data are often available for direct substitution of missing or invalid totals (Beaumont et al., 2011) or for modeling. Consequently, reported totals are generally validated (edited) and imputed when necessary early in the editing process, and missing values are imputed with very strong models. Not only are they made available for all units for subsequent hot deck imputation, such totals are usually "goldplated" against further changes (Sigman and Wagner, 1997) . Fundamentally, totals are considered to be more reliably reported than sets of compositional details in the survey methodology literature, especially when the totals have accounting or financial record-keeping definitions (e.g., income, total sales, total expenses). In contrast, the queried sets of compositional details are generally collected for statistical reporting purposes, not accounting purposes, and are therefore not readily available (Willimack and Snijkers, 2013) . Indeed, it is likely that the reported percentage distributions are more reliably reported than the values themselves. For example, when interviewing a non-probability sample of large businesses, Willimack and Nichols (2010) learned that "company reporters resorted (sic) to estimation strategies rather than leaving items unreported (i.e., blank). Moreover, they only used estimation schemes when company data did not include the type of detail requested on the report." This dovetails with the findings presented in Andridge and Thompson (2015) via a proxy pattern-mixture model analysis of selected items collected by the SAS: when annual payroll was used as the single predictor of total sales in a ratio imputation model, the fraction of missing information (FMI) values were close to zero, indicative of extremely accurate imputation models, whereas the FMI values for collected detail items using total sales as the sole predictor were near one (the maximum value). Finally, deriving hot deck imputed values by multiplying the recipient's total by the corresponding nearest-neighbor ratio is frequently used in business surveys (Beaumont and Bocci, 2009 ). In such positively skewed distributions, donating a ratio instead of a total guards against substitution of an overly large or small value from the respondent "nearest neighbor." Of course, this phenomena is not confined to business surveys. For example, the 2019 ASEC reports the U.S. median income as $68, 703 ± ($904); the 95 th percentile is $270, 002 ± ($4, 831). See https://www.census.gov/content/dam/ Census/library/publications/2020/demo/p60-270.pdf. Estimation from nearest neighbor ratio imputed data is straightforward. Variance estimation is less so, in part because the donor selection procedure is deterministic. The variance estimation is further complicated in our setting due to the potential shift in multinomial distributions as described above. Andridge, Bechtel, and Thompson (2021) investigates multiple imputations of proportions with nearest neighbor ratio hot deck imputation using the Approximate Bayesian Bootstrap, finding consistent underestimation of the variance. As in the cited reference, we assume that the set of details is reported or is missing; in practice, detail components that do not sum to their associated total and are not within a small raking tolerance are often treated as missing, and all detail items are imputed, regardless of their original reporting status. Statistical inference under nearest neighbor imputation in survey sampling has been discussed by Chen and Shao (2001) , Shao and Wang (2008) , Kim et al. (2011) , Yang and Kim (2019) , among others. In this paper, we discuss statistical inference under NNRI in survey sampling. To make statistical inference, we derive the asymptotic linearization of the NNRI estimator, which allows us to decompose the asymptotic variance of the NNRI estimator into two components accounting for sampling and matching. Based on the asymptotic variance formula, we propose an alternative variance estimator by approximating these components by parametric or nonparametric approaches. We show the theoretical guarantees for the plug-in variance estimator. The proposed variance estimator is easily applied to a variety of probability sampling designs and accounts for non-negligible sampling fractions. The rest of the paper is organized as follows. Section 2 introduces the basic setup and the NNRI procedure in detail, including asymptotic properties. Section 3 derives variance estimation for the NNRI estimator. In Section 4, we apply the proposed variance estimator to empirical expenditures data for reference year 2018 from a subset of industries surveyed in the SAS. The studied data are typical of many business surveys, in that many units report a total but may not provide the associated detail items, especially when the item definitions are complex or the number of requested detail items is large (Willimack et al., 2000) . The SAS uses a stratified simple random sample without replacement (SRS-WOR) design with high sampling rates in several strata. The empirical application highlights the differences between our proposed variance estimator from a naive variance estimator but does not provide insight into its statistical properties. Consequently, Section 5 investigates the finite-sample performance of the proposed over repeated stratified SRS-WOR samples via a simulation study patterned off of Andridge et al. (2021) . We close in Section 6 with some general observations along with directions for future research. 2 Basic Setup Let y i = (y i1 , · · · , y iT ) be the study variable of interest and x i be the auxiliary variable. We assume that x i are observed throughout the sample but y i are observed only for the subset of the sample. Let δ i = 1 if y i is observed and δ i = 0 otherwise. Let I i be the sampling indicator be a finite random sample from a superpopulation model ζ with known N . We make the following assumption for the missing data process. Assumption 1 (Missing at random and positivity) (i) The response indicator δ i satisfies P (δ i = 1 | x i , y i ) = P (δ i = 1 | x i ), which can be denoted by π(x i ), and (ii) π(x i ) > for a constant > 0 w.p. 1. Assumption 1 (i) states that the response indicator depends only on the observed x but not on the outcome value y. Essentially, it assumes that the covariates contain all the information for the outcome that affects the probability of response, i.e., is missing at random in the population level (Rubin, 1976; Berg et al., 2016) . Assumption 1 (ii) indicates that all sampled units have a positive probability of responding given any possible value of x, in turn implying that the support of the respondents and the nonrespondents is the same. This assumption guarantees that all donor values are plausible. If the Assumption 1 (ii) is violated, our proposed estimator in Section 2.2 would no longer be asymptotically unbiased. Throughout, we assume this strong ignorability condition in Assumption 1 holds. To motivate the NNRI estimator, we first consider the full response case using the Horvitz- Thompson (HT) estimator. Under full response, we can use T y = i∈S w i y i to estimate T y = N i=1 y i , the population total of y i , where w i is the sampling (design) weight computed as the inverse of the sampling inclusion probability and S is the index set of the sample with |S| = n. Let E p and var p be the expectation and variance with respect to the sampling mechanism; that is, E p (·) = E(· | F N ) and var p (·) = var(· | F N ). We assume a sequence of finite populations and samples in order to investigate the asymptotic properties as defined in Fuller (2009) . Assumption 2 The HT estimator of a population total given byT y = i∈S w i y i satisfies Assumption 2 is widely accepted in survey sampling to allow for valid inferential conclusion via asymptotic normality. Nearest neighbor ratio imputation (NNRI) matches a donor to a recipient (nonrespondent), then multiplies the recipient's (available) total by the donated function m(x i ) under the following assumption, assumed true for all i ∈ S where E ζ (e i | x i ) = 0 and m(x i ) = x i R(x i ) for some smooth function R(·). Let y * i be the imputed value of y i using NNRI as where the ratio R i = (y i1 , · · · , y iT ) /x i is only available from the responding units (donors), and i(1) is the index of the nearest neighbor of unit i within the same imputation cell where the nearest neighbor of i satisfies for all j in the subsample of respondents, where D(·, ·) is a suitable distance function (in this application, the absolute value of the distance). Then, the imputed estimator of T y is given by The goal is to estimate the variance of the imputed estimator in (2). If we define if unit i is used as a donor for unit j, 0 otherwise, then we can express Thus, the imputed estimator in (2) can be written as where Note that κ i satisfies To study the asymptotic properties of the NNRI estimator in (3), we assume the following regularity condition holds. Assumption 3 (i) Let the matching variable X be a random variable on a compact and convex support X, with its density f X bounded and bounded away from zero. Suppose that f X is also differentiable in the interior of X with bounded derivatives; (ii) Let R(x) be Lipschitz Assumption 3 (i) imposes a compact and convex support for the random variable X, which will be essential for studying the asymptotic properties of the NNRI estimator; Assumption 3 (ii) restricts the ratio function R i to be smooth in x i (Abadie and Imbens, 2006; Yang and Kim, 2020) . Note that the underlying ratio model in (1) is a special case of the general models m(·) that clearly satisfies the Lipschitz condition since the Lipshitz condition on R(x) implies that on m(x). The following lemma describes the key asymptotic property with the proof deferred to the Supplementary Material. Lemma 1 Under Assumptions 1-3, we have Combining with (5), we have Considering that E(y i | x i ) = x i R(x i ) in (1), we can express (6) as where m i = x i R(x i ) and e i = y i − m i . This decomposition assures that the first component m i is uncorrelated with the second component e i under the conditional argument of x i . The decomposition leads to the asymptotic distribution of the NNRI estimator as follows. Theorem 1 Under Assumptions 1-3, suppose the ratio model in (1) holds true and define In particular, if n/N = o(1), then V e reduces to The detailed proof of Theorem 1 is presented in the Supplementary Material. The results in Theorem 1 are obtained by taking the reverse sampling arguments following Shao and Steel (1999) and Kim et al. (2006) , so that the sample-response path begin with a census with nonrespondents from which a sample is selected. In the reverse sampling framework, the outer expectation (denoted E) is taken with respect to the superpopulation model for the census with nonrespondents, with inner expectations and variances with respect to the sampling design conditional on (δ 1 , · · · , δ N ). Yang and Kim (2019, 2020) describe asymptotically unbiased variance estimators that fully account for the error term (V m and V e ) presented in Section 2.2, under predictive mean matching imputation. Since NNRI is a special case of predictive mean matching, we modify their variance estimator to obtain asymptotically unbiased estimators i.e., Both the empirical application presented in Section 4 and the simulation study presented in Section 5 employ stratified SRS-WOR designs that include a certainty stratum (all units sampled with probability = 1) and at least one sampling strata with a large sampling rate (greater than 0.50). The sample design is highly characteristic of business surveys, which are sampled from highly skewed populations. The presented applications use approximate sampling variance (design based) estimators for V m to easily incorporate the non-negligible sampling fractions, although we also introduce a general replication variance estimator for use when sampling fractions are negligible. Assumption 3 requires a smooth estimator of R(x) that can be used for all sampled units to estimate m i as x i R(x i ). However, NNRI is not a smooth imputation procedure. We approximate a smooth ratio function R(x) in (6) with a plug-in estimator ( R(x i )), considering . This is the PAR1 estimator utilized in Beaumont and Bocci (2009); 2. PARAM2 Parametric ratio estimator with R(x i ) = β h , where β h is estimated separately within each sampling stratum h by fitting the regression model from (a); and 3. NONPARAM Generalized Additive Model (GAM) estimator (Hastie and Tibshirani, 1990 ) approximating the unknown smooth function of R(x i ) with multinomial link functions. The first two estimators are frequently employed in the survey research methods literature (e.g., Magee (1998) , among others). Of course, these estimators require a very specific relationship between the independent and auxiliary variable, and this strong association is less likely for rarely-reported independent variables. Furthermore, the parametric approximations develop separate regression models for each detail item component in y i . As a nonparametric alternative approach, we appeal to the GAM for a more flexible representation. In short, GAMs can be considered as fitting several spline smoothers to approximate the unknown smooth function, in this case the ratio function R(x). By applying the order-n basis expansions of R(x) under the multinomial link function, one could observe that for i = 1, · · · , n The {β (t) k } n k=1 , t = 1, 2, · · · , T − 1 are the regression coefficients for the t−th detail item and {b (t) k (x i )} n k=1 , t = 1, 2, · · · , T − 1 are the known n basis functions (e.g., splines, radial functions, etc.) associated with the n sample points, which are usually assumed to have good approximation theoretical properties. To alleviate the overfitting issue, a model penalty can be imposed during the model fitting to reduce the number of basis functions from n to K (Wood, 2003; Wood et al., 2016) . The penalized objection function is specified as where the first term measures the closeness of our fitted functions while the second term J{ R(x)} penalizes the wiggliness of the function associated with the tuning parameter λ, which can be obtained by the cross validation technique. Here, we adopt the wiggliness penalty functional from Wood (2003) illustrated in Example 1. where H is an arbitrary reproducing kernel Hilbert space. Begin with a simple objective function: with arbitrary value a and b as long as they cover the variable in question (Hastie and Tibshirani, 1990) . The resulting solution R * (x) can be considered as a type of the basis spline or B-spline (See Figure 1 for an illustrative example). However, its estimation requires O(n 3 ) operations in the univariate case, which is computational infeasible for the large-scale of datasets. One approach to tackle this problem is to employ the regression splines, from which an optimal approximation of R(x) can be produced via choosing the truncated bases with lower ranks K (t) for t = 1, · · · , T − 1 (Wood, 2003) . Using the pseudo observations m i with postulated parametric or nonparametric models of R(x) in Section 3.1, a standard design-based estimator under complete response is given by where Ω ij accounts for various sampling designs. For sampling designs with negligible sampling fractions, the ease of a replication variance estimator (Wolter, 2007) may be preferable for the NNRI estimator. The general replicate variance estimator when y i is observed throughout the sample is where c k is the kth replication factor, and T replicate weight for unit i, using a replication method that appropriately accounts for the complex sampling design [Note: for a stratified SRS-WOR sample with non-negligible sampling fractions, the c k should be modified to include the finite population correction factors We illustrate replicate weights through the following example. Example 2 Suppose that the probability sample is obtained by a single-stage design where each unit i has a sampling weight w i , the delete-1-jackknife method yields an unbiased esti-mate of the sampling variance under complete response. Therefore, L = n, c k = (n − 1)/n, , and w We now discuss estimation of the V e term. If an asymptotically unbiased estimator of σ 2 e (x i ) is available, then we may use If additionally, we assume the Lipschitz continuity of σ 2 e (x i ) in x, a similar result to Lemma 1 can be obtained Substituting (16) back into (8) yields Now, we can directly use the residuals e i obtained from the modeled values i.e., e i = y i − m i to estimate σ 2 e (x i ), where m i is obtained using the PARAM1, PARAM2, or NONPARAM models presented in Section 3.1. In particular, we have Alternatively, we can model the variation of the residuals to estimate σ 2 e (x i ) in (15). We consider three approaches: , specified by the multinomial distribution of y i , where β is the PARAM1 estimator from Section 3.1. where m i is obtained using the NONPARAM estimator described in Section 3.1. Combining the V m components obtained with the proposed estimators of m i described in Section 3.1 and the V e components yields six candidate variance estimators. In this section, we apply the proposed variance estimator to empirical data from the 2018 collection of the Service Annual Survey (SAS). Conducted by the U.S. Census Bureau, the SAS is a mandatory survey of approximately 78,000 employer businesses (companies) having one or more establishments located in the U.S. that provide services to individuals, businesses, and governments. The SAS collects aggregate and detailed revenues and expenses, e-commerce, exports and inventories data from a stratified sample of business firms with paid employees in selected industries in the services sector. As mentioned in Section 1, the key items collected by SAS are total revenue and total expenses; detailed breakdowns of these two totals items are requested from all sampled firms. The revenue detail items vary by industry within sector. Expense detail items, however, are primarily the same for all sectors, with an occasional additional expense detail or two collected for select industries. Complete information on the SAS methodology is available at https://www.census.gov/ programs-surveys/sas/technical-documentation/methodology.html. SAS uses imputation to account for unit and item nonresponse, relying heavily on the ratio imputation models presented in Section 3.1 (specifically, PARAM1 and PARAM2). For total receipts and total expenses, the independent variables are either a highly correlated data item from the same reference period or historic data for the same item. The model for detail items use the corresponding totals item (reference period only). Thompson and Washington (2013) evaluated these imputation models in two of the sectors included in the SAS, explicitly fitting weighted no-intercept linear regression models within industry using respondent data to assess model fit (i.e., the PARAM1 method). For the totals, the ratio imputation models were appropriate, with adjusted-R 2 consistently above 95%. Given such strong predictors, the nonresponse adjustment is robust to the assumed response mechanism and should decrease the variance (Little and Vartivarian, 2005) . However, the results for the details items were far less convincing, with adjusted-R 2 values often well-below 75-percent and non-significant slopes (α = 0.10). The weak predictive power of this ratio imputation approach is exacerbated in the 2020 data collection year, as many businesses were closed or had business limited due to the COVID-19 pandemic. Not surprisingly, the SAS program managers were interested in exploring other imputation methods for these detail items, which are historically less frequently and reliably reported than their associated totals and whose imputed values are generally more difficult to independently validate. The empirical application is restricted to five industries, collectively representing a crosssection of the survey's expenditures data collections. We chose a subset of industries from a candidate list provided by subject matter experts, requiring a minimum of three sampled units per strata in addition to validating that the NNRI model assumptions appeared to hold. Consequently, these industries are not representative of the larger survey. The input data for this application study consists of the sampled companies in the selected industries that tabulated a positive (non-zero) total expenses value; companies reporting zero-valued expenses were dropped. Furthermore, the SAS uses industry-average ratio imputation (not NNRI) for missing and invalid expenses items and implements a naïve random group variance estimator for all item. For these reasons, our estimates and variance estimates differ from the official published values. Tables 1 and 2 describes selected features of the study industries. The sample sizes and response rates are rounded to comply with internal regulations. The unweighted response rates presented in Table 1 represent the proportion of sampled units that provided a complete donor record and do not correspond to the official response rates. On inspection, the response patterns displayed in Table 1 appear to be atypical of business surveys, in that generally the larger businesses (e.g., the certainty companies) respond at higher rates. However for NNRI, all detail items are imputed if either (1) the company was a nonrespondent or (2) the reported set of expenditures details did not add to the total, thus reducing the total number of respondents. suppressed for confidentiality protection. Typical of many business surveys, all five distributions are highly positively skewed, and the largest businesses are sampled with probability = 1 (certainty). These distributions of the total expenses variable are well-approximated by lognormal distributions in all industries. Notice that the variance of the approximate lognormal distribution is small in three industries. In three industries (517210, 541210, and 713110), so that the compact support requirement in Assumption 3 is approximately true. Unfortunately, conformance to this requirement is unlikely for the other two study industries, although the convex support assumption should hold for all five industries. Table 2 provides sample design characteristics of the study industries. The overall sampling rate (n/N ) is non-negligible in four of the five study industries. All industries include a certainty strata (all units included with probability = 1), which is excluded from the range provided in Table 2 but is included in the computation of the overall sampling rate. The sampling rates for the noncertainty industry strata vary greatly, with at least one stratum in each industry having a non-negligible sampling rate. Variance ratios greater than 1, 000 are indicated by an "XXX." The parametric ratio estimator employed by theV PARAM1 is clearly problematic. The variance estimates obtained with theV PARAM1 for the rarely reported detail items (values less than 10%) are often considerably larger than the counterparts obtained with non-naïve variance estimates. However, using modeled residuals for σ 2 e appears to underestimate the variances, as evidenced by frequency of variance ratios with values less than one. Similarly the variance estimates obtained using the modeled residuals with the PARAM2 model (V PARAM2(M) ) tend to be much larger than those obtained with any other considered variance estimator, providing evidence of a poor model fit for σ 2 e . Table 4 presents the coefficients of variation (c.v's) of each item for each considered variance estimator in percentages. At the 95% confidence level, a total with an associated c.v. greater than 51% (= 1/1.96) is not significantly different from zero. Thus, the c.v.'s provide a measure of the practical impact of the variance estimators on inference. 3.5 4.5 9.2 17.5 1.9 11.3 2.9 As one might expect, given the results in Table 3 , the c.v.'s obtained usingV PARAM1 orV PARAM2(M) estimates are much larger than those obtained with any other considered variance estimator. Recall that the PARAM1 model utlizes an industry level ratio estimator, likely inappropriate. Table 3 provides further indications that the modeled residuals in (V PARAM1(M) ) do not improve this ratio estimator's performance, as the associated c.v.'s tend to be smaller than their naïve variance counterparts. In contrast, the c.v.'s obtained with thê V PARAM2 ,V NONPARAM , andV NONPARAM(M) estimators are generally similar, with a few visible exceptions. These large differences could be due to model misspecifications, but could also be confounded with small sample size effects for many of the detail items. Despite the similarity of their c.v.'s, the variance estimates of corresponding items obtained withV PARAM2 ,V NONPARAM , andV NONPARAM(M) are quite different. This affects confidence interval width, and expected coverage by extension. Without a "gold standard" against which to measure these variances, however, we have no viable recommendation. Consequently, we conducted the Monte Carlo simulation study described in Section 5. To evaluate the finite-sample performance of the proposed method over repeated samples, we conducted a simulation study. The simulation varies in four factors: parametric distribution of the size (auxiliary) variable x i , the size of the finite population (N ), relationship of auxiliary variable and detail items (x i and y i ), and response propensity. The data generation is largely patterned after the realistic procedures described in Andridge, Bechtel, and Thompson (2021) , with each separate process outlined below. Each finite population is stratified using the strata boundaries provided in Table 5 . We used a two-step process to generate the sets of detail items values y i = (y i1 , y i2 , y i3 , y i4 , y i5 ) associated with each unit, capturing important data features observed in several of the U.S. Census Bureau's economic programs. Specifically, the y i for all units have non-zero values assigned to (y i1 , y i2 ) but may have zero values in (y i3 , y i4 , y i5 ). To justify the use of the NNRI procedure, the number of non-zero detail items is directly related to unit size. Within each stratum S i , we generate C i (the number of non-zero detail items) for unit i from a discrete distribution of {2,3,4,5} with selection probability P (C i = c | x i ) = p(x i ) where p(x) = (π 1 , π 2 , π 3 , π 4 , π 5 ) are given by (π 1 , π 2 , π 3 , π 4 , π 5 ) = Conditioning on the assigned c i , we draw the R i for each unit i from a multinomial distribution. By design, p 1 = 0.60 for all units regardless its class. This detail item therefore represents about 60% of each unit's total (x i ). Figure 4 illustrates the subtle change in multinomial distributions as unit size (sampling strata) increases. The largest proportion of the total is always reported in item Y 1, with item Y 2 following. The remaining three items are more rarely reported, with the probability of a reported nonzero value being strongly related to unit size. This mimics patterns that were observed by Andridge et al. (2021) in several economic census datasets. Lastly, using these R i , we compute This ensures that x i = 5 t=1 y it . We select a single stratified SRS-WOR sample from each finite population. Table 6 provides the finite population size (N ), the average stratum sizes N h , the sampling fractions f h , and average sample sizes n h . Notice that the overall sampling fraction (f ) is 302/1000 in the N = 1000 populations and is 152/500 in the N = 500 populations and are therefore both Finally, within each sample, we generate missingness indicators for each unit i as δ i ∼ Bernoulli(π) under different response mechanisms: uniform (missing completely at random) with π = 75% and 50% (MCAR); uniform response within strata (missing at random) with smaller units being more likely to respond (Negative MAR); and uniform response within strata with larger units being more likely to respond (Positive MAR). Table 7 presents the strata response propensities for the negative and positive MAR response mechanisms. Notice that between-stratum differences in response propensities quite large; these discrepancies are exaggerated for illustration. The MCAR response propensities resemble the observed patterns for industry 713110 and to a lesser extent, for industries 541211 and 621410 presented in Section 4, although the latter two industries could equally be categorized with negative MAR response patterns. The negative MAR response mechanism mimics the observed pattern for industry 221122, whereas the positive MAR response mechanism mimics the observed pattern for industry 517210. Imputation and estimation are performed separately within each strata, with x i as the matching variable for the NNRI of y i = (y i1 , · · · , y iT ) as outlined in Section 2.2. To evaluate the performance of the proposed NNRI variance estimators over repeated samples selected from different population distributions (scenarios), finite population sizes, and response mechanisms, we compute the relative bias of each variance estimator V yp and the coverage rates of the approximate 95% confidence intervals constructed with T y and V yp for variance estimation method p. Table 8 reports the relative biases using directly obtained residuals for all five detail items for each population scenario and response mechanism for the N = 1000 populations. Results for the N = 500 populations are similar and are consequently not presented here, but are available upon request to the authors. The relative bias of each variance estimator is computed as RB( V yp is the estimate from the b th sample (B = 2000) and V yp is the Monte Carlo empirical (true) variance. Recall that 5 i=2 Y i represents a maximum of 40% of the corresponding total, 5 i=3 Y i represents a maximum of 20% of the corresponding total, and 5 i=4 Y i representS 5% orless of the corresponding total. Consequently, the analysis of relative biases of the variance estimates will be confounded with small sample size effects for each detail item except Y 1 , with small unit differences in totals potentially representing large percentage differences. The main results of Table 8 can be summarized as follows: • The naïve variance estimator severely underestimates the true variance of detail items Y 2 , Y 3 , Y 4 , and Y 5 for all populations and response mechanisms; • In Population Scenarios 1 and 2, the PARAM1 variance estimator overestimates the true variance of all detail items except for Y 1 , essentially confirming the conjecture of overestimation posited Section 4 for industries 517210, 541210, and 713110. This Table 8 : Relative biases of variance estimates using directly-obtained residuals for all detail items by population scenario and response mechanism computed from 2, 000 independent stratified SRS-WOR samples from the N = 1000 populations. Negative relative biases are in parenthesis. Population Scenario 1 = Uniform; Population Scenario 2 = Lognormal (µ = 4.1, σ = 0.66); Population Scenario 3 = Lognormal (µ = 12.0, σ = 1.7) variance estimator generally underestimates the true variance of the same detail items in Population Scenario 3, mimicking the empirical results for industries 221122 and 631400. • In Population Scenarios 1 and 2, the PARAM2 variance estimates are nearly unbiased, regardless of response mechanism. However, in Population Scenario 3, the PARAM2 variance estimator tends to underestimate the variance of all detail items except for Y 1 , with the degree of underestimation being more severe than their PARAM1 variance estimate counterparts. • In Population Scenarios 1 and 2, the NONPARAM variance have inconsistent relative bias performance, although generally improved over the corresponding naïve variance estimates. This is not true in Population Scenario 3, where the variances estimates for all detail items except Y 1 are severely underestimated, and the variance estimates for Y 1 are overestimates, with the level of overestimation related to the response mechanism. The simulation conditions in Population 2 with a MCAR or negative MAR response mechanism resemble the 541210 and 713110 industries' conditions; Table 8 provides evidence that theV P ARAM 2 are likely the most accurate estimates in this situation, even for the rarely-reported detail items. The simulation conditions in Population 2 with a positive MAR response mechanism resemble the industry 517210 conditions; the simulation results are close to the same for theV P ARAM 2 and theV N ON P ARAM , without a clear-cut favorite. The simulation conditions in Population 3 with a negative MAR response mechanism resemble the 221122 and 621400 industries' data; the relative biases for the MCAR response mechanism are similar forV P ARAM 1 andV P ARAM 2 , but theV P ARAM 2 are less biased under the negative MAR response mechanism (theV N ON P ARAM are severe underestimates for all detail items). Table 9 reports the relative biases using the modeled residuals (see Section 3.3) for all five detail items for each population scenario and response mechanism for the N = 1000 populations. With the "high" uniform response rate (i.e., MCAR with π = 0.75), as well as the negative and positive MAR response mechanism, modeling the residuals for σ 2 e (x i ) generally reduces the relative bias of the PARAM2 variance estimates. Notice that the PARAM2(M) variance estimates are nearly unbiased in Population Scenario 3 for all detail items, except Table 9 : Relative biases of variance estimates using modeled residuals for all detail items by population scenario and response mechanism computed from 2, 000 independent stratified SRS-WOR samples from the N = 1000 populations. Negative relative biases are in parenthesis. Population Scenario 1 = Uniform; Population Scenario 2 = Lognormal (µ = 4.1, σ = 0.66); Population Scenario 3 = Lognormal (µ = 12.0, σ = 1.7) the lowest uniform response rate mechanism (MCAR with π = 0.50); in this situation, the bias effects are likely overstated for the most rarely reported detail items (e.g, Y 4 and Y 5 ). This results contrast with those presented in the empirical case study in Section 4, and we suspect this could be an artifact of the data simulation process. The other approaches (PARAM1(M) and NONPARAM(M)) do not yield consistent improvements over their counterparts obtained with directly-obtained residuals. Given that the 2 nd parametric method of estimating m i (PARAM2) generally yields accurate variance estimates and the associated procedures for obtaining σ 2 e appear tractable, we dropped the alternative parametric approach (PARAM1) from further consideration. Regardless of population scenario, population size, and considered response mechanism, all NNRI totals were unbiased across the repeated samples. Again, we conjecture that this is an artifact of the simulation design, which ensures appropriate conditions for the distance function used to select the nearest neighbor for imputation. Nevertheless, coverage rates provide a measure of the practical impact of the bias of the considered variance estimators, given the unbiased estimates. Figure Taken collectively, the simulation results support the generally poor performance of the PARAM1 and PARAM1(M) methods demonstrated in Section 4. Ultimately, the results obtained with PARAM2 are promising, especially given that the data generation models were not congenial to the WLS regression used to obtain m i and the data violations in the third data generation scenario (found in two of the five studied empirical distributions). Despite its poor performance in this simulation study, the nonparametric method remains appealing due to its flexbility. It is possible that the model fit and estimation might be improved with a different choice of basis functions, although we would not recommend utilizing either variation with the studied industries in Section 4. That said, some caution should be exercised in over-generalizing these results, as the simulation utilizes a very specific multinomial distribution and a single sampling design. Nearest neighbor ratio imputation (NNRI) is a useful approach for imputing an entire set of component detail items. Instead of directly imputing the set of detail item values (y i ) from the donor, NNRI imputes the proportions of donor ratios (R i ), which are in turn multiplied by the recipient's available total to derive imputed values for all items. This imputation method has several appealing properties, especially from a bias reduction perspective as discussed in Section 1. NNRI guarantees additivity, as the summed details always equal the associated total. It accommodates subtle changes in unit level multinomial distributions that are associated with unit size. It yields realistic microdata, preserving multivariate relationships. In contrast to other frequently used imputation methods, the NNRI avoids inadvertently imposing possibly outdated historical patterns in the imputed data, as it is entirely restricted to current data (Andridge et al., 2021) . The numerous advantages of the NNRI procedure can be offset by the difficulty of obtaining a valid variance estimator. However, Yang and Kim (2019) show that by identifying the nearest donor using a single scalar with a suitable distance function, the NNRI estimator is asymptotically consistent. Following the frameworks of Shao and Steel (1999) and Yang and Kim (2019), we decompose our asymptotic variance into two parts and extend the variance estimation further to include the case of non-negligible sampling fractions employing both parametric and nonparametric models to obtain smooth estimators for set of ratios. Selecting an appropriate model for the data set at hand is essential, as demonstrated by the empirical application and the simulation study. Model determination is not completely straightforward: in both the empirical application and our simulation, the frequently reported detail items tend to be strongly associated with the total, whereas the relationship between rarely-reported detail items and the total is less obvious. Nevertheless, our simulation studies provide fairly promising results in terms of bias and coverage, even with some model misspecification. In practice, one would expect that methodologists would develop and validate any implemented models after careful data analysis before implementing the proposed variance estimator. Although promising, the empirical and simulation study results collectively suggest several areas of future research. First, we limited ourselves to uniform and lognormally distributed size variables in our study and restricted the response mechanisms to tractable MCAR or MAR models. Thus, the sensitivity of our variance estimator to more complex MAR models or even to non-random response mechanism bears study. Second, if the proportion of recipients to donors is large, then NNRI may repeatedly use the same donor, yielding insufficient variation within each imputation cell. Andridge et al. (2021) propose a modification of the NNRI method that addresses this issue in a multiple imputation framework; it would be useful to develop a single imputation analogue. Third, in practice, many auxiliary variables can be used to determine nearest neighbors, in which case, dimension reduction is necessary to mitigate matching discrepancies. Several techniques such as propensity score (Rosenbaum and Rubin, 1983) , prognostic score (Hansen, 2008) , or their combination (Yang and Zhang, 2020) can be potentially adopted in the NNRI framework. Finally, this paper considers only population totals. However, extending the current framework to general parameter estimation is also feasible (Yang and Kim, 2020) . Given that hot deck imputation is used to create realistic microdata (as well as macrodata), such extensions are especially compelling topics for our future research. Supporting Information for "Nearest neighbor ratio imputation with incomplete multi-nomial outcome in survey sampling" by Gao et al. S.1 Proof of (4) and Lemma 1 To demonstrate (4), we write where the equivalence from the third line to the fourth line implies that i∈S δ i d ij = 1, i.e., for every nonrespondent j, there always exist a donor i ∈ S s.t. d ij = 1. To demonstrate Lemma 1, using algebra, we obtain Under Assumption 3, (S.1) can be bounded by o p (1) by the following argument: where |x i − x i(1) | is termed as the matching discrepancy, uniformly bounded by O p (n −1 ) as shown in Lemma S.2 (provided immediately below). Hence, under Assumptions 1-3, the term in (S.1) is bounded by O p (n −1 N ). Then, we It follows from (S.3) that (6) holds asymptotically: Lemma S.2 Under Assumption 3 and let |V | = n|X i − X i(1) |, then the limiting distribution of V is non-degenerate. Hence |X i − X i(1) | is bounded by O p (n −1 ) uniformly. To start, we consider the conditional probability that unit j is the nearest neighbor of X i = x i , i.e., i(1) = j P (i(1) = j | X j = x j ) = P (|X − x i | ≥ |x j − x i |) n−1 . Since x i are i.i.d. f X (i = 1, · · · , n), the marginal probability that unit j is a match for any nonrespondent is 1/n. Therefore, the distribution of X j conditional on being the nearest neighbor (1 st closest match) is Next, we transform X j to a new random variable by V = n(X i − X j ) = n(X i − X i(1) ) Then, we can show that where A(x i , v, n) can be specified as A(x i , v, n) = nP (|X − x i | ≤ |v|/n). Note that where S = {ω ∈ R : ω = 1} is the unit k sphere with λ S as its surface measure. Therefore, by l'Hopital's rule, we can obtain the limit of A(x i , v, n) as n → ∞, Hence, we have the limiting distribution of V as In addition, we know that f X is bounded in the support X by Assumption 3 and |V | = n|X i −X i(1) | is also non-degenerate by simple algebra, i.e., |V | = O p (1), |x i −x i(1) | = O p (n −1 ) uniformly. By Theorem 1 of Yang and Kim (2020) , we have T y = N i=1 (m i + e i ), we can express the first term as where I i = 1 if i ∈ S. The expression of D * N holds true even when f = n/N is non-negligible. Next, we verify that the covariance of the two terms in D * N equals zero. where E ζ (e i | x i ) = 0 by definition. Therefore, the asymptotic variance is essentially equal to var(D * where we can show V e = O(1) with σ 2 e (x i ) = E(e 2 i | x i ). The first term can be characterized under Assumptions 2 and 3 where κ i = j∈S w j x j /(w i x i )(1 − δ j )d ij is bounded by C 1 min x∈X |x| C 2 max x∈X |x|κ i ≤ κ i ≤ C 2 max x∈X |x| C 1 min x∈X |x|κ i , whereκ i = j∈S (1 − δ j )d ij is bounded by O(1) (Abadie and Imbens, 2006; Yang and Kim, 2020, p.17 ). Next, the second term can be bounded similarly Note that when n/N is negligible, the term V e is then dominated by the first term nN −2 N i=1 w i δ i (1+ κ i ) 2 E(e 2 i | x i ). Large sample properties of matching estimators for average treatment effects Finding a flexible hot-deck imputation method for multinomial data Assessing nonresponse bias in a business survey: Proxy pattern-mixture analysis for skewed data A review of hot deck imputation for survey nonresponse A generic implementation of the nearest-neighbour imputation methodology (nim) On variance estimation under auxiliary value imputation in sample surveys Variance estimation when donor imputation is used to fill in missing values Imputation under informative sampling Jackknife variance estimation for nearest-neighbor imputation Sampling statistics The prognostic analogue of the propensity score Generalized additive models The treatment of missing survey data Variance estimation for nearest neighbor imputation for us census long form data Replication variance estimation for twophase stratified sampling Does weighting for nonresponse increase the variance of survey means? A hot-deck multiple imputation procedure for gaps in longitudinal data on recurrent events Improving survey-weighted least squares regression The central role of the propensity score in observational studies for causal effects Inference and missing data Imputation in surveys: coping with reality Variance estimation for survey data with composite imputation and nonnegligible sampling fractions Confidence intervals based on survey data with nearest neighbor imputation Algorithms for adjusting survey data that fail balance edits Challenges in the treatment of unit nonresponse for selected business surveys: a case study A hybrid response process model for business surveys Using focus groups to identify analysts' editing strategies in an economic survey The business context and its implications for the survey response process Introduction to variance estimation Thin plate regression splines Smoothing parameter and model selection for general smooth models Nearest neighbor imputation for general parameter estimation in survey sampling Asymptotic theory and inference of predictive mean matching imputation using a superpopulation model framework Multiply robust matching estimators of average and quantile treatment effects