key: cord-0454425-8icnt3v1
authors: Wu, Yangyang; Wang, Jun; Miao, Xiaoye; Wang, Wenjia; Yin, Jianwei
title: Differentiable and Scalable Generative Adversarial Models for Data Imputation
date: 2022-01-10
journal: nan
DOI: nan
sha: 0d37a47e67b8b3aed896390b3596fbec77f620d1
doc_id: 454425
cord_uid: 8icnt3v1

Data imputation has been extensively explored to solve the missing data problem. The dramatically increasing volume of incomplete data makes the imputation models computationally infeasible in many real-life applications. In this paper, we propose an effective scalable imputation system named SCIS to significantly speed up the training of the differentiable generative adversarial imputation models under accuracy-guarantees for large-scale incomplete data. SCIS consists of two modules, differentiable imputation modeling (DIM) and sample size estimation (SSE). DIM leverages a new masking Sinkhorn divergence function to make an arbitrary generative adversarial imputation model differentiable, while for such a differentiable imputation model, SSE can estimate an appropriate sample size to ensure the user-specified imputation accuracy of the final model. Extensive experiments upon several real-life large-scale datasets demonstrate that, our proposed system can accelerate the generative adversarial model training by 7.1x. Using around 7.6% samples, SCIS yields competitive accuracy with the state-of-the-art imputation methods in a much shorter computation time.

Many real-life scenarios, such as e-commerce, transportion science, and health care, encounter the problem of missing data [1] - [5] as long as the data collection is involved. Data might be missing for various reasons, including collection device failure [6] , instable system environment [5] , privacy concerns [7] , etc. The missing data problem poses a daunting challenge to the downstream data analysis.

To alleviate the missing data problem, a substantial amount of data imputation studies [8] , [9] has been carried out, including the statistical ones [10] , machine learning ones [11] , multilayer perceptron (MLP) based ones [12] , autoencoder (AE) based ones [13] , and generative adversarial network (GAN) based ones [14] , [15] . Ideally, a preferable data imputation algorithm is to learn the true underlying data distribution well from the observed data, and count on the learned distribution to impute the missing data. GAN-based imputation methods [14] , [15] are attempts in this direction. They maximize the distributional proximity of generated data and true underlying data by introducing an adversarial game between a generator and a discriminator. Many empirical investigations [14] - [16] have demonstrated the promising performance brought by GAN-based methods for data imputation.

The ubiquity of data collection technologies with unprecedented processing power and substantial storage capacity has given rise to a dramatic increase in the volume of incomplete data. For example, a real-world COVID-19 case surveillance public use dataset [17] contains 22,507,139 cases with 7 clinical and symptom features, taking a 47.62% missing rate. The large volume of incomplete data, however, means that it is expensive and unwieldy to exploit the above imputation algorithms. Although GAN-based methods achieve better imputation performance than other imputation approaches, they deplete exceedingly high training cost over large-scale incomplete data. Our experimental results show that, almost all imputation methods take more than 10 5 seconds on training over the million-size incomplete data. In general, the effective and scalable data imputation over large-scale incomplete data is indispensable in many real-life scenarios.

It is challenging to apply the GAN-based imputation methods to specific real-life scenarios, particularly for large-scale incomplete data. First, there is a strong theoretical evidence showing that, the Jensen-Shannon (JS) divergence of the GANbased imputation model fails in the case that the true underlying and generated data distributions have non-overlapping supports 1 [18] - [20] . This makes the JS divergence based imputation loss function non-differentiable, and suffer from the "vanishing" gradient problem. Second, existing GAN-based imputation methods consume high training cost to calculate gradients with the mini-batch strategy for both generator and discriminator over the entire dataset. In general, the model complexity and training sample size are two primary factors that affect the efficiency of the GAN-based imputation. Therefore, in this paper, we propose an effective SCalable Imputation System SCIS to optimize GAN-based imputation models. SCIS makes the GAN-based imputation model differentiable via using the optimal transport theory. Then, for the differentiable model, it pays attention to the training sample size, and estimates an appropriate sample size for efficient and accuracy-guaranteed training. The system is composed of two modules, a differentiable imputation modeling (DIM) module and a sample size estimation (SSE) module. In terms of the first challenge, the DIM module leverages a masking Sinkhorn (MS) divergence to convert a GAN-based imputation model into a differentiable one, which can always provide reliable gradients to avoid the "vanishing" gradient problem. Regarding the second challenge, the SSE module estimates the minimum sample size to enable the trained differentiable GAN-based model to meet a user-specified error tolerance. Hence, SCIS employs the minimum sample size to make the differentiable GAN-based model scalable on large-scale incomplete data through SSE. In summary, the main contributions of this paper are described as follows. • We propose an effective scalable imputation system SCIS with differentiable GAN-based imputation models, which can be effectively and efficiently trained. • In the DIM module, we put forward an MS divergence to quantify the closeness between the true underlying and generated data distributions. It employs the optimal transport theory to make the GAN-based imputation model differentiable, for avoiding the "vanishing" gradient issue. • The SSE module leverages the differentiability of the MS divergence to estimate the minimum sample size for training an approximate differentiable GAN-based model, according to a user-specified imputation accuracy. It thus significantly saves the model training cost. • Extensive experiments using several real-life large-scale incomplete datasets demonstrate the computational benefits of SCIS, compared with the state-of-the-art methods. The rest of the paper is organized as follows. We introduce the background in Section II. Section III gives an overview of the proposed system SCIS. We elaborate the DIM and SSE modules in Section IV and Section V, respectively. Section VI reports the experimental results and findings. Finally, we conclude this work in Section VII.

Existing imputation algorithms can be categorized by their mainly used models, including statistical ones, machine learning ones, and deep learning ones. The statistical imputation methods substitute the missing values with the statistics [10] , or the most similar ones among the training data [21] , [22] . This kind of method has a limited ability to handle the missing data problem, since they ignore the data distribution analysis.

In contrast, the machine learning imputation approaches are to train parametric models in machine learning [23] , [24] to estimate the missing values, including decision tree models like XGBoost imputation [25] , MissFI (MissForest imputation) [11] , and Baran [26] , and regression models like MICE (multivariate imputation by chained equations) [8] and imputation via individual model [27] . These imputation methods employ the batch gradient descent techniques [28] to calculate the model gradient over the entire dataset. The main issue is that, the incomplete dataset may be too large to fit in memory.

Moreover, the deep learning imputation methods leverage deep learning models [29] , [30] to impute missing values with the mini-batch gradient descent. This category consists of i) 

Description X an input incomplete dataset (stored in a matrix) M the mask matrix w.r.t. X M and M an imputation model and the optimized model X andX the reconstructed matrix and imputed matrix of X X0 and M0 the initial dataset matrix and its mask matrix Xv and Mv the validation dataset matrix and its mask matrix n0 and Nv the initial sample size and validation sample size ε and α the user-tolerated error bound and confidence level MLP-based ones like DataWig [31] and RRSI (round-robin Sinkhorn imputation) [12] , ii) AE-based ones, e.g., MIDAE (multiple imputation denoising autoencoder) [32] , VAEI (variational autoencoder imputation) [33] , EDDI (efficient dynamic discovery of highvalue information framework) [34] , HIVAE (heterogeneous incomplete variational autoencoder) [35] , MI-WAE (missing data importance-weighted autoencoder) [9] , and not-MIWAE (not-missing-at-random importance-weighted autoencoder) [13] , and iii) GAN-based ones, such as GINN (graph imputation neural network) [15] and GAIN (generative adversarial imputation network) [14] . The above methods calculate the model gradients with a series of random partitions of the dataset, to train the imputation models over large-scale incomplete data. Nevertheless, both the iteration times and training cost of these methods are dramatically increasing with the rising volume of incomplete data. As analyzed earlier, the data generated by GAN-based methods tend to be closer to the true underlying data manifold than that of the other ones. Nevertheless, due to the high training cost and "vanishing" gradient problem, the GANbased imputation methods are faced with a big challenge to deal with large-scale incomplete data imputation. In this paper, our proposed system SCIS aims to provide good effectiveness and scalability for GAN-based imputation models, so as to make them practical in real-world applications.

The input incomplete dataset is stored in a matrix X = (x 1 , · · · , x N ) , in which each data sample x i = (x i1 , · · · , x id ) with x ij ∈ X j takes values from a d−dimensional space X = X 1 × · · · × X d . For encoding its missing information, we define a mask matrix M = (m 1 , · · · , m N ) , where each mask vector m i = (m i1 , · · · , m id ) corresponds to a data sample x i . In particular, m ij takes value from {0, 1}, i = 1, · · · , s, and j = 1, · · · , d; m ij = 1 (resp. 0) iff the j-th dimension is observed (resp. missing). Note that, we use X and X N interchangeably throughout this paper. Table I lists the frequently used symbols throughout this paper.

In general, we formalize the data imputation problem over X and M in Definition 1.

Definition 1: (Data imputation). Given an incomplete dataset X with its mask matrix M, the data imputation problem aims to build an imputation model M to find appropriate values for missing values in X, i.e., the imputed matrix

where is the element-wise multiplication;X = M(X) is the reconstructed matrix predicted by M over X. In this way, the imputation model M, (i) makesX as close to the real complete dataset (if it exists) as possible, or (ii) helps downstream prediction tasks to achieve better performance if adoptingX than that only with original one X.

In this paper, our study mission is to empower the GANbased imputation model with efficiency and effectiveness for large-scale incomplete data, such that for the optimized model, (i) the training cost is minimized, and (ii) the imputation accuracy satisfies a user-tolerated error bound.

In particular, the GAN-based imputation model builds an adversarial training platform for two players, i.e., the generator and discriminator. The generator is applied to generate data as close to the true underlying data distribution as possible. While the discriminator distinguishes the difference between the generated data and true underlying data as correctly as possible. The objective function of GAN-based imputation is defined as a minimax problem over the generator and discriminator. As a result, the GAN-based model employs the optimized generator to impute missing values via using Eq. 1.

In this section, we present an overview of the proposed system SCIS. It consists of a differentiable imputation modeling (DIM) module and a sample size estimation (SSE) module.

To facilitate the effective and scalable imputation on largescale incomplete data, our proposed system SCIS first converts a GAN-based imputation model into a differentiable one, to avoid the "vanishing" gradient problem. Then, for such a differentiable GAN-based imputation model, SCIS minimizes the training sample size under accuracy-guarantees to accelerate the imputation. Its general procedure is described in Algorithm 1. At first, SCIS samples a size-N v validation dataset X v ∈ R Nv×d (with the validation mask matrix M v ) from the incomplete input dataset X ∈ R N ×d (with the mask matrix M), while it samples a size-n 0 dataset X 0 ∈ R n0×d (with the initial mask matrix M 0 ) from the rest (line 1). Then, SCIS invokes the DIM module to train an initial model M 0 with the masking Sinkhorn (MS) divergence imputation loss function over X 0 and M 0 (line 2).

Next, with the support of the differentiablity of MS divergence, SCIS consults the SSE module to estimate the minimum sample size n (with n 0 ≤ n ≤ N ) based on M 0 , such that the imputation difference of the trained models over the size-n dataset and the (whole) size-N dataset would not exceed the user-tolerated error bound ε with probability at least (1−α). In particular, if n is equal to n 0 , SCIS simply outputs M 0 and the matrixX imputed by M 0 . Otherwise, SCIS constructs a size-n sample set X and its mask matrix M from X and M. It invokes the DIM module again to retrain M 0 using X and M (lines 3-5). Finally, SCIS returns the trained model M and the matrixX imputed by M .

For the GAN-based imputation model, the intersection in the supports of the true underlying and generated data distributions Algorithm 1: The Procedure of SCIS Input: an incomplete set X with its mask matrix M, a validation size N v , an initial size n 0 , a GAN-based model M, a user-tolerated error bound ε, and a confidence level α Output: the trained model M and imputed dataX 1: sample a size-N v validation dataset X v (with M v ) and a size-n 0 initial dataset X 0 (with M 0 ) 2: invoke DIM to train an initial model M 0 with the MS divergence loss function over X 0 and M 0 3: consult SSE to derive the minimum size n to satisfy the error bound ε with probability at least (1 − α) 4: if n = n 0 then M ← M 0 5: else invoke DIM to train M 0 on the minimum sample set X and M to obtain the optimized M 6: reconstruct X via using M to obtainX 7:X = M X + (1 − M) X 8: return M andX is usually negligible [18] . In such case, the JS divergence makes the GAN-based imputation models non-differentiable, and even suffer from the "vanishing" gradient problem. This problem may prevent the model parameter from updating its value or even stop model from further training.

In the SCIS, the differentiable imputation modeling (DIM) module designs a masking Sinkhorn (MS) divergence by deploying the mask matrix from the imputation task and optimal transport theory, to make the GAN-based imputation models differentiable, and thus obtain reliable gradients through the model training and circumvent "vanishing" gradient problem. In this section, we first introduce the MS divergence imputation loss function. Then, we elaborate how to optimize the GAN-based imputation model via the MS divergence.

The differentiable masking Sinkhorn (MS) divergence is devised by introducing the mask matrix, entropic regularizer, and corrective terms into the optimal transport metric. Specifically, in the MS divergence, we first empower the optimal transport metric with the mask matrix, to devise the masking optimal transport metric for data imputation. Then, we put forward the entropic regularizer to make the masking regularized optimal transport metric differentiable and programmable. We also equip the MS divergence with corrective terms to correct the bias from the entropic regularizer.

To be more specific, the masking optimal transport metric is introduced by deploying the mask matrix and optimal transport metric, as stated in Definition 2. It is based on a simple intuition that the observed and generated data should share the same distribution. Letμ

δx i denote the empirical measures over a size-n data matrix X n ⊂ X, its mask matrix M n , and the reconstructed matrixX n respectively, where δ xi , δ mi , and δx i are the Dirac distributions.

Definition 2: (The masking optimal transport). The masking optimal transport metric OT m (νx,μ x ) overνx andμ x is defined as the optimal transport metric OT (νx ⊗μ m ,μ x ⊗μ m ) overνx ⊗μ m andμ x ⊗μ m , i.e.,

whereνx ⊗μ m (resp.μ x ⊗μ m ) stands for the product measure ofνx (resp.μ x ) andμ m . The transportation plan matrix P is from the set Γ n,n def = {P ∈ R n×n : P1 n = 1 n 1 n , P 1 n = 1 n 1 n }. The masking cost matrix C m is defined as

where is the element-wise multiplication; f c (x, y) = ||x − y|| 2 2 is the cost function; P, C m = tr(P C m ) is the Frobenius dot-product of P and C m .

Nervertheless, the masking optimal transport metric is still not differentiable [36] . Moreover, there exists a costly linear program for computing the transport plan P in the masking optimal transport metric. As a result, we further introduce a masking regularized optimal transport metric, as stated in Definition 3. This metric becomes differentiable and programmable through an entropic regularization term [37] .

Definition 3: (The masking regularized optimal transport). The masking regularized optimal transport metric OT m

where λ is a hyper-parameter. Due to the entropic regularizer h(P) in Eq. 2, the resolution of the optimal transport plan P can be tractable by using Sinkhorn's algorithm [38] , and thus the masking regularized optimal transport metric becomes algorithmically differentiable and easily programmable. This entropic regularizer makes the masking regularized optimal transport metric no longer positive. This may make the imputation models focus too much on learning the mean of the true underlying data distribution (i.e., ignoring the whole distribution). To get rid of this issue, we add corrective terms into the MS divergence to assure itself of positivity.

Definition 4: (The masking Sinkhorn divergence). The masking Sinkhorn (MS) divergence S m (νx||μ x ) between the empirical measuresνx andμ x is defined as

. Additionally, we instantiate how the MS divergence handles the "vanishing" gradient problem in the following example. Hereby, we also show the non-differentiable property of the JS divergence and the differentiable property of the MS divergence. Throughout this paper, we make the assumption that the data are missing completely at random (i.e., MCAR).

Example 1: For the imputation task defined in Definition 1, we consider d = 1 and X 1 = R. Under MCAR mechanism, the mask vector m ∈ {0, 1} is independent of the sample x ∈ X 1 , i.e., p m (m|x) = p m (m). Thus, a joint distribution of x and m can be defined as p(x, m) = p x (x)p m (m). In particular, the true underlying and generated data distributions are defined as p 0

x (x) = δ 0 and p θ x (x) = δ θ with the parameter θ ∈ R, respectively. Besides, the missingness distribution p m (m) is supposed to follow a Bernoulli distribution Ber(q) with a constant q ∈ (0, 1). For simplicity, we denote p 0 and p θ as the distributions of p 0

x (x)p m (m) and p θ x (x)p m (m), respectively. Thus, the JS divergence between p 0 and p θ is calculated by

We can find that, JS(p 0 ||p θ ) is not continuous at θ = 0, and thus non-differentiable. Moreover, the gradients of JS(p 0 ||p θ ) are 0 almost everywhere. Thus, JS(p 0 ||p θ ) provides useless gradient information to update the model parameter θ, which underlies the "vanishing" gradient problem.

In contrast, the differentiable MS divergence between p 0 and p θ is calculated by

where Π(p 0 , p θ ) denotes the set of all joint distributions γ(y, y ), whose marginals are p 0 and p θ , respectively. The second equality exploits the fact that the optimal joint distribution γ * (y, y ) is calculated by

It is obvious that, S m (p 0 , p θ ) is differentiable everywhere with respect to θ. The gradients of S m (p 0 , p θ ) vary linearly, which can always provide reliable gradient information to update θ, and thus dispose of the "vanishing" gradient problem.

In general, the differentiable MS divergence imputation loss function can be defined as

By virtue of the differentiable MS divergence, the MS divergence imputation loss function can provide a usable and reliable gradient during the training of the GAN-based model, and thus it helps to get rid of the "vanishing" gradient issue.

It is worthwhile to note that, the proposed MS divergence from SCIS is to minimize the MS divergence between the true underlying data distribution in the observed data and generated data distribution for GAN-based imputation models. Hence, with the support of the MS divergence, the true underlying data distribution can be well preserved in the imputed data from observed values. It is different from the Sinkhorn divergence used in the round-robin Sinkhorn imputation algorithm (RRSI) [12] , where the Sinkhorn divergence between any two imputed batches is minimized for the MLP-based imputation model. In essence, the data distribution predicted by RRSI easily converges to a mixed distribution of the observed data and initial imputed data, rather than the true underlying one, especially when there exist a large amount of missing data. More intuitively, at the initial step, RRSI fills the missing data by the mean value of the observed data for each incomplete feature separately. Thus, RRSI is likely to regard two samples with similar missing patterns as similar ones, while these two samples are actually different in the ground true space.

In DIM module, we convert a GAN-based imputation model into a differentiable one, by taking the MS divergence to measure the closeness between the true underlying data distribution in X n and the generated data distribution inX n . In pursuit of deriving the GAN-based imputation model to predict X n as similar to the observed data in X n as possible, our goal becomes to find the optimized parameterθ minimizing S m (νx||μ x ) from a parameter space Θ. Formally, we rewrite S m (νx||μ x ) as S m (X n M n , X n M n ). Therefore, the MS divergence imputation loss minimizer is given bŷ

By using the chain rule and the barycentric transport map [39] , the MS divergence gradient function can be derived by the following proposition.

Proposition 1: The MS divergence gradient function g(θ) can be calculated by

where ∇ θ S m (·) and ∇X n S m (·) are the derivatives of S m (·) with respect to the parameter θ and the reconstructed matrixˆn

Mini-batch x m 1 1 x m Fig. 1 . The architecture of the DIM modulē X n , respectively; ∇ θXn is the derivative ofX n with respect to θ; P = (P ij ) ij is the optimal transport plan; T (m i ) is to transform a mask vector m i to a diagonal matrix. Proof. First, by using the barycentric transport map, we obtain that ∀i ∈ 1, . . . , n,

Then, by using the chain rule, the gradient of L s (X n , M n ) can be calculated by

Consequently, as depicted in Figure 1 , the DIM module takes the data matrix X n and its mask matrix M n as inputs, and outputs the optimized GAN-based imputation model M n and the data matrixX n imputed by M n . Using the MS divergence gradient function, DIM employs a mini-batch gradient descent technique [40] to solve the optimization problem in Eq. 3 for the differentiable GAN-based imputation model. To be more specific, similar as the studies [19] , [41] , the discriminator is trained to maximize the MS divergence between the true underlying and generated data distributions, while the generator is trained by minimizing the MS divergence metric evaluated by the newly updated discriminator. As a result, the DIM module employs the optimized GAN-based imputation model M n to impute missing values in X n via using Eq. 1.

With the support of the differentiable GAN-based imputation model in the DIM module, the goal of the sample size estimation (SSE) module is to analytically infer the minimum sample size n (with n 0 ≤ n ≤ N ) under accuracy guarantees. With the SSE module, the imputation difference D(M , M N ) between the model M trained on the size-n (in)complete samples from X and the model M N trained on the given incomplete dataset X is well-controlled, i.e.,

where ε is a user-tolerated error bound; α is a confidence level; m is the mask vector of x;x andx N are the vectors reconstructed by M and M N over x, respectively. The core idea of SSE is that, the differentiablity of MS divergence enables SCIS to determine the distributions of the parameters θ and θ N for GAN-based imputation models (i.e., M and M N ), based on which we exploit an empirical estimation to calculate the probability P(D(θ , θ N ) ≤ ε), i.e., P(D(M , M N ) ≤ ε). Specifically, in the SSE module, we first estimate the distributions of the parameters θ and θ N with the given parameter θ 0 of the initial model M 0 by using the minimum distance estimator on differentiable MS divergence, as shown in Theorem 1. Then, we employ an empirical estimation to calculate the probability P(D(θ , θ N ) ≤ ε), as presented in Proposition 2. Finally, with the estimation of P(D(θ , θ N ) ≤ ε), we use binary search to find the minimum size n (in)complete samples to satisfy the users' expectation on imputation accuracy in Eq. 4.

To begin with, we study the estimation on the distribution of the parameter θ n in an unknown differentiable GANbased imputation model M n with a size-n (n 0 ≤ n ≤ N ) (in)complete sample set from X, by using the parameter θ 0 of the initial model M 0 . Inspired by the asymptotic normality of minimum distance estimator [42] , we estimate the parameter distribution of θ n in M n with the given θ 0 of M 0 , as stated in Theorem 1. For brevity, we make a notation that for two positive-value functions g and h over the same domain and upper bounded by a positive constant, g h iff g/h and h/g are bounded by a constant. In particular, the MS divergence gradient function makes Theorem 1 work for GAN-based imputation models, while the corresponding theorem in [43] is only targeted for traditional machine learning models.

Theorem 1: Given the parameter θ 0 of a differentiable generative adversarial imputation model M 0 , the conditional probability distribution ofθ n with respect to M n satisfieŝ

as n 0 → ∞ and n → ∞, where λ is a hyper-parameter in the MS divergence; H is the invertible Hessian matrix of the MS divergence imputation loss function with the given parameter θ 0 ; N (θ 0 , ηH −1 ) denotes a multivariate normal distribution with mean θ 0 and covariance matrix ηH −1 .

Proof. To estimate the conditional probability distribution ofθ n , we first derive the distribution ofθ 0 − θ ∞ by the multivariate central limit theorem, where θ ∞ is the conceptual optimal parameter when the sample size approaches infinity. Then, we infer the distribution ofθ 0 −θ n based on that of θ 0 − θ ∞ . Finally, we exploit Bayes' theorem to estimate the distribution ofθ n |θ 0 from the distribution ofθ 0 −θ n .

Specifically, since the MS divergence is differentiable, θ 0 is obtained by finding the parameter at which g(θ) becomes 0, i.e., θ 0 satisfies g(θ 0 ) = 0. According to the mean-value theorem, there existsθ 0 between θ 0 and θ ∞ that satisfies

Besides, we note that θ 0 is simply an instance from the distribution ofθ 0 . Therefore, we can find

where ζ(λ) e 2κ λ (1 + 1 λ d/2 ) 2 ; J and H are the information matrix and the invertible Hessian matrix of the MS divergence imputation loss function with the given parameter θ 0 , respectively. The transition from Eq. 6 to Eq. 7 is based on the multidimensional central limit theorem and the fact that the simple complexity for Sinkhorn 

L is the Lipschitz constant for the cost function f c . In our case, both |X | and L will be 1, since the input data will be normalized to [0, 1] d and the cost function f c is a squared two-norm function. Thus, ζ(λ) e 6 λ (1 + 1 λ d/2 ) 2 . Moreover, the information matrix equality J = −H implies the last equation.

Then, with the estimated distribution ofθ 0 − θ ∞ , we further infer the distribution ofθ 0 −θ n . To be more specific, since X n can be regarded as a union of X 0 and X n − X 0 , we introduce two random variables V 1 and V 2 independently following N (0, ζ(λ)H −1 ) to capture the randomness of X 0 and X n − X 0 , respectively. Note thatθ 0 − θ ∞ → 1 √ n0 V 1 . In this way, we can find

Moreover, we exploit the fact thatθ 0 −θ n , which is the sum of two independent normal random variables, asymptotically follows a normal distribution. Thus, when both n 0 and n approach infinity,θ 0 −θ n followŝ

Finally, we exploit Bayes' theorem to estimate the distribution ofθ n |θ 0 based on the distribution ofθ 0 −θ n . Observe that θ 0 −θ n andθ n − θ ∞ are independent because the covariance between them is zero, i.e.,

By using Bayes' theorem,

for some normalization constant Z. Since we don't have any extra information about the prior probability P(θ n ), we set a constant to P(θ n ). Therefore, we can find that, the conditional probability distribution ofθ n with respect to M n satisfieŝ θ n |θ 0 → N (θ 0 , ηH −1 ),

For the differentiable GAN-based imputation model M 0 , the invertible Hessian matrix H in Eq. 5 can be computed by

In particular, ∇ 2 θx ik indicates the Hessian matrix ofx ik with respect to θ. The approximation follows by ignoring the first term in the first equation. The first term is negligible, since the reconstructed vectorx i happens to be very close to the observed values in x i ∈ X 0 , where the initial model M 0 has been trained on X 0 [45] .

As a result, with the given parameter θ 0 and the estimated probability distribution of the parameter θ n , we employ an empirical estimation to determine P(D(θ 0 , θ n ) ≤ ε). It is inspired by the conservative estimation developed on Hoeffding's inequatity in [43] . We can guarantee the probability P(D(θ 0 , θ n ) ≤ ε) at least (1 − α) by requesting the empirical estimation of P(D(θ 0 , θ n ) ≤ ε) no less than a certain value, as described in Proposition 2. In essence, we exploit interval estimation to derive a new lower bound for the empirical statistic of P(D(θ 0 , θ n ) ≤ ε) in Proposition 2. The lower bound is orthogonal to the one in [43] by considering an additional hyper-parameter β to bound the error between P(D(θ 0 , θ n ) ≤ ε) and its empirical statistic.

Proposition 2: Let f (θ n ) be the density function of the conditional probability distributionθ n |θ 0 . Suppose θ n,1 , · · · , θ n,k are i.i.d. samples drawn from f (θ n ), where k is the number of parameter sampling. The inequality P(D(θ 0 , θ n ) ≤ ε) ≥ 1−α holds, if ε satisfies that,

where β is a hyper-parameter (0 < β ≤ α ≤ 1) and I is the indicator function that returns 1 if its argument is true, otherwise returns 0. Proof. By applying Hoeffding's inequality,

Hence, for a given k, we can select ε 1 ≥ log(β) −2k , so as to have (1 − β) probability of assuring a not less than b − ε 1 .

Furthermore, to guarantee that P(D(θ 0 , θ n ) ≤ ε) ≥ 1 − α, we will take conservative estimation, i.e., ensuring that

Thus, we can finally have at least

confidence on guaranteeing D(θ 0 , θ n ) ≤ ε. Therefore, according to Proposition 2, the probability P(D(θ n , θ N ) ≤ ε) can be approximated by

where (θ n,i , θ N,i ) is a parameter pair sampled from the conditional probability distributionsθ n,i |θ 0 andθ N |θ n,i , respectively. Moreover, P(D(θ n , θ N ) ≤ ε) is an increasing function with respect to n, which is derived in the similar way as [43] .

Based on Proposition 2, the SSE module uses binary search to find the minimum size n (with n 0 ≤ n ≤ N ) of (in)complete samples in X to satisfy the users' expectation on imputation accuracy. The search procedure terminates, if n satisfies Eq. 4 while (n − 1) does not. Namely,

where θ −1 is the parameter of the GAN-based imputation model trained with the size-(n − 1) (in)complete samples.

In this section, we evaluate the performance of our proposed system SCIS via comparisons with eleven state-of-the-art imputation methods. All algorithms were implemented in Python. The experiments were conducted on an Intel Core 2.80GHz server with TITAN Xp 12GiB (GPU) and 192GB RAM, running Ubuntu 18.04 system.

Datasets.

In the experiments, we use six public realworld incomplete COVID-19 datasets over several scenarios. Table II lists Metrics. In the evaluation, we use the training time and root mean squared error (RMSE) to measure the efficiency and effectiveness of imputation models. We also report the training sample rate R t , i.e., how many samples are used for training models (100% for basic original ones and n N × 100% for SCIS). The smaller the metric value, the better the imputation performance. To obtain the RMSE values, we randomly remove 20% observed values during training for imputation, and thus we use these observed values as the ground-truth to the missing values. In evaluation, each value is reported by averaging five times of experimental results under different data random divisions.

Imputation methods. In the experiments, the baselines include eleven state-of-the-art imputation methods, namely three machine learning ones: MissF, Baran, and MICE, two MLP-based ones: DataWig and RRSI, four AE-based ones: MIDAE, VAEI, EDDI, and HIVAE, and two GAN-based ones: GINN and GAIN. We evaluate SCIS on top of the GAN-based imputation methods.

Implementation details. For all imputation methods, we thank the authors of each algorithm for providing the source codes, so that we directly utilize these source codes with default parameter settings in our experiment evaluation. Specifically, for all machine learning imputation methods, the learning rate is 0.3, and the number of iterations is 100. The number of decision trees in MissFI is 100. Baran employs AdaBoost as the prediction model. The imputation times in MICE are 20. For all deep learning imputation methods, the learning rate is 0.001, the dropout rate is 0.5, the training epoch is 100, and the batch size is 128. The ADAM algorithm is utilized to train networks. MIDAE is a 2-layer with 128 units per layer network. For VAEI, the encoder and decoder are fully connected networks with two hidden layers, each with 20 neurons per layer, and the latent space is 10-dimensional. HIVAE uses only one dense layer for all the parameters of the encoder and decoder, each with 10 neurons per layer. In GINN, the discriminator is a simple 3-layer feed-forward network trained 5 times for each optimization step of the generator. In GAIN, both generator and discriminator are modeled as 2layer fully connected network. Moreover, in SCIS, the hyperparameter λ is 130, the confidence level α is 0.05, the hyperparameter β is 0.01, the number of parameter sampling k in SSE is 20, the user-tolerated error bound ε is 0.001. The initial sample size n 0 is 500 for Trial, 500 for Emergency, 2,000 for Response, 6,000 for Search, 20,000 for Weather, and 20,000 for Surveil. The validation sample size N v is equal to n 0 . Table III and Table IV report the performance of imputation methods over six real-world incomplete datasets. Some results are unavailable (represented by "−"), since the corresponding methods are not able to finish within 10 5 seconds.

One can observe that, SCIS takes less training time and smaller training sample rate R t than baselines, while it achieves competitive imputation accuracy (i.e., similar RMSE value). Moreover, SCIS-GAIN outperforms the other methods in most cases. Specifically, SCIS adopts only 7.58% training samples and saves 41.75% training time of GAN-based methods in average. For the last three million-size incomplete datasets in Table IV Moreover, the speedup of SCIS is not very obvious on the first three dataset in Table III , compared with the other ones in Table IV . It is because that, the initial sample size n 0 for the first three datasets are significantly smaller than that of the other ones. The smaller n 0 , the larger the variance derived by Eq. 5, resulting in higher training sample rate and training time. The training time (resp. training sample rate) even decreases to 4.52% (resp. 0.67%) on Search (resp. Surveil) for GAIN. It is attributed to the SSE module in SCIS that minimizes the required training sample size of GAN-based methods. In addition, the competitive (even better) accuracy with (than) original methods results from the MS divergence imputation loss function employed in SCIS that measures the closeness between the true underlying data and generated data distributions. Also, the accuracy guarantee in SSE benefits the accuracy of SCIS. It even increases 3.02% accuracy for GAIN on Trial. In particular, the experimental results of SCIS-GINN are unavailable over Search and Surveil datasets, since GINN has a high complexity on construction of the similarity graph. For further extensive experimental study on SCIS, we employ GAIN as the baseline, since GAIN can work on these datasets, providing a clear comparison benchmark. 

Effect of R m . When varying the missing rate R m (i.e., how many values in original observed data are dropped) from 10% to 90%, the corresponding results are depicted in Figure 2 . It also reports the time cost of the SSE module, which is the core module of SCIS. The reported SCIS training time has included the execution time of SSE. We can find that, compared with GAIN, SCIS takes much less training time and training samples to obtain a similar or even higher imputation accuracy in all cases. It is more robust with the increasing missing rate R m than GAIN. The SSE module takes 28.31% training time of SCIS in average. In addition, the imputation accuracy of both GAIN and SCIS is comparable, and it descends consistently with the growth of missing rate. The reason is that, as the missing rate increases, the observed information for algorithms becomes less, making imputation algorithms less effective.

Effect of ε. To verify the effectiveness of the imputation accuracy guarantee, we study the effect of the user-tolerated error bound ε on the performance of SCIS with GAIN. With ε SCIS, RMSE SCIS, R t SCIS, Training time Figure 3 plots the corresponding results. The sample rate of the initial training set X 0 , i.e., R 1 = n 0 /N (and that of the minimum sample set X , i.e., R 2 = n /N ). The user-tolerated error is derived by R u mse + ε, where R u mse is the RMSE value of GAIN with the MS divergence imputation loss function trained on X. In order to verify the effect of the DIM module on SCIS, we also report the imputation error R o mse + ε in the figure, where R o mse is the RMSE value of original GAIN model trained on X. Some results are unavailable in the figure, because the corresponding methods cannot finish within 10 5 seconds.

As inferred from the figure, SCIS has the higher imputation accuracy than the user-tolerated error R u mse + ε and the imputation error R o mse + ε in most cases. It means that, the SSE module can estimate an appropriate sample size to get as the good imputation accuracy as the users' expectation, i.e., it indeed satisfies the accuracy requirement of users. Besides, the error R u mse derived by SCIS is smaller than R o mse derived by GAIN in many cases. It confirms that, the DIM module using the MS divergence does boost the imputation accuracy. Moreover, in most cases, the RMSE of SCIS increases with the growth of ε, while R 2 is opposite. It is because, a smaller value of ε signifies a lower user-tolerated error. Hence, more samples (i.e., a larger R 2 ) are needed to satisfy a lower RMSE requirement. In addition, when ε exceeds 0.005, the RMSE of SCIS changes slightly since n equals the lower bound n 0 .

Effect of n 0 . Figure 4 depicts the experimental results of varying the initial sample size n 0 . For SCIS with GAIN, different datasets require different optimal n 0 . SCIS achieves the best imputation accuracy (in terms of RMSE) when the optimal n 0 is chosen, i.e., 500 for Trial, 500 for Emergency, 2,000 for Response, 6,000 for Search, 20,000 for Weather, and 20,000 for Surveil. Meanwhile, the time consumption and training sample rate remain reasonable and acceptable. In addition, the R t of SCIS increases with the decrease of n 0 in most cases. It is partially because that, the less the initial sample size n 0 , the larger the variance derived by Eq. 5, leading to more training samples.

We investigate the influence of different modules of SCIS on the imputation performance. The corresponding experimental results of the RMSE, training time, and training sample rate R t are shown in Table V and Table VI . DIM-GAIN is the variant of SCIS-GAIN without the SSE module over GAIN. Fixed-DIM-GAIN, a variant of DIM-GAIN, randomly selects ten percentage of samples as the training data to accelerate the model training process. In Table VI , the results of DIM-GAIN are unavailable (represented by "−"), since they are not able to finish within 10 5 seconds.

We can observe that, DIM-GAIN gains better imputation accuracy (i.e., smaller RMSE value) than GAIN, while requires higher training time. Besides, compared with DIM-GAIN and Fixed-DIM-GAIN, SCIS-GAIN takes significantly less training time and training samples, while shows a negligible decrease in imputation accuracy, especially on the last three incomplete datasets. Specifically, DIM-GAIN increases 3.24% accuracy for GAIN in average, but its training time is 4.68x averagely of GAIN. It confirms the effectiveness of the MS divergence imputation loss function in the DIM module. SCIS-GAIN takes 6.79% (resp. 67.88%) training samples and saves 72.29% (resp. 20.27%) training time of DIM-GAIN (resp. Fixed-DIM-GAIN) in average, while only averagely decreases 0.72% (resp. 0.51%) accuracy for DIM-GAIN (resp. Fixed-DIM-GAIN). In general, these results further confirm that the sample size estimation strategy in the SSE module is valid and indispensable. Moreover, for Fixed-DIM-GAIN and SCIS-GAIN that use the same imputation model (i.e., DIM-GAIN), the more training samples (i.e., higher training sample rate R t ) the imputation method requires, the higher imputation accuracy it achieves. It is because that, the more the training samples, the higher the available information for imputation algorithms, making algorithms more powerful.

The ultimate goal of imputing missing data is to benefit the downstream data analytics, e.g., regression and classification. In the last set of experiments, we verify the superiority of SCIS over GAIN on the post-imputation prediction task.

For the classification task over Trial and Surveil and the regression task over Emergency, Response, Search, and Weather, the corresponding prediction results are depicted in Table VII . The larger AUC value corresponds to the better prediction effect, while MAE is opposite. The imputation methods are first employed to impute missing values in the incomplete datasets. Then, a regression/classification model is trained with three fully-connected layers over the imputed data. The training epoch is 30, the learning rate is 0.005, the dropout rate is 0.5, and the batch size is 128.

We can observe that, the prediction performance under different imputation algorithms is consistent with the imputation performance of these algorithms, i.e., SCIS-GAIN gains competitive (even better) accuracy with (than) GAIN. Specifically, SCIS-GAIN decreases 0.51% MAE for GAIN on the regression task, while increases 0.27% AUC for GAIN on the classification task. In addition, on the regression (resp. classification) task, SCIS-GAIN achieves the larger improvement over the Weather (resp. Response) dataset. Thus, it further confirms the effectiveness of SCIS.

In this paper, we propose an effective scalable imputation system SCIS to accelerate GAN-based imputation models. It consists of a DIM module and an SSE module. DIM makes the generative adversarial imputation models differentiable via leveraging a new MS divergence. SSE estimates the minimum training sample size for the differentiable imputation model, so that the final trained model satisfies a user-tolerated error bound. Extensive experiments over several real-world datasets demonstrate that, SCIS significantly accelerates the model training and meanwhile harvests the competitive imputation accuracy with the state-of-the-art GAN-based methods.

Almost all existing GAN-based imputation algorithms are under the simple MCAR assumption. It to some extent limits the effectiveness of SCIS under the real complex missing mechanism. In the future, we intend to further explore the imputation problem under more complex missing mechanisms for large-scale incomplete data.

Discovery of genuine functional dependencies from relational data with missing values

Generative semisupervised learning for multivariate time series imputation

HoloClean: Holistic data repairs with probabilistic inference

Supporting ranking queries on uncertain and incomplete data

Embedded functional dependencies and datacompleteness tailored database design

Deep learning for missing value imputationin tables with non-numerical data

BRITS: Bidirectional recurrent imputation for time series

Multiple imputation by chained equations (MICE): Implementation in Stata

MIWAE: Deep generative modelling and imputation of incomplete data sets

A novel framework for imputation of missing values in databases

MissForest non parametric missing value imputation for mixed-type data

Missing data imputation using optimal transport

not-MIWAE: Deep generative modelling with missing not at random data

GAIN: Missing data imputation using generative adversarial nets

Missing data imputation with adversarially-trained graph convolutional networks

A survey of missing data imputation using generative adversarial networks

COVID-19 Open-Data: Curating a fine-grained, global-scale data repository for SARS-CoV-2," 2020, work in Progress

Towards principled methods for training generative adversarial networks

The cramer distance as a solution to biased wasserstein gradients

Wasserstein generative adversarial networks

Comparison of various methods for handling incomplete data in software engineering databases

An introduction to kernel and nearest -neighbor nonparametric regression

Parrot: A progressive analysis system on large text collections

Keyword search on large graphs: A survey

XGBoost: A scalable tree boosting system

Baran: Effective error correction via a unified context representation and transfer learning

Learning individual models for imputation

An overview of gradient descent optimization algorithms

Deep multiple autoencoder-based multi-view clustering

Predicting geolocation of tweets: Using combination of cnn and bilstm

Datawig: Missing value imputation for tables

Multiple imputation using deep denoising autoencoders

Variational autoencoders for missing data imputation with application to a simulated milling circuit

EDDI: Efficient dynamic discovery of high-value information with partial vae

Handling incomplete heterogeneous data using VAEs

Computational optimal transport: With applications to data science

Stochastic optimization for large-scale optimal transport

A relationship between arbitrary positive matrices and doubly stochastic matrices

Fast computation of Wasserstein barycenters

SeqGAN: Sequence generative adversarial nets with policy gradient

Improving gans using optimal transport

Large sample estimation and hypothesis testing

BlinkML: Efficient maximum likelihood estimation with probabilistic guarantees

Sample complexity of sinkhorn divergences

Efficient computation of hessian matrices in tensorflow

A global panel database of pandemic policies (oxford covid-19 government response tracker)