key: cord-0668667-nppeswn5 authors: Luo, Mingjie; Wang, Siwei; Liu, Xinwang; Tu, Wenxuan; Zhang, Yi; Guo, Xifeng; Zhou, Sihang; Zhu, En title: Deep Distribution-preserving Incomplete Clustering with Optimal Transport date: 2021-03-21 journal: nan DOI: nan sha: 54f592e0d72692a757cbe393e421f7ba4d3e9665 doc_id: 668667 cord_uid: nppeswn5 Clustering is a fundamental task in the computer vision and machine learning community. Although various methods have been proposed, the performance of existing approaches drops dramatically when handling incomplete high-dimensional data (which is common in real world applications). To solve the problem, we propose a novel deep incomplete clustering method, named Deep Distribution-preserving Incomplete Clustering with Optimal Transport (DDIC-OT). To avoid insufficient sample utilization in existing methods limited by few fully-observed samples, we propose to measure distribution distance with the optimal transport for reconstruction evaluation instead of traditional pixel-wise loss function. Moreover, the clustering loss of the latent feature is introduced to regularize the embedding with more discrimination capability. As a consequence, the network becomes more robust against missing features and the unified framework which combines clustering and sample imputation enables the two procedures to negotiate to better serve for each other. Extensive experiments demonstrate that the proposed network achieves superior and stable clustering performance improvement against existing state-of-the-art incomplete clustering methods over different missing ratios. Clustering is one of the fundamental and important unsupervised learning tasks in data science, image analysis and machine learning community [9] , [14] , [15] , [16] , [21] , [27] , [29] , [32] , [33] . A wide variety of data clustering methods have been proposed to organise similar items into same groups and achieve promising performance, e.g., k-means clustering, Gaussian Mixture Model (GMM), spectral clustering and deep clustering recently. However, existing clustering approaches all hold one premise that the data themselves are complete while data with missing features are quite common in reality. Data incompleteness occurs due to many factors, e.g. sensor failure, unfinished collection and data storage corruption. For example, face images are covered with masks leading to missing features during COVID-19 [5] . When facing with various types of missing features, incomplete data clustering has draw increasing attention in recent years [13] , [14] , [15] , [24] , [25] , [30] (see Figure. 1 ). Existing incomplete clustering can be roughly categorized into two mechanisms, heuristic-based and learning-based respectively. Both of them firstly impute the missing features and then the full data matrix can be applied with traditional clustering algorithms. The heuristic imputation methods often rely on statistic property, e.g., zero-filling (ZF) and mean-filling (MF) after normalizing. Median values are also popular for imputation in genetic study. Particularly, the KNN-filling method considers the local reliable partners which fills the missing entries with the mean value of k-closest neighbors. When facing with complex high-dimensional data, heuristic-based methods perform poorly since the simple imputations cannot obtain enough information to precisely recover data. Recently, learning-based imputation methods receive enormous attention and become to be the mainstream. Existing work can be categorized into shallow and deep learning framework. The shallow representatives normally assume that the data are low-rank and therefore apply iterative methods to recover missing values [3] , [8] , [17] , [20] , [26] . Moreover, the Expectation-Maximum (EM) algorithm iteratively estimates the maximum likelihood and then inferences the missing variables until convergent. With the improvements of deep learning architectures, various deep networks have been proposed to handle incompleteness. A desirable attribute for deep approaches is that they should accurately inference the joint and marginal distributions of the data. Therefore, variants of generation-style networks are introduced including Generative Adversarial Networks (GAN) and Variational Auto-Encoder (VAE). In [31] , a generator utilizes the observed features to generate 'complete' data and the discriminator attempts to determine which components are actually observed or imputed. With the adversary training strategy, the generated missing features could approximate the real data distribution. Followed this line, enormous GAN and VAE-based approaches are put forward to minimizing the distances between real values and imputed matrices. Although these aforementioned methods offer solutions for arXiv:2103.11424v1 [cs.CV] 21 Mar 2021 Instead of pixel-to-pixel reconstruction, we impute features by minimizing the distribution distance with optimal transport. Moreover, the latent embedding representations are regularized with clustering loss to ensure intra-cluster discrimination. The joint loss functions seamlessly negotiate incomplete imputation and clustering tasks as a unified framework to improve the performance of each other. incomplete data clustering, several drawbacks in existing mechanism cannot be neglected: i) Existing incomplete clustering methods follow a two-step manner, where the imputation stage and the clustering stage are separated from each other. In other words, the imputed features are not designed for clustering task, which may heavily degrade the clustering performance in return. ii) When facing with high-dimensional data (e.g., images, text), both of the shallow and deep methods perform poorly due to the insufficient observed information with inaccurate imputation. These results in sharp degradation in clustering task performance. In this paper, we propose a novel deep incomplete clustering method, which we refer as Deep Distribution-preserving Incomplete Clustering with Optimal Transport (DDIC-OT), that generalizes the well-known Deep Embedding Clustering network (DEC) to handle missing features. Different from existing pixelby-pixel reconstruction in traditional autoencoder, we propose to minimize the Wasserstain distance between observed data and the reconstructed data with optimal transport. Moreover an addition clustering layer is added into the embedded representation level with KL-divergence for measuring clustering loss. By optimizing the novel network, the distribution of original data can be wellpreserved and in return the missing features can be more accurately imputed by guidance of latent clustering structures. Thus, the proposed DDIC-OT simultaneously utilizes the imputation and the embedded clustering procedures so that they can be jointly negotiated with each other and reach consensus best serving for clustering task. Finally, the proposed DDIC-OT is showcased in extensive experiments on a wide variety of benchmarks with different missing ratios, to evaluate its effectiveness. As demonstrated, the proposed network enjoys superior clustering performance in comparison with existing state-of-the-art imputation methods by large margins. The contributions of our DDIC-OT are summarized as follows, 1) We mathematically analyze the failures of existing incomplete clustering methods in theory when facing with high-dimensional data. To avoid insufficient training brought by the spareness of full-observed data, a novel end-to-end deep clustering network is proposed to minimize the wasserstain distances between original and reconstructed distribution. 2) We regularize the latent distribution with more discriminate separation to further enhance task performance. By the guidance of the unified loss function, the network decodes the informative latent representations contributing to better recovery and clustering. To the best of our knowledge, this could be the first work of end-to-end deep incomplete clustering network. 3) Comprehensive experiments are conducted on six highdimensional benchmarks datasets with various incomplete ratios. As the experimental results show, the proposed network significantly outperforms state-of-the-art incomplete clustering methods by large margins. Throughout this paper, we use boldface uppercase letters and lowercase letters to denote matrices and vectors respectively. The (i, j)-th elements of a matrix U is referred as U ij . Let X ∈ R n×d be the data matrix with n samples d dimension. Then we define a missing index matrix M ∈ {0, 1} n×d as follows, With the defined mask matrix M, the incomplete data matrix we observed can be defined as where the • denotes the Hadamard (elementwise) product and N aN means Not a Number throughout the paper. Statistical imputation. Basic statistical methods try to utilize information from the missing data by the means of numerical property. Most of them use statistical attributes to estimate the missing feature values, rather than directly discard incomplete feature information. Incomplete entries are filled with constants to obtain complete data so that they can be directly applied to machine learning tasks, e.g., zero, mean and median. Additionally, KNN imputation method has been considered as an alternative estimating the missing features with the mean of k nearest reliable neighbors [1] . The Bayesian framework is different from the previous method in that it considers the joint and conditional distribution for dealing with incomplete features. These frames are generally expressed in terms of a maximum-likelihood method, which estimate missing values with the most probable numbers. The most popular method of Bayesian framework is the Expectation Maximization (EM) algorithm [2] , [4] . Deep incomplete clustering. Although deep clustering mechanism has received much attention in recent years, none of existing methods has considered to cluster with incomplete features in an end-to-end manner yet. The typical methods follow the two-step strategy:imputation and clustering separately. They propose to fill missing values through neural networks and then apply clustering algorithms on the estimated dataset. GAIN [31] firstly proposes to impute incomplete features with GAN. Different from the traditional GAN networks, the goal of the discriminator in GAIN is to accurately distinguish whether the data are imputed or real, so as to force the samples generated by the generator to be close to the real data distribution. Unfortunately, the same problems as common GANs, these models are generally difficult to train since the optimization processes are hardly stable. Apart from GAN-like networks, VAEAC [7] proposes a neural probabilistic model based on variational autoencoder, which can estimate the observed features using stochastic gradient variational inference [10] . However these VAE-based method may lead to poor results when the posterior approximation of variational inference is far from the actual posterior approximation. In addition, based on fitting the conditional distribution of the missing data, a Markov chain Monte Carlo (MCMC) scheme has been developed in [23] . Optimal transport and sinkhorn divergence. Let α = n i=1 a i δ Xi , β = n i=1 b i δ Yi be two discrete distributions formed by empirical given data samples X, Y, and their supports X ∈ R n×d , Y ∈ R n ×d and frequency vectors a, b. It can be easily obtained that a 1 = 1, a ≥ 0, b 1 = 1, b ≥ 0.The q-th Wasserstein distance corresponds to these two distributions α and β is denoted as follows, where ij ∈ R n×n denotes as the cost matrix of pairwise squared distances between the support sets. In our paper, we set q = 2. The Wasserstein distance denoted in Eq. (3) is often jointly introduced with an entropy regularization, where h(F) def = − ij f ij log f ij denotes the entropy regularization. Eq. (4) can be efficiently optimized using Sinkhorn algorithm [22] . Based on Eq. (4), a symmetric divergence can be represented as The Sinkhorn divergence in Eq. (5) offers a tractable alternative for Wasserstein distance calculations, and easily be accelerated by GPU. In our paper, we use the sinkhorn divergence to measure the OT distance of two distributions. Problem analysis Although the aforementioned methods have been proposed to solve incomplete data clustering to some extent, most of them are evaluated with very small-dimensional data and make them unpractical in real scenarios. When facing with highdimensional data (e.g., images, text), both of the existing shadow and deep methods perform poorly due to the insufficient observed information with inaccurate imputation. We theoretically analyze this phenomenon with the following Theorem 1. Theorem 1. Suppose the data are i.i.d (independently and identically distributed), a fully-observed high-dimensional data sample exists with low probability when facing incompleteness. Proof 1. Suppose the missing ratio is p(0 ≤ p ≤ 1). Given a matrix X ∈ R n×d . For each sample x i , we can obtain the following equation. where P (X i1 ) denotes the probability X i1 can be observed. Taking p = 0.1, d = 300 as an example, with 10 −14 probability the sample x i can be fully-observed. With the increasing dimension d, the probability becomes smaller and approximates 0. This completes the proof. The Theorem 1 illustrates that very few samples are fully complete when the dimensions are relatively high. Therefore, the traditional statistical and deep generative methods fail to impute proper values lacking of sufficient information, e.g., knn-filling and GAN-style solutions. In [19] , the Wasserstain distance is firstly applied to impute missing features where the assumption is to minimize the discrepancy of missing data distribution and the complete data distribution. In the low-dimensional incomplete setting (less than 50), the experiments results are promising and proven to show more stable along with change of the incomplete ratios. However, as confronting with much higher dimension data types (e.g., images, videos and text), very few fully-complete data can be obtained making the empirical estimation of target distribution (complete data distribution) difficult and inaccurate. Therefore these methods perform poorly in downstream clustering tasks (see results in Table. 1). Different from existing assumptions, we propose to jointly solve the two processes in a unified framework: reconstruction and clustering. The natural way of reconstruction is to apply autoencoder models. The observed values can be regraded as 'supervised' signals for the reconstruction. However, with few informative information, it is not reasonable to only reconstruct the missing counterparts since the reconstruction may destroy the geometry distribution features for the data and the clustering performance is heavily affected. In this paper, we decide to recover the latent distribution instead of pixel-level approximation. Specially, we adopt the latent variable models defined by an encoder-decoder manner, where we firstly encode original data x i into the latent code z i in the latent space Z and then z i is decoded to the reconstructed imagex i . This process can be expressed as, where p x (z|x), pX (x|z) are parameterized with the encoder f e and decoder f d network. Then the distribution-preserving loss can be measured with Eq. (5) respecting to p X and pX , In this section, we leverage the one-stage deep incomplete clustering introduced in the previous section as a basis to demonstrate the process of the proposed learning algorithm, the overall flowchart is illustrated in Figure 2 . The proposed clustering model consists of three parts, an encoder, a decoder, and a soft clustering layer, specifically, the method relies on a linear combination based on two objective functions, representing the optimal transport distance and clustering loss respectively. The joint optimization process can be described as follows: where L s is the sinkhorn divergence shown in Eq. (5) and L c is the clustering loss. γ is a hyper-parameter, which is used to balance the two costs. Consider a dataset X with n samples, and each x i ∈ R d where d is the dimension. The number of clusters k is known, for each input data x i we denote the nonlinear mapping The clustering loss is defined as KL divergence between distributions P and Q proposed in [28] , where P is the soft assignment of the distribution z: and then the cluster assignment can be obtained s i = arg max j p ij . Then Q is the target distribution derived from P , Therefore, the clustering loss is defined as We summarize the merits of our proposed framework with the following factors: i) more naturally handle with incomplete clustering in high-dimensional space. The L s loss accomplishes the reconstruction samples with preserving geometry characteristics. ii) more flexible that does not require the prior distribution of X or Z. Instead of explicit distribution formulation, our encoder and decoder network implicitly estimate the latent distribution with more flexibility; iii) regularizing the latent distribution q with more discriminate separation to further enhance task performance. As the empirical experimental results show, the guidance of the joint loss function updates the network leading to the improvement of clustering performance. The training phase of the model consists of two phases: the pretraining phase where the network only contains reconstruction loss and the fine-tune phase where both optimal transport distance and clustering loss are optimized. Our encoder consists of a fully-connected multi-layer perceptron with dimensions d-500-500-1000/2000-10 and the decoder is a mirrored version of the encoder. The details of the network architecture are provided in the supplementary materials. The optimizer Adam with init learning rate η = 0.001 is applied for all datasets, and the batch size is set to 256. After pre-training, we provide two options of the parameter λ 150 and 100. Furthermore, for sinkhorn divergence, we set entropy regularization parameters as 0.01. In addition, we will stop training if the percentage of label distribution change between two consecutive updates for target distribution is less than the threshold δ or the number of iterations meets M axIter. Then the convergence threshold δ and M axIter are set to 0.1 and 200 separately. The full training procedure is summarized in Algorithm 1. Compute P i , Q i using Eq. (10) and Eq. (11) 9: Compute clustering assignment for {x i } m i=1 . 10: Compute overall loss L by Eq. (9). Back-propagation and update model weights. 12: end for 13: Compute Z = f e (X). 14: Compute P and Q. Compute clustering assignment S. 16: if sum(S iter+1 = S iter )/n <δ then Followed by existing incomplete clustering task setting, we set seven groups of incomplete ratios as {0.1, 0.2, 0.3, · · · 0.6, 0.7} for each dataset in our experiments. Incomplete ration means the percentage of missing features in all samples. Evaluation Metrics In our experiements, we used three standard clustering performance metrics for evaluation: (1) Accuracy (ACC) is computed by assigning each cluster with the dominating class label and taking the average correct classification rate as the final score, (2) Normalised Mutual Information (NMI) quantifies the normalised mutual dependence between the predicted labels and the ground-truth, and (c) Purity measures the proportion of the number of samples correctly clustered to the total number of samples. All of these metrics scale from 0 to 1 and higher values indicate better performance. Specially, we report the mean values and standard derivations of 10 independent runs to 1. https://www.cs.columbia.edu/CAVE/software/softlib/ 2. https://www.nist.gov/itl/products-and-services/emnist-dataset avoid the randomness brought by the different initializations of k-means. The author proposes factor group-sparse regularizers to accomplish low-rank matrix completion task. (6) Table 1 shows the aggregated clustering comparison of the above algorithms on the benchmark datasets. The best results are highlighted with boldface and '-' means the out of GPU memory failure. Based on the results, we have the following observations: • Our proposed method outperforms all the SOTA imputation competitors in clustering performance by large margins. For example, our algorithm surpasses the second best by 50.4%, 17%, 12%, 26%, 10%and 30%, in terms of ACC on all benchmark datasets. In particular, the margins for the four datasets (Mnist, Usps, Reuters and Letter) are very impressive. These results clearly verify the effectiveness of the proposed network. • Comparing with the generative-style methods, the proposed DDIC-OT consistently further improves the clustering performance and achieves better results among the benchmark datasets. GAIN, VAEAC and MIWAE are the chosen representative methods. As can be seen, they concentrate on the generation or imputation task while ignoring the impacts of downstream clustering procedure. The joint optimization framework further contributes to improving performance. embedding layer serves for clustering task and make distributions more discriminate. In this section, we deeply analyze the clustering performance regarding to various ratios and the evolution of the learned representation. In order to show the comparison between different methods more clearly, we draw the ACC and NMI of compared methods under different missing rates as line graphs as shown in Figure. 3 and 4. From the figure, we can obtain the following observations:(1) As can be seen, with the incomplete ratios increasing, all the methods suffers the degradation of clustering performance due to more unavailable information. Especially for the generative-based methods (VAEAC and MDIOT), their performance drops sharply due to inaccurate imputations. (2) The results of our proposed method in terms of ACC are higher than all the competing algorithms for different incomplete ratios. Moreover, our method achieves stable performance against the increasing incomplete ratios. These results clearly demonstrates the effectiveness of DDIC-OT. (3) We also show the relative NMI performance of the compared methods in Figure. 4. As can be seen, the clustering performance results are consistent with the ACC observations. We first investigate how the clustering loss and the distribution-preserving loss affect the clustering performance on Mnist/Usps/Reuters, and the results are shown in Table 3 . In this experiment, we uniformly adopt datasets with 10% missing ratio. It seems that the L c has more contributions than L s on Mnist/Usps for clustering, and inversely on Reuters. We also conclude that the joint of two counterpart losses further contributes to better performance. The initialization of imputed values has been demonstrated to be an essential part of incomplete clustering. We tested its sensitivity in our DDIC-OT, w.r.t. model performance on Mnist/Usps/Reuters. We evaluated two commonly-used initialization values: zero-filling (ZF) and mean-filling (MF). Table 5 shows that DDIC-OT can work stably without clear variation in In this paper, we propose a novel incomplete clustering methods termed DDIC-OT, which jointly performs clustering and missing data imputation into a unified framework. Extensive experiments are conducted to demonstrate the effectiveness of optimal transport for clustering tasks. In the future, we will consider to construct more advanced network to further improve incomplete clustering performance. In this section, we mainly supplement our work from two aspects, namely the specific settings of the experiment and the display of more experimental results. We summarize in Table. 6 the dataset specific optimal values of the hyperparameter γ which trades off between the optimal transport and the clustering loss, as well as the full connection of two different node numbers selected due to the difference in the dimensional of the dataset, in particular, we only change the last layer of the encoder and the corresponding decoding layer. The initialization of imputed values has been demonstrated to be an essential part of incomplete clustering. We further tested its sensitivity in our DDIC-OT, w.r.t. model performance on Fmnist/COIL20/Letter. Experimental results show that not all the datasets that perform with mean-filling are better than that with zero-filling, for example, when Fmnist with 10% missing ratio, the effect of zero-filling is much better. Then the Table. 5 shows that when different initializations are used, DDIC-OT can run stably without significant changes in overall performance. To provide more comprehensive experimental results about the performance of our algorithm under different missing ratios, we list the ACC, NMI, PUR of each dataset with a missing ratio ranging from 10% to 70% in Table. 7, 8, 9, 10, 11 and 12. It is clearly to see that our algorithm has achieved a apparent lead among almost all the missing proportions of the six benchmarks. In order to show the comparison between different methods more clearly, we draw the Purity of compared methods under different missing rates as line graphs as shown in Figure 5 . We observe that our algorithm has advantages for all data sets except Reuters. Moreover, we can find that the performance of our algorithm on Reuters has obvious advantages on the two indicators of ACC and MNI. By analyzing the situation where ACC and MNI are much smaller than PUR, we can clearly see that our comparison algorithm classifies most samples into one cluster instead of achieving really fine clustering results. A large number of experimental results above have proved the effectiveness of the proposed algorithm. In order to demonstrate more clearly about how our model converges to the target, the clustering performance evolution with respecting to training epochs is shown in Figure 6 . Specifically, we show in the figure the changes of ACC, MNI, and PUR with the training epoch when the missing rate of the three datasets is 10%. It is obviously that our algorithm achieves fast down convergence and satiafactory evolution. Figure 7 illustrates the complete data and the imputation performance of the competing method with 30% missing rate on the Mnist. Images along rows (a) and (b) contains the complete data images and initialized data images which represents mean-imputed data at 30% missing rate, respectively. The compared imputation methods take the observed images (b) for input, and the complete images are shown for reference. Images along rows (c), (d) include the imputed results after using GAIN, MIDOT respectively. Compared with GAIN and MIDOT, it can be seen from row (e) that our method is visually impressive. More importantly, different from the pixel-level reconstruction used in GAIN and MIDOT, our work is better at preserving the structure of the intended image which is benefit for clustering task. Specifically, the embedding layer has learned the structural information and removed the noise from our network, and this is why our method can obtain leading clustering effects in the face of missing data. The experimental results in Table. 7 also verify our statement. yaimpute: an r package for knn imputation Maximum likelihood from incomplete data via the em algorithm Factor group-sparse regularization for efficient low-rank matrix recovery Supervised learning from incomplete data via an em approach Face masks for the public during the covid-19 crisis A database for handwritten text recognition research Variational autoencoder with arbitrary conditioning Low-rank matrix completion using alternating minimization Partition level multiview subspace clustering Stochastic gradient vb and the variational auto-encoder Imagenet classification with deep convolutional neural networks Gradient-based learning applied to document recognition Partial multi-view clustering Efficient and effective regularized incomplete multi-view clustering Late fusion incomplete multi-view clustering Multiple kernel k k-means with incomplete kernels Depth enhancement via low-rank matrix completion Miwae: Deep generative modelling and imputation of incomplete data sets Missing data imputation using optimal transport Low-rank matrix recovery via efficient schatten p-norm minimization Structured autoencoders for subspace clustering Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning Mcflow: Monte carlo flow models for data imputation A face annotation framework with partial clustering and interactive labeling Partial multi-view clustering via consistent gan Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm Deep comprehensive correlation mining for image clustering Unsupervised deep embedding for clustering analysis Sparse subspace clustering with missing entries Adaptive sample-level graph combination for partial multiview clustering Gain: Missing data imputation using generative adversarial nets Deep partial multi-view learning Low-rank tensor constrained multiview subspace clustering