key: cord-0045894-5e93h3ip authors: Burkhart, Michael C.; Shan, Kyle title: Deep Low-Density Separation for Semi-supervised Classification date: 2020-05-22 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50420-5_22 sha: 32831fb9b05045c746802bf51acf840bcb371184 doc_id: 45894 cord_uid: 5e93h3ip Given a small set of labeled data and a large set of unlabeled data, semi-supervised learning (ssl) attempts to leverage the location of the unlabeled datapoints in order to create a better classifier than could be obtained from supervised methods applied to the labeled training set alone. Effective ssl imposes structural assumptions on the data, e.g. that neighbors are more likely to share a classification or that the decision boundary lies in an area of low density. For complex and high-dimensional data, neural networks can learn feature embeddings to which traditional ssl methods can then be applied in what we call hybrid methods. Previously-developed hybrid methods iterate between refining a latent representation and performing graph-based ssl on this representation. In this paper, we introduce a novel hybrid method that instead applies low-density separation to the embedded features. We describe it in detail and discuss why low-density separation may better suited for ssl on neural network-based embeddings than graph-based algorithms. We validate our method using in-house customer survey data and compare it to other state-of-the-art learning methods. Our approach effectively classifies thousands of unlabeled users from a relatively small number of hand-classified examples. In this section, we describe the problem of semi-supervised learning (ssl) from a mathematical perspective. We then outline some of the current approaches to solve this problem, emphasizing those relevant to our current work. Consider a small labeled training set D 0 = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x , y )} of vector-valued features x i ∈ IR d and discrete-valued labels y i ∈ {1, . . . , c}, for 1 ≤ i ≤ . Suppose we have a large set D 1 = {x +1 , x +2 , . . . , x +u } of unlabeled features to which we would like to assign labels. One could perform supervised learning on the labeled dataset D 0 to obtain a general classifier and then apply this classifier to D 1 . However, this approach ignores any information about the distribution of the feature-points contained in D 1 . In contrast, ssl attempts to leverage this additional information in order to either inductively train a generalized classifier on the feature space or transductively assign labels only to the feature-points in D 1 . Effective ssl methods impose additional assumptions about the structure of the feature-data (i.e., {x : (x, y) ∈ D 0 } ∪ D 1 ); for example, that features sharing the same label are clustered, that the decision boundary separating differently labeled features is smooth, or that the features lie on a lower dimensional manifold within IR d . In practice, semi-supervised methods that leverage data from D 1 can achieve much better performance than supervised methods that use D 0 alone. See Fig. 1 for a visualization. We describe both graph-based and low-density separation methods along with neural network-based approaches, as these are most closely related to our work. For a full survey, see [7, 37, 50] . A schematic for semi-supervised classification. The grey line corresponds to a decision boundary obtained from a generic supervised classifier (incorporating information only from the labeled blue and orange points); the red line corresponds to a boundary from a generic semisupervised method seeking a low-density decision boundary. (Color figure online) A schematic for tsvm segmentation. The grey lines correspond to maximum margin separation for labeled data using a standard svm; the red lines correspond to additionally penalizing unlabeled points that lie in the margin. In this example, the data is perfectly separable in two dimensions, but this need not always be true. (Color figure online) Graph-based methods calculate the pairwise similarities between labeled and unlabeled feature-points and allow labeled feature-points to pass labels to their unlabeled neighbors. For example, label propagation [51] forms a ( + u) × ( + u) dimensional transition matrix T with transition probabilities proportional to similarities (kernelized distances) between feature-points and an ( + u) × c dimensional matrix of class probabilities, and (after potentially smoothing this matrix) iteratively sets Y ← T Y , row-normalizes the probability vectors, and resets the rows of probability vectors corresponding to the already-labeled elements of D 0 . Label spreading [48] follows a similar approach but normalizes its weight matrix and allows for a (typically hand-tuned) clamping parameter that assigns a level of uncertainty to the labels in D 0 . There are many variations to the graph-based approach, including those that use graph min-cuts [4] and Markov random walks [40] . Low-density separation methods attempt to find a decision boundary that best separates one class of labeled data from the other. The quintessential example is the transductive support vector machine (tsvm: [1, 6, 8, 15, 21, 30] ), a semisupervised maximum-margin classifier of which there have been numerous variations. As compared to the standard svm (cf., e.g., [2, 32] ), the tsvm additionally penalizes unlabeled points that lie close to the decision boundary. In particular, for a binary classification problem with labels y i ∈ {−1, 1}, it seeks parameters w, b that minimize the non-convex objective function is the hinge loss function. The hyperparameters C and C * control the relative influence of the labeled and unlabeled data, respectively. Note that the third term, corresponding to a loss function for the unlabeled data, is non-convex, providing a challenge to optimization. See Fig. 2 for a visualization of how the tsvm is intended to work and Ding et al. [13] for a survey of semi-supervised svm's. Other methods for low-density separation include the more general entropy minimization approach [17] , along with information regularization [39] and a Gaussian process-based approach [27] . Both the graph-based and low-density separation approaches to ssl rely on the geometry of the feature-space providing a reasonable approximation to the true underlying characteristics of the users or objects of interest. As datasets become increasingly complex and high-dimensional, Euclidean distance between feature vectors may not prove to be the best proxy for user or item similarity. As the Gaussian kernel is a monotonic function of Euclidean distance, kernelized methods such as label propagation and label spreading also suffer from this criticism. While kernel learning approaches pose one potential solution [9, 49] , neural network-based embeddings have become increasingly popular in recent years. Variational autoencoders (vae's: [24] ) and generative adversarial nets (gan's: [12, 29] ) have both been successfully used for ssl. However, optimizing the parameters for these types of networks can require expert hand-tuning and/or prohibitive computational expense [35, 53] . Additionally, most research in the area concentrates on computer vision problems, and it is not clear how readily the architectures and techniques developed for image classification translate to other domains of interest. Recently, Iscen et al. introduced a neural embedding-based method to generate features on which to perform label propagation [19] . They train a neural network-based classifier on the supervised dataset and then embed all featurepoints into an intermediate representation space. They then iterate between performing label propagation in this feature space and continuing to train their neural network classifier using weighted predictions from label propagation (see also [52] ). As these procedures are similar in spirit to ours, we next outline our method in the next section and provide more details as part of a comparison in Subsect. 2.4. In this section, we provide a general overview of our algorithm for deep lowdensity separation and then delve into some of the details. We characterize our general process as follows: 1. We first learn a neural network embedding f : IR d → IR m for our featuredata optimized to differentiate between class labels. We define a network g : IR m → IP c (initialized as the initial layers from an autoencoder for the feature-data), where IP c is the space of c-dimensional probability vectors, and optimize g • f on our labeled dataset D 0 , where we one-hot encode the categories corresponding to each y i . 2. We map all of the feature-points through this deep embedding and then implement one-vs.-rest tsvm's for each class on this embedded data to learn classpropensities for each unlabeled data point. We augment our training data with the x i from D 1 paired with the propensities returned by this method and continue to train g • f on this new dataset for a few epochs. 3. Our neural network f now provides an even better embedding for differentiating between classes. We repeat step 2 for a few iterations in order for the better embedding to improve tsvm separation, which upon further training yields an even better embedding, and so forth, etc. (2); Perform one-vs.-rest tsvm training onD0 andD1 to obtain predicted probabilitiespi, i = + 1, . . . , + u that the xi in D1 lie in each class and then setD1 = {(xi,pi)} ; Obtain θt, ψt by continuing to optimize g ψ • f θ , using D0 ∪D1; end return g ψ T (f θ T (xi)) or an exponential moving average of the probabilistic predictions g ψ t (f θ t (xi)) for xi ∈ D0. This is our basic methodology, summarized as pseudo-code in Algorithm 1 and visually in Fig. 3 . Upon completion, it returns a neural network g • f that maps feature-values to class/label propensities that can easily be applied to D 1 and solve our problem of interest. In practice, we find that taking an exponentially decaying moving average of the returned probabilities as the algorithm progresses provides a slightly improved estimate. At each iteration of the algorithm, we reinitialize the labels for the unlabeled points and allow the semi-supervised tsvm to make inferences using the new embedding of the feature-data alone. In this way, it is possible to recover from mistakes in labeling that occurred in previous iterations of the algorithm. In our instantiation, the neural network f : IR d → IR m has two layers, the first of size 128 and the second of size 32, both with hyperbolic tangent activation. In between these two layers, we apply batch normalization [18] followed by dropout at a rate of 0.5 during model training to prevent overfitting [38] . The neural network g : IR m → IP c consists of a single layer with 5 units and softmax activation. We let θ (resp. ψ) denote the trainable parameters for f (resp. g) and sometimes use the notation f θ and g ψ to stress the dependence of the neural networks on these trainable parameters. Neural network parameters receive Glorot normal initialization [16] . The network weights for f and g receive Tikhonov-regularization [43, 44] , which decreases as one progresses through the network. We form our underlying target distribution by one-hot encoding the labels y i and slightly smoothing these labels. We define h : {1, . . . , c} → IP c by its where we set = 10 −3 to be our smoothing parameter. Training proceeds as follows. We begin by training the neural network f θ to minimize D kl h(y i )||g ψ (f θ (x i )) the Kullback-Leibler (kl) divergence between the true distributions h(y i ) and our inferred distributions g ψ (f θ (x i )), on D 0 in batches. For parameter updates, we use the Adam optimizer [23] that maintains different learning rates for each parameter like AdaGrad [14] and allows these rates to sometimes increase like Adadelta [47] but adapts them based on the first two moments from recent gradient updates. This optimization on labeled data produces parameters θ 0 for f and ψ 0 for g. Fig. 3 . A schematic for Deep Low-Density Separation. The first two layers of the neural network correspond to f , the last to g. The semi-supervised model corresponds to the tsvm segmentation. We optimize on the unlabeled dataset using mean square error (mse) and on the labeled dataset using cross-entropy (X-Entropy). Upon initializing f and g, f θ0 is a mapping that produces features well-suited to differentiating between classes. We formD 0 = {(f θ0 (x), y) : (x, y) ∈ D 0 } and D 1 = {f θ0 (x) : x ∈ D 1 } by passing the feature-data through this mapping. We then train c tsvm's, one for each class, on the labeled dataD 0 and unlabeled dataD 1 . Our implementation follows Collobert et al.'s tsvm-cccp method [11] and is based on the R implementation in rssl [25] . The algorithm decomposes the tsvm loss function J(w, b) from (1) into the sum of a concave function and a convex function by creating two copies of the unlabeled data, one with positive labels and one with negative labels. Using the concave-convex procedure (cccp: [45, 46] ), it then reduces the original optimization problem to an iterative procedure where each step requires solving a convex optimization problem similar to that of the supervised svm. These convex problems are then solved using quadratic programming on the dual formulations (for details, see [5] ). Collobert et al. argue that tsvm-cccp outperforms previous tsvm algorithms with respect to both speed and accuracy [11] . Upon training the tsvm's, we obtain a probability vectorp i ∈ IP c for each i = + 1, . . . , + u with elements corresponding to the likelihood that x i lies in a given class. We then formD 1 = {(x i ,p i )} and obtain a supervised training set for further refining g • f . We set the learning rate for our Adam optimizer to 1/10th of its initial rate and minimize the mean square error between g(f (x i )) andp i for (x i ,p i ) ∈D 1 for 10 epochs (cf. "consistency loss" from [42] ) and then minimize the kl-divergence between h(y i ) and g(f (x i )) for 10 epochs. This training starts with neural network parameters θ 0 and ψ 0 and produces parameters θ 1 and ψ 1 . Then, f θ1 is a mapping that produces features better suited to segmenting classes than those from f θ0 . We pass our feature-data through this mapping and continue the iterative process for T = 6 iterations. Our settings for learning rate, number of epochs, and T were hand-chosen for our data and would likely vary for different applications. As the algorithm progresses, we store the predictions g ψt (f θt (x i )) at each step t and form an exponential moving average (discount rate ρ = 0.8) over them to produce our final estimate for the probabilities of interest. We view our algorithm as most closely related to the work of Iscen et al. [19] and Zhuang et al. [52] . Both their work and ours iterate between refining a neural network-based latent representation and applying a classical ssl method to that representation to produce labels for further network training. While their work concentrates on graph-based label propagation, ours uses low-density separation, an approach that we believe may be more suitable for the task. The representational embedding we learn is optimized to discriminate between class labels, and for this reason we argue it makes more sense to refine decision boundaries than it does to pass labels. Additionally, previous work on neural network-based classification suggests that an svm loss function can improve classification accuracy [41] , and our data augmentation step effectively imposes such a loss function for further network training. By re-learning decision boundaries at each iterative step, we allow our algorithm to recover from mistakes it makes in early iterations. One failure mode of semi-supervised methods entails making a few false label assignments early in the iterative process and then having these mislabeled points pass these incorrect labels to their neighbors. For example, in pseudo-labelling [28] , the algorithm augments the underlying training set D 0 with pairs (x i ,ŷ i ) for x i ∈ D 1 and predicted labelsŷ i for which the model was most confident in the previous iteration. Similar error-reinforcement problems can occur with boosting [31] . It is easy to see how a few confident, but inaccurate, labels that occur in the first few steps of the algorithm can set the labeling process completely askew. By creating an embedding f : IR d → IR m and applying linear separation to embedded points, we have effectively learned a distance metric κ : IR d × IR d → IR ≥0 especially suited to our learning problem. The linear decision boundaries we produce in IR m correspond to nonlinear boundaries for our original features in IR d . Previously, Jean et al. [20] described using a deep neural network to embed features for Gaussian process regression, though they use a probabilistic framework for ssl and consider a completely different objective function. In this section, we discuss the practical problem of segmenting users from survey data and compare the performance of our algorithm to other recently-developed methods for ssl on real data. We also perform an ablation study to ensure each component of our process contributes to the overall effectiveness of the algorithm. At Adobe, we are interested in segmenting users based on their work habits, artistic motivations, and relationship with creative software. To gather data, we administered detailed surveys to a select group of users in the US, UK, Germany, & Japan (just over 22 thousand of our millions of users). We applied Latent Dirichlet Allocation (lda: [3, 33] ), an unsupervised model to discover latent topics, to one-hot encoded features generated from this survey data to classify each surveyed user as belonging to one of c = 5 different segments. We generated profile and usage features using an in-house feature generation pipeline (that could in the future readily be used to generate features for the whole population of users). In order to be able to evaluate model performance, we masked the lda labels from our surveyed users at random to form the labelled and unlabelled training sets D 0 and D 1 . We compare our algorithm against two popular classification algorithms. We focus our efforts on other algorithms we might have actually used in practice instead of more similar methods that just recently appeared in the literature. The first, LightGBM [22] is a supervised method that attempts to improve upon other boosted random forest algorithms (e.g. the popular xgBoost [10] ) using novel approaches to sampling and feature bundling. It is our team's preferred nonlinear classifier, due to its low requirements for hyperparameter tuning and effectiveness on a wide variety of data types. As part of the experiment, we wanted to evaluate the conditions for semi-supervised learning to outperform supervised learning. The second, Mean Teacher [42] is a semi-supervised method that creates two supervised neural networks, a teacher network and a student network, and trains both networks using randomly perturbed data. Training enforces a consistency loss between the outputs (predicted probabilities in IP c ) of the two networks: optimization updates parameters for the student network and an exponential moving averages of these parameters become the parameters for the teacher network. The method builds upon Temporal Ensembling [26] and uses consistency loss [34, 36] . We test our method with labelled training sets of successively increasing size ∈ {35, 50, 125, 250, 500, 1250, 2500}. Each training set is a strict superset of the smaller training sets, so with each larger set, we strictly increase the amount of information available to the classifiers. To tune hyperparameters, we use a validation set of size 100, and for testing we use a test set of size 4780. The training, validation, and test sets are selected to all have equal class sizes. For our algorithm, we perform T = 6 iterations of refinement, and in the tsvm we set the cost parameters C = 0.1 and C * = u C. To reduce training time, we subsample the unlabeled data in the test set by choosing 250 unlabeled points uniformly at random to include in the tsvm training. We test using our own implementations of tsvm and MeanTeacher. Table 1 reports our classification accuracy on five randomized shuffles of the training, validation, and test sets. These results are summarized in Fig. 4 . The accuracy of our baseline methods are shown first, followed by three components of our model: 1. Initial NN: The output of the neural network after initial supervised training. 2. DeepSep-NN: The output of the neural network after iterative refinement with Algorithm 1. 3. DeepSep-Ensemble: Exponential moving average as described in Algorithm 1. We find that Deep Low-Density Separation outperforms or matches Light-GBM in the range ≤ 1250. The relative increase in accuracy of Deep Separation is as much as 27%, which is most pronounced with a very small amount of training data ( ≤ 50). Some of this increase can be attributed to the initial accuracy of the neural network; however, the iterative refinement of Deep Separation improves the accuracy of the initial network by up to 8.3% (relative). The addition of a temporal ensemble decreases variance in the final model, further increasing accuracy by an average of 0.54% across the range. Compared to Mean Teacher, the iterative refinement of Deep Separation achieves a larger increase in accuracy for l ≤ 500. To visualize how the iterative refinement process and exponential weighted average improve the model, Fig. 5 shows the accuracy of our model at each iteration. We see that for each random shuffle, the refinement process leads to increased accuracy compared to the initial model. However, the accuracy of the neural network fluctuates by a few percent at a time. Applying the exponential moving average greatly reduces the impact of these fluctuations and yields more consistent improvement, with a net increase in accuracy on average. Regarding training time, all numerical experiments were performed on a mid-2018 MacBook Pro (2.6 GHz Intel Core i7 Processor; 16 GB 2400 MHz DDR4 Memory). Deep Separation takes up to half an hour on the largest training set ( = 2500). However, we note that for ≤ 500, the model takes at most three minutes, and this is the regime where our method performs best in comparison to other methods. In contrast, LightGBM takes under a minute to run with all training set sizes. In this paper, we introduce a novel hybrid semi-supervised learning method, Deep Low-Density Separation, that iteratively refines a latent feature representation and then applies low-density separation to this representation to augment the training set. We validate our method on a multi-segment classification dataset generated from surveying Adobe's user base. In the future, we hope to further investigate the interplay between learned feature embeddings and lowdensity separation methods, and experiment with different approaches for both representational learning and low-density separation. While much of the recent work in deep ssl concerns computer vision problems and image classification in particular, we believe these methods will find wider applicability within academia and industry, and anticipate future advances in the subject. Semi-supervised support vector machines Pattern Recognition and Machine Learning Latent Dirichlet allocation Learning from labeled and unlabeled data using graph mincuts Convex Optimization Semi-supervised classification by low density separation Semi-Supervised Learning Optimization techniques for semisupervised support vector machines Cluster kernels for semi-supervised learning XGBoost: a scalable tree boosting system Large scale transductive SVMs Good semisupervised learning that requires a bad GAN An overview on semi-supervised support vector machine Adaptive subgradient methods for online learning and stochastic optimization Learning by transduction Understanding the difficulty of training deep feedforward neural networks Semi-supervised learning by entropy minimization Batch normalization: accelerating deep network training by reducing internal covariate shift Label propagation for deep semisupervised learning Semi-supervised deep kernel learning: regression with unlabeled data by minimizing predictive variance Transductive inference for text classification using support vector machines LightGBM: a highly efficient gradient boosting decision tree Adam: a method for stochastic optimization Semi-supervised learning with deep generative models RSSL: semi-supervised learning in R Temporal ensembling for semi-supervised learning Semi-supervised learning via gaussian processes Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks Triple generative adversarial nets Towards making unlabeled data never hurt Semiboost: boosting for semisupervised learning Machine Learning: A Probabilistic Perspective Inference of population structure using multilocus genotype data Semi-supervised learning with ladder networks Large-scale evolution of image classifiers Regularization with stochastic transformations and perturbations for deep semi-supervised learning Learning with labeled and unlabeled data Dropout: a simple way to prevent neural networks from overfitting Information regularization with partially labeled data Partially labeled classification with markov random walks Deep learning using linear support vector machines Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results On the stability of inverse problems Solution of incorrectly formulated problems and the regularization method The concave-convex procedure The concave-convex procedure (CCCP) Adadelta: an adaptive learning rate method Learning with local and global consistency Nonparametric transforms of graph kernels for semi-supervised learning Semi-supervised learning literature survey Learning from labeled and unlabeled data with label propagation Local label propagation for large-scale semi-supervised learning Neural architecture search with reinforcement learning