key: cord-0043190-r5wpmz0w authors: Mendes, Andre; Togelius, Julian; dos Santos Coelho, Leandro title: Adversarial Autoencoder and Multi-Task Semi-Supervised Learning for Multi-stage Process date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_1 sha: b5aa972ee869e3e38514dfaac802458c90d4ea46 doc_id: 43190 cord_uid: r5wpmz0w In selection processes, decisions follow a sequence of stages. Early stages have more applicants and general information, while later stages have fewer applicants but specific data. This is represented by a dual funnel structure, in which the sample size decreases from one stage to the other while the information increases. Training classifiers for this case is challenging. In the early stages, the information may not contain distinct patterns to learn, causing underfitting. In later stages, applicants have been filtered out and the small sample can cause overfitting. We redesign the multi-stage problem to address both cases by combining adversarial autoencoders (AAE) and multi-task semi-supervised learning (MTSSL) to train an end-to-end neural network for all stages together. The AAE learns the representation of the data and performs data imputation in missing values. The generated dataset is fed to an MTSSL mechanism that trains all stages together, encouraging related tasks to contribute to each other using a temporal regularization structure. Using real-world data, we show that our approach outperforms other state-of-the-art methods with a gain of 4x over the standard case and a 12% improvement over the second-best method. In many applications including network intrusion detection [15] and medical diagnosis [1] , decision systems are composed of an ordered sequence of stages, which can be referred to as a multi-stage process. For selection processes (such as hiring or student intake), for instance, the applicants submit general information in the initial stages, such as resumes. The evaluator screens trough the resumes and selects applicants to move on to the next round. In each following stage, applicants are filtered out until the final pool is selected. In terms of information, the initial stages have general data about the applicants and for each subsequent round, more and specific information is gathered. As the process continues, the cost to acquire and evaluate new data increases, which is an incentive to decrease the pool of applicants. Hence, in the final stages, the evaluator has a smaller pool with much more information about each applicant. Training classifiers for a multi-stage process can be challenging because of the dual funnel structure as shown in Fig. 1(a) . During the initial stages, the number of applicants decreases whereas the data about them increases. Therefore, the dataset grows in dimensionality but decreases in terms of sample size. Classifiers trained in initial stages have sufficiently large samples to generalize, but the available features might not contain enough information to differentiate applicants, causing high bias and underfitting. In the final stages, there is more information for each applicant, but the sample size is reduced significantly, causing classifiers to suffer from high variance and overfitting. To address this problem, we redesign the multi-stage problem and present a framework with two components. In the first part of our framework, we use an adversarial autoencoder (AAE) [19] that learns the data representation and is able to generate synthetic data for data imputation in missing values. This component makes possible for our framework to fill the data for an applicant in all the stages that he has not reached. Therefore, we can generate a complete dataset with data for all applicants in all stages. In the second part of our framework, we use multi-task learning to train a single classifier that can learn different tasks together. We introduce a temporal regularization structure so that related tasks can share information, and we adopt a semi-supervised approach to handle the newly generated samples from AAE, which don't have labels. This results in the aDversarial autoEnCoder and multI-task SemI-superVised lEarning (DECISIVE) framework. The main contributions of this paper are: 1. We adapted an adversarial auto-encoder to perform data imputation and generate a complete dataset to be shared in different stages. 2. We redesign the multi-stage problem and present a temporal multi-task semisupervised framework that allows knowledge from all stages to be shared. 3. The effectiveness of the proposed model is demonstrated by extensive longitudinal experiments on real data from 3 different selection processes. Our model is able to outperform other single-task and multi-task frameworks particularly in the later stages where the sample size is significantly reduced. Our method builds on three subfields of machine learning, namely data imputation [3] , multi-task learning [20] and multi-stage or decision cascades [15, 17] . Data Imputation with Autoencoders -Many methods for data imputation have been proposed, ranging from simple column average to complex imputations based on statistical and machine learning models. Such methods can be categorized in discriminative [14] , or generative [16] . Missing data is a special case of noisy input and deep architectures such as denoising autoencoders (DAE) [16] have performed well due to their capability to automatically learn latent representations. The work presented in [5] uses an overcomplete DAE as a base model to create a multiple imputation framework, which simulates multiple predictions by initializing the model with random weights at each run. Our AAE component is based on the framework proposed in [19] , which explores the generative adversarial network (GAN) [6] approach to create: a generator to accurately impute missing data; a discriminator to distinguish between observed and imputed components; and a hint mechanism to help generate samples according to the true underlying data distribution. Multi-Task and Semi-Supervised Learning -The goal of multi-task learning (MTL) is to learn multiple related tasks simultaneously so that knowledge obtained from each task can be re-used by the others. This can increase the sample size for each task, making MTL beneficial for tasks with small training sample. MTL has been applied in different approaches including neural nets (NN) and kernel methods [1, 7] . More recent methods have explored the application of MTL to deep neural nets (DNN) [11] . The regularization parameters in MTL control how information is shared between tasks and prevents overfitting. The framework proposed in [7] enables information to be selectively shared across tasks by placing a structure constrain on the learned weights. Our framework builds on [21] , in which temporal information is encoded using regularization terms. MTL can also be combined with semi-supervised learning (SSL) to create classifiers coupled by a joint prior distribution over the parameters of all classifiers [4] . Multi-stage Classification -Multi-stage and cascade classifiers share many similarities, however, an important difference between cascade [17] and multi-stage can be defined as the system architecture. Detection cascades make partial decisions, delaying a positive decision until the final stage. In contrast, multi-stage classifiers shown in Fig. 1(b) , can deal with multi-class problems and can make classification decisions at any stage [15] . The approach proposed in [12] explores the connection between deep models and multi-stage classifiers such that classifiers can be jointly optimized and they can cooperate across the stages. This structure is similar to ours, but in our case, the algorithm only has access to the original information in a specific stage. Let's define s = {0, 1, ...S} as a single stage in a multi-stage process. Every stage has a dataset with training examples {x s i , y s i } m s i=0 , where m s refers to the number of samples. The number of features is given by n s , the features are given by X s ∈ R m s ×n s and the labels by Y s ∈ R m s ×1 . Let's also define A ∈ R m×1 as the vector of applicants, where m represents the total number and a ∈ A represents a single applicant. The feature vector for a single applicant a in stage s are given by the vector x s a ∈ R 1×n s . A prediction matrixP ∈ R m×S can be defined, where p a ∈ R 1×S is the prediction vector for an applicant a in all stages, and p s a ∈ [−1, 1] represents the prediction for a single stage s. Underfitting in Earlier Stages -In each stage s, only features x s up to that stage are available. Therefore, a prediction for an applicant with data up to s is given by is a classification function. For example, for an applicant with information in s = 0, all predictions will be made using x 0 a . Since the features in early stages are more general and less discriminative, models trained in these stages have poor performance predicting the applicant's future in the process. A method to address this problem could incorporate future features in the early stages. More specifically, a data imputation process can be used to fill missing information and generate a complete dataset for all stages. In other words,X = g(X s ), where X s is the input features in stage s and g(·) is a data imputation function. Overfitting in Later Stages -For each new stage, new data is received while the number of samples decrease, which means X s+1 = X s , m s+1 < m s and n s+1 > n s . As m s gets significantly smaller in absolute value and in comparison to n s , classifiers trained on the specific stage data tend to overfit. One possible way to address this problem is to use the generated complete datasetX. Since any stage X s can be mapped toX, more training samples can be used to train classifiers in later stages. However,X s+1 = g(X s ) only generate new samples but no labels, since the applicants in s were not evaluated in s+1. Hence, Y s+1 = Y s and m s+1 < m s for the label matrix. In this case, semi-supervised learning can be applied so that labeled and unlabelled data are combined to create a better classifier than by just using the labeled data. Finally, with S stages, the traditional way would be to construct S classifiers to predict the outcomes in each stage. However, we believe that these tasks in each stage are related and they could be combined to share knowledge during training. This problem can be addressed by an MTL that seeks to improve the generalization performance of multiple related tasks by learning them jointly. Our method is based on two components: data imputation using adversarial autoencoders (AAE) and multi-task semi-supervised learning (MTSSL). Here we explain each of these components as well as the combination of them to form our approach. In each stage, we have a combination of numerical and categorical features (encoded using one-hot encode). For data imputation, our method is based on [19] and it is shown in Fig. 2 . We use an adversarial approach in which the generator G receives as input features X ∈ R m×n , a binary mask B indicating the positions of the missing data in X, and a source of noise. We create the datasetX by filling the missing positions in X with the random noise. The goal of G is to generate the dataX with values as close as possible to the original values X. We also have a discriminator D that receivesX and tries to guess if each variable value is either original or imputed. Additionally, a hint mechanism is added by using a random variable H to depend on B. For each (imputed) sample (x, b), we draw h according to the distribution H|B = b, and h is passed as an additional input to the D. The hint H provides information about B so that we can reduce the number of optimal distributions with respect to D that G could reproduce. Therefore, the discriminator tries to predict a binary maskB = D(X, H) that is as close as possible to B. We define the cross-entropy loss function as In the adversarial approach, D maximizes the probability of correctly predicting B while G minimizes the probability of D predicting B, which is given by: where G influences the loss by generatingX in the termB = D(X, H). To solve this problem, we first optimize the discriminator D with a fixed generator G. Second, we optimize the generator G with the newly updated discriminator D using where z ∈ Z and Z ∈ {0, 1} n is a random variable defined by first sampling k from {1, ..n} uniformly at random. The reconstruction error for the non-missing values is The final equation for G in a mini-batch k G and a hyperparameter α is given by In this component, we use a SSL approach to create a model that can use both the labeled and unlabeled data together for model training. We combine SSL and MTL, so that different tasks can be learned simultaneously in a joint framework. Let's define the input feature X ∈ R m×n with labeled samples, X L ∈ R mL×n with corresponding labels Y ∈ R mL×1 and unlabeled samples X U ∈ R mU ×n . Let's define a task t as the task to predict the label for a applicant in a given stage. For S stages, we have T tasks, hence T = S. For supervised learning in a task t, we use the cross-entropy loss To achieve semi-supervised learning, we rely on the common assumption that if two feature vectors x i and x j are close in a weighted graph, their predictions f (x i ) and f (x j ) should be similar [4] . Therefore, we can use a graph regularization term that depends on the affinity similarity matrix O, where the affinity similarity between x i and x j is given by where N K (x i ) denotes the K nearest neighborhood set of x i . The tuning parameters σ i and σ j can be set as the standard deviation of the related K nearest neighbor set. By using the affinity matrix, we can calculate the semi-supervised term as: As shown in [4] , Eq. 10 can be simplified in a closed form solution as where γ 1 0 is a model tuning hyperparameter; Δ t = D t − O t is the graph Laplacian Matrix and D t is a diagonal matrix with d ii = j o t ij and t = {0, 1, ..., T }. For regularization, we introduce a temporal structure to encourage sequential tasks to share knowledge. This is achieved with a graph regularization similar to the affinity matrix but applied to tasks. An edge r ∈ R can be defined as Putting everything together, the multi-task semi-supervised loss function is given by where x t j denotes sample j of the t-th task, y t j denotes its corresponding label, X t U is the unlabeled dataset, W are the model parameters, λ 1 controls the l 2 -norm penalty to prevent overfitting, λ 2 is the regularization parameter for temporal smoothness and λ 3 controls group sparsity for joint feature selection. and unlabeled {X t U } T t=0 datasets for each task as well as an affinity matrix O to train the classifier for all tasks. The weights W are regularized using a temporal graph R that enforces temporal smoothness. On the bottom, the prediction flow uses WAAE to generate the complete datasetX and WMT SSL to generate the prediction matrixP . Figure 2 shows our entire method. We first use the AAE component to generatê X, which will be the input in the MTSSL component to obtain the predictions. This framework allows us to (1) create a dataset common to all stages that can be used for other tasks and (2) train a single classifier for all stages promoting contribution among correlated tasks and better generalization using labeled and unlabeled data. To demonstrate the application of our method in a real-world setting, we perform experiments using 3 datasets from 3 different multi-stage selection processes. Datasets -For privacy requirements, we are going to refer to the companies with indexes such that C 1 refers to Company 1. The companies have a similar process but they have distinct goals: C 1 is an organization that selects students for a fellowship; C 2 is a big retail company and its process focus on their recentgrad hire program; C 3 is a governmental agency that select applicants to work in the public administration. Each dataset contains sub-datasets. For example, in C 1 , the process happens annually and we have data for three years, year 1 C1 , year 2 C1 , year 3 C1 . The dual funnel structure from this process is shown in Fig. 1 . The process in C 2 happens every semester and data for 6 semesters is available, on average, 13000 applicants start the process and 300 are selected. The process in C 3 happens annually and data from 2 years are available. In this case, 5000 applicants start the process and 35 are selected. Each stage in the process contains its own set of variables. For example, in the stage Demographics, information about state and city is collected. Therefore, we refer to Demographics features for those collected in the Demographics stage. The data collected in each process is very similar in stages such as Demographics and Education, both in the form of content and structure. For stages with open questions such as Video, each process has its own specific questions (See Table 1 for details). Additionally, in all datasets, the Video stage marks the last part where data is collected automatically. Feature Preparation -Early stages have structured data in a tabular format. Categorical variables are converted to numerical values using a standard one-hot encode transformation. In the later stages, such as video, the speech is extracted and the data is used as a text. To convert this data to numerical values, we create word embeddings using Word2Vec [9] , where we assign highdimensional vectors (embeddings) to words in a text corpus but preserving their syntactic and semantic relationships. Validation and Performance Metrics -We perform longitudinal experiments, in which we use a previous year as a training and test set and the following year as a validation set. For C 1 for example, we split the dataset from year 1 C1 in train and test, find the best model and validate its results using the dataset from year 2 C1 . The model is trained in year 1 C1 and has never seen any data in year 2 C1 . We repeat the process for all other years. Finally, we also combine the datasets from year 1 C1 and year 2 C1 and validate the results in year 3 C1 , which results in 4 groups. For the train and test split, we also perform 10-fold cross-validation resulting in a total of 40 runs. In C 2 , we obtain 33 groups (330 runs) and for C 3 we have only 1 group (10 runs). In all experiments, we compare the models in terms of F1-score for the positive class, which balances precision and recall specific for the applicants that are selected in each stage. Benchmark Methods -We compare the results with other established methods such as: Support Vector Machines (SVM) [2] with C = 0.1; and neural networks (NN) with dropout regularization, p = 0.5 [13] . We apply them in the standard setting, i.e. without AAE (N), Single Task Learning (STL) and Supervised Learning (SL), N-STL-SL. In our second setting, we add data imputation and run experiments with semi-supervised learning, AAE-STL-SSL. For this setting we use: Laplacian Support Vector Machines (LapSVM) [8] with Gaussian kernel, σ = 9.4, nn = 10, p = 2, γ A = 10 −4 , γ I = 1; and DNN with Virtual Adversarial Training (VAT) [10] with K = 1, = 2, α = 1. For NN methods, we use dense NNs with fully connected layers and the structure is chosen such that the number of parameters is similar in all settings. Our third setting uses the MTL component in the SL case, hence AAE-MTL-SL. We use: Infinite Latent SVM (MT-iLSVM) [22] with α = 0.5, C = 1, σ 2 m0 = 1; and Deep Multi-Task Representation Learning with tensor factorization (DMTRL) with Tucker method [18] . Finally, in the fourth setting, AAE-MTL-SSL, we use DMTRL in the SSL setting and Semi-Supervised Multi-task Learning (S2MTL) [4] with λ 1 = 10, λ 2 = 10, k = 4. For our approach, DECISIVE, we use λ 1 = 0.1, λ 2 = 1, λ 3 = 10 and γ 1 = 10. All hyperparameters are found using cross-validation and the performance is robust for values chosen in the interval [10 −2 , 10 −1 , 1, 10, 100]. In this section we present the results obtained in all experiments, which can be seen in Fig. 3 for C 1 and in Table 2 and Table 3 for C 2 and C 3 , respectively. Data refers to the number of features in each stage compared to the final dataset. App refers to the number of applicants in each stage compared to the number in the first stage. We investigate the general results in all experiments. SVM achieved the best result in the N-STL-SL (to save space we omit NN), however, this group had the worst performance mostly due to overfitting, since each classifier is trained individually and knowledge across stages is not shared. The second-best group is VAT and LapSVM in AAE-STL-SSL. Classifiers in this group are still trained individually, but the additions of data imputation and unsupervised data help the classifier to better generalize in the later stages. SVM methods (LapSVM) have better performance than NN (VAT) especially when the sample size is relatively small. When MTL is introduced in the third setting, AAE-MTL-SL, results are improved since knowledge is shared in all tasks. The parameters for the multitask classifier are updated based on the loss for all predictions which helps mitigate the challenge of small sample size in the later stages (Video E. to Final). In this setting, the NN-based method (DMTRL), can perform similar to the SVMbased method (iLSVM), since the NN can learn more complicated patterns and decision boundaries without overfitting as in previous cases. However, the best results for DMTRL are in the SSL case. The best performing methods are in the AAE-MTL-SSL setting. DMTLR performs slightly better than the S2MTL method, in which, knowledge is shared at the feature level. In general, our method (DECISIVE) can outperform the other methods and the difference is higher in later stages (see Sect. 6.2). Comparing to DMTRL and S2MTL, the temporal regularization introduced in our method enforces tasks close together in time to contribute more with each other. This allows our model to share knowledge among all tasks in the multi-task framework, but still, preserve some of the temporal components of the process. Figure 3 -left shows that the results among methods are consistent for C 1 as we vary the training data. Interestingly, predicting the data in year 3 C1 using year 1 C1 is better than using year 2 C1 . This shows that the profile for applicants changes from year to year and the applicants approved in year 3 C1 are more similar to the ones approved in year 1 C1 . Therefore, it is important to have data from different years to improve generalization. When we combine data from year 1 C1 and year 2 C1 , the results improved due to the increase in sample size and the combination of profiles from different years. In terms of stages, it is clear from the right figure that algorithms can perform well while there is enough data to generalize. However, in later stages, there is a drop in performance caused by the small sample size. Algorithms based on AAE-MTL-SSL perform significantly better than others. They achieved a gain of 4x compared to the best standard case (SVM) for later stages. Additionally, our method outperforms the second best with a 12% gain. Results for C 2 are shown in Table 2 . This process contains more longitudinal data (6 semesters), which makes our model more robust to changes across years. Additionally, the number of applicants is more evenly distributed and more applicants reach the final stages, which causes the average performance to be more similar across all stages. In other words, the drop in performance is less steep for later stages compared to C 1 . We also see that learning from unsupervised samples is less impactful, as methods from AEE-MTL-SL (MT-iLSVM) have similar performance to methods in AEE-MTL-SSL (DMTRL, DECISIVE). Our method outperforms the other methods in this case (16% gain), which shows the importance of the multi-task and regularization in our structure. Compared to SVM, the gain is about 3.46x. Both gains related to the later stages. Table 3 shows the results for C 3 . This process has an overall smaller sample size and only two years are available. We reason that the methods could overfit the training set and not generalize so well to the validation set. Nevertheless, the algorithms in the AAE-MTL-SSL setting still have the best performance, but the difference from the standard case is smaller when compared to the other experiments in later stages (80% gain). Our approach can outperform the other methods with a 13% gain over DMTRL. We presented a framework that combines adversarial autoencoders (AAE) and multi-task semi-supervised learning (MTSSL) to train an end-to-end neural network for all stages of a selection problem. We showed that the AAE makes it possible to create a complete dataset using data imputation, which allows downstream models to be trained in SL and SSL settings. We also introduced a temporal regularization on the model to use information from different stages but still conserve the temporal structure of the process. By combining MTL and SSL, our method can outperform other STL and SL methods. Our validation includes real-world data and our method is able to achieve a gain of 4x over the standard case and a 12% improvement over the second best method. For future research, we want to introduce interpretability techniques to understand the profiles learned by the model and investigate the effect of bias. We also want to apply the framework to other multi-stage processes in different fields. Multitask learning LIBSVM: a library for support vector machines Optimization algorithms on subspaces: revisiting missing data problem in low-rank matrix User attribute discovery with missing labels MIDA: multiple imputation using denoising autoencoders Generative adversarial nets Learning task grouping and overlap in multi-task learning Laplacian support vector machines trained in the primal Efficient estimation of word representations in vector space Virtual adversarial training: a regularization method for supervised and semi-supervised learning An overview of multi-task learning in deep neural networks Deep-cascade: cascading 3D deep neural networks for fast anomaly detection and localization in crowded scenes Dropout: a simple way to prevent neural networks from overfitting Missforest-non-parametric missing value imputation for mixed-type data Multi-stage classifier design Extracting and composing robust features with denoising autoencoders Robust real-time object detection Deep multi-task representation learning: a tensor factorisation approach Gain: missing data imputation using generative adversarial nets A survey on multi-task learning Modeling disease progression via fused sparse group Lasso Infinite latent SVM for classification and multi-task learning