key: cord-0043123-r253ygx0 authors: Zhang, Jianyu; Hao, Jianye; Fogelman-Soulié, Françoise title: Cross-data Automatic Feature Engineering via Meta-learning and Reinforcement Learning date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47426-3_63 sha: bd4bc10c493a66a37e2d4d9bfe0d1c7130dde550 doc_id: 43123 cord_uid: r253ygx0 Feature Engineering (FE) is one of the most beneficial, yet most difficult and time-consuming tasks of machine learning projects, and requires strong expert knowledge. It is thus significant to design generalized ways to perform FE. The primary difficulties arise from the multiform information to consider, the potentially infinite number of possible features and the high computational cost of feature generation and evaluation. We present a framework called Cross-data Automatic Feature Engineering Machine (CAFEM), which formalizes the FE problem as an optimization problem over a Feature Transformation Graph (FTG). CAFEM contains two components: a FE learner (FeL) that learns fine-grained FE strategies on one single dataset by Double Deep Q-learning (DDQN) and a Cross-data Component (CdC) that speeds up FE learning on an unseen dataset by the generalized FE policies learned by Meta-Learning on a collection of datasets. We compare the performance of FeL with several existing state-of-the-art automatic FE techniques on a large collection of datasets. It shows that FeL outperforms existing approaches and is robust on the selection of learning algorithms. Further experiments also show that CdC can not only speed up FE learning but also increase learning performance. As machine learning becomes more and more widespread, it has been recognized that feature engineering (FE) is the most critical factor for models performance [1] . Various researchers have demonstrated the benefit of using additional features [11] . FE aims at reducing the model error and making learning easier by deriving, through mathematical functions (operators), new features from the original ones. Normally a data scientist combines feature generation, selection and model evaluation iteratively, generating a long sequence of decisions before obtaining the "optimal" set of derived features. This process heavily relies on expert domain knowledge, intuition and technical expertise to handle the complex feedbacks and make best decisions. As a result, the process is difficult, time-consuming and hard to automate. Most of existing methods of automatic FE either generate a large set of possible features by predefined transformation operators followed by feature selection [3, 7, 15] or apply simple supervised learning (simple algorithm and/or simple meta-features derived from FE process) to recommend a potentially useful feature [4, 5, 9] . The former makes the process computationally expensive, which is even worse for complex features, while the latter significantly limits the performance boost. A recently proposed FE approach [5] is based on Reinforcement Learning (RL). It treats all features in the dataset as a union, then applies traditional Qlearning [14] on FE-augmented examples to learn a strategy for automating FE under a given computing budget. RL is more promising in providing general FE solutions. However, this work uses Q-learning with linear approximation and 12 simple manual features, which limits the ability of automatic FE. Furthermore, it ignores the differences between features and applies a transformation operator on all of them at each step. Because of this nondiscrimination of different features, it is computation expensive, especially for large datasets and complex transformation operators. To address the above limitations, in this work, we propose FeL (Feature Engineering Learner ) and CAFEM (Cross-data Automatic Feature Engineering Machine). The former is a novel approach for automatic FE for one particular dataset based on off-policy Deep Reinforcement Learning (DRL). In order to speed up the FE process and take advantage of the FE knowledge learned from a large set of datasets, the latter extends FeL to cross-data level by Meta-Learning. We define a Feature Transformation Graph (FTG), a directed graph representing relationships between different transformed versions of features, to organize the FE process. FeL sequentially trains an agent for each feature by DRL algorithms to learn the strategy for feature engineering on one dataset and corresponding FTG representation. We thus view the goal of FE as maximizing model accuracy by searching through a set of features F + to generate and a set of features F − to eliminate. CAFEM extends this process to cross-data by training one agent on a large set of datasets to enable the learned policy to perform well on unseen datasets. In this section we review the Reinforcement Learning (RL) [10] background and describe the problem formulation. RL is a family of algorithms that formalizes the interaction of an agent A with her environment using a Markov Decision Process (MDP) and allows it to devise an optimal sequence of actions. An MDP is defined by a tuple S, A, T , R, γ , where S is a set of states, A a set of actions, T : S × A → Δ(S) a transition function that maps each state-action pair to a probability distribution over the possible successor states, R : S × A × S → R a reward function and γ ∈ [0, 1] a discount factor for controlling the importance of future rewards. A policy π : S → A is a mapping from states to actions. At every time step t, an agent in state s t produces an action a t = π(s t ). Based on transition function T the agent gets into next state s t+1 with probability T (s t , a t ) and obtains immediate reward r t = R(s t , a t , s t+1 ). The goal of an agent is to find an optimal policy π * maximizing her expected discounted cumulated reward E[R 0 |s 0 ], where R t = ∞ i=t γ i−t r i is the discounted sum of future rewards. Q-learning is a well-known model-free RL algorithm for finding an optimal policy π * for any finite MDP. In Q-learning we define the Q-function or actionvalue function as Given an optimal policy π * , we are interested in the optimal function Q π * (s, a), or Q * (s, a) for short, where ∀π, s ∈ S, a ∈ A, Q π * (s, a) ≥ Q π (s, a). As a result, Q * satisfies the following equation: Double Deep Q-network (DDQN) [12] is a model-free RL algorithm, which estimates the state-action value approximately through a deep neural network with parameters θ. It uses an -greedy policy to get the next action. During training, the tuples s t , a t , r t , s t+1 generated by the -greedy policy are stored in R, the so-called replay buffer. Then the neural network is trained by sampling from the replay buffer, using mini-batch, and performing gradient descent on loss L = E((Q(s t , a t ) − y t ) 2 ), where y t = r t + γmax a Q(s t+1 , a), Q(s, t) is approximated by the network g with parameter θ. The goal of meta-learning is to quickly train a model for a new task with the help of data from many other similar tasks. Model-Agnostic Meta-Learning (MAML) [2] is one of the best meta-learning algorithms that were trained by gradient descent. We denote {T } as a set of tasks. MAML performs one step gradient descent for a task T i on loss L with network g and network parameters θ and gains θ i as Equation (2) . Then it performs a second gradient descent θ step on loss L with network parameters θ i as Equation (3) . Finally, MAML finds parameters θ that are close to the optimal parameters of every task. where α is the learning rate of each task T i . where β is the meta step size. We consider a collection of typical supervised learning tasks (binary classification or regression) T = {T 1 , T 2 , ..., T N } and each task T i can be represented as T i = D, L, m , where D = F, y is a dataset with a set of features F = {f 1 , f 2 , ..., f n } and a corresponding target variable y, L is a learning algorithm (e.g. Random Forest, Logistic Regression, Neural Network) to be applied on dataset D and m is an evaluation measure (e.g. log-loss, relevant absolute error, f1-score) to measure the performance. We use P m L (F, y) or P (D) to denote the cross-validation performance of learning algorithm L and evaluation measure m on dataset D. The goal of each task is to maximize P (D). A transformation operator τ in FE is a function that is applied on a set of features to generate a new feature f + = τ ({f i }) where the order of the operator follows the number of features in {f i }. We denote the set of derived features as F + . For instance, a product transformation applied on two features (Order-2) generates a new feature f + = product(f i , f j ). We use T to denote the set of all transformation operators. Feature engineering aims at constructing a subset of features The goal of feature engineering is to find a good policy π * that maximizes the model performance for a given algorithm L and measure m on a dataset D. In this section, we present a new framework called Cross-data Automatic Feature Engineering Machine (CAFEM). In order to highlight the differences between features and integrate feature generation and feature selection effectively, we propose a Feature Transformation Graph (FTG) to represent the FE process at feature level. Based on FTG, CAFEM can perform feature engineering for each particular feature based on the information related with it. Thus, it avoids the drawback of generating a large set of features at each step in [5] , especially for complex features and large number of features. One component of CAFEM called FE Learner (FeL) uses Reinforcement Learning to find the optimal feature set F * for each feature iteratively, instead of using expensive graph search algorithm [6] . FeL focus on one particular supervised learning task which gives FeL the ability to dig deeply into that task. However, it loses the opportunity to learn and integrate useful experiments from other tasks which can speed up FE process on a similar task. In order to balance performance and speed, another component of CAFEM called Cross-data Component (CdC) applies a Model-Agnostic Meta-Learning (MAML) [2] method, which is originally designed for supervised learning and on-policy reinforcement learning algorithms, on off-policy reinforcement learning algorithms to speed up FE learning on one particular dataset by integrating the FE knowledges from a set of datasets. We propose a structure called Feature Transformation Graph (FTG) G, which is a directed acyclic dynamic graph, to represent the FE process. Each node f in FTG corresponds to either one original feature in F o or one feature derived from original features. An edge from node At the start of FE, G contains n nodes which correspond to n original features {f 1 , f 2 , ..., f n }. As FE process goes, FTG dynamically grows up (adds more nodes and edges). So we denote FTG at time step t as G t . An illustrating example is given in Fig. 1 . So far, we have introduced the representation of FE with FTG in our automatic FE framework. After that, what we need to do is to find a suitable strategy to control the growth of FTG. An important property is that FTG is not designed for any particular strategy, but to be a general representation of an FE process. As a result, we can apply many different strategies on the FTG to control it, such as graph search or RL. In this paper, we choose RL to learn a strategy that can make a sequence of decisions on top of FTG, due to its efficiency. Consider the FE process with FTG on one dataset D as an MDP problem defined as a tuple S, A, T , R, γ . At each time step t, a state s t ∈ S consists of the Feature Transformation graph G t and the features {f t } we are working on. Due to the complexity of transformation operators, {f t } could contain one or more features. For example, {f t } contains one feature for Order-1 operators (e.g. log, square), two features for Order-2 operators (e.g. product, sum). An action a t ∈ A = A G A S comes from the following two groups of actions: -A G is a set of actions for feature generation, which apply a transformation τ ∈ T on current features {f t } to derive one new feature. -A S contains one action for feature selection by RL, which drops current feature f t and moves back to the previous feature. One special case is that current feature f t ∈ {f 1 , f 2 , .., f n } belongs to original features. In this case, feature selection action drops it and stops current FE process. The learning objective here is to find a state s i with feature set F * in FTG that maximizes the model accuracy P m C (F * , y). The trajectory from original feature to a new feature f i indicates the final feature engineering strategy for f i . Since the target of FE is to maximize the performance P (D), the reward r t of this FE problem in FTG at time step t is set as: Until now, we have introduced the organization of FE process and the MDP formulation of FE problem. The most critical part is the algorithm to find a good strategy of FE. We introduce CAFEM framework which mainly contains two parts: 1) an algorithm called FeL that can apply an off-policy DRL algorithm A (such as DQN [8] , Double DQN [12] ) on FTG for one particular dataset to perform automatic FE; 2) an extended version of model-agnostic meta-learning [2] algorithm on off-policy DRL to speed up FE learning by taking advantage of the generalized FE strategies learned from a set of datasets. It is called off-policy, since the policy being learned can be different from the policy being executed. In the following sub-section, we will introduce the details of these two parts. Although FeL works as a component of CAFEM in this paper, it is also a complete algorithm that sequentially optimizes FE strategies for each feature on one particular dataset. The details of FeL algorithm are shown in Algorithm 1. Given a supervised learning task T with n features F = {f 1 , f 2 , ..., f n }, n off-policy DRL agents {A i }, FeL sequentially optimizes a FE policy for each feature (line 2 in Algorithm 1). As traditional training stage of off-policy RL algorithms, FeL starts with performing M episodes of FE process by -greedy and stores corresponding transitions in replay buffer (line 3-10). In this process, FeL either generates a new featuref from feature f by action a t (a t ∈ A G ) or drops current feature f and moves back to previous featuref (a t ∈ A S ). Then FeL trains the corresponding agent A i by performing gradient descent on a mini-batch sampled from replay buffer R i (line [11] [12] [13] [14] . During test stage the same FE method as Algorithm 1 with = 0 is used to perform FE for each feature sequentially. Note that the operators in transformation operators set T are not of the same complexity level. For example, some unary features (e.g. log(f i ), square(f i )) are less complex than binary features (e.g. product(f i , f j ), sum(f i , f j )). As in [15] , we introduce features along feature complexity, driving simple features first (e.g. unary features) then complex features (e.g. binary features). for fi in F do 3: for episode = 1, M do 4: Get initial state s0 5: while not terminal do 6: Get an action at by -greedy and execute at on f :f = at(f ) 7: Obtain reward rt and next state st+1 8: Store transition (st, at, rt, st+1) in Ri and reset current feature:f ←f 9: end while 10: end for 11: for t = 1, N do 12: Sample a mini-batch from replay buffer Ri 13: Perform one optimization step on Ai 14: end for 15: end for 16: Reset dataset D = {fi, ..., fn}, y 17: end while Cross-data Component: In order to speed up FE process and take advantage of a large set of datasets, we apply Model-agnostic Meta-Learning [2] on offpolicy RL to perform cross-data level automatic feature engineering. The details of the Cross-data Component (CdC) are shown in Algorithm 2. Given a set of datasets {D = F, y } and an off-policy RL agent A (we use DDQN here as it can gain relevant a good performance in many tasks [12] ) represented by g θ , Crossdata Component samples a batch of features {f i } and corresponding dataset {D fi } and constructs a batch of supervised learning tasks {T i = D fi , P, m } (line 2). For each task T i ∈ {T i }, CdC uses the RL agent A together withgreedy exploration to perform M episodes for T i and stores the corresponding transitions in replay buffer R i (line 4-5). Then CdC samples K transitions from R i and computes one step gradient descent as Algorithm 2 (line 7-8) where the loss L is the same as Algorithm 1. Finally, we sample a batch of transitions and perform meta-update (line 9-11). Network Design: Until now, we have discussed the details of FeL algorithm and cross-data component. One remaining part is the structure of the neural network that can approximate the Q-values of DDQN in FeL algorithm. In this project, instead of building one approximation function with parameter θ for each action a [5] , we use one union function that is approximated by a neural network, for all actions. Thus, we only need to train one DRL model. As we discussed in Sect. 4.2, the state s t at time t indicates the FTG G t and the features {f t } it is working on at time t. In order to cover these two parts of information in the representation of each state s t , we use the following features to represent s t : Store all transitions in Ri 7: Sample K transitions T from Ri 8: Compute adapted parameters with gradient descent: Sample transitions T i from Ri for meta-update 10: end for 11: This section describes our experimental results. First, we introduce our experimental settings as well as our training procedure. Then we use F1-score (for classification) and 1 -Relevant Absolute Error (1-RAE) (for regression) criteria to compare the performance of FeL algorithm with several state-of-the-art automatic FE techniques. After that, we evaluate the robustness of our algorithm with respect to different learning algorithms (Random Forest, Logistic Regression). Finally, we show the efficiency of CAFEM on different supervised learning tasks by comparing it with FeL. To our surprise, CAFEM can help improving the prediction performance. Source codes are posted on Github (https://github. com/TjuJianyu/CAFEM.git). We randomly collect 120 binary classification or regression datasets, which do not contain missing values and too many features and instances, from OpenML. We randomly split them into 100 datasets for training and the other 20 datasets for testing. Following [5, 9] , we choose 13 transformation operators (set T) including Order-1: Log, Round, Sigmoid, Tanh, Square, Square Root, ZScore, Min-Max-Normalization and Order-2: Sum, Difference, Product, Division. Following [9] , we choose Random Forest and Logistic Regression (Lasso for regression) (from Scikit-learn http://scikit-learn.org) as our learning algorithm and use F1score/1-RAE to measure the performance. A 5-fold cross validation (same seed for all experiments) using random stratified sampling is used to measure the average performance. FeL is performed on 20 testing datasets directly, while CAFEM is trained on the 100 training datasets by meta-learning. For Order-2 operators, as the number of candidate features is very large, FeL randomly sample a small batch (100) at each step. To showcase the ability of different FE algorithms, we compare the performance of FeL with the following approaches: -Baseline: applies learning algorithm on original dataset (features) directly. -Random-FeL (RS): is an algorithm where we apply random strategy on FTG rather than the strategy learned by RL like CAFEM to find a set of features that can maximize P (D). This shows the effect of FTG without RL and Meta-learning. This algorithm can be seen as random graph search method on FTG. As some graph search algorithms, such as depth-first search (DFS) or breadth-search algorithm (BFS), are extremely time consuming [5] , we do not compare FeL with DFS or BFS in this paper. -Brute-force (BF): is inspired by DSM [3] , OneBM [7] and [15] . It applies all transformation operators to all original features and performs feature selection on the augmented dataset. (top-down approach). -LFE [9] : uses QSA to generate the representation of each feature in classification problems. Following [9] , a neural network with one hidden layer, L2 regularization and dropout is used to predict whether a feature with a transformation operator will gain 1% model performance improvement. -FERL: organizes the FE problem into a Transformation Graph, where each node is either the original dataset D or a dataset transformed from D. Then it uses Q-learning with linear approximation. We use the same setting as [5] . For Order-2 transformation operators, native FERL is extremely computation expensive since the number of new features is very large. During training stage, we prune the branches in Transformation Graph that would generate more than 10,000 new features next to make it trainable. As the source codes of all these methods are not publicly available and some experiments details are not provided (such as, the random seed of learning algorithm and train-test dataset splitting), we implemented ourselves all the benchmarks. For all the FE approaches except Baseline, we evaluate the performance for Order-1 and (Order-1 & Order-2) transformation operators to compare the ability of handling simple and complex transformation operators. In Order-1 transformation operators, FeL outperforms all approaches on most datasets. On average, FeL improves performance by 4.2% on test datasets. In the best two cases, Credit card and Pc4, FeL even improves baseline performance by 18.2% and 9.3%. One interesting phenomenon is that the Random method (random graph search on FTG) can obtain a relevant higher performance on some datasets. This indicates that FTG represents the FE process in an effective way and significantly contributes further strategy learning of FE. On Order-1 & Order-2 transformation operators, the complexity of FE increases significantly. Thus, it is expected that an inefficient method would easily run out of time, memory space or even would not work. FeL improves performance by 6.9% on average compared with existing approaches. In the best two cases, Pc4 and Ilpd, it improves by 42.5% and 20.9%. As we mentioned above, some FE approaches would be strongly limited as the complexity of transformation operators increasing. Comparing the performance of each approaches on Order-1 & Order-2 with that on Order-1, we found that LFE and Brute-force approaches get a worse performance (-1.54% in average) on half of the datasets, while FeL does not get any performance decrease. FERL approach is really computation expensive here: most of the datasets run out of time (36 h). In order to showcase the robustness of FeL, we evaluate the performance of FeL with two learning algorithms: Random Forest (tree-based ensemble learning algorithm) and Logistic Regression (Lasso for regression) (general linear algorithm) on 20 test datasets. FeL gains 10.8% and 4.2% performance increase on average with Logistic regression and Random Forest, respectively. The performance of FeL with Logistic regression ranges from 0.2% to 25.8%. For Random Forest, the performance of FeL ranges from 0% to 18.2%. It shows that our algorithm is robust with respect to different learning algorithms. One main aim of Cross-data Component is to speed up FE learning. We evaluate FeL and CAFEM on the test datasets and randomly show the comparison on 4 datasets due to the space limitation. Figure 2 shows that CAFEM can increase model performance more rapidly (gain a high score within the first epoch, outperform the best of FeL within around ten epochs). To our surprise, CAFEM can gain a better final model performance than FeL in most of the cases. We hypothesize the reason of this phenomena as that CAFEM learnt some general FE rules from a large set of datasets to help the agent quickly learn a new dataset and regularize its behavior. In this paper, we present a novel framework called CAFEM to perform automatic feature engineering (FE) and transfer FE experiences from a set of datasets to a particular one. It contains a feature transformation graph (FTG) that organized the process of FE, a Single-data FE learner and a Cross-data component. In most datasets, the framework outperforms state-of-the-art automatic FE approaches for both simple and complex transformation operators. With the help of cross-data component, CAFEM can speed up FE and increase FE performance. Moveover, the framework is robust to the choice of different learning algorithms. A few useful things to know about machine learning Model-agnostic meta-learning for fast adaptation of deep networks Deep feature synthesis: towards automating data science endeavors Explorekit: automatic feature generation and selection Feature engineering for predictive modeling using reinforcement learning Cognito: automated feature engineering for supervised learning One button machine for automating feature engineering in relational databases Human-level control through deep reinforcement learning Learning feature engineering for classification Reinforcement Learning: An Introduction The BigChaos solution to the Netflix grand prize. Netflix prize documentation Deep reinforcement learning with double Q-learning Quantiles over data streams: an experimental study Q-learning Towards automatic complex feature engineering Acknowledgments. The work is supported by the National Natural Science Foundation of China (Grant Nos.: 61702362, U1836214).