key: cord-0045747-dtb959ei authors: Lobo Neto, Vicente Coelho; Passos, Leandro Aparecido; Papa, João Paulo title: Evolving Long Short-Term Memory Networks date: 2020-06-15 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50417-5_25 sha: d99df666aaa3f72cfc5468c1089b90bb29e29c69 doc_id: 45747 cord_uid: dtb959ei Machine learning techniques have been massively employed in the last years over a wide variety of applications, especially those based on deep learning, which obtained state-of-the-art results in several research fields. Despite the success, such techniques still suffer from some shortcomings, such as the sensitivity to their hyperparameters, whose proper selection is context-dependent, i.e., the model may perform better over each dataset when using a specific set of hyperparameters. Therefore, we propose an approach based on evolutionary optimization techniques for fine-tuning Long Short-Term Memory networks. Experiments were conducted over three public word-processing datasets for part-of-speech tagging. The results showed the robustness of the proposed approach for the aforementioned task. Machine learning techniques achieved outstanding outcomes in the last years, mostly due to the notable results obtained using deep learning techniques in a wide variety of applications [1] , ranging from medicine [2] to route obstruction detection [3] . In this context, a specific kind of model, the so-called Recurrent Neural Network (RNN), has been extensively employed to model temporaldependent data, such as stock market forecasting and applications that involve text and speech processing. Such networks are time-dependent models whose neurons receive recurrent feedback from others, in contrast to densely connected models (e.g., Multilayer Perceptron). In short, the model's current output at time t depends on its input as well as the output obtained at time t − 1. Due to this intrinsic characteristic, this kind of network deals very well with data that has temporally-related information, such as videos, audio, and texts. each word in a sentence performs a specific function, e.g., a verb indicates an action, an adjective adds some attribute to a noun, and the latter is used to name any being. Natural languages, such as English, Portuguese, and any other, possess a considerable amount of words. Therefore, it is essential to note that many of them may admit ambiguous meanings, i.e., the same word can hold two or more different connotations. To illustrate the idea, suppose the word "seed": it may stand for a noun, referring to the reproductive unity of a plant, as well as denote a verb, representing the act of sowing the soil with seeds. To handle such ambiguous meanings, one can make use of the Part-of-Speech tagging process. The method aims at, given a sentence, identifying each compounding word's grammatical function inside it, as depicted in Fig. 1 . Understanding the true meaning of a word in a sentence is essential to comprehend the message conveyed through it thoroughly. Though it stands for a trivial task for human beings, automatic PoS tagging presents itself as quite challenging for computer systems. Among the various ways of solving this task, one can cite rule-based algorithms [29] , as well as Hidden Markov Models [30] . Finally, one should notice the importance of contextual information in order to assign a tag to a word correctly. One approach that yields good results while addressing such context employs machine learning techniques through Recurrent Neural Networks. Therefore, the next section introduces the LSTM, the current state-of-art approach for natural language processing. Although LSTMs present an architecture similar to traditional RNNs methods, they employ a different function for computing their hidden states. Such modification is essential for avoiding problems such as the well-known vanishing gradient, which can occur while relating current information to long-term network knowledge, known as long-term dependency. LSTM addresses this problem by using structures called cells, which receive as input their previous states as well as the current input. Internally, these cells decide what should be kept and forwarded to the next iterations, and what should be erased from memory. Therefore, they are capable of combining input information with previous states, as well as the current memory. A cell is composed of states, as well as three gates: (i) the input gate, which is responsible for deciding what information is relevant and how much of it will be updated or deleted; (ii) the forget gate, used when the network needs to erase any previous information present in the cell's memory in order to store a new document; and (iii) the output gate, that is employed to compute the cell's state [31] . Besides, every gate is activated by a sigmoid function, and the cell's output is subject to a combination of the gates' outputs, which will decide whether information should be forgotten or kept in the cell's memory. Let f t , i t , o t ∈ m×1 be the outputs of the forget, input, and output gates at time step t, respectively, which can be computed as follows: and where x t ∈ R n×1 denotes the input vector at time step t, h t−1 ∈ R m×1 stands for the output of the previous cell, and correspond to the biases of each gate at time step t. Further, one can compute the cell's state c t ∈ R m×1 at time step t considering the forget and input gates' output as follows: where ⊗ stands for the Hadamard product, and W c ∈ R m×n and U c ∈ R m×m are the weight matrices, and b c ∈ R m×1 corresponds to the bias of the cell's state. Therefore, one can compute the output of a cell at time step t as follows: Finally, the weights of an LSTM network are updated using a gradient-based backward pass as follows: and where α ∈ R is the learning rate and denote the partial derivatives at time step t, and z δ t f , z δ t i , and z δ t o correspond to the z th column of matrices z δ t f , z δ t i , and z δ t o , respectively. Further, the bias of each gate is also updated in a similar fashion, and the gradient is propagated to the former layers using the backpropagation chain's rule. Finally, the weight matrices U t f , U t i , and U t o are updated similarly to their respective counterparts, i.e., Eqs. 6, 7, and 8. In the context of parameterized machine learning techniques, we must refer to two different denominations: (i) parameters and (ii) hyperparameters. Typically, the first term represents low-level parameters that are not controlled by the user, such as connection weights in neural networks, for example. The other term refers to high-level parameters that can be adjusted and chosen by the user, such as the learning rate and the number of layers, among others. Both terms are of crucial importance for improving the performance of neural models. Regarding LSTM parameter optimization, the well known back-propagated using gradient descent present itself as a suitable method for the task. However, a proper selection of its hyperparameters poses a more challenging task since it requires the user a previous knowledge of the problem. Hence, an automatic method to select the model most relevant hyperparameters is strongly desirable, i.e., the learning rate, which denotes the step-size towards the gradient direction, the embedding layer size, which is responsible for the representation of words and their relative meanings, and the LSTM size itself, which denotes the number of LSTM's cells composing the model. Therefore, this work proposes employing evolutionary algorithms to fine-tune such hyperparameters. These methods are capable of automatically selecting a good set of hyperparameters by randomly initializing a set of candidate solutions that evolve over time. In this context, the candidate solutions are represented by a vector, where each position denotes a hyperparameter to be fine-tuned. In this work, for instance, it is considered a 3-dimensional representation, denoting the learning rate, the embedding layer size, and the LSTM size. Further, every solution is evaluated after each evolution cycle. This evaluation consists of training the LSTM network using both the candidate solution set of hyperparameters together with the training set. Afterward, the trained model is employed to classify the evaluation set, and the loss function obtained over this procedure, denoted fitness function, is used to evaluate and update the solutions towards a minimum (ideally, the global minimum). At the end of the process, it is expected that the model learns a reasonably good set of hyperparameters. Figure 2 depicts the model adopted in the work. This section briefly describes the four metaheuristic optimization techniques employed in this work concerning the task of LSTM hyperparameter fine-tuning. GA is a specific case of evolutionary algorithms that employ concepts and expressions from natural genetics that models each possible solution as a chromosome. Among a variety of implementations of the model, this work represents each individual as an array composed of the hyperparameter to be optimized, letting aside the binary version. Moreover, the following evolution operators were employed: 1. Elitism: it retains the individuals with the best fitness values of the current population using elitism to accelerate the convergence; 2. Random selection: it adds some random elements of the current population to the new one to promote greater genetic diversity; 3. Mutation: it modifies a random position of individuals of the new population generating values within the allowed limits; and 4. Crossover: it randomly selects two individuals from the population and creates two new individuals through the single-point crossover. GP employs the very same idea behind the GA, except for data structure, which is represented by an expression tree. In this model, each tree's leaf node contains an array, similar to the ones applied in GA, representing the hyperparameters to be optimized. Further, inner-nodes stand for mathematical operators, such as addition, multiplication, and logarithm, among others. Despite the advantages of GP, since it uses purely syntactic operators, instead of the values of the individuals' decision variables while applying the operators, it may perform some very abrupt changes in the values of the variables as the evolution process occurs, which may adversely affect the convergence. Therefore, GSGP considers the evolution of individuals by selecting semantically close solutions [27] . The procedure is performed by introducing some restrictions in the mutation and crossover operators. The SGP is a variation of GP that, instead of using a tree data structure, it employs two stacks: one for data input and one for the output. Thus, the SGP input stack is composed of a set of operations and operands responsible for describing a mathematical expression, similar to GP, while the output stack contains the expression evaluation result [28] . Since there are no syntactic constraints in this technique, individuals can be represented as vectors. Besides, it is possible to use the same genetic operators as the standard Genetic Algorithm for mutation and crossover. Thence, the technique combines the great individuals' variability of GP with the simplicity of GA's genetic operators, thus providing better results than GP, in general, depending on the implementation and the problem addressed. This section presents the methodology employed in the experiments, i.e., the datasets, experimental setup, and network modeling concerning the PoS tagging problem. This work employs three well-known public datasets for the task of Part-Of-Speech tagging. Such datasets are composed of natural language text fragments, and they are available at the Natural Language Toolkit (NLTK) library [32] : Since LSTM networks require a vector composed of real numbers as input, the datasets were pre-processed such that the words were encoded into a numerical dictionary, where each word corresponds to a non-zero number. Besides, the input is composed of an n-dimensional vector whose features should fit in, i.e., all sentences should have the same size. Therefore, n was selected based on the longest sentence's size. Finally, a padding approach was employed to fulfill all the non-occupied spaces with zeros. Further, each sentence is represented by an array of tuples using the pattern (word, PoS tag). Later, these tuples are split, and the tags are converted into their categorical (one-hot) representation to match the LSTM output format. Therefore, the sequential architecture of the LSTM employed in this work for the task of PoS tagging is composed of the following layers: Further, the Softmax is then employed as the activation function. Concerning the loss function, the model aims to minimize the categorical cross-entropy, which is the most commonly applied approach for text classification tasks. This work compares the performance of four well-known evolutionary optimization algorithms, i.e., GA, GP, GSGP, and SGP, as well as a random search, to the task of LSTM network hyperparameter fine-tuning in the context of PoS tagging. The experiments were performed over 400 randomly selected sentences from three public datasets commonly employed for the task: Brown, Floresta, and Treebank. Further, these sentences were partitioned into three sets: training, validation, and testing using the percentages of 60%, 20% and 20%, respectively. This work focuses on finding the set of hyperparameters that minimizes the categorical cross-entropy loss function, thus maximizing the LSTM performance for the task of PoS tagging using metaheuristic evolutive algorithms. Therefore, the problem was designed to search for three hyperparameters: (i) the layer size L and (ii) the embedding layer size E, both selected from a closed set of powers of 2 ranging from 2 1 to 2 10 , i.e. , {2, 4, 8, 16, 32, 64, 128, 256 , 512, 1024}, and (iii) the learning rate α ∈ [0.001, 0.1]. Since the metaheuristic techniques work with real numbers only, one should convert them to the nearest power of 2 when dealing with L and E. Concerning the optimization procedure, each technique employed 10 individuals evolving during 20 iterations and considered a retention rate of 20%, a random selection rate of 5%, and 1% of mutation rate. Moreover, GP and GSGP employed expression trees with a maximum tree height of 5, while the SGP employed arrays with 10 elements. Also, they implemented the addition, subtraction, multiplication, division, modulo, negation, logarithm, and square root operators. Notice the latter two were not employed in the GP since they adversely affected convergence 1 . Finally, these techniques were compared against a random search, which stands for the experiments' baseline. Afterward, the best hyperparameters obtained during the optimization process are employed to train the network for 200 iterations for further classification. Besides, each experiment was performed during 20 runs for statistical analysis through the Wilcoxon signed-rank test [33] with a significance of 5%. Finally, the experiments were conducted using both the Keras open-source library [34] and Tensorflow [35] . This section presents the experimental results. Table 1 provides the mean accuracy and the standard deviation concerning the Brown Corpus of Standard American English, Floresta Sintá(c)tica, and the Penn Treebank datasets. Notice that the best results, according to the Wilcoxon signed-rank test, are presented in bold. From these results, it is possible to extract two main conclusions: (i) metaheuristic optimization techniques performed better than a random search concerning the three datasets, which validates the employment of such methods for the task; (ii) the tree-based algorithms, i.e., GP, GSGP, and SGP, performed better than GA in two out of three datasets. Such behavior is expected since these techniques employ more robust approaches to solve the optimization problem. Considering the evaluation of the learned models, Fig. 4 , respectively. Additionally, concerning the Treebank dataset, although GSGP obtained the best values in the initial iterations, it is outperformed by GA in the third iteration, whose convergence keeps improving until reaching the stop criteria of 20 iterations. Moreover, it can be observed that GSGP and SGP do not present significant progress during the iterations, getting stuck on local optima and even get worst results than the ones obtained in previous iterations. Finally, considering the GP algorithm, despite the slightly positive convergence, it was not capable of outperforming GA in any of the three datasets. It is interesting to notice that, although GA outperformed the tree-based methods concerning the validation set over a training composed of 20 iterations when considering the same hyperparameters over the testing set considering 200 iterations for training, the tree-based methods obtained the best results. Such a phenomenon may suggest the model becomes overfitted when trained considering the GA set of hyperparameters. In this paper, we addressed the problem of LSTM hyperparameter fine-tuning through evolutionary metaheuristic optimization algorithms in the context of PoS tagging. For this task, four techniques were compared, i.e., the Genetic Algorithm, the Genetic Programming, the Geometric Semantic Genetic Programming, and the Stack-based Genetic Programming, as well as a random search, employed as the baseline. Experiments conducted over three well-known public datasets confirmed the effectiveness of the proposed approach since all four metaheuristic algorithms outperformed the random search. Further, although GA presented a smoother convergence during the optimization steps, the treebased evolutionary techniques obtained the best loss and accuracy results when considering a more extended training period over the testing sets. Such results highlight that those complex mathematical operations, like the ones performed by GP, GSGP, and SGP, contribute to the convergence of the model in this context. Regarding future works, we intend to validate the proposed approach concerning other contexts than PoS tagging, whether textual or not, as well as using different artificial neural networks. Deep learning A hybrid approach for breast mass categorization A novel siamese-based approach for scene change detection with applications to obstructed routes in hazardous environments Neural networks and physical systems with emergent collective computational abilities Bag of samplings for computer-assisted Parkinson's disease diagnosis based on recurrent neural networks Long short-term memory End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF Dependency based embeddings for sentence classification tasks Long short-term memory recurrent neural network architectures for large scale acoustic modeling LSTM: a search space odyssey Optimal hyperparameters for deep LSTM-networks for sequence labeling tasks Embedding gravitational search algorithms in convolutional neural networks for OCR applications GSA: a gravitational search algorithm Gradient-based learning applied to document recognition Fine-tuning convolutional neural networks using harmony search Music-Inspired Harmony Search Algorithm: Theory and Applications, 1st edn Learning parameters in deep belief networks through firefly algorithm Firefly algorithm, stochastic test functions and design optimisation A fast learning algorithm for deep belief nets Fine tuning deep Boltzmann machines through meta-heuristic approaches Temperature-based deep Boltzmann machines An efficient learning procedure for deep Boltzmann machines Genetic algorithm-optimized long short-term memory network for stock market prediction Particle swarm optimization-based CNN-LSTM networks for forecasting energy consumption Genetic algorithms and machine learning Genetic Programming: On the Programming of Computers by Means of Natural Selection Geometric semantic genetic programming Stack-based genetic programming A simple rule-based part of speech tagger Robust part-of-speech tagging using a hidden Markov model Learning to forget: continual prediction with LSTM Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics Individual comparisons by ranking methods TensorFlow: large-scale machine learning on heterogeneous systems, software available from tensorflow The authors are grateful to FAPESP grants #2013/07375-0, #2014/12236-1, #2017/ 25908-6, #2018/10100-6, #2019/07665-4, as well as CNPq grants #307066/2017-7 and #427968/2018-6.