key: cord-0108086-8sndsphi authors: Vazan, Milad title: Joint Learning for Aspect and Polarity Classification in Persian Reviews Using Multi-Task Deep Learning date: 2022-01-17 journal: nan DOI: nan sha: e2838d96c45e025a66adf0027197b48c494c36c0 doc_id: 108086 cord_uid: 8sndsphi The purpose of this paper focuses on two sub-tasks related to aspect-based sentiment analysis, namely, aspect category detection (ACD) and aspect category polarity (ACP) in the Persian language. Its ability to identify all aspects discussed in the text is what makes aspect-based sentiment analysis so important and useful. While aspect-based sentiment analysis analyses all aspects of the text, it will be most useful when it is able to identify their polarity along with their identification. Most of the previous methods only focus on solving one of these sub-tasks separately or use two separate models. Thus, the process is pipelined, that is, the aspects are identified before the polarities are identified. In practice, these methods lead to model errors that are unsuitable for practical applications. In other words, ACD mistakes are sent to ACP. In this paper, we propose a multi-task learning model based on deep neural networks, which can concurrently detect aspect category and detect aspect category polarity. We evaluated the proposed method using a Persian language dataset in the movie domain on different deep learning-based models. Final experiments show that the CNN model has better results than other models. The reason is CNN's capability to extract local features. Since sentiment is expressed using specific words and phrases, CNN has been able to be more efficient in identifying these in this dataset.iments show that the CNN model has better results than other models. With the rise of Web 2 tools and the advent of social media, their use has become an essential daily activity in today's society. Hence, huge amounts of data are generated daily by users of these media. Social media in its various forms, such as blogs, microblogs, and social networks, has created a platform for interaction, conversation, posting ideas and opinions. The comments provided by users on social networks are very useful. In an online store, different opinions and views about a product can reflect the level of customer satisfaction and quality, which can be an excellent guide for other buyers. In addition, these online comments can predict election results. It is not easy to categorize and organize this vast volume of views on one specific topic manually. Hence, the need for an automated system for collecting opinions led to the emergence of a new field of research called sentiment analysis. Sentiment analysis is a subset of natural language processing [1] [2] 31 ] whose main objective is to extract the sentiments expressed in the text. Sentiment analysis is generally researched at three levels: document level, sentence level, and aspect level. The sentiment analysis at the document and sentence level is built on assuming that the whole text only includes opinions about one subject. This assumption is not logical in many cases [3] . A sentence can contain opinions with different aspects and polarities. Hence, assuming that the whole sentence includes a positive or negative opinion, we cannot get a correct assessment from Joint Learning for Aspect and Polarity Classification in Persian Reviews Using Multi-Task Deep Learning a person's opinions. In return, Aspect-Based Sentiment Analysis (ABSA), also called Aspect-Level Sentiment Analysis (ALSA), allows us to identify the point of view of the commenter for each of the features of the entity mentioned in the text [3] [4] [5] . Aspect-based sentiment analysis is generally categorized and researched into four sub-tasks, which are: [6] [7] [8] : Aspect Term Extraction (ATE), Aspect Term Polarity (ATP), Aspect Category Detection (ACD), and Aspect Category Polarity (ACP). While ATE tries to extract terms that refer to an aspect, ATP determines the polarity of any term extracted by the ATE problem based on positive, negative, or neutral. ACD, on the other hand, intends to assign a subset of these categories to a single review, given a set of predefined categories [4, 6, [8] [9] [10] [11] [12] [13] [14] [15] . The last sub-task of aspectbased sentiment analysis, ACP, focuses on determining the polarity of each category identified by the previous problem [7] [8] . Table 1 shows the difference between these four sub-tasks with an example. In the field of sentiment analysis on reviews, a lot of work has been done. However, due to several challenges, relatively little work has been done in the field of aspect sentiment classification [11] . Since the categories of an aspect are not explicitly mentioned in the text most of the time, the problem of identifying aspect category is a more challenging problem than extracting the aspect term [4, 7, 11] . Because the term aspect is explicitly stated in the sentence, this is much easier than identifying aspect category [15] . Most research on aspect-based sentiment analysis has been conducted in a supervised manner. Researchers often extract features and use machine learning algorithms to train the model. These methods are based on the extraction of features manually. However, such methods require feature engineering or extensive linguistic resources. In recent years, deep neural networks, known as deep learning, have shown remarkable performance in various natural language processing tasks, such as text classification, text summarization, question and answer systems, and many more. The principal advantage of these methods is that they can learn text features automatically [16] . Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) are two widely used networks in this field. Most of the previous methods only focus on solving one of these sub-tasks separately. Pipeline methods first identify aspects and then identify polarity. Such methods do not meet practical applications because they can lead to model errors. That is, ACD errors are transmitted to ACP. Due to the importance of aspect-based sentiment analysis, especially its two sub-tasks, ACD and ACP, this article focuses on them. Also, since most of the previous work deals with solving these two sub-tasks separately, this research presents a new strategy using multi-task deep learning to aspect-based sentiment analysis in Persian comments for the two sub-tasks ACD and ACP, which can solve them simultaneously. To do this, various models based on deep learning were used to examine and select the best structure. The proposed method was evaluated using a data set from the comments provided on Persian movies. Following, after reviewing the relevant works and problem background, we describe in detail the proposed method and the evaluation results. Recent advances in artificial neural network technology, especially deep learning, have greatly influenced the field of natural language processing. In many areas of natural language processing, the use of deep learning has led to results exceeding those previously used by traditional machine learning and statistical methods [17] . In the traditional machine learning method, features were extracted manually and with feature engineering [32] [33] [34] [35] , which requires background knowledge and is also a timeconsuming and costly process. Deep learning is a subfield of machine learning, is known as representation learning or feature learning. Representation learning is a technique that enables the machine to detect features from raw data without human intervention automatically. This critical ability of deep networks, which makes them superior to other traditional machine learning methods, is made possible by many layers in its structure. This ability to automatically extract from deep networks quickly made them popular and replaced traditional machine learning methods. In the continuation of this section, we have a brief overview of some deep network architectures. Convolutional neural networks (CNNs) are a branch of deep neural networks suitable for processing data with a grid-like topology. A CNN network consists of several layers of convolutional and pooling, followed by one or more fully connected layers [18] [19] . CNNs are common in computer vision due to the repetitive patterns in the images. They're also used to process natural language because the text contains repeating patterns as well. Instead of image pixels, the text input in natural language processing is represented as a matrix, with each row representing a word as a vector [18] . These vectors are usually obtained by word embedding or one-hot embedding. Figure 1 shows a model of a convolutional neural network architecture for text (sentiment) classification. As the figure shows, the input to this network is a matrix in which the rows represent the vector for each word, and the number of columns represents the vector dimensions of the word. Feature vectors are obtained by applying two filters in sizes 2 and 4 to the input matrix. Once the feature vector is obtained, the max pooling operator is applied to reduce the size of the features. Finally, the classification is done using a fully connected neural network layer and an activating function. In real-world scenarios, a word's semantic content is frequently influenced by the meaning of words before it in a text. CNN cannot process this dependency because they consider each word independently in the text [21] . Recurrent neural networks (RNNs) are designed to work with sequential data and are an excellent alternative to tasks where interdependence is essential. Neurons in RNNs are connected like a chain, and each of them sends a message to its successor that this sequential transfer of information creates a kind of "memory" [18, 22] . However, these networks cannot maintain long-term dependencies due to the vanishing gradient problem. Hence, a new neural network for long-term dependencies called the long short-term memory (LSTM) was introduced. This capability in LSTMs by important components called state cells and gates enables LSTMs to maintain long-term dependencies and create a memory-based architecture. Figure 2 shows the internal structure of an LSTM network. Efforts to simplify LSTM have led to introducing another type of recurrent neural network called the gated recurrent unit, or GRU for short. GRU uses two gates called update and reset gates to solve the vanishing gradient problem in RNN compared to LSTM, which has three gates. These two gates decide what information should be displayed at the output. Figure 3 shows an overview of the internal structure of a GRU network. To improve the performance of neural networks, Bidirectional neural networks were constructed. In a Bidirectional LSTM (Bi-LSTM), network training is performed using two LSTMs. The first network learns the provided input sequence, while the other network feeds in the reverse input sequence. Since there are two trained models at the end, the outputs of both networks are interconnected to form the final output. Word embedding is a representation of real vectors mapped to words. These numerical vectors are smaller than statistical approaches in natural language processing to convert words into numbers. Also, these numerical vectors, if well trained, can show semantic and syntactic connections between words. Word embedding is the cornerstone of much natural language processing work that uses deep learning. Word embedding can be achieved in two general ways. The first approach is to be learned simultaneously with the main task by the network structure. The second approach is to use pre-trained embedding that has already been trained by specific algorithms. Traditional machine learning methods are optimization for a specific measure for performing a single task. To achieve this goal, a model is trained by fine-tuning the hyperparameters. Although we can get a satisfactory result by training the model, some information that helps to improve the performance of the model is ignored. In other words, we ignore the knowledge gained from training signals related to related tasks. To make effective use of this information, a new approach called multi-task learning was proposed [23] [24] [25] [26] . Multi-task learning is done with the aim of joint learning and simultaneously several different and at the same time-related tasks to maximize the efficiency of the model. This is done by sharing information between different tasks. Each task can benefit from other tasks, thus increasing the model's efficiency. In general, whenever you train an optimization problem with more than one loss function, or part of your loss function stems from another task, you are using multi-task learning. [23, [26] [27] [28] . In matters related to classification, two concepts are easily confused: multi-class classification, which means classification with more than two classes, and multi-label classification, which assigns sets of target tags to each instance. Since both have the prefix "multi," it may be mistaken for both to be multi-task learning. However, this is not true, and when it comes to multi-task learning, we refer to multi-label classification [27] . To perform multi-task learning in deep learning, two general approaches of hard parameter sharing and soft parameter sharing are used. Hard parameter sharing is the most common method of multi-task deep learning [24] . In this method, hidden layers are shared between all tasks, while there are several output layers for each task. In this way, the parameters in the common hidden layer are forced to be generalized to all tasks. This results in a lower overfitting risk for each specific task. In other words, the more tasks are trained simultaneously the model has to find a representation that represents all the tasks and reduces the occurrence of overfitting in the main task. Soft parameter, on the other hand, each task has its model with its parameters. In other words, in soft parameter sharing, each model has its set of weights and biases. In contrast, the distance between these parameters is adjusted in different models to make the parameters similar [23] [24] [26] [27] . Figure 4 shows the structures of these two methods. Kumar et al. [11] used an association rule technique to identify the aspect categories. To deal with the limitations of these statistical rules, they proposed an approach based on Combined rules, which consists of associative rules and semantic rules. They used word embedding for semantic rules. In the end, they concluded that adding these additional rules and combining them would increase the accuracy of the classification. Ghaderi et al. [9] presented a supervised model called Language-Independent Category Detector (LICD), which detects aspect categories based on text matching without special tools or extracting manual features. To do this, they developed their model with two assumptions. The first assumption is that if a high semantic similarity between a sentence and a set of words representing that category, it should be assigned to a category. And second, a sentence will belong to a category if the sentences are semantically and structurally similar to a sentence related to a category. To apply the assumptions, they used the soft cosine distance for the first assumption and the word mover's distance for the second assumption. Movahedi et al. [13] presented a deep learning model based on the attention mechanism that can identify different categories by focusing on different parts of a sentence. Xue et al. [12] proposed a model based on the convolutional neural network and the gate mechanism, which performed better than the LSTM. The main idea of this method is to use Tanh-ReLU gate units, which can selectively produce polarity output depending on the aspect and entity. To solve the problems related to supervised approaches that require labeled data, Fu et al. [30] proposed a semi-supervised approach based on the variational autoencoder and the attention mechanism. In addition, to better learn a word, they added the sentiment vector of each word along with it to the input. Their results showed that their models could obtain more accurate sentiments for a given aspect. Ruder et al. [29] proposed a hierarchical model for aspectbased sentiment analysis. They demonstrated that by allowing the model to consider the structure of the review and the sentential context for its predictions, it could perform better than models that rely solely on sentence information and achieve competitive performance with models that use large external sources. In this section, we first define the two problems of aspect category detection and aspect category polarity. Then, we present our proposed method for aspect category detection and aspect category polarity, which can simultaneously detect aspect category and detect aspect category polarity. Assuming a list of predefined categories, the goal is to assign a subset of these categories into a text (sample). Since each text can contain a set of categories, this corresponds to a multi-label classification problem. Assuming that a set of categories is identified for a text, the goal is to assign one of the positive or negative polarities to each of them. The proposed method, by changing the data labeling, or in other words, adding a neutral class gives the model the ability to simultaneously solve the two sub-tasks of aspect category detection and aspect category polarity using deep multi-task learning. To do this, we do not change the aspect category classes and only add a neutral class to the aspect polarity classes. Accordingly, each class of the aspect polarity has three classes: positive, negative, and neutral. Neutral here means that the comment provided is not a member of this class. The reason for adding this neutral class is that it is not possible to judge the final classification by having only two classes of positive and negative polarity. More precisely, since this corresponds to multi-label classification, and it is possible that each instance is a member of one or more of these classes, then we create a third class so that if the instance did not belong to that class, the classifier could detect it. Assuming that a neutral class does not exist, the probability of being positive or negative should be considered for each aspect category. That is, an opinion must have this aspect in the category of positive or negative. It can be seen that this is not a good idea. By adding this neutral class, the model can recognize if a comment is a member of that class or not. Figure 5 shows an example of how to label using this method for better understanding. In designing the model for the proposed method, the SoftMax function must be used in the output layer of the model because each category has three different classes of positive, negative, and neutral. Since our number of categories is 9, 9 SoftMax functions with three neurons were used. Figure 6 shows the general architecture of the proposed method based on hard parameter sharing to solve the two sub-tasks of aspect category detection and aspect category polarity for joint learning. 5 Experiment 5-1 Dataset Every machine learning system needs a suitable data set to learn. Unfortunately, in Persian, there is no comprehensive and unified data set in aspectbased sentiment analysis. Therefore, for this study, a collection of data from the opinions of social media users about movies was collected from cinematicket.org. The number of samples for the train data set is 2000 and for the test data set is 200 samples with nine classes in both positive and negative polarities. Table 2 shows the distribution of data in different classes. Figure 6 . The general architecture of the proposed method Since the topic under discussion is multi-label classification, we must use appropriate measures to evaluate the models. These measures are different from single-label classification measures. In this regard, we have used three measures of subset accuracy, jacquard index (multi-label accuracy), and Hamming loss, to evaluate the efficiency of the models. In the subset accuracy, which is one of the most stringent measures, the predicted set must be exactly the same as the actual set. Equation 1 shows how to calculate it. The Jaccard index, also known as multi-label accuracy, is defined as the number of correctly predicted labels divided by the actual label union. Equation 2 shows how this evaluation measure is calculated. Hamming loss calculates the symmetric difference between the actual output and the classifier output. In other words, it shows a fraction of the labels that were not correctly predicted for a sample. Equation 3 shows how this criterion is calculated. In Equations 1 to 3, represents the actual output for sample i, ℎ( ) the classifier output, and n represents the number of samples in the data set. In this study, we used four basic models of deep learning, namely, CNN, LSTM, Bi-LSTM and, GRU, to train and then evaluate the results. The CNN model used in this study consists of several steps: Initially, we first fed the data to the word embedding layer using the Keras library. In the next step, the convolution operator is applied to Joint Learning for Aspect and Polarity Classification in Persian Reviews Using Multi-Task Deep Learning the word embedding layer, in which 256 filters with a kernel size of 3 were used to form the final features of the input sample. Finally, a fully connected layer was used to classify based on existing classes. Like CNN, the other three models consist of several stages. In the first stage, word embedding was used to convert words into numerical vectors. These vectors were then passed to the blocks of these networks to extract textual information. Finally, a fully connected network was used to map the features extracted by these models to the existing class set. Also, in all model architectures, dropout has been used to prevent overfitting. Table 3 shows the list of hyperparameters used in this study. Table 4 compares the models based on the three measures of subset accuracy, Jacquard index, and Hamming loss. As can be seen, the CNN model in subset accuracy measure with a score of 67.5% and the Jacquard index with a score of 77.075% was able to show the best performance among other models. However, it performed worse on the Hamming loss compared to Bi-LSTM. Since the subset accuracy is one of the most stringent measures in multi-label classification, it follows that CNN has done the most efficiently in this task and data set. Remarkably, CNN is superior to recurrent neural networks. This is because of CNN's good performance in extracting local features. Because sentiment is expressed in comments and texts with specific words and phrases, CNN has been able better to identify these words and phrases in this dataset and show better efficiency. Aspect-based sentiment analysis is of great importance and application because of its ability to identify all aspects discussed in the text. However, aspect-based sentiment analysis will be most effective when, in addition to identifying all the aspects discussed in the text, it can also identify their polarity. Most previous methods use the pipeline approach, that is, they first identify the aspects and then identify the polarities. Such methods are unsuitable for practical applications since they can lead to model errors. In other words, ACD mistakes are sent to ACP. Therefore, in this study, we used an approach that can simultaneously identify aspects and their polarity in a model. Since a similar study in Persian has not been done in this way, different deep learning models for the proposed method were developed and compared with each other. Subset accuracy, Jacquard index, and Hamming loss measures were used to evaluate the performance of the developed models. it can be seen from the results in table 4 that the CNN model outperformed the rest of the deep learning models. This is due to CNN's effectiveness in extracting local features. In this article, we used a simple but effective strategy to learn the two important sub-task of aspect-based sentiment analysis, namely, ACD and ACP in Persian. In this regard, we used a change in data labeling, or in other words, the addition of a neutral class to enable the multi-task deep learning model to detect category and polarity. In the next step, we trained some basic deep learning models and compared and evaluated the results. Finally, the CNN model, due to its high ability to extract local features, was able to perform better than other models in the data set of this study. Issues and Challenges of Aspect-based Sentiment Analysis: A Comprehensive Survey Sentiment analysis for customer relationship management: an incremental learning approach Deep Learning for Aspect-Level Sentiment Classification: Survey, Vision, and Challenges Aspect Aware Learning for Aspect Category Sentiment Analysis Ensemble Models for Aspect Category Related ABSA Subtasks UWB: Machine Learning Approach to Aspect-Based Sentiment Analysis Aspect Based Sentiment Analysis Survey Application of Deep Learning Approaches for Sentiment Analysis. Algorithms for Intelligent Systems LICD: A Language-Independent Approach for Aspect Category Detection Understanding Citizen Issues through Reviews: A Step towards Data Informed Planning in Smart Cities Aspect category detection using statistical and semantic association Aspect Based Sentiment Analysis with Gated Convolutional Networks Aspect Category Detection via Topic-Attention Network bindu cs, 2020. aspect category polarity detection using multi class support vector machine with lexicons based features and vector based features Sentence Constituent-Aware Aspect-Category Sentiment Analysis with Graph Attention Networks Enhanced Aspect Level Sentiment Classification with Auxiliary Memory A Survey of the Usages of Deep Learning in Natural Language Processing Deep Learning for Natural Language Processing in Radiology-Fundamentals and a Systematic Review Text analysis for email multi label classification Jointly Modeling Aspect and Polarity for Aspect-based Sentiment Analysis in Persian Reviews Co-LSTM: Convolutional LSTM model for sentiment analysis in social big data. Information Processing & Management Multi-Label Text Classification with Transfer Learning for Policy Documents Multi-Task Learning Based Network Embedding An Overview of Multi-Task Learning in Deep Neural Networks A Survey of Multi-Task Deep Reinforcement Learning Joint Learning for Aspect and Polarity Classification in Persian Reviews Using Multi-Task Deep Learning Neural Transfer Learning for Natural Language Processing Multi-Task Deep Learning for Affective Content Detection from Text A Comparison of Loss Weighting Strategies for Multi task Learning in Deep Neural Networks A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis Semi-supervised Aspect-level Sentiment Classification Model based on Variational Autoencoder. Knowledge-Based Systems A Novel Approach for Enhancing Sentiment Classification of Persian Reviews Using Convolutional Neural Network and Majority Voting Classifier Classification of membrane proteins using a deep hybrid model Predicting mRNA degradation in the developmentof covid-19 vaccine using deep learning methods Machine Learning and Data Science: Foundations, Concepts, Algorithms, and Tools Deep Learning: From Basics to Building Deep Neural Networks with Python