key: cord-0043218-gei9ktvy authors: Zhao, Jingjing; Yang, Yao; Pang, Guansong; Lv, Lei; Shang, Hong; Sun, Zhongqian; Yang, Wei title: Learning Discriminative Neural Sentiment Units for Semi-supervised Target-Level Sentiment Classification date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_60 sha: 45c9c1022e0345f344a37f8e1522a048b57ed805 doc_id: 43218 cord_uid: gei9ktvy Target-level sentiment classification aims at assigning sentiment polarities to opinion targets in a sentence, for which it is significantly more challenging to obtain large-scale labeled data than sentence/document-level sentiment classification due to the intricate contexts and relations of the target words. To address this challenge, we propose a novel semi-supervised approach to learn sentiment-aware representations from easily accessible unlabeled data specifically for the fine-grained sentiment learning. This is very different from current popular semi-supervised solutions that use the unlabeled data via pretraining to generate generic representations for various types of downstream tasks. Particularly, we show for the first time that we can learn and detect some highly sentiment-discriminative neural units from the unsupervised pretrained model, termed neural sentiment units. Due to the discriminability, these sentiment units can be leveraged by downstream LSTM-based classifiers to generate sentiment-aware and context-dependent word representations to substantially improve their sentiment classification performance. Extensive empirical results on two benchmark datasets show that our approach (i) substantially outperforms state-of-the-art sentiment classifiers and (ii) achieves significantly better data efficiency. Target-level sentiment classification (TSC) is the task of classifying sentiment polarities on opinion targets in sentences. It can provide more detailed insights into sentence polarities, but it involves significantly more intricate sentiment relations than sentence/document-level sentiment analysis. For example, the sentence "The voice quality of this phone is not good, but the battery life is long" holds negative sentiment on the target "voice quality" but is positive on the target "battery life". In recent years, deep neural network-based methods have been extensively explored for target-level sentiment classification to learn the representations of sentences and/or targets. Recurrent neural networks are one of the most popular approaches for this task because of their strong capability of learning sequential representations [2, 9] . However, these methods fail to distinguish the importance of each word to the target. A range of attention mechanisms are introduced to address this issue, such as target-to-sentence attention [2] , fine-grained word-level attention [3] , and multiple attentions [4] . Convolutional neural network (CNN)-based models are also recently used for this task because of the capability to extract the informative n-grams features [5] . All the aforementioned methods focus on exploiting labeled data to build the classification model, whose performance is often largely limited. This is because they normally require large-scale high-quality labeled data to be well trained, but in practice we have only small target-level labeled data since it is very difficult and costly to collect due to the complex nature of the task, e.g., fine granularity, co-existence of multiple targets in a sentence, and context-sensitive sentiment. Two main methods to address this issue include: (i) generating and incorporating extra sentiment-informative representations by using auxiliary knowledge resources, e.g., sentiment lexicons [17, 28] ; and (ii) pretraining the embeddings of words or the parameters of networks using large-scale unlabeled data [3, 16] . However, both methods can't capture context-dependent sentiment. For example, the opinion "long " can have completely opposite sentiment in different contexts, e.g., it is positive in " battery life is long " but negative in "the start-up time is too long ". Additionally, the sentiment lexicons require very expensive human involvement to handle data with evolving and highly diversified linguistics, so the pretraining method is more plausible. The pretraining aims at generating generic representations for different learning tasks, which can often extract some transferable features for a particular task. However, due to the generic learning objective, it can also extract a large number of features that are irrelevant or even noisy w.r.t. a given task such as sentiment classification, leading ineffective use of the unlabeled data. In this study, we introduce a novel approach to associate the feature learning on unlabeled data with the downstream sentiment classification to extract highly relevant features w.r.t. sentiment classification. Specifically, besides pretraining on unlabeled data, we take a step further to learn and extract highly sentiment-discriminative neural units from a pretrained model, e.g., long short-term memory (LSTM)based Variational Autoencoder (VAE) [11] . The selective sentiment-aware units, termed Neural Sentiment Units (NeSUs), can generate highly relevant sentimentaware representations, which are then leveraged by LSTM networks to perform sentiment classification on small labeled data. This enables LSTM networks to achieve significantly improved data efficiency and to learn context-dependent sentiment representations, resulting in substantially improved LSTM networks. In summary, this paper makes the following two main contributions: -We discover for the first time that feature learning on unlabeled data can be associated with downstream sentiment classification to learn some highly sentiment-discriminative neural units (NeSUs). These NeSUs can be leveraged by LSTM-based classifiers to generate sentiment-aware and contextdependent representations, carrying substantially more task-dependent information than the generic representations obtained by pretraining. -We further propose a novel LSTM-based target-level sentiment classifier called NeaNet that effectively incorporates the most discriminative NeSU to exemplify the applications of the NeSUs. Extensive empirical results on two benchmark datasets show that NeaNet (i) substantially outperforms 13 (semi-) supervised state-of-the-art sentiment classifiers and (ii) achieves significantly better data efficiency. Many methods have been introduced for target-level sentiment analysis, including rule-based approaches [1, 6] , statistical approaches [7, 8] and deep approaches [9, 21] . Due to page limits, below we discuss two closely relevant research lines. Recursive neural network is one popular network architecture explored at the early stage [29] , which heavily relies on the effectiveness of syntactic parsing tree. Recurrent neural networks have also shown expressive performance in this task. TD-LSTM [9] incorporated target information into LSTM and modeled preceding and following contexts of the target to boost the performance. Target-sensitive memory networks (TMNs) [21] were proposed to capture the sentiment interaction between targets and contexts to address the context-dependent sentiment problem. However, these models fail to identify the contribution of each word to the targets. The attention mechanism [2, 4, 10, 22] is then applied to address this issue. For example, A target-to-sentence attention mechanism, ATAE-LSTM [2] , was introduced to explore the connection between the target and its context; IARM [22] leveraged recurrent memory networks with multiple attentions to generate target-aware sentence representations. As CNN can capture the informative n-grams features, convolutional memory networks were explored in [18] to incorporate an attention mechanism to sequentially compute the weights of multiple memory units corresponding to multi-words. Instead of attention networks, [5] proposed a component to generate target-specific representations for words, and employed a CNN layer as the feature extractor relying on a mechanism of preserving the original contextual information. Some other works [20] exploited human reading cognitive process for this task. These neural network-based methods stand for the current state-of-the-art techniques, but their performance are generally limited by the amount of high-quality labeled data. Semi-supervised Methods. Many semi-supervised methods have been explored on sentence-level sentiment classification, such as pretraining with Restricted Boltzmann Machine or autoencoder [23, 26] , auxiliary task learning [24] and adversarial training [25, 27] . However, there are only few studies [16, 19] on semi-supervised target-level sentiment classification. [19] explored both pretraining and multi-task learning for transferring knowledge from document-level data, which is much less expensive to obtain. [16] used a Transformer-based VAE for pretraining, which modeled the latent distributions via variational inference. However, it failed to distinguish the relevant and irrelevant features with respect to the sentiment. We introduce a novel semi-supervised framework to learn sentimentdiscriminative neural units (NeSUs) on large-scale unlabeled data to enhance downstream classifiers on small labeled data. Unlike the widely-used pretraining approaches that learn generic representations, our proposed approach is specifically designed for fine-grained sentiment classification, by incorporating sentiment-aware neural units hidden in the pretrained model into downstream LSTM-based classifiers. This enables us to have a substantially more effective use of the unlabeled data, greatly lifting the sentiment classification on limited labeled data. The procedure of our framework is presented in Fig. 1 , which consists of four modules, including LSTM-based VAE pretraining, measuring neuron sentiment discriminability, detection of NeSUs, and NeSU-enabled sentiment classification. The details of each module are introduced below. VAE is composed of an encoder and a decoder. The encoder maps an input x into a latent space and outputs the representation z. The decoder decodes z to generate the input x. LSTM-based VAE is used to pretrain for two main reasons: (i) VAE retains sentiment-related features which are important to generate sentences. (ii) LSTMs use an internal memory to remember semantic information, which can help learn intricate context-dependent opinions in sentiment analysis. VAE is trained on unlabeled data DS unlabel by minimizing reconstruction loss and KL divergence loss. And we obtain H neuron units for the encoding/decoding stage. We then exploit small labeled data to examine the discriminability of each neuron unit as follows. Firstly, an LSTM-based VAE is trained on unlabeled data DS unlabel . We then evaluate the discriminability of each encoding LSTM neuron unit using labeled data. A distribution separation measure d(·) is further applied to find a set of NeSUs (F ) that have the best discriminability. Since NeSUs are often redundant to each other, only the most discriminative NeSU (C ) is leveraged by the downstream classifiers. sentences with negative sentiment, then we define the discriminability measure function d(·) w.r.t. a neuron unit C i as follows: where η i : DS → R M +K returns a vector that contains the last hidden states of the neuron unit C i for all the sentences in the set DS = {DS pos , DS neg }, i.e., for M positive sentences and K negative sentences; the unit C i has a scalar output; φ(·, ·) is a measure that evaluates the separability of hidden states' distributions resulted by the samples of the two classes. The main intuition of Definition 1 is that if a neuron unit has good discriminability, its hidden state distributions of different classes' samples should be well separable. Motivated by the fact that Gaussian distribution is the most general distribution for fitting values drawn from Gaussian/non-Gaussian variables according to the central limit theorem, we specify φ using Bhattacharyya distance to measure the separability of two distributions, which assumes the resulting hidden states in the neuron unit C i for each class's samples follow a Gaussian distribution. Accordingly, the discriminability of C i is calculated as follows: where ξ is a threshold hyperparameter and F is a set of discriminative NeSUs. Since each NeSU is an LSTM neural unit, it works as a none-linear mapping function η : R D → R which is the same η as Eq. 1 and can be formally defined as follows: where v t is an embedding vector of the t-th word and s t is a scalar sentiment indication value with larger s t indicating more positive sentiment. In Fig. 1(b) , we illustrate the discriminability values of all encoding LSTM neural units on a dataset Laptop. It is clear that only a small number of neural units are sentiment-aware. Most units do not capture much sentiment information. Therefore, simply using all units may disregard discriminative information. Instead, as defined in Eq. ( 3), we only retain selective sentiment-aware neural units based on their discriminability to fully exploit the unlabeled data. The parameter ξ can be tuned via cross validation using the labeled data. We find that retaining the single most discriminative neural sentiment unit (NeSU) always results in the best downstream classification performance; adding more NeSUs does not perform better. This demonstrates that NeSUs in F capture similar transferable features, so they are often redundant to each other. We therefore only extract NeSU below for the downstream classification: where the unit C , denoted by η , is the only neural sentiment unit incorporated into downstream classifiers. We further introduce a novel NeSU-enabled attention Network, namely NeaNet, by using two parallel LSTMs to fully exploit the NeSU and generate sentimentaware representations for target-level sentiment classification. Task Statement. The target-level sentiment analysis is to predict a sentiment category for a (sentence, target) pair. Given a sentence-target pair x = (w, w T ), where w = {w 1 , w 2 , . . . , w n }, w T = w T 1 , w T 2 , . . . , w T m , and w T is a subsequence of w. The goal of this task is to predict a sentiment polarity y ∈ {P, N, O} of the sentence w w.r.t. the target w T , where P , N , and O denote "positive", "negative" and "neutral" sentiments respectively. The architecture of NeaNet is shown in Fig. 2 . The bottom is an embedding layer, which maps the words in an input sequence w to a word vectors Since NeSU can discriminate the sentiment of the input words, we integrate it into the memory computation of LSTM to generate sentiment-aware word representations. Moreover, the sentiment information can be carried forward along with word sequences due to the LSTM structure. Besides the three gates (input, forget and output gates) in the vanilla LSTM, we define an additional read gate r t ∈ [0, 1] to control the sentiment information captured by the NeSU η . This yields a NeSU-enabled Sentiment LSTM. The NeSU works like a sentiment prior, so we call the whole module NeSU-based Sentiment Prior (SUSP), which is defined as follows: where σ refers to sigmoid activation function and tanh refers to hyperbolic tangent function; i t , f t , o t ∈ R H respectively denote the input, forget and output gates; v t is the t-th word embedding and h t−1 is the hidden state at time step and z su ∈ R H are the network weights, where H is the number of hidden cells; s t = η (v t ) denotes the sentiment value output by the retained NeSU mapping function η as in Eq. ( 4); denotes element-wise multiplication. Essentially, SUSP uses the NeSU η , via the underlined parts in Eq. ( 8-9) to capture context-dependent sentiment information and propagate this information to generate the context-dependent representation h t . The position information between the target and its context is also used to weight opinion words. The position weight l i of w i is calculated as follows: where k is the index of the first target word, m is the length of the target, and C is a constant associated with datasets. Finally h t is weighted with l t as: SUCR: Using NeSU as a Context Reinforcer. Due to the integrated computation of Sentiment LSTM, some original context information might be lost. To preserve the genuine context, we parallelly employ a Context LSTM initialized with the VAE encoder to learn the generic word representation, and further incorporate NeSU with the position l to sentimentally reinforce the context representations generated by the Context LSTM. We call this whole module NeSU-based Context Reinforcer (SUCR) and define it as follows: where h et is the hidden state generated by the Context LSTM at the t-th time step and s t is a sentiment value output by η as in Eq. ( 4). We further consolidate the word-level representations generated by SUSP and SUCR via summation to form the final sentiment-aware and context-sensitive word representations. Then we apply a standard attention layer to fuse the semantic information of the context and the target. Particularly, let h Tm be the target representation generated by SUSP, h t and h et respectively denote the word representations generated by SUSP and SUCR. The input of attention layer is given as: We evaluate our method on two benchmark datasets: Laptop and Rest from SemEval 2014 [30] , containing reviews in laptop and restaurant domains. Following previous works [4, 5] , we remove the samples labeled "conflict". For VAE pretraining, a relatively large unlabeled dataset was collected, including Laptop, Rest. and Elec.. The unlabeled data Laptop and Rest. are respectively obtained from the Amazon Product Reviews 1 and Kaggle 2 , while Elec. is from [14] . The statistics of all datasets and the detailed hyperparameters are listed in Table 1 . For both labeled and unlabeled data, any punctuation is treated as space. To understand the discriminability of NeSU, this section demonstrates the sentiment NeSU perceives for each word in different sentences. It is clear that NeSU responds to the sentiment word "long " adaptively depending on the context, i.e., it is positive in Fig. 3 (a) and negative in Fig. 3(b) . In Figs. 3(a) , benefiting from the LSTM, the target "battery life" can arouse the NeSU memory from "long ", generating a higher value. Fig. 3 (c) shows an example with subjunctive style, a challenging task for [5] . The NeSU can correctly assign a negative value for the positive sentiment word "friendly ", and a downtrend/uptrend for "bit"/"more", demonstrating NeSU is also aware of implicit semantics. Overall Performance. The results are shown in Table 2 . On both datasets, our model NeaNet consistently achieves the best performance in both accuracy (ACC) and macro-F1 compared to all 13 supervised and semi-supervised methods. E.g., compared to RAM, MGAN, TNet and ASVAET, which are the best competing methods in the overall ACC, NeaNet substantially outperforms them by 1.18%-3.05% in Laptop and 2.64%-4.61% in Rest. The superiority of NeaNet is mainly due to the incorporation of the NeSU-driven SUCR and SUSP components that effectively leverage the discriminability of the NeSU to capture context-dependent sentiment information, which enables the LSTM networks to classify the sentiment of opinion targets more correctly. Particularly, as PRET+MULT is pretrained on document-level labeled sentiment data, its pretraining may introduce ambiguity for fine-grained sentiment task, leading to significantly less effective performance than NeaNet. ASVAET is also pretrained on unlabeled data, and generates generic representations only, which are much less expressive than the NeSU-enabled sentiment-aware representations. Breakdown Performance. NeaNet obtains the best F1 performance in the negative class on both Rest. and Laptop, achieving 8.69% and 2.43% improvements over the best competing methods respectively. And NeaNet performs very competitive to the best results in positive and neutral classes. These results indicate that NeaNet well leverages unlabeled data to capture fine-grained sentiment features and achieves impressive improvements by using SUCR and SUSP. This section is to answer whether the discriminability of NeSU enables NeaNet to achieve a more data-efficient learning. We evaluate the performance of NeaNet with randomly reduced training data, with RAM and TNet as the baselines. The results are shown in Fig. 4 . NeaNet performs significantly better than RAM and TNet in both ACC and macro-F1 with different amount of labeled training data on both Laptop and Rest. Particularly, even when NeaNet is trained using 50% less labeled data, it can obtain the ACC and/or macro-F1 performance that is comparable well to, or better than, RAM on both datasets. Similarly, NeaNet achieves comparable well performance to TNet even if 25% less training data is used in training NeaNet. This justifies that NeaNet can leverage the sentiment-aware property of NeSU to achieve substantially more effective exploitation of the small labeled data. NeaNet is compared with its four ablations as follows to investigate the contribution of its different components. -aLSTM*: aLSTM* is a simple semi-supervised version of aLSTM by initializing with our pretrained VAE encoder. -aLSTM*+NeSU: aLSTM*+NeSU is a simple NeSU-enabled aLSTM*, in which the NeSU-based sentiment value is added into the attention layer. -SUCR-enabled aLSTM*: It is an enhanced aLSTM* with its plain LSTM replaced with SUCR. It is equivalent to NeaNet with SUSP removed. -SUSP-enabled aLSTM*: It improves aLSTM* by replacing its LSTM with SUSP. It is a simplified NeaNet with SUCR removed. The results are given in the last group in Table 2 . aLSTM* performs significantly better than aLSTM on all datasets, showing that the pretrained VAE can extract highly transferable features from unlabeled data. aLSTM*+NeSU, SUCR-enabled aLSTM* and SUSP-enabled aLSTM* outperform aLSTM* in all performance measures, which indicates that the discriminability of NeSU can enhance the downstream classifiers in various ways, e.g., to enhance the attention as in aLSTM*+NeSU or the memory architecture of LSTM as in SUCR/SUSPenabled aLSTM*. SUCR/SUSP-enabled aLSTM* performs much better than aLSTM*+NeSU, indicating that SUSP and SUCR can exploit the power of NeSU more effectively; both of them underperform NeaNet, so both SUSP and SUCR are important to NeaNet. Particularly, SUSP-enabled aLSTM* performs consistently better than SUCR-enabled aLSTM*, revealing that, SUSP leverages the sentiment-aware property of NeSU to learn better representations than SUCR. A holistic lexicon-based approach to opinion mining Attention-based LSTM for aspect-level sentiment classification Multi-grained attention network for aspect-level sentiment classification Recurrent attention network on memory for aspect sentiment analysis Transformation networks for target-oriented sentiment classification Using bilingual knowledge and ensemble techniques for unsupervised Chinese sentiment analysis Target-dependent twitter sentiment classification NRC-Canada-2014: Detecting aspects and sentiment in customer reviews Target-dependent sentiment classification with long short term memory Aspect level sentiment classification with deep memory network Generating sentences from a continuous space Effective attention modeling for aspectlevel sentiment classification Adam: a method for stochastic optimization Semi-supervised convolutional neural networks for text categorization via region embedding On a measure of divergence between two statistical populations defined by their probability distributions Variational semi-supervised aspect-term sentiment analysis via transformer Sentiment lexicon enhanced attention-based LSTM for sentiment classification Convolution-based memory network for aspect-based sentiment analysis Exploiting document knowledge for aspect-level sentiment classification A human-like semantic cognition network for aspect-level sentiment classification Target-sensitive memory networks for aspect sentiment classification IARM: inter-aspect relation modeling with memory networks in aspect-based sentiment analysis Variational pretraining for semisupervised text classification Semi-supervised learning with auxiliary evaluation component for large scale e-commerce text classification Adversarial training methods for semisupervised text classification Fuzzy deep belief networks for semi-supervised sentiment classification Learning adversarial networks for semi-supervised text classification via policy gradient Attention and lexicon regularized LSTM for aspect-based sentiment analysis Phrasernn: phrase recursive neural network for aspectbased sentiment analysis Semeval-2016 task 5: Aspect based sentiment analysis This paper introduces a novel semi-supervised approach to leverage large-scale unlabeled data for target-level sentiment classification on small labeled data. We discover for the first time that a few neuron units in encoding LSTM cells of the pretrained VAE demonstrate highly sentiment-discriminative capability. We further explore two effective ways to incorporate the most discriminative neural sentiment unit (NeSU) into attention networks to develop a novel LSTMbased target-level sentiment classifier. Empirical results show that our NeSUenabled classifier substantially outperforms 13 state-of-the-art methods on two benchmark datasets and achieves significantly better data efficiency.