key: cord-0864806-b575p9un authors: Mukherjee, Rajdeep; Naik, Atharva; Poddar, Sriyash; Dasgupta, Soham; Ganguly, Niloy title: Understanding the Role of Affect Dimensions in Detecting Emotions from Tweets: A Multi-task Approach date: 2021-05-09 journal: 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021 DOI: 10.1145/3404835.3463080 sha: e5dfdd4c2ceeb05354539d3e1366466ab926ab3d doc_id: 864806 cord_uid: b575p9un We propose VADEC, a multi-task framework that exploits the correlation between the categorical and dimensional models of emotion representation for better subjectivity analysis. Focusing primarily on the effective detection of emotions from tweets, we jointly train multi-label emotion classification and multi-dimensional emotion regression, thereby utilizing the inter-relatedness between the tasks. Co-training especially helps in improving the performance of the classification task as we outperform the strongest baselines with 3.4%, 11%, and 3.9% gains in Jaccard Accuracy, Macro-F1, and Micro-F1 scores respectively on the AIT dataset. We also achieve state-of-the-art results with 11.3% gains averaged over six different metrics on the SenWave dataset. For the regression task, VADEC, when trained with SenWave, achieves 7.6% and 16.5% gains in Pearson Correlation scores over the current state-of-the-art on the EMOBANK dataset for the Valence (V) and Dominance (D) affect dimensions respectively. We conclude our work with a case study on COVID-19 tweets posted by Indians that further helps in establishing the efficacy of our proposed solution. With the proliferation of social media, as more and more people express their opinions online, detecting human emotions from their written narratives, especially tweets has become a crucial task given its widespread applications in e-commerce, public health monitoring, disaster management, etc. [17, 18] . Categorical models of emotion representation such as Plutchik's Wheel of Emotion [21] or Ekman's Basic Emotions [8] classify affective states into discrete categories (joy, anger, etc.). Dimensional models on the other hand describe emotions relative to their fundamental dimensions. Russel and Mehrabian's VAD model [23] for instance interprets emotions as points in a 3-D space with Valence (degree of pleasure or displeasure), Arousal (degree of calmness or excitement), and Dominance (degree of authority or submission) being the three orthogonal dimensions. Accordingly, the literature on text-based emotion analysis can be broadly divided into coarse-grained classification systems [10, [12] [13] [14] 28] and fine-grained regression systems [22, 24, 29, 30] . Although a coarse-grained approach is better-suited for the task of detecting emotions from tweets as observed in [4] , prior works fail to exploit the direct correlation between the two models of emotion representation for finer interpretation. We utilize the better representational power of dimensional models [4] to improve the emotion classification performance by proposing VADEC that jointly trains multi-label emotion classification and multi-dimensional emotion regression in a multi-task framework. Multi-task learning [6] has been successfully used across a wide spectrum of NLP tasks including emotion analysis [1, 30] . While AAN [30] takes an adversarial approach to learn discriminative features between two emotion dimensions at a time, All_In_One [1] proposes a multi-task ensemble framework to learn different configurations of tasks related to coarse-and fine-grained sentiment and emotion analysis. However, none of the methods combine the supervisions from VAD and categorical labels. Our proposed framework (Section 2) consists of a classifier module that is trained for the task of multi-label emotion classification, and a regressor module that co-trains the regression tasks corresponding to the V, A, and D dimensions. Owing to the unavailability of a common annotated corpus, the two tasks are trained using supervisions from their respective benchmark datasets (reported in Section 3.1), which further justifies the utility of our proposed multi-task approach. VADEC learns better shared representations by jointly training the two modules, that especially help in improving the performance of the classification task, thereby achieving state-of-the-art results on the AIT [17] and SenWave [27] datasets (Section 3.3). For the regression task, we achieve SOTA results on the EMOBANK dataset [5] for V and D dimensions (Section 3.4). We conclude our work with a detailed case study in Section 3.5, where we apply our trained multi-task model to detect and analyze the changing dynamics of Indian emotions towards the COVID-19 pandemic from their tweets. We discover the major factors contributing towards the various emotions and find their trends to correlate with real-life events. Figure 1 illustrates the architecture of VADEC, that jointly trains a multi-label emotion classifier and a multi-dimensional emotion regressor with supervision from their respective datasets. Since we primarily focus on detecting emotions from tweets, we use BERTweet [19] to serve as our text-encoder. It is shared by the two modules and is hereby referred to as the shared layer. The 768dim. [ ] token embedding of the sentence/tweet obtained from BERTweet is first passed through a fully connected (FC) layer with 256 neurons in both the modules respectively. The classifier passes this intermediate representation through another FC layer with 11 output neurons, each activated using Sigmoid with a threshold of 0.5 to predict the presence/absence of one of the 11 emotion categories. Binary Cross-Entropy (BCE) with L2-norm regularization is used as the loss function, hereby referred to as the EC Loss . Similarly, the regressor passes the 256-dim. intermediate representation through an FC layer with 3 output neurons (with Sigmoid activation) corresponding to the V, A and D dimensions. It then jointly optimizes the Mean Squared Error (MSE) loss of all three dimensions, hereby referred to as the VADR Loss . VADEC jointly trains the two modules by optimizing the following multi-task objective: Here, represents a balancing parameter between the two losses. The weighted joint loss backpropagates through the shared layer, thereby fine-tuning the BERTweet parameters end-to-end. For our experiments, we consider EMOBANK, a VAD dataset, and two categorical datasets, AIT and SenWave as described below: • EMOBANK (Buechel and Hahn [5] ) : A collection of around 10k English sentences from multiple genres (8, For all our model variants, we perform extensive experiments with different sets of hyper-parameters and select the best set w.r.t. lowest validation loss. Before evaluating the performance on the test set, we combine the training and validation data and re-train the models with the best obtained set of hyper-parameters (learning rate = 2 − 5, weight decay = 0.01, = 0.5, and no. of epochs = 5 for VADEC). For the regression task, the outputs of Sigmoid activation at each of the three output neurons are suitably scaled before calculating the MSE loss since the ground-truth VAD scores are in the range of 1-5. As model ablations, we investigate the role played by features derived from affect lexicons by additionally appending a 194-dim. Empath 1 [9] feature vector to the intermediate representations learnt by our model variants to be used for final predictions. Parameters of our shared encoder are initialized with pre-trained model weights (roberta-base for RoBERTa, and bertweet-base for BERTweet) from the HuggingFace Transformers library [25] . Other model parameters are randomly initialized. All our model variants are trained end-to-end with AdamW optimizer [16] on Tesla P100-PCIE (16GB) GPU. We additionally ensure the reproducibility of our results and make our code repository 2 publicly accessible. We first discuss the comparative results of our model variants and ablations on the AIT dataset. We then respectively report our stateof-the-art results achieved on the AIT and the SenWave datasets. As metrics we use Jaccard Accuracy, Macro-F1, and Micro-F1 [17] . Among recent baselines: (i) BERTL (Park et al. [20] ) denotes the scores obtained by fine-tuning BERT-Large [7] on the AIT dataset, and (ii) NTUA-SLP (Baziotis et al. [3] ) represents the winning entry for this (sub)task of SemEval 2018 Task 1 [17] , where the authors take a transfer learning approach by first pre-training their Bi-LSTM architecture, equipped with multi-layer self attentions, on a large collection of general tweets and the dataset of SemEval 2017 Task 4A, before fine-tuning their model on this dataset. Among our model variants and ablations: (i) EC represents our classifier module, when trained as a single task (Fig. 1a) , (ii) EC RoBERTa uses RoBERTa [15] instead of BERTweet as the shared layer. From Table 1 , NTUA-SLP surprisingly outperforms BERTL (on Jac. Acc. and Micro-F1), a heavier model with 336M parameters. EC (trained with BERTweet) comfortably beats EC RoBERTa demonstrating the better efficacy of BERTweet in learning features from tweets. The sparse Empath feature vectors do not however add any value to the rich 768-dim. contextual representations learnt using BERTbased methods. We obtain our best results with VADEC, with respectively 3.4%, and 3.9% gains in Jacc. Acc., and Micro-F1 over NTUA-SLP, and 11% gain in Macro-F1 over BERTL. Considering the superior performance of VADEC over all its model variants and ablations from Table 1 , here we directly compare the results of VADEC, re-trained with SenWave [27] , with the ones reported by the authors of [27] , serving as the only available baseline on this dataset. Following [27] , we use Label Ranking Average Precision (LRAP), Hamming Loss, and Weak Accuracy (Accuracy) as metrics in addition to the ones reported in Table 1 . As observed from Table 2 , VADEC achieves SOTA by outperforming the baseline scores with 11.3% performance gain averaged over all 6 metrics. Overall, our results from Tables 1 and 2 demonstrate the advantage of utilizing the VAD supervisions for improving the performance of the multi-label emotion classification task. Pearson Correlation Coefficient r is used as the evaluation metric for this task. All the models are evaluated on the EMOBANK dataset. Among recent baselines: (i) AAN (Zhu et al. [30] ) employs adversarial learning between two attention layers to learn discriminative word weight parameters for scoring two emotion dimensions at a time. The authors report the VAD scores for all 6 domains and 2 perspectives of EMOBANK. For comparison, we use their highest correlation score for each dimension, (ii) All_In_One (Akhtar et al. [1] ) represents a multi-task ensemble framework which the authors use for learning four different configurations of multiple tasks related to emotion and sentiment analysis, (iii). SVR-SLSTM (Wu et al. [26] ) represents a semi-supervised approach using variational autoencoders to predict the VAD scores, and (iv). BERTL (EB ← AIT) [20] , the current state-of-the-art, fine-tunes BERT-Large [7] on the AIT dataset to predict VAD scores by means of minimizing EMD distances between the predicted VAD distributions and sorted categorical emotion distributions as a proxy for target VAD distributions. For comparison, we use their reported Significance T-Test (p-values) 0.029 -- scores obtained upon further fine-tuning their best-trained model on the EMOBANK corpus. Our model variants include (i) VADR which represents our regressor module, when trained as a single task (Fig. 1b) , (ii) VAD RoBERTa , an ablation where we experiment with RoBERTa as the shared layer, (iii) VADEC (AIT), and (iv) VADEC (SenWave) representing the scores of our multi-task model when trained respectively with the AIT and SenWave datasets. From Table 3 , VADR RoBERTa shows the highest correlation (0.511) on the D dimension. VADR (w/ BERTweet) however outperforms VADR RoBERTa on the other two dimensions. Contrary to our observations in the classification task, co-training does not help in improving the performance of the regression task, as can be confirmed from the results of VADEC (AIT) and VADR. Although we are outclassed by BERTL (EB ← AIT) on the A dimension, VADEC (AIT) comfortably outperforms BERTL (EB ← AIT) on the V and D dimensions. VADEC (SenWave) further outclasses both VADEC (AIT) and BERTL (EB ← AIT) on V and D with 7.6% and 16.5% gains respectively. To conclude, although joint-learning does not help the regression task as much as it helps in improving the classification performance (which in fact is our main objective), we still achieve noticeable improvements in majority of emotion dimensions. For this analysis, we consider Twitter_IN, a subset of COVID-19 Twitter chatter dataset (version 17) [2] , containing around 140K English tweets from India posted between January 25th and July 4th 2020. Owing to very few reported cases in India before March 2020, we begin our analysis by predicting emotions from tweets, posted on or after Match 1st 2020, using VADEC trained on EMOBANK Predicted Labels Single Label Let us spare a moment and thought for the junior resident doctors of Mumbai on the frontline fighting it out alone with little help from the government against all odds and at great personal risk Thankful This is the time to fight Covid19 at present but some intelligent Generals are focusing on war and terrorism Annoyed Multiple Labels The first Covid 19 positive from Meghalaya Dr John Sailo Rintathiang passed away early this morning. Sailo 69 who was also the owner of Bethany hospital was tested positive on April 13 2020 Sad, Official Report Media is so obsessed with a particular community that they even misspell coronavirus Annoyed, Joking, Surprise and SenWave. Few tweets with their predicted emotions are listed in Table 4 . For each emotion, we obtain its contributing aspects by training an unsupervised neural topic model, ABAE (He et al. [11] ) on the subset of tweets containing the given emotion as per VADEC predictions. Few emotions along with their most accurate aspects are reported in Table 5 . For each emotion, the extracted aspect terms are further filtered and assigned meaningful sub-categories by means of a many-to-many mapping. In Figure 2 , we plot the temporal trends of these sub-categories (with roughly equal-sized bins in terms of no. of tweets predicted with the emotion plotted) that respectively made Indians feel annoyed (Fig. 2a) and optimistic (Fig. 2b ) over time. In Fig. 2a , the peak in Crowd gathering between March 28th and April 7th can be attributed to the Tablighi Jamaat gatherings 3 unfortunately triggering widespread criticism. Fig. 2b shows a high level of Community gratitude in general, with occasional peaks which may be attributed to the events targeted at raising solidarity among the public. For Technology and AI, we observe a peak near the launch date of the Arogya Setu App 4 -developed by the Indian Government to identify COVID-19 clusters. All-in-One: Emotion, Sentiment and Intensity Prediction using a Multi-task Ensemble Framework Yuning Ding, and Gerardo Chowell. 2020. A large-scale COVID-19 Twitter chatter dataset for open scientific research -an international collaboration NTUA-SLP at SemEval-2018 Task 1: Predicting Affective Content in Tweets with Deep Attentive RNNs and Transfer Learning Emotion Analysis as a Regression Problem -Dimensional Models and Their Implications on Emotion Representation and Metrical Evaluation EmoBank: Studying the Impact of Annotation Perspective and Representation Format on Dimensional Emotion Analysis BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding An argument for basic emotions Empath: Understanding Topic Signals in Large-Scale Text Latent Emotion Memory for Multi-Label Emotion Classification An Unsupervised Neural Attention Model for Aspect Extraction Seq2Emo for Multi-label Emotion Classification Based on Latent Variable Chains Transformation A Deep Learning-Based Approach for Multi-Label Emotion Classification in Tweets Weakly-Supervised Deep Learning for Domain Invariant Sentiment Classification RoBERTa: A Robustly Optimized BERT Pretraining Approach Decoupled Weight Decay Regularization SemEval-2018 Task 1: Affect in Tweets Understanding Emotions: A Dataset of Tweets to Study Interactions between Affect Categories BERTweet: A pre-trained language model for English Tweets Toward Dimensional Emotion Detection from Categorical Emotion Annotations Theories of Emotion Modelling Valence and Arousal in Facebook posts Evidence for a three-factor theory of emotions Dimensional Sentiment Analysis Using a Regional CNN-LSTM Model Transformers: State-of-the-Art Natural Language Processing Semi-supervised dimensional sentiment analysis with variational autoencoder. Knowledge-Based Systems Xin Gao, and Xiangliang Zhang. 2020. Sen-Wave: Monitoring the Global Sentiments under the COVID-19 Pandemic Improving Multi-label Emotion Classification via Sentiment Classification with Dual Attention Transfer Network Predicting Valence-Arousal Ratings of Words Using a Weighted Graph Method Adversarial Attention Modeling for Multi-dimensional Emotion Regression This research is partially supported by IMPRINT-2, a national initiative of the Ministry of Human Resource Development (MHRD), India. Niloy Ganguly was partially funded by the Federal Ministry of Education and Research (BMBF), Germany (grant no. 01DD20003).