key: cord-0043202-ey4ziomv
authors: Jiang, Tao; Wang, Jiahai; Liu, Zhiyue; Ling, Yingbiao
title: Fusion-Extraction Network for Multimodal Sentiment Analysis
date: 2020-04-17
journal: Advances in Knowledge Discovery and Data Mining
DOI: 10.1007/978-3-030-47436-2_59
sha: 2a6241610ed600f734487193921d7e8538f9e283
doc_id: 43202
cord_uid: ey4ziomv

Multiple modality data bring new challenges for sentiment analysis, as combining varieties of information in an effective manner is a rigorous task. Previous works do not effectively utilize the relationship and influence between texts and images. This paper proposes a fusion-extraction network model for multimodal sentiment analysis. First, our model uses an interactive information fusion mechanism to interactively learn the visual-specific textual representations and the textual-specific visual representations. Then, we propose an information extraction mechanism to extract valid information and filter redundant parts for the specific textual and visual representations. The experimental results on two public multimodal sentiment datasets show that our model outperforms existing state-of-the-art methods.

With the prevalence of social media, social platforms like Twitter and Instagram, have become part of our daily lives and played an important role in people's communication. As a result of the increasing multimodality of social networks, there are more and more multimodal data which combine images and texts in social platforms. Though providing great conveniences for people communication, multimodal data bring growing challenges for social media analytics. In fact, it is often the case that the sentiment cannot be reflected with the support of single modality information. The motivation is to leverage the varieties of information from multiple sources for building an efficient model. This paper studies the task of sentiment analysis for social media, which contains both visual and textual contents. Sentiment analysis is a core task of natural language processing, and aims to identify sentiment polarity towards opinions, emotions, and evaluations. Traditional methods [14, 21] for text-only sentiment analysis are mainly statistical methods which highly rely on the quality of feature selection. With the rapid development of machine learning techniques and deep neural network, researchers introduce many dedicated methods [7, 13] , which achieve significantly improved results. In contrast to single modality based sentiment analysis, multimodal sentiment analysis attracts more and more attention in recent works [20, 24, 26, 28] .

However, most previous works cannot effectively utilize the relationship and influence between visual and textual information. Xu et al. [22] only take the single-direction influence of image to text into consideration and ignore interactive promotion between visual and textual information. A co-memory network [23] then is proposed to model the interactions between visual contents and textual words iteratively. Nevertheless, the co-memory network only applies a weighted textual/visual vector as the guide to learn attention weights on visual/textual representation. It can be seen as a coarse-grained attention mechanism and may cause information loss because attending multiple contents with one attention vector may hide the characteristic of each attended content. Further, the previous studies directly apply multimodal representations for final sentiment classification. However, there is partial redundancy information which may bring confusion and is not beneficial for final sentiment classification.

This paper proposes a new architecture, named Fusion-Extraction Network (FENet), to solve the above issues for the task of multimodal sentiment classification. First, a fine-grained attention mechanism is proposed to interactively learn cross-modality fused representation vectors for both visual and textual information. It can focus on the relevant parts of texts and images, and fuse the most useful information for both single modality. Second, a gated convolution mechanism is introduced to extract informative features and generate expressive representation vectors. The powerful capability of Convolution Neural Networks (CNNs) for image classification has been verified [8, 19] . It is a common way that applying CNNs to extract relativeness of different regions of an image. For textual information, it deserves to be pointed out that CNNs also have strong ability to process [25] . CNNs have been observed that they are capable of extracting the informative n-gram features as sentence representations [10] . Thus, the convolution mechanism is quite suitable for the extraction task in the multimodal sentiment classification. Meanwhile, we argue that there should be a mechanism controlling how much part of each multimodal representation can flow to the final sentiment classification procedure. The proposed gate architecture mechanism plays the role to modulate the proportion of multimodal features. The experimental results on two public multimodal sentiment datasets show that FENet outperforms existing state-of-the-art methods.

The contributions of our work are as follows:

• We introduce an Interactive Information Fusion (IIF) mechanism to learn fine-grained fusion features. IIF is based on cross-modality attention mechanisms, aiming to generate the visual-specific textual representation and the textual-specific visual representation for both two modality contents. • We propose a Specific Information Extraction (SIE) mechanism to extract the informative features for textual and visual information, and leverage the extracted visual and textual information for sentiment prediction. To the best of our knowledge, no CNN-gated extraction mechanism for both textual and visual information has been proposed in the field of multimodal sentiment analysis so far.

Various approaches [1, 4, 5] have been proposed to model sentiment from textonly data. With the prevalence of multimodal user-generated contents in social network sites, multimodal sentiment analysis becomes an emerging research field which combines textual and non-textual information. Traditional methods adopt feature-based methods for multimodal sentiment classification. Borth et al. [2] firstly extract 1200 adjective-noun pairs as the middle-level features of images for classification, and then calculate the sentiment scores based on English grammar and spelling style of texts. However, these feature-based methods highly depend on the laborious feature engineering, and fail to model the relation between visual and textual information, which is critical for multimodal sentiment analysis.

With the development of deep learning, deep neural networks have been employed for multimodal sentiment classification. Cai et al. [3] and Yu et al. [27] use CNN-based networks to extract feature representations from texts and images, and achieve significant progress. In order to model the relatedness between text and image, Xu et al. [22] extract scene and object features from image, and absorb text words with these visual semantic features. However, they only consider the visual information for textual representation, and ignore the mutual promotion of text and image. Thus, Xu et al. [23] propose a co-memory attentional mechanism to interactively model the interaction between text and image. Though taking the mutual influence of text and image into consideration, Xu et al. [23] adopt a coarsegrained attention mechanism which may not have enough capacity to extract sufficient information. Furthermore, they simply concatenate the visual representation and the textual representation for final sentiment classification. Instead, our model applies a fine-grained information fusion layer, and introduces an information extraction layer to extract and leverage visual and textual information for sentiment prediction.

Given a text-image pair (T, I), where T = {T 1 , T 2 , . . . , T M } and I is a single image, the goal of our model is to predict the sentiment label y ∈ {positive, neutral, negative} towards the text-image pair.

The overall architecture of the proposed FENet is shown in Fig. 1 . The bottom layer includes a text encoding layer and an image encoding layer, which transforms the text

and transforms image to a fixed size vector separately, where d w denotes the dimensions of the word embeddings. The middle part of our model is an interactive information fusion (IIF) layer simultaneously used to interactively learn cross-modality fusion for text and image. The IIF layer contains a fine-grained attention mechanism and identity mapping [9] , which allows fuse one modality information with another modality data and learns more specific features. The top part is a specific information extraction (SIE) layer, which consists of two gated convolution layers and a max-pooling layer. The SIE layer first utilizes convolution to extract informative features, and then selectively adjusts and generates expressive representations with gate mechanisms and a max-pooling layer. Finally, the visual-specific textual representation and the textual-specific visual representation from the SIE layer are concatenated for sentiment classification.

The function of the text encoding layer is mapping each word into a low dimensional, continuous and real-valued vector, also known as word embedding. Traditional word embedding can be treated as parameters of neural networks or pretrained from proper corpus via unsupervised methods such as Glove [17] . Further, a pretrained bidirectional transformer language model, also known as BERT [6] , has shown its powerful capacity as word embedding. We applies Glovebased embedding for basic embedding and BERT-based embedding for extension embedding. The model variants are named FENet-Glove and FENet-BERT, respectively.

• FENet-Glove. It applies Glove as the basic embedding to obtain the word embedding of each word. Specifically, we employ a word embedding matrix L ∈ R dw×|V | to preserve all the word vectors, where d w is the dimension of word vector and |V | is the vocabulary size. The word embedding of a word w i can be notated as l ∈ R dw , which is a column of the embedding matrix L. • FENet-BERT. It uses BERT as the extension embedding to obtain the word representation of each word. Specifically, we use the last layer of BERT-base 1 to obtain a fixed-dimensional representation sequence of the input sequence.

Given an image I p , where I p indicates the image I rescaled to 224 × 224 pixels, we use Convolutional Neural Networks (CNNs) to obtain the representations of images. Specifically, the visual embedding V is obtained from the last convolutional layer of ResNet152 2 [8] pretrained on ImageNet [18] classification. This process can be described as follows:

where the dimension of V is 2048 × 7 × 7. 2048 denotes the number of feature maps, 7 × 7 means the shape of each feature maps. We then flatten each feature map into 1-D feature vector v i corresponded to a part of an image.

The above encoding representation only considers their single modality, and the attention mechanism is often applied to capture the interactions between different modality representations. However, previous works [22, 23] adopt coarsegrained attention which may cause information loss, as the text contains multiple words and the image presentation contains multiple feature maps. In contrast, as shown in the middle part of Fig. 1 , we adopt the IIF layer to solve this problem and the detail of the IIF mechanism is shown in Fig. 2(a) . Given two modality inputs, one of them is the target modality input which we fuse with another modality input named auxiliary input to generate the target modality output. Specifically, given a target input S = {S 1 , S 2 , . . . , S n } ∈ R ds×n and an auxiliary input A = {A 1 , A 2 , . . . , A l } ∈ R da×l , we first project the target input S and the auxiliary input A into the same shared space. The projecting process can be depicted as follows:

where For each row of M , a softmax function is applied for quantifying the importance of each piece of auxiliary input to a specific piece of target input as follows:

Then, the fine-grained attention output F is formulated as follows:

where F ∈ R da×n and "·" denotes matrix multiplication. Finally, the concatenation of the target input S and the fine-grained attention output F is fed into a full connection layer to obtain the specific representation G = {G 1 , G 2 , . . . , G n } of the target input:

where G i ∈ R ds and W g ∈ R ds×(ds+da) . Thus, the overall process of IIF can be summarized as follows:

Therefore, the textual-specific visual representation V g and the visual-specific textual representation X g are obtained as follows:

After interactively fusing two modality information, we need to extract the most informative representation and control the proportion contributing to the final sentiment classification. As shown in the top part of Fig. 1 , we introduce the SIE layer for this task and the details of the SIE layer is depicted in Fig. 2(b) . The SIE layer is based on convolutional layers and gated units. Given a padded input vector Q = {q 1 , q 2 , . . . , q k } ∈ R dq×k , we pass it through the SIE layer to get the final representation. First, n k one dimensional convolutional kernel pairs are applied to capture the active local features. Each kernel corresponds a feature detector which extracts a specific pattern of active local features [11] . However, there are differences within the kernel pairs for their different nonlinearity activation function. The first kernel of kernel pairs is adopted to transform the information and obtain informative representation. While the second kernel of kernel pairs is a gate which controls the proportion of the result of the first kernel flowing to the final representation. Specifically, a convolution kernel pair of W a and W b maps r columns in the receptive field to a single feature a and b with tanh and sigmoid activation function, respectively. e is the result of multiplication of a and b, which stands for the representation after extraction and adjustment. As the filter slide across the whole sentence, a sequence of new feature e = {e 1 , e 2 , . . . , e k−r+1 } is obtained by:

where W a , W b ∈ R dq×r are weights of the convolution kernel pair, and b a , b b ∈ R are bias of the convolution kernel pair. " * " denotes the convolution operation. As there are n k kernel pairs, the output features can form a matrix E ∈ R (k−r+1)×n k . Finally, we apply a max-pooling layer to obtain the most informative features for each convolution kernel pair, which results in a fixedsize vector z whose size is equal to the number of filter pairs n k as follows:

The above process can be summarized as follows:

We treat V g and X g as the input of SIE to obtain the final visual and textual representation, respectively. The process is formulated as follows: 

After obtaining the final feature representation vectors for image and text, we concatenate them as the input of a fully connected layer for classification:

where W p ∈ R class×2n k and b p ∈ R class are learnable parameters.

Datasets. We use MVSA-Single and MVSA-Multiple [15] two datasets. The former contains 5129 text-image pairs from Twitter and is labeled by a single annotator. The later has 19600 text-image pairs labeled by three annotators. For fair comparison, we process the original two MVSA datasets on the same way used in [22, 23] . We randomly split the datasets into training set, validation set and test set by using the split ratio 8:1:1.

Tokenization. On the one hand, to tokenize the sentences for Glove-based embedding method, we apply the same rule as [16] , except we separate the tag '@' and '#' with the words after. On the other hand, we use the WordPiece tokenization introduced in [6] for BERT-based embedding method. Word Embeddings. To initialize words as vectors, FENet-Glove uses the 300-dimensional pretrained Glove embeddings, and FENet-BERT applies 768dimensional pretrained BERT embeddings which contains 110M parameters. Pretrained CNNs. We use the pretrained ResNet152 [8] from Pytorch. Optimization. The training objective is cross-entropy, and Adam optimizer [12] is adopted to compute and update all the training parameters. Learning rate is set to 1e−3 and 2e−5 for Glove-based and BERT-based embedding, respectively. Hyper-parameters. We list the hyper-parameters during our training process in Table 1 . All hyper-parameters are tuned on the validation set, and the hyperparameters collection producing the highest accuracy score is used for testing.

We compare with the following baseline methods on MVSA datasets. SentiBank & SentiStrength [2] extracts 1200 adjective-noun pairs as the middle-level features of image and calculates the sentiment scores based on English grammar and spelling style of texts. CNN-Multi [3] learns textual features and visual features by applying two individual CNN, and uses another CNN to exploiting the internal relation between text and image for sentiment classification. DNN-LR [27] trains a CNN for text and employs a deep convolutional neural network for image, and uses average strategy to aggregate probabilistic results which is the output of logistics regression.

MultiSentiNet [22] extracts deep semantic features of images and introduces a visual feature attention LSTM model to absorb the text words with these visual semantic features. CoMN [23] proposes a memory network to iteratively model the interactions between visual contents and textual words for sentiment prediction.

Besides, this paper also presents two ablations of FENet to evaluate the contribution of our components. Table 2 shows the performance comparison results of FENet with other baseline methods. As shown in Table 2 , we have the following observations.

(1) SentiBank & SentiStrength is the worst since it only uses traditional statistical features to present image and text multimodality information, which can not make full of the high-level characteristic of multimodal data. Both CNN-Multi and DNN-LR are better than SentiBank & Sen-tiStrength and achieve close performances by applying CNN architecture to learn two modality representation. MultiSentiNet and CoMN get outstanding results as they take the interrelations of image and context into consideration. CoMN is slightly better than MultiSentiNet because Mul-tiSentiNet only considers the single-direction influence of image to text and ignores the mutual reinforcing and complementary characteristics between visual and textual information. However, CoMN employs the coarse-grained attention mechanism which may cause information loss, and directly uses redundant textual and visual representations for final sentiment classification. In contrast, FENet applies an information-fusion layer based on finegrained attention mechanisms, and leverages visual and textual information for sentiment prediction by adopting an information extraction layer. Thus, FENet variants perform better than CoMN and achieves a new state-ofthe-art performance. Figure 3 shows a example of visual and textual attention visualization. We use the first feature map of image and the first token of sentence as attention query, respectively. With the help of interactive fine-grained attention mechanisms, the model can successfully focus on appropriate regions based on the associated sentences and pay more attention to the relevant tokens. For example, Fig. 3 (a) depicts a traffic accident, and the corresponding text describes the casualties. As shown in Fig. 3(b) , our model pay more attention to the head and seat of broken car according to the sentence context. Also, based on the accident image, the important words such as "serious" and "injury" have greater attention weight in Fig. 3(c) . Thus, our model correctly catches the important parts of text and image, and predicts the sentiment of this sample as negative. 

This paper proposes FENet for sentiment analysis in multimodal social media. Compared with the previous works, we employ a fine-grained attention mechanism to effectively extract the relationship and influence between text and image. Besides, we explore a new approach based on gated convolution mechanisms to extract and leverage visual and textual information for sentiment prediction. The experimental results on two datasets demonstrate that our proposed model outperforms the existing state-of-the-art methods.

Concept-level sentiment analysis with dependency-based semantic parsing: a novel approach

Large-scale visual sentiment ontology and detectors using adjective noun pairs

Convolutional neural networks for multimedia sentiment analysis

SenticNet 4: a semantic resource for sentiment analysis based on conceptual primitives

SenticNet 5: discovering conceptual primitives for sentiment analysis by means of context embeddings

Bert: pre-training of deep bidirectional transformers for language understanding

Multi-grained attention network for aspect-level sentiment classification

Deep residual learning for image recognition

Identity mappings in deep residual networks

Semi-supervised convolutional neural networks for text categorization via region embedding

A convolutional neural network for modelling sentences

Adam: a method for stochastic optimization

Hierarchical attention transfer network for cross-domain sentiment classification

Sentiment analysis and opinion mining

Sentiment analysis on multi-view social data

Improved part-of-speech tagging for online conversational text with word clusters

Glove: global vectors for word representation

ImageNet large scale visual recognition challenge

Going deeper with convolutions

VistaNet: visual aspect attention network for multimodal sentiment analysis

Recognizing contextual polarity in phraselevel sentiment analysis

MultiSentiNet: a deep semantic network for multimodal sentiment analysis

A co-memory network for multimodal sentiment analysis

Multi-interactive memory network for aspect based multimodal sentiment analysis

Aspect based sentiment analysis with gated convolutional networks

Visual sentiment analysis by attending on local image regions

Visual and textual sentiment analysis of a microblog using deep convolutional neural networks

Tensor fusion network for multimodal sentiment analysis