key: cord-020905-gw8i6tkn authors: Qu, Xianshan; Li, Xiaopeng; Farkas, Csilla; Rose, John title: An Attention Model of Customer Expectation to Improve Review Helpfulness Prediction date: 2020-03-17 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45439-5_55 sha: doc_id: 20905 cord_uid: gw8i6tkn Many people browse reviews online before making purchasing decisions. It is essential to identify the subset of helpful reviews from the large number of reviews of varying quality. This paper aims to build a model to predict review helpfulness automatically. Our work is inspired by the observation that a customer’s expectation of a review can be greatly affected by review sentiment and the degree to which the customer is aware of pertinent product information. Consequently, a customer may pay more attention to that specific content of a review which contributes more to its helpfulness from their perspective. To model such customer expectations and capture important information from a review text, we propose a novel neural network which leverages review sentiment and product information. Specifically, we encode the sentiment of a review through an attention module, to get sentiment-driven information from review text. We also introduce a product attention layer that fuses information from both the target product and related products, in order to capture the product related information from review text. Our experimental results show an AUC improvement of 5.4% and 1.5% over the previous state of the art model on Amazon and Yelp data sets, respectively. E-commerce has become an important part of our daily life. Increasingly, people choose to purchase products online. According to a recent study [9] , most online shoppers browse reviews before making decisions. It is essential for users to be able to find reliable reviews of high quality. Therefore, an automatic helpfulness evaluation mechanism is in high demand to help user evaluate these reviews. Previous work typically derived useful information from different sources, such as a review content [6, 15, 28] , metadata [4, 15, 17] , and context [14, 20, 25] . However, such features are extracted from each source independently, without considering interactions. In particular, previous approaches do not take into account the customer review evaluating process. A customer's perception of helpful information of a review is affected by the sentiment of the review and what the customer already knows about the product. Before reading a review text for a product, the customer is very likely to be aware of background information such as star rating, product attributes, etc. When a customer reads a review with a lower star rating, he may hold a negative opinion towards the item at first and mainly look into those aspects of the review supporting the lower star rating. Although review sentiment has been previously explored [7, 15, 17] , previous work has not used review sentiment to identify useful information from review text. Moreover, the customer likely has some preconceptions of the product features they are most interested in. With these expectations in mind, the customer pays special attention to those aspects of the review text that they find most salient. There have been earlier efforts [4, 6, 13] at capturing useful information from a review by considering product information. However, the unique aspects of each product (different levels of importance of attributes, evaluation standard, etc.) were not fully identified in those efforts. In order to address the above issues, we propose a novel neural network architecture to introduce sentiment and product information when identifying helpful content from a review text. The main contributions are summarized as follows: -To our knowledge, we are the first to propose that customers may have different expectations for reviews that express different sentiments. We design a sentiment attention layer to model sentiment-driven changes in user focus on a review. -We propose a novel product attention layer. The purpose of this layer is to automatically identify the important product-related attributes from reviews. This layer fuses information not only from related products, but also from the specific product. -We evaluate the performance of our model on two real-world data sets: the Amazon data set and the Yelp data set. We consider two application scenarios: cold start and warm start. In the cold start scenario, our proposed model demonstrates an AUC improvement of 5.4% and 1.5% on Amazon and Yelp data sets, respectively, when compared to the state of the art model. We also validate the effectiveness of each of the attention layers of our proposed model in cold start and warm start scenarios. Previous studies have concentrated on mining useful features from the content (i. e., the review itself) and/or the context (other sources such as reviewer or user information) of the reviews [6, 10, 13, 15, [18] [19] [20] 25, 27, 28] . Content features have been extracted and widely utilized. They can be roughly broken down into the following categories: structural features [6, 10, 13, 15, 27, 28] , lexical features [10, 15, 27, 28] , syntactic features [10, 15, 27] , emotional features [15, 28] , semantic features [6, 10, 13, 15, 18, 27, 28] , and argument features [12] . For instance, Kim et al. [10] investigated a variety of content features from Amazon product reviews, and found that features such as review length, unigrams and product ratings are most useful in measuring review helpfulness. Context features have also been studied to improve helpfulness prediction [14, 20, 25] . For example, Lu et al. [14] examined social context that may reveal the quality of reviewers to enhance the prediction of the quality of reviews. While context information shows promise for improving helpfulness prediction, it may not be available across different platforms and is not appropriate for designing a universal model. Deep Neural Networks have recently been proposed for helpfulness prediction of online reviews [1] [2] [3] [4] 23] . Chen et al. [1] designed a word-level gating mechanism to represent the relative importance of each word. Fan et al. [3] proposed a multitask paradigm to predict the star ratings of reviews and to identify the helpful reviews more accurately. They also utilized the metadata of the target product in addition to the textual content of a review to better represent a review [4] . The methods summarized above are representative of the research progress in review helpfulness prediction. Sentiment and product information have been explored previously [4, 7, 10, 15] . With respect to sentiment, Martin and Pu [15] extracted emotional words from review text to serve as important parameters for helpfulness prediction. However, previous research has not taken into account differences in customer expectations that can result from review sentiment perception. With respect to product information, Fan et al. [4] tried to better represent the salient information in reviews by considering the metadata information (title, categories) of the target product. However, this information can be quite similar for products of the same type, so the unique aspects of each product (different degrees of importance of attributes, evaluation standard, etc.) can not be fully captured from reviews. Wu et al. [26] presented an architecture similar to the one we propose here. However, the design of the sentiment attention layer and product attention layer in our architecture is different from their attention layers. Moreover, their architecture is intended for classifying review sentiment. Our model, shown in Fig. 1 , is built upon a hierarchical bi-directional LSTM. We incorporate sentiment and product information to improve review representations through two attention layers. As the main components of our model are the Hierarchical bi-directional LSTM, the Sentiment Attention layer, and the Product Attention layer, we refer to our model as HSAPA. A bi-directional LSTM model is able to learn past and future dependencies. This provides a better understanding of context [16] . The hierarchical architecture include two levels: the word level and the sentence level. These levels learn dependencies between words and sentences, respectively. A bi-directional LSTM consists of two LSTM networks that process data in opposite directions. At the word level, we feed the embedding of each word into a unit of both LSTMs, and get two hidden states. We then concatenate these two hidden states as a representation of a word. The process is defined as: , where x ij is the embedding vector of the ith word of the jth sentence. − → h ij and ← − h ij are hidden states learned from bi-directional LSTM. The state h ij is the concatenation of these hidden states for the word x ij . Sentence Encoder. At the sentence level, a sentence representation is learned through an architecture similar to that used for the word level: , where s j refers to a weighted representation of the jth sentence after applying the attention layer. The state h j is the final representation for the sentence s j by concatenating the hidden states − → h j and ← − h j . For reviews that express different types of sentiment (positive, negative, etc.), customers may have different expectations, and attend to different words or sentences of a review. Consider the following example: I loved the simplicity of the mouse, . . . and it was very comfortable . . . About 4 months of owning the mouse the scroll wheel seemed to be in always clicked in position, and would only stop after clicking it down hard for a couple seconds . . . The above review has a star rating of 2 out of 5. For a review with an overall negative sentiment like this, we may pay more attention to its descriptions of bad aspects of the product than we do to the good aspects. Therefore, each word/sentence may contribute unequally to the helpfulness of a review, with regard to its sentiment. In order to learn the sentiment-influenced importance of each word/sentence, we propose a custom attention layer. In this layer, we use an embedded vector to represent each type of sentiment. We use the star rating (ranging from 1 to 5) of each review to indicate its sentiment, and map each discrete star rating into a real-valued and continuous vector Sent. This vector is initialized randomly, and updated gradually through the training process by reviews with the corresponding star rating. Sent can be interpreted as a high level representation of the sentiment-specific information. We measure the similarity between the sentiment and each word/sentence using a score function. The score function is defined as: where v s w is a weight vector, and (v s w ) T indicates its transpose, W s wh and W s ws are weight matrices, and b s w is the bias vector. At the word level, the input to the score function is the abstract sentiment representation Sent and the hidden state of the ith word in the jth sentence h s ij . Next, we use the softmax function to normalize the scores to get the attention weights: α s ij is the attention weight for the word representation h s ij . The sentence representation is a weighted aggregation of word representations, the jth sentence is represented as Eq. 3. The number of words in the jth sentence is denoted by l. The representation of a review is also a weighted combination of sentence representations defined as Eq. 4, where h s j is the hidden state of the jth sentence s s j , which is learned through the bi-directional LSTM. The value m refers the number of sentences in a review. The value β s j indicates the corresponding attention score for h s j . The weight score β s j is calculated based on the score function f (.) defined as: As shown in the top right corner of Fig. 1 , the Product Attention Layer consists of two components: related product information and unique product information. Metadata information is embedded and fed into a CNN model [11] to capture the related product information, and the product identifier is encoded to represent the unique product information. When reading a review, customers may refer to different attributes depending on the product the review references. For example, for a review of a mouse, we may expect to see the comments related to attributes such as scroll wheel, hand feel, etc. Such attributes are considered helpful and garner more attention. We take advantage of the metadata information (such as title, product description, product category, etc.) of each product to learn common attributes shared by related products. We use the pretrained GloVe embedding [21] to initialize each token in the metadata into a 100-dimensional embedding. We extract important attributes from the metadata through a CNN model [11] , which is widely used for different NLP tasks [8, 24, 30, 31] such as text understanding, document classification, etc. The CNN model consists of a convolution layer, a max-pooling layer, and a fully connected layer. In the convolution layer, each filter is applied to a window of words to generate the feature map. For example, we apply a filter w ∈ R hk to a window of words x i:i+h−1 . Here k indicates the dimension of the word vector, and x i:i+h−1 refers to the concatenation of h words from x i to x i+h−1 . The context feature c ih is generated as: where b is the bias item. A feature map of the text is then generated through ..c nh refer to context features extracted from different sliding windows of the text, and c h indicates the concatenation of these features. The feature map c h is then fed into a max-pooling layer, and the maximum value is extracted as c = max{c h } as the important information extracted by a particular filter. A number of filters are used, and the extracted features are concatenated and fed into a fully connected layer to generate a vector P rod 1 . P rod 1 is a representation of the important related product attributes in the metadata. Although reviews for the same type of product may share the same important attributes, the degree of importance of these attributes may vary from product to product. Consider pet food for example. Some pet food may be of good quality and fair price, but the flavor may not appeal to a picky eater. Conversely, price may be the most salient feature for some brands. In order to represent the unique characteristics of each product, the unique product identifier for each product is mapped into a continuous vector P rod 2 . At the outset, P rod 2 is randomly initialized. During the training process, this vector is only updated when reviews specific to the product are used for training. Thus P rod 2 can be interpreted as a high level representation of product-specific information. The final product representation P rod is generated by combining the two vectors: P rod 1 and P rod 2 as: where W 1 and W 2 are weight matrices for P rod 1 and P rod 2 respectively, and b p is the bias vector. We calculate the product attention weights based on the score function f (.), and the input to the score function is the product representation P rod and hidden state of a word h p ij : where (v p w ) T denotes the transpose of weight vector v p w , W p wh and W p wp are weight matrices, and b p w is the bias vector. Then we apply softmax function to get a normalized attention score α p ij . At the word level, the sentence representation is defined in Eq. 10, where α p ij indicates the product attention score of the word representation h p ij . The representation of a review can be obtained formally through Eq. 11, where β p j indicates the attention weight for hidden state of the jth sentence h p j . After applying the sentiment attention layer and the product attention layer separately, we obtain two different review representations r s and r p . These two representations are concatenated as the final representation of a review r = [r s , r p ]. Then, we apply a fully connected layer on top of r, to classify the helpfulness of a review. To minimize the difference between the predicted helpfulness value and the actual helpfulness label, we utilize cross entropy loss as the objective function. It is a commonly used loss function for binary classification, and is defined as: where y i indicates the actual helpfulness label, p(y i ) indicates the probability of helpfulness. N is the number of training observations. We present details on how these y i are assigned in the following section. In this section, we evaluate performance of our architecture in two application scenarios: cold start scenario and warm start scenario. Correspondingly, we split the data into training and test data differently for the two scenarios. . We pre-process the data the same way as Fan et al. [4] : First, we join the product review with corresponding metadata information. Second, we filter out the reviews that have no votes. Last, we label reviews that receive more than 75% helpful votes out of total votes as helpful, and label the remaining reviews as unhelpful. Evaluation Metric. In this study we use the Receiver Operating Characteristic Area Under the Curve (ROC AUC) statistic to evaluate the performance of our proposed model. This is a standard statistic used in the machine learning community to compare models. It is a robust statistic where imbalanced data sets are involved, and is a good metric for our problem where there are nearly four times as many helpful reviews as unhelpful reviews. In practice, a new product may have not yet received any helpful votes. Therefore assessment standards can't be captured from past voting information and can lead to the cold start problem. To evaluate model performance in this scenario, we randomly select 80% of the products and their corresponding reviews as the training data set. The remaining products and their reviews are employed as the test data set. Therefore, all of the reviews for a given product appear only in the training data set or test data set. Consequently, all products in this test data set face the cold start problem. The statistics of the two data sets are summarized in Tables 1 and 2 . Even though the partitioning approach is the same as that reported by Fan et al. [4] , a consequence of the random selection of products into test and training data sets is that the actual number of reviews differs from that of Fan et al. [4] . However, the difference is less than 1%, which is not statistically significant. We compare our proposed model with several baseline models. Below is a list of the hand-crafted features. -Structural features (STR) as introduced by Xiong et al. [27] and Yang et al. [28] , include the number of tokens, number of question sentences, the star rating. They are used to reveal a user's attitude towards a product. The baseline models that we use to compare our model are: -Fusion (SVM) uses a Support Vector Machine to fuse features from the preceding feature list. -Fusion (R.F.) uses a Random Forest to fuse features from the preceding feature list. -Embedding-Gated CNN (EG-CNN) [1] introduces a word-level gating mechanism that weights word embeddings to represent the relative importance of each word. -Multi-task Neural Learning (MTNL) [3] is based on a multi-task neural learning architecture with a secondary task that tries to predict the star ratings of reviews. -Product-aware Review Helpfulness Net (PRH-Net) [4] is a neural networkbased model that introduces target product information to enhance the representation of a review. Fan et al. evaluate this model on the two data sets we are using and claim that PRH-Net is the state of the art. We use the same data sets as Fan et al. [4] in the cold start scenario. This allows us to directly compare the performance of our model with the results reported by Fan et al. [4] . We also randomly select 10% of the products and their corresponding reviews from the training set as a validation data set. We then performed a grid search of hyper-parameter space on the validation data set to determine the best choice of hyper-parameters. The models were then trained based on the entire training data set with these fixed hyper-parameters. Tables 3 and 4 show the results on the Amazon data set and Yelp data set, respectively. In Table 3 we see that our model outperforms previous models on all categories of the Amazon data set. The average improvement in AUC is 5.4% over the next best model. We observe that the degree of improvement varies from category to category. In the categories AC3 (Electronics), our model achieves improvement of 7.9%. In contrast, for the category AC4 (Grocery & Gourmet Food), the improvement is only 0.3%. We note that the category AC4, has less data than most of other categories (Table 1) . Only the category AC8 (Pet Supplies) contains fewer products and reviews. However, there are proportionally more reviews per product for the category AC8 than for the category AC4. We suspect that sentiment embedding and product embedding may not be learned well with such limited and divergent data. Therefore the improvement is not as high as that for the other categories. The results for the yelp data set are presented in Table 4 . We find that our model also outperforms the previous models in all categories. The average improvement in AUC is 1.5% over the next best model. We note that the overall improvement is not as high as that demonstrated in the Amazon data set. This may be due to the relatively small number of products and reviews in the yelp data set. With the exception of the category YC4 (Restaurants), the other categories have fewer products and reviews than all of categories of Amazon data set. The comparison results presented in Tables 3 and 4 show that our model outperforms the baseline models in the cold start scenario. In order to examine the significance of the improvement of our proposed model, we conducted a one-tailed t-test. As we are not certain as to whether the values reported in Fan et al. [4] refer to variance or standard deviation, we performed three tests: the one-sample t-test, and the two sample t-test where we evaluated both interpretations (variance and standard deviation) of the values reported in Fan et al. [4] . In all cases, the statistical results validate that our method is significantly better than the other baselines (p < 0.001). In order to tease out the performance contribution of each of the components of our model, we evaluated different combinations of the components. The results are show in Table 5 . Here HBiLSTM refers to the hierarchical bi-directional LSTM model without either of the attention layers. We use it as the baseline model for comparison. HSA refers to the combination of the HBiLSTM with the sentiment attention layer. HPA refers to the combination of the HBiLSTM with the product attention layer. Finally, HSAPA refers to the complete model which implements both attention layers. From Table 5 , we see that adding a sentiment attention layer (HSA) to the base model (HBiLSTM) results in an average improvement in the AUC score of 2.0% and 2.6%, respectively on the Amazon and Yelp data sets. By adding a product attention layer (HPA) to the base model (HBiLSTM), the improvement is 0.7% and 1.3% on the Amazon and Yelp data sets respectively. Combining all three components results in an even larger increase in AUC score, 3.4% and 4.8%, respectively on the Amazon and Yelp data sets. We observe a synergistic effect resulting from the addition of the two attention layers. We also note that in both data sets, the improvement from the product attention layer is lower than that from the sentiment attention layer. This may be due to the fact that in the cold start scenario we have no information about the target product. Possibly the helpful attributes shared by related products may not be sufficiently accurate. In order to verify that the gain in AUC is a consequence of the additional attention layers and not simply a result of adding more parameters, we conducted additional experiments. We adjusted the hyper-parameters of the HBiLSTM, HSA and HPA models to ensure they have approximately the same number of parameters as the complete model HSAPA. For example, for the category Grocery in the Amazon data set, the number of parameters of the complete model HSAPA is 30, 194, 490 . We increased the number of hidden units in the Table 6 . Performance of each model on Yelp data set in the Warm Start Scenario. YC1-YC5 are described in Table 4 . Table 7 . Performance of each model on Amazon data set in the Warm Start Scenario. AC1-AC10 are described in Table 3 . Recall that the selection of hyper-parameters was determined by using a grid-search of the hyper-parameter space. Not surprisingly, the new models with more parameters do not demonstrate an improvement in performance in comparison to the models with hyper-parameters determined by grid-search. Our proposed model demonstrates improved performance, not simply because of greater modelling power due to more parameters, but because of the leveraging of sentiment and product related information by the sentiment and product attention layers. The warm start scenario is another commonly seen scenario in which some reviews for products have user votes, while other reviews haven't yet received user votes. For this scenario, we randomly select 80% of the reviews as the training data, and use the remaining reviews as the test data. The data statistics are essentially the same in scenario 2 as that in scenario 1 (Tables 1 and 2 ). As 80% of the reviews for products are in training data set, this partitioning produces a warm start scenario. We also evaluated the contribution of each attention layer in the warm start scenario. Tables 6 and 7 show that, on average, the addition of the sentiment layer (HSA) to the base model increases the AUC by 1.8% and 2.7% on Yelp and Amazon data sets, respectively. We also find that the addition of the product attention layer (HPA) to the base model increases the AUC by 3.8% and 5.5% on Yelp and Amazon data sets, respectively. Comparing the results from warm start scenario to cold start Scenario, we make the following observations. First, the average performance of the base model (HBiLSTM) is very similar in both scenarios. Second, the AUC improvement from the product attention layer (HPA) is higher than that from the sentiment attention layer (HSA) on both data sets in the warm start scenario. This improvement is not seen in the case of the cold start scenario. In the cold start scenario the product embedding can only be learned from reviews of related products. In contrast, in the warm start scenario product information can be learned from both the target product and related products. This explains why the product attention layer can achieve better performance in warm start compared to cold start scenario. From the performance results described in the cold start scenario section (Tables 3 and 4) , we see that HSAPA out-performs the other models. We also see that HSAPA has even better performance in the warm start scenario (Tables 6 and 7 ). In practice, one can expect a mix of cold and warm start scenarios where HSAPA can be expected to demonstrate superior performance than in cold start scenario. did not fit on any of the tub spouts and was unable to stretch it enough to work. Had to return did not fit on any of the tub spouts and was unable to stretch it enough to work. Had to return This is a great blade. Almost no sanding needed after use and they remain sharp after several uses . Don't use them on rough construction material if you want them to keep doing the job they were meant to do. This is a great blade. Almost no sanding needed after use and they remain sharp after several uses . Don't use them on rough construction material if you want them to keep doing the job they were meant to do. We demonstrate the visual examination of attention scores applied at the word level by randomly sampling two identical review examples (shown in Table 8 ). We use two colors: red and green to represent the sentiment attention scores and product attention scores respectively. The lightness/darkness of the color is proportional to the magnitude of the attention score. There are a few interesting patterns to note. First, for the sentiment attention layer, the words that are assigned large weights have sentiment that is close to the overall sentiment of the review. For instance, in the example 2 the overall sentiment of the review is positive (5 out of 5). Although there are several negative words such as "no" and "don't", the positive words/phrases like "great", "remain sharp" are still assigned higher attention weights. This observation is consistent with our previous hypothesis that the word importance in a review can be affected by review sentiment. Second, the attributes or descriptive words of an attribute of the product in a review text gain higher weights from the product attention layer. For instance, in the first example the descriptive words "fit", "enough" and the noun "tub" are assigned relatively high attention scores. Third, the combination of the important words captured by two attention layers can give us a brief and thorough summary of a review. It may also visually explain why the combination of these two can achieve a better result compared to a single attention layer. In this paper, we describe our analysis of review helpfulness prediction and propose a novel neural network model with attention modules to incorporate sentiment and product information. We also describe the results of our experiments in two application scenarios: cold start and warm start. In the cold start scenario, our results show that the proposed model outperforms PRH-Net, the previous state of the art model. The increase in performance, measured by AUC, as compared with PRH-Net is 5.4% and 1.5% on Amazon and Yelp data sets, respectively. Furthermore, we evaluate the effect of each attention layer of proposed model in both scenarios. Both attention layers contribute to the improvement in performance. In the warm start scenario, the product attention layer is able to attain better performance than in cold scenario since it has access to reviews for targeted products. In this work, we evaluate review helpfulness from the perspective of review quality. For future work, we may rank the helpfulness of reviews by incorporating a user's own preferences [22] in order to make personalized recommendations. Multi-domain gated CNN for review helpfulness prediction Cross-domain review helpfulness prediction based on convolutional neural networks with auxiliary domain discriminators Multi-task neural learning architecture for end-to-end identification of helpful reviews Product-aware helpfulness prediction of online reviews Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering What reviews are satisfactory: novel features for automatic helpfulness voting A study of factors that contribute to online review helpfulness Effective use of word order for text categorization with convolutional neural networks Surprise! most consumers look at reviews before a purchase Automatically assessing review helpfulness Convolutional neural networks for sentence classification Using argument-based features to predict and analyse review helpfulness Low-quality product review detection in opinion summarization Exploiting social context for review quality prediction Prediction of helpful reviews using emotions extraction context2vec: learning generic context embedding with bidirectional LSTM What makes a helpful online review? A study of customer reviews on amazon Exploring latent semantic factors to find useful product reviews Modeling and prediction of online product review helpfulness: a survey Learning to recommend helpful hotel reviews Glove: global vectors for word representation A dynamic neural network model for CTR prediction in real-time bidding Review helpfulness assessment Twitter sentiment analysis with deep convolutional neural networks Context-aware review helpfulness rating prediction Improving review representations with user attention and product attention for sentiment classification Automatically predicting peer-review helpfulness Semantic analysis and helpfulness prediction of text for online product reviews Text understanding from scratch Character-level convolutional networks for text classification Acknowledgments. This work was partially supported by 2018 IBM Faculty Award to the University of South Carolina.