key: cord-0193292-m80zg42s authors: Wu, Chuhan; Wu, Fangzhao; Qi, Tao; Huang, Yongfeng title: MM-Rec: Multimodal News Recommendation date: 2021-04-15 journal: nan DOI: nan sha: d75805a7aaa0d5f76e74b98375ae0b748060d6ed doc_id: 193292 cord_uid: m80zg42s Accurate news representation is critical for news recommendation. Most of existing news representation methods learn news representations only from news texts while ignore the visual information in news like images. In fact, users may click news not only because of the interest in news titles but also due to the attraction of news images. Thus, images are useful for representing news and predicting user behaviors. In this paper, we propose a multimodal news recommendation method, which can incorporate both textual and visual information of news to learn multimodal news representations. We first extract region-of-interests (ROIs) from news images via object detection. Then we use a pre-trained visiolinguistic model to encode both news texts and news image ROIs and model their inherent relatedness using co-attentional Transformers. In addition, we propose a crossmodal candidate-aware attention network to select relevant historical clicked news for accurate user modeling by measuring the crossmodal relatedness between clicked news and candidate news. Experiments validate that incorporating multimodal news information can effectively improve news recommendation. News representation is critical for news recommendation [28] . Most existing news representation methods learn news representations merely from news texts [1, 3, 6, 13-15, 18, 20-22, 25, 26, 30, 31] . For example, Okura et al. [13] used autoencoders to learn news representations from news content. Wu et al. [23] used CNN and personalized attention network to learn news representations from news titles. [24] used multi-head self-attention networks to model news from titles. In fact, besides the titles, many news websites also use images to better attract users' clicks [10] , as shown in Fig. 1 . Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Users may click news to read not only because of the interest in the content of news title, but also due to the fascination of news images [2, 7, 29] . For example, in Fig. 1 the image of the second clicked news shows a highlight moment in an NFL game, which may be attractive for users interested in football. The visual information of news images can provide rich information for news content understanding and future behavior prediction. In this paper we study how to incorporate visual news information to enhance news recommendation. Our work is motivated by the following observations. First, news titles and images usually have some relatedness in describing news content and attracting clicks. For example, in the second news of Fig. 1 , the word "Cowboys" in news title is related to the players shown in the news image. Modeling their relatedness can help better model news and infer user interest for news recommendation. Second, a user may have multiple interests, and a candidate news may only be related to a specific interest encoded in part of clicked news. For example, in Fig. 1 the candidate news is only related to the second clicked news. Thus, modeling the relevance between clicked news and candidate news can help predict users' specific interest in a candidate news. Moreover, candidate news may have crossmodal relatedness with clicked news. In Fig. 1 the image of candidate news is related to both the image and title of the second clicked news, because both images show the same football team and its name is mentioned by the title of the second clicked news. Modeling the crossmodal relations between candidate news and clicked news can help measure their relevance accurately. In this paper, we present a multimodal news recommendation method named MM-Rec, which leverages both textual and visual news information for news recommendation. In our approach, we first extract region-of-interests (ROIs) of news images via a pretrained Mask R-CNN model [4] for object detection. Then we use a pre-trained visiolinguistic model [11] to encode both news texts and news image ROIs and model their inherent crossmodal relatedness via co-attentional Transformers to learn accurate multimodal news representations. In addition, we propose a crossmodal candidateaware attention network to select relevant clicked news for user modeling by evaluating the crossmodal relevance between candidate news and clicked news, which can help model users' specific interest in candidate news. Experiments on real-world dataset show that incorporating multimodal news information can effectively improve news recommendation. Next, we introduce our MM-Rec method that uses both textual and visual news information for news recommendation, as shown in Fig. 2 . We first introduce its multimodal news encoder that learns multimodal news representations from texts and images, and then introduce how to make news recommendations based on multimodal news representations. On news websites, many news have images displayed together with titles, as shown in Fig. 1 . Users may click news not only due to their interests in news title, but also because of the attraction of news images [2] . Thus, modeling visual news content like images is important for news representation. Since different regions in news image may have different informativeness for news modeling, we use a Mask-RCNN [4] model pre-trained in an object detection task to extract ROIs of news images. We further use a ResNet-50 [5] model to extract features of ROIs, which forms a feature sequence [e 1 , e 2 , ..., e ], where is the number of ROIs. To model textual content, following previous works [21, 24] we take news titles as model input. We tokenize a news title into a word sequence where is the number of words. An intuitive way is modeling news texts and images with separate models. However, the title and image of the same news usually have some relations. For instance, in the first news of Fig. 1 , the word "Fauci" in the news title is related to his photo. Capturing the relatedness between news titles and images can help better understand their content and infer user interests. Visiolinguistic models are effective in modeling the crossmodal relations between texts and images [9, 11, 16, 17] . Thus, we apply the pre-trained ViLBERT model [11] to capture the inherent relatedness between news title and image when learning representations of them. The inputs of ViLBERT are the ROI and word sequences. It first models the contexts of words via several Transformers [19] , and then use several co-attentional Transformers [12] to capture the crossmodal interactions between the image and title. The output is a hidden ROI representation sequence H = [h 1 , h 2 , ..., h ] and a hidden word Then we apply a word attention network to learn title representations and an image attention network to learn image representations. The attention weights of words in news title are computed as follows: where q is an attention query vector and W is a parameter matrix. The final representation of news title is the summation of the hidden word representations weighted by their attention weights, i.e., r = H × a . The attention weights of ROIs are computed in a similar way as follows: where q and W are parameters. The final representation of news image is the summation of hidden ROI representations weighted by their attention weights, i.e., r = H × a . Then we introduce how to make news recommendations using the multimodal news representations. Since news recommendation usually relies on the relevance between candidate news articles and users' personal interest to rank candidate news for a target user, we first introduce our user interest modeling method. Following many prior works [13, 24] , we model users' interest in news from the representations of their previously clicked news. We use the multimodal news encoder to learn the text and image representations of previously clicked news from their titles and images, which are denoted as R = [r 1 , r 2 , ..., r ] and R = [r 1 , r 2 , ..., r ], where is the number of clicked news. However, not all clicked news are informative for inferring user interests on a candidate news, because it may be relevant to a few clicked news only. For example, the candidate news in Fig. 1 is only related to the second clicked news. Thus, selecting clicked news according to their relevance to candidate news in user modeling may help accurately match candidate news with user interest. In addition, candidate news may have some crossmodal relations with the images and titles of clicked news. For example, in Fig. 1 the players in the candidate news image are related to the players in the images of the second clicked news and the word "Cowboys" in its title. Motivated by these observations, we propose a crossmodal candidate-aware attention network to measure the crossmodel relevance between clicked news and candidate news for better modeling user interests in candidate news. We denote the image and text representations of a candidate news as r and r , respectively. We compute the text-text attention weights for clicked news that represent their text-text relevance to candidate news as follows: In a similar way, we compute the text-image, image-text and imageimage attention weights of clicked news as a , = softmax(R ×r ), a , = softmax(R × r ), and a , = softmax(R × r ). The unified user embedding u is computed as: In our method, the news click score for ranking is derived from the multimodal representations r and r of candidate news and the user representation u. Motivated by [13] , the click scoreˆis predicted byˆ= r × u + r × u. Following [24] we use negative sampling to build labeled samples from news click logs for model training and use cross-entropy as the loss function. Since there is no high-quality news recommendation dataset that contains multimodal information, we constructed one based on the logs collected from a commercial news website during three weeks (from Feb. 25, 2020 to Mar. 16, 2020) . Logs in the first week were used to construct user histories and the rest sessions were used to form click and non-click samples. We sorted sessions by time and used the first 1M sessions for training, the next 100K sessions for validation and the rest for test. Table 1 shows the dataset statistics. In our experiments, we finetuned the last three layers of ViLBERT. We used Adam [8] as the optimizer (lr=1e-5). The batch size was 32. We tuned hyperparameters on the validation set. We repeated each experiment 5 times and reported the average AUC, MRR, NDCG@5 and NDCG@10 scores. We compare the proposed MM-Rec method with many baseline methods, including: (1) EBNR [13] , learning news embeddings via autoencoder and user embeddings with a GRU network; (2) DKN [21] , , a pre-trained language empowered approach for news recommendation. We use BERT [27] as the news model. In these methods, only news texts are considered. The results are summarized in Table 2 . It shows that our MM-Rec approach that considers visual information of news outperforms other methods based on textual content only. T-test results further validate the significance of improvement ( < 0.01). This is because users usually click news articles not only based on their interest in news texts, but also the attraction of news images. Thus, the visual information of news images can enrich news representations for recommendation. Our MM-Rec method can incorporate both textual and visual news information into news representation learning and meanwhile model their inherent relatedness for better news content understanding, while in existing news recommendation methods the image-related information is ignored. In addition, our approach can model the crossmodal relatedness between clicked news and candidate news for more accurate interest matching, which can yield better performance. Next, we study the effectiveness of multimodal information for news representation. We compare MM-Rec with its two variants with images or titles only. The results are shown in Fig. 3 . We find that both news title and image are useful for learning news representations for recommendation. It shows that both textual and visual information of news are highly useful for understanding news content and inferring user interest. We also have an interesting finding that the performance of MM-Rec without image information is also slightly better than the results of PLM-NR baseline in Table 2 . This is because the ViLBERT model is pretrained on multi-modal data, which can leverage visual signals to enhance text understanding. In addition, incorporating multimodal news information can further improve the recommendation performance, which shows that incorporating multimodal news information can help learn accurate news representations. Then we study the effectiveness of the co-attentional Transformers in the ViLBERT model and the crossmodal candidate-aware attention network for user interest modeling. We compare MM-Rec and its variants without co-attentional Transformers or replacing the crossmodal candidate-aware attention with the vanilla attention mechanism used in [22] . The results are shown in Fig. 4 . We find incorporating co-attentional Transformers network is helpful. This is because there is inherent relatedness between news title and image in representing the news content and attracting news clicks. Thus, modeling their interactions can enhance their representations. In addition, the candidate-aware attention network is useful. It is because different clicked news usually have different importance for modeling users' specific interests in candidate news, and selecting them according to their crossmodal relatedness with candidate news can help better match user interest. We present several ablation studies on the multimodal candidate news embedding and clicked news embedding for user modeling. The results are illustrated in Fig. 5 . We find the performance drops when any of the candidate news embeddings or the user embeddings is removed, which shows that all of them are useful. It shows that both textual and visual information are useful for news and user modeling. In addition, textual information plays more important roles in news and user modeling, which is consistent with the results in Fig. 3 . This is a very interesting phenomenon because texts are usually less attractive than images. We think this is mainly because a single news image usually cannot comprehensively summarize news content, and it may be very challenging to understand visual information accurately. We conduct several case studies to visually demonstrate the effectiveness of incorporating multimodal information into news recommendation. We show the clicked news of a random user and the rankings given by NRMS and MM-Rec in Fig. 6 . We find that both NRMS and MM-Rec assign the last candidate news low rankings, because from its title we can easily infer that it is irrelevant to the user interests. However, the NRMS model fails to promote the first candidate news, which is highly related to the user's clicked news about NFL. This may be because it is difficult to measure their relevance solely based on their titles. Fortunately, our MM-Rec method ranks the first candidate news at the top position because it is easy to match it with user interests based on visual information. These results show the effectiveness of multimodal information in news recommendation. In this paper we present MM-Rec, which can utilize both textual and visual news information to model news for recommendation. We use a visiolinguistic model to encode both news texts and images and capture their inherent crossmodal relatedness. In addition, we propose a crossmodal candidate-aware attention network to select relevant clicked news based on their crossmodal relevance to candidate news, which can better model users' specific interest in candidate news. Experiments show MM-Rec can effectively exploit multimodal news information to improve news recommendation. Neural News Recommendation with Long-and Short-term User Representations Value added': Language, image and news values Graph enhanced representation learning for news recommendation Mask r-cnn Deep residual learning for image recognition Graph neural news recommendation with long-term and short-term interest modeling News Recommendation System Based on Text and Image Tag Data Adam: A method for stochastic optimization Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training Newsreel multimedia at mediaeval 2018: news recommendation with image and text content Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks Hierarchical question-image co-attention for visual question answering Embedding-based news recommendation for millions of users PP-Rec: News Recommendation with Personalized User Interest and Time-aware News Popularity HieRec: Hierarchical User Interest Modeling for Personalized News Recommendation VL-BERT: Pre-training of Generic Visual-Linguistic Representations LXMERT: Learning Cross-Modality Encoder Representations from Transformers Joint Knowledge Pruning and Recurrent Graph Convolution for News Recommendation Attention is all you need Fine-grained Interest Matching for Neural News Recommendation DKN: Deep Knowledge-Aware Network for News Recommendation Neural News Recommendation with Attentive Multi-View Learning. In IJCAI NPA: Neural News Recommendation with Personalized Attention Neural News Recommendation with Multi-Head Self-Attention Personalized news recommendation: A survey User Modeling with Click Preference and Reading Satisfaction for News Recommendation Empowering News Recommendation with Pre-trained Language Models MIND: A Largescale Dataset for News Recommendation Xiuqiang He, Xiaofei He, Tat-Seng Chua, and Fei Wu. 2021. Why Do We Click: Visual Impression-aware News Recommendation. In MM Amm: Attentive multi-field matching for news recommendation UNBERT: User-News Matching BERT for News Recommendation DAN: Deep Attention Neural Network for News Recommendation