key: cord-0446699-b1z82xbl authors: Raza, Shaina; Ding, Chen title: Deep Dynamic Neural Network to trade-off between Accuracy and Diversity in a News Recommender System date: 2021-03-15 journal: nan DOI: nan sha: 42499dfae0611149f4f455de1127bd1bd4ab0078 doc_id: 446699 cord_uid: b1z82xbl The news recommender systems are marked by a few unique challenges specific to the news domain. These challenges emerge from rapidly evolving readers' interests over dynamically generated news items that continuously change over time. News reading is also driven by a blend of a reader's long-term and short-term interests. In addition, diversity is required in a news recommender system, not only to keep the reader engaged in the reading process but to get them exposed to different views and opinions. In this paper, we propose a deep neural network that jointly learns informative news and readers' interests into a unified framework. We learn the news representation (features) from the headlines, snippets (body) and taxonomy (category, subcategory) of news. We learn a reader's long-term interests from the reader's click history, short-term interests from the recent clicks via LSTMSs and the diversified reader's interests through the attention mechanism. We also apply different levels of attention to our model. We conduct extensive experiments on two news datasets to demonstrate the effectiveness of our approach. Nowadays, there are many online web sites, where the news gets continuously published, which can cause the information overload problem for their readers. The news recommender system (NRS) can narrow down the limitless options and provide readers with the news based on what they have liked in the past. To understand its readers, an NRS needs to analyze the news content and make recommendations in real time. For the news recommendation, there are three keys issues that need to be addressed. First, we observe that readers' interests are not fixed and tend to change over time. Generally, their interests on different topics are stable in the long run, while the content in which they are currently interested is often affected by their up-to-date concerns, or by certain events and contexts [1] , such a breaking news, weather alerts. For example, a reader who is a fan of 'Cristiano Ronaldo' may have read many soccer related news for several years. This reflects the reader's long-term interests. Recently, impacted by the pandemic, the reader also began to browse news related to the COVID-19. This reflects the reader's short-term interests. Second, we observe that the readers' interests naturally form a sequence over time. Each sequence consists of a list of the consumed items together with their associated timestamps. For example, we consider a reader who reads about 'Lionel Messi' at time t1, 'Soccer tournament postponed due to COVID-19' at t2, and then 'Is it safe to play soccer again' at t3. In each time step, the reader's next click sequentially depends on the previous ones. In this paper, we use click as a general term representing any interaction (browse, click, comment, read) between the reader and the news item. Third, we observe that readers may get bored of only reading news on similar topics. It is not wise to keep recommending similar news to a reader repeatedly. For example, the reader reads about the tennis player 'Novak Djokovic'. Based on the reader's interests, the NRS recommends him more similar news about Djokovic. After a while, the reader gets bored. Probably, he wants to read about other tennis players or different news. This hints towards recommending the diversified news (in addition to the personalized news) to the readers. We also observe that the short-term interests of the 2 readers implicitly show the diversified patterns [2, 3] . For example, while reading the sport-related news as usual, the reader might start browsing the news related to real estate market, due to his latest interest in purchasing a property. Or the reader is compelled to read US political news after getting the breaking news about the violent attack on the US Capitol in early 2021. In each such case, the diversified reading patterns are seen in the short-term interests. To address these issues, the NRS needs to consider both the long-term and short-term interests of the readers. In addition, the NRS should capture the sequential patterns in the readers' clicks and recommend diverse news to the readers. This whole phenomenon in which the readers' interests change over time (for various reasons) is called the temporal dynamics [1] . Generally, based on the optimization objective [4] , the recommendation approaches are categorized into two types: (i) the accuracy-based (aka relevance-based) [5] and; (ii) the diversity-based [6, 7] . The accuracy-based approach optimizes the recommendation results by matching them with the user profile. A major limitation of this approach is that only the items that match with what the user has liked in the past are recommended. The diversity-based approach, on the other hand, aims at recommending a wide variety of items to the user no matter whether they are similar to the previously liked items. In the state-of-the-art NRS [2, 8] , many of them apply the accuracy-based approaches, where the readers' longand short-term interests are addressed while the diversity is often not considered. There is limited work [3, 9] in NRS that considers the diversity. However, they have their own limitations: (i) the readers' interests are often taken as static; and (ii) a balance between the accuracy and diversity is not fully explored. Ignoring these factors could result in sub-optimal recommendations. The research shows that there always exists a tradeoff between accuracy and diversity [10] . Not considering this tradeoff in an NRS has the following negative implications: (i) the recommendations based solely on accuracy could fail to suggest a variety of news items to the readers. The readers may get bored of reading similar news. They may also be trapped in echo chambers, where they get only the news that reflects the like-minded opinions; and (ii) conversely, the recommendations based only on diversity could potentially present news to the readers that totally deviates from their interests. The readers may leave the system after getting all the news that they are not interested in. To provide a better user experience, an NRS should extend beyond the conventional accuracy evaluation criteria and consider a balanced combined objective. In this work, we aim to address the temporal dynamics issue and offer balanced news recommendations to the news readers. We still focus on accuracy but making sure the diversity is acceptable (at least the medium-level). We propose a deep dynamic neural network (D2NN) to provide effective news recommendations to the readers. Our proposed model has three components: (i) the news modeling; (ii) the reader interest modeling; and (iii) the recommendation part. We design a novel news modeling component with BERT [11] . Convolution Neural Network (CNN) [12] and the attention mechanisms [13] . For each news item, we not only consider the news title and/or topics, similar to some recent work [2, 8, 14] , but also consider various types of side information such as news body, taxonomy, timestamps, etc. The embeddings are obtained from the BERT model that considers the side information of the news data. Different from [2, 8, 14, 15] , we pretrain the BERT model on the news specific data to generate the embeddings. We use CNN on top of the embeddings to capture the local context information. We apply the word-level attention on the news to select the important words. We also apply the news-level attention to weigh each type of the side information according to its effectiveness in the NRS. 3 We design a novel reader modeling component to profile the reader's interests. We learn the reader's long-term interests from the whole history, the reader's short-term interests using a Long short-term memory (LSTM) network [5] on the recent history, and the diverse patterns in the reader's interests within a session (short-term interests) through an attention module. The intuition is that the reader's sequences consist of repeated patterns. With the LSTM, the reader modeling component can forget or retain some portions of the history. Using attention, the model learns the important clicks at successive steps and reduces the effect of repeating patterns. Finally, the recommendation module defines the probability of a candidate news article being clicked by a reader. To the best of our knowledge, we are the first to provide such a wide list of aspects (features) related to the news (i.e., the side information) and readers in an NRS. There are three main contributions in our work: (1) We address the dynamics in a reader's interests over time with seamless integration of longterm, short-term, diverse, and sequential interests. Our approach inherently provides a balance between highly accurate and reasonably diverse news recommendations. (2) We emphasize the inclusion of heterogenous side information including the news headlines, snippets (passages from the news body) and taxonomy (category, subcategory) to learn the multifaceted news representations, consisting of textual, temporal, and contextual features from a sequences of news articles. The embeddings are generated from the BERT model that is pretrained on the news specific data. Most NRS [2, 8, 14] only consider the title to represent the news. There are also NRS that consider the news body, such as [15] . However, in [15] , the embeddings (i.e., GloVe 1 ) are pretrained on generic corpuses such as Wikipedia 2014. In contrast, we train the word embeddings on a news-specific corpus. (3) Through our extensive experiments on two news datasets, we show that our proposed model has achieved a high-level of the recommendation accuracy meanwhile maintaining a requisite level of diversity. We also introduce a new metric to measure the tradeoff performance with respect to accuracy and diversity. The rest of the paper is organized as follows. Section 2 is the related work. Section 3 discusses our D2NN framework. Section 4 explains the experiment setup. Section 5 analyzes the results. Section 6 gives the conclusion and lists the future work. Temporal Dynamics: In a recommender system, the temporal dynamics refers to changing users' preferences over time [1] . The earlier work on temporal dynamics is based on simple time-decay functions [16] . A time decay function assigns different weights to a user's feedback and gives more importance to the recent interactions (or ratings) over the older ones. A general limitation of this method is that it emphasizes too much on the recent user feedback where the importance of the past feedback is underestimated. The past feedback is also important, in a recommender system, to determine the long-term user's preferences (interests). Another classical method for temporal dynamics is the time binning method such as the timeSVD++ [17] . In binning-based methods, the longer time bins usually represent the users' longterm interests, while the shorter time bins usually reflect the users' short-term interests. A general 5 These above-mentioned state-of-the-art NRS [2, 8, 14, 23, 25] do represent the sequential, long-term and/or short-term interests of the news readers, however, a few limitations are noted. First, these models normally consider the news IDs or news titles to represent the whole news. However, there are also other pieces of information that may be more descriptive (e.g., the news story) or that reflect a reader's short-term or long-term preferences (e.g., topics, categories). Second these methods do not consider the variations in the readers interactions. The focus is mainly on the prediction accuracy. As a result, the recommended items are usually very similar with each other and may cover only a small fraction of items, with no consideration of diversity. Diversity: Maximal marginal relevance (MRR) is a classical technique to increase the diversity of documents retrieved against a query in an information retrieval (IR) system. MMR tries to reduce the redundancy of retrieval results while preserving the relevance of the query results for documents. Basically, this technique evaluates the query results and provide another list of documents that are reranked. The same idea is also used in the recommender systems to include diversity during the reranking of recommended items [6] . Generally, the diversity is incorporated in the recommender systems in two ways: (i) as a two-stage recommendation strategy; and (ii) in the optimization model. In a two-stage recommendation strategy [6, 27] , an existing CF method is typically used to predict the missing ratings in the first stage, and the second stage is used to promote desired diversification through a modified ranking strategy (rather than conventional ranking scheme). The diversity is these methods is usually evaluated in two ways: (i) the individual diversity [6] or (ii) the aggregate diversity [28] . Individual Diversity is quantified as the pairwise dissimilarity between items in a (given) user's recommendation list. This technique is used to assess the diversity from the user's perspective. The aggregate diversity, on the other hand, is measured as the number of unique items recommended across users. This scheme captures the system-centric notion of diversity. The two-stage recommendation strategy usually makes use of a tuning parameter that is defined explicitly to control the tradeoff between the accuracy and the diversity. The selection of the parameter is based on heuristics, where the items rated above the threshold are considered as prospective candidates for recommendation. This approach, in addition to increasing the computational burden, does not guarantee an optimum solution. In the optimization model, a weighted combination of diversity and accuracy is incorporated in a joint optimization strategy. Some work [29, 30] consider diversity in the recommendation process through the use regularization terms on items' feature space. In a recent NRS [31] , the regularization terms (through a unique combination of Lasso and Ridge regressions) are used to tradeoff between a high-level of accuracy and a reasonable amount of diversity. Some other works, related to the optimization technique, use the Determinantal point processes (DPPs) [7] to introduce diversity among the set of items. DPPs are probabilistic models and can be used to address the balance between the accuracy and the diversity within a set of diverse items. However, the applicability of DPPs is subject to high complexity matrix operations on large datasets. Dueling Bandit Gradient Descent (DBGD) [21] is another optimization technique to introduce diversity in the recommendation models. The DBGD is an online learning-to-rank algorithm based on multi-arm bandit algorithms and is used to model the exploration-versus-exploitation trade-off for the relative feedback. DBGD is recently used in a state-of-the-art NRS [9] to improve the recommendation diversity. Despite the robust design introduced in this model [9], a few things make it harder to implement this model in the real-world setting. First, the learning efficiency of this model is limited in a high-dimensional parameter space. Second, this method assumes only the binary feedback because there is no way of directly observing the reward of users' actions. In this paper, we consider the short-term and long-term interests of the news readers. In addition, we consider the sequential dependencies among the readers' feedbacks and also address the diversified readers' interests from their recent feedbacks. We also consider rich side information from the news content to include in our recommendation model. Different from the previous work, we provide a balance between highly accurate yet reasonable diverse news recommendations. Given a set of readers, a set of news items, and an interaction sequence created in the chronological order for each reader who has interacted with the news at a timestamp, the recommendation task is to predict the news item that the reader will interact next. The notations used in this paper are given in Table 1 : Table 1 . Notations used in the paper Notation , , * All of the above are in a set format, e.g., = { 1 , 2 , . . , }, k; l; m; are lengths of vector Sets of (i) readers (r); (ii) news (n); (iii) candidate news (cn); (iv) headlines (h); (v) snippets (s); (vi) taxonomy (t) (tc: category; tsc: subcategory); (vii) word embedding (we) of h (same for s); (viii) embeddings (e) of tc and tsc; (ix) contextualized embeddings (ce) of h (also s); (x) reader's click history ( ), embedding of reader's click history ( ) Our proposed model D2NN has three components: (i) the news modelling (NM) (Fig. 1) , (ii) the reader interest modeling (RIM), and (iii) the news recommendation (NR) components (Fig. 2) . We first explain the NM, then the RIM and finally the NR component. Fig. 1 and discussed: below: The input to the NM is the news N. We give the heterogenous side information (i.e., headline, snippet, category, subcategory) from the news data as inputs to produce a combined news representation. The NM consists of three parallel neural networks (NNs), which respectively take inputs from the headline H, snippet S and taxonomy T to learn the news representations. There are three layers in NNs for and , and two layers for T. BERT Embedding Layer: The first layer in NNs for and is the embedding layer. The input to this layer is the words sequence from and , and the output is the sequence of dense word embedding vectors ℎ and respectively. We utilize the token-level embeddings from the BERT model to generate the word embeddings. The advantage of our approach compared to other pretrained embeddings (e.g., GloVe or original BERT) is that we use the news specific data to generate the embeddings. In our preliminary experiments, we used a variety of methods to obtain the word embeddings. We tried the word embeddings pretrained on generic corpus (e.g., GloVe, original 8 BERT). We also pretrained the BERT model on the recent Wiki dumps 2 . In addition, we pretrained the BERT on the WikiNews 3 dumps and also on our news datasets. After all these efforts, we found that with the BERT trained on news specific corpuses, the performance of news recommendation methods is greatly improved. The first layer in the NN for T is also the embedding layer that converts the category and subcategory from T into the embeddings and respectively. CNN layer: The second layer in NNs for and is the CNN layer, to capture the local contexts (order of words) within H and S. The local contexts are important to learn useful news representations. For example, in the headline "City prepares for the second COVID-19 wave", the local contexts "second" and "wave" are important to infer that this headline belongs to "COVID-19" event. We apply the CNN with different filters on the input word embeddings ℎ . The output from the CNN layer is the set of contextualized word embeddings ℎ , as shown in Equation 1. Here ⊙ is convolution operator, ℎ ( − ):( + ) is the concatenation of word embeddings from position ( − ): ( + ), ℎ is the kernel and ℎ is the bias parameter. The parameters is the number of filters and 2 + 1 is the filter size. The CNN filter on S is applied in the same way. We apply a separate filter on the snippet's word embeddings to produce the context embeddings , as shown in equation 2. = ReLU( ⊙ ( − ):( + ) + ) (2) There is no CNN layer in the NN for T. Attention mechanism for the words: The third layer in NNs for and is the attention layer. We apply the word-level attention to learn the related and relevant words from the news. In the headline "Thousands of people died of this COVID-19", the word "thousands" is linked to "people" and the word "this" is related to 'COVID-19'. Without the attention mechanism, the less significant words (e.g., "thousands" or "this") could easily be ignored. The input to the attention layer is the contextualized representations ℎ and from the CNN layer, and the output is the representations of the headline h and snippet . We apply the attention on the contextualized representations of h, i.e., ℎ to produce the headline representation h, as in Equation 3 . Here, h is the summation of the contextualized representations ( ℎ ) of all the words in h weighed by the attention weights ( ℎ ). The attention weight of the i th word in h is given in Equation 4 . Here ℎ is the hidden representation of ℎ , ℎ assists in deciding important words in h. ℎ and ℎ are the projection parameters and ℎ is the query vector of h. The snippet representation ̃ is also calculated in the same manner as shown in Equation 5 and the attention weight in shown in Equation 6 . There are limited words in taxonomy T, so there is no word-level attention for T. The second layer in the NN for T is a dense layer that transforms the embeddings and into the vector representation ̃ and ̃, as shown in Equations 7 and 8 respectively. Attention mechanism for the news: After building the NNs for H, S and T, we apply the news-level attention on all the learnt news representations. The intuition is that different side information has different usefulness and contributes differently. For example, if the snippet is less informative than the headline, then it should be weighed less, and the headline should be weighed more. So, we take each side information independently and weigh each piece of information accordingly. The final news representation ̃ is the summation of h ,̃ , ̃ , ̃, as in Equation 9 . The attention weight ℎ is given in Equation 10 : Here ℎ = ℎ tanh ( ℎ h + ℎ ). The attention weights , , are calculated in a similar way. The CNN and the attention layers is also found in the DKN [8] , LSTUR [2] and NAML [15] . However, our network structure is different. Firstly, we include more side information (DKN [8] , LSTUR [2] include the titles, topics). Though, NAML [15] includes news body, but the dataset in [15] is quite precise that includes one month user logs only. In contrast, one of our datasets (NYTimes) range over a longer period of time (two years data). This much data is good to model the temporal dynamics in readers' interests. Secondly, we generate the word embeddings by pretraining the BERT with the news specific data (they use embeddings pretrained on general data). And lastly, we consider the sequential readers' clicks (DKN considers the readers' current interests, whereas LSTUR and NAML considers readers' interests -without considering the diversified interests within sessions). The RIM is a sequential model that learns the readers' representations from the click history. We have two components in this module: the long-term interest (LTI) and the short-term interest (STI) modules. We also have a diversity-aware interest modeling (DIM) part within STI to introduce diversity in our model. The architecture of the RIM is shown in Fig. 2 , with the following parts: Long-term interest (LTI) module: The LTI captures the readers' long-term interests. It simply takes as input the whole sequential click history of a reader and adds the embeddings of the reader's history in . The output from LTI is the sequential representation of the reader's long-term interests as ̃, as shown in Equation 11 . ̃= ∑ =1 (11) The STI captures the readers' short-term interests. It takes as input the reader's click history and outputs the short-term sequential representation ̃. The output ̃ from LTI is sent to the LSTM network and the following steps take place that produce the short-term sequences, as shown in Equation 12. (iv) = ∘ −1 + ∘ ̃, The short-term representation is the last hidden state (vi) from LSTM, i.e., ̃= ℎ . Diversity-aware interest modeling through attention: As hypothesized earlier, the diversity is intrinsically reflected in the readers' short-term interests. Generally, the readers' clicks among consecutive sessions are repeating, e.g., a reader reading about the COVID-19 in successive sessions. Due to these repeated clicks in sessions, the recommendations that are produced for these sessions are also similar. So, it is important to attend to readers' different clicks during different time steps for diversity. The short-term sequences coming from the STI do not consider the variations in a reader's behavior. Inspired by the diversity-based attention in [32] , we also use the attention to introduce diversity in our model. The diversity-based attention in [32] is applied on a content-based approach. 11 Different from [32] , we apply the diversity for the user (reader) modeling task. We apply the attention on the short-term sequences ̃ in the DIM part. We denote the attention weight of the i th clicked news in ̃ as ℎ , which is calculated by evaluating the importance of interaction between the clicked news (ℎ ) during time t and the clicked news representation ̃, as shown in Equation 13 : Here ℎ = ℎ tanh ( ℎ̃ + ℎ ). The final diversified representation ̃ is the summation of the clicked news representations weighted by their attention weights, as shown in Equation 14 : We combine the reader' long-term (̃) and short-term diversified representations (̃) to produce the final reader representation . By doing so, we provide a seamless integration of a reader's longterm and short-term diversified interests. The NR component predicts the click probability of a reader based on the click history. We compute the click probability score of a reader r clicking the candidate news cn, as ( , ), by taking the dot product between their representations i.e., (,̃), as shown in Equation 15 : We use the negative sampling [2] to train the model. We train our model using positive (reader's observed clicks) and negative samples (unobserved history from the same session) with a ratio of 1:5. Each training sample ̅ has a clicked news, a candidate news and a label ̅ ( ̅ = 1 is positive, ̅ = 0 is negative), with an estimated probability ∈ [0,1] for each click. We minimize the loss for each sample using the negative log-likelihood function to train the model, shown in Equation 16 . Here + is positive and − is negative sample set respectively. We take the reader's whole click history to train LTI. To train STI, we create time-ordered sessions (sequences) in the reader's click history, with both positive and negative samples. We use paddings to fill up the shorter readers' sequences. We also pad the headline and snippet inputs of shorter length. We evaluate our proposed model on the two real-world news datasets: New York Times (NYTimes): we collected the news articles and the anonymized readers' interactions using NYTimes API. The interactions on articles were retrieved with respect to the timeline of the news data A sample of the dataset can be accessed here 4 . we use a large-scale benchmark news dataset [33] consisting of anonymized behavior logs from the Microsoft News. It can be accessed here 5 . The original MIND dataset has about 1 million readers with 15 million clicks on 160k articles, and we use the MINDsmall (a smaller version) in the experiments. Both the datasets consist of English news articles. For both datasets, we generated the training samples from the click histories and impressions logs according to the format given in the MIND paper [33] . An impression log records the news articles that the reader visits or clicks at a specific time [33] . The basic statistics for both datasets are shown in Table 2 . We use the leave-one-out-evaluation for the next-item recommendation. We chronologically sort each reader's interactions and hold out the last item in the sequence as test set, second-last interaction as validation set and the rest as training set. Evaluation Metrics: We evaluate all models for the following evaluation metrics: to evaluate the prediction accuracy, we use RMSE [34] . Typically, RMSE are used in a recommender system to evaluate the difference between the predicted rating (estimating user' like or dislike) and the known rating that is given by the user. Area Under Curve (AUC): as, the problem defined in this work is a click prediction problem, so we also use the AUC [34] . Typically, the AUC is used to calculate the area under the ROC (Receiver Operating Characteristic) curve for classification problems, and a higher value of AUC means that recommendations are better. to measure the recommendation and ranking accuracy, we use NDCG @k [35] . Typically, an item is viewed as either relevant or not relevant by the other ranking measures (e.g., by precision, recall), while there can be degree of relevancy of items. NDCG uses the graded relevance to rank each item. Compared to F1-score [35] that is also commonly used to measure the recommendation accuracy, NDCG additionally considers the ranking order of the positive results (good recommendations) in the top k list. It is important for an NRS to place the good results in higher positions because such news items have better chances to be selected, especially when the readers do not have the patience to go through the whole k results. In this work, we focus on the NDCG to evaluate the ranking performance. Diversity (DIV): to calculate the diversity, we compute the individual diversit as the average dissimilarity [6] of all the pairs of items in a user's recommended list at a specific cut-off value (k). In that, we calculate the recommendation diversity DIV @k using the intra-list similarity (ILS) [6, 34] . 13 tradeoff: we define a tradeoff metric that can measure the trade-off degree between accuracy (the mean NDCG) and diversity (mean DIV) as: tradeoff=2*accuracy*diversity/ (accuracy+diversity). These mean scores are calculated over all k values. We consider four k values: 5, 10, 20, and 50 for all metrics except RMSE and AUC. Our general assumption for choosing these evaluation metrics is that the readers will be more satisfied if the recommended results are accurate (matching their interests) and diverse (giving something new). Task Settings: We pretrain the BERT on Wikinews Error! Bookmark not defined. dumps for the word embeddings. We implement our models using TensorFlow on the GPUs provided by Google Colab Pro 6 . For the hyperparameters, we consider: In our work, we want to compare our model with the baselines in terms of the accuracy (prediction accuracy, recommendation accuracy, ranking accuracy), diversity and most importantly tradeoff. The expectation (based on our claimed contribution) is that our tradeoff score should be the best, accuracy should also be high, and diversity should be reasonable. We use a mix of baselines in our experimentation and group them into three categories: (i) NRS, (ii) general DL models, (iii) sequential models as discussed below: NRS as baselines: The first group of baselines is of NRS and consists of: DKN : Deep Knowledge-Aware Network for News Recommendation [8] is a NRS that fuses the semantics and knowledge from the news for the click prediction problem. Representations [2] is an NRS that learns the news and readers' (long-term, short-term) representations for the click prediction task. It has another variant LSTUR-ini with only short-term module. The second group is of general deep learning recommenders and consists of: DMF: Deep Matrix Factorization Models for Recommender Systems [36] is a CF model that learns the similarities among the users and items to predicts the rankings of items. CDAE: Collaborative Denoising Auto-Encoders for Top-N Recommender Systems [37] is another CF model that uses the idea of Denoising Auto-Encoders to model users' preference with implicit feedback. Our third group is of sequential (session-based) recommenders and consists of: GRU4Rec+ : A Gated Recurrent Unit for Recommendation with Improved Version [5] is a recommender system that uses RNNs to model the user-item interaction sequences for session-based task. Compared to simple GRU4Rec [22] , GRU4Rec+ [5] adopts a different loss function and a different sampling strategy. SASRec : Self-Attentive Sequential Recommendation [38] is a self-attention based sequential model for the task of next item recommendation. It captures the entire user sequence to predict the next item. SRGNN : Session-Based Recommendation with Graph Neural Networks [39] aims to predict a user's next action based on the session information. BERT4Rec : Sequential Recommendation with Bidirectional Encoder Representations from Transformer [26] adapts the BERT architecture based on the Transformer model for sequential recommendations. It captures the user's sequential behavior for the task of click prediction. Among the baselines, DKN, LSTUR, SASRec, SRGNN and BERT4Rec are attention-based. We share the implementation details of our baselines also in this link 4 . Results and Analyses The comparison between our model and all the baselines is shown in Table 3 (bold is the best score). We discuss the results on both datasets (NYTimes and MIND), if there is a difference because of the dataset, we highlight the point explicitly. Overall, we can see in Table 3 that our D2NN model has the highest prediction accuracy (the lowest RMSE) and the highest click prediction accuracy (the highest AUC). We also get the highest recommendation accuracy (highest NDCG for all the k values in NYTimes and the highest NDCG for k @20 and 50 in MIND). We achieve a reasonable level of diversity (medium-level DIV) for all the k values. The tradeoff score of our model is the highest among all models, which shows that we achieve the right balance between the high accuracy and reasonable diversity. Overall, we get the better performance with the NYTimes. This is probably because our NYTimes dataset covers a longer range of readers' sequences and the news items. Our D2NN method is designed to model the sequential information from the news, and the readers' long-term and shortterm interests in a better way. The MIND dataset is a short sequence dataset, which is better modeled by the recommenders focusing more on the short-term user modeling. However, despite the nature of both datasets, our D2NN model shows the best tradeoff score (also most of the accuracy measures) with both datasets. Among the baselines, we find that the overall performance of the NRS baselines is better than that of the general methods. It can be seen with the higher tradeoff scores of DKN and LSTUR compared to other baselines. This is because these NRS methods were designed from the beginning to learn the news and reader representations, while in general methods, we need to explicitly provide this information. We also find that the general performance of the NRS methods is better than the CF models (DMF and CDAE). The CF suffers from the data sparsity problem that's why these methods have relatively lower recommendation accuracy. Also, the CF suffer from the inherent popularity bias, which explains why there is lower diversity. It indicates that we should go beyond the rating-only CF by adding content-based, contextual and attention-based solutions in an NRS. We observe that the non-sequential models, in general, perform better than the sequential models, particularly on the accuracy and tradeoff scores. This is probably because a pure sequential recommender system usually takes the strictly ordered reader and item interactions as sequences. However, there is quite an uncertainty in the reading behaviors in an NRS that couldn't be captured by pure sequential models. We also find that the overall performance of the attention-based methods is better than the traditional neural recommenders. This is seen with the relative better performance of the attentionbased NRS methods compared to the general methods without attention. The accuracy of GRU4Rec+ (non-attention) is also lower among the sequential recommenders. This result indicates the effectiveness of the attention mechanism in focusing on details and selecting the important information for the readers. 16 Although the non-sequential recommenders are in general better than the sequential ones in terms of accuracy; but in terms of diversity, the sequential recommender GRU4Rec+ has the highest diversity among all the models on both datasets. These sequential models have the higher diversity on both datasets. It is important to mention here that the highest diversity does not mean the best performance. This higher diversity level is achieved with a big loss on accuracy. In an NRS, we would never want to achieve a high diversity at the cost of a big loss on accuracy [31] . In the subsequent results (Table 4 ), we find that the diversity of one of our model variants is comparable (very close) to the highest diversity of GRU4Rec+, which shows that by adjusting our model component, we can achieve a high diversity if that is the goal. In the following experiments, we show the results on AUC and NDCG that are considered as standard evaluation metrics for recommendation accuracy [33] . We also show the DIV and the tradeoff score. The results of D2NN variants show similar patterns on both datasets, so we report the results on the NYTimes dataset only. We perform an ablation study to analyze the impact of different components in our model. Table 4 shows the performance of our default D2NN method and its variants. D2NN consists of LTI, STI, s (snippet), t (taxonomy) and h (headline). Our model variants are named as: D2NN-LTI (D2NN with only LTI), D2NN-STI (D2NN with only STI). When we remove some side information, we use a minus sign as the superscript, e.g., D2NN-STI(st -) is D2NN-STI without s and t. We observe the similar patterns on NDCG and DIV for all the k values, so we show the average of them (i.e., the mean NDCG and the mean DIV). We demonstrate the impact of modelling readers' long-term, short-term and combined interests in our D2NN model. As shown in Table 4 , the overall accuracy and diversity scores of the D2NN-STI (modeling short-term and diverse interests) are higher than those of the D2NN-LTI (modeling only long-term interests). This is because in a typical news dataset, there are more short-term users, without considering short-term interests, the model performance drops. We also see that our sequential model with LSTM and the attention outperforms the Transformer (sequence-to-sequence) model in modeling the readers' short-term interests. It is shown, in Table 4 , by the better accuracy and diversity of our D2NN-STI compared to scores of BERT4Rec (in Table 3 ). The LSTM can model the readers' short-term interests in different shorter time ranges, whereas the attention can model the related and diversified readers' interests. Our model also has far less parameters. As, news reading is a sequential process (usually left to right) that is better conceived by RNNs, there is no need to consider bi-direction as in Transformers. D2NN-STI has a built-in diversity-ware module DIM with the reader-level attention. We explore the impact of removing the reader-level attention from the model. The variant D2NN(r -) refers to D2NN without the reader-level attention. As shown in Table 4 , the model accuracy drops when we remove the reader-level attention, as shown in D2NN(r -). As the primary purpose of the reader-level attention in STI is to include the diversity, removing attention from the STI obviously impacts the model diversity. We also see that diversity of D2NN(r -) is close to that of the D2NN-LTI. This is obvious as D2NN-LTI is by default without the reader-level attention. Overall, we see that it is necessary to include both STI and LTI, that's why the default D2NN performs the best. We demonstrate the efficacy of our D2NN model with respect to including the side information in the NM component. The results, in Table 4 , show that the D2NN with all the side information is more competitive than the models that exclude one or more pieces of the side information. This is shown with the overall better accuracy of D2NN than D2NN(s -), which is better than D2NN(st -). We also demonstrate the impact of removing the side information in our model variants D2NN-LTI and D2NN-STI and the results are: (i) when we remove only the snippet from both variants, the overall accuracy of both models gets lower. This is obvious since we can better know about readers when we have more information about the clicked news; and (ii) when we remove the snippet and taxonomy from both variants, the performance of D2NN-STI increases and that of D2NN-LTI decreases. This shows that headlines alone are probably enough to make recommendations when the reader sequences are typically short. The lower performance in D2NN-LTI(st -) is mainly because of skipping the news categories. The news taxonomy browsed by the reader has decisive influence on the long-term behavior. Finally, combining all the side information improves the performance as seen in the best performance of D2NN. We also explore the impact of different levels of attention in the NM component. The D2NN(w -) refers to D2NN without the word-level attention, and D2NN(n -) is D2NN without the news-level attention. We see, in Table 4 , that the model performance drops when we remove the word-level attention. This shows that word-level attention is important to capture the relatedness of the words. The performance drops (slightly) more when we remove the news-level attention. This shows the usefulness of news-level attention in learning different side information for the news representation. Overall, we see that both attention levels (word, news) are very important. This is demonstrated with the better scores of the default D2NN. We study the effectiveness of including the self-attention in our model. We make two changes in the D2NN: (i) we replace the CNN layer in the NM with the self-attention; and (ii) we replace the attention in DIM with the multi-head self-attention. The goal is to see if we can perform better with multi-heads. We produce the contextualized representations of headline h using self-attention, as in Equation 17 . Here , ℎ is the multi-head representation of i th word in the headline h with k th attention head. The attention , is shown in Equation 18 . Here , refers to the relative importance of the interaction between the i th and j th words in k th selfattention head. ℎ and ℎ are the project parameters in the k th head and ℎ is the summation of ℏ attention heads i.e., ℎ = [ ,1 ℎ , ,2 ℎ , … , ,ℏ ℎ ]. The context representation for the snippet s i.e., , is also calculated in the same manner. We also learn the reader's representation who clicked news in the same way. Multi-head attention is particularly useful in scenarios when we want to learn different types of information from text [12] . For example, in the news "Ontario identify new COVID-19 cases, with a high number", the word 'COVID-19' here has interactions with many words i.e., 'new', 'cases' and 'number'. This kind of problem where we need to break the concept word by word, the multi-heads can be utilized to absorb knowledge from each word. In this experiment for self-attention mechanism, we select a number of heads in the self-attention layer and report the results with few numbers i.e., 3, 5, 10 with maximum 16 heads (maximum. 16 is also seen in a previous work on multi-heads [40] ). We compare these D3NN variants with our original D3NN that has vanilla (additive) attention. We show the effect of multi-heads on our model performance in in Fig. 3 . The results, in Fig. 3 , show that the default D2NN is better considering its overall performance, especially in terms of AUC, NDCG and tradeoff. The D2NN(3ℏ) shows marginal higher tradeoff than D2NN, which is probably related to the higher diversity. We also observe a trend with the increase or decrease in the number of heads in these variants. We find that as we increase the number of heads from 3ℏ to 5ℏ, the model accuracy increases, but then the accuracy drops when we further increase the heads to 10ℏ and 16ℏ. D2NN(16ℏ) has the lowest accuracy. In terms of diversity, we see the opposite trend to accuracy. The variants with lower accuracy now show the increased diversity. This indicates that there exists a negative correlation between the accuracy and the diversity. Besides the recommendation tasks, we also see the effect of attention heads on model efficiency. Similar to the results reported in [40] , we find the pruning half the heads could result in better model efficiency. In our experiment, we also find that the attention heads after a certain number become overhead for the model. Our original D2NN is not using multi-heads and we get a good performance. We use the BERT for the news language modeling (representation) task. We pre-train BERT on WikiNews 3 dumps (same timeline as our datasets), then we fine-tune BERT on our datasets. We use only the token-level embeddings from BERT. We compare our model using these language models (i) (original) BERT pretrained on Wikipedia and Books; (ii) our own trained BERT-News; (iii) the GloVe 1 model. The results in Fig. 4 show the overall best performance of our model with the BERT-News. The performance of (original) BERT and the Glove models is relatively lower. This is probably because these models might have missed many domain-specific words in the embeddings. Th embeddings by these pretrained GloVe and (original) BERT models also contain some noise, that's why we see an unexpected rise in diversity. Overall, the result shows that the domain-specific language models are more useful for understanding news articles. We also explore the impact of various important hyperparameters in the experiments. While analyzing one hyperparameter, the remaining hyperparameters are fixed at their optimal settings. Here, we report the broad findings by using various hyperparameters as discussed below: Negative: Positive samples: We try different ratio of negative to positive samples in our model. The result shows that too few negative samples do not represent readers' interests sufficiently, while too many negative samples bring noise. We find that when we use too small number of negative samples (e.g., less than 5) for training, the model performance drops. This is because we are relying too much on positive implicit feedbacks (clicks) to produce similar recommendations with least consideration for other items. However, when we include too many negative samples, the model performance again drops. This is probably because our model begins including with too many negative samples which results in bringing noise in the model. Therefore, a ratio of 1:5 is most appropriate in our case. The result shows that negative samples are helpful to learn useful news across various topics. Embedding size: The original BERT embedding size is 768, which is too large particularly when we want to use word-level embeddings only. When we use the original embedding vector of BERT, the model performance drops. So, we reduce the embedding size starting from size 50 till 300. We find that the model performance increases till the embedding size of 300. Going forward, the model again drops with larger dimension size (greater than 300). This shows that too small (<50) or too large dimension (>300) reduces the model performance in our experiments. A size of around 300 is best for an NRS to represent the semantic information from the headline and snippet Sequence length: We also exploit the sequence length of reader's clicked history and we find out that too small or too large sequence length results in poor model performance. Too large sequence length results in readers short-term interests being ignored, whereas, too small sequence length will not be able to capture the patterns from readers historical records. Therefore, we choose an average sequence length for training in our model. we test the varying sequence length of a reader's clicks starting from 10 to 300 and find that a length of ~100 reasonably reflects the reader' interests in the news recommendation problem. Miscellaneous: we also test different RNN variants (e.g., GRU versus LSTM), the number of LSTM gates, with or without the layer normalization and the dropout layer, the number of filters and the context window size in the CNN layer, the batch size, and the number of layers. With all these experiments, we find that our current setup is the best for achieving our goals for an NRS. In this paper, we propose a deep neural network to provide timely, highly accurate and reasonably diverse news recommendations to predict the reader's next click. We learn the news representations by incorporating multiple features from the news. We learn the reader's long-term interests from the whole click history, and the short-term interests and the diversified interests from the recent clicks. We trade-off between high accuracy and reasonable diversity. We apply different attention levels to learn useful news and reader representations. We conduct extensive experiments on two news datasets to demonstrate our work. In the future, we would like to include more readers' feedbacks and address issues such as missing negative implicit feedbacks in NRS. We also like to include position encoding techniques to address the temporal order of news articles and reader clicks. We plan to conduct a user study on whether our proposed method can indeed improve the measurement like click through rate. News Recommender System Considering Temporal Dynamics and News Taxonomy Neural news recommendation with long-And short-term user representations Modeling and broadening temporal user interest in personalized news recommendation Online social media recommendation over streams Recurrent neural networks with top-k gains for session-based recommendations Improving recommendation lists through topic diversification Fast greedy map inference for determinantal point process to improve recommendation diversity DKN: Deep knowledge-aware network for news recommendation DRN: A deep reinforcement learning framework for news recommendation Overcoming accuracy-diversity tradeoff in recommender systems: A variancebased approach Deep attention neural network for news recommendation Pre-training of deep bidirectional transformers for language understanding A convolutional neural network for modelling sentences, ArXiv Prepr Attention is all you need Neural news recommendation with attentive multi-view learning Time weight collaborative filtering Collaborative filtering with temporal dynamics Recommender system based on temporal models: A systematic review Sequence-aware recommender systems A Survey on Session-based Recommender Systems, 2020. A Surv Reinforcement Learning based Recommender Systems: A Survey Session-based recommendations with recurrent neural networks, ArXiv Prepr CHAMELEON: a deep learning meta-architecture for news recommender systems Graph neural networks for social recommendation Graph neural news recommendation with long-term and short-term interest modeling Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer Diversification and refinement in collaborative filtering recommender, Int. Conf. Inf. Knowl. Manag. Proc A new collaborative filtering approach for increasing the aggregate diversity of recommender systems Promoting diversity in recommendation by entropy regularizer DiABlO: Optimization based design for improving diversity in recommender system A Regularized Model to Trade-off between Accuracy and Diversity in a News Recommender System Diversity driven attention model for query-based abstractive summarization Proc of the 58th Annual Meeting of the Association for Computational Linguistics How good your recommender system is? A survey on evaluations in recommendation Recommender systems: Sources of knowledge and evaluation metrics Deep matrix factorization models for recommender systems Collaborative denoising auto-encoders for top-n recommender systems Session-based recommendation with graph neural networks Session-based recommendation with graph neural networks Are sixteen heads really better than one? This work is partially sponsored by Natural Science and Engineering Research Council of Canada (grant 2020-04760).