key: cord-0039560-aqd4f252 authors: Fujita, Soichiro; Kobayashi, Hayato; Okumura, Manabu title: Unsupervised Ensemble of Ranking Models for News Comments Using Pseudo Answers date: 2020-03-24 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45442-5_17 sha: 71becf689650d29b43afaff37234d2718d0bfa60 doc_id: 39560 cord_uid: aqd4f252 Ranking comments on an online news service is a practically important task, and thus there have been many studies on this task. Although ensemble techniques are widely known to improve the performance of models, there is little types of research on ensemble neural-ranking models. In this paper, we investigate how to improve the performance on the comment-ranking task by using unsupervised ensemble methods. We propose a new hybrid method composed of an output selection method and a typical averaging method. Our method uses a pseudo answer represented by the average of multiple model outputs. The pseudo answer is used to evaluate multiple model outputs via ranking evaluation metrics, and the results are used to select and weight the models. Experimental results on the comment-ranking task show that our proposed method outperforms several ensemble baselines, including supervised one. User comments on online news services can be regarded as a useful content since users can read other users' opinions related to each news article. Many online news service sites rank comments in the order of the number of positive user-feedback for a comment, such as "Like"-button clicks, and preferentially display popular comments to readers. However, this type of user-feedback is not suitable to assess the comment quality, because this type of measurement is biased by where a comment appears [7] ; Earlier comments tend to receive more feedback since they will be displayed at the top of the page. In attempt of solving this problem, several studies introduce some aspects of the comment quality to focus on, e.g., constructiveness [7, 13] or persuasiveness [22] . In particular, Fujita et al. [7] proposed a new dataset to rank comments directly according to comment quality. This is a difficult task because we have various situations of judging whether a comment is good. For example, comments can indicate rare user experiences, provide new ideas, or cause discussions. Ranking models often fail to capture such information. According to recent studies [2, 12, 15] , ensemble techniques are widely known to improve the accuracy of machine learning models. These ensemble techniques can be roughly divided into two types: averaging and selecting. Averaging methods such as Naftaly et al. [17] simply average multiple model outputs. Selecting methods such as majority vote [15] select the most frequent label from the predicted labels of multiple classifiers in post-processing. These methods assist models to make up for other models' mistakes and to improve the results. Recently, Kobayashi [12] proposed an unsupervised ensemble method, post-ensemble, based on kernel density estimation, which was an extension of the majority vote to text generation models. He showed that this method outperformed averaging methods in a text summarization task. In this paper, we propose a new unsupervised ensemble method, HPA, which is a hybrid of an output selection and a typical averaging method. In typical averaging methods, a lower accuracy model could merely be noise. A simple denoising method is to statically remove such lower accuracy models [19] . However, there is basically no model that fails for every inputs, particularly in neural models with the same architecture. In general, each model has its own strengths and weaknesses. Therefore, our method adopts dynamic denoising of outputs via a provisional averaging result. We use the provisional averaging result as a pseudo answer. Each predicted ranking is compared to the pseudo answer via a similarity function, and the similarity scores are used for selecting and weighting models. We adopt evaluation metrics as a kind of similarity to specialize in the ranking task. In experiments on a task of ranking constructive news comments, our proposed method HPA outperformed both previous unsupervised ensemble methods and a simple supervised ensemble method. Furthermore, we found that one of the evaluation metrics is useful as a similarity measure for the ensemble process. Comment Ranking Task: Let an article be associated with comments C = (c 1 , ..., c n ). Each comment has a manually annotated score S = (s 1 , ..., s n ), such as the degree of comment quality. A ranking model m learns a scoring functioñ s i = m(c i ). We consider a predicted score sequence as a ranking of the comments r = (s 1 , ...,s n ), because we can generate a ranked comment sequence using this score sequence. Ensemble Problem: We prepare N rankings R = (r 1 , ..., r N ) from ranking models M = (m 1 , ..., m N ). The goal of the ensemble is to combine the ranking models to produce a better ranking than any of the individual ranking functions. A simple averaging method calculates the average of the comment scores, like r * = r∈R r |R| . We introduce PostNDCG which applies the post-ensemble method [12] to the ranking task. Post-ensemble is an unsupervised ensemble method based on kernel density estimation for sequence generation. This method compares the similarity between model outputs and selects the majority-like output which is similar to the other outputs. This selection is equivalent to selecting the output whose estimated density is the highest in the outputs. PostNDCG calculates this scoring function: f (r) = 1 |R| r ∈R sim(r, r ), where sim(r, r ) represents the similarity between r and r . The final ranking of PostNDCG is defined as r * = argmax r∈R f (r). We used the normalized discounted cumulative gain (NDCG@k) [1] as the similarity function sim(·) to compare each ranker. We propose a Hybrid method using the Pseudo Answer (HPA). Figure 1 illustrates an example of HPA. Here, HPA selects the top three rankings {r 2 , r 3 , r 5 } that are nearest to the pseudo answer. After that, it weights each selected ranking via a scoring function based on the pseudo answer. The concept of HPA is to denoise outputs via a pseudo answerr, which is represented by the average of each model output after the L2 normalization: The scoring function g is calculated as the similarity between the pseudo answer and the predicted ranking: g(r) = sim(r, r). Then, HPA selects the top k models with the highest scores. The final ranking r * is represented as, r * = r∈R g(r) · r, whereR is the set of selected models (rankings). Dataset: We used a dataset for ranking constructive comments on Japanese articles in Yahoo! News 1 , which was prepared in Fujita et al. [7] . The dataset consists of triplets of an article title, comment, and constructiveness score. The constructiveness score (C-score) is defined as the number of crowdsourced workers, out of 40, who have judged a comment to be constructive. Therefore, the C-score is an integer ranging from 0 to 40. In this research, 130,000 comments from 1,300 articles were used as training data, 11,300 comments from 113 articles were used as validation data, and 42,436 comments from 200 articles were used as test data. In the training and validation data, 100 comments were randomly extracted in each article, whereas in the test data, all the comments were extracted assuming an actual service environment. Preprocessing: We used a morphological analyzer MeCab 2 [14] with a neologism dictionary, NEologd 3 [20] , for splitting Japanese texts into words. We replaced numbers with a special token and standardized the letter types by halfwidth to fullwidth 4 . We did not remove stop-words because function words will affect the performance in our task. We cutoff low-frequency words that appeared only three times or less in each dataset. We used RankNet [1] , a well-known pairwise ranking algorithm based on neural networks. Given a pair of two comments c 1 and c 2 on an article q, RankNet solves a binary classification problem of whether or not c 1 has a higher score than c 2 . The score indicates the comment has high quality or not. We adopted the encoder-scorer structure for RankNet. The encoder consisted of two long short-term memory (LSTM) instances with 300 units to separately encode a comment and its title. The scorer predicted the ranking score of the comment via a fully-connected layer after concatenating the two encoded (comment and title) vectors. We used pre-trained word representations as the encoder input. They were obtained from a skip-gram model [16] trained with 1.5 million unlabeled news comments. We used the Adam optimizer (α = 0.0001, β 1 = 0.9, β 2 = 0.999, = 1 × 10 −8 ) to train these models. Both the dimensions of the hidden states of the encoders of article titles and comments were 300. In the experiments, we trained 100 different models by random initialization for the ensemble methods. We used normalized discounted cumulative gain (NDCG@k) [1] . The NDCG@k is typically calculated in the top-k comments ranked by the ranking model and denoted by NDCG@k = Z k k i=1 scorei log 2 (i+1) , where score i represents the true ranking score of the i-th comment ranked by the model, and Z k is the normalization constant to scale the value between 0 and 1. In addition to NDCG@k, we use Precision@k as the second evaluation metrics. Precision@k is defined as the ratio of the correctly included comments in the inferred topk comments to the true top-k comments. In the experiment, we evaluated the case of k ∈ {1, 5, 10}. Note that a well-known paper [10] in the information retrieval field determined NDCG to be more appropriate than Precision@k for graded-scores settings like ours. Ensemble Baselines: We prepared the following methods as baselines. RankSVM and RankNet are baselines of a single model. ScoreAvg, RankAvg, TopkAvg, and NormAvg are commonly used ensemble methods that combine multiple models in post-processing without training. SupWeight is the popular supervised ensemble method based on weighting. -RankSVM: The best single RankSVM model proposed in Fujita et al. [7] . -RankNet: The best single RankNet model in 100 models for ensemble. -ScoreAvg: Average output scores of the models for each comment. -RankAvg: Average rank orders of each comment. -TopkAvg: Select comments with higher scores than a threshold from each ranking and average their scores [5] . -NormAvg: Average normalized output scores of the model outputs, as typified by [2] . We used L2 normalization to each ranking as r = r/||r||. -SupWeight: Average weighted scores of the model outputs [19] . Our experimental results are shown in Table 1 . As a result of the ensemble, we confirmed that all ensemble methods perform better than when using a single model. In particular, the proposed method HPA has achieved the highest NDCG@k. PostNDCG achieved higher accuracy than RankNet. This implies that the method of calculating the similarity between models using evaluation metrics for each article is effective. However, it was less accurate than the common averaging ensemble method such as NormAvg. Since models were originally trained by a relative comparison of rankings, preserving the diversity of models is more effective for improving performance than selecting models with high confidence by using PostNDCG. The unsupervised method HPA outperformed the supervised method SupWeight. Therefore, we confirmed that it is better to determine the important model from the similarity between the predicted rankings rather than learning it in advance using the labeled data. Furthermore, we verified the effectiveness of NDCG@k as a similarity function to calculate HPA, compared to other similarity functions. We selected Precision, cosine similarity, Kendall rank correlation coefficient [11] , and Spearman rank correlation coefficient [21] as compared methods. Table 2 shows the results of HPA when the similarity function is changed. The NDCG@k functions outperformed other similarity functions. Furthermore, Precision@k performed better than cos. Note that Precision@k equals top-k cosine similarity. It indicates top-k focused measurement, evaluation metrics, is useful for the ensemble. Analyzing comments on online forums, including news comments, has been widely studied in recent years. This line of research has included many studies on ranking comments according to user feedback [6, 9, 22] . On the other hand, there has also been much research on analyzing news comments in terms of "constructiveness" [7, 13, 18] . The most related research is Fujita et al. [7] . They ranked comments by using the C-score to evaluate the quality, instead of relying on user feedback. They created a news comment ranking dataset and improved the model performance from the viewpoint of the dataset structure. In our research, we further improve the the performance from the viewpoint of the model structure. In the ensemble methods for ranking task, there are methods to average model outputs [2, 5] , as mentioned in Sect. 3.2. Our method expands those methods by denoising through the relationships between predicted rankings. There is also research on learning the query-dependent weights with semi-supervised ensemble learning in an information retrieval task [8] . This method focused on selecting documents that are highly relevant to a query (article). It is effective for information retrieval tasks but not for ranking news comments task, because almost all such comments would be associated with a news article. There are also approaches that improve the ranking model according to evaluation metrics: NDCG@k, LambdaRank [3] , and LambdaMART [4] . These methods handled model training by calculating NDCG@k between a gold ranking and a predicted one. It means NDCG@k was not used in inference. That fundamentally differs from our method which calculates NDCG@k between predicted rankings during inference. We proposed a hybrid unsupervised method of an output selection method and a typical averaging method. Our experiments showed that comparing predicted rankings using the evaluation metrics is effective for selecting and weighting models. For future work, we would like to compare the proposed method with the supervised ensemble method in terms of performance and speed. We also plan to combine various types of networks instead of using the same network structure. Learning to rank using gradient descent Learning to rank using an ensemble of lambda-gradient models Learning to rank with nonsmooth cost functions From RankNet to LambdaRank to LambdaMART: an overview Reciprocal rank fusion outperforms condorcet and individual rank learning methods Ranking mechanisms in Twitter-like forums Dataset creation for ranking constructive news comments Semi-supervised ensemble ranking Ranking comments on the social web Cumulated gain-based evaluation of IR techniques A new measure of rank correlation Frustratingly easy model ensemble for abstractive summarization Constructive language in news comments Applying conditional random fields to japanese morphological analysis The weighted majority algorithm Distributed representations of words and phrases and their compositionality Optimal ensemble averaging of neural networks Automatically identifying good conversations online (yes, they do exist!) Actively searching for an effective neural network ensemble Implementation of a word segmentation dictionary called mecab-ipadic-NEologd and study on how to use it effectively for information retrieval The proof and measurement of association between two things Is this post persuasive? Ranking argumentative comments in online forum