key: cord-0168738-8d56i848 authors: Hoque, Md Reshad Ul; Li, Jiang; Wu, Jian title: SciEv: Finding Scientific Evidence Papers for Scientific News date: 2022-04-30 journal: nan DOI: nan sha: 7165ce23697beb003434e00278c32f651b81cc62 doc_id: 168738 cord_uid: 8d56i848 In the past decade, many scientific news media that report scientific breakthroughs and discoveries emerged, bringing science and technology closer to the general public. However, not all scientific news article cites proper sources, such as original scientific papers. A portion of scientific news articles contain misinterpreted, exaggerated, or distorted information that deviates from facts asserted in the original papers. Manually identifying proper citations is laborious and costly. Therefore, it is necessary to automatically search for pertinent scientific papers that could be used as evidence for a given piece of scientific news. We propose a system called SciEv that searches for scientific evidence papers given a scientific news article. The system employs a 2-stage query paradigm with the first stage retrieving candidate papers and the second stage reranking them. The key feature of SciEv is it uses domain knowledge entities (DKEs) to find candidates in the first stage, which proved to be more effective than regular keyphrases. In the reranking stage, we explore different document representations for news articles and candidate papers. To evaluate our system, we compiled a pilot dataset consisting of 100 manually curated (news,paper) pairs from ScienceAlert and similar websites. To our best knowledge, this is the first dataset of this kind. Our experiments indicate that the transformer model performs the best for DKE extraction. The system achieves a P@1=50%, P@5=71%, and P@10=74% when it uses a TFIDF-based text representation. The transformer-based re-ranker achieves a comparable performance but costs twice as much time. We will collect more data and test the system for user experience. This can be generally classified as a citation recommendation problem, the goal of which is to find an article that should be cited given a piece of context Färber and Jatowt [2020] . In proposed solutions, a function is trained to map a citation context z ij and the document it belongs to (d i ) to a reference (aka the cited document r m ). In a supervised method, a classifier is trained to incorporate global (document or cross-document) and local (context) features, e.g., He et al. [2010] . In an unsupervised method, a re-ranking model is applied which assigns probabilistic scores to a list of candidate documents, e.g., Peng et al. [2016] . Existing citation recommender systems focused on problems in which both original or recommended documents are either news articles Peng et al. [2016] or scientific articles He et al. [2010] . In our problem, the original article is a news article and the recommended articles are scientific papers. This problem is more challenging for the following reasons. First, there is usually a gap between vocabularies used in news articles and scientific papers because reporters usually need to paraphrase scientific papers into more readable text for non-domain experts. The second challenge is data sparsity. Unlike dense citation networks Kataria et al. [2010a] , there are much fewer citation relations between news and scientific articles, making it challenging to apply graph embedding methods. Lastly, to our best knowledge, there are few open-access datasets that can be used for training and evaluation. Recently, a dataset containing ∼ 800 peer-reviewed articles were compiled with mentions of scientific research in popular media but the original news articles were not included Anderson et al. [2020] . To overcome this challenge, we compiled a pilot dataset consisting of 100 pairs of scientific news articles and their associated research papers. In this paper, we proposed a system called SciEv as an intermediate step towards evaluating the credibility of scientific news. SciEv automatically finds scientific evidence papers for a given scientific news article in two stages. In the first stage, the system sends queries to a digital library API and retrieve candidate papers. The second stage reranks these candidates based on their semantic similarities to the news article. To overcome the vocabulary gap challenge aforementioned, instead of querying general keyphrases, the system queries domain knowledge entities (DKEs) extracted from news articles Hoque et al. [2019] . We queried DKEs instead of regular keywords because many DKEs appear both in news articles and corresponding scientific papers. This motivates us to use DKEs to search for scientific papers given a scientific news article. DKEs We define DKEs as noun phrases that deliver domain knowledge. DKEs are different from general named entities and keyphrases He et al. [2018] because they are predominantly used in domain specific articles and usually need more or less domain knowledge to understand. In Figure 1 , we highlight DKEs that appear in both a piece of scientific news and the corresponding paper. The exponential growth of scientific papers each year poses a great challenge for news editors and researchers to find the most pertinent citations Bornmann and Mutz [2015] . Citation recommendations can be broadly classified into three categories depending on the source articles and the articles to be cited. In the first type, a news article cites another news article (news→news). In the second type (paper→paper), a scientific paper cites another scientific paper. In the third type (news→paper), a news article cites a scientific paper. Citing a news article in a research paper is relatively uncommon and beyond the scope of our study. The collaborative filtering (CF) method has been widely used for news recommender systems since it was proposed Melville et al. [2002] . This method requires building the document and the user profiles. However, the reading history of news articles is usually unknown, making it difficult to build user profiles. Many citation recommender systems were proposed using different text representation models. Early work used Synset Frequency-Inverse Document Frequency (SF-IDF) to represent the news Capelle et al. [2012] . SF-IDF was similar to TFIDF except that it used WordNet synonym sets to expand the semantic representation of a given word. Other types of work leverage word embeddings. For example, Peng et al. (2016) developed a news citation recommendation system using a word-embedding based re-ranking and grounded entities (i.e., explicit semantics) Peng et al. [2016] . Okura et al. Depending on the type of input, citation recommendation systems can be classified into local and global citation recommendations. Local citation recommendation systems are based on text snippets, such as a sentence or even several words Huang et al. [2015a] . Global citation recommendation systems make recommendations based on the full text or the abstract of a document Kataria et al. [2010b] . We propose the Scientific Evidence paper retrieval system called SciEv, which can be classified as a news→paper recommender system based on global text. To our best knowledge, few systems have been proposed with the same functionality Hoque et al. [2019] . Recently, many pre-trained language models were proposed (e.g., BERT Devlin et al. [2018] ) and have shown efficacy in retrieval tasks by representing text with distributed vectors, e.g., . This motivates us to compare these language models in the news→paper recommender system. The architecture of the SciEv system adopted a two-stage retrieval model proposed, e.g., . The architecture of the system contains four modules ( Figure 2 ) as described below. Preprocessing The input of this module is an HTML page of a scientific news article, downloaded from the Web. Only textual content is retained for further analysis. In the experiments below, we used a preprocessor that parses web pages downloaded from ScienceAlert.com. The parser could easily be customized to parse news body text from other websites. The text was then cleansed so square brackets, extra spaces, special characters (such as @,#), and numerical digits were removed. Finally, the cleansed news article was segmented into sentences. Each sentence was tokenized and each token is labeled with part-of-speech (POS) tags. As mentioned above, DKEs will be used as queries to retrieve candidate papers, so the next step is to extract them from the news article body text. We use DKEs instead of keyphrases or general name entities because DKEs represent the scientific domain knowledge and certain DKEs in news articles also appear the source papers. This module is based on a transformer model trained on a multi-domain corpus. We elaborate the DKE extraction in Section 5.1. Candidate Paper Retrieval (CPR) This module searches for the scientific paper candidates using extracted DKEs as queries. Here, we assume a frequency-based ranking algorithm such as BM25 Robertson and Zaragoza [2009] , which is implemented in popular search platforms, e.g., Apache Solr and Elasticsearch. In our experiment, we use arXiv.org, a digital repository that hosts around 2 million pre-printed papers covering major fields in Computer Science, Mathematics, Physics, Astronomy, Statistics, Materials Science, and Social Science. The website offers a free search API. To obtain a high recall in this stage, we perform a union of multiple query results. Each query contains a single or a combination of up to 3 DKEs (connected by "AND"). The final candidate result list is obtained by merging the top 10 results of all queries and removing duplicate papers in terms of titles and authors. This step is necessary for narrowing down the search space because constructing vector representations for millions of papers and ranking them by cosine similarity could take impractically long time. This module reduces the candidate pool down to thousands of articles, and boosts the efficiency of the overall retrieval model. Empirically, the system finds approximately 500-3000 candidate papers for each news article. Paper Re-ranking This module reranks candidate documents based on the vector representations of news articles and paper abstracts. The purpose of this step is to increase the precision by promoting scientific papers that are highly topically relevant to the news article. Here we apply cosine similarity, which is a common practice in many vector search engines (e.g., Covidex ). The key is to generate a high-quality vector representation. The vector representation of the scientific paper is constructed by encoding the abstracts into a fix-length vector using a pre-trained language model. The vector representations of the news articles are constructed in a similar way. We investigate the performances of state-of-the-art language models. As mentioned above, to our best knowledge, we are not aware of existing datasets containing scientific news and corresponding research papers. Our pilot ground truth dataset is obtained using 100 scientific news articles downloaded from ScienceAlert, ScienceNews, EurekAlert, and Forbes 1 . The articles were manually curated so that at least one source scientific paper is provided as a hypertext link or a reference in the original news article. We found up to 5 papers linking to a news article. The average length of these news articles is approximately 900-1000 words. The news articles are from a variety of domains such as history, arts, astronomy, biology, environment, computer science, and medicine. We use two datasets for training the DKE extractor: the SemEval 2017 Task 10 dataset Augenstein et al. [2017] and the OA-STM dataset 2 . Scientific news articles can be written in multiple domains. To train a robust DKE extractor for articles in various domains, we pre-train a model using the SemEval 2017 Task 10 dataset Augenstein et al. [2017] , consisting of 500 passages extracted from journal papers in Computer Science, Materials Science, and Physics. The dataset was double annotated and three types of entities were identified, namely, MATERIAL, METHOD, and PROCESS. There are in total more than 7000 entities manually annotated from the whole dataset. However, this dataset covers only three domains, so we fine-tune the pre-trained model on the OA-STM dataset, containing pre-processed abstracts from scientific papers in 10 domains, including agriculture, astronomy, biology, chemistry, computer science, earth science, engineering, materials science, math, and medicine. There are 11 abstracts in each domain. For each abstract, four core scientific concepts were annotated including PROCESS, METHOD, MATERIAL, and DATA. There are in total 5595 entities annotated. Existing studies indicated that a classifier trained on data from all 10 domains performs better than trained on data from a single domain Brack et al. [2020] . When using these two datasets, we collapse all categories into one category called DKE. DKE extraction can be seen as a named entity recognition (NER) task. Although many NER models have been proposed, there is not a consensus that a certain model definitely beats the others in all scenarios. The performance of NER models Embedding Layer [CLS] while AG usually depends on data properties Li et al. [2022] . Therefore, we compare the following models in the DKE extraction task. BiLSTM-CRF and Res-BiLSTM-CRF In this model, we applied the bi-directional long short-term memory (BiL-STM) model to obtain the hidden representation of a word level token, followed by a conditional random field (CRF) layer. This model has been applied for many NER tasks Huang et al. [2015b] and achieved state-of-the-art performance on standard datasets, e.g., Luo et al. [2020] . We also considered an alternative model with two BiLSTM networks with a residual connection. In the residual unit, the output of a shallow layer is directly added to the output of a deeper layer Srivastava et al. [2015] . Either model uses random weights as input and learns the hidden representation of each token from the context. In this model, the representation of each token is constructed by concatenation of the hidden representation output by a BiLSTM with the pre-trained word2vec model Mikolov et al. [2013] . A CRF layer is then applied to classify each token. BiLSTM-ChE Character embedding can be used for capturing morphological information of words Santos and Guimaraes [2015] and mitigating the out-of-vocabulary problem Verwimp et al. [2017] . In this model, we combine character and word level encodings. The first layer use a BiLSTM to encode each character and combine them into a word-level vector. The second BiLSTM layer encodes each word-level token into a new vector. These two vectors are concatenated to generate the final representation of each work-level token. A CRF classifier is then applied to tag each token. BiLSTM-ChE-Attention In this model, a self-attention layer is added after combining the character and word embeddings in the BiLSTM-ChE model. ELMo is a context-dependent language model trained on the 1 Billion Word Benchmark Peters et al. [2018] , providing word representations with rich features. In this model, we initialize the Res-BiLSTM model using the pre-trained ELMo embedding. The aggregation of self-attention and positional encoding has made transformer models successful for many tasks such as named entity recognition (NER) Vaswani et al. [2017] . One representative language model is Bidirectional Encoder Representations from Transformers (BERT) Devlin et al. [2018] , which has been successfully applied for NER Liang et al. [2020] and other downstream tasks. We implemented two transformer models. One model was trained from scratch on the OA-STM dataset. The other model was developed by fine-tuning a pre-trained BERT model as a backbone encoder on the OA-STM dataset. Before the classification layer, the BERT encoder extracts high-quality language features from our text data. Based on these features, the classification layer classifies these entities into DKEs and non-DKEs (Figure 3 ). Text-rank is an unsupervised graph-based model inspired by Google's PageRank algorithm to extract keyphrases Mihalcea and Tarau [2004] . The algorithm builds an undirected graph for each target document, in which the nodes correspond to words in the target document, and edges are drawn between two words that occur next to each other in the text. HESDK HESDK is a hybrid approach to extract DKEs Wu et al. [2017] . In the first phase, candidate phrases are extracted by a grammar-based chunk parser which is then filtered by a linear support vector machine (SVM). In the second phase, a CRF model is used to predict the probabilities of tags for a given token based on lexical and morphological features. The results from both approaches are merged and further filtered by a rule-based filter. Stanford NER To demonstrate the advantage of using DKEs, we extract regular named entities using the Stanford CoreNLP Manning et al. [2014] . We use 7-class Stanford NER model trained on the MUC6 and MUC7 datasets. The model extracts seven named entities, including LOCATION, PERSON, ORGANIZATION, MONEY, PERCENT, DATE, and TIME. Depending on the length of the news article, the DKE extractors can extract 50-200 DKEs per article, resulting in 500-3000 candidate scientific papers. For all baseline methods, we use the term frequency-inverse document frequency (TFIDF, see below) to represent news articles and scientific papers. In the reranking phase, we represent a news article and scientific paper abstracts with fix-length feature vectors. We compare both local and distributed representation models. First, although pre-trained language models have generally exhibited advantages over local representation models on many tasks, the performance of document representation could be task-dependent. If the similarity of two documents is not on the semantic level but on the literal level, pre-trained language models may loose the advantage. Second, language models on document representation is usually data dependent. The case is more challenging in our task as there is a discrepancy between vocabularies of news articles and research papers. General-use language models, such as BERT Devlin et al. [2018] , usually well-represent text prevalently used in news and Wikipedia articles. Scientific language models, such as SciBERT Beltagy et al. [2019] , on the other hand, usually well-represent text used in scientific papers. TFIDF weighted Bag-of-Words (BoW) BoW is a traditional text representation model Shahmirzadi et al. [2019] . In this model, each news article or the scientific abstract is represented as a sparse vector containing |V | elements, in which |V | is the vocabulary size of a retrieval corpus. Each element is the TFIDF value of the corresponding term. The retrieval corpus is defined as the combination of the news article and its candidate papers. The IDF for each term is calculated based on the retrieval corpus it belongs to. d2vec In this model, for each given document, the vector representation of each word is aggregated in a certain way to represent the whole document. We use the pre-trained word2vec model to calculate a 300 dimensional vector representation of each word. The document vector is the average of vectors of all tokens. Doc2vec Doc2vec is a model to create a vector representation of a document Le and Mikolov [2014] of various lengths. The d2vec model above does not count the word sequence information and does not incorporate the context into the embedded vectors. In doc2vec, when training the word vectors, the document vector D is trained as well. When the document sequence is finished, the document vector D holds a representation of the document. We use the model implemented by Python Gensim Doc2Vec. Weighted Doc2Vec After getting document representation using the Python Gensim Doc2Vec, we extract word representation for each of the words from the document. We then weighted that word using the TFIDF value. Eventually, we combine all the word representation followed by feature-wise averaging to create a new document representation. SciBERT SciBERT is a transformer-based encoder trained on a large corpus of scientific text Beltagy et al. [2019] . Because this model is trained on scientific literature, it has shown advantages over BERT in scientific text classification and summarization tasks Gu et al. [2020] . partially because its relatively larger vocabulary overlaps with the given corpus. We perform a similar aggregation to the d2vec to obtain the document embedding by averaging the vector representation of each token in a document. SBERT Sentence transformer or sentence-BERT (SBERT) is a modified version of the pre-trained BERT model Reimers and Gurevych [2019] . It uses a Siamese network with the triplet loss function to produce sentence embeddings. Each sentence in a document is encoded using SBERT. We then use the averaged embedding as the document embedding. 6 Evaluation and Comparison We use precision, recall, and F1 score to evaluate the DKEs extractor models. Precision is calculated as the ratio of correctly extracted DKEs divided by the total number of DKEs extracted. The recall is calculated as the ratio of the correctly extracted DKEs divided by the total number of DKEs labeled. F1 score is the harmonic mean of precision and recall. We use the following metrics to evaluate the system. in which Q is the total number of queries, and rank(i) is the rank of a relevant scientific paper. MRR assumes there is only one relevant document in the search results of each query. When evaluating queries corresponding to multiple papers, we use the top-ranked paper to calculate MRR. Normalized Discounted Cumulative Gain (NDCG) We use NDCG with a binary relevance. The metric was used for evaluating cases in which one query returns multiple relevant papers. P@K We use the precision at rank K to measure the fraction of relevant scientific papers within certain top results. It can be used when there are multiple relevant papers. We evaluate P@K when K = 1, 5, 10, 20, and 50. DKE extraction is the key step to generate queries to retrieve paper candidates. Each model was trained on 80% documents from all domains and tested on 20% of documents on individual domains. Figure 5 show the comparison of performance of DKE extraction models. The results indicate that the fine-tuned BERT model outperforms all other models, achieving a nearly perfect performance for all domains, with F1 varying from 0.92 to 1.00 depending on the domain (Table 5) . Specifically, the model correctly extracted all DKEs in the math domain. The superior results are attributed to the BERT transformer encode, which was pre-trained on CoNLL-2003 Devlin et al. [2018 . The other models underperformed most likely because they were trained from scratch on much smaller training datasets and did not generalize well. Among these models, the ELMo-BiLSTM performed relatively better than its peer models. Specifically, the model achieved an F1=52.8%, 53.5%, and 51.1% for materials science, biology, and chemistry, respectively. The results verified the advantage of initializing the BiLSTM encoder with pre-trained language models, e.g., Wu et al. [2020] . One interesting phenomenon is that adding self-attention to the BiLSTM-ChE model boosts the performance on certain domains such as agriculture, engineering, math, and biology but decreases the F1 scores of other domains. Harrison et al. [2019] . The specific corpus used is not available, making it impossible to make a fair comparison. 2 Keyphrases extracted using TextRank Mihalcea and Tarau [2004] . 3 Named entities extracted using Stanford CoreNLP Manning et al. [2014] . 4 HESDK Wu et al. [2017] . Table 1 shows the performance of the system on retrieving scientific evidence papers. For each system setting, we report the query type, the model used for DKE extraction, and document representation, P@K, MRR, and the average NDCG. We also measured the time it takes to finish ranking (the PRR module only), and the overall time for the entire system ( Figure 2 ). The runtime information can be important to deploy an online system. For comparison, we add four baseline settings, all using TFIDF weighted BoW model to represent documents but use different query types. The results indicate that the BERT-TFIDF and the BERT-SPECTER settings achieved the top performances. The BERT-TFIDF model achieves the best P@1/5/10, MRR, and average NDCG. The BERT-SPECTER model achieves the best P@20 and P@50. Specifically, the best baseline (Baseline4), which retrieves 60% scientific papers within the top 50th position, whereas the BERT-SPECTER model can retrieve 91%. The result first demonstrates the efficacy of querying DKEs, as opposed to general named entities. In particular, retrieval settings using DKEs as queries outperformed the all retrieval settings using general named entities or keyphrases. Second, the relatively high P@K when K is high (k=20, 50) can be attributed to the powerful capability of the language models to capture the semantic similarities between news articles and papers. On the other hand, the simple TFIDF document presentation coupled with BERT model outperforms all other models when K=1, 5, and 10. This result indicates that the TFIDF model can capture news article and scientific paper pairs that exhibit higher literal similarities. However, when we lower the selection threshold (by increasing k), the the most relevant scientific papers to news articles are more semantically similar. We postulate this could be attributed to paraphrasing instead of using exactly the same terms. Through error analysis, we found that the major reason that caused our retrieval models to fail was that the scientific news contain much less DKEs. One example is a news article called "How to spot deepfakes? Look at light reflection in the eyes" 3 . Other types of news articles use more images, videos, and equations to convey scientific discoveries, rather than plain text. One example is a news article titled "Math Genius Has Come Up With a Wildly Simple New Way to Solve Quadratic Equations" 4 . Regarding runtime, Baseline3 using the Stanford CoreNLP takes the shortest overall time of 13.90 seconds. The top performing setting BERT-TFIDF takes the shortest PRR time of 0.8 second and a relatively short overall time of 66.70 seconds. The BERT-SPECTER setting takes much longer time, which almost doubles the BERT-TFIDF model. From the perspective of building a production service, BERT-TFIDF seems a better choice but the relatively long runtime is still a bottleneck for building a real-time system. In this work, we proposed a system called SciEv, which automatically recommends scientific papers given a scientific news article. Although this can be broadly treated as a citation recommender system, we introduced a new scenario, in which the citing document is a news article and the cited document is one or several research papers. Our system consists of four modules: preprocessing, DKE extraction, candidate paper retrieval using DKEs, and paper reranking based on document embedding. We trained a multi-disciplinary transformer-based transfer-learning model that beats other heuristic and learning-based models, achieving an F1=0.93-1. We also compares the capabilities of different document embedding models in capturing the similarities between the news article and research papers. Our experiments on different system settings indicated that using DKEs was an effective and efficient way to retrieve research papers given a scientific news article. However, the TFIDF representation seems more powerful than the language model (e.g., SPECTER) to find our scientific papers when K is relatively low (K< 20). The language model starts to exhibit advantages over TFIDF when the search results are more inclusive with a higher K (K≥ 20 in our case). The results indicate that an ensembled re-ranking model may achieve an higher performance, which we will pursue in future work. The ultimate goal is to build a public application that is capable of automatically assessing the credibility of scientific news, based on pertinent scientific papers. To this end, we need to find effective and efficient ways to find the most relevant ones pertaining to a given scientific news report from the vast amount of scholarly papers and to evaluate the consistency of a scientific report against a list of relevant publications. The SciEv system we proposed here answers the first question. However, the overall runtime is still over 10 seconds. In the future, we will consider a more efficient method to further reduce the overall runtime to seconds by parallelizing queries and text embeddings. For comparison, we also show a "Transformer" model without initializing tokens using the pre-trained BERT model. Categories along x-axis are below. Arg: agriculture; Astr: astronomy; Bio: biology; Chem: chemistry; CS: computer science; Eng: engineering; ES: environmental science; Math: mathematics; Med: medical science; MS: materials science. 12 Misinformation in and about science How to fight an infodemic Determining the credibility of science communication. CoRR, abs/2105.14473 Citation recommendation: approaches and datasets Context-aware citation recommendation News citation recommendation with implicit and explicit semantics Utilizing context in generative bayesian models for linked corpus A case study exploring associations between popular media attention of scientific research and scientific citations Searching for evidence of scientific news in scholarly big data Keyphrase extraction based on prior knowledge Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references Content-boosted collaborative filtering for improved recommendations Semantics-based news recommendation Deep neural architecture for news recommendation Dkn: Deep knowledge-aware network for news recommendation Next news recommendation via knowledge-aware sequential model A neural probabilistic model for context based citation recommendation Utilizing context in generative bayesian models for linked corpus Pre-training of deep bidirectional transformers for language understanding Covidex: Neural ranking models and keyword search infrastructure for the COVID-19 open research dataset The probabilistic relevance framework: Bm25 and beyond. Found Semeval 2017 task 10: Scienceie -extracting keyphrases and relations from scientific publications Domain-independent extraction of scientific concepts from research articles A survey on deep learning for named entity recognition Bidirectional lstm-crf models for sequence tagging Hierarchical contextualized representation for named entity recognition Linguistic regularities in continuous space word representations Boosting named entity recognition with neural character embeddings Character-word lstm language models Deep contextualized word representations Attention is all you need Bond: Bert-assisted opendomain named entity recognition with distant supervision Textrank: Bringing order into text Hesdk: A hybrid approach to extracting scientific domain knowledge entities The Stanford CoreNLP natural language processing toolkit Scibert: A pretrained language model for scientific text Text similarity in vector space models: a comparative study Distributed representations of sentences and documents Domain-specific language model pretraining for biomedical natural language processing. CoRR, abs Sentence-bert: Sentence embeddings using siamese bert-networks Specter: Document-level representation learning using citation-informed transformers A comparative study of sequential tagging methods for domain knowledge entity recognition in biomedical papers Recommending research articles to consumers of online vaccination information A Appendix: DKE Extraction Performance Comparison Figure 5 shows the differential performance of the DKE extractor for 10 domains in the OA-STM dataset.