key: cord-0460383-bmw1kb24 authors: Ebadi, Nima; Najafirad, Peyman title: A Self-supervised Approach for Semantic Indexing in the Context of COVID-19 Pandemic date: 2020-10-07 journal: nan DOI: nan sha: 2126665a067a9bee613a3581322ece01c67ea522 doc_id: 460383 cord_uid: bmw1kb24 The pandemic has accelerated the pace at which COVID-19 scientific papers are published. In addition, the process of manually assigning semantic indexes to these papers by experts is even more time-consuming and overwhelming in the current health crisis. Therefore, there is an urgent need for automatic semantic indexing models which can effectively scale-up to newly introduced concepts and rapidly evolving distributions of the hyperfocused related literature. In this research, we present a novel semantic indexing approach based on the state-of-the-art self-supervised representation learning and transformer encoding exclusively suitable for pandemic crises. We present a case study on a novel dataset that is based on COVID-19 papers published and manually indexed in PubMed. Our study shows that our self-supervised model outperforms the best performing models of BioASQ Task 8a by micro-F1 score of 0.1 and LCA-F score of 0.08 on average. Our model also shows superior performance on detecting the supplementary concepts which is quite important when the focus of the literature has drastically shifted towards specific concepts related to the pandemic. Our study sheds light on the main challenges confronting semantic indexing models during a pandemic, namely new domains and drastic changes of their distributions, and as a superior alternative for such situations, propose a model founded on approaches which have shown auspicious performance in improving generalization and data efficiency in various NLP tasks. We also show the joint indexing of major Medical Subject Headings (MeSH) and supplementary concepts improves the overall performance. To facilitate literature search and storage, curators at National Library of Medicine (NLM) annotate every article with a set of concepts from established categorical semantic terminologies. [1] This annotation process of scientific articles is generally referred to as semantic indexing. Nevertheless, the manual process of biomedical semantic indexing is time-consuming and financially expensive. [2, 3] Therefore, several automated semantic indexing models have been proposed in the literature, including NLM's official Medical Text Indexing tool (MTI). [4, 5, 6, 7, 8, 9] A pandemic situation, however, is an extreme scenario which highlights the importance of automated semantic indexing as researchers desperately require a well compartmentalized database to gain insights about the recent findings. [10] During the current pandemic so many related papers are being published at a much faster pace, [11] and the focus of the literature has drastically shifted towards COVID-19 related topics and subtopics, [12] some of them have not had a standard name until a couple of month ago. [13] Such conditions cause challenges for the automatic semantic indexing systems which are based on substantial supervisions and hand-coded features. Albeit the importance of semantic indexing in the pandemic situation, there is a lack of study on the performance of such automated models on the rapidly evolving corpus of COVID-19 related documents. [10] In this research, we present a case study on the state-of-the-art semantic indexing models in the context of COVID-19 pandemic. We analyze the key challenges of these models performing various evaluation (training and testing) schema. We find out the key aspects of the pandemic causing challenges for automatic semantic indexing models are the abrupt changes in the distribution of these indexes, rapid arXiv:2010.03544v1 [cs.IR] 7 Oct 2020 growth of specific topics regarding few indexes from a relatively large set of indexes, and lack of standard terms for newly introduced topics. In this research, we attempt to tackle the problem of semantic indexing exclusively in the pandemic situation. We propose a novel semantic indexing methodology suitable for the aforementioned challenges, i.e. that is able to effectively scale-up to COVID-19 literature. Inspired by the state-of-the-art performance of self-supervised learning (SSL) models in various NLP, [14, 15] and BioNLP, [16] tasks-specifically their generalization and data efficiency capabilities-as well as the best performing models in BioASQ Task 8a, [5, 6] we design our methodology based on transformers encoding and attention mechanism between the document and candidate indexes. Our experimental results denote our model as a superior alternative over the best-performing models of BioASQ challenge during health crisis situations, like the current one. The main contributions of this study are as follows: 1. We propose a novel semantic indexing approach which can effectively scale up to new distributions, thereby suitable for emergency situations like the current pandemic where the related literature is rapidly evolving. 2. Our study bring attention to the main challenges confronting semantic indexing models in the pandemic crises, and attempts to address them by proposing a novel model inspired by the best-performing models of BioASQ challenge, but unlike them, able to leverage self-supervised representation learning and transformer language model to improve efficiency and generalization. 3. We present a case study on a novel semantic indexing dataset that is based on the COVID-19 related research articles published and manually indexed in PubMed. We use flat and hierarchical measures to evaluate the performance of our model along with the state-of-the-art benchmarks. Our study demonstrate the superiority of our self-supervised approach in scaling to the novel pandemic situation with the relatively small amount of labeled data available. 4. We also discuss the importance of more fine-grained categorization of documents to supplementary concepts, and show their indexing can actually improve the MeSH indexing performance when performed simultaneously. In addition to major MeSH indexing, we evaluate the performance of simultaneous indexing of both major MeSH and supplementary concepts. 5. This paper aims to offer some aid in the process of semantic indexing of the novel COVID-19 literature so as to lighten the load on NLM indexers. Biomedical literature has been collected by the National Library of Medicine (NLM) for the last 150 years. As of 2020, PubMed database contains about 30 Million biomedical journal citations. This number has risen from 12 Million citations in 2004 to 30 Million citations in 2020 having a growth rate of 4% per year. Through a laborious process, NLM curators fully examine every document and annotate it with a set of hierarchically-organized terminologies developed by NLM called Medical Subject Headings (MeSH 1 ) along with supplementary concepts for more fine-grained categorization. [17] In 2019, more than 900K biomedical citations were added to PubMed and manually indexed to more than 29K MeSH concept categories 2 . In the light of the size and growth rate of such databases, several automated models have been developed to improve the time-consuming and financially expensive process of biomedical semantic indexing through annual competitions such as BioASQ Task a, [18] and presented models, [6, 5] as well as other BioNLP research venues. [19, 9, 7] These approaches are either based on i) simple retrieval systems; such as SNOKES team which participated in BioASQ 6th and uses search engine methods along with UMIA concept extractor, [20] Iria another participating team which combines ensemble of the best performing models from previous years challenges with k-NN MeSH masking algorithms, [21] Segura et al. utilize ElasticSearch to manage the "scalability" issue of the task and the enhanced NLM Medical Text Indexing (MTI), [8] Zavorin et al. combines L2R with Medical Text Indexing; [9] or ii) deep learning models with substantial hand-coded features and supervision. DeepMesh which is the best performing model of a couple of edition of BioASQ challenge that combines document to vector models with crafted features from the document and MeSH indexes along with ensemble models fed by those features. Other deep learning approaches include UIMA concept extractor links, [5] and AUTH that also uses document to vector approach with an ensemble of machine learning classifier (SVM) fed with document-MeSH features. [17] Jin et al. and Xun et al. combined retrieval systems with deep recurrent neural networks and attention mechanism and also provide explainability for MeSH indexing decisions. [6, 7] The amount of hand-crafted features and supervision required for these models make it difficult for them to effectively scale-up as the biomedical databases do during pandemic crises. [22] 1 These semantic indexing models are proposed to perform well in normal situations, when there is no specific interest towards specific concepts, and are evaluated based on their overall performance on all major MeSH indexes. [23] In the pandemic situation, however, the focus of the literature has drastically shifted towards the specific concepts and sub-concepts related to the current Coronavirus disease. The number of published documents related to Coronavirus have risen from to a few articles per month to more than 10K articles in June 2020-roughly 1 out of every 11.5 citations are about Coronavirus these days. [24] The rapidly growing and evolving literature of COVID-19 causes challenges for automatic semantic indexing models. [10] Previously introduced semantic indexing models are based on supervised learning approaches and heavily hand-coded features; therefore, they require significant amount of labeled data regarding a specific concept to show decent performance in indexing related documents, and have challenges to effectively scale-up to newly introduced terminologies and sub-concepts; thereby, not suitable for emergency situations like the ongoing health crisis. On the other hand, self-supervised learning (SSL), where a model is initially trained on a data-rich unsupervised pretext task then fine-tuned on a downstream task, has recently emerged as an effective technique in almost every deep learning problem ranging from computer vision, [25, 26] NLP, [27, 15, 28] to Bioinformatics, [29, 30] and IoT security. [31, 32, 33] Self-supervised learning is known to enhance data efficiency, [34] and generalization, [35] because the SSL-based model learns some general auxiliary knowledge from the pre-text task that allows the model to "understand" the downstream task better. [36] Therefore, SSL-based models are more robust towards changing domains and scaling up to new distributions. [37] In order for a deep SSL algorithm to be effective, the pre-text learning process should be susceptible to downstream learning. [36] However, as for the proposed semantic indexing models in the literature, either their architecture or the representation learning objective of the downstream task does not allow pre-training in an unsupervised manner that is useful for the downstream learning. In DeepMeSH only the TF-IDF vectorization can be updated with unlabeled data, without any word-level representation learning. [5] In AttenMeSH word-level encoding of the document can be updated by pre-training the Bi-GRU on an masked language modeling (MLM) pretext task, but these encodings wouldn't be appropriate enough for the downstream task because i) it cannot involve the document-index attention in the encoding, and ii) more sophisticated architectures have been proven to be more effective in this regard such as transformers. [38] 2 MATERIALS AND METHODS As for the initial representations of documents, we use the word representations provided by BioASQ organizers which is a pre-trained word embedding trained on large-scale corpus of biomedical documents 3 . We also use BioASQ provided word tokenizer to parse documents's title and abstract to a list of the constituting words 4 . We perform stemming and eliminate the stop words as different variants of an individual word or a stop words does not affect the semantic index of a document. Afterwards Bag-of-Word representation of each document is computed according to the following equation: Where |D| is the number of non-stopwords in the document, and d e1 is the word embedding size. For the model to get acquainted with COVID-19 context, we leverage an unlabeled corpus of COVID-19 related documents, called CORD-19, [40] which is significantly larger than our labeled dataset of indexed documents. We pre-train a bi-directional transformer on masked language modeling (MLM) task with word-tokenized representation of the documents shown in Eq. 1, along with those of candidate similar indexes 5 . [38, 41] Such algorithms have shown promising results in many NLP problems regarding deep semantic analysis of scientific documents, such as SciBERT and BioBERT. [42, 16] In this regard, the document representation D is masked and fed to the transformers model and passed through a positional encoding approach following the original implementation of transformers. [38] Similar semantic indexes are also fed to the transformer and through a joint document-index attention, index specific encoding of the document is generated. The retrieval and embedding of candidate indexes is discussed in the next section. Next, the transformer model gets trained to predict masked tokens from the index specific encoding (similar to 1.b) but the softmax is applied over the index axis). Unlike BERT and BioBERT, we do not leverage the next sentence prediction (NSP) task for two reasons: 1) sentences ordering is required for QA type of inferences, not semantic indexing and text classifications. 2) It has been proven to be ineffective. [15, 14] Inspired by Jin et al., as for major MeSH indexes, we initially use a retrieval system to retrieve a subset of related MeSH categories from relevant documents 6 . [6] Note: we only perform this retrieval process for major MeSH indexes, not for supplementary concepts since there is only 19 of them in COVID-19 dataset. In this regard, we translate the target biomedical document into a query to extract the relevant ones from the annotated database. We follow the same pre-process of parsing, stemming and stopwords removal of Section 2.1. Every document is represented by their both TF-IDF and BM25 weighted sum of their words, following weighting schems of Wang et al., [43] and Paik et al., [44] respectively. Each document as query is represented as follows: Where, w i is the i th word in document d, and v wi is the word vector from the provided pre-trained embeddings. Next using cosine similarity scores between the target document and other ones, we find the K relevant documents. Next, we use scoring scheme to re-rank and collect the candidate MeSH indexes. We score every MeSH term by 6 We can regard this module as a weak classifier which filter out the negative data which is far more than the positive ones (29K total number of MeSH terms vs. 12.6 terms for every document on average). Jin et al. shows doing so enhances the efficiency and performance of the indexing models as the classifier only focuses on the detection of correct MeSH indexes from a subset of plausible ones. summing their IDF weights in the documents and rank them. The top M with the highest scores are considered for indexing and passed to the next stage. Semantic indexes representations are quite straight-forward to extract as they are single words. The indexes' embeddings are as follows: Where |M | is the number of filtered indexes, and d e2 is the embedding size of every index m j . To simplify the model, we make d e1 = d e2 = d model . After candidate indexes are retrieved, the document BoW representation D along with those of indexes M are fed to the bi-directional transformers which is pre-trained on the self-supervised pre-text task. Positional encoding is performed for documents, not for indexes, to bring in their words' ordering. Initially, D and M are separately encoded to D and M with self-attention mechanism allowing words to only attend other words, and indexes to attend other indexes (there is no cross attention between words and indexes). Self-attention mechanism for D is to capture context-aware representation of words, and for M is to capture correlation and dependencies between indexes which has been shown important. [5, 45] Next, cross-attention between encodings of words and indexes are computed using scaled dot-product attention function, [38, 46] as follows: where M D T ∈ R |M |×|D| is the dot product between every index and every word packed together into a matrix multiplication. Softmax is performed over word axis to get attention weights for every index. Finally, O is the index specific context vectors each of which is based on a weighted sum of the word vectors. To compute the likelihood scores of indexes, we apply a linear projection layer with a non-linear activation function σ on the context specific vectors O and index encodings M , as following equation: whereŶ ∈ R |M |×1 is the set of likelihood scores for candidate indexes, and U, V ∈ R 1×d model and B ∈ R |M |×1 are trainable parameters. Finally, the predicted indexes are computed through thresholding over every likelihood score. Thresholds are defined by maximizing the micro f-measure in the training set, following. [47] 3 RESULTS For self-supervised representation learning (pre-training) stage of our methodology, we use CORD-19 dataset which includes 141K research articles about Coronavirus published in peer-reviewed venues and archival services such as bioRxiv 7 Medicine indexers. The dataset includes journal name in which the article has been published, article's title and abstract along with MeSH indexes for the training sets. On average, each article is indexed with 12.84 MeSH categories. [18] ii) Recently collected COVID-19 related documents from PubMed: we use 13K latest documents, published and annotated in 2020, related to COVID-19 crawled from PubMed using this query: covid-19 AND severe acute repository syndrome 2 AND sars-cov-2, [24] and evaluation results are calculated by this dataset. Table 1 provide information about the statistics of the datasets. We utilize the MeSH majors from both sets as well as the supplementary concept from our set to measure the performance in detail. As shown in Figure 2 major MeSH indexes and supplementary concepts form a Directed Acyclic Graph (DAG) where there is a hierarchical relation between two major MeSH (parent-child) and a mapping relation between mesh and supplementary concepts. As for the MeSH retrieval part of our methodology, we use bm25/tf-idf bag-of-words representation methods with vocabulary size of 90K. The retrieval components, i.e. vectorization global features as well as the thresholding features, are trained using the train set only to avoid data bleeding. [48] The bi-directional transformer is implemented using TensorFlow (2.0) Eager, [49, 50] and tensor-2-tensor 9 library. The hyperparameter values are shown in Table 2 . We use Adam optimizer and early stopping strategies. [51] We apply three versions of our methodology: i) base bi-transformer without self-supervised training: where the model is not pre-trained on CORD-19 dataset, section b) and c) of Figure 1 with random initialization of the parameters; ii) base bi-transformer with masked language modeling (MLM) as the self-supervised pre-text task; and iii) large bi-transformer with MLM self-supervised tasks. Table 2 : Hyperparameters values. We use bold text for the optimal ones among all tried values. * refer to those for Bi-Trans Large. Following BioASQ challenge, we evaluate the performance of the semantic indexing models based on two sets of evaluation measures: i) flat: Accuracy,Micro and Macro F-measures, and ii) hierarchical: lowest common Ancestor F-measure (LCA-F). Accuracy is the fraction of correct predictions. However, in multi-label classification problems true and predicted classes could be a set of labels for every example; therefore, there is an additional notion on partially correct. To capture this, precision and recall measures are computed for every class separately. Then, the results are aggregated using micro-averaging and macro-averaging strategies to compute micro (MiP/MiR) and macro precision/recall (MaP/MaR) respectively. Micro-averaging is evaluating the average difference between the predicted labels and the actual labels globally for each test example, and then averaging over all examples in the test set. The second strategy is macroaveraging evaluation in which each label is evaluated separately then averaged over all the labels. Finally the micro and macro f-measures (MiF and MaF) are computed base on the harmonic mean of the corresponding precision and recall. MiF is more affected by the performance of frequent indexes, while MaF treats every index equally. [52] Following BioASQ, MiF is the major flat measure in our presented case study. As shown in Figure 2 , semantic indexes have a hierarchical relation between one another. Therefore, in addition to flat measures, hierarchical measures are also used to evaluate the hierarchical classification performance of the semantic indexing models. In this regard, we leverage Lowest Common Ancestor F-measure (LCA-F) the algorithm provided by Kosmopoulos et al., [53] which is the same algorithm used in BioASQ challenge 10 . In LCA-F measure, sets of true class and predicted class are compared based on union of their corresponding augmented graphs which encompass all the lowest common ancestors between every pair. The algorithm has shown desirable results in various hierarchical text classification tasks. Table 3 shows the performance of the semantic indexing models with the aim of indexing only major subject headings (i.e. major MeSH). The models are trained on BioASQ data from 2015-2019 (excluding the recent COVID-19 documents) as well as COVID training set. They are tested on COVID testing set. As shown in Table 3 The higher performance of self-supervised models reveals that the models learn some sort of common sense (acquire general knowledge) about the pandemic and new distribution of major MeSH indexes. To evaluate how efficiently the semantic indexing models scale-up to the novel Coronavirus related literature, we chronologically sort the COVID-19 training dataset, and train each model with the following proportions of the data to evaluate their zero-and few-shot performance along with their data efficiency: 0.0 (zero-shot evaluation), 0.05, 0.1, 0.2, 0.5 and 1 (the whole data). Figure 3 shows the MeSH indexing performance of the top performing models from Table 3 based on the size of the exclusive COVID-19 training data. The beginning performance is their zero-shot performance, when the models have only been trained on BioASQ dataset and have not seen a COVID-19 paper yet. The SSL-based versions of our BioTrans model achieve substantially superior performance until almost half of the training data is fed, especially in the very beginning. They reach 0.95 of their optimum performance by only 0.2 of the data. Other models reach this point once half of the training data is fed which means a couple of months delay, an essential issue for a pandemic crisis. Our BioTrans model which is not pre-trained on COVID-19 SSL dataset does not learn effectively Figure 3 : MeSH indexing performance w.r.t the size of COVID-19 training data based on their Micro-F score. COVID-19 related data is chronologically ordered and then divided; therefore, the horizontal axis is directly related to the date these papers are published. with the COVID-19 supervised data, and simply follows similar learning speed to those of AttenMeSH and DeepMeSH as it has been inspired by these techniques. This shows that the major strength of using such architecture-bi-directional transformers encoding with attention between documents and indexes-emerges when it undergoes a self-supervised learning process. As the literature gets hyper-focused towards specific topics in the context of pandemic, classification to more finegrained indexes becomes critical. Therefore, we also present evaluation of simultaneous indexing of major Mesh and supplementary concepts. In this regard, the trained models on BioASQ are simply fine-tuned to detect supplementary concepts of COVID training set in addition to the major mesh indexes. Supplementary concepts are added as new classes to the potential indexes. As demonstrated in Table 4 the performance of the baselines is improved by fine-tuning them to detect supplementary concepts as well. This shows the importance of more fine-grained indexing. In comparison to baselines our model improved even more with the aid of supplementary concepts. In this research, we propose a novel semantic indexing approach based on self-supervised deep representation learning models to tackle this task in the current health crisis. We present a case study on COVID-19 literature collected from recently indexed documents in PubMed. We compare the performance of our model with the state-of-the-art baselines based on flat and hierarchical measures. Our study shows the presented self-supervised model outperforms the baselines with the small amount of labeled data available. We further evaluate the indexing of supplementary concepts along with the major MeSH indexes demonstrating the state-of-the-art performance. We also show that the indexing of supplementary concepts improves MeSH indexing performance of our model explaining the importance of more fine-grained categorization of documents in the current pandemic situation. In future, we will attempt to continue our case study as COVID-19 documents are being published and indexed in PubMed. We will mainly focus on improving the data efficiency and generalization aspects of semantic indexing models as COVID-19 literature is rapidly evolving. We will also try sophisticated few-and zero-shot learning techniques to better handle newly introduced concepts. Sergios Petridis, Dimitris Polychronopoulos, et al. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition Recommending mesh terms for annotating biomedical articles The nlm medical text indexer system for indexing biomedical literature Tackling mesh indexing dataset shift with time-aware concept embedding learning Deepmesh: deep semantic representation for improving large-scale mesh indexing Attentionmesh: Simple, effective and interpretable automatic mesh indexer Meshprobenet: a self-attentive probe net for mesh indexing Labda at the 2016 bioasq challenge task 4a: Semantic indexing by using elasticsearch Using learning-to-rank to enhance nlm medical text indexer results Lessons from covid-19 to future evidence synthesis efforts: first living search strategy and out of date scientific publishing and indexing industry (submitted) Co-search: Covid-19 information retrieval with semantic search, question answering, and abstractive summarization Trec-covid: Rationale and structure of an information retrieval shared task for covid-19 Retweets of officials' alarming vs reassuring messages during the covid-19 pandemic: Implications for crisis management Albert: A lite bert for self-supervised learning of language representations A robustly optimized bert pretraining approach Biobert: a pre-trained biomedical language representation model for biomedical text mining Large-scale semantic indexing and question answering in biomedicine Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos, and Georgios Paliouras. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition Livivo-the vertical search engine for life sciences Georgios Paliouras, and Ioannis Kakadiaris. Results of the fifth edition of the bioasq challenge Cole and utai at bioasq 2015: experiments with similarity based descriptor assignment High dimensional model representation of log-likelihood ratio: binary classification with expression data Results of the seventh edition of the bioasq challenge Keep up with the latest coronavirus research Self-supervised learning of motion capture Self-supervised learning of pretext-invariant representations Xlnet: Generalized autoregressive pretraining for language understanding Multi-task self-supervised learning for robust speech recognition Self-supervised learning model for skin cancer diagnosis Deep learning-based cross-classifications reveal conserved spatial behaviors within tumor histological images. bioRxiv On data-driven curation, learning, and analysis for inferring evolving internet-of-things (iot) botnets in the wild Internet-scale insecurity of consumer internet of things: An empirical measurements perspective Detecting internet of things attacks using distributed deep learning A simple framework for contrastive learning of visual representations Adversarial robustness: From self-supervised pre-training to fine-tuning Exploring the limits of transfer learning with a unified text-to-text transformer Language models are unsupervised multitask learners Attention is all you need Genetic cluster analysis of sars-cov-2 and the identification of those responsible for the major outbreaks in various countries Mass: Masked sequence to sequence pre-training for language generation Scibert: A pretrained language model for scientific text Text clustering based on the improved tfidf by the iterative algorithm A novel tf-idf weighting scheme for effective ranking Automatic text summarization using customizable fuzzy features and attention on the context and vocabulary Human action performance using deep neuro-fuzzy recurrent attention model Threshold optimisation for multi-label classifiers A simple but tough-to-beat baseline for the fake news challenge stance detection task Tensorflow eager: A multi-stage, pythonembedded dsl for machine learning Large-scale machine learning on heterogeneous distributed systems On early stopping in gradient descent learning Implicit life event discovery from call transcripts using temporal input transformation network Evaluation measures for hierarchical classification: a unified view and novel approaches The authors gratefully acknowledge the use of the services of Jetstream cloud, funded by National Science Foundation (NSF) awards 1445604, and the Cloud Technology Endowed Professorship.