key: cord-0318030-9yvk7l1v authors: Nentidis, Anastasios; Krithara, Anastasia; Bougiatiotis, Konstantinos; Krallinger, Martin; Rodriguez-Penagos, Carlos; Villegas, Marta; Paliouras, Georgios title: Overview of BioASQ 2020: The eighth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering date: 2021-06-28 journal: nan DOI: 10.1007/978-3-030-58219-7_16 sha: 1665937b629bd9ea070b21b1ab2da8633a6cfadf doc_id: 318030 cord_uid: 9yvk7l1v In this paper, we present an overview of the eighth edition of the BioASQ challenge, which ran as a lab in the Conference and Labs of the Evaluation Forum (CLEF) 2020. BioASQ is a series of challenges aiming at the promotion of systems and methodologies for large-scale biomedical semantic indexing and question answering. To this end, shared tasks are organized yearly since 2012, where different teams develop systems that compete on the same demanding benchmark datasets that represent the real information needs of experts in the biomedical domain. This year, the challenge has been extended with the introduction of a new task on medical semantic indexing in Spanish. In total, 34 teams with more than 100 systems participated in the three tasks of the challenge. As in previous years, the results of the evaluation reveal that the top-performing systems managed to outperform the strong baselines, which suggests that state-of-the-art systems keep pushing the frontier of research through continuous improvements. This paper aims at presenting the shared tasks and the datasets of the eighth BioASQ challenge in 2020, as well as at providing an overview of the participating systems and their performance. Towards this direction, in section 2 we provide an overview of the shared tasks, that took place from February to May 2020, and the corresponding datasets developed for the challenge. In section 3, we present a brief overview of the systems developed by the participating teams for the different tasks. Detailed descriptions for some of the systems are available in the proceedings of the lab. In section 4, we focus on evaluating the performance of the systems for each task and sub-task, using state-of-the-art evaluation measures or manual assessment. Finally, in section 5, we sum up this version of the BioASQ challenge. This year, the eighth version of the BioASQ challenge comprised three tasks: (1) a large-scale biomedical semantic indexing task (task 8a), (2) a biomedical question answering task (task 8b), both considering documents in English, and (3) a new task on medical semantic indexing in Spanish (task MESINESP). In this section we provide a brief description of the two established tasks with focus on differences from previous versions of the challenge [30] . A detailed overview of these tasks and the general structure of BioASQ are available in [43] . In addition, we describe the new MESINESP task on semantic indexing of medical content written in Spanish (medical literature abstracts, clinical trial summaries and health-related project descriptions), which was introduced this year [21] , providing statistics about the dataset developed for it. In Task 8a the aim is to classify articles from the PubMed/MedLine 5 digital library into concepts of the MeSH hierarchy. In particular, new PubMed articles that are not yet annotated by the indexers in NLM are gathered to form the test sets for the evaluation of the participating systems. Some basic details about each test set and batch are provided in Table 1 . As done in previous versions of the task, the task is divided into three independent batches of 5 weekly test sets each, providing an on-line and large-scale scenario, and the test sets consist of new articles without any restriction on the journal published. The performance of the participating systems is calculated using standard flat information retrieval measures, as well as, hierarchical ones, when the annotations from the NLM indexers become available. As usual, participants have 21 hours to provide their answers for each test set. However, as it has been observed that new MeSH annotations are released in PubMed earlier that in previous years, we shifted the submission period accordingly to avoid having some annotations available from NLM while the task is still running. For training, a dataset of 14,913,939 articles with 12.68 labels per article, on average, was provided to the participants. Task 8b aims at providing a realistic large-scale question answering challenge offering to the participating teams the opportunity to develop systems for all the stages of question answering in the biomedical domain. Four types of questions are considered in the task: "yes/no", "factoid", "list" and "summary" questions [4] . A training dataset of 3,243 questions annotated with golden relevant elements and answers is provided for the participants to develop their systems. Table 2 presents some statistics about the training dataset as well as the five test sets. As in previous versions of the challenge, the task is structured into two phases that focus on the retrieval of the required information (phase A) and answering the question (phase B). In addition, the task is split into five independent bi-weekly batches and the two phases for each batch run during two consecutive days. In each phase, the participants receive the corresponding test set and have 24 hours to submit the answers of their systems. In particular, in phase A, a test set of 100 questions written in English is released and the participants are expected to identify and submit relevant elements from designated resources, including PubMed/MedLine articles, snippets extracted from these articles, concepts and RDF triples. In phase B, the manually selected relevant articles and snippets for these 100 questions are also released and the participating systems are asked to respond with exact answers, that is entity names or short phrases, and ideal answers, that is natural language summaries of the requested information. There is a pressing need to improve the access to information comprised in health and biomedicine related documents, not only by professional medical users buy also by researches, public healthcare decision makers, pharma industry and particularly by patients. Currently, most of the Biomedical NLP and IR research is being done on content in English, despite the fact that a large volume of medical documents is published in other languages including Spanish. Key resources like PubMed focus primarily on data in English, but it provides outlinks also to articles originally published in Spanish. MESINESP attempts to promote the development of systems for automatic indexing with structured medical vocabularies (DeCS terms) of healthcare content in Spanish: IBECS 6 , LILACS 7 , REEC 8 and FIS-ISCIII 9 . The main aim of MESINESP is to promote the development of semantic indexing tools of practical relevance of non-English content, determining the current-state-of-the art, identifying challenges and comparing the strategies and results to those published for English data. This task was organized within the framework of the Spanish Government's Plan for Promoting Language Technologies (Plan TL), that aims to promote the development of natural language processing, machine translation and conversational systems in Spanish and co-official languages. A training dataset with 369,368 articles manually annotated with DeCS codes (Descriptores en Ciencias de la Salud, derived and extended from MeSH terms) 10 was released. 1,500 articles were manually annotated and verified at least by two human experts (from a pool of 7 annotators), and from them a development and gold standard for evaluation were generated. A further background dataset was 6 produced from diverse sources, including machine-translated text. Consistently, the different collections averaged, per document, around 10 sentences, 13 DeCS codes, and 300 words, of which between 130 and 140 were unique ones. In order to explore the diversity of content from this dataset, we generated clusters of semantically similar records from the training dataset's titles by, first, creating a Doc2Vec model with the gensim library, 11 and then using that similarity matrix to feed an unsupervised DBScan algorithm from the sklearn python package, 12 that basically creates clusters from high density samples. The resulting 27 clusters were visualized with the libraries from the Carrot Workbench project. 13 (Figure 1 ). This year, 7 teams participated in the eighth edition of task a, submitting predictions from 16 different systems in total. Here, we provide a brief overview of those systems for which a description was available, stressing their key characteristics. A summing-up of the participating systems and corresponding approaches is presented in Table 3 . This year, the LASIGE team from the University of Lisboa, in its "X-BERT BioASQ" system propose a novel approach for biomedical semantic indexing combining a solution based on Extreme Multi-Label Classification (XMLC) with a Named-Entity-Recognition (NER) tool. In particular, their system is based on X-BERT [7] , an approach to scale BERT [12] to XMLC, combined with the use of the MER [10] tool to recognize MeSH terms in the abstracts of the articles. The system is structured into three steps. The first step is the semantic indexing of the labels into clusters using ELMo [36] ; then a second step matches the indices using a Transformer architecture; and finally, the third step focuses on ranking the labels retrieved from the previous indices. Other teams, improved upon existing systems already participating in previous versions of the task. Namely, the National Library of Medicine (NLM) team, in its "NLM CNN " system enhance the previous version of their "ceb" systems [37] , based on an end-to-end Deep Learning (DL) architecture with Convolutional Neural Networks (CNN), with SentencePiece tokenization [22] . The Fudan University team also builds upon their previous "AttentionXML" [50] and "DeepMeSH " [35] systems as well their new "BERTMeSH " system, which are based on document to vector (d2v) and tf-idf feature embeddings, learning to rank (LTR) and DL-based extreme multi-label text classification, Attention Mechanisms and Probabilistic Label Trees (PLT) [16] . Finally, this years versions of the "Iria" systems [40] are also based on the same techniques used by the systems in previous versions of the challenge which are summarized in Table 3 . Similarly to the previous versions of the challenge, two systems developed by NLM to facilitate the annotation of articles by indexers in MedLine/PubMed, where available as baselines for the semantic indexing task. MTI [29] as enhanced in [51] and an extension based on features suggested by the winners of the first version of the task [44] . This version of Task b was tackled by 94 different systems in total, developed by 23 teams. In particular, 8 teams participated in the first phase, on the retrieval of relevant material required for answering the questions, submitting results from 30 systems. In the second phase, on providing the exact and ideal answers for the questions, participated 18 teams with 72 distinct systems. Three of the teams participated in both phases. An overview of the technologies employed by the teams is provided in Table 4 for the systems for which a description were available. Detailed descriptions for some of the systems are available at the proceedings of the workshop. Phase Approach Table 4 . Systems and approaches for Task8b. Systems for which no information was available at the time of writing are omitted. The "ITMO" team participated in both phases of the task experimenting in its "pa" systems with differing solutions across the batches. In general, for document retrieval the systems follow an approach with two stages. First, they identify initial candidate articles based on BM25, and then they re-rank them using variations of BERT [12] , fine-tuned for the binary classification task with the BioASQ dataset and pseudo-negative documents. They extract snippets from the top documents and rerank them using biomedical Word2Vec based on cosine similarity with the question. To extract exact answers they use BERT fine-tuned on SQUAD [38] and BioASQ datasets and employ a post-processing to split the answer for list questions and additional fine-tuning on PubMedQA [17] for yes/no questions. Finally, for ideal answers they generate some candidates from the snippets and their sentences and rerank them using the model used for phase A. In the last batch, they also experiment with generative summarization, developing a model based on BioMed-RoBERTa [15] to improve the readability and consistency of the produced ideal answers. Another team participating in both phases of the task is the "UCSD" team with its "bio-answerfinder" system. In particular, for phase A they rely on previously developed Bio-AnswerFinder system [32] , which is also used as a first step in phase B, for re-ranking the sentences of the snippets provided in the test set. For identifying the exact answers for factoid and list questions they experimented on fine-tuning Electra [8] and BioBERT [23] on SQuAD and BioASQ datasets combined. The answer candidates are then scored considering classification probability, the top ranking of corresponding snippets and number of occurrences. Finally a normalization and filtering step is performed and, for list questions, and enrichment step based on coordinated phrase detection. For yes/no questions they fine-tune BioBERT on the BioASQ dataset and use majority voting. For summary questions, they employ hierarchical clustering, based on weighted relaxed word mover's distance (wRWMD) similarity [32] to group the top sentences, and select the sentence ranked highest by Bio-AnswerFinder to be concatenated to form the summary. In phase A, the "Google" team participated with four distinct systems based on different approaches. In particular, they used a BM25 retrieval model, a neural retrieval model, initialized with BioBERT and trained on a large set of questions developed through Synthetic Query Generation (QGen), and a hybrid retrieval model 14 based on a linear blend of BM25 and the neural model [26] . In addition, they also used a reranking model, rescoring the results of the hybrid model with a cross-attention BERT rescorer [34] . The team from the University of Aveiro, also participated in phase A with its "bioinfo" systems, which consists of a finetuned BM25 retrieval model based on ElasticSearch [14] , followed by a neural reranking step. For the latter, they use an interaction-based model inspired on the DeepRank [33] architecture building upon previous versions of their system [2] . The focus of the improvements was on the sentence splitting strategy, on extracting of multiple relevance signals, and the independent contribution of each sentence for the final score. In phase B, this year the "KU-DMIS " team participated on both exact and ideal answers. For exact answers, they build upon their previous BioBERTbased systems [49] and try to adapt the sequential transfer learning of Natural Language Inference (NLI) to biomedical question answering. In particular, they investigate whether learning knowledge of entailment between two sentence pairs can improve exact answer generation, enhancing their BioBERT-based models with alternative fine-tuning configurations based on the MultiNLI dataset [46] . For ideal answer generation, they develop a deep neural abstractive summa-rization model based on BART [24] and beam search, with particular focus on pre-processing and post-processing steps. In particular, alternative systems were developed either considering the answers predicted by the exact answer prediction system in their input or not. In the post-processing step, the generated candidate ideal answers for each question where scored using the predicted exact answers and some grammar scores provided by the language check tool 15 . For factoid and list questions in particular, the BERN [19] tool was also employed to recognize named entities in the candidate ideal answers for the scoring step. The "NCU-IISR" team also participated in both parts of phase B, constructing two BioBERT-based models for extracting the exact answer and ranking the ideal answers respectively. The first model is fine-tuned on the BioASQ dataset formulated as a SQuAD-type QA task that extracts the answer span. For the second model, they regard the sentences of the provided snippets as candidate ideal answers and build a ranking model with two parts. First, a BioBERT-based model takes as input the question and one of the snippet sentences and provides their representation. Then, a logistic regressor, trained on predicting the similarity between a question and each snippet sentence, takes this representation and outputs a score, which is used for selecting the final ideal answer. The "UoT " team participated with three different DL approaches for generating exact answers. In their first approach, they fine-tune separately two distinct BioBERT-based models extended with an additional neural layer depending on the question type, one for yes/no and one for factoid and list questions together. In their second system, they use a joint-learning setting, where the same BioBERT layer is connected with both the additional layers and jointly trained for all types of questions. Finally, in their third system they propose a multi-task model to learn recognizing biomedical entities and answers to questions simultaneously, aiming at transferring knowledge from the biomedical entity recognition task to question answering. In particular, they extend their joint BioBERT-based model with simultaneous training on the BC2GM dataset [42] for recognizing gene and protein entities. The "BioNLPer " team also participated in the exact answers part of phase B, focusing on factoids. They proposed 5 BioBERT-based systems, using external feature enhancement and auxiliary task methodologies. In particular, in their "factoid qa model" and "Parameters retrained" systems they consider the prediction of answer boundaries (start and end positions) as the main task and the whole answer content prediction as an auxiliary task. In their "Features Fusion" system they leveraged external features including NER and part-of-speach (POS) extracted by NLTK [25] and ScispaCy [31] tools as additional textual information and fused them with the pre-trained language model representations, to improve answer boundary prediction. Then, in their "BioFusion" system they combine the two methodologies together. Finally, their "BioLabel" system employed the general and biomedical domain corpus classification as the auxiliary task to help answer boundary prediction. The "LabZhu" systems also participated in phase B, with focus on exact answers for the factoid and list questions. They treat answer generation as an extractive machine comprehension task and explore several different pretrained language models, including BERT, BioBERT, XLNet [47] and SpanBERT [18] . They also follow a transfer learning approach, training the models on the SQuAD dataset, and then fine-tuning them on the BioASQ datasets. Finally, they also rely on voting to integrate the results of multiple models. The "MQ" team, as in past years, focused on ideal answers, approaching the task as query-based summarisation. In some of their systems the retrain their previous classification and regression approaches [28] in the new training dataset. In addition, they also employ reinforcement learning with Proximal Policy Optimization (PPO) [41] and two variants to represent the input features, namely Word2Vec-based and BERT-based embeddings. The "DAIICT " team also participated in ideal answer generation, using the standard extractive summarization techniques textrank [27] and lexrank [13] as well as sentence selection techniques based on their similarity with the query. They also modified these techniques investigating the effect of query expansion based on UMLS [5] for sentence selection and summarization. Finally, the "sbert" team, also focused on ideal answers. They experimented with different embedding models and multi-task learning in their systems, using parts from previous "MQU " systems for the pre-processing of data and the prediction step based on classification and regression [28] . In particular, they used a Universal Sentence Embedding Model [9] (BioBERT-NLI 16 ) based on a version of BioBERT fine-tuned on the the SNLI [6] and the MultiNLI datasets as in Sentence-BERT [39] . The features were fed to either a single logistic regression or classification model to derive the ideal answers. Additionally, in a multi-task setting, they trained the model on both the classification and regression tasks, selecting for the final prediction one of them. In this challenge too, the open source OAQA system proposed by [48] served as baseline for phase B exact answers. The system which achieved among the highest performances in previous versions of the challenge remains a strong baseline for the exact answer generation task. The system is developed based on the UIMA framework. ClearNLP is employed for question and snippet parsing. MetaMap, TmTool [45] , C-Value and LingPipe [3] are used for concept identification and UMLS Terminology Services (UTS) for concept retrieval. The final steps include identification of concept, document and snippet relevance based on classifier components and scoring and finally ranking techniques. For the newly introduced MESINESP8 task, 6 teams from China, India, Portugal and Spain participated and results from 24 different systems were submitted. The approaches were similar to the comparable English task, and included KNN and Support Vector Machine classifiers, as well as deep learning frameworks like X-BERT and multilingual-BERT, already described in subsection 3.1. A simple lookup system was provided as a baseline for the MESINESP task. This system extracts information from an annotated list. Then checks whether, in a set of text documents, the annotation are present. It basically gets the intersection between tokens in annotations and tokens in words. This simple approach obtains a MiF of 0.2695. Table 6 . Average system ranks across the batches of the task 8a. A hyphenation symbol (-) is used whenever the system participated in fewer than 4 test sets in the batch. Systems participating in fewer than 4 test sets in all three batches are omitted. In Task 8a, each of the three batches were independently evaluated as presented in Table 6 . Standard flat and hierarchical evaluation measures [4] were used for measuring the classification performance of the systems. In particular, the micro F-measure (MiF) and the Lowest Common Ancestor F-measure (LCA-F) were used to identify the winners for each batch [20] . As suggested by Demšar [11] , the appropriate way to compare multiple classification systems over multiple datasets is based on their average rank across all the datasets. In this task, the system with the best performance in a test set gets rank 1.0 for this test set, the second best rank 2.0 and so on. In case two or more systems tie, they all receive the average rank. Then, according to the rules of the challenge, the average rank of each system for a batch is calculated based on the four best ranks of the system in the five test sets of the batch. The average rank of each system, based on both the flat MiF and the hierarchical LCA-F scores, for the three batches of the task are presented in Table 6 . The results in Task 8a show that in all test batches and for both flat and hierarchical measures, the best systems outperform the strong baselines. In particular, the "dmiip fdu" systems from the Fudan University team achieve the best performance in all three batches of the task. More detailed results can be found in the online results page 17 . Comparing these results with the corresponding results from previous versions of the task, suggests that both the MTI baseline and the top performing systems keep improving through the years of the challenge, as shown in Figure 2 . Phase A: In the first phase of Task 8b, the systems are ranked according to the Mean Average Precision (MAP) measure for each of the four types of annotations, namely documents, snippets, concepts and RDF triples. This year, the calculation of Average Precision (AP) in MAP for phase A was reconsidered as described in the official description of the evaluation measures for Task 8b 18 . In brief, since BioASQ3, the participant systems are allowed to return up to 10 relevant items (e.g. documents), and the calculation of AP was modified to reflect this change. However, the number of golden relevant items in the last years have been observed to be lower than 10 in some cases, resulting to relatively small AP values even for submissions with all the golden elements. For this reason, this year, we modified the MAP calculation to consider both the limit of 10 elements and the actual number of golden elements. In Tables 7 and 8 some indicative preliminary results from batch 2 are presented. The full results are available in the online results page of Task 8b, phase A 19 . The results presented here are preliminary, as the final results for the task 8b will be available after the manual assessment of the system responses by the BioASQ team of biomedical experts. Phase B: In the second phase of task 8b, the participating systems were expected to provide both exact and ideal answers. Regarding the ideal answers, the systems will be ranked according to manual scores assigned to them by the BioASQ experts during the assessment of systems responses [4] exact answers, which are required for all questions except the summary ones, the measure considered for ranking the participating systems depends on the question type. For the yes/no questions, the systems were ranked according to the macro-averaged F1-measure on prediction of no and yes answer. For factoid questions, the ranking was based on mean reciprocal rank (MRR) and for list questions on mean F1-measure. Some indicative results for exact answers for the third batch of Task 8b are presented in Table 9 . The full results of phase B of Task 8b are available online 20 . These results are preliminary, as the final results for Task 8b will be available after the manual assessment of the system responses by the BioASQ team of biomedical experts. Figure 3 presents the performance of the top systems for each question type in exact answers during the eight years of the BioASQ challenge. The diagram reveals that this year the performance of systems in the yes/no questions keeps improving. For instance, in batch 3 presented in Table 9 , various systems manage to outperform by far the strong baseline, which is based on a version of the OAQA system that achieved top performance in previous years. Improvements are also observed in the preliminary results for list questions, whereas the top system performance in factoid questions is fluctuating in the same range as done last year. In general, Figure 3 suggests that for the latter types of question there is still more room for improvement. Yes/No Factoid List Acc. Table 9 . Results for batch 3 for exact answers in phase B of Task 8b. Only the performance of the top-20 systems and the BioASQ Baseline are presented. Fig. 3 . The official evaluation scores of the best performing systems in Task B, Phase B, exact answer generation, across the eight years of the BioASQ challenge. Since BioASQ6 the official measure for Yes/No questions is the macro-averaged F1 score (macro F1), but accuracy (Acc) is also presented as the former official measure. The results for BioASQ8 are preliminary, as the final results for Task 8b will be available after the manual assessment of the system responses. The task proved to be a challenging one, but overall we believe the results were pretty good. Compared to the setting for English, the overall dataset was significantly smaller, and also the track evaluation contained not only medical literature, but also clinical trial summaries and healthcare project summaries. Moreover, in case of the provided training data, two different indexing approaches were used by the literature databases: IBECS has a more centralized manual indexing contracting system, while in case of LILACS a number of records were indexed in a sort of distributed community human indexer effort. The training set contained 23,423 unique codes, while the 911 articles in the evaluation set contained almost 4,000 correct DeCS codes. The best predictions, by Fudan University, scored a MIF (micro F-measure) of 0.4254 MiF using their Atten-tionXML with multilingual-BERT system, compared to the baseline score of 0.2695. Table 10 shows the results of the runs for this task. As a matter of fact, the five best scores were from them. Although MiF represent the official competition metric, other metrics are provided for completeness. It is noteworthy that another team (Anuj-ml, from India) that was not among the highest scoring on MiF, nevertheless scored considerably higher than other teams with Precision metrics such as EBP (Example Based Precision), MaP (Macro Precision) and MiP (Micro Precision). Unfortunately, at this time we have not received details on their system implementation. One problem with the medical semantic concept indexing in Spanish, at least for diagnosis or disease related terms, is the uneven distribution and and high variability. [1] . This paper provides an overview of the eighth BioASQ challenge. This year, the challenge consisted of three tasks: The two tasks on biomedical semantic indexing and question answering in English, already established through the previous seven years of the challenge, and the new MESINESP task on semantic indexing of medical content in Spanish, which ran for the first time. The addition of the new challenging task on medical semantic indexing in Spanish, revealed that in a context beyond the English language, there is even more room for improvement, highlighting the importance of the availability of adequate resources for the development and evaluation of systems to effectively help biomedical experts dealing with non-English resources. The overall shift of participant systems towards deep neural approaches, already noticed in the previous years, is even more apparent this year. Stateof-the-art methodologies have been successfully adapted to biomedical question answering and novel ideas have been investigated. In particular, most of the systems adopted on neural embedding approaches, notably based on BERT and BioBERT models, for all tasks of the challenge. In the QA task in particular, different teams attempted transferring knowledge from general domain QA datasets, notably SQuAD, or from other NLP tasks such as NER and NLI, also experimenting with multi-task learning settings. In addition, recent advancements in NLP, such as XLNet [47] , BART [24] and SpanBERT [18] have also been tested for the tasks of the challenge. Overall, as in previous versions of the challenge, the top preforming systems were able to advance over the state of the art, outperforming the strong baselines on the challenging shared tasks offered by the organizers. Therefore, we consider that the challenge keeps meeting its goal to push the research frontier in biomedical semantic indexing and question answering. The future plans for the challenge include the extension of the benchmark data though a communitydriven acquisition process. Google was a proud sponsor of the BioASQ Challenge in 2019. The eighth edition of BioASQ is also sponsored by the Atypon Systems inc. BioASQ is grateful to NLM for providing the baselines for task 8a and to the CMU team for providing the baselines for task 8b. The MESINESP task is sponsored by the Spanish Plan for advancement of Language Technologies (Plan TL) and the Secretaría de Estado para el Avance Digital (SEAD). BioASQ is also grateful to LILACS, SCIELO and Biblioteca virtual en salud and Instituto de salud Carlos III for providing data for the BioASQ MESINESP task. Icd-10 coding of spanish electronic discharge summaries: An extreme classification problem Calling attention to passages for biomedical question answering Lingpipe. Available from World Wide Web Evaluation framework specifications The unified medical language system (umls): integrating biomedical terminology A large annotated corpus for learning natural language inference X-bert: extreme multilabel text classification with using bidirectional encoder representations from transformers Electra: Pre-training text encoders as discriminators rather than generators Supervised learning of universal sentence representations from natural language inference data MER: a shell script and annotation server for minimal named entity recognition and linking Statistical comparisons of classifiers over multiple data sets BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL HLT 2019 -2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies -Proceedings of the Conference 1(Mlm) Lexrank: Graph-based lexical centrality as salience in text summarization Elasticsearch: The definitive guide: A distributed real-time search and analytics engine Don't stop pretraining: Adapt language models to domains and tasks Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications Pubmedqa: a dataset for biomedical research question answering Spanbert: Improving pre-training by representing and predicting spans A neural named entity recognition and multi-type normalization tool for biomedical text mining Evaluation measures for hierarchical classification: a unified view and novel approaches Bioasq at clef2020: Large-scale biomedical semantic indexing and question answering SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing Biobert: pretrained biomedical language representation model for biomedical text mining Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension Nltk: the natural language toolkit Zero-shot neural retrieval via domain-targeted synthetic query generation Textrank: Bringing order into text Classification betters regression in query-based multidocument summarisation techniques for question answering Recent enhancements to the nlm medical text indexer Results of the seventh edition of the bioasq challenge Scispacy: Fast and robust models for biomedical natural language processing Bio-answerfinder: a system to find answers to questions from biomedical texts Deeprank: A new deep architecture for relevance ranking in information retrieval Seventh BioASQ Workshop: A challenge on large-scale biomedical semantic indexing and question answering Deepmesh: deep semantic representation for improving large-scale mesh indexing Deep contextualized word representations. Proceedings of the Conference on Empirical Methods in Natural Language Processing Convolutional Neural Network for Automatic MeSH Indexing Squad: 100,000+ questions for machine comprehension of text Sentence-bert: Sentence embeddings using siamese bertnetworks CoLe and UTAI at BioASQ 2015: Experiments with similarity based descriptor assignment Proximal policy optimization algorithms Overview of biocreative ii gene mention recognition An overview of the bioasq large-scale biomedical semantic indexing and question answering competition Large-Scale Semantic Indexing of Biomedical Publications Beyond accuracy: creating interoperable and scalable text-mining web services A broad-coverage challenge corpus for sentence understanding through inference Xlnet: Generalized autoregressive pretraining for language understanding Learning to answer biomedical questions: Oaqa at bioasq 4b Pre-trained Language Model for Biomedical Question Answering Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification Using learning-to-rank to enhance nlm medical text indexer results