key: cord-0765139-hg5bwj2r
authors: Nentidis, Anastasios; Katsimpras, Georgios; Vandorou, Eirini; Krithara, Anastasia; Gasco, Luis; Krallinger, Martin; Paliouras, Georgios
title: Overview of BioASQ 2021: The ninth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering
date: 2021-06-28
journal: 12th International Conference of the Cross-Language Evaluation Forum for European Languages, CLEF 2021
DOI: 10.1007/978-3-030-85251-1_18
sha: 7afa74ecfc184d4d8d98549066ed48187ef21bc0
doc_id: 765139
cord_uid: hg5bwj2r

Advancing the state-of-the-art in large-scale biomedical semantic indexing and question answering is the main focus of the BioASQ challenge. BioASQ organizes respective tasks where different teams develop systems that are evaluated on the same benchmark datasets that represent the real information needs of experts in the biomedical domain. This paper presents an overview of the ninth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2021. In this year, a new question answering task, named Synergy, is introduced to support researchers studying the COVID-19 disease and measure the ability of the participating teams to discern information while the problem is still developing. In total, 42 teams with more than 170 systems were registered to participate in the four tasks of the challenge. The evaluation results, similarly to previous years, show a performance gain against the baselines which indicates the continuous improvement of the state-of-the-art in this field.

In this paper, we present the shared tasks and the datasets of the ninth BioASQ challenge in 2021, as well as we as an overview of the participating systems and their performance. The remainder of this paper is organized as follows. Section 2 provides an overview of the shared tasks, that took place from December 2020 to May 2021, and the corresponding datasets developed for the challenge. Section 3 presents a brief overview of the systems developed by the participating teams for the different tasks. Detailed descriptions for some of the systems are available in the proceedings of the lab. Then, in section 4, we focus on evaluating the performance of the systems for each task and sub-task, using state-of-theart evaluation measures or manual assessment. Finally, section 5 draws some conclusions regarding this version of the BioASQ challenge.

In this year, the ninth version of the BioASQ challenge offered four tasks: (1) a large-scale biomedical semantic indexing task (task 9a), (2) a biomedical question answering task (task 9b), both considering documents in English, (3) a medical semantic indexing in Spanish (task MESINESP9 using literature, patents and clinical trial abstracts), and (4) a new task on biomedical question answering on the developing problem of COVID-19 (task Synergy). In this section, we describe the two established tasks 9a and 9b with focus on differences from previous versions of the challenge [25] . Detailed information about these tasks can be found in [39] . Additionally, we discuss the second version of the MESINESP task and also present the new Synergy task on biomedical question answering for developing problems, which was introduced this year, providing statistics about the dataset developed for each task. The aim of Task 9a is to classify articles from the PubMed/MedLine 4 digital library into concepts of the MeSH hierarchy. Specifically, the test sets for the evaluation of the competing systems consist of new PubMed articles that are not yet annotated by the indexers in the National Library of Medicine (NLM). Table  1 illustrates a more detailed view of each test. As in the previous years, the task is realized in three independent runs of 5 weekly test sets each. Two scenarios are provided: i) on-line and ii) large-scale. The test sets are a collection of new articles without any restriction on the journal published. For the evaluation of the competing systems standard flat information retrieval measures are used, as well as hierarchical ones, comparing the predictions of the participants with the annotations from the NLM indexers, once available. Similarly to the previous years, for each test set, participants are required to submit their answers in 21 hours. Furthermore, a training dataset was available for Task 9a which contains 15,559,157 articles with 12.68 labels per article, on average, and covers 29,369 distinct MeSH labels in total.

Task 9b focuses on enabling the competing teams to develop systems for all the stages of question answering in the biomedical domain by introducing a largescale question answering challenge. Again this year, four types of questions are considered: "yes/no", "factoid", "list" and "summary" questions [9] . A total of 3,743 questions, which are annotated with golden relevant elements and answers from previous versions of the task, consist of the available training dataset for this task. The dataset is used by the participating teams to develop their systems. Table 2 provides detailed information about both training and testing sets. Task 9b is divided into two phases: (phase A) the retrieval of the required information and (phase B) answering the question. Moreover, it is split into five independent bi-weekly batches and the two phases for each batch run during two consecutive days. In each phase, the participants receive the corresponding test set and have 24 hours to submit the answers of their systems. More precisely, in phase A, a test set of 100 questions written in English is released and the participants are expected to identify and submit relevant elements from designated resources, including PubMed/MedLine articles, snippets extracted from these articles, concepts and RDF triples. In phase B, the manually selected relevant articles and snippets for these 100 questions are also released and the participating systems are asked to respond with exact answers, that is entity names or short phrases, and ideal answers, that is natural language summaries of the requested information.

Over the last year, scientific production has increased significantly and has made more evident than ever the need to improve the information retrieval methods under a multilingual IR or search scenario for medical content beyond data only in English [38] . The scenario faced during the year 2020 demonstrates the need to improve access to information in demanding scenarios such as a disease outbreaks or public health threats at multinational/cross-border scale. In a health emergency scenario, access to scientific information is essential to accelerate research and healthcare progress and to enable resolving the health crisis more effectively. During the COVID-19 health crisis, the need to improve multilingual search systems became evermore significant, since a considerable fraction of medical publications (especially clinical case reports on COVID patients) were written in the native language of medical professionals.

MESINESP was created in response to the lack of resources for indexing content in languages other than English, and to improve the lack of semantic interoperability in the search process when attempting to retrieve medically relevant information across different data sources.

The MESINESP 2021 track [14] , promoted by the Spanish Plan for the Advancement of Language Technology (Plan TL) 5 and organized by the Barcelona Supercomputing Center (BSC) in collaboration with BioASQ, aims to improve the state of the art of semantic indexing for content written in Spanish, ranking among the highest number of native speakers in the world 6 . In an effort to improve interoperability in semantic search queries, this edition was divided into three subtracks to index scientific literature, clinical trials and medical patents.

MESINESP-L (subtrack 1) required the automatic indexing with DeCS 7 terms of a set of abstracts from scientific articles (titles and abstracts) from two widely used literature databases with content in Spanish: IBECS 8 and LILACS 9 .

We built the corpora for the task from the data available in BvSalud, the largest database of scientific documents in Spanish, which integrates records from LILACS, MEDLINE, IBECS and other databases. First, we downloaded the whole collection of 1.14 million articles present in the platform. Then, only journal articles with titles and abstracts written in Spanish that had been previously manually indexed by LILACS and IBECS experts with DeCS codes were selected, obtaining a final training dataset of 237,574 articles. A development set of records manually indexed by expert annotators was also provided. This development corpus included 1,065 articles manually annotated (indexed) by the three human indexers who obtained the best IAA in the last MESINESP edition. To generate the test set, 500 publications were selected to be indexed by the three experts. We also incorporated a background set of 9,676 Spanishlanguage clinical practice guidelines to evaluate the performance of the models on this type of biomedical documents. Clinical Trials subtrack (MESINESP-T) asked participating teams to generate models able to automatically predict DeCS codes for clinical trials from the REEC database 10 .

Last year's task generated a silver standard (automatically assigned codes by participating teams) with a set of REEC clinical trials. The predictions of the best performing team was used as a substitute or surrogate data collection for training systems, pooling a total of 3,560 clinical trials. For the development set, 147 records manually annotated by expert indexers in MESINESP 2020 were provided. For the test set, we calculated the semantic similarity between MESINESP-L training corpus and a pre-selection of 416 clinical trials published after 2020. Then, the top 250 most similar clinical trials, which included many COVID-19 related trials, were annotated by our indexers. Similar to what was done for the scientific literature track, we included a background set of 5,669 documents from medicine data sheets to be automatically indexed by teams (generating thus a silver standard collection).

Finally, for the patents subtrack (MESINESP-P), the aim was to explore and evaluate indexing strategies of medical patents written in Spanish providing only a very small manually annotated patent collection (in addition to the literature corpus). We presented the track as a cross-corpus training challenge, in which participants should transfer/adapt previous models to the patent langaguage without a large manually annotated data set. All patents written in Spanish having the assigned IPC codes "A61P" and "A61K31" were retrieved using Google Big Query 11 , only these codes were considered as they cover medicinal chemistry related topics [18] . After data harvesting, 65,513 patents were obtained, out of which the 228 most semantically similar to the MESINESP-L training set were chosen. After an annotation process, 119 were used as the development set and 109 as the test set. Some summary statistics of the used datasets can be seen in the Some additional resources were published in order to serve as complementary annotations for participating teams. Since the BSC text mining unit had already implemented several competitive medical named entity recognition tools adapted to content in Spanish [20, 21, 19] , four different NER systems were applied to each of the corpora to annotate automatically mentions of medical entities that may help improve model performance, namely diseases, procedures, medications/drugs and symptoms. Many DeCS terms do actually correspond to these semantic classes, in particular diseases. Overall, the semantic annotation results for the MESINESP included around 840,000 disease mentions, 170,000 medicine/drug mentions, 415,000 medical procedures mentions and 137,000 symptoms mentions.

The established question answering BioASQ task (Task B) is structured in a sequence of phases. First comes the annotation phase; then with a partial overlap runs the challenge; and only when this is finished does the assessment phase start. This leads to restricted interaction between the participating systems and the experts, which is acceptable due to the nature of the questions, that have a clear, undisputed answer. However, a more interactive model is necessary for open questions on developing research topics, such as the case of COVID-19, where new issues appear every day and most of them remain open for some time. In this context, a model based on the synergy between the biomedical experts and the automated question answering systems is needed.

In this direction, we introduced the BioASQ Synergy task envisioning a continuous dialog between the experts and the systems. In this model, the experts pose open questions and the systems provide relevant material and answers for these questions. Then, the experts assess the submitted material (documents and snippets) and answers, and provide feedback to the systems, so that they can improve their responses. This process proceeds with new feedback and new predictions from the systems in an iterative way.

This year, Task Synergy took place in two versions, focusing on unanswered questions for the developing problem of the COVID-19 disease. Each version was structured into four rounds, of systems responses and expert feedback for the same questions. However, some new questions or new modified versions of some questions could be added to the test sets. The details of the datasets used in task Synergy are available in Table 5 . 1  1  108  33  22  17  36  0  0  1  2  113  34  25  18  36  53  101  1  3  113  34  25  18  36  80  97  1  4  113  34  25  18  36  86  103  2  1  95  31  22  18  24  6  95  2  2  90  27  22  18  23  10  90  2  3  66  17  14  18  17  25  66  2  4  63  15  14  17  17  33  63  Table 5 . Statistics on the datasets of Task Synergy. "Answer" stands for questions marked as having enough relevant material from previous rounds to be answered".

Contrary to the task B, this task was not structured into phases, but both relevant material and answers were received together. However, for new questions only relevant material (documents and snippets) is required until the expert considers that enough material has been gathered during the previous round and mark the questions as "ready to answer". When a question receives a satisfactory answer that is not expected to change, the expert can mark the question as "closed", indicating that no more material and answers are needed for it.

In each round of this task, we consider material from the current version of the COVID-19 Open Research Dataset (CORD-19) [41] to reflect the rapid developments in the field. As in task B, four types of questions are supported, namely yes/no, factoid, list, and summary, and two types of answers, exact and ideal. The evaluation of the systems will be based on the measures used in Task 9b. Nevertheless, for the information retrieval part we focus on new material. Therefore, material already assessed in previous rounds, available in the expert feedback, should not be re-submitted. Overall, through this process, we aim to facilitate the incremental understanding of COVID-19 and contribute to the discovery of new solutions.

This year, 6 teams participated with a total of 21 different systems. Below, we provide a brief overview of those systems for which a description was available, stressing their key characteristics. The participating systems along with their corresponding approaches are listed in Table 6 . Detailed descriptions for some of the systems are available at the proceedings of the workshop.

Approach bert dna, pi dna SentencePiece, BioBERT, multiple binary classifiers NLM SentencePiece, CNN, embeddings, ensembles, PubMedBERT dmiip fdu d2v, tf-idf, SVM, KNN, LTR, DeepMeSH, AttentionXML, BERT, PLT Iria Luchene Index, multilabel k-NN, stem bigrams, ensembles, UIMA ConceptMapper Table 6 . Systems and approaches for Task 9a. Systems for which no description was available at the time of writing are omitted.

The team of Roche and Bogazici University participated in task 9a with four different systems ("bert dna" and "pi dna" variations). In particular, their systems are based on the BERT framework with SentencePiece tokenization, and multiple binary classifiers. The rest of the teams build upon existing systems that had already competed in previous versions of the task. The National Library of Medicine (NLM) team competed with five different systems [31] . To improve their previously developed CNN model [32] , they utilized a pretrained transformer model, PubMedBERT, which was fine-tuned to rank candidates obtained from the CNN. The Fudan University ("dmiip fdu") team also relied on their previous "AttentionXML" [1] , "DeepMeSH " [30] , and "BERTMeSH " models [46] . Differently from their previous version, they extended AttentionXML with BioBERT. Finally, the team of Universidade de Vigo and Universidade da Coruña competed with two systems ("Iria") that followed the same approach used by the systems in previous versions of the task [34] .

As in previous versions of the challenge, two systems developed by NLM to facilitate the annotation of articles by indexers in MedLine/PubMed, where available as baselines for the semantic indexing task. MTI [24] as enhanced in [47] and an extension based on features suggested by the winners of the first version of the task [40] .

This version of Task 9b was undertaken by 90 different systems in total, developed by 24 teams. In phase A, 9 teams participated, submitting results from 34 systems. In phase B, the numbers of participants and systems were 20 and 70 respectively. There were only three teams that engaged in both phases. An overview of the technologies employed by the teams is provided in Table 7 for the systems for which a description was available. Detailed descriptions for some of the systems are available at the proceedings of the workshop.

Phase Approach BioBERT, PubMedBERT, logistic-regression Table 7 . Systems and approaches for Task9b. Systems for which no information was available at the time of writing are omitted.

The "UCSD" team [27] participated in both phases of the task with two systems ("bio-answerfinder "). Specifically, for phase A they relied on previously developed Bio-AnswerFinder system [28] , but instead of LSTM based keyword selection classifier, they used a Bio-ELECTRA++ model based keyword selection classifier together with the Bio-ELECTRA Mid based re-ranker [26] . This model was also used as an initial step for their systems in phase B, in order to re-rank candidate sentences. For factoid and list questions they fine-tuned a Bio-ELECTRA model using both SQuad and BioASQ training data. The answer candidates are then scored considering classification probability, the top ranking of corresponding snippets and number of occurrences. Finally a normalization and filtering step is performed and, for list questions, an enrichment step based on coordinated phrase detection. For yes/no questions, they used a Bio-ELECTRA model based ternary yes/no/neutral classifier. The final decision is made by score voting. For summary questions, they follow two approaches. First, they employ hierarchical clustering, based on weighted relaxed word mover's distance (wRWMD) similarity [28] to group the top sentences, and select the sentence ranked highest by Bio-AnswerFinder to be concatenated to form the summary. Secondly, an abstractive summarization system based on the unified text-to-text transformer model t5 [33] is used.

In phase A, the team from the University of Aveiro participated with four distinct "bioinfo" systems [5] . Relying on their previous model [3] , they improved the computation flow and experimented with the transformer architecture. In the end, they developed two variants that used the passage mechanism from [3] and the BERT model. The "RYGH " team participated in phase A with five systems. They adopted a pipeline that utilized the BM25 along with several pre-trained models including BioBERT, PubMedBERT, PubMedBERT-FullText and T5.

In phase B, this year the "KU-DMIS " team [2] participated in both exact and ideal answers. Their systems are based on the transformers models and follow either a model-centric or a data-centric approach. The former, which is based on the sequence tagging approach [45] , is used for list questions while the latter, which relies on the characteristics of the training datasets and therefore data cleaning and sampling are important aspects of its architecture, is used for factoid questions. For yes/no questions, they utilized the BioBERT-large model, as a replacement of the previously used BioBERT-BASE model. For ideal questions, they followed the last year's strategy, where their BART model utilizes the predicted exact answer as a input for generating an ideal answer.

There were four teams from the Macquarie University that participated in task 9b. The first team ("MQ") [23] competed with five systems which are based on the use of BERT variants in a classification setting. The classification task takes as input the question, a sentence, and the sentence position, and the target labels are based on the ROUGE score of the sentence with respect to the ideal answer. The second team ("CRJ ") competed with three systems that followed the Proximal Policy Optimization (PPO) approach to Reinforcement Learning [22] , and also utilized word2vec and BERT word embeddings. The third team (ALBERT) [16] competed with four systems that were based on the transformer-based language models, DistilBERT and ALBERT. The pretrained models were fine-tuned first on the SQuAD dataset and then on the BioASQ dataset. Finally, the fourth team ("MQU ") participated with five systems. Their systems utilized sentence transformers fine-tuned for passage retrieval, as well as abstractive summarizers trained on news media data.

The Fudan University team participated with four systems ("Ir sys"). All systems utilized variants of the BERT framework. For yes/no questions they used BioBERT, while for factoid/list questions they combined SpanBERT, Pub-medBERT and XLNet. For summary questions, they utilized both extractive and abstractive methods. For the latter, they performed conditional generation of answers by employing the BART model. The "LASIGE ULISBOA" team [10] , from the University of Lisboa, competed with four systems which are based on BioBERT . The models are fine-tuned on larger non-medical datasets prior to training on the task's datasets. The final decisions for the list questions are computed by applying a voting scheme, while a softmax is utilized for the remaining questions.

The University of Delaware team [6] participated with four systems ("UDEL-LAB ") which are based on BioM-Transformets models [7] . In particular, they used both BioM-ALBERT and BioM-ELECTRA, and also applied transfer learning by fine tuning the models on MNLI and SQuAD datasets. The "NCU-IISR" team [48] , as in the previous version of the challenge, participated in both parts of phase B, constructing various BERT-based models. In particular, they utilized BioBERT and PubMedBERT models to score candidate sentences. Then, as a second step a logistic regressor, trained on predicting the similarity between a question and each snippet sentence, re-ranks the sentences.

The "Universiteit van Amsterdam" team submitted three systems ("UvA") that focused on ideal answers. They reformulated the task as a seq2seq language generation task in an encoder-decoder setting. All systems utilized variants of pre-trained language generation models. Specifically, they used BART and MT5 [43] .

In this challenge too, the open source OAQA system proposed by [44] served as baseline for phase B exact answers. The system which achieved among the highest performances in previous versions of the challenge remains a strong baseline for the exact answer generation task. The system is developed based on the UIMA framework. ClearNLP is employed for question and snippet parsing. MetaMap, TmTool [42] , C-Value and LingPipe [8] are used for concept identification and UMLS Terminology Services (UTS) for concept retrieval. The final steps include identification of concept, document and snippet relevance based on classifier components and scoring and finally ranking techniques.

MESINESP track received greater interest from the public in this second edition. Out of 35 teams registered for CLEF Labs 2021, 7 teams from China, Chile, India, Spain, Portugal and Switzerland finally took part in the task. These teams provided a total of 25 systems for MESINESP-L, 20 for MESINESP-T and 20 for MESINESP-P. Like last year, the approaches were pretty similar to those of the English track, relying mainly on deep language models for text representation using BERT-based systems and extreme multilabel classification strategies. Table 8 describes the general methods used by the participants. Most of the teams used sophisticated systems such as AttentionXML, graph-based entity linking, or label encoding systems. But unlike the first edition, this year some teams have also tested models with more traditional technologies such as TF-IDF to evaluate their performance in the indexing of documents in Spanish, This year's baseline was an improved textual search system that searches the text for both DeCS descriptors and synonyms to assign codes to documents. This approach got an MiF of 0.2876 for scientific literatre, 0.1288 for clinical trials and 0.2992 for patents.

Ref Approach 

In the first two versions of the new task Synergy, introduced this year, 15 teams participated submitting the results from 39 distinct systems. An overview of systems and approaches employed in this task is provided in Table 9 for the systems for which a description was available. More detailed descriptions for some of the systems are available at the proceedings of the workshop.

The Fudan University team, uses BM25 to fetch the top documents and then they use BioBERT, SciBERT, ELECTRA and T5 models to score the relevance of each document and query. Finally, the reciprocal rank fusion (RRF) is used to get the final document ranking results by integrating the previous results. Similarly, for snippet retrieval task, we use the same method with the focus in sentence. They also participate in all four types of questions. Table 9 . Systems and their approaches for Task Synergy. Systems for which no description was available at the time of writing are omitted. type, they use the BERT encoder, a linear transformation layer and the sigmoid function to calculate the yes or no probability. For Factoid/List questions, they again employ BERT as the backbone and fine-tune the model with SQuAD. For Summary questions, they perform conditional generation of answers by adopting BART as the backbone of the model. As this is a collaborative task, they use experts' feedback data in two aspects: one is to expand query by Named Entity Recognition, and the other is to finetune the model by using feedback data.

The "MQ" team [23] focused on the question answering component of the task, section ideal answers using one of their systems that participated in BioASQ 8b [22] Phase B For document retrieval, they used the top documents returned by the API provided by BioASQ. For snippet retrieval, they re-ranked the document sentences based on tfidf-cosine similarity with the question or the sentence score predicted by their QA system. In run 4, they experimented with a variant of document retrieval based on an independent retrieval system, tuned with the BioASQ data. What is more, they incorporated feedback from previous rounds to remove false negatives in documents and snippets and omit all documents and snippets that had been previously judged.

The "bio-answerfinder" team [27] used the the Bio-AnswerFinder end-to-end QA system they had previously developed [28] . For exact answers and ideal answers they used re-ranked candidate sentences as input to the Synergy challenge subsystems. For factoid and list questions they used an answer span classifier fine-tuned ELECTRA Base [11] using combined SQuAD v1.1 and BioASQ 8b training data. For list questions, answer candidates were enriched by coordinated phrase detection and processing. For Yes/No questions, they used a binary classifier fine-tuned on ELECTRA Base using training data created/annotated from BioASQ 8b training set (ideal answers). For summary questions, they used the top 10 selected sentences to generate an answer summary. Hierarchical clustering using weighted relaxed word mover's distance (wRWMD) similarity was used to group sentences, with similarity threshold to maximize ROGUE-2 score. They used the feedback provided to augment the training data used for the BERT [3] based reranker classifier used by Bio-AnswerFinder, after weighted relaxed word mover's distance (wRWMD) similarity based ranking and focus-word-based filtering. At each round, the BERT-Base based reranker was retrained with the cumulative Synergy expert feedback.

The "University of Aveiro" team [4] built on their BioASQ Task 8b implementation [3] modifying it to fit the Synergy Task by adding methodology for the given feedback of each round. Their approach was to create a strong baseline using simple relevance feedback technique, using a tf-idf score they expanded the query and finally processed it using the BM25 algorithm. This approach was adopted for questions having some feedback from previous rounds, for the new questions they used the BM25 algorithm along with reranking, similarly to the BioASQ Task 8b. The "NLM" team [37] first used the BM25 model to retrieve relevant articles and reranked them with the Text-to-Text Transfer Transformer (T5) relevance-based reranking model. For snippets, after splitting the relevant articles into sentences and chunks they used a re-ranking model based on T5 relevance. For ideal answers they used extractive and abstractive approaches. For the former they concatenated the top-n snippets, while for the later they finetuned their model using Bidirectional and Auto-Regressive Transformers (BART) on multiple biomedical datasets.

The "AUEB" team also built on their implementation from BioASQ Task 8b [29] exploiting the feedback to filter out the material that was already assessed. They participated in all stages of the Synergy Task. They use mostly JPDRMM-based methods with ElasticSearch for document retrieval and SE-Mantic Indexing for SEntence Retrieval (SEMISER) for snippet retrieval. The "JetBrains" team were based on their BioASQ Task 8b approach as well. In short, they used Lucene full-text search combined with BERT based reranker for document retrieval and BERT-based models for exact answers, without using the feedback provided by the experts. The "MQU" team used sentence vector similarities on the entire CORD-19 dataset, not considering the expert feedback either.

In Task 9a, each of the three batches were independently evaluated as presented in Table 10 . As in previous versions of the task, standard evaluation measures [9] were used for measuring the classification performance of the systems, both flat and hierarchical. In particular, the official measures used to identify the winners for each batch were the micro F-measure (MiF) and the Lowest Common Ancestor F-measure (LCA-F) [17] . As suggested by Demšar [12] , the appropriate way to compare multiple classification systems over multiple datasets is based on their average rank across all the datasets.

In this task, the system with the best performance in a test set gets rank 1.0 for this test set, the second best rank 2.0 and so on. In case two or more systems Table 10 . Average system ranks across the batches of the task 9a. A hyphenation symbol (-) is used whenever the system participated in fewer than 4 test sets in the batch. Systems participating in fewer than 4 test sets in all three batches are omitted. tie, they all receive the average rank. Based on the rules of the challenge, the average rank of each system for a batch is the average of the four best ranks of the system in the five test sets of the batch. The average rank of each system, based on both the flat MiF and the hierarchical LCA-F scores, for the three batches of the task are presented in Table 10 .

The results of Task 9a reveal that several participating systems manage to outperform the strong baselines in all test batches and considering either the flat or the hierarchical measures. Namely, the "dmiip fdu" systems from the Fudan University team achieve the best performance and the "NLM" systems the second best in all three batches of the task. More detailed results can be found in the online results page 12 . Figure 2 presents the improvement of the MiF scores achieved by both the MTI baseline and the top performing participant systems through the nine years of the BioASQ challenge.

Phase A: The evaluation of phase A in Task 9b is based on the Mean Average Precision (MAP) measure for each of the three types of annotations, namely documents, concepts and RDF triples. For snippets, where several distinct snippets may overlap with the same golden snippet, interpreting the MAP, which is based on the number of relevant elements, is more complicated. Therefore, this year, the F-measure is used for the official ranking of the systems in snippet retrieval, which is calculated based on character overlaps 13 .

As in BioASQ8, a modified version of Average Precision (AP) is adopted. In brief, since BioASQ3, the participant systems are allowed to return up to 10 relevant items (e.g. documents), and the calculation of AP was modified to reflect this change. However, some questions with fewer than 10 golden relevant items have been observed in the last years, resulting to relatively small AP values even for submissions with all the golden elements. Therefore, the AP calculation was modified to consider both the limit of 10 elements and the actual number of golden elements [25] .

Some indicative preliminary results from batch 4 are presented in Tables  11 and 12 for document and snippet retrieval. The full results are available in the online results page of Task 9b, phase A 14 . The results presented here are preliminary, as the final results for the task 9b will be available after the manual assessment of the system responses by the BioASQ team of biomedical experts.

Phase B: In phase B of task 9b, both exact and ideal answers are expected by the participating systems. For the sub-task of ideal answer generation, the BioASQ experts assign manual scores to each answer submitted by the participating systems during the assessment of system responses [9] . Then these scores are used for the official ranking of the systems. Regarding exact answers, the participating systems are ranked based on their average ranking in the three question types where exact answers are required. Summary questions are not considered as no exact answers are submitted for them. For yes/no questions, the systems are ranked based on the F1-measure, macro-averaged over the class of no and yes. For factoid questions, the ranking is based on mean reciprocal rank (MRR) and for list questions on mean F1-measure. Indicative preliminary results for exact answers from the fourth batch of Task 9b are presented in Table 13 . The full results of phase B of Task 9b are available online 15 . These results are preliminary, as the final results for Task 9b will be available after the manual assessment of the system responses by the BioASQ team of biomedical experts.

The top performance of the participating systems in exact answer generation for each type of question during the nine years of BioASQ is presented in Figure  3 . These results reveal that the participating systems keep improving in all types of questions. In batch 4, for instance, presented in Table 13 , in yes/no questions most systems manage to outperform by far the strong baseline, which is based on a version of the OAQA system that achieved top performance in previous years. Improvements are also observed in the preliminary results For list and factoid questions, some improvements are also observed in the preliminary results compared to the previous years, but there is still more room for improvement. Fig. 3 . The official evaluation scores of the best performing systems in Task B, Phase B, exact answer generation, across the nine years of the BioASQ challenge. Since BioASQ6 the official measure for Yes/No questions is the macro-averaged F1 score (macro F1), but accuracy (Acc) is also presented as the former official measure.

The performance of participating teams this year is higher than last year. There has been an increase in f-score of 0.06 for scientific literature, and the state of the art of clinical trials and patents semantic indexing with DeCS has been established in 0.3640 and 0.4514.

As shown in Table 14 , once again, the top performer this year was the Bert-DeCS system developed by Fudan University. Their system was based on an AttentionXML architecture with an Multilingual BERT encoding layer that was trained with MEDLINE articles and then fine-tuned with MESINESP corpora. This architecture obtained the best MiF score performance in scientific literature, clinical trials and patents. However, the best code prediction accuracy was achieved by Roche's "pi dna" system. Comparing the performance of the models with the baseline, it is noteworthy that only 7 of the models implemented for patents have been able to outperform the look-up system, highlighting the good performance of iria-2. Table 14 .

The results of the task show a drop in performance compared to the English task despite teams using similar technologies. This drop in performance could be associated with a lower number of training documents and inconsistencies in the manual indexing of these documents because they come from two different bibliographic sources [35] . Alternatively, this could also be explained by the delay in updating deprecated DeCS codes from the historical database. DeCS add and remove new terms twice a year, and the lack of temporal alignment in the update process could lead to inconsistencies between training and test data and decrease overall performance.

Regarding MESINESP-T track, there is no similar task in English to compare the results. The performance of the models is systematically lower than those generated for scientific literature. Because participants reported that they reused the models trained with scientific literature, incorporating the development set to make their predictions, a low quality Gold Standard cannot be associated with the drop in performance. However, given that the length of clinical trial documents is much longer than article abstracts, and that most systems use BERT models with an input size limit of 512 tokens, it is possible that a significant part of the documents will not be processed by the models and relevant information will be lost for indexing.

The patents subtrack presented a major challenge for the participants as they did not have a large training and development dataset. Since the statistics between the MESINESP-T and MESINESP-P corpora were similar, the participants solved the lack of data using the same models generated for scientific literature. The resulting models were promising, and the performance of some of the systems, such as Fudan, Roche and Iria, remained at the same level as scientific literature track.

On the other hand, although the performance of the models is lower than that of the English task, we used the participants' results to see whether the manual annotation process could be improved. To this end, a module for indexing assistance was developed in the ASIT tool, and a set of pre-annotated documents with the predictions of the best-performing team was provided to our expert indexers. After tracking annotation times, we observed that this type of system could improve annotation times by up to 60% [14] .

In task Synergy the participating systems were expected to retrieve documents and snippets, as in phase A of task B, and, at the same time, provide answers for some of these questions, as in phase B of task B. In contrast to task B, it is possible that no answer exists for some questions. Therefore only some of the questions provided in each test set, that were indicated to have enough relevant material gathered from previous rounds, require the submission of exact and ideal answers. Also in contrast to task B, during the first round no golden documents and snippets were given, while on the rest of the rounds a separate file with feedback from the experts, based on the previously submitted responses, was provided.

The feedback concept was introduced in this task to further assist the collaboration between the systems and the BioASQ team of biomedical experts. The feedback includes the already judged documentation and answers along with their evaluated relevancy to the question. The documents and snippets included in the feedback are not considered valid for submission in the following rounds, and even if accidentally submitted, they will not be taken into account for the evaluation of that round. The evaluation measures for the retrieval of documents and snippets are the MAP and F-measure respectively, as in phase A of task B.

Regarding the ideal answers, the systems are ranked according to manual scores assigned to them by the BioASQ experts during the assessment of systems responses as in phase B of task B [9] . For the exact answers, which are required for all questions except the summary ones, the measure considered for ranking the participating systems depends on the question type. For the yes/no questions, the systems were ranked according to the macro-averaged F1-measure on prediction of no and yes answer. For factoid questions, the ranking was based on mean reciprocal rank (MRR) and for list questions on mean F1-measure.

Some indicative results for the first round of Synergy Task, version 1, are presented for document retrieval in Table 15 . The full results of Synergy Task are available online 16 . As regards the extraction of exact answers, despite the moderate scores in list and factoid questions the experts found useful the submissions of the participants, as most of them (more than 70%) stated they would be interested in using a tool following the BioASQ Synergy process to identify interesting material and answers for their research. Table 15 . Results for document retrieval in round 1 of the first version of Synergy task. Only the top-10 systems are presented.

An overview of the ninth BioASQ challenge is provided in this paper. This year, the challenge consisted of four tasks: The two tasks on biomedical semantic indexing and question answering in English, already established through the previous eight years of the challenge, the second version of the MESINESP task on semantic indexing of medical content in Spanish, and the new task Synergy on question answering for COVID-19.

In the second version of the MESINESP task we introduced two new challenging sub-tracks, beyond the one on medical literature. Namely, on patents and clinical trials in Spanish. Due to the lack of big datasets in these new tracks, the participants were pushed to experiment with transferring knowledge and models from the literature track, highlighting the importance of adequate resources for the development of systems to effectively help biomedical experts dealing with non-English resources.

The introduction of the Synergy Task, in an effort to enable a dialogue between the participating systems with biomedical experts revealed that state-ofthe-art systems, despite they still have room for improvement, can be a useful tool for biomedical experts that need specialized information in the context of the developing problem of the COVID-19 pandemic.

The overall shift of participant systems towards deep neural approaches observed during the last years, is even more apparent this year. State-of-the-art methodologies have been successfully adapted to biomedical question answering and novel ideas have been explored leading to improved results, particularly for exact answer generation this year. Most of the teams developed systems based on neural embeddings, such as BERT, SciBERT, and BioBERT models, for all tasks of the challenge. In the QA tasks in particular, different teams attempted transferring knowledge from general domain QA datasets, notably SQuAD, or from other NLP tasks such as NER and NLI.

Overall, the top preforming systems were able to advance over the state of the art, outperforming the strong baselines on the challenging tasks offered in BioASQ, as in previous versions of the challenge. Therefore, BioASQ keeps pushing the research frontier in biomedical semantic indexing and question answering, extending beyond the English language, through MESINESP, and beyond the already established models for the shared tasks, by introducing Synergy. The future plans for the challenge include the extension of the benchmark data for question answering though a community-driven process, as well as extending the Synergy task into other developing problems beyond COVID-19.

Google was a proud sponsor of the BioASQ Challenge in 2020. The ninth edition of BioASQ is also sponsored by the Atypon Systems inc. BioASQ is grateful to NLM for providing the baselines for task 9a and to the CMU team for providing the baselines for task 9b. The MESINESP task is sponsored by the Spanish Plan for advancement of Language Technologies (Plan TL) and the Secretaría de Estado para el Avance Digital (SEAD). BioASQ is also grateful to LILACS, SCIELO and Biblioteca virtual en salud and Instituto de salud Carlos III for providing data for the BioASQ MESINESP task.

attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification

ku-dmis at bioasq 9: Data-centric and model-centric approaches for biomedical questionanswering

Bit. ua at bioasq 8: Lightweight neural document ranking with zero-shot snippet retrieval

Bioasq synergy: A strong and simple baseline rooted in relevance feedback

Universal passage weighting mecanism (upwm) in bioasq 9b

Large biomedical question answering models with albert and electra

BioM-transformers: Building large biomedical language models with BERT

Lingpipe. Available from World Wide Web

Evaluation framework specifications

Post-processing biobert and using voting methods for biomedical question answering

Electra: Pre-training text encoders as discriminators rather than generators

Statistical comparisons of classifiers over multiple data sets

Vicomtech at MESINESP2: BERTbased Multi-label Classification Models for Biomedical Text Indexing

Overview of BioASQ 2021-MESINESP track. Evaluation of advance hierarchical classification techniques for scientific literature, patents and clinical trials

Pidna at bioasq mesinesp: Hybrid semanticindexing for biomedical articles in spanish

Transformer-based language models for factoid question answering at bioasq9b

Evaluation measures for hierarchical classification: a unified view and novel approaches

Overview of the chemdner patents task

Named entity recognition, concept normalization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results

The profner shared task on automatic recognition of occupation mentions in social media: systems, evaluation, guidelines, embeddings and corpora

Overview of automatic clinical coding: annotations, guidelines, and solutions for non-english clinical cases at codiesp track of clef ehealth 2020

Query focused multi-document summarisation of biomedical texts

Query-focused extractive summarisation for finding ideal answers to biomedical and covid-19 questions

Recent enhancements to the nlm medical text indexer

Overview of bioasq 2020: The eighth bioasq challenge on large-scale biomedical semantic indexing and question answering

On the effectiveness of small, discriminatively pre-trained language representation models for biomedical text mining

End-to-end biomedical question answering via bio-answerfinder and discriminative language representation models

Bio-answerfinder: a system to find answers to questions from biomedical texts

Aueb-nlp at bioasq 8: Biomedical document and snippet retrieval

Deepmesh: deep semantic representation for improving large-scale mesh indexing

A neural text ranking approach for automatic mesh indexing

Automatic mesh indexing: Revisiting the subheading attachment problem

Exploring the limits of transfer learning with a unified text-to-text transformer

CoLe and UTAI at BioASQ 2015: Experiments with similarity based descriptor assignment

Overview of mesinesp8, a spanish medical semantic indexing task within bioasq

LASIGE-BioTM at MESINESP2: entity linking with semantic similarity and extreme multi-label classification on Spanish biomedical documents

Nlm at bioasq 2021: Deep learning-based methods for biomedical question answering about covid-19

The growth of covid-19 scientific literature: A forecast analysis of different daily time series in specific settings

An overview of the bioasq large-scale biomedical semantic indexing and question answering competition

Large-Scale Semantic Indexing of Biomedical Publications

Cord-19: The covid-19 open research dataset

Beyond accuracy: creating interoperable and scalable text-mining web services

mt5: A massively multilingual pre-trained text-to-text transformer

Learning to answer biomedical questions: Oaqa at bioasq 4b

Sequence tagging for biomedical extractive question answering

BERTMeSH: deep contextual representation learning for large-scale high-performance MeSH indexing with full text

Using learning-to-rank to enhance nlm medical text indexer results

Ncu-iisr/as-gis: Results of various pre-trained biomedical language models and logistic regression model in bioasq task 9b phase b