key: cord-0459642-o5ntjl4o
authors: Blagec, Kathrin; Kraiger, Jakob; Fruhwirt, Wolfgang; Samwald, Matthias
title: Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals
date: 2022-01-18
journal: nan
DOI: nan
sha: 0222e81f8c9d3fb07f9e2bf737f0d32d0503511c
doc_id: 459642
cord_uid: o5ntjl4o

Publicly accessible benchmarks that allow for assessing and comparing model performances are important drivers of progress in artificial intelligence (AI). While recent advances in AI capabilities hold the potential to transform medical practice by assisting and augmenting the cognitive processes of healthcare professionals, the coverage of clinically relevant tasks by AI benchmarks is largely unclear. Furthermore, there is a lack of systematized meta-information that allows clinical AI researchers to quickly determine accessibility, scope, content and other characteristics of datasets and benchmark datasets relevant to the clinical domain. To address these issues, we curated and released a comprehensive catalogue of datasets and benchmarks pertaining to the broad domain of clinical and biomedical natural language processing (NLP), based on a systematic review of literature and online resources. A total of 450 NLP datasets were manually systematized and annotated with rich metadata, such as targeted tasks, clinical applicability, data types, performance metrics, accessibility and licensing information, and availability of data splits. We then compared tasks covered by AI benchmark datasets with relevant tasks that medical practitioners reported as highly desirable targets for automation in a previous empirical study. Our analysis indicates that AI benchmarks of direct clinical relevance are scarce and fail to cover most work activities that clinicians want to see addressed. In particular, tasks associated with routine documentation and patient data administration workflows are not represented despite significant associated workloads. Thus, currently available AI benchmarks are improperly aligned with desired targets for AI automation in clinical settings, and novel benchmarks should be created to fill these gaps.

Recent advances in artificial intelligence (AI) capabilities hold the potential to transform medical practice. In contrast to the oft-discussed replacement of medical professionals, fields of application in the foreseeable future will rather lie in assisting and augmenting the cognitive processes of professionals.

Recent years have witnessed a rapid increase in research on algorithms that are aimed at a wide variety of clinical tasks, such as diabetic retinopathy detection from fundus images, visual question-answering from radiology images, or information synthesis and retrieval. [1] [2] [3] [4] [5] Besides tasks aimed directly at clinical care, other more administrative healthcare tasks that may help to reduce clinicians' administrative workloads-such as automatic ICD coding from electronic health record (EHR) data-have also been targeted. [6] [7] [8] As AI algorithms and training paradigms become better and more versatile, the availability of high-quality, representative data to train and validate machine learning models is key to optimally harnessing their potential for biomedical research and clinical decision making.

In AI research, model performance is commonly evaluated and compared using benchmarks. A benchmark constitutes a task (e.g., named entity recognition), a dataset representative for the task and one or more metrics to evaluate model performance (e.g., F1 Score). Benchmark datasets are commonly publicly available to researchers and therefore provide a transparent, standardised way to assess and compare model performance. Consequently, benchmarks also act as drivers for AI development, and a rapidly growing body of research is devoted to understanding, critically reflecting and improving AI benchmarking and capability measurement [9] [10] [11] [12] [13] [14] .

While benchmarks exist for a large number of general domain AI tasks, the coverage of biomedical and clinical tasks is largely unclear. Having a clear understanding of what kind of tasks the biomedical and clinical AI research community is currently tackling, which of these tasks are currently covered by benchmarks and which are not, helps to inform the future creation of benchmarks.

A systematic model of biomedical and clinical tasks can further offer valuable insights into how different research focus areas might synergize and allow insights for future research focus areas. Currently, there is a lack of systematized meta-information on biomedical and clinical datasets and benchmarks, that allow researchers to quickly determine their accessibility, scope, content and other characteristics. [15] Several initiatives to index datasets and benchmark datasets relevant to AI research have been introduced in recent years. These include the 'Papers with Code' datasets 1 , 'Hugging Face' datasets 2 or the Online Registry of Biomedical Informatics Tools (ORBIT) Project'. 3 [16] These are, however, either not maintained anymore (ORBIT), focused on making datasets available for use in programming frameworks (Hugging Face) or capturing benchmark results (Papers with Code) rather than on systematizing and modeling datasets and tasks. Furthermore, the latter two are focused on general domain tasks and datasets, with biomedical and clinical datasets making up only a small fraction of the current databases.

Finally, the future utility of AI for healthcare therefore hinges on how well AI benchmarks reflect the actual needs in healthcare. To the best of our knowledge, currently there exists no study that investigates this essential question.

This paper aims to address these issues in a threefold way, focused on natural language processing (NLP) tasks:

• First, we introduce a comprehensive curated catalogue of 450 biomedical and clinical NLP datasets and benchmark datasets based on a systematic literature review covering both biomedical and computer science literature and grey literature data sources.

• Second, we manually systematize and annotate these datasets and benchmarks with meta-information, such as accessibility, performance metrics, availability of data splits and associated tasks, while considering interoperability and harmonisation with existing ontologies, such as EDAM, SNOMED CT and ITO. [17, 18] • Finally, based on this data source, we analyze the current availability of clinically relevant AI benchmarks, their overlap with actual needs in healthcare, 3 4 and arXiv 5 were selected as the main sources to ensure coverage of both biomedical and computer science literature. The MEDLINE database contains bibliographic information on more than 26 million biomedical journal articles as of January 2020. arXiv is an open access pre-print server for scientific papers covering a wide range of fields, including computer science. We considered using additional search engines such as Semantic Scholar for the review, but decided to fall back on PubMed and arXiv due to e.g., missing Boolean search functions or intransparent search modalities as described previously by Gusenbauer and Haddaway. [19] In addition to the systematic literature review, we included sources such as NLP and ML challenge and shared-task websites, such as BioASQ 6 and n2c2 7 .

We focused our review on benchmarks for NLP tasks. PubMed was queried using the web interface and results as of November 12, 2020 were exported. arXiv was queried using its API on November 27, 2020. Records were converted from XML to CSV. 

Extracted datasets were de-duplicated and annotated for their meta-information. In many cases datasets appeared under different naming variations or did not have an explicit name. We normalized the dataset names to the best of our knowledge with, e.g., names from the official dataset websites or repositories as a reference for the normalised names. For datasets with no explicit names we instead used a surrogate name in the form of a short description as reported in the respective paper. For each dataset, alias names occurring in the identified records were annotated. 

Clinical "Relating to the examination and treatment of patients and their illnesses" (Oxford dictionary)

Benchmark dataset "Any resource that has been published explicitly as a dataset that can be used for evaluation, is publicly available or accessible on request, and has clear evaluation methods defined"

Clinically relevant benchmark dataset "Benchmark datasets directly relating to the entirety of processes involved in the examination and treatment of patients and their illnesses."

Information retrieval (IR) "Obtaining information system resources that are relevant to an information need from a collection of those resources" (Source: Wikipedia).

Question answering (QA) "Building systems that automatically answer questions posed in a natural language" (Source: Adapted from Wikipedia).

Clinical care task Tasks that are directly related to the examination and treatment of patients and their illnesses. Includes reviewing and searching for medical information using a variety of information sources, such as books, scientific literature or web-based information content. Includes the analysis and interpretation of diagnostic tests including medical imaging results.

Administrative task Administrative tasks include, e.g., scheduling and managing patient appointments, filing, updating, and organizing patient records or coding medical records for billing.

Scientific task Tasks related to the coordination, conduct or reporting of clinical scientific research. Includes e.g., the selection of eligible patients for clinical trials.

We distinguished between datasets and benchmark datasets. We defined benchmark datasets as "any resource that has been published explicitly as a dataset that can be used for evaluation, is publicly available or accessible on request, and has clear evaluation methods defined" (see Table 1 ). While our analysis is focused on benchmark datasets, we also documented relevant non-benchmark datasets in the catalogue because they may be used for the creation of novel datasets as well as unsupervised pre-training of machine learning models.

Task descriptions extracted from the identified records were standardized and mapped to broad task families based on two relevant ontologies, i.e. the Intelligence Task

Ontology (ITO) 8 The complete list of annotated fields including their descriptions can be found in Table   2 .

Name (or description if no name available) Name of the dataset or description if no official name is available Task Task as described in the source paper (if available).

Mapped task Mapped task from relevant ontologies, e.g., Intelligence Task Ontology (ITO) and SNOMED CT.

Id(s) of mapped task Id(s) of ontology classes representing the mapped task.

Data basis Data type of the source data used to create the dataset, e.g., "Clinical notes / EHR data".

Availability of evaluation criteria 'Yes' The paper or original source describes evaluation criteria that are used to assess a model's performance on the respective task.

The dataset was published without any specific evaluation criteria.

Performance metrics for evaluation on benchmark datasets, e.g., F-Measure.

The dataset can be downloaded together with an official train / validation / test split.

'Described' A data split is described in the source paper but not available together with the data.

'Not described' The source paper does not contain any information on data splits.

'Not available' There is no data split.

The dataset is publicly available and can be accessed and downloaded by everyone.

'Public (planned)' The dataset is not yet publicly available but there are plans to make it available.

'Upon registration'

The dataset can be accessed and downloaded after filling out a registration form.

'On request' The dataset can be requested from the owner(s) via email.

'Unknown'

The availability of the dataset is unknown, i.e. it is not made available and there is no explicit statement about its availability.

'Not available' It is explicitly stated that the dataset is not available due to, e.g., privacy or other reasons. The dataset is released in two formats: As a Google sheet 11 and as a versioned TSV file at Zenodo 12 .

[23] Additionally, we make the raw exports of the literature review results available via Zenodo. Figure 2 shows the characteristics of all datasets (i.e. benchmark datasets and non-benchmark datasets) included in the catalogue in terms of source data, task family, accessibility and clinical relevance.

At the time of analysis, only 28.2% of all datasets were publicly available while 9.5% and 13.1% of datasets were available after undergoing a registration procedure or upon written request, respectively (see Figure 2c ). It has, however, to be noted that in many of these cases the applicant may be required to hold a certain position, e.g.,

verifiably be employed as a researcher at an academic institution. Only datasets associated with a concrete task as stated in the reference source are included in this chart (n=314). c) Dataset accessibility. d) Proportion of benchmark datasets among directly and indirectly clinically relevant datasets. Only datasets associated with a concrete task as stated in the reference source are included in this chart (n=314). 13 

The overwhelming majority (86.7%, n=117) of currently available clinically relevant benchmarks are focused on narrow technical aspects of clinical and biomedical AI tasks. In contrast, few benchmarks with direct clinical relevance exist (13.3%, n=18). Figure 3 shows characteristics of the identified benchmark datasets in terms of source data, task family and the availability of pre-defined data splits.

Benchmarks classified as directly clinically relevant belonged to four broad task groups, i.e., visual question-answering in the areas of radiology and histopathology (n=6), text-based information retrieval and question-answering in general clinical domains (n=7), radiological/histopathological report generation (n=3), and radiological image annotation (n=2). Table 3 lists the identified benchmarks with direct clinical relevance. Visual question-answering on radiology images using manually generated clinical questions.

Clinicians, radiologists Q: "Is there evidence of an aortic aneurysm?" A: "no"

Public PathVQA [30] Visual question-answering based on histopathological images from pathology textbooks and digital resources using automatically generated questions.

Pathologists Q: "Is the covering mucosa ulcerated?" A: "yes"

Public IU X-Ray [31] Radiological report generation Radiologists "Eventration of the right hemidiaphragm. No focal airspace consolidation. No pleural effusion or pneumothorax."* Public PEIR Gross dataset [32] Histological report generation Pathologists "Carcinoma: Micro low mag H&E needle biopsy with obvious carcinoma"

Public PadChest [33] Radiological report generation Radiologists "pleural effusion, costophrenic angle blunting" On request ImageCLEFmedical 2018 Visual Question Answering (VQA) [34] Visual question-answering on clinical images extracted from PubMed using synthetic question-answer pairs based on image captions.

Clinicians Q: "Is the lesion associated with a mass effect?" A: "no"

Upon registration ImageCLEFmedical 2019 Visual Question Answering (VQA) [34] Visual question-answering on radiology images covering four question categories, i.e., modality, plane, organ system and abnormality.

Clinicians Q: "What is the primary abnormality in this image?" A: "Burst fracture"

Upon registration ImageCLEFmedical Visual Question Answering (VQA) 2020 and 2021 [34] Visual question-answering on radiology images focusing on questions about abnormalities.

Clinicians Q: "What abnormality is seen in the image?" A: "ollier's disease, enchondromatosis"

Upon registration ImageCLEFmedical Medical Automatic Image Annotation Task [34] Annotation of radiology/histology images Radiologists/P athologists "Immunohistochemical stain for pancytokeratin, highlighting tumor cells with unstained lymphocytes in the background"

Mapping of AI benchmarks to the list of real-world clinical work activities revealed that many work activities with potential for assisting and disburdening healthcare staff are currently not or only scarcely addressed (Table 4 ). Work activities with the highest number of associated AI benchmarks were "Process x-rays or other medical images", "Review professional literature to maintain professional knowledge" and "Gather medical information from patient histories".

An informal review of the validity and representativeness of the currently available, directly clinically relevant benchmarks revealed two areas of concern: First, we found that datasets commonly contained redundant items, or items not representative of real-world clinical tasks. Examples of the latter include unrepresentative questions in visual question-answering tasks, such as "What is not pictured in this image?" or bogus synthetically generated questions, such as "The tumor cells and whose nuclei are fairly uniform, giving a monotonous appearance?". Second, for manually or semi-automatically generated datasets, the varying expertise and number of annotators involved in the creation process might have impacted data quality. We have created this dataset based on a systematic review conducted in accordance with current best practice guidelines, such as the PRISMA guidelines. [35] The PRISMA guidelines have primarily been created for systematic reviews of randomized trials and evaluation of interventions. Nonetheless, the general review framework is applicable to all types of systematic literature reviews.

We built our analysis of benchmark coverage of clinically relevant tasks on recent work by Frühwirt and Duckworth who investigated the possibility and desirability of automation of work activities in the healthcare domain based on thousands of ratings by domain experts. [22] We found that AI benchmarks of direct clinical relevance are scarce and fail to cover many work activities that clinicians most want to see addressed. Especially tasks associated with routine documentation and patient data administration workflows were scarcely represented despite significant associated workloads.

There are several potential reasons for this. First, our criteria for benchmark datasets were the availability of clear evaluation criteria and public accessibility of the dataset. Our analysis has shown that a large amount of published research on clinically relevant AI is conducted on datasets that are not available to other researchers. This is in line with previous research that investigated the availability of biomedical datasets. [16] Data governance in the biomedical domain and especially its clinical subdomains is strongly marked by privacy and data protection considerations. In the subfield of clinical applied AI research oftentimes institutional data, e.g. from the local EHR, is used. Making such data available to other researchers requires adequate measures to sustain patient privacy which may be associated with increased workloads and/or costs.

While the safeguarding of patient privacy and data protection is of fundamental importance, it also entails cutbacks in terms of transparency and reproducibility of current AI research in the biomedical domain which ultimately may negatively impact research progress. Emerging de-centralized learning approaches, such as federated learning, may address this problem by enabling the sharing of data even across institutions while maintaining patient privacy and data protection. [36] However, while such approaches hold promise to unlock the potential value of sensible data, widespread application is yet to be awaited and will require adequate incentivisation.

The lack of coverage of administrative clinical work activities may further point to a research prioritization of tasks that directly impact patient treatment, such as diagnosing a disease or finding information on diseases and/or their treatment. This seems to neglect that a significant amount of healthcare providers' workload is caused by administrative tasks and paperwork. [37, 38] Disburdening healthcare providers from such administrative tasks may ultimately indirectly improve the quality of provided healthcare by freeing up time and cognitive resources for actual clinical care and patient communication.

We make the curated datasets available and expect it to be useful to a broad target audience, including biomedical and clinical researchers, NLP and AI researchers as well as ML practitioners in general. The dataset can be utilized in a variety of tasks such as application development, AI research as well as meta-research. For AI developers and researchers, the dataset offers a comprehensive overview of NLP in medical application areas and enables finding datasets and benchmarks relevant for a specific task and type of data (e.g., scientific, clinical). Its practical utility is further increased through the addition of detailed information on data accessibility and licensing.

One limitation of this initial version of the dataset is that we included datasets and benchmark datasets based on their occurrence in the literature and grey literature regardless of their size, provenance, generation/annotation process and internal validity. We therefore strongly encourage users to individually verify that the respective dataset is appropriate for the intended use case.

The dataset is intended as a living, extendable resource. Further, newly identified biomedical datasets will be added to the catalogue. To this end, methods for creating a semi-automatic pipeline for extracting datasets from the literature will be investigated. In addition, the dataset is open to additions and suggestions by users, which can be communicated directly in the Google sheet version of the dataset. The TSV version of the dataset will be versioned to allow comparability and tracking.

Finally, in this work, we have limited the scope to NLP tasks, including cross-domain tasks, such as visual question-answering. Other high impact clinical application domains of AI include computer vision tasks, such as classification of radiology or pathology images, or video-based tasks related to robotic and laparoscopic surgery. Conducting an analysis of benchmark datasets for these and other clinically relevant AI focus areas could be the subject of future research.

AI benchmarks of direct clinical relevance are scarce and fail to cover many work activities that clinicians most want to see addressed. Tasks associated with routine documentation and patient data administration workflows are seldom represented despite significant associated workloads.

Investing in the creation of high-quality, representative benchmarks for clinical tasks will have significant positive long-term impact on the utility of AI in clinical practice. Ideally, this should be addressed by allocating more funding to the development of such benchmarks, since their creation can be costly and is currently poorly incentivized.

The data set is released in two formats: As a Google sheet 13 

The authors declare no conflicts of interest.

Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs

Overview of the VQA-Med Task at ImageCLEF 2020: Visual Question Answering and Generation in the Medical Domain

SLEDGE-Z: A Zero-Shot Baseline for COVID-19

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes

Hybrid machine learning architecture for automated detection and grading of retinal images for diabetic retinopathy

Interpretable deep learning to map diagnostic texts to ICD-10 codes

An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes

A comparison of deep learning methods for ICD coding of clinical records

BioASQ at CLEF2022: The Tenth Edition of the Large-scale Biomedical Semantic Indexing and Question Answering Challenge

Measuring the occupational impact of AI: tasks, cognitive abilities and AI benchmarks

Research community dynamics behind popular AI benchmarks

Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research

What will it take to fix benchmarking in natural language understanding

Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT

Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study

A curated, ontology-based, large-scale knowledge graph of artificial intelligence tasks and benchmarks 2021

EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats

Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources

A curated, ontology-based, large-scale knowledge graph of artificial intelligence tasks and benchmarks

Towards better healthcare: What could and should be automated?

A living catalogue of artificial intelligence datasets and benchmarks for medical decision making

Repurposing TREC-COVID Annotations to Answer the Key Questions of CORD-19

FindZebra: a search engine for rare diseases

A Large Corpus for Question Answering on Electronic Medical Records

ShAReCLEF eHealth 2013: Natural Language Processing and Information Retrieval for Clinical Care

Classification of Radiology Reports Using Neural Attention Models

30000+ Questions for Medical Visual Question Answering

On the automatic generation of medical imaging reports

Imageclef 2022: multimedia retrieval in medical, nature, fusion, and internet applications

Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement

The future of digital health with federated learning

Administrative work consumes one-sixth of U.S. physicians' working hours and lowers their career satisfaction

Medical Practice and Quality Committee of the American College of Physicians. Putting patients first by reducing administrative tasks in health care: A position paper of the american college of physicians