key: cord-0139212-9bkrsdgx
authors: Karargyris, Alexandros; Umeton, Renato; Sheller, Micah J.; Aristizabal, Alejandro; George, Johnu; Bala, Srini; Beutel, Daniel J.; Bittorf, Victor; Chaudhari, Akshay; Chowdhury, Alexander; Coleman, Cody; Desinghu, Bala; Diamos, Gregory; Dutta, Debo; Feddema, Diane; Fursin, Grigori; Guo, Junyi; Huang, Xinyuan; Kanter, David; Kashyap, Satyananda; Lane, Nicholas; Mallick, Indranil; Mascagni, Pietro; Mehta, Virendra; Natarajan, Vivek; Nikolov, Nikola; Padoy, Nicolas; Pekhimenko, Gennady; Reddi, Vijay Janapa; Reina, G Anthony; Ribalta, Pablo; Rosenthal, Jacob; Singh, Abhishek; Thiagarajan, Jayaraman J.; Wuest, Anna; Xenochristou, Maria; Xu, Daguang; Yadav, Poonam; Rosenthal, Michael; Loda, Massimo; Johnson, Jason M.; Mattson, Peter
title: MedPerf: Open Benchmarking Platform for Medical Artificial Intelligence using Federated Evaluation
date: 2021-09-29
journal: nan
DOI: nan
sha: 99ad042e52b52252dc201dd0f1172ccfda0b5a50
doc_id: 139212
cord_uid: 9bkrsdgx

Medical AI has tremendous potential to advance healthcare by supporting the evidence-based practice of medicine, personalizing patient treatment, reducing costs, and improving provider and patient experience. We argue that unlocking this potential requires a systematic way to measure the performance of medical AI models on large-scale heterogeneous data. To meet this need, we are building MedPerf, an open framework for benchmarking machine learning in the medical domain. MedPerf will enable federated evaluation in which models are securely distributed to different facilities for evaluation, thereby empowering healthcare organizations to assess and verify the performance of AI models in an efficient and human-supervised process, while prioritizing privacy. We describe the current challenges healthcare and AI communities face, the need for an open platform, the design philosophy of MedPerf, its current implementation status, and our roadmap. We call for researchers and organizations to join us in creating the MedPerf open benchmarking platform.

As medical AI has begun to transition from research to clinical care [1] [2] [3] [4] , national agencies around the world have started drafting regulatory frameworks to support this new class of interventions.

Examples include the US Food and Drug Administration (https://www.fda.gov/medical-devices/digital-health-center-excellence), the European Medicines Agency (https://www.ema.europa.eu/en/about-us/how-we-work/regulatory-science-strategy), and the Central Drugs Standard Control Organisation in India 5 . A key point of agreement for all regulatory agency efforts is a requirement for formal, large-scale validation of medical AI models [6] [7] [8] . Widespread approval and adoption of medical AI models will thus require expansion and diversification of clinical data sourced from multiple organizations. Furthermore, there are emerging parallels between stages for approval for medical AI interventions and the regulatory approval of small molecules or medical devices through clinical trials [9] [10] [11] .

Pioneering research in the medical field and elsewhere 12, 13 has demonstrated that using large and diverse datasets during model training results in more accurate models. Such models are also expected to be more generalizable to other clinical settings. Other studies have shown that models trained with data from limited and specific clinical settings demonstrate bias toward specific patient populations [14] [15] [16] , and such data biases can lead to models that appear promising during development but have lower performance in wider deployment 17, 18 . A given static model may be susceptible to distribution shifts for the model's input or the model's target, or both 19 . For example, input distributional shifts may occur when an algorithm is evaluated on a population different than the one upon which it was trained on, when there are changes to local demographics or disease prevalence, or as a result of software or hardware upgrades of medical imaging equipment used for data acquisition. Similarly, distributional shifts may also arise from variations in geographic insurance reimbursement and medical procedure trends, or from new annotation or labeling guidelines. These issues, which are often intertwined with one another and frequently result in performance degradation, can also hinder trust and acceptance of AI among healthcare stakeholders, including clinicians, patients, insurers, and regulators.

We believe a new approach to leveraging diverse data can deliver consistent clinical and business value to healthcare data owners, while creating adoption incentives through lower implementation cost and lower deployment risk 6 . Such an approach should allow collaborative model training and evaluation on large, multi-institutional and representative datasets while complying with privacy and regulatory requirements. However, the degree to which these requirements can be met during collaborative training is still an open research question 43 .

Here we present MedPerf, an approach focused on broader data access during model evaluation, which we believe will best support model generalization as well as improve clinician and patient confidence. MedPerf was built upon the group's experience leading and disseminating efforts such as (i) the development of standardized benchmarking platforms (e.g., MLPerf for benchmarking machine learning training 20 and inference 21 across industries in a pre-competitive space -https://mlcommons.org/#MLPerf); (ii) the implementation of federated learning software frameworks (e.g., NVIDIA CLARA, Intel OpenFL 22 , and Flower by Adap/University of Cambridge); (iii) the ideation and coordination of federated medical challenges across dozens of clinical sites and research institutes (e.g., BraTS 23 and FeTS 24 ); as well as (iv) other prominent medical AI and machine learning efforts spanning multiple countries and healthcare specialties (e.g., oncology 25 and COVID 26 ) . MedPerf should also illuminate cases where better models are needed, increase adoption of existing generalizable models, and incentivise further model development, data annotation, curation, and data access, while preserving patient privacy. The development of this approach requires (a) consistent and rigorous methodologies to evaluate performance of AI models for real-world use in a standardized manner, (b) a technical approach that enables measuring model generalizability across institutions, while maintaining data privacy and respecting model intellectual property, and (c) a community of expert groups to employ the evaluation methodology and the technical approach to define and operate mature performance benchmarks.

MedPerf aims to address these goals. MedPerf is an open-source framework designed to develop and support benchmark reference implementations, respect data privacy, and support model evaluation through formal generation of benchmarking working groups. MedPerf provides the opportunity to set standards, best practices, and benchmarking for medical AI in a pre-competitive space. The current list of contributors includes representatives of 18 companies, 13 universities, 6 hospitals, and 10 countries.

In this section, we discuss challenges to wider data access for AI training and evaluation in healthcare. Convincing data owners to broaden data access is hindered by substantial regulatory, legal, and public perception risks, high up-front costs, and uncertain financial return on investment.

Sharing patient data presents three major classes of risk: liability, regulatory, and public perception. Sharing patient data can expose providers to liability risk in multiple ways. Shared data could be stolen or misused in a manner damaging to patients (e.g., to discriminate against patients with certain conditions). Patient data are protected by complex regulations such as HIPAA in the United States and GDPR in Europe that carry substantial penalties for violators.

The perception of risk is also heightened because AI is a relatively new paradigm where application of existing regulations can be unclear. Lastly, even if data are shared legally and used beneficially, people naturally value privacy, and sharing data without explicit consent could lead to negative public perception 27 .

Sharing data requires up-front investment to turn raw data into a useful resource for AI. This transformation involves multiple steps:

1. Data collection: Cohorts need to be identified and the corresponding data need to be made accessible.

2. Transformation: Once accessible, data must be reformatted to a standardized representation for each data type (e.g., DICOM 28 for medical images) suitable for subsequent steps.

3. Anonymization: Data are anonymized by removing identifying information and/or filtering to comply with statistical and regulatory requirements (e.g., K anonymity 29 ).

4. Labeling: For many AI tasks, data must be labeled (i.e., annotated) according to the task (e.g., brain tumor segmentation). To ensure quality and performance, labeling should be consistent across institutions. This step is expensive, highly human-dependent, and error-prone, while carrying additional costs related to annotation correction, versioning, and dataset maintenance 30 . 5 . Review: Data need to be reviewed for regulatory, legal, and policy compliance, and patients or patient groups need to be consulted for the design and perception of the use case.

6. Licensing: Data must be licensed in a manner that fulfills business and/or scientific interests while complying with existing regulations. 7. Sharing: Data must be physically shared with licensees, through complex legal agreements, which may require secure transmission of large data volumes or the creation of specially designed data enclaves.

Navigating these steps can be costly. The technical part of the process is also complex, requiring a mix of medical, artificial intelligence, and software engineering skills. There are multiple opportunities for error that may not be revealed until downstream consequences emerge, necessitating careful validation at each step, sometimes with multiple iterations 31 .

Even if a data owner (e.g., a hospital) is willing to pay for these costs and mitigate these risks, benefits can be unclear for financial or technical reasons. For example, if the development of an AI-based solution is driven by the AI model builder instead of the data provider, the AI provider may see a greater share of the eventual benefits than the data owner, even though the data owner may incur a greater share of the risk.

From a technical perspective, it can be difficult to prove a model's performance prior to deployment. Current medical AI community challenge efforts (e.g., FeTS 24 , CheXpert 32 , BraTS 33 , NLST 34 , CHAOS 35 , fastMRI 36 ) have been invaluable for advancing research but lack the scope to serve as real-world evaluation mechanisms in clinical settings. These challenges typically focus on a single dataset and task and thus do not reflect the diversity (e.g., multi-modal and multi-institutional) and complexity (e.g., different clinical and technical workflows) of real-world use cases. Model training and evaluation on non-diverse datasets carries increased risk of overfitting and the chance that even top-performing models will not generalize in real-world use cases, where clinical data reside in multi-institutional, geographically distributed organizations with significant differences across domains (i.e., domain shifts) 14 .

Our goal is to increase the clinical impact of AI by leveraging more data across multiple facilities to address the challenges described above. We are developing an open benchmarking platform that combines a lower-risk, evaluation-focused approach without data sharing along with appropriate infrastructure, technical support and organizational coordination. This approach can reduce the risk and cost associated with data sharing while increasing the likelihood of business and medical benefits provided by AI solutions. MedPerf should lead to wider adoption, more efficacious and cost-effective clinical practice, and improve patient outcomes.

Our technical approach uses federated evaluation, a reduced-risk form of federated learning. At its core, the aims of federated evaluation are to make sharing models with multiple data owners easy and reliable, to evaluate those models against data owners' data in controlled settings, and to aggregate and analyze evaluation metrics. Importantly, by limiting the goal to model evaluation, and by aggregating only evaluation metrics, federated evaluation poses significantly lower risk to patient privacy than collaborative model training, while also minimizing the risk 37, 38 of intellectual property theft and data misuse.

More specifically, our open platform for federated evaluation will provide a common, open-source infrastructure for defining medical AI benchmarks and coordinating federated evaluation of models against such benchmarks. We are building the infrastructure with best practices to help align AI model owner/developers with data owners, through an active community with a neutral organization at its core. We intend for our approach to be compatible with, and to build upon, existing federated learning frameworks, rather than to compete with them. Furthermore, as detailed below, we introduce steps that give data owners control over what algorithms run on their data and allow them to confirm benchmarking results.

MedPerf addresses regulatory, liability, and public perception risks using a three-pronged approach.

First, because the initial focus is on model evaluation instead of training, our federated evaluation approach maximizes value without data leaving the possession of data owners, either directly or accidentally leaked through results. We only need data owners to share agreed-upon evaluation metrics (e.g., specificity), which are aggregated across participating institutions before disseminating. This mitigates most regulatory, public perception, and legal risk.

Second, Medperf retains human evaluators 39 as a critical part of the proces: the MedPerf client software requires a data-owner's system administrator to approve all model evaluations and result uploads, and automatically records transactions to support auditing. Moreover, to protect against malicious or erroneous implementations, MedPerf requires that (a) all novel code has no network access and restricted local file-system access, (b) evaluation algorithm implementations are well-vetteed and common among benchmarks, and (c) all output (i.e., statistics) must be explicitly approved by data owners before it is uploaded to the MedPerf platform.

Third, we leverage social trust: we enable benchmarks to be specified, developed, and deployed publicly or within closed groups, such as provider networks with pre-existing trusted relationships and business and legal contracts, and these closed-group benchmarks will be Singularity, to offer a simple and consistent file system-based interface for other infrastructure to train or make inferences using AI models (e.g., for testing harnesses or federated learning).

Additionally, deployment tools like Docker and Singularity enable hospital information technology groups to evaluate the AI model code for security concerns using common methods and tools.

Second, we are developing an open-source hub for medical AI benchmarks and a consistent methodology for benchmarking. The hub will offer coordination among benchmark groups, model developers, and data owners by providing a central model and data registry and by storing results, but will not directly handle proprietary models or data, ensuring that these assets remain in the hands of their owners. Instead, model and data owners will register hashes to enable checking the integrity of their assets without exposing them to the platform. This method will ensure that benchmark results can be compared to better establish promising technical approaches.

Our approach decreases the uncertainty of deploying AI models by enabling easy evaluation against data held by multiple data owners. We enable model developers to indirectly interact with data owners' datasets and thus tap into a large, virtual test set. In doing so, we increase the size of the test set and thereby reduce uncertainty of the evaluation -even if all the data are from a single provider. More importantly, by enabling evaluation against data from multiple providers, we can more effectively evaluate how the model will perform when deployed at different facilities with diverse patient populations. And by providing multi-site performance feedback to model developers, we increase the odds of successful model deployment.

Ultimately, demonstration that broad evaluation via federated evaluation is correlated with clinical efficacy will further improve clinician and patient confidence and motivate additional data owners to participate.

We believe widespread adoption of federated evaluation will also spur wider adoption of federated learning. Federated learning (FL) is a promising technology to enable development of AI models by leveraging data from multiple institutions without directly sharing data [40] [41] [42] . While FL enables model training without data sharing, data may leak through the model parameters themselves, requiring additional mitigations [43] [44] [45] . Research and development of these mitigations is ongoing, slowing the adoption of the technology. We believe that federated evaluation provides concrete benefits while building industry familiarity with the technology needed for full FL.

In this section, we describe the structure and functionality of an open benchmarking platform for medical AI. We define a MedPerf benchmark in this context, discuss user roles required to successfully operate a benchmark, and provide an overview of the operating workflow.

For the purposes of our platform, a benchmark is a bundle of assets that enables quantitative measurement of the performance of AI models for a specific clinical problem. A benchmark consists of the following major components:

1. Specifications: precise definition of the clinical setting (e.g., problem or task and specific patient population) on which trained AI models are to be evaluated, the labelling methodology, and specific evaluation metrics.

also test the prepared datasets for quality and compatibility.

criteria and approved for evaluation use by their owners, e.g. patient data from multiple facilities representing (as a whole) a diverse patient population.

5. Reference Implementation: an example of a benchmark submission consisting of example model code, the evaluation component above, and publicly available de-identified or synthetic sample data.

Our platform uses the MLCube container for components such as Dataset Preparation, Evaluation, and the Registered Models. MLCube containers are software containers (e.g., Docker and Singularity) with standard metadata and a consistent file-system level interface. By using MLCube, the infrastructure software can easily interact with models implemented using different approaches and/or frameworks, running on different hardware platforms, as well as leverage common software tools for validating proper secure implementation practices (e.g., CIS Docker Benchmarks).

We have identified four primary roles in operating an open benchmark platform, outlined in Table   1 . In many cases, a single organization may participate in multiple roles, and multiple organizations may share any given role. Beyond these roles, the long term success of medical AI benchmarking requires organizations that create and adopt appropriate community standards for interoperability such as Vendor Neutral Archives (VNA) 46 

Our open benchmarking platform uses the workflow depicted in Figure 1 . To start, a benchmark group registers the benchmark with the benchmarking platform (1) and then recruits data (2) and model owners (3) . The benchmarking platform sends model evaluation requests to the data owners who approve and execute the evaluations, successively vetting and then pushing results to the benchmarking platform (4). The benchmarking platform shares the results with participants based on a policy specified by the benchmark group (5) . Table 2 provides further details about each workflow step.

Ultimately, we aim to deliver an open platform that enables groups of researchers and developers to use federated evaluation to provide high-confidence evidence of generalized model performance to regulators, health care providers, and patients. In Table 3 , we review necessary steps, scope of each step, and current progress towards developing this open benchmarking platform.

Our effort is inspired by several classes of related work. First, we adopt a federated approach to data, focusing first on evaluation to lower the barriers to adoption. Second, we adopt the standardized measurement approach to medical AI from organizations such as RSNA (https://www.rsna.org), SIIM (https://siim.org), and Kaggle (https://www.kaggle.com), and we generalize these efforts to a standard platform that can be applied to many problems rather than focus on a specific one. Third, we leverage the open, community-driven approach to benchmark development successfully employed to accelerate hardware development, through efforts such as MLPerf (https://mlcommons.org) and SPEC (https://www.spec.org/benchmarks.html), and apply it to the medical domain. Lastly, we push towards creating shared best practices for AI as inspired by efforts like MLflow (https://mlflow.org) and Kubeflow (https://www.kubeflow.org) for AI operations, as well as MONAI (https://monai.io) and GaNDLF (https://cbica.github.io/GaNDLF/) for medical models.

Our initial goal is to provide medical AI researchers with reproducible benchmarks based on diverse patient populations to assist healthcare algorithm development. We believe such benchmarks will increase development interest and solution quality, leading to patient benefit and growing adoption. Furthermore, our platform will help advance research related to, but not limited to, data utility, robustness to noisy annotations, and understanding of model failures. If a critical mass of AI researchers adopts these benchmarks, healthcare decision makers will see substantial benefits from aligning with this effort to increase benefit for their patient populations.

Ultimately, standardizing best practices and evaluation methods will lead to highly accurate models that are acceptable to regulatory agencies and clinical experts, and create momentum within patient advocacy groups. By bringing together these diverse groups, starting with AI researchers and healthcare organizations, as well as building trust with clinicians, regulatory authorities, and patient advocacy groups, we envision accelerating the adoption of AI in healthcare and increased clinical benefits to patients and providers worldwide.

However, we cannot achieve these benefits without the help of the technical and medical community. We call for:

• Healthcare stakeholders to form the benchmark groups that define benchmark specifications and oversee the analyses of their results.

• AI researchers to test this end-to-end platform and use it to create and validate their own models across multiple institutions.

• Data owners (e.g., healthcare organizations) to register their data in the platform, again while never sharing the data itself. Table 2 for details of all workflow steps, 1 through 5. • Once the benchmark, dataset, and models are registered to the benchmarking platform, the platform notifies the data owners that models are available for benchmarking • The data owner runs a benchmarking client that downloads available models, reviews and approves models for safety, then approves execution • Once execution completes, the data owner reviews and approves upload of the results to the benchmark platform 5 Release results

• Benchmark results are aggregated by the benchmarking platform and shared per the policy specified by the benchmark group 

The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database

Effect of a deep-learning computer-aided detection system on adenoma detection during colonoscopy (CADe-DB trial): a double-blind randomised study

Artificial intelligence-enabled electrocardiograms for identification of patients with low ejection fraction: a pragmatic, randomized clinical trial

A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy

Regulating AI in Public Health: Systems Challenges and Perspectives

How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals

Continual learning in medical devices: FDA's action plan and beyond

Artificial intelligence for clinical oncology

Artificial intelligence in oncology: Path to implementation

Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed

Bringing AI to the clinic: blueprint for a vendor-neutral AI deployment infrastructure

Language Models are Few-Shot Learners. arXiv

Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study

Geographic distribution of US cohorts used to train deep learning algorithms

The "inconvenient truth" about AI in healthcare

Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition

Dissecting Racial Bias in an Algorithm used to Manage the Health of Populations

Causality matters in medical imaging

An open-source framework for Federated Learning. arXiv

Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the

The Federated Tumor Segmentation (FeTS) Challenge. arXiv

Pancreatic cancer risk predicted from disease trajectories using deep learning

Federated learning for predicting clinical outcomes in patients with COVID-19

Ethics of using and sharing clinical imaging data for artificial intelligence: A proposed framework

Introduction to the DICOM standard

A Model for Protecting Privacy

Preparing medical imaging data for machine learning

Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI

CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison

The multimodal brain tumor image segmentation benchmark (BRATS)

Lung cancer screening with low-dose helical CT: results from the National Lung Screening Trial (NLST)

CT-MR) healthy abdominal organ segmentation

An Open Dataset and Benchmarks for Accelerated MRI. arXiv

Deep Models Under the GAN: Information Leakage from Collaborative Deep Learning

End-to-end privacy preserving deep learning on multi-institutional medical imaging

Interactive machine learning for health informatics: when do we need the human-in-the-loop?

The future of digital health with federated learning

Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data

Secure, privacy-preserving and federated machine learning in medical imaging

Advances and open problems in federated learning

Shuffled Model of Federated Learning: Privacy, Communication and Accuracy Trade-offs. arXiv

Privacy-first health research with federated learning. medRxiv (2020)

Implementation and Benefits of a Vendor-Neutral Archive and Enterprise-Imaging Management System in an Integrated Delivery Network

Twenty Years of Digital Pathology: An Overview of the Road Travelled, What is on the Horizon, and the Emergence of Vendor-Neutral Archives

Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers

PRISSMM Data Model

Using HL7 FHIR to achieve interoperability in patient health record

Contribution Author contribution

NN: contribution to concept design, revising the work for intellectual content. NP: contribution to concept design, revising the work for intellectual content. PeM: overall work coordination and supervision, contribution to concept design, revising the work for intellectual content, substantial editorial work. PiM: contribution to concept design, revising the work for intellectual content. PR: contribution to concept design, revising the work for intellectual content. PY: contribution to concept design, revising the work for intellectual content. RU: co-first author, implementation supervision, contribution to concept design, revising the work for intellectual content

The authors declare that there are no competing interests.