key: cord-0784069-v97wr09g
authors: Shelmerdine, Susan Cheng; Arthurs, Owen J; Denniston, Alastair; Sebire, Neil J
title: Review of study reporting guidelines for clinical studies using artificial intelligence in healthcare
date: 2021-08-23
journal: BMJ Health Care Inform
DOI: 10.1136/bmjhci-2021-100385
sha: 6f7363c98120f4472c186a46b28d72d886fc8cec
doc_id: 784069
cord_uid: v97wr09g

High-quality research is essential in guiding evidence-based care, and should be reported in a way that is reproducible, transparent and where appropriate, provide sufficient detail for inclusion in future meta-analyses. Reporting guidelines for various study designs have been widely used for clinical (and preclinical) studies, consisting of checklists with a minimum set of points for inclusion. With the recent rise in volume of research using artificial intelligence (AI), additional factors need to be evaluated, which do not neatly conform to traditional reporting guidelines (eg, details relating to technical algorithm development). In this review, reporting guidelines are highlighted to promote awareness of essential content required for studies evaluating AI interventions in healthcare. These include published and in progress extensions to well-known reporting guidelines such as Standard Protocol Items: Recommendations for Interventional Trials-AI (study protocols), Consolidated Standards of Reporting Trials-AI (randomised controlled trials), Standards for Reporting of Diagnostic Accuracy Studies-AI (diagnostic accuracy studies) and Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis-AI (prediction model studies). Additionally there are a number of guidelines that consider AI for health interventions more generally (eg, Checklist for Artificial Intelligence in Medical Imaging (CLAIM), minimum information (MI)-CLAIM, MI for Medical AI Reporting) or address a specific element such as the ‘learning curve’ (Developmental and Exploratory Clinical Investigation of Decision-AI). Economic evaluation of AI health interventions is not currently addressed, and may benefit from extension to an existing guideline. In the face of a rapid influx of studies of AI health interventions, reporting guidelines help ensure that investigators and those appraising studies consider both the well-recognised elements of good study design and reporting, while also adequately addressing new challenges posed by AI-specific elements.

Recent, rapid developments in computational technologies and increased volumes of digital data for analysis have resulted in an unprecedented growth in research activities relating to artificial intelligence (AI), particularly within healthcare. This volume of work has even led to several high impact journals launching their own subjournals within the 'AI healthcare' field (eg, Nature Machine Intelligence, 1 Lancet Digital Health, 2 Radiology: Artificial Intelligence). 3 High-quality research should be accompanied by transparency, reproducibility and validity of techniques for adequate evaluation and translation into clinical practice. Standardised reporting guidelines help researchers define key components of their study, ensuring that relevant information is provided in the final publication. 4 Studies pertaining to algorithm development and clinical application of AI however, have brought unique challenges and added complexities in how such studies are reported, assessed and compared in relation to elements that are not conventionally prespecified in traditional reporting guidelines. This could lead to missing information and high risk of hidden bias. If these actual or potential limitations are not identified, then it may lead to tacit approval through publication which in turn may support premature adoption of new technologies. 5 6 Conversely well-designed, well-delivered studies that are poorly reported may be judged unfavourably due to being adjudged to have a high risk of bias, simply due to a lack of information.

Inadequacies of reporting of AI clinical studies are increasingly well-recognised. In 2019, a systematic review by Liu et al 7 reviewed over 20 500 articles, but found that fewer than 1% of these were sufficiently robust in their design and reporting allowing independent reviewers to have confidence in their claims. Similarly Nagendran et al 8 identified high levels of bias in the field. In another study, 9 it was reported that only 6% of over 500 eligible radiological-AI research publications performed any external validation of their models, and none used multicentre or prospective data collection. Similarly most studies using machine learning (ML) models Open access for medical diagnosis 10 did not have adequate detail on how these were evaluated nor sufficient detail for these to be reproduced. Inconsistencies in how ML models from electronic health records have also been reported, with details regarding race and ethnicity of participants omitted in 64% of studies, and only 12% of models being externally validated. 11 In order to address these concerns, adapted research reporting guidelines based on the well-established EQUATOR Network (Enhancing the QUAlity and Transparency Of health Research) 12 13 and de novo recommendations by individual societies have been published, with a greater relevance for AI research. In this review, we highlight those that will cover the majority of healthcare focused AI-related studies, and explain how they differ to the well-known guidance for non-AI related clinical work. Our intention is to raise awareness of how such studies should be structured, thereby improving the quality of future submissions and providing a helpful aid for researchers, peer reviewers and editors.

In compiling a detailed, yet relevant list of study guidelines, we reviewed the EQUATOR network 13 website for those containing the terms AI, ML or deep learning. A separate search was also conducted using Medline, Scopus and Google Scholar databases for publications using the same search terms with the addition of 'reporting guideline', 'checklist' or 'template'. Opinion pieces were excluded. Articles were included where the description of the recommendations were provided, and published at time of the search (March 2021).

An ideal reporting guideline should be a clear, structured tool with a minimum list of key information to include within a published scientific manuscript. The EQUATOR Network 13 is the international 'standard bearer' for reporting guidelines, committed to improving 'the reliability and value of published health research literature by promoting transparent and accurate reporting and wider use of robust reporting guidelines'. Since the landmark publication of Consolidated Standards of Reporting Trials (CONSORT), 14 the network has overseen the development and publication of a number of guidelines that address other types of study design (eg, diagnostic accuracy studies). The EQUATOR guidelines are centrally registered (available via a core library) which ensures adherence to robust methodology of development and avoids redundancy of parallel initiatives to address the same issue. Importantly these guidelines are not medical specialty specific but are focused on the type of study, which helps ensure that there is a consistent approach and quality for addressing the same study design. It is recognised that certain specific scenarios may require specific extensions to these guidelines. For example, the increasing recognition of the importance of patient-reported outcomes (PROs) has led to the development of Standard Protocol Items: Recommendations for Interventional Trials (SPIRIT-PRO) 15 and CONSORT-PRO. 16 In a similar way, the specific attributes of AI as an intervention, has led to a number of AI extensions, both published and in process, which build on the robust methodology of the original EQUATOR guidelines, while ensuring AI-specific elements are also addressed.

In parallel to the work of the EQUATOR network, a number of experts and institutions have developed their own recommendations for good practice and reporting. In contrast, these start with the intervention (ie, AI) rather than the study type (ie, randomised controlled trial (RCT)), and therefore, cover essentially the same territory. They vary in depth, and there can be differences in nuance depending on their primary purpose. For example some have originated from the need to support reviewers and editorial staff ('is this complete and is it good enough?'), whereas others are addressing at building a shared understanding of appropriate design and delivery ('this is what good looks like').

Given the number of different reporting guidelines in this area, there is value in setting them in context to help support users in understanding which is most appropriate for a particular setting (table 1). Ultimately the most important elements of a high-quality study are contained within the methodology of the study design itself and not within the intervention. It is these elements that help minimise the major biases that all studies must address. In line with leading journals, we would, therefore, recommend starting with the guideline that addresses that particular study design (eg, CONSORT 14 for an RCT). If an AI extension is already in existence for that study type then these are clearly appropriate for that study (eg, CONSORT-AI). [17] [18] [19] If no such -AI extension exists then we recommend using the appropriate EQUATOR guideline (eg, Standards for Reporting of Diagnostic Accuracy Studies (STARD) 20 for diagnostic accuracy studies), but supplementing with AI-specific elements recommended in other guidelines (eg, SPIRIT-AI, [21] [22] [23] CONSORT-AI [17] [18] [19] or the non-EQUATOR guidelines described below). Indeed all the guidelines considered here contain valuable insights into the specific challenges of AI studies, and are recommended reading into good practice for design and reporting.

The quality of a study and the trustworthiness of its findings, starts at the design phase. The study protocol should contain all elements of the study design, sufficient for independent groups to carry out the study and expect replicability. Prepublication of the study protocol, helps avoid biases such as post-hoc assignment of the primary outcome in which the triallist can 'cherry pick' one of a number of outcomes that point in the desired direction.

Guidance for recommended items to include in a trial protocol are provided by the SPIRIT Statement (latest version published in 2013), 24 which has been recently Open access adapted for trials with an AI-related focus, termed the 'SPIRIT-AI' guideline. [21] [22] [23] This adaptation includes an additional 15 items (12 extensions, 3 elaborations) to the existing 33-item SPIRIT 2013 guideline. The key differences are outlined in table 2, mostly focused on the methodology of the trial, (accounting for eight extensions, one elaboration) with emphasis on inclusion/exclusion of data and participants, dealing with poor quality data and how the AI intervention will be applied to and benefit clinical practice.

While most AI studies are currently at early-phase validation stages, those evaluating the use of 'AI-interventions' in real world setting are fast emerging, and will become of increasing importance, since these are required for realworld clinical benefit demonstration. RCTs are the exemplar study design in providing a robust evidence basis for efficacy and safety of a given intervention, with the CONSORT statement, 2010 version 14 providing a 25-item checklist for the minimum reporting content in such studies. An adapted version, entitled the 'CONSORT-AI' extension [17] [18] [19] was published in September 2020 for 'AI intervention' studies. This includes an additional 14 items (11 extensions, 3 elaborations) to the existing CONSORT 2010 statement, the majority of which (8 extensions, 1 elaboration) relate to the study participants and details of the 'AI intervention' being evaluated, which are similar to those additions already described in the SPIRIT-AI extension. Specific key differences in the new guideline are outlined in table 3. Although not specific for AI interventions, some aspects of the checklist Template for Intervention Description and Replication, 2014 25 may be a helpful addition when reporting details of the interventional elements of a study (ie, as an extension of item 5 of the CONSORT 2010 statement or as item 11 of the SPIRIT 2013 statement). These include details regarding any modifications of the intervention during a study, including how and why certain aspects were personalised or adapted. There are currently no publicly proposed plans to publish an 'AI' extension to this guideline to the best of our knowledge.

The STARD statement, 2015 version 20 is the most widely accepted reporting standard for diagnostic accuracy studies. A steering group has been established to devise an AI-specific extension to the latest version of the 30-item STARD statement (called the STARD-AI extension. 26 At the time of writing this is undergoing an international consensus survey among leaders in the AI field for suggested adaptations and pending publication.

Extensions to reporting guidelines describing prediction models that use ML have been announced, and are anticipated for publication soon. These include adapted versions of the 'Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis' (TRIPOD), 2015 version, 27 which will be entitled 

State the inclusion and exclusion criteria at the level of the input data.

Interventions for each group with sufficient detail to allow replication, including how and when they will be administered Extension State which version of the AI algorithm will be used.

Specify the procedure for acquiring and selecting the input data for the AI intervention.

Specify the procedure for assessing and handling poor quality or unavailable input data.

Specify whether there is human-AI interaction in the handling of the input data, and what level of expertise is required for users.

Specify the output of the AI intervention.

Explain the procedure for how the AI intervention's output will contribute to decision making or other elements of clinical practice.

Plans for collecting, assessing, reporting and managing solicited and spontaneously reported adverse events and other unintended effects of trial interventions or trial conduct Extension Specify any plans to identify and analyse performance errors. If there are no plans for this, justify why not.

Access to data 29 Statement of who will have access to the final trial dataset and disclosure of contractual agreements that limit such access for investigators Extension State whether and how the AI intervention and/or its code can be accessed, including any restrictions to access or reuse. 

'TRIPOD-AI', 28 29 and supported by the 'Prediction model Risk Of Bias Assessment Tool' (PROBAST, 2019 version) 30 which is proposed to be entitled PROBAST-ML. 28 29 Human factors Another upcoming guideline, focused on the evaluation of the 'human factors' in algorithm implementation, has been announced: the checklist (Developmental and Exploratory Clinical Investigation of Decision-support systems driven by AI). 31 This checklist is intended for use in early small-scale clinical trials that evaluate and provide information on how algorithms may be used in practice, bridging the gap between the algorithm development/ validation stage (which would follow TRIPOD-AI, STARD-AI or Checklist for Artificial Intelligence in Medical Imaging (CLAIM)), but before large-scale clinical trials of AI interventions (where the CONSORT-AI would be used). Publication is anticipated to be late 2021 or early 2022.

Given the increasing volume of radiological AI-related research for a growing variety of conditions and clinical settings, it is also likely that we will encounter more systematic reviews and meta-analyses that aim to aggregate the evidence from studies in this field (eg, recent Explain the intended use of the AI intervention in the context of the clinical pathway, including its purpose and its intended users (eg, healthcare professionals, patients, public).

Eligibility criteria for participants Elaboration State the inclusion and exclusion criteria at the level of participants.

State the inclusion and exclusion criteria at the level of the input data.

4b

Settings and locations where the data were collected Extension Describe how the AI intervention was integrated into the trial setting, including any onsite or offsite requirements.

The interventions for each group with sufficient details to allow replication, including how and when they were actually administered Extension State which version of the AI algorithm was used.

Describe how the input data were acquired and selected for the AI intervention.

Describe how poor quality or unavailable input data were assessed and handled Extension Specify whether there was human-AI interaction in the handling of the input data, and what level of expertise was required of users.

Specify the output of the AI intervention.

Explain how the AI intervention's outputs contributed to decision making or other elements of clinical practice.

Describe results of any analysis of performance errors and how errors were identified, where applicable. If no such analysis was planned or done, justify why not.

State whether and how the AI intervention and/or its code can be accessed, including any restrictions to access or re-use. 36 Currently, there have not been any announcements for an update to these guidelines for AI-related systematic reviews or meta-analyses, and therefore, it is suggested that the PRSIMA 2009 35 or PRISMA-DTA 2018 36 guidance should be followed.

In the planning stages for conducting systematic reviews of prediction models, the 'Checklist for critical appraisal and data extraction for systematic reviews of prediction modelling studies' (CHARMS, 2014 37 was developed by the Cochrane Prognosis Methods Group. This was not intentionally created for publications relating to AI per se, but applicable to a wide range of studies, which also happen to include the evaluation of ML models. The developers provide the checklist to help authors frame their review question, design and extract relevant items from published reports of prediction models and guide assessment of risk of bias (rather than in the analysis of these). This checklist will, therefore, be useful to those who wish to plan a review of AI tools that provide a 'risk score' or 'probability of diagnosis'. A tutorial on how to carry out a 'CHARMS analysis' for prognostic multivariate models with real-life worked examples has been published 38 and may be a helpful resource for readers wishing to carry out similar work. It is worth noting that the authors of CHARMS still recommend reference to the PRISMA 2009 35 and PRISMA-DTA 2018 36 statements for the reporting and analysis of trial results, in conjunction with their own checklist for planning of the review design.

Alternative guidelines have been published by expert interest groups and endorsed by different specialty societies. A few are described here to supplement further reading and interest.

The Radiological Society of North America recently published the 'CLAIM' 39 in 2020, containing elements of the STARD 2015 guideline and applicable for trials addressing a wide spectrum of AI applications using medical images (eg, classification, reconstruction, text analysis, work flow optimisation). This checklist comprises of 42 items, of which 6 are new (pertaining to model design and training), 8 are extensions of pre-existing STARD 2015 items, 14 items are elaborations (mostly relating to methods and results) and 14 items remain the same. Particular emphasis is given to data, the reference standard of 'ground truth' and the precise development and methodology of the AI algorithm being tested. These are listed in further detail in table 4, where differences to the STARD 2015 are highlighted. Care should be taken to avoid any confusion with another similarly named checklist entitled 'minimum information about clinical AI modelling' (MI-CLAIM), 40 which is less of a reporting guideline but a document outlining required shared understanding in the development and evaluation of AI models aimed to serve clinical and data scientists), repository managers and model users.

It is also worth noting that the American Medical Informatics Association produced a set of guidelines in 2020 termed the 'MI for Medical AI Reporting' (MINIMAR), 41 specific to studies reporting the use of AI solutions in healthcare. Rather than a list of items for manuscript writing, this guidance provides suggestions for details pertaining to data sources used in algorithm development and their intended usage, spread across four key subject areas (ie, study population and setting, patient demographics, model architecture and model evaluation). There are many similarities with the aforementioned CLAIM checklist, although the key differences include the granularity by which the MINIMAR suggests researchers should explicitly state participant demographics (eg, ethnicity and socioeconomic status, rather than just age and sex) and how code and data can be shared with the wider community.

There is an increasing need to build a cadre of researchers and reviewers with sufficient domain knowledge of technical aspects (including limitations and risk) and of the principles of good trial methodology (including areas of potential bias, analysis issues, etc). There is also a need for ML experts and clinical trial communities to increasingly learn each other's language, to ensure accurate and precise communication of concepts, and enable comparison between studies. A number of reviews are highlighted here for further reading 42-46 along with work 47 explaining different evaluation metrics used in AI and ML studies. It is also worth bearing in mind the wider clinical and ethical context of how any AI tool would fit into our existing clinical pathways and healthcare systems. 48 

In conclusion, this article has provided readers an overview of changes to standard clinical reporting guidelines specific for AI-related studies. The fundamental basics of describing the trial setup, inclusion and exclusion criteria, detailing the study methodology and standards used, together with details on algorithm development, should create transparency and address reproducibility. Those which are most relevant for a particular healthcare specialty will depend on the type of research being conducted in that particular field (eg, guidelines for AI-related diagnostic accuracy trials may be more relevant for radiological or pathological specialties, whereas those addressing patient outcomes with the aid of an Open access 

Robustness or sensitivity analysis.

Methods for explainability or interpretability (eg, saliency maps) and how they were validated.

Validation or testing on external data.

Intended sample size and how it was determined.

Extension How data were assigned to partitions; specify proportions.

Level at which partitions are disjoint (eg, image, study, patient, institution).

Flow of participants, using a diagram. Same 20 Baseline demographic and clinical characteristics of participants.

Demographic and clinical characteristics of cases in each partition.

Test results 23 Cross tabulation of the index test results (or their distribution) by the results of the reference standard.

Performance metrics for optimal model(s) on all data partitions.

Estimates of diagnostic accuracy and their precision (such as 95% CIs).

Any adverse events from performing the index test or the reference standard.

Failure analysis of incorrectly classified cases.

Study limitations, including sources of potential bias, statistical uncertainty and generalisability.

Implications for practice, including the intended use and clinical role of the index test.

Other Information

Registration no and name of registry. Same

Where the full study protocol can be accessed.

Same

Sources of funding and other support; role of funders.

This is based on the STARD 2015 guidelines, 20 demonstrating which aspects are new, the same or elaborated on. Items not included in the CLAIM checklist (which were previously present in the STARD guideline) have been removed. Open access AI algorithm may be more relevant for oncological or surgical specialties).

Although the reporting guidelines outlined may seem comprehensive, there remain areas that will need to be addressed, such as for economic health evaluation of AI-tools and algorithms (many are currently developed for 'pharmacoeconomic evaluations'. 49 It is likely that future guidelines may take the form of an extension to the widely used CHEERS guidance (Consolidated Health Economic Evaluation Reporting Standards 50 51 available via the EQUATOR network. 13 Nevertheless, a wide variation in opinion regarding the most appropriate economic evaluation guideline already exists for non-AI related tools, and this may be reflected in future iterations of such guidelines depending on how the algorithms are funded in different healthcare systems. 52 The current guidelines outlined here will likely continue to be updated in the light of new understanding of the specific challenges of AI as an intervention and, how traditional study designs and reports need to be adapted. Competing interests None declared.

Patient consent for publication Not required.

Provenance and peer review Not commissioned; externally peer reviewed.

Data availability statement Data sharing not applicable as no datasets generated and/or analysed for this study.

More than machines

A digital (r)evolution: introducing The Lancet Digital Health

Artificial intelligence, real radiology

Reporting guidelines: doing better for readers

Assessing radiology research on artificial intelligence: a brief guide for authors, reviewers, and readers-from the radiology editorial board

Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed

A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis

Artificial intelligence versus clinicians: systematic review of design, reporting Standards, and claims of deep learning studies

Design characteristics of studies reporting the performance of artificial intelligence algorithms for diagnostic analysis of medical images: results from recently published papers

Reporting quality of studies using machine learning models for medical diagnosis: a systematic review

Reporting of demographic data and representativeness in machine learning models using electronic health records

EQUATOR: reporting guidelines for health research

Enhancing the quality and transparency of health research

explanation and elaboration: updated guidelines for reporting parallel group randomised trials

Guidelines for inclusion of patient-reported outcomes in clinical trial protocols: the SPIRIT-PRO extension

Reporting of patient-reported outcomes in randomized trials: the CONSORT pro extension

Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension

Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension

Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension

Stard 2015: an updated list of essential items for reporting diagnostic accuracy studies

Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension

Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension

Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension

Spirit 2013 explanation and elaboration: guidance for protocols of clinical trials

Better reporting of interventions: template for intervention description and replication (TIDieR) checklist and guide

Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: the STARD-AI steering group

Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement

Reporting of artificial intelligence prediction models

Protocol for a systematic review on the methodological and reporting quality of prediction model studies using machine learning techniques

PROBAST: a tool to assess risk of bias and applicability of prediction model studies: explanation and elaboration

DECIDE-AI: new reporting guidelines to bridge the development-to-implementation gap in clinical artificial intelligence

Systematic review of artificial intelligence techniques in the detection and classification of COVID-19 medical images in terms of evaluation and benchmarking: taxonomy analysis, challenges, future solutions and methodological aspects

Using machine learning of clinical data to diagnose COVID-19: a systematic review and meta-analysis

Role of machine learning techniques to tackle the COVID-19 crisis: systematic review

The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate healthcare interventions: explanation and elaboration

Preferred reporting items for a systematic review and meta-analysis of diagnostic test accuracy studies: the PRISMA-DTA statement

Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the charms checklist

A general presentation on how to carry out a CHARMS analysis for prognostic multivariate models

Checklist for artificial intelligence in medical imaging (claim): a guide for authors and reviewers

Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist

MINIMAR (minimum information for medical AI reporting): developing reporting standards for artificial intelligence in health care

Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view

A short guide for medical professionals in the era of artificial intelligence

How to read and review papers on machine learning and artificial intelligence in radiology: a survival guide to key methodological concepts

A clinician's guide to artificial intelligence: how to critically appraise machine learning studies

Basics of deep learning: a radiologist's guide to understanding published radiology articles on deep learning

Peering into the black box of artificial intelligence: evaluation metrics of machine learning methods

Ethical limitations of algorithmic fairness solutions in health care machine learning

What guidance are economists given on how to present economic evaluations for policymakers? A systematic review

Consolidated Health Economic Evaluation Reporting Standards (CHEERS)--explanation and elaboration: a report of the ISPOR Health Economic Evaluation Publication Guidelines Good Reporting Practices Task Force

Consolidated health economic evaluation reporting standards (cheers) statement

National healthcare economic evaluation guidelines: a Cross-Country comparison