key: cord-0967384-aqyfm5wa authors: Oala, Luis; Murchison, Andrew G.; Balachandran, Pradeep; Choudhary, Shruti; Fehr, Jana; Leite, Alixandro Werneck; Goldschmidt, Peter G.; Johner, Christian; Schörverth, Elora D. M.; Nakasi, Rose; Meyer, Martin; Cabitza, Federico; Baird, Pat; Prabhu, Carolin; Weicken, Eva; Liu, Xiaoxuan; Wenzel, Markus; Vogler, Steffen; Akogo, Darlington; Alsalamah, Shada; Kazim, Emre; Koshiyama, Adriano; Piechottka, Sven; Macpherson, Sheena; Shadforth, Ian; Geierhofer, Regina; Matek, Christian; Krois, Joachim; Sanguinetti, Bruno; Arentz, Matthew; Bielik, Pavol; Calderon-Ramirez, Saul; Abbood, Auss; Langer, Nicolas; Haufe, Stefan; Kherif, Ferath; Pujari, Sameer; Samek, Wojciech; Wiegand, Thomas title: Machine Learning for Health: Algorithm Auditing & Quality Control date: 2021-11-02 journal: J Med Syst DOI: 10.1007/s10916-021-01783-y sha: 00ccd7b9c9d185a0fd1fc3d1d86705e8d6d5213b doc_id: 967384 cord_uid: aqyfm5wa Developers proposing new machine learning for health (ML4H) tools often pledge to match or even surpass the performance of existing tools, yet the reality is usually more complicated. Reliable deployment of ML4H to the real world is challenging as examples from diabetic retinopathy or Covid-19 screening show. We envision an integrated framework of algorithm auditing and quality control that provides a path towards the effective and reliable application of ML systems in healthcare. In this editorial, we give a summary of ongoing work towards that vision and announce a call for participation to the special issue Machine Learning for Health: Algorithm Auditing & Quality Control in this journal to advance the practice of ML4H auditing. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s10916-021-01783-y. Machine learning (ML) technology promises to automate, speed up or improve medical processes. A large number of institutions and companies are ambitiously working on fulfilling this promise spanning tasks such as medical image classification [1] , segmentation [2] or reconstruction [3] , protein structure prediction [4] and electrocardiography interpretation [5] , among others 1 . However, the deployment of machine learning for health (ML4H) tools into real-world applications has been slow because existing approval processes [6] may not account for the particular failure modes and risks that accompany (ML) technology [7] [8] [9] [10] [11] . Certain changes to image data that may not change the decision of a human expert can completely alter the output of an image classification [12] or regression [13, 14] model. Model performance estimates are often not valid for the types of varying input distribution that can occur during real world deployment [15] [16] [17] . The decision heuristics a model learns can differ from the heuristics we may expect a human to use [1, [18] [19] [20] , and model predictions may come with illcalibrated statements of confidence [21] [22] [23] or no estimate of uncertainty altogether [24] . Developers proposing new ML4H technologies sometimes promise to match or even surpass the performance of existing methods [25] yet the reality is often more complicated. Classical ML performance evaluation does not automatically translate to clinical utility as examples from large diabetic retinopathy projects [26] or Covid-19 diagnosis illustrate [27] . The reliable and integrated management of these risks remains an open scientific and practical hurdle. In order to overcome this hurdle, we envision a framework of algorithm auditing and quality control that provides a path towards the effective and reliable application of ML systems in healthcare. In this editorial we give a brief summary of ongoing work towards that vision from our open collective of collaborators. Many of the considerations presented here originate from a consensus finding effort by the International Telecommunication Union (ITU) and World Health Organization (WHO) which started in 2018 as the Focus Group on Artificial Intelligence for Health (FG-AI4H) [28] . We are convinced that success on this path heavily depends on practical feedback. Auditing processes that are developed on paper have to be put to the test to ensure that they translate to utility in the actual auditing practice [29] . That is why we are introducing the special issue Machine Learning for Health: Algorithm Auditing & Quality Control in this journal (see the Call for Participation for more details 2 ). The special issue will provide a platform for the submission, discussion and publication of audit methods and reports. The resulting compendium is intended to be a useful resource for users, developers, vendors and auditors of ML4H systems to manage and mitigate their particular risks. From a bird's eye view, many ML tools share a set of core components comprising data, an ML-model and its outputs, as visualized in Fig. 1A . The typical ML product life cycle goes through stages of planning, development, validation and, potentially, deployment under appropriate monitoring (see Fig. 1B ). Feedback loops between stages, for example from product validation back to development, are commonplace 3 . An audit entails a detailed assessment of an ML4H tool at one or more of the ML life cycle steps. It can be carried out to anticipate, monitor, or retrospectively review operations of the tool [30, 31] . The audit output should consist of a comprehensive standardized report that can be used by different stakeholders to efficiently communicate the tool's strengths and limitations [29] . We envision a process by which an independent body, for example appointed by a government, carries out the audit using the methods and tools outlined below. Further, they can also be used by manufacturers and researchers themselves to carry out internal quality control [32] . In either scenario, the assessment is carried out with respect to a dynamic set of technical, clinical and regulatory considerations (see Fig. 1C ) that depend on the concrete ML technology and the intended use of the tool. Audit teams should thus comprise expertise in all these dimensions and have to be able to synthesize related requirements across disciplines. In the following, we list a selection of considerations for all three of these auditing dimensions, tools that can be used to aid the auditing process as well as the role so called trial audits can play in advancing ML4H quality control. Most ML tools share a set of core components comprising data, a ML-model and its outputs B: The typical ML life cycle goes through stages of planning, development, validation and, potentially, deployment under appropriate monitoring C: An ML4H audit is carried out with respect to a dynamic set of technical, clinical and regulatory considerations that depend on the concrete ML technology and the intended use of the tool The technical validation of an ML4H tool comprises the application of data and ML model quality assessment methods to detect possible failure modes in the model's behavior. These include model-oriented metrics, such as predictive performance, robustness [33, 34] , interpretability [1, 35] , disparity [36] or uncertainty [13, 24, 37] but also dataoriented metrics related to sample size determination [38] , sparseness [39] , bias [40] distribution mismatch [41, 42] and label quality [7] . Rigorous statistical analysis of the model metrics is a common pitfall in both research and industry, and thus plays an important role during technical validation [43] . FG-AI4H has formulated a standardized quality assessment framework based on existing good practices [44] [45] [46] and provides practical guidance and examples for performing technical validation audits on three ML4H tools [29] . Clinical Evaluation comprises an "ongoing procedure to collect, appraise and analyse clinical data pertaining to a medical device and to analyse whether there is sufficient clinical evidence to confirm compliance with relevant essential requirements for safety and performance when using the device according to the manufacturer's instructions for use" [47] . The EQUATOR-network, including STARD-AI [48] , CONSORT-AI [49] and SPIRIT-AI [50] , as well as different scientific journals and associations [51] [52] [53] [54] , have developed guidelines for the design, implementation, reporting and evaluation of AI interventions in various study designs. Key concerns are whether the ML4H tool delivers utility in clinical pathways, how cost-effective the cliniciantool interaction is [55] and whether it provides the desired benefits for the intended users [56] . To demonstrate reliable performance, it is important to look beyond common machine learning performance statistics such as accuracy and to evaluate in addition whether the ML4H tool is suited to the clinical setting in which it will be used; for example, whether the training and test data represent patient populations that are similar to the intended use population [7, 57] and whether the output translates to medically meaningful parameters [58] . Regulatory Assessment comprises the systematic evaluation of ML4H tools with respect to the applicable regulatory requirements found in laws (MDR [59] , IVDR [ [68] ). Such guidance is of practical concern for stakeholders in the ML4H ecosystem including manufacturers (e.g. product managers, developers, developers and data scientists, quality and regulatory affairs managers) and for regulatory bodies (authorities, notified bodies). The FG-AI4H has identified and critically reviewed general yet fundamental regulatory considerations related to ML4H. This overview of regulatory considerations assessment have been converted into specific and verifiable requirements and subsequently published as a comprehensive assessment checklist entitled "Good practices for health applications of machine learning: Considerations for manufacturers and regulators" [45] which covers the entire life cycle outlined in 1B at a higher resolution. It includes checklist items which should be given high priority in the presence of limited time -an important practical constraint for real-world audits. Examples and comments give further guidance to users. New regulatory developments, such as predetermined change control plans [69] , imply faster software update cycles and potentially more frequent audits. Hence, good tooling can become an important means to make effective as well as efficient audits possible. The auditing process can be supported by appropriate tools to make it more targeted and time-efficient. This can include process and requirements descriptions, as mentioned above [44, 45, 56] , which help to manage dynamic workflows that may vary by use case and ML technology. It also includes reporting templates to present the audit results in a standardized way for the communication between different stakeholders. [29, 70] . In addition, the nature of ML4H tools, as primarily software that interacts with data, lends itself to the application of test automation and simulations for the purpose of auditing. This requires software tools which can handle custom evaluation scripts, the flexible processing of different ML4H model formats and data modalities as well as security protocols that protect intellectual property and sensitive patient information [71] . We are working with open source frameworks such as EvalAI [72] and MLflow [73] to develop solutions for automated auditing 4 , federated auditing in remote teams 5 and automated report creation. Our first demo platform is available via http:// health. aiaud it. org/ 6 and hosted on ITU provisioned infrastructure. While quantitative performance measures can already be provided, it is essential to also offer qualitative measures. This is realized by requiring the users to fill out a standardized questionnaire [74] . Quantitative and qualitative performance results are then provided to the users as a comprehensive and standardized report card [70] . We are convinced that success on the path towards a framework for algorithm auditing and quality control depends heavily on practical feedback. The development and refinement of auditing processes should routinely be accompanied by trial audits. In trial audits, draft processes and standards are applied to ML4H tools. The purpose of such an exercise is to ensure that auditing processes developed on paper translate to utility in actual auditing practice [29] . In order to facilitate the implementation of trial audits, we are introducing the special issue Machine Learning for Health: Algorithm Auditing & Quality Control in this journal. We welcome contributions pertaining to methods, tools, reports or open challenges in ML4H auditing. The materials summarized above bear testimony to the initial progress that has been made towards the creation of frameworks for ML4H algorithm auditing and quality control. Nevertheless, new challenges emerge as we collectively pull at the complex fabric that ML4H systems are. From the perspective of technical validation, the identification of factors which bias or deteriorate algorithmic performance is often constrained by the absence of relevant metadata. For example, the measurement device types (and related acquisition parameters) used to produce the validation inputs should be available in order to validate if the model performance is robust under device type changes. This problem can be alleviated by identifying and routinely recording this information during data acquisition. For clinical evaluation, future considerations include extending and refining the specific requirements related to how the clinical effectiveness of a tool should be monitored after implementation of the algorithm and with ongoing monitoring [59] . This also requires agreement over the clear and clinically useful procedures to obtain ground truth annotations. It might be necessary to refine the ML algorithm to the target population, if demographics or clinical character are different from training settings or if medical guidelines for diagnostics or treatment have changed [75] . Therefore, in order for these insights to be effective it is imperative that auditors exhibit a solid understanding of the training data, ML algorithm, independent test data and evaluation metrics specific to the intended use. A challenge for regulatory assessment is that standardization organizations, notified bodies and manufacturers need to efficiently formulate and parse applicable regulatory requirements for each individual ML4H tool. Comprehensive assessment checklists [45, 51] can help with that task. However, more support is needed in terms of workflow management and assisting tools if we consider the limited time and budgets which professional auditors have at their disposal. Future regulatory checklists should allow for interactive selection of use-case specific sub-checklists, an automated audit report creation, a issue of standard minimum test cases as well as accompanying glossaries and education materials for auditors. We also have to ensure that protocols are in place which translate the audit insights to actual improvements in the ML4H tool. Managing the risks presented by the exciting advances of AI in healthcare is a formidable undertaking, but with collaborative pooling of expertise and resources we believe we can rise to the task. The online version contains supplementary material available at https:// doi. org/ 10. 1007/ s10916-021-01783-y. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. Resolving challenges in deep learning-based analyses of histopathological images using explanation methods Unet++: A nested u-net architecture for medical image segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support Learning the invisible: a hybrid deep learning-shearlet framework for limited angle computed tomography Improved protein structure prediction using potentials from deep learning Ptb-xl, a large publicly available electrocardiography dataset How medical ai devices are evaluated: limitations and recommendations from an analysis of fda approvals As if sand were stone. new concepts and metrics to probe the ground on which to build trustable ai Underspecification presents challenges for credibility in modern machine learning Adversarial examples are a natural consequence of test error in noise Closing the ai accountability gap: defining an end-to-end framework for internal algorithmic auditing Do imagenet classifiers generalize to imagenet Intriguing properties of neural networks Interval neural networks as instability detectors for image reconstructions Detecting failure modes in image reconstructions with interval neural network uncertainty The many faces of robustness: A critical analysis of out-of-distribution generalization Measuring robustness to natural distribution shifts in image classification Post-hoc domain adaptation via guided data homogenization Unmasking clever hans predictors and assessing what machines really learn Do deep generative models know what they don Interpretable heartbeat classification using local model-agnostic explanations on ecgs Improving uncertainty estimation with semi-supervised deep learning for covid-19 detection using chest x-ray images On calibration of modern neural networks Revisiting the calibration of modern neural networks What uncertainties do we need in bayesian deep learning for computer vision? Common pitfalls and recommendations for using machine learning to detect and prognosticate for covid-19 using chest radiographs and ct scans Google's medical ai was super accurate in a lab. real life was a different story. | mit technology review Ct scanning is just awful for diagnosing covid-19 -luke oakden-rayner Who and itu establish benchmarking process for artificial intelligence in health Ml4h auditing: From paper to practice Towards algorithm auditing: A survey on managing legal, ethical and technological risks of ai Opinion: The dangers of faulty, biased, or malicious algorithms requires independent oversight Software product quality assurance Towards evaluating the robustness of neural networks Benchmarking neural network robustness to common corruptions and perturbations Explaining deep neural networks and beyond: A review of methods and applications Aequitas: A bias and fairness audit toolkit Interval neural networks: Uncertainty scores Sample-size determination methodologies for machine learning in medical imaging research: a systematic review Using cluster analysis to assess the impact of dataset heterogeneity on deep convolutional network accuracy: A first glance Assessing and mitigating bias in medical artificial intelligence: the effects of race and ethnicity on a deep learning model for ecg analysis The reliability of a deep learning model in clinical out-of-distribution mri data: a multicohort study More than meets the eye: Semisupervised learning under non-iid data Data analysis strategies in medical imaging Data and artificial intelligence assessment methods (daisam) reference. Reference document DEL 7.3 on FG-AI4H server The Supreme Audit Institutions of Finland, Germany, the Netherlands, Norway and the UK. Auditing machine learning algorithms Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: The STARD-AI Steering Group CONSORT-AI extension Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI Extension The need to separate the wheat from the chaff in medical informatics Minimar (minimum information for medical ai reporting): developing reporting standards for artificial intelligence in health care Artificial intelligence in dental research: Checklist for authors, reviewers, readers Clinician checklist for assessing suitability of machine learning applications in healthcare Costeffectiveness of artificial intelligence for proximal caries detection Clinical evaluation of ai for health Artificial intelligence versus clinicians: systematic review of design, reporting standards, and clains of deep learning studies Regulation (eu) 2017/746 of the european parliament and of the council on medical devices Regulation (eu) 2017/746 of the european parliament and of the council on in vitro diagnostic medical devices Code of federal regulations part 1: Application of usability engineering to medical devices -amendment Medical devices -application of risk management to medical devices FDA. Fda guidance documents AAMI. Techical report (tr) 57 principals for medical device security -risk management Eur-lex -52021pc0206 -eneur-lex Daisam audit reporting template Data sharing practices. Reference document DEL 5.6 on FG-AI4H server Evalai: Towards better evaluation systems for AI agents Developments in mlflow: A system to accelerate the machine learning lifecycle Model questionnaire. Reference document J-038 on FG-AI4H server Key challenges for delivering clinical impact with artificial intelligence Patterns, predictions, and actions: A story about machine learning Luis Oala 1 · Andrew G. Murchison 2 · Pradeep Balachandran 3 · Shruti Choudhary 4 · Jana Fehr 5 · Alixandro Werneck Leite 6 · Peter G. Goldschmidt 7 · Christian Johner 8 · Elora D. M. Schörverth 1 · Rose Nakasi 9 · Martin Meyer 10 · Federico Cabitza 11 · Pat Baird 12 · Carolin Prabhu 13 · Eva Weicken 1 · Xiaoxuan Liu 14 Makerere University, Kampala, Uganda