key: cord-0470311-4qe1dvko authors: Lekadir, Karim; Osuala, Richard; Gallin, Catherine; Lazrak, Noussair; Kushibar, Kaisar; Tsakou, Gianna; Auss'o, Susanna; Alberich, Leonor Cerd'a; Marias, Kostas; Tsiknakis, Manolis; Colantonio, Sara; Papanikolaou, Nickolas; Salahuddin, Zohaib; Woodruff, Henry C; Lambin, Philippe; Mart'i-Bonmat'i, Luis title: FUTURE-AI: Guiding Principles and Consensus Recommendations for Trustworthy Artificial Intelligence in Medical Imaging date: 2021-09-20 journal: nan DOI: nan sha: 8142c7a852eed9c5df99c33915acd5601e6e221b doc_id: 470311 cord_uid: 4qe1dvko The recent advancements in artificial intelligence (AI) combined with the extensive amount of data generated by today's clinical systems, has led to the development of imaging AI solutions across the whole value chain of medical imaging, including image reconstruction, medical image segmentation, image-based diagnosis and treatment planning. Notwithstanding the successes and future potential of AI in medical imaging, many stakeholders are concerned of the potential risks and ethical implications of imaging AI solutions, which are perceived as complex, opaque, and difficult to comprehend, utilise, and trust in critical clinical applications. Despite these concerns and risks, there are currently no concrete guidelines and best practices for guiding future AI developments in medical imaging towards increased trust, safety and adoption. To bridge this gap, this paper introduces a careful selection of guiding principles drawn from the accumulated experiences, consensus, and best practices from five large European projects on AI in Health Imaging. These guiding principles are named FUTURE-AI and its building blocks consist of (i) Fairness, (ii) Universality, (iii) Traceability, (iv) Usability, (v) Robustness and (vi) Explainability. In a step-by-step approach, these guidelines are further translated into a framework of concrete recommendations for specifying, developing, evaluating, and deploying technically, clinically and ethically trustworthy AI solutions into clinical practice. Amid hope and hype, artificial intelligence (AI) is widely regarded as one of the most promising and disruptive technologies for future healthcare. As machine learning techniques are suited for facilitating the analysis of large and complex datasets, medical imaging is the medical speciality that has seen the most developments in AI in the last years [5] . With the advent of big data and machine learning, imaging AI solutions have been developed for the whole value chain of medical imaging and radiology, including image reconstruction [144, 153] , medical image segmentation [167, 21] , image-based diagnosis [104, 100] , and treatment planning [65, 182] . The recent developments in the field are also well illustrated by the comprehensive list of FDA-approved AI algorithms that is maintained online by the American College of Radiology [7] . If suitably implemented, AI is expected to play an important role in future medical imaging, by enhancing the acquisition, processing and interpretation of medical images, by helping to extract and combine new information and imaging biomarkers for enhanced patient assessment, prediction and decision making, and thus, assisting the clinician in diagnosing and managing patients more efficiently and more accurately. However, despite the advances and developments in the field over the last years, the adoption and deployment of imaging AI technologies remains limited in clinical practice. A recent survey of clinicians performed in Australia and New Zealand showed that, while the vast majority of radiologists agree that the introduction of AI would improve their field, over 80% of respondents have not yet used AI in their day-to-day practice [151] . At the same time, many stakeholders have expressed concerns on the potential risks, ethical implications and lack of trust in AI in healthcare in general, including medical imaging in particular. AI tools continue to be viewed as complex, opaque technologies that are difficult to comprehend, utilise and fully trust by clinicians and patients alike [142] . There is a concern that the AI tools can generate undetected errors, with harmful consequences to the patient, when they are applied on imaging conditions that may differ or unexpectedly deviate, even slightly, from those used for training. Because existing imaging databases are often imbalanced according to sex, ethnicity, geography and socioeconomics, there is a risk that trained AI algorithms become biased towards under-represented groups and hence exacerbate existing health disparities [157, 85] . There are also concerns about the effect of AI tools on the decision-making and interpretation skills of experienced and less experienced radiologists [132] . Importantly, current AI solutions for medical imaging are rarely developed and validated with mechanisms to enable their monitoring throughout their deployment lifetime, to periodically assess changes in performance, especially when changes in the imaging hardware or protocol take place, or to enable continuous learning and assess its effect on the AI tools as new, additional imaging studies and richer datasets become available over time. Despite these concerns and risks, there are currently no concrete guidelines and best practices for guiding future AI developments in medical imaging towards increased trust, safety and adoption. A joint statement of European and North-American radiological associations on ethical challenges of AI in radiology recently stated that 'the radiology community should start now to develop codes of ethics and practice for AI' [49] . This paper defines new guiding principles named FUTURE-AI and translates these into concrete recommendations and best practices for developing future AI solutions in medical imaging. The proposed guiding principles are (i) Fairness, (ii) Universality, (iii) Traceability, (iv) Usability, (v) Robustness and (vi) Explainability as depicted in Figure 1 . These principles and the related recommendations and best practices were defined by building on accumulated experiences and results from five large European projects on AI in Health Imaging (the AI4HI Network, comprising the EuCanImage, PRIMAGE, CHAIMELEON, INCISIVE and ProCancer-I projects). The current recommendations, as detailed in this paper, facilitate the application of the FUTURE-AI guiding principles and include a set of 55 checklist points aimed at guiding designers, developers, evaluators, end-users and regulators of AI in medical imaging. In a step-by-step approach, these guidelines will enhance the specification, implementation, evaluation and deployment of imaging AI algorithms that can be trusted, technically, clinically and ethically, within future radiology practice. Despite continuous advances in medical research and practice, there remain important health inequalities between individuals and groups of individuals, in particular due to differences in sex/gender, age, ethnicity, income, education and geography [105] . There are concerns that the existing disparities may be embedded and even amplified by emerging algorithms in healthcare if these are not properly implemented [126] . A 2019 study received a lot of media attention when it showed that an algorithm widely used in the United States for patient referral discriminated against Black patients [119] . The authors of the study explained that remedying this bias would increase the percentage of Black patients receiving additional help from 17.7% to 46.5%. Among the few works that investigated AI bias in medical imaging, a recent study evaluated the extent to which 3 deep learning based algorithms for the detection of abnormalities in chest X-ray images (e.g., bone fractures, lung lesions, nodules, pneumonia) are biased with respect to attributes such as sex, age, ethnicity and socioeconomic status [157] . The authors found that the highest rate of AI-based under-diagnosis was in young females (age < 20 years), in Black patients, and in patients on basic health insurance, and concluded that 'models trained on large datasets do not provide equality of opportunity naturally, leading instead to potential disparities in care if deployed without modification'. Fairness can be negatively affected by both quantitative and qualitative biases. Indicative examples of quantitative biases in medical imaging follow. A review of five years' worth of peerreviewed articles analysed the data used in training deep learning algorithms for image-based diagnosis in various medical specialities in the United States [80] . They found that among the studies in which geographic location was known, the vast majority (71%) used training data from just three states: California, Massachusetts, and New York, and that no data at all was used from 24 of the 50 states in the country. In a study on fairness in cardiac image segmentation based on a training dataset that is gender balanced (52% male, 48% female) but ethnicity unbalanced (80% White, 20% other ethnicities including Asian, Black, Chinese and Mixed), the baseline AI algorithm performed consistently across men and women, but there were significant biases against the under-represented ethnic groups [137] (e.g., 93% average segmentation accuracy for White subjects vs. 83% for the Chinese subjects). Similarly, a study on missed imaging appointments in the United States, found that racial minorities and people with low socioeconomic status missed more healthcare appointments than other groups, which further contributes to their prevailing under-representation in the imaging datasets of their healthcare provider [54] . Qualitative biases affecting fairness may consist in cognitive biases of clinicians generating, interpreting or annotating the imaging data: A study on mammography screening, for example, has shown that the detection of malignant breast lesions in patients with a minority ethnic background or low income was less likely [140] 1 .Such biases can find their ways into training datasets (e.g., datasets can contain less accurate or more missing image annotations for particularly these subject groups) and, therefore, potentially get exacerbated through AI decision support solutions trained on these datasets. These studies illustrate the main cause of algorithmic bias in AI: the training datasets often lack the required quantitative and qualitative diversity and balance to obtain AI solutions that can maintain the same performance across human groups and sub-populations. In particular, if an AI algorithm is trained with imaging data that are imbalanced with respect to sex [90] , socioeconomics or ethnicity [137] , and given the health differences within and across these groups, it is likely that the resulting model will lead to biased predictions. Some data imbalance is applicationspecific and can be more difficult to identify based on standard attributes. For example, in breast 1 Further studies are needed to investigate as to why radiologists were less likely to detect malignant breast lesion in minority / low-income patient groups. Young age might be a confounder for lower income and high breast density i.e. it is more difficult to detect lesions in breasts with high density. On the other hand, it is likely that low income and minority ethnic background are associated with the absence of private health insurance translating to a lower standard of care for these patients, which could also have caused the lower detection rate. cancer, women with high breast density are generally under-represented in fthe existing clinical registries. The problem of bias in AI is common to all medical applications, but it is even more problematic in medical imaging as personal attributes such as sex, age, ethnicity and socioeconomics are not always retained during the data preparation and anonymisation process, which minimises the possibility of patient identification [32] . To make things more complex, a recent early-stage research study [13] reports evidence that AI algorithms can identify the race from a person's medical scans, such as chest and hand X-rays and mammograms, when human medical experts cannot. This raises concerns relevant not only to the Explainability principle 7 but also to fairness and the extent to which these unexplained AI predictions could imply a different, unequal AI-based decision support outcome for the treatment of patients depending on their race, as a kind of shortcut decision boundary, even in cases where race is not a relevant treatment criterion. Equally importantly, even if the perfect balance across diverse groups could be achieved in AI training and testing datasets, the integration of AI in real-life clinical practice can still raise fairness issues, especially with regard to its usage by experienced and less experienced clinicians and the effect that it has on their decision-making capabilities. For instance, in relation to the interpretation of mammograms, a study [132] has found that while automated support positively influenced the decision-making of radiologists with less advanced interpretation skills, it had a negative effect on the decision-making of radiologists with advanced image interpretation skills. Evidently, this is related to over-trusting the outcome of the automated support and devaluing the clinicians' professional experience accumulated over years of medical practice, or to misconceptions about the limitations and strengths of AI. The first principle of the FUTURE-AI guidelines is the one of Fairness, which states that 'imaging AI algorithms should be impartial and maintain the same performance when applied to similarly situated individuals (individual fairness) or to different groups of individuals, including under-represented groups (group fairness)'. Equal access to the 'highest attainable standard of health' is considered a fundamental right of every human without distinction of race, beliefs, or economic conditions [180] . Medical imaging, which is an expensive but critical service in medical care, should, hence, be provided equally for all patients independently of their gender, ethnicity, geographical location and socioeconomic level. AI algorithms should not exacerbate existing health disparities, such as those indicatively presented above, but instead should facilitate and enhance access to high-quality radiology services for all individuals and groups. To assess and achieve fairness when developing imaging AI algorithms, we recommend the following quality checks. 1. Multi-disciplinarity: The design, implementation and testing of the AI algorithms should take into account diverse perspectives brought by multidisciplinary teams of stakeholders, including AI developers, patients, radiologists, specialists (e.g. oncologists) and social scientists (e.g. ethicists). Multi-disciplinarity will help to eliminate subjectivity and identify as many possible sources of application-specific bias and inequity as possible. The views that social scientists and ethicists offer will enable AI to reach a wider, more diverse public, while engaging the communities most impacted by health disparities. 5 2. Definition of fairness: Fairness is subjective by nature [131] therefore, it should be defined in both the general and the application/context specific settings. Apart from general fairness requirements that should be satisfied in all cases (e.g. with respect to sex/gender), there are often application-and medical context-specific sources of bias that should be identified taking into account the specific clinical goals and end users of the AI solution. Therefore, the very first step is to define and prioritise AI fairness requirements in a specific medical imaging setting from a combined quantitative and qualitative perspective. This entails the definition of possible sources of inequity in the given context of use during all AI development phases, including requirements, implementation and evaluation phases, as well as the specification of actions to counteract possible biases and measurements of success in doing so. Such actions should include the identification of critical steps or AI subsystems requiring 'human-in-the-loop' decision-making and feedback by radiologists or physicians to avoid automation bias. Furthermore, in collaboration with domain experts such as radiologists and specialists, potentially hidden qualitative (e.g. annotators' cognitive biases), or quantitative biases (e.g. under-representation of high-density breasts in breast cancer imaging datasets, non-white skin colour in skin cancer imaging datasets) in the data collection and labelling should be investigated and identified for the particular clinical application, beyond the standard categories such as sex, ethnicity and socio-economics. The inclusion of a diversity of patients during the requirement analysis and prototype testing may help anticipate on some of the more complex, hidden biases. 3. Metadata labelling: When collecting and preparing imaging databases for developing and testing new AI algorithms, standardised metadata and key variables (e.g. sex/gender, ethnicity, geography) which allow for the identification of groups and the verification of AI fairness should be included, in addition to the anonymised imaging data. However, this should be achieved while ensuring data privacy and informed consent according to existing data regulations, e.g. the European General Data Protection Regulation (GDPR). To ensure individual fairness, metadata which allow measuring similarity of medical situations should be included to verify equal treatment of all similar cases (e.g. clinical information). After defining what fairness is in the specific AI application context, the diversity and distribution within the training and testing datasets should be carefully planned and inspected. In particular, the data should be balanced as much as possible across sex, age, ethnic and socio-economic groups. Mechanisms such as random sampling, stratified sampling and adaptive sampling can be used to increase data balance and representativeness. Biases due to the nature of the inclusion and exclusion criteria should be analyses and reported. A recent study of possible biases in the selection of individuals (Supp. Tables 5 & 6) demonstrated that excluded individuals due to missing values (in staging, primary site, tumour size and differentiation) tended to live further away from the institution, had differences in some of the calculated clinical indices (e.g., the Framingham risk score) and had a higher percentage of deaths than the selected cohorts [115] . Multi-centre data collection: When possible, training and testing datasets should be multicentric, spanning across several radiology centres and/or localities and countries. In this case, the AI developers and collaborating clinicians should examine the extent to which the variations in imaging quality, imaging protocols and sample size may impact the fairness of the AI models across radiology centres and/or localities. In particular, the fair application of the AI algorithms when the datasets are applied with imaging equipment of reduced quality (such as in low-to-middle income countries) should be investigated. 6 . Transparency of fairness: The process of collecting and preparing the datasets used for training and testing AI solutions should be transparent and documented, including information on data diversity and imbalance. Where possible, it is recommended to link this quality check with the Explainability principle 7, which can help in the identification of the reasons why a specific AI-based outcome is unfair and help to define counter-measures. 7 . Fairness evaluation and metrics: For bias estimation in imaging AI, dedicated metrics and statistical tests should be considered, such as True Positive Rates (TPR), Statistical Parity, Group Fairness, Equalised Odds and Predictive Equality [15] . The use of software toolkits, such as IBM's AI Fairness 360, may also be deployed to regularly check for unwanted biases in AI algorithms [16] . 8. Application of counter-measures: If biases are identified during any stage of development or testing of AI, mitigation measures should be investigated and evaluated, including (1) pre-processing approaches to improve the training dataset through re-sampling (under-or over-sampling), data augmentation (image synthesis using adversarial learning) or sample weighting to neutralise discriminatory effects; (2) in-processing approaches that modify the learning algorithm in order to remove discrimination during the model training process, such as by adding explicit constraints in the loss functions to minimise the performance difference between subgroups of individuals (e.g., learning bias-free representations via adversarial loss [96] ); and (3) post-processing approaches to correct the outputs of the AI algorithm depending on the individual's group, such as by using the equalised odds post-processing technique [130] . All these techniques should be thoroughly evaluated to ensure their positive impact on fairness. 9. Continuous monitoring of fairness: Once implemented, the AI algorithm should be thoroughly and continuously evaluated and retrained for fairness, once again by using imaging datasets with adequate population diversity. 10 . Training material and deployment effects: Appropriate training material for targeted end users of AI solutions (e.g. radiologists, oncologists) should be an inseparable part of any AI deployment package. Such training material should seek to inform clinicians about the strengths and limitations of the given AI solution, raise their awareness about possible misconceptions or misuses in the specific clinical setting, e.g. due to overgeneralisation of the diagnostic task that the AI tool is meant to support, or over-trust in the AI outcome. Once the AI solution has been introduced in the everyday practice of the targeted end users, the training material should be continuously updated to take into account new findings about potential AI effects on the clinical practice that may come up over time. While a certain degree of diversity in the design and implementation of AI solutions in medical imaging is both expected and desirable to promote innovation and differentiation, the Universality principle recommends the definition and application of standards during algorithm development, evaluation and deployment. These standards, including technical, clinical, ethical and regulatory standards, will achieve at least three key objectives: (1) They will enable the development of AI technologies with increased interoperability across clinical centres, radiology units and geographical locations; (2) they will promote a culture of quality, safety and trust in imaging AI based on well-proven, widely accepted frameworks; (3) they will facilitate co-creation and cooperation in imaging AI between AI developers, manufacturers, radiologists, physicians, data managers and healthcare bodies based on unified language and common approaches. In contrast, the development of imaging AI algorithms without relying on community standards will inevitably result in technologies that are restricted, incomparable and intransparent, and, therefore, likely will lack public and clinical acceptance. A number of initiatives have already been established to define standards for artificial intelligence, though they are focused on AI in general. In 2018, the International Organisation for Standardisation (ISO) and the International Electrotechnical Commission (IEC) started a project -still ongoing-on AI standardisation (Subcommittee ISO/IEC JTC 1/SC 42 Artificial intelligence [10] ). At the national level, in the US for example, the National Institute for Standards and Technology (NIST) issued a plan for long-term definition and maintenance of technical standards and related tools for AI [116] . In the medical imaging community, there is a need for concerted efforts by leading research centres, medical associations and radiological societies, open-source communities, standardisation bodies and private companies in the field, to define clinical and technical standards for imaging AI. At the same time, while standardisation holds evident benefits for interoperability, adoption and trust, there is a risk of hindering innovation by too many or inflexible standards. Hence, in FUTURE-AI, we recommend a minimal set of key standards for imaging AI focused on the following aspects: • Clinical definition of the image analysis tasks: Imaging AI algorithms -such as for estimating the patient's diagnosis, treatment or prognosis-can be built based on a wide range of clinical definitions and categorisation schemes. COVID-19 diagnosis based on chest imaging scans, for example, has been proposed by various research and clinical centres using multiple classification approaches, i.e. (1) two categories (Presence or absence of the disease); (2) four categories [158] (Typical appearance, indeterminate, atypical, negative for pneumonia); (3) six categories [135] (very low probability, low, indeterminate, high, very high, PCR positive); and (4) various scoring systems of lesion severity in the lungs [91] (e.g. a 24-scale based on a 0 to 4 severity rating for each of six lung zones [176] or a 35-scale based on a 0 to 7 severity rating for each of five lung lobes [67] ). The possible use of multiple definitions for the same clinical task can greatly limit the clinical interoperability, benchmarking and acceptance of the resulting imaging AI algorithms. To ensure widespread acceptance, future imaging AI algorithms should be designed based on consensus definitions of the clinical tasks, as established and maintained by recognised, non-for-profit entities such as medical societies. These definitions should detail the criteria for making the clinical assessment, as well as the descriptions of the imaging measurements, image labelling instructions and classification categories. • Software standardisation: With the rapid advancement in the fields of AI and medical imaging an increasing number of new software tools, solutions and libraries are becoming available. However, these tools and libraries differ in their scope, approach, programming language, documentation, integrability with other solutions, and in their adherence to standards such as the DICOM standard [32] . The libraries and frameworks upon which an imaging AI solution is developed need to be inter-compatible while also allowing to fulfil the functional requirements of the clinical task at hand. Moreover, to allow for deployment in different clinical settings, AI solutions need to be able to run on different hardware with different operating systems, within different software systems and IT landscapes, and under different system performance, security, privacy and data processing constraints. A principled approach towards choice of well-established libraries and proven frameworks can help to prevent potential incompatibility issues at later project stages while also facilitating analysis, extension, maintenance, upgrades, monitoring, migration, integration, and audit of the imaging AI solution. Even if reference implementations and designs are followed, it is recommended to document the rationales behind each of the design decisions in AI model development such as framework choices, modules, and whether the AI solution will be a standalone solution or rather serve as integral part of a larger image processing platform. • Image annotation standardisation: Despite efforts and guidelines towards structured, standardised diagnostic reporting and data labelling, reports in radiology are conventionally written in free text formats [120] . This increases uncertainty and subsequent relabelling and annotation efforts when preparing such data for automated processing and as input for AI solutions [178] . Standardisation in reports, labelling and annotation ensures the completeness and interoperability of medical imaging datasets. For instance, annotations should follow one common format such as contours, as opposed to bounding boxes or circles around anatomies of interest. Once such a common format is found to satisfy clinical use-case requirements, and once this format is agreed upon with the annotating clinicians and endusers, guidelines for annotation are to be establish that introduce a consistent annotation collection process, which further aids standardisation across annotations, file formats, and annotation storage options. In this regards, and as many different tools and platforms exists for the purpose of annotation [32] , it is recommendable that the annotators uses the same annotation software, that should be mature, well established and documented, as well as chosen based on its compliance with the requirements of the clinical use-case at hand. • Standards for quantification of imaging biomarkers: Existing research has shown that the values of common imaging features, such as radiomics features, can vary when they are calculated from different software packages, which implement the same equation using varying configurations and image processing codes [107, 189] . This has motivated the establishment of the Imaging Biomarker Standardisation Initiative (IBSI), an international consortium of 25 teams which defined conventions that provide unique schemes for calculating radiomicsbased imaging biomarkers [189] . • Criteria and metrics for imaging AI evaluation: To enable objective, widely accepted, community benchmarking of future imaging AI algorithms, standard criteria and metrics should be used for their evaluations based on the consensus literature. For instance, for evaluating the accuracy of image segmentation tools, the Dice Similarity Coefficient (DSC) and the Hausdorff Distance (HD) have been universally adopted in the image computing community. To assess the robustness of imaging features, the coefficient of variation (CV) and the intraclass correlation coefficient (ICC) have been widely used. • Reference imaging datasets for AI benchmarking: To promote objective and comparative assessment of algorithm performance, reference imaging datasets have been proposed. They consist of curated sets of images acquired from representative real-world cases in which the resulting imaging AI algorithms will be typically used. There are already several reference datasets in many subspecialties of medical imaging, including in brain MRI (e.g. ADNI: Alzheimer's Disease Neuroimaging Initiative [177, 181] [94] ), and many more (see for example the curated cancer imaging collections hosted at the Cancer Imaging Archive 4 [28] ). • Reporting of imaging AI studies: To enable wide acceptance of imaging AI, it is important that key details of the algorithms are clearly reported. This will enable developers, researchers and other stakeholders to critically appraise the relevant information on the design, development and validation, as well as to replicate the AI algorithms and results. Even before the advent of AI, guidelines were proposed for standardised and comprehensive reporting of predictive models in medicine. The most widely used of such guidelines is the TRIPOD statement 5 (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) [30] , which lists key items to report when describing and evaluating clinical prediction models, including: (i) Title, abstract, background, and objectives; (ii) Methods: Source of data, participants, predictors, sample size, missing data, type of prediction model and other model-building procedures; (iii) Results: Participants (number and characteristics), performance measures, confidence intervals, model updating; (iv) Discussion: Limitations (e.g. non-representative sample, missing data), interpretation (incl. comparison to similar studies), implications (e.g. potential clinical use); (v) Other information: Supplementary information, funding. Recently, the TRIPOD steering committee announced that they are working on extended reporting guidelines named TRIPOD-AI [29] and focused on AI-driven predictive models in healthcare, which will be published in the foreseeable future. Furthermore, in medical imaging, the TRIPOD has been adjusted into the Radiomics Quality Score (RQS) for reporting radiomics-based predictive models 6 [89] . In summary, we recommend the following checklist for maximising the universality of imaging AI algorithms: 11. Definition of clinical task: The AI manufacturers should ensure that the clinical imaging tasks they aim to address using AI are based on universal clinical definitions, as defined and promoted by recognised non-for-profit medical societies in the area of interest. 12. Software standardisation: A standardised approach to AI software design enables developers, maintainers, and auditors to understand, analyse, maintain, migrate, integrate, and extend the imaging AI solution. AI software solution design conventions, code standards, and proven libraries and frameworks should be used to readily allow for extension and integration with other clinical software systems. 13 . Image annotation standardisation:The collection and storage of clinical annotations should follow a standardised approach, where annotations are comparable, reproducible and in one common format (e.g., delineated organ contours). A common, standardised and reproducible format for annotating and labelling imaging datasets enables training AI models transparently and with clear focus on the respective clinical tasks. 14. Variation of quantified biomarkers: Universal definitions and calculation methods for estimating imaging biomarkers should be employed when building feature-based AI models in medical imaging, such as by using IBSI-compliant software packages or ComBat harmonisation of the radiomics features. 15. Evaluation metric selection and reporting: When evaluating imaging AI algorithms, universal criteria and metrics should be used to enable comparative, community-driven assessment of the model's performance and properties. 16 . Reference dataset evaluation: Furthermore, when possible, evaluation results should be generated and reported based on reference, open-access imaging datasets that are representative of real-world clinical cases. 17. Reporting standards compliance: Finally, standardised guidelines such as TRIPOD-AI and the RQS should be employed for reporting imaging AI studies. The High-Level Expert Group on AI (AI HLEG), a group of experts appointed by the European Commission to provide advice on its artificial intelligence strategy, recently released a document with guidelines to attain "trustworthy AI" [36] mentioning seven key requirements, with transparency being one of these requirements. Transparency is among the top principles promoted by other international initiatives on ethics in AI [74, 168] , as mapped by a recent review on existing ethics guidelines for AI across many fields including medicine [74] . Transparency is a complex construct that evades simple definitions. It can refer to explainability, interpretability, openness, accessibility, and visibility, among others [41] . However, in the AI HLEG's document, transparency was explained in terms of three components: traceability, explainability, and open communication about the limitations of the AI system. In the present section, we will focus on the aspect of traceability, a key requirement for trustworthy artificial intelligence (AI), and related to 'the need to maintain a complete account of the provenance of data, processes, and artefacts involved in the production of an AI model' [113] . In essence, traceability refers to the mandate to document the whole development process and to track the functioning of an AI model or an AI-based system used to support medical imaging analysis and interpretation. As the variability of AI in the medical imaging space is high, the documentation should be complete and detailed, in compliance with the best practices and the standards for software development regulated by certification organisations, as in the case of software as a medical device [1, 40] . In other words, the data sets, the processes, the reference clinical gold standards, and the contributors that yield the AI system should be documented to the best possible standard to allow for traceability and an increase in transparency [19] . This entails to provide details about data gathering, with information about the clinical sites, the devices used, the acquisition protocols, dataset composition (see the Fairness principle 2), data labelling, also with respect to annotation contributors, used annotation tooling, the underlying reference standards (e.g., the version of PI-RADS or BI-RADS used by radiologists), as well as the development framework, and the algorithms used. The endeavour of documentation also includes the decisions made by the AI system [56] , to enable identifying the reasons why an AI-decision was erroneous, which, in turn, could help prevent future mistakes. Documenting the development process of an AI model and making the model transparent and traceable by design [42] is key to avoid any 'grey' area about what happens if something goes wrong when the model is used in clinical practice. In this respect, traceability in AI shares part of its scope with general-purpose recommendations for provenance and it is also supported to different extents by specific tools used by practitioners as part of their efforts in making data analytic processes reproducible or repeatable [113] . Provenance, as defined by the PROV W3C recommendations [175] , is 'information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness'. In the field of medical imaging oriented AI-based systems, provenance data can be used to manage, track, and share machine learning models, but also to many other applications, such as to detect poisonous data and mitigate attacks [14] . Transparency in AI development and deployment requires clear communication of a variety of tasks, such as data management, model development, deployment, and updating/refinement, as well as tasks related to the functional details of the system. In particular, in recent years, importance has been placed on data provenance and on the tracking of the entire machine learning lifecycle. Two concepts are key in relation to this: Data transparency and Model transparency. • Data transparency: As the outcomes of AI/ML systems depend directly on the data training process, transparency in data collection, utilisation and storage, is an area of significant concern in trustworthy AI. Data provenance (or data lineage) methods are required to improve replication, tracing, quality assessment in data use and data transformation processes [60] . In recent years, a series of standards have appeared for recording data provenance such as Open Provenance Model [114] , Provenir [146] , and the W3C standard PROV-O [174] , whereas many of them have been devised specifically for tracking of data and data transformations during the machine learning lifecycle such as PROV-ML, ProvLake [162] , and Hippo [188] . While these solutions assist with internal data provenance, several researchers have also advocated for private, secure, and standardised methods for data sharing. Datasheets for datasets, is a standardisation method proposed by Gebru et al [48] for documenting, among others, every dataset's motivation, composition, collection process, and recommended uses. Such dataset documentation aims to improve the communication between dataset creators and its users, while also encouraging the prioritisation of transparency and accountability in the ML community. • Model Transparency: Due to the rising complexity in modelling, model transparency and provenance methods have also gained interest. Research has focused both on end-to-end tracking of provenance information in the machine learning lifecycle, and on evaluating models for performance and trust. In this context, several modelling provenance solutions have been proposed. Schelter et al [152] propose a system for the extraction and storage of meta-data and provenance information commonly observed in the ML lifecycle. Hummer et al [68] propose ModelOps, a cloud-based framework for end-to-end AI pipeline management. One of the key components of ModelOps is a domain abstraction language with first-class support for the common artefacts in AI solutions. This includes datasets, model definitions, trained models, applications, and monitoring events, as well as the algorithms and platforms used to process data, train models, or deploy applications. Further, several tools for complete asset tracking of AI pipelines have also been developed, focusing on tracking model inputs, results, and production processes [51, 186] . In regards to AI documentation, a recent trend is the use of FactSheets [11] for communicating 'purpose, performance, safety, security, and provenance information' from the creator to the user of an AI service. In light of these considerations, transparency and traceability are instrumental to address other key concerns about AI models and systems, namely reproducibility, auditability and accountability. Accountability is one of the key principles for Trustworthy AI [36] , as stated by the AI High-Level Expert Group, and has been translated into the Assessment List on Trustworthy Artificial Intelligence (ALTAI) and included into the Proposal for a Regulation Laying down Harmonised Rules on Artificial Intelligence by the European Commission [2] . Currently, the question of accountability -when an AI-based system is deployed in real clinical settings and either fails or its outcomes produce unexpected side effects-is still open and burning [49] . The problem affects any algorithmic application that supports decision-making and it is well known and debated in the ethic, social and legal communities [110] . In the medical and radiology domains, this question is considered in a collaborative work of the American and European radiology and medical physics societies as well as in the guidelines published by the Royal Australian and New Zealand College of Radiologists [169] . Under current laws, physicians that are compliant with the standard of care are not held liable for an unwanted outcome and this still holds when the decision is based on the results of an AI model [164] . Some works have debated the implications on liability when AI is in place in radiology and healthcare in general [117, 133] ; however, a dedicated regulation is still an open issue. Documenting and tracking the development process and the functioning of an AI model is key to reconstruct all the pieces of information the physician or the radiologist used to make their decision when using AI models. Overall, the recommendations and code of conduct proposed in this paper are a step forward to regulate and support the definition of an AI-aided standard of care. Considering the data-inductive and dynamic nature of AI models and systems, traceability does not end with documenting the development process and the testing activities, but it also entails the process of maintaining the AI model or system, by tracking its behaviour over time and detecting any drift from its training settings or previous states. Indeed, AI models are adaptive, non-deterministic systems, whose testing does not and cannot involve all the variables and changing contexts of real-world settings. Thus, testing the systems before its deployment, although extensive to guard against potential ethical concerns, cannot cover the whole host of scenarios and cases that could be encountered in practice. Moreover, clinical practices and technologies continuously evolve as new imaging technologies, evidences and clinical findings come forth, thus yielding new guidelines, new protocols, or new diagnostic devices and procedures. Finally, training is often limited to a set of cases that can exhibit several kinds of biases or limitations, as debated in section 2 on Fairness. These phenomena call for ongoing surveillance of AI models, and a maintenance system that tracks over time the performance, vitality and conduct of the AI models after their deployment in clinical settings. Such a maintenance system might implement a continuous monitoring of the AI models to guarantee their sustained quality, but also to close the data feedback loop, by taking advantage of the new data, new knowledge and feedback coming from the clinical production settings. It is well-known that a model's performance degrades over time when evaluated in real world, as several phenomena drive a decay (i.e., mainly model/concept and data drifts). Concept drift [149] refers to a phenomenon in the practical application of AI in which some underlying statistics or characteristics of one or more variables changes after the deployment of a model and as a result the AI model's predictive accuracy changes. In a recent study [71] the authors have presented few potential types of concept drift, e.g. novel class arrival and class evolution. Many static AI models for medical imaging have been developed with curated, hand-picked datasets. As a result, static AI models are not designed to remain connected to real-time changes in their production environments and are prone to 'concept drifting' in time. This is demonstrated in [128] , where three versions of the same model were trained on gradually ageing radiology data. Despite better quality achieved with large training sets (12 months of data), all three models became significantly inaccurate when their training data aged beyond 8 months. Concept drift is a problem that can be managed by periodically or continuously testing and updating the models or in some cases deploying models that take the possibility of concept drift into account [22] . Data drift, on the other hand, which underlies model drift and is also often referred to as dataset shift [138] , is defined as a change in the distribution of data. Production data can diverge or drift from the baseline data over time due to changes in the real world. In the domain of medical imaging this can, for example, be a result of new imaging modalities introduced in a local site, a new version of their software, or adjustments to data acquisition procedures. Therefore, once implemented, ongoing surveillance is needed to monitor and recalibrate AI algorithms [109] . This surveillance is also needed for dynamic algorithms that continuously update themselves based on practice data and published clinical evidence. Thus, evaluation is likely to shift from a one-off activity to a continuous process to ensure that the use of AI, including those incorporating dynamic algorithms, is meeting expectations and adherence to clinical standards of care. The model maintenance can, hence, be seen as a way to nurture the model, as it can take advantage of the new knowledge coming from the real-setting scenario, thus producing an improvement of the original version released. Not by chance, when it comes to traceability, the ALTAI tool 7 explicitly includes some recommendations pertaining to monitoring the quality and to logging the outputs of an AI system (see the ALTAI document for the complete list of questions) In accordance to these considerations, we envisage that to ensure traceability a governance framework/tool for the whole AI model lifecycle should be put in place to ensure the following key aspects: • the maintenance of up-to-date documentation on policies, motivations, responsibilities and logging information. • the validation procedures and sandbox test analyses for safety purposes. • the continuous on-line or periodic monitoring of model conduct and performance, to orchestrate any remediation needed to keep AI models well-performing, unbiased and ethical for as long as models are in use in the clinical settings. One sign of the importance of this issue is the rise of MLOps (DevOps for ML) as a signal of an industry shift from technology R&D (how to build models) to operations (how to run models). To ensure the transparency and traceability by design of the AI models, we propose keeping a structured documentation of each step of the production process, as suggested in the following recommendations. 18. Model scope: when starting the development of a model, a precise definition of the model's scope should be agreed upon with the radiologists and/or the clinicians and described in terms of model's intended use, use case scenarios, the intended output, supported model inputs, the underlying biological phenomenon and any known limitations of the diagnostic/prognostic problem faced. In this regard, it is also recommended to discuss and document related use cases and scenarios that are outside the model's scope, which serves to transparently highlight the limitations and realistic expectations of the model. 19 . Data provenance: In addition to the recommendations coming from the Fairness principle, a complete documentation of the imaging and the related clinical/genomic/pathology data required for the development of the AI model should be maintained in accordance with an appropriate data provenance standard (e.g., DataSheet or PROV-O). This should be done by including information about data provenance and ownership, acquisition protocols, devices, and timing. For the imaging data, the DICOM tags, duly anonymised, should be retained to keep details about the acquisition parameters. 20. Data localisation: we also suggest to keep track of the data location over the network, which could be relevant to federated learning approaches, and to analyse dataset statistics with respect to the capability to represent the phenomenon at hand (e.g., distribution analyses). This analysis is relevant to detect concept and model drifts. Quantifying missing values and any gaps or known biases is also advisable as well as documenting breaks in the data supply and noting down when the input data are erroneous, incorrect, inaccurate or mismatched in format. 21 . Documenting data preparation: A multitude of diverse data preparation tools and approaches exist , which illustrates the importance of documenting the pre-processing pipelines when preparing and curating the data. It should be reported whether and which data quality and standardisation procedures are put in place and, then, specifying their details, by describing whether the algorithms are ad-hoc or borrowed from existing libraries and tools, by specifying the requirements for their applications as well as any parameters set for them. 22 . Specification of clinical references: the radiological or clinical standards or biomarkers used as reference should be carefully detailed (e.g., PIRADS, BIRADS, Gleason score). If data labelling or segmentation are used, the authorship and authors' expertise and experience, alongside the tools and approaches used, should be detailed along with the results of any stability or consensus analysis and known limitations. Training recording: the model training process should be carefully documented by including the standardised descriptions of imaging and non-imaging features, by detailing the training approach, the assumptions made, the methods used for parameters' and hyper-parameters' optimisation, the weight initialisation and the framework used (also including the framework version). A ModelOps framework can be of service to keep record of all these pieces of information. 24. Validation documentation: The validation process should be duly described in terms of evaluation metrics, cross-validation approach, decision thresholds as also agreed with clinicians, confidence intervals, degree of uncertainty of the output, benchmarking information, and auditors. Additional information about the results provided by the AI model should come from the explainability principle, which will be further discussed in section 7. The final model released should be described with a standardised description of the model's architecture, interfaces, I/O data structures, and the limitations and known points of failure. 26. Traceability tool: Each AI model should be developed together with a traceability tool, that will enable to monitor the live functioning of the AI tool to, for instance, flag and record errors, deviations, and degradation in performance. The main statistics of the model should be recorded in a model registry and include the model's functions and predictions while running in a clinical production settings, but also the model's evolution over time. Feedback from clinicians/radiologists/decision makers should be recorded whenever possible. The traceability tool might be included in the ModelOps framework. In the desirable scenario where the model is capable of and configured for online learning from production data or from the feedback provided by health professionals, this learning and the tools and processes used therein should also be recorded in the traceability tool. 27. Model passport: All the above pieces of information should be included in a standardised format into a passport of the AI model, which should accompany the model during its whole lifecycle with a rich set of metadata that guides the model adoption in clinical practice, supports its usability and makes the model auditable. The passport should also include general information, such as the team responsible for the whole development or part of it; the date and information of each of the model's releases; the model's current version; the model type, reference and license; the contact details. This information will make the passport a viable solution to sustain accountability of the model. The passport should be maintained up-to-date by (automatically) recording any new version and information on the live functioning of the model, while also including a clear plan for periodic checks and updates of the model. 28 . Accountability and risk specification: the model passport should be kept updated and should contain information about the code of conduct followed apart from an evaluation of the risks that may be raised by usage of the AI model or system. The risk evaluation may be in accordance with the Proposal for a Regulation Laying down Harmonised Rules on Artificial Intelligence by the European Commission. According to ISO 9241-11 definition [70] , 'Usability is the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use'. Usability is a key characteristic of every product offering tangible advantages such as faster acceptability, saving of costs and user satisfaction. Although the basic principles remain the same, each application domain needs to be analysed in detail in order to promote enhanced usability and design dedicated usability testing [61] . A recent study [43] stressed the unmet need to consider human factors and employ user-centred design to achieve maximal usability, accelerate AI adoption and achieve the desired paradigm shift based on Radiology AI solutions. In more detail, the authors suggest user-centred design principles for each of the AI development model life cycle phases e.g. observation of the clinical environment and user needs assessment in the design phase, iterative user testing in the development phase, clinical workflow integration in the implementation phase and performance monitoring in the long-term use phase. Critically, they stress the need to avoid previous mistakes in health technology applications related to computer-aided diagnosis (CAD) or electronic health records (EHR) where user factors were neglected leading to poor adoption and errors. The FDA has issued a guidance document for medical device manufacturers on human factors and usability engineering [39] . Despite the numerous AI tools, human factors are still not being adequately addressed since there are very few relevant scientific publications mainly addressing the need for future usability testing in AI. However, previous relevant experience mainly in Picture Archiving and Communications System (PACS) systems has highlighted the need for assessing usability and demonstrated that user satisfaction can vary significant when different products are evaluated. In a relevant study [76] , two sequential versions of commercial PACS software were evaluated 6 months apart by five radiologists with varying PACS experience. They reported 22% improvement in performance time and 30% decrease in the number of errors in the second version compared to the first. In a more recent study [3] , user satisfaction was assessed by three resident radiologists and the results revealed that PACS has not fully met all the demands of physicians and has not achieved its predetermined objectives, such as all-access from different locations. In a more extensive study [38] , 200 individuals using the PACS in several hospitals performed usability evaluation based on the standard Computer System Usability Questionnaire (CSUQ) consisting of 5 sections and 19 items. The results highlighted significant differences in terms of information quality, user interface quality, overall user satisfaction and usability of PACS. At the same time, it was demonstrated that there is a need to speed up the image processing tasks and avoid system failures while it was suggested that the information quality and user interface of systems be improved by using appropriate analysis and needs assessment of the end users. Last, a usability study on integrating CAD to PACS reported numerous weaknesses that users considered important in the context of the integrative workflow such as efficient handling and fast computation [50] . AI and Radiomics have shown great potential in many areas of healthcare, including clinical oncology; however, the clinical use of this technology is still in its infancy [159, 82, 37] . For the last five years, we have witnessed a tsunami of radiomics publications [93, 161] , demonstrating potential novel clinical applications of radiomic signatures or nomograms predominantly in the field of oncologic imaging. The problems that radiomics attempt to address are either tasks already accomplished by humans currently offering unsatisfactory performance or tasks that are currently impossible to achieve through human visual inspection and interpretation of medical images by radiologists. More specifically, the topics that radiomics are primarily focusing on are related to the prediction of treatment response, before, during or early after the completion of therapy, the accurate patient stratification related to disease prognosis taking as end-points survival-related outcomes (OS, PFS), and the prediction of risk for local or distal recurrence [160] . In addition, another popular topic is the association of radiomics features often called radiomics signatures with surrogate biomarkers including molecular [83] , genomic [66] , or path-omics [6] since radiomics are non-distractive and non-invasive; therefore, they can be easily obtained throughout the entire disease continuum. The biggest problems regarding the latter efforts were related to the study design that in the vast majority of the cases was based on a small retrospective cohort [118] , coming from a single institution, using hundreds or even thousands [44] of radiomics features extracted from each patient to construct the radiomics signatures possibly using the same dataset after splitting to address multiple target variables, ignoring essential concepts like multiple comparisons corrections and type I errors [26] . Given the fact that medical imaging technology is fast evolving, such study designs are resulting in clinically non-usable models. They, therefore, might contribute to further confusion and lack of trust in AI models. To make things worse, clinical user requirements in the design process of such new technologies are often neglected [141] which means that the user experience is still not considered as a critical development variable not least due to the fact that there is still a lack of a generalised consensus of how to design effective end-user interaction with machine learning systems [8] . This problem introduces a serious risk of losing early adopters and increasing the cost and the complexity of the product when trying to resolve usability a posteriori. There are very few available sources to elaborate on usability issues with respect to AI. For this reason, we start by analysing the basic pillars of usability and enriching them with AI-specific considerations. Based on a domain agnostic usability textbook [179] , usability is traditionally associated with five important attributes: • Learnability: How quickly a new user learns how to use an AI system is critical for the fast adoption of AI since the users need to rapidly learn and incorporate the models in their clinical workflow. An important element regarding this usability dimension is the ability of the user interface to allow exploratory learning by including e.g. undo functions or comprehensive wizards for novice users. • Efficiency: Clinicians usually face heavy workloads and one important aspect of AI in medical imaging is to empower the user (e.g., radiologist) by alleviating his/her heavy workload (e.g. reading of mammograms) while providing efficiency coupled with high level of productivity. This is an important factor when validating such systems in order to ensure that the inclusion of the AI system preserves or improves diagnostic time while increasing performance and efficiency. • Memorability: This is a very particular property of AI systems characterising how easy it is to remember how to use a particular tool. Several users in the clinical environment deal with different tools and technologies and being able to easily remember how the system works after some period of not having used it without having to learn everything all over again, is very important for promoting acceptability. This attribute is also an important design element for taking into consideration the casual user needs of AI systems. • Limited and non-catastrophic errors: An AI medical imaging system should produce very limited (if any) errors and in any case warn the user about the possibility of a critical error (e.g. in prediction) based on uncertainty estimates. As it has been mentioned before, training and testing of current AI models in relatively small datasets might give rise to errors during the actual deployment of the systems within the clinical setting and this remains a poorly addressed issue calling for more extensive research and definition of widely acceptable guidelines. • Satisfaction: It is very often the case that AI developers believe that a satisfactory performance is equal to user satisfaction, assuming that all the clinical users want is high accuracy and robustness. In practice, the system should be pleasant to use and subjective satisfaction needs to be monitored in order to assist the users to incorporate it in their daily routine and ensure wide adoption. The latter also depends on the variability of clinical user attributes (e.g. computer enthusiasts vs more conservatives) and it is therefore very important to engage diverse clinical users in the early design phases as well as in measuring subjective satisfaction after deployment. It is also argued that limited explainability of current AI models might negatively influence user satisfaction. As mentioned in the beginning usability is a milestone for achieving the goals of an AI system with efficiency and user satisfaction at the same time. So, it is becoming evident that for an AI model to be usable, superior effectiveness and efficiency compared to the current modus operandi in the clinical setting needs to be demonstrated and, most importantly, end-users must be satisfied. The latter might refer to different stakeholders, including the individual patient, the caregiver, and the hospital administrator. They might be interested in the sustainability of the health care services provided by their institution or even to policymakers to craft data-informed decisions and clinical guidelines. Poor model usability might be responsible for limited translation of research in the field of AI and Radiomics to the clinics. In a recent article discussing the plethora of AI tools in COVID-19 medical imaging, it is argued that the poor clinical adoption of such systems maybe partially attributed to the lack of awareness/ and understanding of the user needs and propose an improved workflow in the AI development process including iterative usability studies [17] . In order to properly design AI projects with a purely clinical indent, active engagement of experts is necessary. AI projects must be supported by a multi-disciplinary team where each and every member should provide his or her own expertise, however the leader should be an individual that has a deep understanding both of the clinical and technical domain. One of the reasons that can explain the limited translation rate to the clinics is the level of involvement of stakeholders with domain knowledge that was minimal or very superficial, failing to safeguard aspects related to whether such efforts are clinically relevant and usable. The latter contributes to the lack of trust and, therefore, lack of translation to the clinical environment since usability and trust are interconnected. Without engaging domain experts in the model's design, there is always the risk to make assumptions that are clinically irrelevant. Therefore, the developed model will be clinically irrelevant, as well. Usability lies at the tip of an iceberg, and many qualities that a model should have might influence its usability. Availability of high-quality curated data is the cornerstone of any ML model, but their presence is not enough. ML modelling strategies should respect best practices and avoid data leakage, which might be responsible for overfitted models with limited generalisability. So far, the focus when developing an ML model was to maximise its performance, often implementing models with limited interpretability, the so-called 'black-box' models. However, such a strategy may harm the trust and usability of the model since its decisions cannot be explained, and even in the case that the end-user is willing to accept the predictions, regulatory and legal reasons may prohibit the clinical applications of such models, especially in Europe where GDPR grants the right to patients to get explanations and justifications on which grounds a specific decision affecting their life was made by a model. When designing the validation of an AI model, it is often the case that tailored metrics reflecting the relative cost of a false positive or false negative prediction should be utilised instead of the 'of the selves' metrics that are used in most cases. For example, the metric on which the evaluation of the performance of an ML model that is expected to detect cancers in a screening setting is very different to the metric that should be used when designing a model to help with treatment decisions. Even the optimisation of hyperparameters often must be performed on such custom metrics rather than on typical accuracy, AUC or F1 scores. In Radiomics, most of the efforts concern the development and validation, overlooking how and where the signature will be deployed and used. Consequently, essential concepts of integration with current, mostly rigid, clinical workflows should be considered even from the first phases of model development to avoid unpleasant surprises regarding data availability prospectively. There are many challenges related to low interoperability, fragmentation of clinical systems, restrictions to the easy flow of data between different systems, presence of unstructured information that needs to be transformed and curated to become eligible for further utilisation. As far as the place where the ML models will be deployed and consumed, those can be either in the cloud or in local servers as endpoints within the hospital's firewall. According to Park et al [125] , usability evaluation should assess the extent to which the endusers are able to discover, understand, and use system features. An important field to quantify and more objectively assess the latter requirements is called usability testing and it comprise simulation studies or scenario-based testing that run realistic clinical scenarios6. During these tests data pipeline integrity and data flowing are checked, as well as potential disagreement between the human decision and the AI model output is investigated through root-cause analysis to identify potential reasons including software/hardware malfunction, poor model fit, bias or finally human error. 29. Engage all the stakeholders in the development phase: Active engagement of a multidisciplinary team including AI developers, Medical Imaging Scientists, Radiologists, Radiation Oncologists, Patients and Health Care Administrators to ensure clinical usability, effectiveness, and cost-benefit of the proposed AI models that can be the basis to further convince clinical users and regulatory bodies on the usability and value of the AI solution in healthcare practice. This engagement should be achieved as early as possible and in any case cover all the product life cycle phases starting even from the design of the AI product. Hands-on sessions during the model design and evaluation phases with a multidisciplinary team are essential to ensure that all user perspectives are taken into consideration. 30 . Understanding user needs: It is important to understand in-depth the clinical user needs regarding AI in medical imaging solutions. In particular, to ensure that the solution has a favourable learning curve, it is efficient in terms of time and performance (e.g. for a diagnostic task) and reduce as much as possible the possibility of serious errors. Understanding the clinical needs in a holistic fashion is a milestone for promoting user-centric design while addressing the actual clinical unmet needs. To this end, we propose at least one co-creation development workshop with end users and developers prior to the AI development phase to understand the needs of the users and define relevant constrains and reliability/evaluation metrics metrics. The absence of adequate human-computer interfaces is a major obstacle for the adoption of ML models into clinical practice [121] and understanding user needs is a sine qua non condition to design such adequate human-computer interfaces. It is therefore highly recommended to actively engage radiologists in the design of the UI in order to ensure that the users will be able to use all the provided functionality in an efficient way fulfilling all their needs e.g. in terms of execution time. 32 . Explainability for usability: During the design phase it is critical to follow methodologies that produce more explainable results in order to increase trustworthiness and promote clinical adoption. 33. Usability testing: In parallel, as it is has been recently suggested [17] , usability studies should be performed before the final AI solution is released in order to avoid poor user satisfaction and promote the faster and wider adoption. This means that there should be enough time and resources planned to re-evaluate and re-design certain aspects of the product's functionality until the user needs are met. Usability testing should be done with multiple clinicians of different characteristics (experience) to identify differences and varying preferences. Even though the assessment of AI model usability should be performed in clinical environments with real world data, engaging with end-users in the early stages of designing solutions, and keeping them involved as the solution evolves, is the recommended strategy to gain an understanding of the nature of the problem to be addressed and the issues that emerge during implementation. End-users can also act as great advocates for solutions in their organisations and among their colleagues, which can greatly improve adoption rates [43] . 34 . In-silico usability validation: In order to accelerate usability evaluation, it is also recommended to re-use existing retrospective data in a prospective fashion which blinds the researcher to the outcomes, simulating "real" clinical conditions. It is recommended that at least three radiologists from different clinical sites and degree of experience participate in the study. The validation results should include both usability aspects (e.g. user satisfaction) and agreement metrics between the clinician's decision and the AI's recommendation. 35 . Usability metrics: It is recommended to define the metrics that will be used throughout the product development. These should include usability questionnaires that measures several usability aspects, including time required to perform the task, learnability, efficiency, explainability, user satisfaction and intention-to-use. It is very important to incorporate in these metrics widely accepted recommendations such as the FDA guidance document for medical device manufacturers on human factors and usability engineering [39] . 36 . Deployment and integration in the clinical environment: AI developers should have a clear strategy on how to seamlessly integrate the developed models into current clinical workflows, including Electronic Health Records and PACS systems. Usability should also be assessed in terms of the integrative functional aspects (e.g. when using an AI tools integrated with the hospital's PACS system). 37. Provision of training resources for end-users: Training resources such as user guides, training material and user workshops can help to reduce the perceived complexity of the AI solution for end-users. This allows end-users to adopt the AI solution with less effort into their clinical practice. For instance, a user guide can resolve doubts on how to install, calibrate, update, interact with, and interpret the AI solution and its results. 38. Continuous monitoring of user satisfaction: AI Model development should be a continuous process influenced by real world conditions that can only be identified after model deployment to address real time changes that occurs in the input data (images and radiomics features), as well as in the output variables. This will promote error reduction, generalisability and trust. In FUTURE-AI, the Robustness principle refers to the ability of an imaging AI technology to maintain its model accuracy when it is applied under highly variable conditions in the real world, outside the controlled environment of the laboratory where the algorithm is built. Compared to other types of biomedical data, medical images are known to be associated with significant variations (both expected and unexpected) across radiological studies, which can impact the performance of the AI algorithms. There are several sources for this data heterogeneity, which, alongside their causal reasons and relationships [24] , need to be closely taken into consideration when developing, evaluating, and deploying new imaging AI algorithms for the real-world, including: • Equipment-related heterogeneity: For a given imaging modality (e.g., MRI), there are multiple manufacturers of imaging scanners (e.g., for MRI, Philips, Siemens, General Electric, Toshiba, Canon, Fujifilm). While the main physical principles that govern the manufacturing of these imaging scanners are consistent, there are vendor-specific variations that can make imaging studies vary in image conditions between scanners. The recent M&Ms study on deep learning based cardiac image segmentation in a multi-centre and multi-vendor context showed that models trained with cardiac MRI images from a given vendor generalise poorly to new images from a distinct vendor, losing up-to 40% of their initial performance [21] . An additional study based on image data from multiple centres and multiple vendors for the classification of prostate cancer showed that radiomics models that have a decent performance when tested on data from the same centre and/or scanner (AUC of 0.75) may show a significant drop in performance when applied to external data (AUC of 0.54) [23] . • Centre-specific imaging parameters: Despite the existence of reference imaging protocols in the clinical literature, the specific parameters of the image acquisitions, such as image resolution, slice thickness, orientation, contrast type, and post-injection scan delays, generally vary between clinical centres. This can lead to imaging studies with different intensity distributions that can impact the robustness of the AI algorithms when applied in new clinical centres. The evaluation of the influence of MRI scanning parameters on quantitative imaging features showed significant differences in many texture features when varying different MRI acquisition parameters such as magnet strength, flip-angle, number of excitations and scanner platform, emphasising the need for a standardised MRI technique [18] . • Operator-related heterogeneity: Imaging studies can greatly vary between shift acquisitions, in image quality, scan positioning, level of noise and artefacts, as well as organ coverage and tissue/lesion appearance, depending on the operator's experience, dexterity and workload. This variability is particularly pronounced for certain imaging modalities, such as ultrasound, where the operator is required to carefully and precisely manipulate a probe on the patient's body to identify optimal image planes for subsequent image quantification. In MR images, local intensity shift artefacts can be minimised but not eliminated with optimal patient location, coil design and tuning. Improper coil or patient positioning can produce subtle or, in some cases, severe signal intensity artefacts, which can also occur in a perfectly functioning coil if protocols are not optimised. Improper coil tuning manifests as a shading artefacts that can mimic other findings. Operators are recommended to be familiar with the various causes of signal intensity artefacts to maintain optimal image quality as part of an MR imaging quality assurance program [75] . • Patient-related heterogeneity: The anatomical properties of the patients, such as the size of the organs of interest, the amount of body fat, anatomical variations, or the tissue density (e.g., brain volumetry related to age), can result in highly variable quality between imaging studies. Large differences between subject variability were observed in quantitative cerebral blood flow measurements in normal subjects using various PET and MRI techniques [59] . Another source of image heterogeneity is the level of cooperation of the patients during scanning, such as their propensity to remain still or to move during the image acquisitions. This can be observed when using MRI in paediatric patients, whose major challenge is the need for sedation or general anaesthesia [170] . • Context-related heterogeneity: AI algorithms are particularly sensitive to unexpected changes and artefacts in medical image data depending on the context in which the scanning took place. For example, an AI system trained to detect pulmonary lesions on chest X-ray images may be impacted if the X-ray technician forgets to remove adhesive ECG lead connectors on the patient's chest from a recent inpatient ECG, or if the patients inadvertently place their hands on the chest during the X-ray scan [185] . • Variability in image annotation and segmentation: Some imaging AI algorithms require the prior definition of regions of interest in the image such as bounding boxes around lesions or contours around tissue/organ boundaries. However, it is well-known that radiologists and clinicians annotate and delineate the images with significant intra-and inter-observer variability, especially when they have different levels of time, expertise, and experience. With annotations used for AI training, this affects the robustness of subsequent AI-driven predictions, especially in feature-based models. A study in breast MRI found that the variability in lesion segmentation based on four different observers resulted in only 20-30% of robust radiomics features for complex tumours [57] . Automated or semi-automated techniques for medical image segmentation are expected to generate more consistent results, but most existing segmentation software still requires manual inputs and corrections. Because these variations are an integral part of real-world radiology, and given the differences in clinical practices between radiology departments within as well as across centres and countries, it is important to implement preventive and corrective measures to enhance the robustness of the AI algorithms against changing imaging conditions. In the following, we recommend respective guidelines. 39. Image harmonisation: If differences in imaging and acquisition protocols cannot be prevented between centres, robustness should be enhanced by implementing image harmonisation tools and techniques such as histogram normalisation and discretisation [112] , ComBat harmonisation [45, 139] , and data augmentation solutions with neural style transfer methods, Generative Adversarial Networks and unsupervised image-to-image translation units [47, 171, 122] . It is recommended to assess and report the variation across features alongside the reduction in variation after applying harmonisation methods to these features in the dataset. 40. Feature harmonisation: Following the craze for radiomics features, their lack of reliability raised the question of the generalisability of classification models. The design of feature harmonisation pipelines is essential to investigate the repeatability and reproducibility of these features in order to evaluate their temporal stability with respect to a controlled scenario (test-retest), as well as their dependence on acquisition parameters such as slice thickness, or tube current [72] . Feature selection strategies have been proposed to incorporate only robust and stable imaging features into prognostication/prediction models to improve generalisability across multiple institutions [124] . 41 . Intra and inter-observer variability: Dedicated experiments should be performed to separately assess the effect of intra-and inter-observer manual biases on the imaging AI algorithms. One approach is to gather and analyse multiple annotations per image annotated by a diverse set of clinicians and by the same clinician at different points in time. In supervised machine learning and annotation-related tasks, a common practice to generate ground truth label data is to merge observer annotations. A detailed study was performed to analyse how the high intra-and inter-observer variability resulting from factors such as image quality, different levels of user expertise and domain knowledge may affect the performance of automated image segmentation solutions and their uncertainty. The results highlighted the large impact of intra-and inter-observer variability and the negative effect of annotation merging methods applied in deep learning to obtain reliable estimates of segmentation uncertainty [77] . Samples with high annotation variation above a certain threshold can be detected by variation proxy measures such as the coefficient of variance (CV), the Dice Similarity Coefficient (DSC) and the Hausdorff Distance (HD). Reporting the analysed observer variation increases study reproducibility, while clinician reassessment of these high variation data samples can allow to find the most suitable and robust consensus annotation in each such case. 42. Quality control: Quality control capabilities should be implemented to identify abnormal deviations or artefacts in the imaging studies. Inter-and intra-observer variability of manual quality control is high and may lead to inclusion of poor-quality scans and exclusion of scans of usable quality. A recent quantitative quality control tool (MRQy) based on unsupervised learning techniques (clustering) was developed to help interrogate MR imaging datasets for site-or scanner-specific variations in image resolution or image contrast, and imaging artefacts such as noise or inhomogeneity; which need correction prior to model development [145] . Other relevant solutions include: Qoala-T, a brain-only tool based on an easy and free to use supervised-learning model to reduce observer bias and misclassification in manual quality control procedures using FreeSurfer-processed scans [86] ; QC-Automator, based on different CNN architectures and applied on diffusion MR imaging data only, it can handle a variety of artefacts such as motion, multiband interleaving, ghosting, susceptibility, herringbone, and chemical shifts [147] ; and PI-QUAL, a prostate-specific tool to assess the diagnostic quality of a scan against a set of objective criteria as per Prostate Imaging-Reporting and Data System recommendations, together with criteria obtained from the image [53] . 43 . Phantoms: Phantoms should be scanned across multiple centres and used to calibrate and harmonise future patient images and measurements. The use of a standardised quantitative calibration phantom and a well-recognised and accepted procedure by the medical imaging and radiology community would decrease inter-scanner variability [81, 134] . Some existing methods make use of an imaging phantom by which volume change can be applied in a highly controlled way for standardising measurements of brain atrophy rates between different scanners [9] . 44 . Data augmentation for model training: Robustness in training can be improved through synthetic data by simulating a wide range of challenging imaging conditions (e.g., noise, artefacts, extreme cases) to augment the available data in magnitude and dispersity. A recent study defined two families of data augmentations: spatial transformations to increase sample size through rotation, flipping, scaling or deformation of the original images; and intensitydriven techniques, which maintain the spatial configuration of the anatomical structures but modify their image appearance (e.g., with standard image transformations such as histogram matching, blurring, change in brightness, gamma and contrast, or addition of Gaussian noise; and advanced image synthesis by using generative adversarial networks (GANs) or variational auto-encoders (VAE)) [21] . An additional study proposed the use of adversarial attacks to generate small synthetic image perturbations for image reconstruction tasks. By introducing robust training into a reconstruction network, the rate of false negative features in image reconstruction was shown to be reduced [20] . In the case of the necessity of deployment to clinical domains where the data distribution is unknown a priori, inference/test time adaptive AI models [79] can adapt to the new data distribution and, hence, further foster AI robustness for the unseen clinical domain. Training on heterogeneous data: The imaging AI algorithms should be trained and evaluated with heterogeneous datasets from multiple clinical centres, vendors, and protocols. One of the most recommended strategies to promote further research and scientific benchmarking in the field of generalisable deep learning is the organisation of task-specific challenges which include multi-centre, multi-vendor and multi-disease data [99] . If access to shared anonymised multi-centre imaging samples is not feasible, privacy-preserving federated learning should be considered by the AI developers in collaboration with the participating clinical centres [78] . 46 . Uncertainty estimation: Uncertainty estimation must be considered part of any AI system in imaging, to estimate confidence scores or maps given the imaging characteristics and inform the radiologist in potential lack of robustness [25] . An entire framework based on Bayesian CNNs was proposed to diagnose ischemic stroke patients incorporating Bayesian uncertainty into the analysis procedure, which resulted in not only an improvement at image-level prediction and uncertainty estimation but also for the detection of uncertain aggregations at the patient-level [62] . Other innovative proposals include the calculation of a new score beyond the classifier's discriminant or confidence score, called the trust score, which constitutes a measure of uncertainty for any trained (possibly black-box) classifier which is more effective than the classifier's own implied confidence (e.g., softmax probability for a neural network) [73] . If the AI tool is intended for global radiology, it should be optimised (such as by using transfer learning and/or domain adaptation) and tested with new imaging samples from resource-limited settings in low-to-middle income countries [112] . In recent years, AI models have begun to outperform radiologists at certain diagnostic tasks using medical images [106, 172] . However, AI solutions in general, and deep neural networks in particular, lack transparency, leading to the term "black box AI", referring to the fact that these models learn complex functions that are inaccessible and often incomprehensible to humans [183] . One promising, but not yet widely applied, exception to this are causal models that enable comparisons between observed and counterfactual medical imaging data, which help to explain outcomes causally [127] . Nonetheless, the common lack of AI model transparency hinders the incorporation of AI solutions in standard-of-care clinical workflows, as clinicians likely cannot accept AI solutions in their workflows without some understanding of the underlying principles, even if the algorithms routinely outperform experts [98] . For example, it is important to ensure that AI solution performs the diagnosis based on the patient's phenotype rather than on image features that are clinically irrelevant for the task, such as the presence of a ruler in the image. A recent study showed that a highly accurate deep learning solution for COVID-19 detection from chest radiographs performs prediction based on confounding variables such as laterality or text markers on radiographs [31] . Similarly, a deep learning solution for skin cancer diagnosis can assign importance to irrelevant image regions such as dark corners on the images and still achieve high performance [184] . Such solutions perform poorly when tested on unseen (real-world) data from new hospitals despite high accuracy during initial testing. Hence, it is important to partially or entirely understand the decision-making process of the AI solution for troubleshooting these problems. The European Union's General Data Protection Regulation (GDPR) specifies a 'right to explanation' for the patient in Article 22 that makes it legally binding to offer explanation regarding the automated decision-making process [155] . It is a legal, ethical, and clinical requirement to focus on the explainability of the AI algorithm in order to integrate its predictions into clinical practice. Explainability affirms and embraces the need for providing insight into the mechanisms behind AI decision making processes thereby allowing for clinical validation and scrutinisation of these decisions. Towards a definition of explainability, the need for a general agreement on the term explainability of AI algorithms is to be pointed out, as it is often used interchangeably with interpretability [103, 97] . Holzinger et al defines explainability as 'Given a certain audience, explainability refers to the details and reasons a model gives to make its functioning clear or easy to understand'. Explainable artificial intelligence (XAI) for medical imaging refers to AI solutions that give end-users insight into its functioning. Explainability of AI solutions should start at the design and requirement gathering process by incorporating the desires, objectives, and challenges of clinicians to understand what type of explanations best suit their needs. These elucidations come in a variety of formats, each addressing different questions. Local explanations, for instance, provide reasons behind a particular prediction by the AI model for an individual image while global explanations identify the common characteristics that the model considers important for a particular class. Post-hoc explainability methods aim to provide an understanding of how the model works after building the model. Attribution maps or heat-maps are one of the most common types of post-hoc explainability methods that highlight the relevant regions on the input image that the AI model considers important. Examples of attribution methods include Grad-CAM [156] , Integrated Gradients [165] , and Guided BackProp [163] . In a recent study for COVID-19 detection from chest X-rays and CT-Scans, Grad-CAM based attribution maps highlighted the infected area in the lungs showing that the deep learning classifier considers it important for prediction [123] . The attribution maps do not offer any information on how these salient regions influence the decision-making process. Local Interpretable Modelagnostic Explanations (LIME) method evaluates the contribution of a feature to the prediction made by the AI model [143] . LIME perturbs the features extracted from a medical image to measure its impact on the classifier's prediction. LIME assumes that every complex model behaves like a linear model locally. A new linear model trained using the generated perturbations and classifier's output determines the contribution of each feature by approximating the behaviour of the model locally. For example, a study for predicting Isocitrate Dehydrogenase Mutations (IDH) in gliomas using dynamic susceptibility contrast magnetic resonance imaging (DSC-MRI)-based radiomics used LIME for explaining predictions made by a random forest model [102] . In this case, LIME analysis revealed that dependence count variance, complexity and normalised grey level non-uniformity as strongest radiomics features for IDH status mutation prediction. Shapley additive explanations (SHAP) is an interpretability method derived from game theory, which helps in determining the effect of individual features on the predictions made by the classifier [101] . For instance, SHAP analysis for interpreting the predictions of Support Vector Machine model trained with radiomics features extracted from the MRI of patients with non-metastatic nasopharyngeal carcinoma after intensity modulation radiation therapy revealed that tumour shape sphericity, first-order mean absolute deviation, T stage, and overall stage are important features for predicting disease prognosis [34] . Concept attribution associates high-level clinical concepts quantitatively to model predictions. Testing with Concept Activations Vectors (TCAV) method provides global explanations by determining the influence of high-level image concepts on the neural network's internal states [84] . A study on interpretability of deep learning models for predicting Diabetic Retinopathy (DR) level using a five point grading scale showed that the TCAV identified correct diagnostic concepts for some DR levels [84] . Micro-aneurysm diagnostic concept was assigned a high TCAV score for the diagnosis of DR level 1 and aneurysm diagnostic concept was assigned a high TCAV score for the diagnosis of DR level 2. The TCAV method was extended to determine the influence of continuous variables such as radiomics features on the neural network layer activations. In a further study, the TCAV method used radiomics features as concepts to determine that nuclei texture is relevant for the detection of tumour tissue in breast lymph node histopathology samples by a deep learning model [58] . Interpretable models inherently provide explanations along with their predictions or, alternatively, their reasoning process is explainable by design. For instance, Concept Bottleneck Models make the decision-making process interpretable i.e. as in Koh et al. [87] by first predicting clinical concepts and then predicting the severity grade based entirely on these clinical concepts. In this case, the Concept Bottleneck Models used 10 clinical concepts describing bone spurs, calcification, etc. to predict the severity of knee osteoarthritis. This also allows clinicians to intervene and change the clinical concepts to observe the effect on the model's prediction. Prototypical Part Network (ProtoPNet) [27] is a deep neural network that is interpretable and performs classification by comparing the features extracted from the input image against class discriminative prototypes. ProtoPNet was utilised for Alzheimer's disease classification with DenseNet-121 as a feature extractor and the analysis showed that the ProtoPNet provided reasoning for its prediction that can facilitate its adoption in clinical practice [111] . Anatomical priors and other domain specific information related to the medical image analysis task can be incorporated in the model to make its predictions interpretable. It is a general perception that the performance of the algorithm is inversely proportional to its interpretability. A reason for this perception is that deep neural networks have achieved stateof-the-art performance on many medical imaging tasks while remaining mostly uninterpretable. Both Concept Bottleneck Models and ProtoPNet are interpretable and achieve performance at par with black-box deep learning models, showing that there is, in principle, no need to compromise between performance and interpretability [183] . Explainability methods are difficult to evaluate because they are subjective, application-specific, and often lacking an available ground truth. However, it is important to ensure that the explainability methods produce explanations that are robust, sensitive, and faithful to the model, the data, and the prediction. A study utilised model parameter randomisation and data randomisation tests to evaluate the sanity of different attribution methods. Model parameter randomisation test evaluates the effect of using randomly initialised model weights and trained model weights on attribution methods, while data randomisation test evaluates the effect of using random data labels and correct labels on attribution methods. These tests revealed that the attribution maps generated by Gradient and Grad-CAM method pass these tests, while some other methods produce inconsistent attribution maps [4] . Explainability methods can reveal diagnostic information that may add additional clinical value for diagnosis. For instance, an explainability study on the diagnosis of skin cancer revealed that the attribution maps generated by Grad-CAM for prediction of pigmented actinic keratosis consider the area outside the lesion also important for the prediction [173] . This observation is consistent with the findings that chronic sun damage is responsible for pigmented actinic keratosis. An increase in diagnostic accuracy was observed when clinicians were told to pay extra-attention to chronic sun damage. A study for fetal head circumference estimation from ultrasound images used perturbation analysis and Area Over the Perturbation Curve (AOPC) for the quantitative evaluation of the attribution maps [187] . The AOPC metric is based on perturbation analysis in an ordered manner by modifying the important regions first to observe the performance decay and a large APOC value corresponds to an informative attribution map. It is also important to determine if clinical end-users and stakeholder groups are satisfied with the content and quality of the explanations. System Causability Scale (SCS) determines the extent of the utility of explanation for the end-user. For example, clinicians can use SCS to evaluate the quality of different explainability methods [64] . Generally, there is a need for a thorough quantitative and qualitative evaluation of explanations. This is particularly the case taking into account that deep learning networks are susceptible to adversarial attacks [55] , which exposes them and their explainability methods to security concerns. A study investigating the effect of small input perturbations on attribution maps generated by DeepLIFT and Integrated Gradients revealed that input images that look similar and predict the same label produced very different attribution maps [52] . Also, ProtoPNet is prone to corruption by noise and JPEG compression artefacts [63] . Therefore, there is also a need to investigate the effect of adversarial perturbations and noise on the explainability outputs to inform choice and design of the respective explainability methods. In summary, we recommend the following guidelines to enhance interpretability and explainability of AI solutions to foster trust in the predictions made by the AI models. 48. Clinical requirements on explainability: Clinicians should be involved early on in the design phase to discuss options, wishes and requirements regarding the explainability of AI models. The different explainability methods should be presented to the clinicians in an in-tuitive manner to allow for a clear understanding of their usage, advantages, and limitations. A small trial using example explanations of each of the explainability methods is recommended to be conducted in order to determine which methods are considered suitable for the task by the clinical end-users and stakeholders. 49 . Incorporation of clinical concepts: If possible, clinical concepts important for the diagnosis of a particular disease should also be consistently annotated. These additional concepts can be utilised to make the model interpretable at an added annotation cost, as exemplified in the following. Explainable capsule networks (X-Cap) can predict lung nodule malignancy by encoding high-level visual object attributes such as sphericity, margin, subtlety, and texture in capsule vectors and perform diagnosis based entirely on these concepts [88] . The latent space of a deep neural network can be disentangled to understand the model behaviour using annotated clinical concepts. A study for cardiac resynchronisation therapy response prediction from cine MRI showed that the latent space of a variational autoencoder can be disentangled for interpretability in terms of clinical concepts such as the presence of septal flash by using secondary classifiers [136] . Anatomical priors and other domain-specific clinical information for the medical image analysis task should also be utilised to design interpretable deep learning models. A deep learning model for detecting midline shift in MRI images can be designed by exploiting structural knowledge. A two-step approach that first estimates the midline and then predicts midline shift from the generated curve is more interpretable than predicting midline shift directly from the entire MRI scan [129] . 50 . Multiple explanation methods: Multiple explainability methods that provide complementary explanations should be explored for understanding the decision-making process of the AI model. For example, Gamble et al [46] used attribution maps to provide local explanations for a specific image in an effort to understand the reasoning process of the AI model that predicts breast cancer biomarker status from hematoxylin and eosin-stained images . They also utilised TCAV for providing global explanations to identify the characteristics that influence the decisions of the AI model for a particular class. 51 . Identifying explainable imaging biomarkers: In order to increase clinical value, explainability methods should be used to identify imaging features or structures that can serve as an imaging biomarker for diagnosis and prognosis of a disease. The detected imaging biomarkers, if previously known, can increase trust in the AI algorithms. For instance, posthoc explainability methods identified the outside region of the skin indicating chronic sun damage as a potential biomarker for the diagnosis of pigmented actinic keratosis [173, 95] . The explainability methods can also help in hypothesising new imaging biomarkers. A study utilising Layerwise Relevance Propagation (LRP) [12] attribution method for explaining the deep learning model for the prediction of Estrogen Receptor status from H&E Images revealed a number of image features such as nuclear and stromal morphology as potential biomarkers [154] . 52 . Quantitative evaluation of explainability: Quantitative evaluation of explainability methods should be performed to ensure that the explanations are trustworthy, consistent and robust. Quantitative evaluation metrics such as Area Over the Perturbation Curve (AOPC) score may be used for evaluation [148] . A study investigated four attribution methods for robustness by training multiple deep learning models for Alzheimer's disease classification [35] . Attribution sum, attribution density and gain of attribution evaluation metrics were used for quantitative comparison. 53 . Qualitative evaluation of explainability: Qualitative evaluation of the generated explanations should be performed with the help of clinicians in order to determine the usefulness of the explainability methods. System Causability Scale (SCS) measures the utility of explanations for the end-users, and clinicians can use it to evaluate the quality of the explanations [166] . 54 . Robustness of explainability against ad-versarial attacks: The robustness of explainability methods against adversarial attacks should be assessed and, whenever possible, enhanced by training with adversarial examples. The input images are subjected to small perturbations and noise to determine if the explanations remain consistent. Security is a critical factor in automated decision-making systems for healthcare. A study showed that the input images that are subjected to small perturbations and noise to produce visually indistinguishable images can highlight entirely different regions in the attribution maps generated by DeepLIFT and Integrated Gradients methods [52] . 55 . Explainability in clinical practice: Application-grounded evaluations [33] involving clinical experts should be performed for evaluating the effect of using the explainability methods in clinical practice. A collaborative human-AI study should be performed in which the doctor performs the clinical task using the AI tool with and without explanations. It is important to identify any resulting bias from the use of explainability methods. For example, one study investigated the impact of deep learning model predictions and corresponding integrated gradients attribution maps on 10 ophthalmologists for the diagnosis of Diabetic Retinopathy, revealing that the use of attribution maps results in over-estimation of normal cases [150] . To facilitate testing whether and to what extent a given imaging AI solution is compliant with the FUTURE-AI guiding principles, a pragmatic quality check is presented in the following. This quality check consists of a set of 55 practical questions, which, as a whole, summarise the five guiding principles and encapsulate each of their aforementioned recommendations. AI research and development teams may consult the quality check to identify potential improvements in their endeavour of building a trustworthy, deployment-ready medical imaging AI solution. The quality check questionnaire can be found in table 1. Furthermore, an alternative version of the quality check can also be accessed online 8 . As we recognise that each project stage in the development of the imaging AI solution comprises its own specific challenges and dynamics, the quality check provides guidance as to which project stages are affected by each of its questions. The successive project stages start with (1) Table 1 : Overview of the FUTURE-AI checklist organised per principle and ranked according to the AI stages in which the respective checklist item is applicable. The stages consist of (1) clinical conceptualisation, (2) end-user requirement gathering, (3) technical design, (4) data collection and preparation, (5) AI implementation and optimisation, (6) AI evaluation, and (7) Usable explainability Did you implement any type of explainability that will be usable and actionable by the radiologist? 33. 6 Usability testing Did you design an appropriate usability study? 34. 6 In-silico validation Did you consider an in-silico validation of usability? 35. 6 Usability metrics Did you define the appropriate usability metrics for evaluation? 36. 6,7 Clinical Integration Did you evaluate the usability of your tool after integration in the clinical workflows of the clinical sites? 37. 6,7 Training material Did you provide end-users with resources to learn to adopt and appropriately work with your tool? 38. 5-7 Usability monitoring Did you implement monitoring mechanisms to assess changes in user needs and reevaluate the appropriateness of the AI solution though time? Did you implement any image harmonisation solutions to account for image heterogeneity? 40. 4 Feature harmonisation Did you perform any feature harmonisation study before developing your predictive models? Did you assess, minimise, and report the variation across features? 41. 4 Intra-and inter-observer variability Did you perform any intra-and inter-observer annotation studies? 42. 4 Quality control Did you use any quality control tools to identify abnormal deviations or artefacts in images? 43. 4 Phantoms Did you use phantoms to harmonise patient images and/or measurements? 44. 4,5 Data augmentation for model training Did you use data augmentation techniques to improve training of AI models? 45. 5,6 Training on heterogeneous data Did you train and evaluate your tools with heterogeneous datasets from multiple clinical centres, vendors, and protocols? 46. 5,6 Uncertainty estimation Did you report any kind of model uncertainty beyond the classifier's discriminant or confidence score? 47. 6,7 Equity in accessibility Did you optimise your tool with images from resource-limited settings in low-tomiddle countries? clinical conceptualisation, followed by (2) end-user requirement gathering for co-creation, (3) technical design and specification, (4) data selection, collection and/or preparation, (5) AI implementation and optimisation, (6) AI evaluation (retrospective, prospective, in-silico), and (7) AI deployment and monitoring. First and foremost, it is to be noted that the FUTURE-AI principles are a living framework, whereby, after diligent examination, new consensus and emerging best practices can be adopted into the framework. As such, this document represents work in progress and is a living document calling for updates and refinement. In this work, we have provided the FUTURE-AI framework built on consensus from various large-scale medical imaging AI implementation projects. FUTURE-AI is based upon the principles of section (2) Fairness, (3) Universality, (4) Traceability, (5) Usability, (6) Robustness and (7) Explainability. For each principle, we have analysed recent publications and discussed the necessary features for building FUTURE-ready AI solutions. We note that the first principle, Fairness (section 2), is highly subjective and requires a clear definition in the context of the imaging AI solution. In this regard, a diverse perspective of a multidisciplinary team including clinicians, social scientists and ethicist can uncover hidden biases. To develop equitable AI solutions, it is important to label metadata in imaging datasets such as sex, gender, ethnicity, skin colour, socioeconomics, or geography, while, at the same time, ensuring patient privacy preservation. Multi-centre data collection can increase the diversity of datasets. Data imbalances and biases within these datasets can be difficult to identify and estimate, but are crucial for subpopulation fairness analysis, evaluation, reporting, countermeasure planning, and continuous fairness monitoring after deployment. In Universality (section 3), the need for universally applicable, interoperable, and standardised imaging AI solutions was elaborated. After (a) a standardised definition of the clinical problem, a principled approach towards solving that clinical problem with AI include (b) standardisation of the software solution for maintainability (e.g., using established libraries and proven frameworks), (c) dataset annotations for objectivity, (c) adherence to standardisation initiatives for reproducible imaging biomarker quantification, (e) standardised evaluation criteria with (f) benchmark evaluation for comparability, and (g) adherence to reporting standards (e.g., TRIPOD-AI) for unambiguous communication of the AI solution. In Traceability (section 4) , the importance of model and data transparency was discussed in order to determine and countermeasure concept drifts and data drifts after model deployment. Also, the model passport was introduced that travels together during the entire AI model lifecycle i.e. from the lab to the clinical centres. The model passport transparently provides clinical and technical stakeholders with updated model metadata, statistics, model scope, model data provenance, and monitoring information. A tracebility tool together with the model passport enable AI accountability and risk awareness, for instance, tracing errors back to the recorded training, validation, and data preparation processes. Breaking down the concept of Usability (section 5), we discussed its associated key concepts, which are learnability, efficiency, memorability, limited and non-catastrophic errors, and user sat-isfaction. It is important to listen to and empower the voice of each group in some form affected by the AI solution to gather a diverse set of stakeholder needs, expectations and requirements. AI solution explainability, end-user training material, and an adequate human-computer interface allow for user-friendly clinical adoption of the AI solution, further guided by validation via usability tests, in-silico trials, usability metrics, and continuous user satisfaction monitoring before and also after clinical deployment. The Robustness principle (section 6) is particularly important in medical imaging due to the multitude of different sources of variations in medical images. For instance, we discussed variation due to equipment-related imaging heterogeneity, varying centre-specific imaging parameters, operator-related heterogeneity, patient-related heterogeneity, context-related heterogeneity, and intra-and inter-observer image annotation variability. To account and countermeasure against the data and domain shifts resulting from these variations, dedicated experiments and quality controls are necessary to estimate, report, and trace-back imaging variations to their origin. Also, optimisation of both the data (e.g., image, feature, and annotation harmonisation) and the AI tool (e.g., domain-invariance, domain-adaptation, domain-generalisation, uncertainty estimation) enable quantifying and reducing the error of the AI model ascribable to data heterogeneity. Lastly, in Explainability (section 7), we strive to motivate the transition from the development of "black box AI" models towards explainable and interpretable AI. Not only is this transition technically, ethically, and scientifically desirable, but it increasingly is adopted into binding legal frameworks e.g., requiring to offer patients an explanation of automated decision-making processes. After gathering the needs and preferences of automated AI model explanation from clinicians, the annotation of respective clinical concepts and the complementary usage of multiple explainability methods (e.g., attribution maps, Local Interpretable Model-agnostic Explanations Shapley additive explanations Testing with Concept Activations Vectors, Layerwise Relevance Propagation, Explainable Capsule Networks) can improve local (image-level) and global (modellevel) explainability, as is to be shown via quantitative (e.g., Area Over the Perturbation Curve) and qualitative (e.g., System Causability Scale) explainability evaluation measures. Furthermore, also the robustness of explainability measures (e.g., against adversarial examples), as well as the effect of using the explainability methods in clinical practice is to be evaluated. on medical devices, amending directive 2001/83/ec, regulation (ec) no 178/2002 and regulation (ec) no 1223/2009 and repealing council directives 90/385/eec and 93/42/eec Proposal for a REGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL LAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE (ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION LEGISLATIVE ACTS COM/2021/206 final. OJ Investigating the satisfaction level of physicians in regards to implementing medical picture archiving and communication system (pacs). BMC medical informatics and decision making Sanity checks for saliency maps An intelligent future for medical imaging: a market outlook on artificial intelligence for medical imaging Identifying cross-scale associations between radiomic and pathomic signatures of non-small cell lung cancer subtypes: preliminary results American College of Radiology. FDA Cleared AI Algorithms Effective end-user interaction with machine learning Novel imaging phantom for accurate and robust measurement of brain atrophy rates using clinical mri ANSI. Artificial intelligence. Standard ISO/IEC TR 29110-1:2016, American National Standards Institute Factsheets: Increasing trust in ai services through supplier's declarations of conformity On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation Reading race: Ai recognises patient's racial identity in medical images Mitigating poisoning attacks on machine learning models: A data provenance based approach Fairness in machine learning Ai fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias On the role of artificial intelligence in medical imaging of covid-19 Quantitative variations in texture analysis features dependent on mri scanning parameters: A phantom model Alicja Gosiewska, and Przemyslaw Biecek. Transparency, auditability, and explainability of machine learning models in credit scoring Adversarial robust training of deep learning mri reconstruction models Multi-centre, multi-vendor and multi-disease cardiac segmentation: The m&ms challenge Concept drift detection and adaptation for federated and continual learning A multi-center, multi-vendor study to evaluate the generalizability of a radiomics model for classifying prostate cancer: High grade vs. low grade Causality matters in medical imaging Diana Veiga Canuto, Adela Cañete, and Luis Martí-Bonmatí. A confidence habitats methodology in mr quantitative diffusion for the classification of neuroblastic tumors False discovery rates in pet and ct studies with texture features: a systematic review This looks like that: deep learning for interpretable image recognition The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository Reporting of artificial intelligence prediction models Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): the tripod statement Ai for radiographic covid-19 detection selects shortcuts over signal Data preparation for artificial intelligence in medical imaging: A comprehensive guide to open-access platforms and tools Towards a rigorous science of interpretable machine learning Radiomics model to predict early progression of nonmetastatic nasopharyngeal carcinoma after intensity modulation radiation therapy: a multicenter study Testing the robustness of attribution methods for convolutional neural networks in mri-based alzheimer's disease classification Ethics guidelines for trustworthy ai Automated deep learning design for medical image classification by health-care professionals with no coding experience: a feasibility study Usability evaluation of selected picture archiving and communication systems at the national level: Analysis of users' 38 viewpoints Applying Human Factors and Usability Engineering to Medical Devices Proposed Rregulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based-Software as a Medical Device (SaMD) Robots and transparency: The multiple dimensions of transparency in the context of robot technologies Towards transparency by design for artificial intelligence The case for user-centered artificial intelligence in radiology Radiomics and artificial intelligence for biomarker and prediction model development in oncology Harmonization of multi-site diffusion tensor imaging data Determining breast cancer biomarker status and associated morphological features using deep learning A universal intensity standardization method based on a many-to-one weak-paired cycle generative adversarial network for magnetic resonance images Ethics of artificial intelligence in radiology: summary of the joint european and north american multisociety statement Black box integration of computer-aided diagnosis into pacs deserves a second chance: results of a usability study concerning bone age assessment Automated end-to-end management of the modeling lifecycle in deep learning Interpretation of neural networks is fragile Understanding pi-qual for prostate mri quality: a practical primer for radiologists Socioeconomic and demographic predictors of missed opportunities to provide advanced imaging services Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples What does research reproducibility mean? Science translational medicine Mri-based radiomics in breast cancer: feature robustness with respect to interobserver segmentation variability Regression concept vectors for bidirectional expla-39 Understanding and Interpreting Machine Learning in Medical Image Computing Applications Estimation of intersubject variability of cerebral blood flow measurements using mri and positron emission tomography A survey on provenance: What for? what form? what from? Usability testing: A practitioner's guide to evaluating the user experience Integrating uncertainty in deep neural networks for mri based stroke analysis This looks like that Measuring the quality of explanations: the system causability scale (scs). KI-Künstliche Intelligenz Mr imaging of rectal cancer: radiomics analysis to assess treatment response after neoadjuvant therapy Radiogenomics of gastroenterological cancer: The dawn of personalized medicine with artificial intelligence-based image analysis Timely diagnosis and treatment shortens the time to resolution of coronavirus disease (covid-19) pneumonia and lowers the highest and last ct scores from sequential chest ct Modelops: Cloud-based lifecycle management for reliable and trusted ai Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison ergonomics of human-system interaction -part 11: Usability: Definitions and concepts. Standard ISO 9241-11 A fully adaptive image classification approach for industrial revolution 4.0 Repeatability and reproducibility study of radiomic features on a phantom and human cohort To trust or not to trust a classifier The global landscape of ai ethics guidelines Signal intensity artifacts in clinical mr imaging Merits of usability testing for pacs selection On the effect of inter-observer variability for a reliable estimation of uncertainty of medical image segmentation Secure, privacy-preserving and federated machine learning in medical imaging Test-time adaptable neural networks for robust medical image segmentation Geographic distribution of us cohorts used to train deep learning algorithms Quantitative magnetic resonance imaging phantoms: a review and the need for a system phantom Key challenges for delivering clinical impact with artificial intelligence Nonsmall cell lung carcinoma histopathological subtype phenotyping using high-dimensional multinomial multiclass ct radiomics signature Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav) Fairness of classifiers across skin tones in dermatology Qoala-t: A supervised-learning tool for quality control of freesurfer segmented mri data Concept bottleneck models Encoding visual attributes in capsules for explainable medical diagnoses Radiomics: the bridge between medical imaging and personalized medicine Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis Regulatory frameworks for development and evaluation of artificial intelligence-based diagnostic imaging algorithms: summary and recommendations Deep learning for segmentation using an open large-scale dataset in 2d echocardiography Bringing ai to the clinic: blueprint for a vendor-neutral ai deployment infrastructure Multiparametric magnetic resonance imaging for predicting pathological response after the first cycle of neoadjuvant chemotherapy in breast cancer From deep learning towards finding skin lesion 41 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) Estimating and improving fairness with adversarial learning Explainable ai: A review of machine learning interpretability methods The doctor just won Evaluation of prostate segmentation algorithms for mri: the promise12 challenge A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The lancet digital health A unified approach to interpreting model predictions Multicenter dsc-mri-based radiomics predict idh mutation in gliomas Interpretability and explainability: A machine learning zoo mini-tour Image-based cardiac diagnosis with machine learning: a review Defining health and health inequalities International evaluation of an ai system for breast cancer screening Standardization in quantitative imaging: a multicenter comparison of radiomic features from different software packages on digital reference objects and patient data sets The multimodal brain tumor image segmentation benchmark (brats) Effect of changes over time in the performance of a customized saps-ii model on the quality of care assessment The ethics of algorithms: Mapping the debate Using protopnet for interpretable alzheimer's disease classification Artificial intelligence in low-and middle-income countries: innovating global health radiology Traceability for trustworthy ai: A review of models and tools The open provenance model core specification (v1. 1). Future generation computer systems An artificial intelligence framework integrating longitudinal electronic health records with real-world data enables continuous pan-cancer prognostication LEADERSHIP IN AI: A Plan for Federal Engagement in Developing Technical Standards and Related Tools Artificial intelligence: Who is responsible for the diagnosis? Rectal cancer: assessment of neoadjuvant chemoradiation outcome based on radiomics of multiparametric mri Dissecting racial bias in an algorithm used to manage the health of populations European Society of Radiology (ESR) communications@ myesr. org. Esr paper on structured reporting in radiology. Insights into imaging Bringing modern machine learning into clinical practice through the use of intuitive visualization and human-computer interaction A review of generative adversarial networks in cancer imaging: New applications, new solutions A deep learning and grad-cam based color visualization approach for fast detection of covid-19 cases using chest x-ray and ct-scan images Reproducibility and generalizability in radiomics modeling: possible strategies in radiologic and statistical perspectives Evaluating artificial intelligence in medicine: phases of clinical research Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities Deep structural causal models for tractable counterfactual inference Continuous learning ai in radiology: implementation principles and early applications Incorporating task-specific structural knowledge into cnns for brain midline shift detection On fairness and calibration Not all biases are bad: equitable and inequitable biases in machine learning and radiology How to discriminate between computer-aided and computer-hindered decisions: a case study in mammography Potential liability for physicians using artificial intelligence Reproducibility of structural and diffusion tensor imaging in the tacern multi-center study Corads: a categorical ct assessment scheme for patients suspected of having covid-19-definition and evaluation Interpretable deep models for cardiac resynchronisation therapy response prediction Fairness in cardiac mr image analysis: An investigation of bias due to data imbalance in deep learning based segmentation Dataset shift in machine learning Increased power by harmonizing structural mri site differences with the combat batch adjustment method in enigma Potentially missed detection with screening mammography: does the quality of radiologist's interpretation vary by patient socioeconomic advantage/disadvantage? Integrating artificial intelligence into the clinical practice of radiology: challenges and recommendations On the interpretability of artificial intelligence in radiology: challenges and opportunities why should i trust you?' explaining the predictions of any classifier Phase recovery and holographic image reconstruction using deep learning in neural networks Mrqy-an open-source tool for quality control of mr imaging data Provenir ontology: Towards a framework for escience provenance management Qcautomator: Deep learning-based automated quality control for diffusion mr images Evaluating the visualization of what a deep neural network has learned Concept Drift Using a deep learning algorithm and integrated gradients explanation to assist grading for diabetic retinopathy A survey of clinicians on the use of artificial intelligence in ophthalmology, dermatology, radiology and radiation oncology Automatically tracking metadata and provenance of machine learning experiments A deep cascade of convolutional neural networks for dynamic mr image reconstruction Interpretable deep neural network to predict estrogen receptor status from haematoxylin-eosin images meaningful information" and the right to explanation Grad-cam: Visual explanations from deep networks via gradient-based localization Chexclusion: Fairness gaps in deep chest x-ray classifiers Radiological society of north america expert consensus document on reporting chest ct findings related to covid-19: endorsed by the society of thoracic radiology, the american college of radiology, and rsna Towards clinical application of image mining: a systematic review on artificial intelligence and radiomics Artificial intelligence and hybrid imaging: the best match for personalized medicine in oncology A review of original articles published in the emerging field of radiomics Provenance data in the machine learning lifecycle in computational science and engineering Striving for simplicity: The all convolutional net Are current tort liability doctrines adequate for addressing injury caused by ai? Axiomatic attribution for deep networks Madabhushi. Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation Asilomar AI Principles; Principles The Royal Australian and New Zealand College of Radiologists. Ethical Principles for Artificial Intelligence in Medicine Problems and preferences in pediatric imaging Unsupervised mri homogenization: Application to pediatric anterior visual pathway segmentation Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, webbased, international, diagnostic study Human-computer collaboration for skin cancer recognition PROV-Overview: An Overview of the PROV Family of Documents Temporal changes of ct findings in 90 patients with covid-19 pneumonia: a longitudinal study The alzheimer's disease neuroimaging initiative 3: Continued innovation for clinical trial improvement Preparing medical imaging data for machine learning User experience re-mastered: your guide to getting the right design World Health Organization. Basic documents. World Health Organization Standardization of analysis sets for reporting results from adni mri data Deep learning predicts lung cancer treatment response from serial medical imaging Unbox the black-box for the medical explainable ai via multi-modal and multi-centre data fusion: A mini-review, two showcases and beyond Deep neural network or dermatologist? In Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support Framing the challenges of artificial intelligence in medicine Accelerating the machine learning lifecycle with mlflow Explainability for regression cnn in fetal head circumference estimation from ultrasound images Diagnosing machine learning pipelines with fine-grained lineage The image biomarker standardization initiative: standardized quantitative radiomics for high-throughput image-based phenotyping