key: cord-0889893-uovfydnb authors: Driggs, Derek; Selby, Ian; Roberts, Michael; Gkrania-Klotsas, Effrossyni; Rudd, James H. F.; Yang, Guang; Babar, Judith; Sala, Evis; Schönlieb, Carola-Bibiane title: Machine Learning for COVID-19 Diagnosis and Prognostication: Lessons for Amplifying the Signal While Reducing the Noise date: 2021-03-24 journal: Radiol Artif Intell DOI: 10.1148/ryai.2021210011 sha: 121b37c58e81b53e33fb6e20806865add9337d58 doc_id: 889893 cord_uid: uovfydnb “Just Accepted” papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. Most studies introducing AI models for COVID-19 diagnosis and prognostication exhibit systematic errors that make them unusable in most clinical settings. However, there remain opportunities for machine learning to assist front-line workers during the COVID-19 pandemic, and the steps we take now will leave us better in the future. opportunities for machine learning to assist front-line workers during the COVID-19 pandemic, and the steps we take now will leave us better in the future. Since the emergence of Coronavirus Disease 2019 , researchers in machine learning and radiology have rushed to develop algorithms that could assist with diagnosis, triage and management of the disease (1) . As a result, thousands of diagnostic and prognostic models using chest radiographs and computed tomography (CT) have been developed. However, with no standardized approach to development or evaluation, it is difficult, even for experts, to determine which models may be of most clinical benefit. Here, we share our main concerns and present some possible solutions. In April 2020, during the first wave of the novel coronavirus outbreak in Europe and the U.S., Gog published an editorial outlining how researchers could use their skills to help (2) . Her paper was a call for researchers to proceed cautiously, stating that the priority should be to "amplify the signal" but avoid "adding to the noise" in the literature. In the several months since this appeal to caution, have we, as a research community, followed her guidance? Our AIX-COVNET collaboration is a multidisciplinary team of radiologists and other clinicians working alongside image-processing and machine learning specialists to develop AI tools to support front-line practitioners in the COVID-19 pandemic (3). We set out to quantify common problems in the enormous number of papers which developed machine learning models for COVID-19 diagnosis and prognostication using thoracic imaging. We systematically reviewed every such study published between 1st January and 3rd October 2020 and found two predominant sources of error (4) . First, an apparent deterioration in standards of research, and second, a lack of collaboration between the machine learning and medical communities leading to inappropriate and redundant efforts. To create models quickly, researchers frequently have relaxed standards for developing safe, reliable and validated algorithms. This laxity is most obvious in the datasets used to train these models. These datasets contain too few examples from COVID-19 patients, their quality is unreliable, and their origins are poorly understood. Many have been developed with access to only a few hundred COVID-19 images, where comparable models before the pandemic were trained using up to half a million examples (5) . Few papers address this small-data issue, or the resulting imbalance of class sizes, making it unlikely that their results will generalize to the wider community. For example, because of the prevalence of data from China, many researchers train on small datasets from China when the model is intended for European populations, and recent research suggests such models are ineffective in practice (6) . Differences between the training data and the target population, including patient phenotypes and data acquisition procedures, can all affect a model's generalisability (6) . Training generalisable models from small amounts of labeled data are a common problem in medical imaging, and techniques such as transfer learning, self-or semisupervised learning, and parameter pruning can ameliorate this issue (7, 8) . Although data sharing is critical for the research community to thrive, distributing or using public datasets of poor quality and unknown origins can further damage research efforts. Many public datasets are combinations of images assembled from other public datasets and redistributed under a new name (9, 10) . This repackaging of data has led to researchers unknowingly validating their models on public datasets that contain their training data as a subset, likely producing an optimistic view of their performance. There are also a surprising number of studies that unknowingly use a public dataset of pediatric patients for non-COVID-19 cases (9) . Additionally, many researchers have not acknowledged that some popular public data sets of COVID-19 patients are composed of images taken from journal articles with no access to the original DICOM files (11) . Whether "pictures of pictures" provide the same quality data as original images is an issue that was discussed before the beginning of this pandemic (12, 13) without an established consensus. In this time of crisis, these concerns have been ignored. Given the prevalence of research quality standards for developing medical models, it is perhaps surprising that such widespread issues exist in the COVID-19 literature. We have determined that disconnects between research standards in the medical and machine learning communities partly explain these issues. For example, the Prediction model Risk Of Bias Assessment Tool (PROBAST) checklist (14) for assessing the risk-of-bias in medical models require models to be validated on an external dataset, but in machine learning research, it is common practice to validate a model using an 80:20 training-to-testing split from the same data source. On the other hand, model quality checklists, such as the Radiomic Quality Score (RQS) (15) , suggest that to protect against overfitting, a model must train on at least 10 training examples per model parameter. However, deep learning models have been shown to generalize well despite heavy over-parameterisation (16) , so this requirement is often inappropriate for deep learning models. Furthermore, with deep learning models, it is difficult to interpret the extracted features, making it difficult to run standard risk-of-bias assessments from the medical literature (17) . These gaps between research standards in medicine and machine learning allow the dissemination of irreproducible research, and they extend far beyond the immediate COVID-19 crisis. Collaboration and communication between these communities to bridge these gaps will be necessary as more machine learning models phase into clinical deployment. Our collaboration is exemplary as it comprises clinicians, machine learning researchers, mathematicians and radiologists. Given our own experiences and the findings presented in our discussion of the literature, we propose some guiding principles for developing clinical models in the COVID-19 era and beyond. • Work as a multidisciplinary team. Many existing studies were performed without any input from clinicians. Because of this, models have been built to solve problems that do not necessarily provide significant clinical benefit (4). For example, in the UK, chest radiographs have a much more significant role in COVID-19 diagnosis than CT scans, but early models focused mostly on diagnosis from CT (18, 19) . Adapting to local medical practices is difficult without collaborating with clinicians. • Source original data. The origins of public datasets are often unknown, so it is difficult to determine their quality or suitability for inclusion in model development. Such datasets are also unlikely to represent a model's target population, making it less likely for a model's performance to generalize upon deployment. Training on high-quality data that is representative of the target community, with validation on data sourced externally, provides the best estimate of a model's performance. • Streamline data acquisition and processing. Collecting high-quality data are always a challenge in machine learning, particularly data on a novel virus, but preparation can make data collection easier. Researchers must be familiar with local guidance surrounding the use and sharing of patient data, and pre-emptive protocols for obtaining, anonymising and securely storing data, including for anticipated future pandemics, are essential. The current crisis has demonstrated that without these pre-emptive protocols, data collection can be severely delayed. Equally important is developing efficient and potentially semiautomated data preprocessing pipelines to ensure rapid access to high quality, well-curated datasets. Making these procedures publicly accessible also ensures that different groups do not need to spend time curating the same data. • Acknowledge the small-data problem. Obtaining large amounts of labeled data for medical applications is difficult, especially when they relate to a novel virus. Models should be adjusted to respond to this small-data problem. Although this is an on-going area of research, several strategies have been shown to boost performance when working with small or sparsely labeled datasets, including semi-and self-supervised learning (7, 20) , weight transfusion and limiting the number of trainable parameters (8). • Follow and improve medical standards. There are gaps between research standards in medicine and machine learning, and more research is required to resolve these inconsistencies. Machine learning researchers should be aware of the RQS (15) and the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) (21) , standard checklists to evaluate models using radiomic features. It is also imperative to evaluate a model's risk of bias using standards such as PROBAST (14) and to report results following guidelines such as the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) checklist (22) . Conversely, medical standards must be updated to support deep learning practices. Indeed, calls for an updated TRIPOD checklist (TRIPOD-ML, 23) and the related reporting guidelines SPIRIT-AI (24) and CONSORT-AI (25) are steps in this direction. With these considerations in mind, there remain plenty of opportunities for machine learning models to aid clinicians during the current pandemic and beyond, with much of the knowledge gained applicable to other diseases including future pandemics. Below, we outline several data sources that could be used to develop models that are helpful to clinicians (Figure) . 1. Chest radiographs: Chest radiographs are a first-line investigation in many countries, including the UK. Researchers could examine not only the initial imaging findings and extent of respiratory involvement, but also how radiographic progression in serial studies correlates with patients' clinical phenotypes. Many works have developed deep learning models using chest radiographs of COVID-19 patients, but further research is required to determine if similar models could be clinically viable, especially for prognostic models. 2. Thoracic CT: Another promising area of research that has received some attention is in developing segmentation and classification methods to locate lung parenchyma that could be affected by COVID-19 and classify these regions as a symptom of COVID-19 or a result of another disease. High-quality datasets for chest radiographs and CT include the British National COVID-19 Chest Imaging Database (NCCID) (26) and the Medical Imaging and Data Resource Center (MIDRC-RICORD) datasets curated by the RSNA (27) . 3. Comorbidities: Given that patients with cardiovascular comorbidities are at higher risk of severe disease and mortality (28) , it is natural to consider the cardiovascular information that is also contained in thoracic CT. Models that incorporate automated calcium scoring, for example, allow for the burden of atherosclerotic disease to be incorporated into prognostic models, even in those patients with no prior cardiovascular diagnosis. The effects of COVID-19 on the heart have received little attention. Many diseases cause irregularities in the physical and chemical properties of blood cells, affecting distinct cell types differently. COVID-19 might cause a specific and unique set of changes that can be rapidly detected by flow cytometry. This often-untapped plethora of granular and longitudinal data has recently shown promising results when used in models for COVID-19 prognostication (29) . Multiple centers collect data in different formats, consider different features and store data in potentially many different systems. One significant challenge is to design an algorithm robust to these factors. Ideally, a model would use more than one data source. An especially promising direction for investigation is how to optimally combine clinical and radiomic features (4). Many clinicians welcome helpful and appropriately validated models into the clinic. By making these projects open source, interested hospitals can integrate these models into their clinical workflow. The COVID-19 pandemic presents an opportunity to accelerate cooperation between image scientists, data scientists, radiologists and other clinicians; our collaboration is but one example. Researchers are close to realizing the potential of machine learning in health care, but there are still many barriers to deployment. To overcome many of these, we do not necessarily need more powerful machine learning models, but a better understanding of how to develop these tools responsibly. Bridging disconnects between machine learning and medical communities is an important step forward, and the current pandemic will forge vital collaborations with potential benefits beyond COVID-19. Disclosures of Conflicts of Interest: D.D. disclosed no relevant relationships. I.S. Activities related to the present article: institution received grant from Innovative Medicines Initiative (Innovative Medicines Initiative grant funding was paid to the University of Cambridge as part of the DRAGON consortium. Author's salary is paid using a portion of this funding, but I can confirm that no specific or additional payment was made relating to this article and the source of funding had no influence on the content of this opinion piece) (The DRAGON consortium is a group of high-tech SMEs, academic research institutes, biotech and pharma partners, affiliated patient-centered organizations and professional societies aiming to apply artificial intelligence for improved and more rapid diagnosis and prognosis in COVID-19.) Further details may be found at https://www.imi.europa.eu/projectsresults/project-factsheets/dragon. Activities not related to the present article: disclosed no relevant relationships. How might AI and chest imaging help unravel COVID-19's mysteries? How you can help with COVID-19 modelling Artificial Intelligence of COVID-19 Imaging: A Hammer in Search of a Nail CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison Momentum Contrast for Unsupervised Visual Representation Learning Transfusion: Understanding transfer learning for medical imaging Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning Augmenting the National Institutes of Health chest radiograph dataset with expert annotations of possible pneumonia A CT Scan Dataset about COVID-19 Adversarial examples in the physical world CheXpedition: Investigating Generalization Challenges for Translation of Chest X-Ray Algorithms to the Clinical Setting PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies Radiomics: the bridge between medical imaging and personalized medicine Understanding deep learning requires rethinking generalization Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans The role of CT in patients suspected with COVID-19 infection British Society of Thoracic Imaging. Radiology decision tool for suspected COVID-19 COVID-19 Prognosis via Self-Supervised Representation Learning and Multi-Image Prediction Checklist for Artificial Intelligence and Medical Imaging (CLAIM) Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement Reporting of artificial intelligence prediction models Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension Using imaging to combat a pandemic: rationale for developing the UK National COVID-19 Chest Imaging Database The RSNA International COVID-19 Open Annotated Radiology Database (RICORD). Radiology Cardiovascular disease and cardiovascular outcomes in COVID-19 Rapid triage for COVID-19 using routine clinical data for patients attending hospital: development and prospective validation of an artificial intelligence screening test Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem The AIX-COVNET collaboration's vision for a multistream model incorporates multiple imaging segmentation methods (a., b. and c.) with flow cytometry (d.) and clinical data. (a. A saliency map on a radiograph from the CheXpert dataset (5), b. Segmented parenchymal disease on a CT scan from the NCCID (26), c. Segmentation of calcified atherosclerotic disease on an image from the NCCID (26), d. A projection of a flow cytometry scatter plot of side-scattered light (SSC) versus sidefluorescence light (SFL), giving insight into cell structures