key: cord-1052725-aqgz2xly
authors: Ardestani, Ali; Li, Matthew D.; Chea, Pauley; Wortman, Jeremy R.; Medina, Adam; Kalpathy-Cramer, Jayashree; Wald, Christoph
title: External COVID-19 Deep Learning Model Validation on ACR AI-LAB: It's a Brave New World
date: 2022-04-08
journal: J Am Coll Radiol
DOI: 10.1016/j.jacr.2022.03.013
sha: b392fc6d2b1dc51d415977d71f164a37a3db94ed
doc_id: 1052725
cord_uid: aqgz2xly

PURPOSE: Deploying external artificial intelligence (AI) models locally can be logistically challenging. We aimed to use the ACR AI-LAB software platform for local testing of a chest radiograph (CXR) algorithm for COVID-19 lung disease severity assessment. METHODS: An externally developed deep learning model for COVID-19 radiographic lung disease severity assessment was loaded into the AI-LAB platform at an independent academic medical center, which was separate from the institution in which the model was trained. The data set consisted of CXR images from 141 patients with reverse transcription-polymerase chain reaction-confirmed COVID-19, which were routed to AI-LAB for model inference. The model calculated a Pulmonary X-ray Severity (PXS) score for each image. This score was correlated with the average of a radiologist-based assessment of severity, the modified Radiographic Assessment of Lung Edema score, independently interpreted by three radiologists. The associations between the PXS score and patient admission and intubation or death were assessed. RESULTS: The PXS score deployed in AI-LAB correlated with the radiologist-determined modified Radiographic Assessment of Lung Edema score (r = 0.80). PXS score was significantly higher in patients who were admitted (4.0 versus 1.3, P < .001) or intubated or died within 3 days (5.5 versus 3.3, P = .001). CONCLUSIONS: AI-LAB was successfully used to test an external COVID-19 CXR AI algorithm on local data with relative ease, showing generalizability of the PXS score model. For AI models to scale and be clinically useful, software tools that facilitate the local testing process, like the freely available AI-LAB, will be important to cross the AI implementation gap in health care systems.

"O wonder! How many goodly creatures are there here! How beauteous mankind is! O brave new world, that has such people in't" [1] -William Shakespeare, The Tempest.

A vigorous debate has emerged how to best bring the benefits of the brave new world of artificial intelligence to bear in our imaging enterprises. When individual practices deploy artificial intelligence (AI) today, most contract with individual commercial companies to deploy their clinical solutions, or they use a platform vendor to choose from algorithms available on those aggregated marketplaces. In either scenario, validation of these algorithms with site-specific data is recommended to ensure that performance of the algorithm with local data conforms to standalone performance testing. In a broader sense, and during the development of algorithms, AI algorithms need to be validated using real world data that reflects the spectrum of disease in a range of practice types, prior to commercial clinical deployment. The ability of radiology practices to participate in such algorithm validation is hampered by their rightful reluctance to release their (anonymized) patient data beyond their institution for commercial use. Algorithm developers, on the other hands, are concerned with protecting the proprietary nature of their trained algorithms. Therefore, a need exists for solutions which serves as an intermediary, bringing together practices and their data with developers to train, test, validate and assess AI algorithms. Multiple approaches are emerging that address this need in different ways, with or without the need to move source data. As an example, the Medical Imaging and Data Resource (MIDRC), funded by NIBIB and implemented by a consortium of professional societies and academic resources, facilitates central data collection and AI research on various entities. Platform/Marketplace vendors are beginning J o u r n a l P r e -p r o o f to incorporate tools for acceptance testing into their commercial offerings. Early-stage commercial offerings are emerging which promise to enable interaction of research or commercial algorithms with local data without the need to share the data.

The American College of Radiology (ACR) AI-LAB application has been developed by the ACR Data Science Institute as a platform to lower the barrier to entry for radiologists to engage with AI algorithms under development by external entities, without the need to share patient data externally.

[2] This platform aims to democratize participation in AI algorithm development and evaluation. One should remember that even when using an AI algorithm intermediary platform, a practice may need additional resources to participate in these activities. Practices need to have the ability to identify suitable patient cohorts, or build exam specific filters, and identify suitable images from each exam to present to the AI algorithm. This may be viewed as an insurmountable hurdle by some, particularly in the case of small and mid-sized practices with limited informatics resources.

Our institution sought to participate in AI algorithm testing using the AI-LAB to better understand the process. During the first half of 2020, all hospitals in our metropolitan area were heavily affected by the first wave of COVID-19 infections; for this reason, there was a particular interest for testing a COVID-19 chest radiology (CXR) algorithm trained to assess disease severity. [7] . We hypothesized that we could use AI-LAB, in the absence of local data J o u r n a l P r e -p r o o f science infrastructure or expertise, to deploy an already trained AI model to reliably and repeatedly assess the severity of COVID-19 lung disease across many patients at our institution. Multiple research groups have developed different AI models that can predict the radiographic severity of lung involvement based on lung opacities [8] [9] [10] .

In this study, we evaluated the feasibility of deploying and testing such an AI model developed at another institution on our local institutional data using AI-LAB. We used the previously published Pulmonary X-ray Severity (PXS) score model, a convolutional Siamese neural network-based model for continuous disease severity evaluation [8, 11, 12] . Model outputs were correlated with manual lung disease severity assessments by radiologists and associated with clinical outcomes at our institution.

We describe our experience conducting an applied clinical data science research project using the AI-LAB platform, including site requirements for data preparation, ground truth annotation, validation, and testing AI algorithms. We tested a chest radiograph algorithm to assess lung disease severity among patients with COVID-19 during the first pandemic surge.

J o u r n a l P r e -p r o o f

Midsized academic radiology practice in the Northeastern United States located in a 335-bed hospital serving a suburban population of a metropolitan area in the United States with minimal data science infrastructure and no internal access to data scientists, no high-performance graphics processing units (GPU) based computers or designated general purpose AI software.

Clinical scenario:

Our radiology group partnered with data scientists at an academic medical center in our metropolitan area to use a COVID-19 chest radiography (CXR)-based lung disease severity quantification algorithm which had been trained at that other institution.

As an early adopter, pilot site, we had joined an ACR facilitated research consortium for the purpose of AI model testing and exchange. At the outset, we internally assessed our data science infrastructure to conduct the proposed AI algorithm testing and consulted with the AI-LAB developer team to obtain recommendations on the necessary computing capability to implement a local installation of CONNECT/AI-LAB.

This HIPAA-compliant study was performed with approval from the ANONYMIZED HOSPITAL Institutional Review Board with a waiver of informed consent. The study was performed by our radiology group in our radiology department, which is an official participating pilot site for the AI-LAB platform.

We used a combined query of an imaging and laboratory database (Primordial RadMetrix, Nuance Communications Inc., Burlington, MA) and our electronic health records (Epic, Verona, WI) to identify the patient cohort. The query retrospectively identified consecutive patients with positive COVID-19 RT-PCR tests who also had a CXR on clinical presentation (to the emergency room, outpatient clinics and inpatient wards) performed between 3/16/2020 and 4/18/2020. Since hospital admission was one of the primary outcomes and the presentation CXR was not available for 20 transfer patients, these patients were excluded. Admission, intubation and death dates were recorded for each patient. Admission, intubation and death within 3 days of the presentation CXR were calculated and recorded as primary clinical outcomes. Due to low incidence of death within 3 days of admission, a combined outcome of death or intubation within 3 days was used.

The Radiographic Assessment of Lung Edema (RALE) score was initially devised to assess lung edema based on degree and extent of lung opacity in patients with acute respiratory distress syndrome (ARDS) [12] . A modified version of this score (mRALE) was used in our study. Each lung was assigned an mRALE score for the extent of involvement by consolidation or ground glass opacities (0 = none, 1 = < 25%, 2 = 25-50%, 3 = 50-75%, 4 = >75% involvement) [7] . Each lung score was then multiplied by an overall lung density score (1=hazy, 2=moderate, 3=dense). The scores from each lung are added together to form the patient-level mRALE score. Examples of this scoring are demonstrated in Figure 1 .

For the purpose of this study, two staff radiologists and a radiology fellow were trained to visually assess CXR and assign mRALE scores by first assessing a training set of 10 sample CXRs with feedback on how their scores correlated with the group. Then, each radiologist independently J o u r n a l P r e -p r o o f assigned an mRALE score for each frontal CXR image from the study cohort. The average mRALE score across all readers was imported to the AI-LAB as the reference standard ( Figure 2 ).

The PXS score model was previously developed at ANONYMIZED HOSPITAL, a large tertiary care hospital, initially using CXRs from admitted patients with COVID-19 and was further finetuned using outpatient clinic CXRs at that institution [11, 12] . The model takes a CXR image of interest and compares it with a pool of normal CXR images. A continuous disease severity score is calculated as the median of the Euclidean distances between the image of interest and each image in the pool of normal studies, as it passes through twinned neural networks. Please see the cited work for details on the design and implementation of this neural network architecture. The model was packaged into a Docker file (Docker Inc., Palo Alto, CA) which could then be loaded onto the AI-LAB platform and imported locally at our institution.

Pearson correlation was used to evaluate the correlation between mRALE assessments by different radiologist raters and the correlation between the average mRALE and PXS scores. The Mann-Whitney U test was used for comparison of the PXS score between groups. Bootstrap 95% confidence intervals (CI) were calculated for the correlation between the average mRALE and PXS scores and for the area under the curve of the receiver operating characteristic (AUROC) curves.

Infrastructure:

After engaging our radiology and enterprise informatics teams, we identified a need for a dedicated, high performance GPU-based server in our institution. While only basic GPU based hardware is required to run pre-trained models (also known as "model inference"), we preferred to "future-proof" our investment in computational resources and opted to acquire a highperformance GPU server, which would equip us to retrain and optimize models locally, if desired.

In order to obtain the server, we worked with the customary hospital hardware supplier to ensure the physical server had sufficient motherboard power supply to support the graphics card(s) of choice. After determination of the hardware specifications in collaboration with the AI-LAB team and the vendor, we ultimately decided on a rack-based Dell R740XD PowerEdge Server, with 3x

Nvidia Tesla T4 GPUs (16 GB memory each) and 4 TB of SAS based Solid State Drives for data storage needs.

The installation process, including setting up the AI-LAB software, and configuration of all required docker containers was performed. At the time we first implemented this software, some of the installation process required more manual steps, since we were the first institution in the United States to adopt this platform. Since then, the installation process has been streamlined with the development of a new installer software that requires less manual input. Please see Table 2 for description of professional time required for procurement, installation and configuration.

The AI-LAB team assisted with the upload of the COVID-19 CXR AI model. During the course of the experiment, the authoring institution made improvements to the algorithm. The new model J o u r n a l P r e -p r o o f was packaged using AI-LAB Inference Model Standards [14] . Because the algorithm was packaged using the appropriate model standards, the AI-LAB platform was able to receive the updated docker container and make it available for subscription in their cloud. We subsequently downloaded and used the updated model on our prepared data, running it on our local instance of AI-LAB. Total time spent by our informatics analyst in this step of the collaboration was approximately 2 hours.

We imported the frontal view DICOM files for each of these CXRs from our institution into AI-LAB. For patients with more than one frontal view CXR image associated with the study accession (e.g., large body habitus or difficult positioning requiring multiple attempts at image acquisition), we manually selected the image that contained the most lung in the image. All clinical data was used for testing, with no re-training or model tuning of the algorithm using our institution's data.

AI-LAB has the ability to bulk upload ground truth information (e.g., radiologist generated labels of disease) and imaging studies. However, other important data set curation functions are still to be developed. Importantly, almost any imaging-based data science project requires selection of the appropriate series of an imaging exam for input into the AI model. Most digital radiography devices and PACS designate each exposure/image as a separate series within a single exam. For our own experiment, we needed to upload the optimal frontal CXR image and ensure that it was also the one the readers in this study had based their evaluation on. In 90% of our patients there was only a single image, but in 10% there was more than one image. This was due to acquisition challenges in often critically ill patients. Lacking a universal series selection tool at our institution, we retrieved studies of interest from PACS, batch anonymized them, manually selected the image series of interest, which, together with activities such as meetings with readers (to ensure reads J o u r n a l P r e -p r o o f matched key series) and ACR team members, required approximately 12 hours of analyst time for the entire cohort. We used a shared anonymized spreadsheet to ensure that the series of interest was communicated unambiguously to the readers.

One hundred and forty-one COVID-19 RT-PCR positive patients who had CXRs were included in the study cohort. Patient demographics are summarized in Table 1 . Most patients (n=130, 92%) were imaged in the emergency room setting. Most patients (n=120, 85%) required hospital admission. A subset of patients (n=14, 10%) were intubated within 3 days of CXR acquisition. Six patients (4%) died within 3 days of CXR acquisition.

The correlation between the mRALE scores assigned by the three radiologists who independently assessed each image varied (r=0.71, 0.78, 0.82). The average of the assigned mRALE were used as the reference standard for the deep learning PXS score. The median of the reference standard mRALE scores in this cohort was 2.7 (interquartile range=1. [3] [4] [5] .

The PXS score deployed in AI-LAB correlated with the mRALE score assigned by the radiologist readers (r=0.80) (Figure 3 ).

The PXS score was significantly higher in patients admitted to the hospital within 3 days of CXR acquisition than for those patients who did not require admission time between three readers and external assistance from the academic center that developed the AI model. After migrating the dataset from the PACS a script was used to curate the data based on the series description. We identified that some exams had multiple instances of the same projection image do to clipped anatomy, position error, or exposure. A manual process was used to have the exam reviewed by a Radiologist and determine the best image to be used for the AI. We demonstrate with this effort that it is feasible to set up and successfully use clinical data science infrastructure without any previous institutional history or dedicated personnel in this field. The overall investment of time and resources was deemed reasonable in return for the outcome we achieved. The inter-institutional effort created learning about data science workflow steps for all stakeholders and provides further evidence of the potential of an external platform to facilitate radiology practice participation in AI algorithm assessment. This platform offers an opportunity for successful engagement of clinical radiologists in the absence of on-site data scientists and more J o u r n a l P r e -p r o o f robust on-site data science infrastructure. Many radiology-based AI models have been developed since the start of COVID-19 pandemic, with the hopes of improving diagnostic accuracy, speed, and risk assessment. However, for these algorithms to be safely used in clinical practice, they must be deployed and ideally tested locally before providing inferences on live patient data for use in clinical or operational decision making. AI-LAB enabled our practice to do just that. In this study, we successfully used AI-LAB to deploy and test a COVID-19 CXR AI algorithm that had been developed at another institution, showing generalizability of the previously developed external PXS score model on local data obtained at our own institution. The actual model deployment using AI-LAB was accomplished in a matter of days once the system setup had been completed. This demonstrates the feasibility of using AI-LAB to provide expedient solutions for assessment of algorithms across institutions, without the need to send actual source imaging or clinical data outside of the institution to test the model. In general, lack of information technology and data science expertise at small and medium-sized institutions like ours might be considered as a major hurdle to participation in AI research or application of AI in radiology workflow. Platforms such as the AI-LAB, designed for federated algorithm use can reduce the barriers to participation. MIDRC, launched in 2020 pursues a different approach by aggregating anonymized data in a central archive. This "centralized" approach also aims at achieving exposure of algorithms to a broader sample of data, through a different architecture which requires to move the data. MIDRC 

The COVID-19 deep learning model that we deployed and tested in AI-LAB in this study shows potential for predicting hospital admission or intubation/death within 3 days of presentation. This may become a useful tool for data driven resource management within a health system. During the COVID-19 pandemic, many health systems allocated and moved precious resources (e.g.

ventilators) based on actual observed patient census. Similarly, ICU bed capacity was managed based on actual current capacity, initiating patient transfers as needed. One could envision a future state where the repeated, at-scale use of the deep learning model-based prediction of near-term clinical prognosis of affected patients in a given health system could facilitate prospective, predictive management of resources and capacity. The correlation between mRALE score and PXS score in our study was 0.80 (95% CI 0.72-0.86), which is similar compared to the original study, which showed a PXS score of 0.86 (95% CI 0.80-0.90) in an internal test set and 0.86 (95% CI 0.79-0.90) in a different external hospital test set [8] . While the 95% confidence intervals do overlap, the possible decrease in model performance could be related to differences in image acquisition technique and patient population. Patients had less severe disease in our own study cohort (median mRALE 2.7) compared to the original study test sets (median mRALE 4.0 and 3.3). We also found that in our study that PXS score can predict subsequent intubation or death within 3 days with an AUROC of 0.73 (95% CI 0.61-0.84), which is less than the AUROC of 0.80 (95% CI 0.75-0.85) reported in the original study, though the bootstrap 95% confidence intervals overlap. The PXS score model was not trained to predict these outcomes (rather it was trained to evaluate lung disease severity), so it is not surprising that different patient populations may have J o u r n a l P r e -p r o o f different outcomes. Also, as new clinical management guidelines and therapeutic options arise, prediction of such outcomes may change. Thus, ongoing testing is needed to ensure that such predictions are updated, which AI-LAB can help to facilitate.

Many AI models, developed using curated institutional data, demonstrate high performance initially, however, their performance not uncommonly degrades when deployed on data generated at a different institution [15] . This variability in generalization of the performance is a known issue, especially for models created based on single-institution data [16] . External platforms such as AI-LAB provide the opportunity for the developers to train and test their models on multipleinstitution data. This may result in improved generalizability. The ability of each participating institution to optimize and verify model performance based on their own data raises the safety profile of the AI model, hence overcoming one of the major hurdles of AI implementation in medicine (i.e., implementation gap) [17] . Like many issues in healthcare, the implementation gap became more evident during the COVID-19 pandemic. With heightened interest in this entity, many AI models have been developed but they have had little to no impact on the pandemic [3] [4] [5] [6] 18 ]. One of the major obstacles in this rapidly changing environment is the current inability of many practices to optimize the AI models for their local (data) environment and external platforms such as the AI-LAB may facilitate this activity. Lastly, continuous learning has been proposed as a method to preserve AI model robustness and promote adaption to changes in the local environment [19] . Engagement of radiology departments in co-developing and testing of AI models, has been proposed as a method to develop an environment for continuous learning of AI models. Platforms such as AI-LAB and MIDRC might facilitate achieving this goal.

J o u r n a l P r e -p r o o f There are few limitations to our study. First, this study involves using AI-LAB at a single institution. Assessment at multiple institutions will be important for future scaling of this work.

Second, because some patients had multiple frontal CXR images obtained in a single study accession (due to challenging patient positioning or body habitus), we had to manually select which CXR image to load into AI-LAB. This problem with selecting the correct series is a barrier to scaling such models and needs to be addressed in future studies. Third, we tested a CXR-based model in this study; however, models using different modalities like CT and MRI may have other challenges for deployment using AI-LAB. Fourth, the data and images for this experiment were collected during the first surge of the pandemic in the United States which mostly affected an older patient population. This is almost certainly associated with a higher pre-test probability of a poorer outcome from a COVID-19 infection (such as intubation and death) than would be expected in a younger population. During the second surge of pandemic in the United States, relatively younger patients with fewer co-morbidities were more frequently affected [20] . The performance of the AI model may have been impacted by this demographic shift.

J o u r n a l P r e -p r o o f 

The Tempest

Iteratively Pruned Deep Learning Ensembles for COVID-19 Detection in Chest X-Rays

Using Artificial Intelligence to Detect COVID-19 and Community-acquired Pneumonia Based on Pulmonary CT: Evaluation of the Diagnostic Accuracy

COVID-19 on Chest Radiographs: A Multireader Evaluation of an Artificial Intelligence System

DeepCOVID-XR: An Artificial Intelligence Algorithm to Detect COVID-19 on Chest Radiographs Trained and Tested on a Large US Clinical Dataset

ACR Recommendations for the use of Chest Radiography and Computed Tomography (CT) for Suspected COVID-19 Infection. American College of Radiology

Pulmonary Disease Severity on Chest Radiographs using Convolutional Siamese Neural Networks

Initial chest radiographs and artificial intelligence (AI) predict clinical outcomes in COVID-19 patients: analysis of 697 Italian patients

End-to-end learning for semiquantitative rating of COVID-19 severity on Chest X-rays

Siamese neural networks for continuous disease severity evaluation and change detection in medical imaging

Improvement and Multi-Population Generalizability of a Deep Learning-Based Chest Radiograph Severity Score for COVID-19. medRxiv

Severity scoring of lung oedema on the chest radiograph is associated with clinical outcomes in ARDS

Inconsistent Performance of Deep Learning Models on Mammogram Classification

Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A crosssectional study

Bridging the implementation gap of machine learning in healthcare

The challenges of deploying artificial intelligence models in a rapidly evolving pandemic

Continuous Learning AI in Radiology: Implementation Principles and Early Applications

Changing Age Distribution of the COVID-19 Pandemic -United States