key: cord-0983090-du989x9v authors: Fang, Xi; Kruger, Uwe; Homayounieh, Fatemeh; Chao, Hanqing; Zhang, Jiajin; Digumarthy, Subba R.; Arru, Chiara D.; Kalra, Mannudeep K.; Yan, Pingkun title: Association of AI quantified COVID-19 chest CT and patient outcome date: 2021-01-23 journal: Int J Comput Assist Radiol Surg DOI: 10.1007/s11548-020-02299-5 sha: cbf1121fac742f5735e271c70b966bb646e59250 doc_id: 983090 cord_uid: du989x9v PURPOSE: Severity scoring is a key step in managing patients with COVID-19 pneumonia. However, manual quantitative analysis by radiologists is a time-consuming task, while qualitative evaluation may be fast but highly subjective. This study aims to develop artificial intelligence (AI)-based methods to quantify disease severity and predict COVID-19 patient outcome. METHODS: We develop an AI-based framework that employs deep neural networks to efficiently segment lung lobes and pulmonary opacities. The volume ratio of pulmonary opacities inside each lung lobe gives the severity scores of the lobes, which are then used to predict ICU admission and mortality with three different machine learning methods. The developed methods were evaluated on datasets from two hospitals (site A: Firoozgar Hospital, Iran, 105 patients; site B: Massachusetts General Hospital, USA, 88 patients). RESULTS: AI-based severity scores are strongly associated with those evaluated by radiologists (Spearman’s rank correlation 0.837, [Formula: see text] ). Using AI-based scores produced significantly higher ([Formula: see text] ) area under the ROC curve (AUC) values. The developed AI method achieved the best performance of AUC = 0.813 (95% CI [0.729, 0.886]) in predicting ICU admission and AUC = 0.741 (95% CI [0.640, 0.837]) in mortality estimation on the two datasets. CONCLUSIONS: Accurate severity scores can be obtained using the developed AI methods over chest CT images. The computed severity scores achieved better performance than radiologists in predicting COVID-19 patient outcome by consistently quantifying image features. Such developed techniques of severity assessment may be extended to other lung diseases beyond the current pandemic. The SARS-CoV-2 (COVID-19) outbreak at the end of 2019, which results from contracting an extremely contagious beta-coronavirus, has spread worldwide and is responsible for the latest pandemic in human history. Prior studies report frequent use of chest computed tomography (CT) in patients suspicious of pneumonia, including COVID-19 [1, 4, 10, 14, 16] . Chest CT is often recommended to assess B Mannudeep K. Kalra mkalra@mgh.harvard.edu (indicating extensive lobar involvement and consolidation) are associated with severe COVID-19 pneumonia. In these clinical studies, radiologists visually assess the condition of each lobe and estimate the extent of opacities in each lung lobe to assign a severity score [7, 15, 18] . The scores are then added up to become the overall evaluation. However, such description is inconsistent, subjective, and suffers from intra-and inter-observer variations. To accurately quantify a patient's condition and to relieve clinicians' labor, an automated assessment of the extent of pulmonary parenchymal involvement and distribution of pulmonary opacities is useful since routine clinical interpretation of chest CT does not quantify the disease burden. To automate quantification of disease distribution and extent of pulmonary involvement, automatic segmentation of lung lobes, pulmonary opacities, and distribution of opacities within each lobe are desired. Tang et al. [11] employed threshold segmentation based on the Hounsfield units (HU) range of the ground glass opacity (GGO) to detect the severe disease on chest CT. He et al. [5] proposed a framework for joint lung lobe segmentation and severity assessment of COVID-19 in CT images. We hypothesized that AI-based automatic lung lobe segmentation and distribution of pulmonary opacities can help assess disease severity and outcomes in patients with COVID-19 pneumonia. This work proposes an AI-assisted severity scoring method based on automated segmentation of the lung lobes and pulmonary opacities. We first analyze the correlation between scores obtained by AI and radiologists. We then show the statistics and data distribution of patients and derive the correlation between scores and patient outcomes. To quantify the correlation, we establish an evaluation scheme that uses three machine learning models for predicting the patient outcomes based on severity scores. The deidentified data used in our work were acquired at two hospitals, i.e., Site A: Firoozgar Hospital (Tehran, Iran) and Site B: Massachusetts General Hospital (Boston, MA, USA). All the CT imaging data were from patients who underwent clinically indicated, standard-of-care, non-contrast chest CT. Site A We reviewed medical records of adult patients admitted with known or suspected COVID-19 pneumonia from the site between February 23, 2020, and March 30, 2020. In the 117 patients with positive RT-PCR assay for COVID-19, three patients were excluded due to the presence of extensive motion artifacts on their chest CT. One patient was excluded due to the absence of ICU admission information. The radi- Two thoracic subspecialty radiologists (one with 16 years of experience and the other with 14 years of experience) reviewed all CT images without the knowledge of clinical features, laboratory data, and patient outcomes. Chest CT images were reviewed on DICOM image viewer (MicroDicom DICOM Viewer, Sofia, Bulgaria) in lung windows (window width 1500 HU, window level-600 HU). The radiologists recorded the type of pulmonary opacities (ground glass opacities, mixed ground glass and consolidative opacities, consolidation, organizing pneumonia, nodular pattern, and ground glass opacities with septal thickening (crazy-paving pattern)). Extent of involvement of each lobe (right upper, right middle, right lower, left upper, and left lower lobes) by the pulmonary opacities was assessed using a previously described scale (0: 0% lobar involvement; 1: < 5% involvement of lobar volume; 2: 5-25% involvement of lobe; 3: 26-50% lobar involvement; 4: 51-75% lobar involvement; 5: >75% lobar involvement) [10] . The two thoracic subspecialty radiologists reviewed all CT images independently. Any discordance between them was resolved by consensus readout of cases. Total extent of pulmonary opacities was estimated by adding the scores of all lobes (lowest score 0, highest possible score 25) [8, 10, 13] . Since pulmonary opacity (PO) is an important criterion in terms of the patient severity assessment, we further evaluated our AI-based quantification scores against radiologists' manually graded scores in patients' outcome prediction. In order to obtain explainable severity scores, we divide the AIassisted procedure into two steps as same as radiologists. In the first step, we use deep learning-based method to automatically segment the lung lobes and pulmonary opacities. Then, severity scores of each patient are computed from the sizes of the lobes and pulmonary opacities. In this way, the severity scores are not dependent on the radiologists' annotation. Furthermore, we use the association of severity scores with patient outcome, ICU admission, and mortality risk of patients with COVID-19 pneumonia, as the criterion to evaluate the severity assessment between AI against radiologists. This work employs deep neural networks to segment both lungs, five lung lobes (left upper lobe, left lower lobe, right upper lobe, right middle lobe, right lower lobe) and pulmonary opacity regions of infection from non-contrast chest CT examinations. For network training, we semiautomatically labeled all five pulmonary lobes in 71 CT volumes from Site A using chest imaging platform [17] . A radiologist (M.K.K.) annotated the lung opacities slice by slice in 105 CT volumes from Site A. For lung lobe segmentation, we adopted the automated lung segmentation method proposed by Hofmanninger et al. [6] . Their work provides a trained U-net model for lung segmentation. The U-Net consists of an encoder with regular convolutions and max pooling layers, and a decoder that applies transposed convolutions along with regular convolutions. The network consists of 19 convolutional layers. Each convolutional block in the encoder and decoder uses two 3×3 convolutional layers. Finally, a 1×1 convolutional layer is applied to squeeze the number of feature map channels into 2. To improve the model performance, residual connections between the encoder and decoder were employed. Same as their pre-processing step, the intensity HU range is cropped into the window of [− 1024, 600] and then normalized into [0, 1]. The pre-trained model 1 was finetuned with a learning rate of 10 −5 using our annotated data. During tuning, each slice was randomly cropped into patches with size 224 × 224. The tuned model was then applied to segment all the chest CT volumes. Segmentation of pulmonary opacities was completed by our previously proposed method, Pyramid Input Pyramid Output Feature Abstraction Network (PIPO-FAN) [3] with publicly released source code. 2 The network integrates image pyramid and multi-scale feature analysis into one single end-to-end framework. It applies spatial pyramid pooling on one 2D slice to generate pyramid input and hierarchically fuses semantically similar features after convolutional blocks. Pyramid features were then adaptively fused via attention module to obtain the lesion map of the slice. In the pre-processing step, all slices were resampled into a fixed resolution of 256 × 256 pixels, and to improve the contrast of pulmonary opacity, the intensity HU range was cropped into a window of [−1000, 200]. In training, the learning rate was set to be 0.002. Softmax activation function with threshold 0.5 was used to obtain the infection area. Morphological operation was applied to refine the segmentation of lung lobes and pulmonary opacities. Using a 3×3 kernel which only connects the nearest neighbors to the center, we first perform the opening operation to remove small noisy segmentation and then apply the closing operation using the same kernel to generate a smooth segmentation. Figure 1 shows the segmentation results of lung lobes and pulmonary opacities. From axial and 3D view, we can see that the segmentation model can smoothly and accurately predict isolated regions with pulmonary opacities. Based on the area of pulmonary opacities, we then computed the ratio of opacity volume over lobes, which is a widely used measurement to describe the severity [11, 20] . The dice similarity coefficient (DSC) was used to quantitatively evaluate the segmentation accuracy. For Site A dataset, we obtained DSC scores of 82.5% and 90.0% on lung opacity segmentation and lobe segmentation, respectively. Based on the segmentation process in the section "Deep learning-based image segmentation," the AI-based quantification scores are obtained based on the segmentation results with similar procedure, followed by radiologists (see section "Annotation of severity" for more details). First, RPO (ratio of pulmonary opacities) over lobe was calculated for each lobe and graded into 6 levels (0-5). Then, the final score for a patient was the sum of the scores of the 5 lobes which ranges from 0 to 25. To show relationship between the severity scores and patient outcome, patients were divided into 4 groups (Group I, II, III and IV) based on their final outcome from mild to severe. Groups I and II included recovered patients without admission to ICU. Group I consisted of patients discharged from hospital within 7 days. Group II included patients with more than 7 days of hospital stay. Group III patients had ICU admission and recovered, whereas Group IV patients succumbed to COVID-19 pneumonia. We then divided the severity scores into four buckets. The patients were first divided into two groups using the mean severity score 15 assigned by the radiologists. Each group is then further divided by halving the buckets. We display the statistic of severity groups in different buckets and evaluate the correlation between severity buckets and patient outcome using mean absolute error (MAE). This work establishes an evaluation scheme to quantify the correlation between severity scores and patient outcome. The severity scores are used as an input to different machine learning models for predicting patient outcome. The AUC for this task indicates the correlation between scores and patient outcomes. Considering the variations introduced by different prediction targets, models and datasets, to obtain a more objective evaluation, we conducted experiments for both ICU admission and mortality prediction with 3 different models (i.e., support vector machine (SVM), random forest (RF) and logistic regression (LR)) on the two datasets. Radial basis function kernels were used to construct SVM models. Squared L2 norm is used as a regularization term for the SVM and LR models. The RF model has 300 trees, and the Gini index is used as the criterion for calculating information gain. We bootstrap AUCs with 1000 entries to obtain the 95% confidence interval. In addition to the scores estimated by radiologists and out AI methods, we also include another two groups of scores, i.e., threshold-based scores, and the mixture of radiologists' scores and the AI-based scores. As HU [−750, −300] corresponds to GGO regions [2, 11] , the method regards voxels within this threshold as pulmonary opacities and further cal- Fig. 2 Correlation between the severity scores assigned by radiologists and computed by our deep learning-based segmentation on Site A dataset culates the severity scores. To investigate whether the scores of our AI method are complementary with those of the radiologists on patient outcome prediction, we merged their scores (denoted as AI + Radiologists). Specifically, while all other methods use 5 scores of 5 lobes as inputs of the prediction models, AI + Radiologists uses all 10 scores obtained by both the AI methods and the radiologists as inputs. The average results of the three models is used to compute AUC of each scoring method. We conducted one-tailed z-test [19] on ROC curves using scores from radiologists and other severity scoring methods. This section presents the results of the developed techniques. We show the effectiveness of our proposed segmentationbased severity scoring on the two datasets separately through comparison with different severity scoring methods. The results are summarized in three parts. In the first part, we computed the correlation coefficient between severity scores assigned by radiologists and computed by our deep learning-based segmentation to demonstrate a good consistency between AI and radiologists. In the second part, we display the statistics of number and proportion of patients in severity groups on different buckets based on section "Severity scoring." The number of patients in each group is overlaid on the corresponding segment. We use the mean absolute error (MAE) between severity score and patient severity group to evaluate the consistency between severity scores and patient outcome. Finally, to further evaluate the association between the scores and patient outcome, we use the different groups of scores to do ICU admission and mortality prediction, respectively. We computed the AUC for mortality and ICU admission prediction using the severity scores to evaluate the association between scores and patient outcomes. Crossdataset validation (training on Site A and testing on Site B; training on Site B and testing on Site A) was performed for each model on each group of severity scores to compute the AUC. The score for a patient is the sum of the scores of the 5 lobes which ranges from 0 to 25. Similar to the work of Li et al. [9] , we evaluated Spearman's rank correlation and associated pvalue to determine the strength of the correlation between scores computed by AI-assisted method and assigned by radiologists on Site A and Site B dataset. Figure 2 shows the correlation between the two types of severity scores. We obtained a considerable positive Spearman's rank correlation of 0.770 on Site A dataset ( p < 0.001). Figure 3 shows the correlation between the two types of severity scores of 88 patients. We obtained a considerable positive Spearman's rank correlation of 0.837 on Site B dataset ( p < 0.001), indicating that the AI-assisted prediction can obtain consistent results with radiologists. Fig. 5 , we can see that for our AI method, no patients in Group III and Group IV are in the bucket [20, 25] . For quantitative evaluation, we then computed the MAE between severity score buckets and patient severity groups. Each bucket of severity score is paired with one severity group. The MAE for radiologists and AI quantification computed using the number is 88 and 84, respectively. The MAE computed using the proportion is 3.51 and 2.65. Our AI method obtained comparable MAE when compared to the radiologists, i.e., the difference in the MAE scores is not statistically different. Figures 6 and 7 display the statistics on number and proportion of 88 patients in Site B dataset falling into different categories based on the computed severity scores, given by radiologists and our AI algorithm, respectively. The MAE for radiologists and AI quantification computed using the num-ber is 89 and 89, respectively. The MAE computed by the proportion is 3.88 and 3.38 for the radiologists and the AI, respectively. That means the AI-assisted method achieved comparable consistency with patient severity groups than radiologists. In Fig. 7 , we can see that for our AI method, the proportion of the severe patients (Group III and Group IV) monotonically increases with the raising of the severity score, while the result of the radiologists has a drawback on the second score bucket. Table 3 summarizes the AUCs and 95% confidence intervals of three machine learning models on ICU admission and mor- Widths of the bars represent such proportion (mild patient: green for group I and light green for group II; severe patient: light red for group III and red for group IV) Widths of the bars represent such proportion (mild patient: green for group I and light green for group II; severe patient: light red for group III and red for group IV) tality. The three machine learning models were trained on Site B and tested on Site A. The bold values represent the best AUCs. We can see that AI achieved higher AUC than radiologists under all models and tasks. We further use the mean of the scores from the three models as a simple ensemble strategy to compute the AUC value for each severity scoring method. ROC curves on ICU admission and mortality prediction are shown in Fig. 8a , b. AI obtains best AUC 0.755 and 0.723 on ICU admission and mortality, respectively. One-tailed z-test is used to evaluate the statistical significance between radiologists and other scoring methods. Threshold-based method outperforms radiologists in ICU admission with p = 0.031 although performance in mortality prediction was not significantly different ( p = 0.426). AI significantly outperforms radiologists with p = 0.044 < 0.05 in ICU admission and p = 0.031 < 0.05 on mortality. The same set of experiments were repeated on the Site B dataset. The three machine learning models are trained on Site A and tested in Site B. Table 4 summarizes the AUCs of different models with 95% confidence interval indicated by ICU admission and mortality on Site B dataset. The bold values represent the best AUCs. We can see that AI achieved best AUC under all models and tasks. ROC curves on ICU admission and mortality prediction are shown in Fig. 9a , b. AI obtains best AUC 0.813 and 0.741 on ICU admission and mortality, respectively. One-tailed z-test is used to evaluate the statistical significance between radiologists and other scoring methods. Radiologists outperform thresholdbased method in ICU admission with p = 0.016; both methods had similar performance for mortality prediction ( p = 0.060). AI significantly outperforms radiologists with p = 0.022 < 0.05 in ICU admission and p = 0.045 < 0.05 on mortality. In the current clinical radiology practice, radiologists from hospitals do not perform a quantitative or semiquantitative assessment of disease severity or distribution of pulmonary opacities. This lack of quantification is related to the fact that they are not trained and required to assign severity scores in patients with pneumonia. While in patients with cancer and focal lesions, radiologists measure and compare single or volumetric dimension of focal lesions, in patients with diffuse and ill-defined disease patterns found in pneumonia such measurements are not feasible and practical. As a result, radiology reports in patients with COVID-19 pneumonia are limited to semantic description of extent (such as diffuse, multifocal, or localized) and type of opacities rather than an assigned severity score. However, prior studies with subjective severity scores from both chest radiography and CT report on their ability to predict disease severity and patient outcome [12, 16] . Threshold segmentation can detect coarse opacities with some degree, but it is known to be less accurate than AI-based methods. The study confirms that AI-based severity scoring method yields AUCs in ICU admission and mortality prediction, on Site A and B datasets, that exceed those of radiologists. Considering the large differences in patient demographic statistics and clinical protocols at the two participating sites, the results highlight the robustness of the AI-based method and their generalization ability of the extracted scores. The statistical test of differences between ROC curves using scores from AI and radiologists suggests that AI significantly outperforms radiologists (A → B: p = 0.022 on ICU admission, p = 0.045 on mortality; B → A: p = 0.044 on ICU admission, p = 0.031 on mortality). The results outline that the correlation between AI-assisted severity score and prognosis of patients has stronger correlation with patient outcomes than radiologists, which could effectively improve the accuracy of severity scoring system. We explored the potential of combining the scores of AI and radiologists to further improve the prognosis of COVID-19 pneumonia. Both score concatenation and sum were tested. However, compared with AI-based method, the combination of scores of AI and radiologists did not present a significant improvement. We also used investigated other machine learning models, decision tree (DT) and multilayer perceptron (MLP), to predict patient outcomes. The two models did not perform as well as the three models reported in this paper. Besides the choice of classification model, the segmentation quality of pulmonary lobes and opacities may be an additional factor that limits the performance of outcome prediction. More accurate image segmentation will give a more precise lung opacity quantification, which, in turn, may help to improve the performance of our AI method further. Also, ancillary imaging findings not assessed with our segmentation tool such as coronary artery disease, cardiac enlargement, pleural effusions, co-existing bronchial wall thickening or emphysema, and mediastinal lymphadenopathy, may have affected the disease outcome, and thus decreased the performance of our models. The AUC obtained for mortality prediction is lower than in ICU admission prediction. Since mortality is influenced by several factors including comorbidities, local treatment condition and disease exacerbation, mortality prediction is more challenging than ICU admission. However, the AI method can still persistently improve the AUC over radiologists for this task by 10%, which further demonstrate the effectiveness of scores extracted from AI. Although scores from radiologists may not be as accurate as AI, they may convey overall condition of severe patients. For example, although not included in our study, radiologists assessed findings beyond pulmonary opacities such as architectural distortion in lungs, pleural effusions, mediastinal/hilar lymphadenopathy, cardiac enlargement, and increased subcutaneous fat stranding or attenuation suggestive of anasarca and underlying fluid overload. Li et al. [9] showed that such findings can be learned by AI algorithms and help longitudinal disease evaluation on COVID-19 pneumonia. For future work, we will continue to explore developing AI algorithms, which may efficiently incorporate the knowledge of radiologists with an overall evaluation of patient condition. It is also worth noting that the developed techniques of severity assessment may extend to other lung diseases beyond the current pandemic. As the world awaits the introduction of disease prevention and specific treatment for those infected by COVID-19 pneumonia, it is important that future versions of AI-based time-to-event measures assess the extent of chronic changes from the substantial patient population who had initial recovery but might have long lasting sequelae from their infection. This paper has proposed an AI-assisted severity scoring method based on automatic segmentation of lung lobes and pulmonary opacities. The severity scores obtained by AI have shown to be consistent with those obtained by radiologists in the two datasets. We have further quantitatively evaluated the scores based on patient outcomes and demonstrated that the AI segmentation-based method was significantly more accurate than current severity scoring by radiologists only. The results suggest that AI can significantly improve the patient outcome prediction for patients with severe COVID-19 pneumonia. We believe such techniques have the potential to numerous clinical applications involving COVID-19 pneumonia. Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: a report of 1014 cases Integrative analysis for COVID-19 patient outcome prediction Multi-organ segmentation over partially labeled datasets with multi-scale feature abstraction Turkbey B (2020) Artificial intelligence for the detection of COVID-19 pneumonia on Chest CT using multinational datasets Synergistic learning of lung lobe segmentation and hierarchical multi-instance classification for automated severity assessment of COVID-19 in CT images Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem The clinical and chest CT features associated with severe and critical COVID-19 pneumonia Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest CT Automated assessment and tracking of covid-19 pulmonary disease severity on chest radiographs using convolutional siamese neural networks Lung infection quantification of COVID-19 in CT images with deep learning Severity assessment of coronavirus disease 2019 COVID-19 using quantitative features from Chest CT images Clinical and chest radiography features determine patient outcomes in young and middle age adults with COVID-19 Frequency and distribution of chest radiographic findings in COVID-19 positive patients Chest CT for typical 2019-nCoV pneumonia: relationship to negative RT-PCR testing Clinical and high-resolution CT features of the COVID-19 infection: comparison of the initial and follow-up changes Chest CT severity score: an imaging tool for assessing severe COVID-19 Application of the 3d slicer chest imaging platform segmentation algorithm for large lung nodule delineation Relation between chest CT findings and clinical conditions of coronavirus disease (COVID-19) pneumonia: a multicenter study Statistical methods in diagnostic medicine Joint prediction and time estimation of COVID-19 developing severe symptoms using chest CT scan Acknowledgements This work was partially supported by National Institute of Biomedical Imaging and Bioengineering (NIBIB) under award R21EB028001 and National Heart, Lung, and Blood Institute (NHLBI) under Award R56HL145172. Conflict of interest The authors declare that they have no conflict of interest. The study received IRB approval at both participating sites. Need for informed consent was waived due to the retrospective nature of the study.Informed consent There is no informed consent required for the work reported in this manuscript.