key: cord-0187447-hyk93ons
authors: Chassagnon, Guillaume; Vakalopoulou, Maria; Battistella, Enzo; Christodoulidis, Stergios; Hoang-Thi, Trieu-Nghi; Dangeard, Severine; Deutsch, Eric; Andre, Fabrice; Guillo, Enora; Halm, Nara; Hajj, Stefany El; Bompard, Florian; Neveu, Sophie; Hani, Chahinez; Saab, Ines; Campredon, Alienor; Koulakian, Hasmik; Bennani, Souhail; Freche, Gael; Lombard, Aurelien; Fournier, Laure; Monnier, Hippolyte; Grand, Teodor; Gregory, Jules; Khalil, Antoine; Mahdjoub, Elyas; Brillet, Pierre-Yves; Ba, Stephane Tran; Bousson, Valerie; Revel, Marie-Pierre; Paragios, Nikos
title: AI-Driven CT-based quantification, staging and short-term outcome prediction of COVID-19 pneumonia
date: 2020-04-20
journal: nan
DOI: nan
sha: 1ed171376f8afb9d629b8f623c910947a1105e6e
doc_id: 187447
cord_uid: hyk93ons

Chest computed tomography (CT) is widely used for the management of Coronavirus disease 2019 (COVID-19) pneumonia because of its availability and rapidity. The standard of reference for confirming COVID-19 relies on microbiological tests but these tests might not be available in an emergency setting and their results are not immediately available, contrary to CT. In addition to its role for early diagnosis, CT has a prognostic role by allowing visually evaluating the extent of COVID-19 lung abnormalities. The objective of this study is to address prediction of short-term outcomes, especially need for mechanical ventilation. In this multi-centric study, we propose an end-to-end artificial intelligence solution for automatic quantification and prognosis assessment by combining automatic CT delineation of lung disease meeting performance of experts and data-driven identification of biomarkers for its prognosis. AI-driven combination of variables with CT-based biomarkers offers perspectives for optimal patient management given the shortage of intensive care beds and ventilators.

. Comparison between automated and manual segmentations. Delineation of the diseased areas on chest CT in a COVID-19 patient: First Row: input, AI-segmentation, expert I-segmentation, expert II-segmentation. Second Row: Box-Plot Comparisons in terms of Dice similarity and Haussdorf between AI-solution, expert I & expert II, & Plot of correlation between disease extent automatically measured and the average disease extent measured from the 2 manual segmentations. Disease extent is expressed as the percentage of lung affected by the disease. Third row: statistical measures on comparisons between AI, expert I, and expert II segmentations.

deep-learning to quantify COVID-19 disease extent on CT but none of them used a multi-centric cohort while providing comparisons with segmentations done by radiologists 18, 19 . Disease extent is the only parameter that can be visually estimated on chest CT to quantify disease severity 4, 5 , but visual quantification is difficult and usually coarse. Several AI-based tools have been recently developed to quantify interstitial lung diseases (ILD) [20] [21] [22] [23] , which share common CT features with COVID-19 pneumonia, especially a predominance of ground glass opacities. In this study, we investigated a fully automatic method ( Figure 1 ) for disease quantification, staging and short-term prognosis. The approach relied on (i) a disease quantification solution that exploited 2D & 3D convolutional neural networks using an ensemble method, (ii) a biomarker discovery approach sought to determine the share space of features that are the most informative for staging & prognosis, & (iii) an ensemble robust supervised classification method to distinguish patients with severe vs non-severe short-term outcome and among severe patients those intubated and those who did not survive.

In the context of this work, we report a deep learning-based segmentation tool to quantify COVID-19 disease and lung volume. For this purpose, we used an ensemble network approach inspired by the AtlasNet framework 22 . We investigated a combination of 2D slice-based 24 and 3D patch-based ensemble architectures 25 . The development of the deep learning-based segmentation solution was done on the basis of a multi-centric cohort of 478 unenhanced chest CT scans (208,668 slices) of COVID-19 Figure 3 . Spider-chart distribution of features depicting their minimum and maximum values [mean value (blue), 70% percentile (yellow) and 90% percentile (red) lines] with respect to the different outcomes with the following order: top: non-severe, bottom left: intensive care support & bottom right: deceased in the testing set. White and red circles represent respectively 40% and 60% of the maximum value of each feature. Clear separation was observed on these feature space with respect to the non-severe & severe cases. In terms of deceased versus intensive care patients, notable difference were observed with respect to three variables, the age of the patient, the condition of the healthy lung and the non-uniformity of the disease (indicated with gray in the spider-chart).

patients with positive RT-PCR. The multicentric dataset was acquired at 6 Hospitals, equipped with 4 different CT models from 3 different 91 manufacturers, with different acquisition protocols and radiation dose (Table 1) . Fifty CT exams from 3 centers were used for training and 130 CT exams from 3 other centers were used for test (Table 2 ). Disease and lung were delineated on all 23, 423 images used as training dataset, and on only 20 images per exam but by 2 independent annotators in the test dataset (2, 600 images). The overall annotation effort took approximately 800 hours and involved 15 radiologists with 1 to 7 years of experience in chest imaging. The consensus between manual (2 annotators) and automated segmentation was measured using the Dice similarity score (DSC) 26 and the Haussdorf distance (HD). The CovidENet performed equally well to trained radiologists in terms of DSCs and better in terms HD ( Figure 2 ). The mean/median DSCs between the two expert's annotations on the test dataset were 0.70/0.72 for disease segmentation. For the same task, DSCs between CovidENet and the manual segmentations were 0.69/0.71 and 0.70/0.73. In terms of HDs, the observed average value between the two experts was 9.16mm while it was 8.96mm between CovidENet and the two experts. When looking at disease extent, defined as the percentage of lung affected by the disease, we found no significant difference between automated segmentation and the average of the two manual segmentations ( 

To assess the prognostic value of the Chest computed tomography (CT) an extended multi-centric data set was built. We reviewed outcomes in patient charts within the 4 days following chest CT and divided the patients in 3 groups: those who didn't survive, those who required mechanical ventilation and those who were still alive and not intubated. Out of the 478 included patients, 27 died (6%) and 83 were intubated (17%), forming a group of 110 patients with severe short-term outcome (23%). Data of 383 patients from 3 centers were used for training and those of 85 patients from 3 other centers composed an independent test dataset (Table 3) . Radiomics-based prognosis gained significant attention in the recent years towards predicting treatment outcomes 27 . In this study we have adopted a similar strategy, we extracted 107 features related to first order, higher order statistics, texture and shape information for lungs, disease extent and heart. Feature selection was performed on a basis of predictive value consensus. We created several representative partitions 117 of the training set (80% training and 20% validation) and run 13 different supervised classification methods towards optimal separation of the observed clinical ground truth between severe and non-severe cases ( Table 4 ). The features that were shared between the different classifiers were retained as robust imaging biomarkers using a cut-off probability of 0.25 and were aggregated to patients' age and gender (Table 5 ). In total 12 features were retained for the prognosis part and included age, gender, disease extent, descriptors of disease heterogeneity and extension, features of healthy lung and a descriptor of cardiac heterogeneity. Correlations for some these features and the clinical outcome are presented in Figure 5 while a representation of these feature space with respect to the different classes is presented in Figure 3 .

The staging/prognosis was implemented using a hierarchical classification principle, targeting first staging and subsequently prognosis. The staging component sought to separate patients with severe and non-severe short-term outcomes, while the prognosis sought to predict the risk of decease among severe patients. On the basis of the feature selection step, the machine learning algorithms that had a balanced accuracy greater than 60% on validation were considered. The selection of these methods was done on the basis of minimum discrepancy between performance on training and internal validation sub-training data set. We have built two sequential classifiers using this ensemble method, one to determine the severe cases and a second to predict survival. The classifier aiming to separate patients with severe and non-severe short-term outcomes had a balanced accuracy of 74%, a weighted precision of 79%, a weighted sensitivity of 69% and specificity of 79% to predict a severe short-term outcome ( Figure 4 , Table 6 ). The performance of the second classifier aiming to differentiate between intubated and deceased patients was even higher with a balanced accuracy of 81% ( Figure 4 , Table 7 ). The hierarchical classifiers combing the 3 classes had a balanced accuracy of 68%, a weighted precision of 79%, a weighted sensitivity of 67% and specificity of 83% ( Figure 4) . It was observed that prognosis performance difference between training and external cohort testing was low, suggesting that the most important information present at CT scans was recovered, and additional information should be integrated in order to fully explain the outcome.

In conclusion, artificial intelligence enhanced the value of chest CT by providing fast accurate, and precise disease extent quantification and by helping to identify patients with severe short-term outcomes. This could be of great help in the current context of the pandemic with healthcare resources under extreme pressure. In a context where the sensitivity of RT-PCR has been shown to be low, such as 63% when perform on nasal swab 28 , chest CT has been shown to provide higher sensitivity for diagnosis of COVID-19 as compared with initial RT-PCR from pharyngeal swab samples 10 . The current COVID-19 pandemic requires implementation of rapid clinical triage in healthcare facilities to categorize patients into different urgency categories 29 , often occurring in the context of limited access to biological tests. Beyond the diagnostic value of CT for COVID-19, our study suggests that AI should be part of the triage process. The developed tool will be made publicly available. Our prognosis and staging method achieved state of the art results through the deployment of a highly robust ensemble classification strategy with automatic feature selection of imaging biomarkers and patients' characteristics available within the image' metadata. In terms of future work, the continuous enrichment of the data base with new examples is a necessary action on top of updating the outcome of patients included in the study. The integration of non-imaging data and other related clinical and categorical variables such as lymphopenia, the D-dimer level and other comorbidities 9, 30-32 is a necessity towards better understanding the disease and predicting the outcomes. This is clearly demonstrated from the inability of any of the state-of-the art classification methods (including neural networks and multi-layer perceptron models) to predict the outcome with a balanced accuracy greater to 80% on the training data. Our findings could have a strong impact in terms of (i) patient stratification with respect to the different therapeutic strategies, (ii) accelerated drug development through rapid, reproducible and quantified assessment of treatment response through the different mid/end-points of the trial, and (iii) continuous monitoring of patient's response to treatment.

This retrospective multi-center study was approved by our Institutional Review Board (AAA-2020-08007) which waived the need for patients' consent. Patients diagnosed with COVID-19 from March 4th to 29th at six large University Hospitals were eligible if they had positive PCR-RT and signs of COVID-19 pneumonia on unenhanced chest CT. A total of 478 patients formed the full dataset (208, 668 CT slices). Only one CT examination was included for each patient. Exclusion criteria were (i) contrast medium injection and (ii) important motion artifacts.

For the COVID-19 radiological pattern segmentation part, 50 patients from 3 centers (A: 20 patients; B: 15 patients, C: 15 patients) were included to compose a training and validation dataset, 130 patients from the remaining 3 centers (D: 50 patients; E: 50 patients, F: 30 patients) were included to compose the test dataset ( Table 2 ). The proportion between the CT manufacturers in the datasets was pre-determined in order to maximize the model generalizability while taking into account the data distribution.

For the radiomics driven prognosis study, 298 additional patients from centers A (96 patients), B (64 patients) and D (138 patients) were included to increase the size of the dataset. Data of 383 patients from 3 centers (A, B and D) were used for training and those of 85 patients from 3 other centers (C, E, F) composed an independent test set ( Table 3 ). Only one CT examination was included for each patient. Exclusion criteria were (i) contrast medium injection and (ii) important motion artifacts. For short-term outcome assessment, patients were divided into 2 groups: those who died or were intubated in the 4 days following the CT scan composed the severe short-term outcome subgroup, while the others composed the non-severe short-term outcome subgroup.

Chest CT exams were acquired on 4 different CT models from 3 manufacturers (Aquilion Prime from Canon Medical Systems, Otawara, Japan; Revolution HD from GE Healthcare, Milwaukee, WI; Somatom Edge and Somatom AS+ from Siemens Healthineer, Erlangen, Germany). The different acquisition and reconstruction parameters are summarized in Table 1 . CT exams were mostly acquired at 120 (n=103/180; 57%) and 100 kVp (n=76/180; 42%). Images were reconstructed using iterative reconstruction with a 512 × 512 matrix and a slice thickness of 0.625 or 1 mm depending on the CT equipment. Only the lung images reconstructed with high frequency kernels were used for analysis. For each CT examination, dose length product (DLP) and volume Computed Tomography Dose Index (CTDIvol) were collected.

Fifteen radiologists (GC, TNHT, SD, EG, NH, SEH, FB, SN, CH, IS, HK, SB, AC, GF and MB) with 1 to 7 years of experience in chest imaging participated in the data annotation which was conducted over a 2-week period. For the training and validation set for the COVID-19 radiological pattern segmentation, the whole CT examinations were manually annotated slice by slice using the open source software ITKsnap 1 . On each of the 23, 423 axial slices composing this dataset, all the COVID-19 related CT abnormalities (ground glass opacities, band consolidations, and reticulations) were segmented as a single class. Additionally, the whole lung was segmented to create another class (lung). To facilitate the collection of the ground truth for the lung anatomy, a preliminary lung segmentation was performed with Myrian XP-Lung software (version 1.19.1, Intrasense, Montpellier, France) and then manually corrected.

As far as test cohort for the segmentation is concerned, 20 CT slices equally spaced from the superior border of aortic arch to the lowest diaphragmatic dome were selected to compose a 2, 600 images dataset. Each of these images were systematically annotated by 2 out of the 15 participating radiologists who independently performed the annotation. Annotation consisted of manual delineation of the disease and manual segmentation of the lung without using any preliminary lung segmentation.

The segmentation tool was built under the paradigm of ensemble methods using a 2D fully convolutional network together with the AtlasNet framework 22 and a 3D fully convolutional network 25 . The AtlasNet framework combines a registration stage of the CT scans to a number of anatomical templates and consequently utilizes multiple deep learning-based classifiers trained for each template. At the end, the prediction of each model is -to the original anatomy and a majority voting scheme is used to produce the final projection, combining the results of the different networks. A major advantage of the AtlasNet framework is that it incorporates a natural data augmentation by registering each CT scan to several templates. Moreover, the framework is agnostic to the segmentation model that will be utilized. For the registration of the CT scans to the templates, an elastic registration framework based on Markov Random Fields was used, providing the optimal displacements for each template 33 .

The architecture of the implemented segmentation models was based on already established fully convolutional neural network designs from the literature 24, 25 . Fully convolutional networks following an encoder decoder architecture both in 2D and 3D were developed and evaluated. For the 2D models the CT scans were separated on the axial view. The network included 5 convolutional blocks, each one containing two Conv-BN-ReLU layer successions. Maxpooling layers were also distributed at the end of each convolutional block for the encoding part. Transposed convolutions were used on the decoding part to restore the spatial resolution of the slices together with the same successions of layers. For the 3D pipeline, the model similarly consisted of five blocks with a down-sampling operation applied every two consequent Conv3D-BN-ReLU layers. Additionally, five decoding blocks were utilized for the decoding path, at each block a transpose convolution was performed in order to up-sample the input. Skip connections were also employed between the encoding and decoding paths. In order to train this model, cubic patches of size 64 × 64 × 64 were randomly extracted within a close range of the ground truth annotation border in a random fashion. Corresponding cubic patches were also extracted from the ground truth annotation masks and the lung anatomy segmentation masks. To this end, we trained the model with the CT scan patch as input, the annotation patch as target and the lung anatomy annotation patch as a mask for calculating the loss function only within the lung region. In order to train all the models, each CT scan was normalized by cropping the Hounsfield units in the range [−1024, 1000].

Regarding implementation details, 6 templates were used for the AtlasNet framework together with normalized cross correlation and mutual information as similarities metrics. The networks were trained using weighted cross entropy loss using weights depending on the appearance of each class and dice loss. Moreover, the 3D network was trained using a dice loss. The Dice loss (DL) and weighted cross entropy (WCE) are defined as follows,

where p is the predicted from the network value and g the target/ ground truth value. β is the weight given for the less representative class. For network optimization, we used only the class for the diseased regions.

For the 2D experiments we used classic stochastic gradient descent for the optimization with initial learning rate = 0.01, decrease of learning rate = 2.5 · 10 −3 every 10 epochs, momentum =0.9 and weight decay =5 · 10 −4 . For the 3D experiments we used the AMSGrad and a learning rate of 0.001. The training of a single network for both 2D and 3D network was completed in approximately 12 hours using a GeForce GTX 1080 GPU, while the prediction for a single CT scan was done in a few seconds. Training and validation curves for one template of AtlasNet and the 3D network are shown in Figure 6 . Both Dice similarity score and Haussdorff distances were higher with the 2D approach compared to the 3D approach (Figure ??) . However, the combination of their probability scores led to a significant improvement. Thus, the ensemble of 2D and 3D architectures was selected for the final COVID-19 segmentation tool.

Moreover, segmentation masks of the lung and heart of all patients were extracted by using ART-Plan software (Thera-Panacea, Paris, France). ART-Plan is a CE-marked solution for automatic annotation of organs, harnessing a combination of anatomically preserving and deep learning concepts. This software has been trained using a combination of a transformation and an image loss. The transformation loss penalizes the normalized error between the prediction of the network and the affine registration parameters depicting the registration between the source volume and the whole body scanned. These parameters are determined automatically using a downhill simplex optimization approach. The second loss function of the network involved an image similarity function -the zero-normalized cross correlation loss -that seeks to create an optimal visual correspondence between the observed CT values of the source volume and the corresponding ones at the full body CT reference volume. This network was trained using as input a combination of 360, 000 pairs of CT scans of all anatomies and full body CT scans. These projections used to determine the organs being present on the test volume. Using the transformation between the test volume and the full body CT, we were able to determine a surrounding patch for each organ being present in the volume. These patches were used to train the deep learning model for each full body CT. The next step consisted of creating multiple annotations on the different reference spaces, and for that a 3D fully convolutional architecture was trained for every reference anatomy. This architecture takes as input the annotations for each organ once mapped to the reference anatomy and then seeks to determine for each anatomy a network that can optimally segment the organ of interest similar to the AtlasNet framework used for the disease segmentation. This information was applied for every organ of interest presented in the input CT Scan. In average, 6, 600 samples were used for training per organ after data augmentation. These networks were trained using a conventional dice loss. The final organ segmentation was achieved through a winner takes all approach over an ensemble networks approach. For each organ, and for each full body reference CT a specific network was built, and the segmentation masks generated for each network were mapped back to the original space. The consensus of the recommendations of the different subnetworks was used to determine the optimal label at the voxel level.

As a preprocessing step, all images were resampled by cubic interpolation to obtain isometric voxels with sizes of 1 mm. Subsequently, disease, lung and heart masks were used to extract 107 radiomic features 34 for each of them (left and right lung

were considered separately both for the disease extent and entire lung). The features included first order statistics, shape-based features in 2D and 3D together with texture-based features. Radiomics features were enriched with clinical data available from the image metadata (age, gender), disease extent and number of diseased regions. The minimum and maximum values were calculated for the training and validation cohorts and Min-Max normalization was used to normalize the features, the same values were also applied on the test set. As a first step, a number of features were selected using a lasso linear model in order to decrease the dimensionality. The lasso estimator seeks to optimize the following objective function:

where α is a constant, ||w|| 1 is the L1-norm of the coefficient vector and n is the number of samples. The Lasso method was used with 200 alphas along a regularization path of length 0.01 and limited to 1000 iterations. The staging/prognosis component was addressed using an ensemble learning approach. First, the training data set was subdivided into training and validation set on the principle of 80% − 20% while respecting that the distribution of classes between the two subsets was identical to the observed one. These features included first order features (maximum attenuation, skewness and 90th percentile), shape features (surface, maximum 2D diameter per slice and volume) and texture features (non-uniformity of the GLSZM and GLRLM).

Subsequently, this reduced feature space was considered to be most appropriate for training, and the following 7 classification methods with acceptable performance, > 60% in terms of balanced accuracy, as well as coherent performance between training and validation, performance decrease < 20% for the balanced accuracy between training and validation, were trained and combined together through a winner takes all approach to determine the optimal outcome ( Table 4 ). The final selected methods include the {Linear, Polynomial Kernel, Radial Basis Function} Support Vector Machines, Decision Trees, Random Forests, AdaBoost, and Gaussian Naive Bayes which were trained and combined together through a winner takes all approach to determine the optimal outcome. To overcome the unbalance of the different classes, each class received a weight inversely proportional to its size. The Support Vector Machines were all three granted a polynomial kernel function of degree 3 and a penalty parameter of 0.25. In addition, the one with a Radial Basis Function kernel was granted a kernel coefficient of 3. The decision tree classifier was limited to a depth of 3 to avoid overfitting. The random forest classifier was composed of 8 of such trees. AdaBoost classifier was based on a decision tree of maximal depth of 2 boosted three times.

The classifiers were applied in a hierarchical way, performing first the staging and then the prognosis. More specifically, a majority voting method was applied to classify patients into severe and non-severe cases (Table 6 ). Then, another majority voting was applied on the cases predicted as severe only to classify them into intubated or deceased (Table 7 ). In such a setup, the correlation of the reported features are summarized in Table 5 . For the hierarchical prognosis on the three classes a voting classifier for the prediction of each class against the others has been applied to aggregate the predicted outcomes from the 7 selected methods. In the Figure 8 we visualize the distributions of the different features along the ground truth labels and the prediction of the hierarchical classifier for each subject. In particular, all the samples are grouped using their ground truth labels and a boxplot is generated for each group and each feature. Additionally, color coded points are over imposed at each boxplot denoting the prediction label. It is therefore clearly visible that some features such as the disease extent, the age, the shape of the disease and the uniformity seems to be very important on separating the different subjects. 

The statistical analysis for the deep learning-based segmentation framework and the radiomics study was performed using Python 3.7, Scipy 35 , Scikit-learn 36 , TensorFlow 37 and Pyradiomics 34 libraries. The dice similarity score (DSC) 26 was calculated to assess the similarity between the 2 manual segmentations of each CT exam of the test dataset and between manual and automated segmentations. The DSC between manual segmentations served as reference to evaluate the similarity between the automated and the two manual segmentations. Moreover, the Hausdorff distance was also calculated to evaluate the quality of the automated segmentations in a similar manner. Disease extent was calculated by dividing the volume of diseased lung by the lung volume and expressed in percentage of the total lung volume. Disease extent measurement between manual segmentations and between automated and manual segmentations were compared using paired Student's t-tests. For the stratification of the dataset into the different categories, classic machine learning metrics, namely balanced accuracy, weighted precision, and weighted specificity and sensitivity were used. Moreover, the correlations between each feature and the outcome was computing using a Pearson correlation over the entire dataset.

CT parameters between the 6 centers were compared using the analysis of variance, while patient characteristics between training/validation and test datasets were compared using chi-square and Student's t-tests. 

Coronavirus disease 2019 (covid-19): a perspective from china

Performance of radiologists in differentiating covid-19 from viral pneumonia on chest ct

Chest ct findings in coronavirus disease-19 (covid-19): relationship to duration of infection

Ct image visual quantitative evaluation and clinical classification of coronavirus disease (covid-19)

Association of radiologic findings with mortality of patients infected with 2019 novel coronavirus in wuhan, china

The toughest triage-allocating ventilators in a pandemic

A framework for rationing ventilators and critical care beds during the covid-19 pandemic

A novel coronavirus from patients with pneumonia in china

Clinical course and risk factors for mortality of adult inpatients with covid-19 in wuhan, china: a retrospective cohort study

Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases

Sensitivity of chest ct for covid-19: comparison to rt-pcr

Chest ct for typical 2019-ncov pneumonia: relationship to negative rt-pcr testing

Artificial intelligence applications for thoracic imaging

A survey on deep learning in medical image analysis

End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography

Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest radiography images

Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct

Quantification of tomographic patterns associated with covid-19 from chest ct

Serial quantitative chest ct assessment of covid-19: Deep-learning approach

Mortality prediction in ipf: evaluation of automated computer tomographic analysis with conventional severity measures

Idiopathic pulmonary fibrosis: data-driven textural analysis of extent of fibrosis at baseline and 15-month follow-up