key: cord-0758041-s5s2c3qw
authors: Zhang, Kang; Liu, Xiaohong; Shen, Jun; Li, Zhihuan; Sang, Ye; Wu, Xingwang; Cha, Yunfei; Liang, Wenhua; Wang, Chengdi; Wang, Ke; Ye, Linsen; Gao, Ming; Zhou, Zhongguo; Li, Liang; Wang, Jin; Yang, Zhizhong; Cai, Huimin; Xu, Jie; Yang, Lei; Cai, Wenjia; Xu, Wenqin; Wu, Shaoxu; Zhang, Wei; Jiang, Shanping; Zheng, Lianghong; Zhang, Xuan; Wang, Li; Lu, Liu; Li, Jiaming; Wu, Haiying; Wang, Winston; Li, Oulan; Zhang, Charlotte; Liang, Liang; Wu, Tao; Deng, Ruiyun; Wei, Kang; Zhou, Yong; Chen, Ting; Yiu-Nam Lau, Johnson; Fok, Manson; He, Jianxing; Lin, Tianxin; Li, Weimin; Wang, Guangyu
title: Clinically Applicable AI System for Accurate Diagnosis, Quantitative Measurements and Prognosis of COVID-19 Pneumonia Using Computed Tomography
date: 2020-05-04
journal: Cell
DOI: 10.1016/j.cell.2020.04.045
sha: e7741fed9483bc46c587cdfff414d435b1799dab
doc_id: 758041
cord_uid: s5s2c3qw

Summary Many COVID-19 patients infected by SARS-CoV-2 virus develop pneumonia (called novel coronavirus pneumonia, NCP) and rapidly progress to respiratory failure. However, rapid diagnosis and identification of high-risk patients for early intervention are challenging. Using a large computed Tomography (CT) database from 4,154 patients, we developed an AI system that can diagnose NCP and differentiate it from other common pneumonia and normal controls. The AI system can assist radiologists and physicians in performing a quick diagnosis especially when the health system is overloaded. Significantly, our AI system identified important clinical markers that correlated with the NCP lesion properties. Together with the clinical data, our AI system was able to provide accurate clinical prognosis that can aid clinicians to consider appropriate early clinical management and allocate resources appropriately. We have made this AI system available globally to assist the clinicians to combat COVID-19.

The outbreak of the 2019 novel coronavirus (SARS-CoV-2) began in early December, 2019 (Munster et al., 2020; Wang et al., 2020) . The infection has a mean incubation period of 5.2 days and causes fever, cough, and other flulike symptoms. It can affect multiple tissues and organ systems and collectively the viral-induced disease is termed COVID-19. Many affected patients develop pneumonia (called novel coronavirus pneumonia, NCP) and progress rapidly into severe acute respiratory failure with a very poor prognosis and high mortality (Guan et al., 2020; Huang et al., 2020a) . Personto-person transmission has been established (Chan et al., 2020; Phan et al., 2020; Rothe et al., 2020; Zhu et al., 2020) and the World Health Organization (WHO) has declared COVID-19 to be pandemic. Studies have shown that over 60% of patients died once they progressed into the severe/critical illness stage (Guan et al., 2020; Huang et al., 2020a) . Therefore, identifying risk factors and parameters that can allow the creation of an accurate prognosis predictive model are critical in such a pandemic in the planning of early intervention and intense monitoring and hopefully will lead to improved clinical outcomes.

Currently, an individual with fever, cough, and flu symptoms would be screened first by clinical assessments, laboratory tests, and chest X-ray to rule out pneumonia. If viral pneumonia is diagnosed, being able to diagnose NCP is paramount importance for obvious public health reasons. COVID-19 diagnosis is confirmed by a positive molecular polymerase chain reaction (PCR) test. Chest computed tomography (CT) radiography is an important tool in the diagnosis of lung diseases including pneumonia. CT scanning procedure has a faster turnaround time than a molecular diagnostic test performed in a standard laboratory and can provide more detailed information related to the pathology and is better for the quantitative measurement of lesion size and the extent/severity of lung involvement, which may have prognostic implications . As seasonal flu also causes viral pneumonia, it is also important to differentiate NCP from the common flu or other types of pneumonia such as viral pneumonia and bacterial pneumonia. Therefore, an accurate CT-based artificial intelligence (AI) system may have the potential to assist in the early diagnosis for planning, monitoring and treatment, and establishing the reference for longitudinal follow ups (Huang et al., 2020b; Xie et al., 2020) .

Recent new and exciting advances in the applications of AI in many healthcare areas (Esteva et al., 2019; Gulshan et al., 2016; Li et al., 2018; Norgeot et al., 2019; Poplin et al., 2018; Ravizza et al., 2019; Ting et al., 2017; Topol, 2019) have inspired innovations in the development of novel AI-based radiological diagnostic technology. Chen (Chen et al., 2020) reviewed various quantitative models of thin-section CT of the chest and showed effectiveness of quantitative tools in both accurate diagnosis and longitudinal follow-up. Another study showed that deep learning algorithms helped to identify head CT scan abnormalities that can assist the clinical triage process (Chilamkurthy et al., 2018) . Recent studies demonstrated the potential of integrating AI into both the eye and childhood disease diagnostic systems and was found to significantly improve clinical diagnostic efficiency and accuracy (Burlina et al., 2017; Kermany et al., 2018; Long et al., 2017; Van Grinsven et al., 2016) . Therefore, with a more accurate tool in CT scans, we hypothesized that an AI system can be established that can accurately diagnose NCP and this will assist radiologists and clinicians in the management of patients with symptoms suggestive of COVID-19 NCP.

Another urgent need is to identify the patients with higher risks of developing acute respiratory failure so that they can be monitored closely earlier and receive intervention treatment early as otherwise, such patients will have a higher chance to eventually develop multi-organ failure associated with a high mortality . As lesion characteristics, including number, size, and density of lesions and also the overall lung parenchyma is an indicator of lung damage and remaining lung reserve, we also tested the hypothesis as to whether an AI system can be established using both clinical data and CT parameters to generate an accurate clinical prognostic model, allowing clinicians to plan for early monitoring and management of these patients. Accordingly, we have constructed a large CT dataset on NCP, other common pneumonia and normal controls and developed an AI diagnostic system for assisting the accurate diagnosis for application of an epidemic area and two in non-epidemic areas in China. We also provided prognosis indications for patients with NCP by using a combination of CT and clinical parameters, with an aim to provide another tool to assist physicians (Figure 1 ).

A large CT dataset encompassing patient cohorts from the China Consortium of Chest CT Image Investigation (CC-CCII) was constructed, which consisted of a total of 617,775 CT images from 4,154 patients. CT images from 3,777 patients were employed to train and test our AI systems for differentiating NCP from other common pneumonia and normal controls (Figure 1 and Figure S1 , Table S1 ). The common pneumonia group consisted of viral pneumonia, bacterial pneumonia, and mycoplasma pneumonia, all of which are the most common causes of pneumonia in China. Our AI COVID-19 diagnostic system consisted of two models including a lung lesion segmentation model, and a diagnosis analysis model (Figure 1 and Figure   S2 ). We constructed a segmentation network trained with 4,695 manually segmented slices from NCP patients and common pneumonia patients. For the classification model, 361,221 CT images from 2,246 patients including 752 NCP, 797 common pneumonia patients and 697 normal control patients were used for training (Table S1 ). Our AI system performance of diagnosis prediction was also tested in one retrospective cohort and three prospective pilot studies consisting of one in an epidemic area and two in non-epidemic areas in China. In addition, we have tested our AI diagnostic performance internationally using CT data generated outside China. For the CT quantitative and correlation analysis with clinical parameters, 843 hospital-admitted NCP patients with clinical metadata were included. Of these, 456 hospitalized patients with comprehensive CT image and clinical outcome information were subject to a prognosis prediction and survival analysis and the establishment of a clinical prognosis estimation (Figure 1 ).

A two-stage segmentation framework for accurately segmenting lung lesions from normal lung fields and background on raw CT slices with processing time efficiency and accuracy was constructed ( Figure S2) . A total of 4,695 CT slice images were manually annotated at the pixel level into seven classes, including background, lung field, CL, GGO, pulmonary fibrosis, interstitial thickening and pleural effusion.

To study our AI system performance on CT slice segmentation, several classic semantic segmentation tools as the backbone of our segmentation framework were tested including U-net (Ronneberger et al., 2015) , DRUNET (Devalla et al., 2018) , FCN (Long et al., 2015) , SegNet (Badrinarayanan et al., 2017) and DeepLabv3 . We evaluated system performance with two evaluation metrices including Dice Coefficient (DC) and Pixel Accuracy (PA) by a five-fold cross-validation test (Table S3) . We adopt DeepLabv3 as the backbone for subsequent analyses for its better segmentation performance. Compared to human experts, our segmentation network obtained smoother and clearer lesion segmentation boundaries as well as archived a high accuracy (Figure 2 and Figure S3 ).

To distinguish NCP from other common pneumonia and normal controls, we employed a diagnostic system based on a lung lesion segmentation model and a diagnosis analysis model. The diagnosis classification took the lunglesion map as an input generated by segmentation networks and utilized the normalized CT volumes for further diagnosis prediction. While real-world original scans contained noises and varied for different devices and human operations, our approach provided better generalization and interoperability during clinical implementations, instead of end-to-end black-box networks.

We used a total of 40,880 slices from 260 patients including 83 NCP patients, 91 common pneumonia patients and 86 normal controls to test our diagnosis classifier model as an internal validation (Table S1). Our system was able to differentiate NCP from other two classes (other common pneumonia and normal controls) with 92.49% accuracy, 94.93% sensitivity, 91.13% specificity, and an AUROC of 0.9797 (95% CI: 0.9665-0.9904) on an internal validation dataset. The overall performance for three-way classification obtained 92.49% of accuracy and an AUROC of 0.9813 (95% CI: 0.9691-0.9902) (Figure 3A and 3B).

To address regional variations and general applicability of our AI diagnostic system, our AI system performance was tested in four other different regions using different dataset. Specifically, our AI performance was tested in a retrospective study in an external cohort from an epidemic area in Hubei, China (City of Yichang). In addition, we also tested our AI system performance in three other regions in a prospective fashion, with one cohort from the epicenter Hubei, China (City of Wuhan), two cohorts from other nonepidemic areas in China (City of Hefei and City of Guangzhou).

In a retrospective study in the City of Yichang, Hubei Province, China, 155 NCP, 36 common pneumonia patients and 17 normal controls who underwent CT imaging were enrolled into this study (Table S1 ). Our AI diagnostic system archived 90.70% accuracy, 92.15% sensitivity, 85.92% specificity, and AUROC of 0.9805 (95% CI: 0.9662-0.9899) for NCP versus all other groups and an accuracy of 89.92% and AUROC of 0.9805 (95% CI: 0.9662-0.9899) for the overall three-way classification (NCP, common pneumonia and normal controls) (Figure 3C and 3D).

The first prospective pilot study was conducted in Wuhan, Hubei Province, China (the epicenter) from Jan 25 th to March 18, 2020 (Table S1 ). Our AI diagnostic system achieved 91.20% accuracy, 94.03% sensitivity, 88.46% specificity, and AUROC of 0.9610 (95% CI: 0.9403-0.9785) for NCP versus all other groups together and an accuracy of 91.20% and AUROC of 0.9741 (95% CI: 0.9583-0.9856) for the overall three-way classification (NCP, common pneumonia and normal controls) in this prospective study cohort ( Figure 4A and 4B).

A second prospective pilot study was conducted in a non-epidemic region in Hefei, Anhui Province, China, from Feb 2 nd to March 25 th (Table S1). Our AI diagnostic system archived 90.32% accuracy, 94.74% sensitivity, 89.19% specificity, and AUROC of 0.9700 (95% CI: 0.9500-0.9872) for NCP versus all other groups together and an accuracy of 91.76% and AUROC of 0.9776 (95% CI: 0.9630-0.9899) for the overall three-way classification (NCP, common pneumonia and normal controls) in this second study prospective cohort ( Figure 4C and 4D) .

The third prospective pilot study was conducted in another non-epidemic region in Guangzhou, Guangdong Province (China) from Feb 8 th to March 27 th , (Table S1). Our AI diagnostic system archived 84.78% accuracy, 90.00% sensitivity, 84.15% specificity, and AUROC of 0.9512 (95% CI: 0.9124-0.9820) for NCP versus all other groups together and an accuracy of 89.67% and AUROC of 0.9755 (95% CI: 0.9545-0.9896) for the overall three-way classification (NCP, common pneumonia and normal controls) ( Figure 4E and 4F).

To validate our AI system's general applicability outside China, we obtained CT images from an open source and additional data from our collaborators in Ecuador (Table S1 ). Our AI diagnostic system achieved 84.11% accuracy, 86.67% sensitivity, 82.26% specificity, and AUROC of 0.905 (95% CI: 0.8421-0.9612) for NCP versus all other groups together and an accuracy of 85.05% and AUROC of 0.9381 (95% CI: 0.8944-0.9742) for the overall three-way classification ( Figure S5A and S5B). Together, these five studies confirmed the high performance, accuracy and general applicability of AI diagnostic system both within China and outside China internationally.

We also evaluated the effect of drug treatment on lesion size and volume changes using our AI-generated quantitative measurements. NCP patients undergoing three different drug treatment trials were analyzed. The enrollment criteria included a confirmed NCP diagnosis by a positive viral PCR test, and no other prior treatment history. We quantified lung lesions (GGO and total lesions volumes) on CT scans and compared the differences between pretreatment and after the experimental drug treatment were administered for three experimental drugs. The CT scans demonstrated quantitative lesion changes, indicating the potential of our AI system in the evaluation of drug treatment efficacy in an objective quantitative way ( Figure S4 ).

An independent test dataset of 18,392 CT slice images from 150 patients including 40 NCP, 80 common pneumonia and 30 normal controls was used to compare the AI system's performance with practicing radiologists in classifying NCP versus other pneumonia. We employed eight radiologists in two groups: four radiologists in the junior group, with 5 year to 15 years of clinical experience; and four radiologists in the senior group with 15 to 25 years of clinical experience to participate in the study. Ground truth was established based on a consensus from an independent group of four senior radiologists with 25 or above years clinical experience.

We then compared the performance between our AI system and radiologists on differentiating NCP from other common pneumonia and normal controls. Our AI system performance was superior overall superior to that of junior radiologists and comparable to mid-senior radiologists ( Figure 5A and 5B). We used predicted errors, based on penalty scores ( Figure 5C and Figure  S5C ), to create a metric to evaluate and compare performance between our AI system and the radiologists. Our AI system yielded a weighted error of 9.29%, compared to a range of weighted errors by the experts ranging from 7.14% to 19.15% with a mean of 13.55% ( Figure 5C ).

To investigate whether our AI system could help junior radiologists to improve their diagnostic performance, each junior radiologist was given diagnosis probability on each patient by AI system and then asked them to make a diagnosis with the assistance of the AI-generated results. To avoid a potential memorization bias, the follow up AI assisted diagnostic test by junior radiologists was performed 2 weeks later the initial test. The performance was significantly improved compared to the previous one and comparable to that of the level of senior radiologists ( Figure 5 ).

The ratio of total lesions (named lesion), CL, or GGO to an entire lung field were chosen as three lesion indicators. The correlations between quantitative lesion features and clinical and demographic parameters including age, gender, clinical and biochemical markers were determined. Spearman's and Pearson's correlations were employed to study the association between quantitative lesion features on a CT scan and clinical parameters. C-reactive protein (CRP), age, serum lactic dehydrogenase (LDH), highest body temperature (Tmax), and neutrophil-to-lymphocyte ratio showed highly positive correlations with the lesion features. By contrast, degree of blood oxygen saturation, lymphocyte count, albumin, blood platelets, Na + and albumin showed highly negative correlations with the lesion features ( Figure  6A -6D, Figure S6A -S6E). We also designed a composite score system using international guidelines to evaluate correlations between the lesion features and degree of lung and liver organ damage (see Methods for more details), which showed a positive correlation between the lesion size and degree of organ damage ( Figure 6E ). Together, these data suggest lung lesions could not only directly correlate with the respiratory system function, but also with the clinical parameters of other organ systems, suggesting the interconnectivity between lung lesions and the health of other major organs and potential concomitant advancement of damages contributing to multiple organ failures. Obviously, we would like to emphasize that our current data only showed the correlation of these changes without any direct knowledge of the exact underlying pathogenetic mechanisms (e.g. secondary to respiratory failure or/and direct viral involvement).

Next, the clinical and radiological features that contributed to the progression to critical illness were further analyzed to develop an AI-assisted model to estimate the clinical prognosis. A composite score (c-score) based on quantitative lung lesion features extracted by the AI system and clinical parameters (e.g. age, albumin, blood oxygen saturation, CRP) was created and applied to predict the clinical outcome defined by the progression (and time from the initial hospital admission) to severe/critical illness (defined by the clinical need to be transferred to ICU, mechanical ventilation, or death). We implemented LightGBM and CoxPH regression models for prognosis prediction. For interpreting the effects and relative contributions of the lung lesion features and clinical parameters on prognosis prediction, we implemented an explainer SHAP (Shapley Additive exPlanation) (Lundberg et al., 2018a; Lundberg et al., 2018b) .

As expected, the lesion features were identified as the most significant contributor in the clinical prognosis estimation. In addition, clinical parameters relating to respiratory function (oxygen saturation index, and respiratory rate), general clinical characteristics (age, the body temperature on admission and highest body temperature) also contributed to the prediction of progression to severe/critical illness requiring an ICU admission. Interestingly, additional prognostic markers were also identified, including liver biochemistry markers (albumin, serum LDH, indirect bilirubin), coagulation markers (thrombin time, activated partial thromboplastin time (APTT), platelet count), electrolyte and acid-base balance (Na + , K + , HCO 3 -), and markers of inflammation (CRP, lymphocyte count, neutrophil count) ( Figure 7C and 7D, Table S4 ), suggesting the overall health of other tissues/organs and systemic homeostasis also played an important role in determining the clinical prognosis of these patients in term of their evolving into severe/critical illness status.

AUROC curves were then generated to measure the impact of the NCP lesions and clinical metadata on progression to critical illness. Lung lesion features alone gave an AUC of 0.8479. When combined with clinical metadata, the performance of our model improved significantly to AUC of 0.9093 (95% CI: 0.8775-0.9369), with specificity of 80.00% and sensitivity of 86.71% ( Figure 7A and 7B).

Kaplan-Meier curves using a composite score (c-score) were generated based on these lung lesion and clinical parameters ( Figure S6F) . Based on the current size of the database, we stratify the patients into two group: the high-risk group with a c-score of > 0.5 and the low risk group with a c-score of < 0.5. The high-risk group (158 observations with 133 events) had a much lower survival probability compared to the low-risk group (274 observations with 37 events) with a highly significant statistical difference in medium survival (p<0.001, log-rank test, Figure 7E ). These results suggest that a combination of lung lesions and clinical metadata can contributed significantly to the prognosis prediction. It is also important to note that the curves will also provide estimates as to the estimated time (days) that critical care demands will be needed. Certainly, with a larger database in the future, our AI system may be able to provide an even more refined clinical prognostic model stratified into more groups with different levels of risks that are associated with different clinical prognosis.

In this study, we described an AI system for the diagnosis of COVID-19 pneumonia based on chest CT images. The performance of our AI system was comparable to that of practicing radiologists with significant clinical experience and could assist and improve the performance of junior radiologists. This AI effort is driven by the desire to develop a system for rapid diagnosis of NCP to assist radiologists and clinicians to combat this pandemic. Such an AI system can also ease the significant demand for diagnostic expertise when the health system is overloaded in a pandemic situation or in remote areas. At present, our AI system is designed to assist radiologists and clinicians as an efficient first reading/screening tool as this may reduce patient waiting time, shortening diagnostic workflow time, therefore lessoning a radiologist's overall workload and allowing them to respond quickly and more effectively in an emergency situation. The precise measurement of the different lung damage parameters in the CT scan will also allow an objective and quantitative measurement of the disease severity and has the potential to provide an objective and quantitative evaluation of the efficacy of a drug treatment, including antivirals and other immunomodulators, on the lung lesions.

The correlation of the lesion features in the CT scans as evaluated by our AI system and the clinical and biochemical evidence of disease severity based on other organ systems' laboratory parameters highlighted the pathogenesis interlink involving different organs of COVID-19. From a pathogenetic mechanistic perspective, our study can only show the correlation, yet it does not address whether this interlink is established directly through viral infection of these organs, and/or secondary through respiratory failures, which remains to be determined through other pathogenesis studies. The higher correlation with the lung lesions with outcomes compared to that of other clinical parameters did highlight the importance of the lung damage as the key parameter in the overall prognostic implications. The association of age and outcome is consistent with recent reports that older age is a significant risk factor for poor outcomes (Huang et al., 2020a) . The identification of the other parameters including CRP, serum sodium concentration, serum albumin, platelets count, as prognostic factors are consistent with the prognostic factors seen in patients with multi-organ failure. An increased CRP level may reflect the heightened systemic acute inflammation reflected in the severity of pulmonary inflammation, consistent with the prominent lung inflammation as observed in CT scans and post-mortem studies. The association with liver and renal biochemistry with prognosis may be another reflection of the multiorgan failure observed in patients with severe/critical COVID-19, although again whether there is direct viral involvement of these organs cannot be addressed in this study.

The development of a clinical prognostic model based on our AI system utilizing CT parameters and clinical data was an important advancement towards the use of AI as assisting clinical management. Based on our database, we were able to identify that a composite score of >0.5 based on our current model as the high-risk group in terms of the eventual progression into severe/critical illness requiring ICU admission, mechanical ventilation, or death. Importantly, an estimated time to this progression can also be provided. This information will assist the clinicians to plan for monitoring and allocate resources in the ICU, and in case of an epidemic, will allow healthcare system to have a few more days to react to the demand of resources.

In summary, an AI system that can accurately diagnose NCP and assist radiologists and physicians has been developed. A good correlation between the NCP lung lesions as reviewed by CT parameters and the clinical and biochemical markers of multiple organs was observed, highlighting that multiorgan failure were observed in COVID-19. Together with the clinical prognostic estimation function, this AI system can assist radiologists and physicians such as ER physicians and pulmonologists to accurately diagnose patients rapidly, and with the prognostic estimation generated, can assist the physicians to determine the subset of patients that will require close monitoring and early intervention/support as needed and the estimated timing of these needs. As this AI system may be of help to physicians and healthcare systems globally to better manage their patients during this pandemic, we are opening up this AI system to all radiologists and clinicians, and hope that this can assist their management of patients and the data that they input may further perfect our AI system which we hope can evolve as a versatile tool for the global community to combat COVID and other emerging viral infections. For an example of the clinical deployment, see Figure S7 .

We thank many physicians in China who generously donated their time and expertise for this project. We thank Drs. Roberto Hidalgo and Daniel Recalde in Guayaquil, Ecuador for providing de-identified CT images from NCP, other common pneumonia and normal patients. We thank members of Zhang, Lin and Wang groups for their assistance and helpful discussions. We thank many volunteers and physicians for curating grading CT scans and medical records. 

The authors declare no competing financial interests. Figure 1. Our proposed AI framework for NCP diagnosis and prognosis prediction.

(A) A large CT dataset were constructed using the data from CC-CCII (617,775 CT images from NCP, common pneumonia, and normal controls). The NCP diagnosis system consisted of two models: lung-lesion segmentation model and diagnosis prediction model. We first trained a segmentation network with 4,695 manually segmented images from NCP and common pneumonia patients. The diagnosis classifier took as input the previous lung-lesion map and generated probability of three classes: NCP, common pneumonia and normal controls with classification networks. A number of prospective pilot studies were also conducted to test our AI performance for clinical application. (B) AI assisted clinical prognosis estimation based on CT quantitative parameters and clinical metadata. A system for risk factor evaluation and Kaplan Meier curve analysis for sever/critical illness as defined in the text was also implemented. CRP, Creactive protein. See also Figure S1 , S2, and S7 and Table S1. The severity level definition as following: mild, defined as less than three GGO lesions of size less than 3 cm; intermediate, defined as a lesion area more than 25% of the entire lung field; severe, defined as a lesion area more than 50% of the entire lung field. See also Figure S3 and Table S2 . 

(A and B) The performance of our AI system and eight practicing radiologists (four junior level and four senior level). Receiver operating characteristic (ROC) curve for diagnosis of NCP versus other classes. Filled dots denote junior and senior radiologists' performance, while the hollow dots denote the performance of junior group with AI assistance. Dashed lines linked the paired performance values of each junior radiologist. (C) Weighted error results based on penalty scores (See Figure S5) . 

Comparative measurements of ground glass opacities (GGO) and total lesion (lesion) volume ratio before and after a drug treatment in three preliminary drug treatment observation trials (drug 1, 2 and 3). (A and B) Bar graphs comparing lesion volume changes before and after treatment by three drugs.

(C-E) Image examples of lesion changes before treatment (left panels) and after treatment (right panels). The NCP total lesion area in the example slice of each patient was quantified as a horizonal bar. A typical image with lesions and corresponding AI segmentation was presented for each drug treatment. For the AI segmentation color code, blue, purple and green represented GGO, consolidation (CL) and pulmonary fibrosis, respectively. (C) A representative patient from the drug 1 group. (D) A representative patient from the drug 2 group. (E) A representative patient from the drug 3 group. A t-test was used to measure statistical significance comparing before and after a treatment. The lesion change comparison before and after treatment was no statistically significant in the drug 1 group, whereas it was significant in the Drug 2 group (P=0.0345) and the Drug 3 (P=0.00056). (D) A distribution plot of the severity index (lesion volume ratios) between NCP patients and common pneumonia patients, which represented a severity level comparison between the two disease groups. The distribution difference between these two groups was evaluated by a statistical measurement of Jaccard Similarity (JS), which was the intersection divided by the union of distribution of two samples. The JS of the lesion ratios for CP and NCP patients was of 0.939, suggesting that the distributions of severity levels were similarly matched and would not generate a bias in diagnosis analysis. 

Further information and requests for resources should be directed to the Lead Contact, Kang Zhang (kang.zhang@gmail.com). All CT images data and metadata and codes generated in this study are available from the Lead Contact.

This study did not generate new unique reagents.

Chest CT images, clinical metadata and codes are deposited into the China National Center for Bioinformation at the website (http://ncovai.big.ac.cn/download?lang=en).

CT images were collected from cohorts from the China Consortium of Chest CT Image Investigation (CC-CCII), which consists of Sun Yat-sen Memorial Hospital and Third Affiliated Hospital of Sun Yat-sen University, The first Affiliated Hospital of Anhui Medical University, West China Hospital, Nanjing Renmin Hospital, Yichang Central People's Hospital, Renmin Hospital of Wuhan University. Human subjects were deemed clinically appropriate for a chest CT scan during the management of the patients by the clinicians. The influence (or association) of age or gender were not taken into the exclusion criteria. Institutional Review Board (IRB)/Ethics Committee approvals were obtained in all the institutions and consent was obtained from all participants. The work was conducted in compliance with the Chinese CDC policy on reportable infectious diseases and the Chinses Health and Quarantine Law and in compliance with patient privacy regulations in China, and was adherent to the tenets of the Declaration of Helsinki.

The CC-CCII contained a total of 617,775 CT slices of 6752 CT scans from 4154 patients. The study sample size was estimated by a standard AI training and validation approach. Patients were randomly assigned to a training set (80%), an internal validation set (10%) or a test set (10%). We used a total of 444,034 CT slices of 2,778 patients from the CC-CCII for training and internal validation of the AI diagnosis system, including 164,241 slices from 917 NCP patients, 183,933 slices from 983 pneumonia patients and 95,860 slices from 878 normal controls. Of these, NCP diagnosis was given when a patient had pneumonia with a confirmed reverse-transcriptase-PCR. The common pneumonia group include viral pneumonia (including adenoviral, influenza, and parainfluenza pneumonia), bacterial pneumonia, and mycoplasma pneumonia, all of which together are the most common causes of pneumonia, which were diagnosed based on standard clinical, radiological, culture/molecular assay results. We matched lesion severity levels between NCP and other common pneumonia by lesion volume ratio measurements ( Figure S5D) . All cohorts had viral pneumonia as a part of other common pneumonia diagnosis. The percentages of viral pneumonia were as following: initial training/validation/testing cohort, 48%; external validation cohort (City of Yichang), 27%; prospective cohort 1 (City of Hefei), 36%; prospective cohort 2 (City of Guangzhou), 38%; international cohort, 15%. CT scans from each patient were put into the corresponding dataset based on the patient assignment. We excluded patients without corresponding non-contrast CT scans. Scan sets without serial information or containing any motion artifacts or significant image resolution reductions were also excluded from the study. Details of patient characteristics and the inclusion and exclusion criteria are given in Figure S1 . For the CT quantitative and correlation analysis with clinical parameters, 843 admitted NCP patients with clinical metadata were included. Of these, 456 hospitalized patients with clinical outcome information were used in the prognosis estimation analysis. The endpoint in the study was defined as "severe/critical illness" which means admission to intensive care unit (ICU), on mechanical ventilation or death.

To train and evaluate our semantic segmentation framework, a subset of 2,879 CT slices from NCP patients and a set of 1816 CT slices from other pneumonia patients were manually segmented at the pixel level. The annotation was done via polygons. The segmentation labels were selected as relevant pathological features for distinguishing NCP and other common pneumonia. The annotation included lung field, and five commonly seen categories of lesions including CL, GGO, pulmonary fibrosis, interstitial thickening and pleural effusion. There were 4,406 consolidation lesions, 10,544 ground-glass opacities and a total of 2,571 other three types of lesions annotated ( Table S2 ). The segmentations were annotated and reviewed by five senior radiologists with 15 to 25 years of experience.

For the analysis of CT images from CC-CCII, all radiographs were initially screened for quality control by removing all low quality or unreadable scans. For all CT images, each image went through a tiered grading system consisting of two layers of trained graders of increasing expertise for verification and correction of image labels. Each image imported into the database started with a label matching the diagnosis of the patient. This first tier of graders who were radiologists with 5 to 15 years of clinical practice experience conducted initial quality control and excluded images containing severe artifacts or significant image resolution reductions. The presence or absence of lung lesions in CT images were recorded. The second tier of five independent radiologists with at least 25 years of clinical practice experience verified the true labels for each image randomly selected from 10% of all images, this group of senior radiologists also served on the consensus committee for the ground truth adjudicator in AI versus radiologist comparison.

For comparing the performance of our AI system with experienced practicing radiologists, an independent dataset which consisted of a total of 18,392 slices from 150 patients including 40 NCP patients, 80 common pneumonia patients and 30 normal controls were employed (Table S1 ). Eight practicing radiologists were enrolled to participate in the AI comparison study and were allocated into two groups: junior group, with 5 to 15 years of clinical experience; and senior group with 15 to 25 years of clinical experience. A weighted error based on penalty score was used to evaluate our AI system and the experts to reflect clinical performance. We set the misdiagnosing NCP to normal with a score of 2, as it may cause the most severe outcome as compared to misdiagnosing NCP as "other common pneumonia", which had a score of 1. In addition, the scores of misdiagnosing the rest of classes were set as 1.

We further conducted a study to investigate the impact of the AI diagnostic system in aiding the performance of the four junior radiologists. During this rereading of the same CT four weeks after the initial reading, the AI system readout was provided to the radiologists with lesion labeling of each slice and a final diagnostic probability at a patient level (three-way classifications) and the radiologist were asked to make a diagnosis again to access the impact of our AI system on the radiologists diagnosis and compared with the Ground Truth from the Senior radiologists

In the first prospective pilot study conducted in Wuhan (Hubei), we enrolled 201 consecutive NCP, 144 common pneumonia patients and 64 normal controls. We performed this study from Jan 25 th to March 25 th , 2020. Similarly, the second prospective pilot study was conducted in a non-epidemic central region in China with 41 consecutive NCP, 128 common pneumonia patients and 73 normal controls (Hefei from Feb 2nd to March 25 th ). The third prospective pilot study was conducted in another region in southern China with 20 consecutive NCP, 57 common pneumonia patients and 63 normal controls (Guangzhou from Feb 8 th to March 27 th , Table S1 ).

To further validate our AI system outside China, we also obtained CT images from a cohort of patients from Ecuador and data from an open source website (https://radiopaedia.org/encyclopaedia/cases/all?lang=us). This international cohort consisted of 40 NCP patients, 52 CP patients, and 10 normal controls. This project was approved by the IRB of the respective institutions. Enrollment criteria included a diagnosis of viral pneumonia on clinical symptoms and signs, and standard laboratory tests. The CT screening was performed in all participants as a part of clinical management. A confirmative diagnosis of NCP was made by a molecular PCR test (Table S1) .

We constructed a computer-aided diagnosis (CAD) system for detecting COVID-19 patients which consisted of two models, the first lung-lesion segmentation model and the second diagnosis prediction model. The lunglesion segmentation model took a raw CT scan as the input and produced a lung lesion map as the output by segmentation networks, in which it generated one out of seven classes at a pixel level, including background, lung fields, five lesions including CL, GGO, pulmonary fibrosis, interstitial thickening and pleural effusion. The entire scan of a patient was then cropped and transformed into a normalized CT volume map based on the lungsegmentation model. The diagnosis prediction model took the normalized lung lesion CT volume map and produced a final prediction on whether a patient is Normal, with NCP or other common pneumonia by classification networks.

As a CT scan usually consisted of many slices, ranging from around 50 (5mm) to 200 (1mm) in thickness, which was challenging for real-time application in clinical practice. To achieve the requirement of real-time segmentation, we constructed a fast and accurate segmentation framework to segment the lung filed and lesions in the CT slice. We formulated our segmentation framework as a two-stage segmentation scheme which enabled the model to focus on the medically meaningful regions of input image which will reduce computation cost significantly (see Figure S2A for more details).

In the first stage, we down sampled the input image from to level and segmented the lung (lung field and lesions) from the image, as the patterns of lung can be learned at a relatively low resolution. In the second stage, we first calculated the bounding box with the lung field segmentation results. The bounding box was slightly larger than the lung field and contained all key regions with pathological features. Next, the key region from the original input image was cropped and resized to a level as the input for the second stage segmentation model, which segmented all segmentation classes from the cropped image. The results from the second stage segmentation were transferred to the coordinates of the original input image to form a final segmentation mask. This method increased segmentation framework efficiency and was model agnostic, therefore it could be applied to any image segmentation models.

The two-stage segmentation framework with different backbones was adopted, in which we chose a range of classic semantic segmentation models to conduct extensive experiments, including U-net, DRUNET, FCN, SegNet and DeepLabv3. All images for training and evaluating semantic segmentation model were resized to to balance the computation cost and accuracy. Batch normalization was utilized to accelerate the training procedure. The annotated labels were highly imbalanced, with lung field took up the majority of the labels. To solve this problem, we used the pixel-level weighted binary cross-entropy and dice loss as the loss function for optimization. During training, SGD optimizer was employed with an initial learning rate at 0.01, momentum at 0.9 and weight decay at 0.0001. The learning rate decays by a factor of 0.9 for every 5 epochs. The training batch size is 4.

The segmentation results of CT slices in a CT scan were stacked vertically to form a volume. The lung filed region was then cropped from the volume and normalized to for depth, height and width respectively. Then the normalized volume is converted into one-hot representation for each pixel, deriving a tensor . The 3D classification network took the tensor as input and output the diagnosis probability of three categories: NCP, CP and normal controls.

The detailed structure of the 3D classification network was shown in Figure  S2B , adapted from 3D ResNet-18 (Hara et al., 2017) . The network used multiple 3D convolutional blocks with residual connections to continuously extract local and global contextual features, and the final predictions were calculated with a fully connected layer followed with a softmax activation function. For the three-way diagnosis decision, the model output the class with the maximum probability. The 3D classification network is trained with cross entropy loss between final predictions and ground truth labels. During training, we used Adam optimizer with an initial learning rate at 0.001. The learning rate decays by a factor of 0.1 for every 10 epochs. The training epoch is 20 in total. The training batch size is 8. The whole training, validation and testing procedures were also conducted with Pytorch (v.1.2.0) on NVIDIA GeForce 1080Ti graphical processing units (Paszke et al. 2019 ).

We evaluated the effect of drug treatment on lesion size and volume changes using our AI-based quantitative measurements. NCP patients undergoing three different experimental drug treatment in observation trials were analyzed. Drug 1, drug 2 and drug 3 group enrolled 12, 8 and 22 NCP patients respectively. The enrollment criteria included a confirmed NCP diagnosis by a positive viral PCR test, and no other prior treatment history. We quantified lung lesions (GGO and total lesions volumes) on CT scans and compared the differences between pre-treatment and after treatment of the three experimental drugs. Dependent t-test for paired samples was used for statistical analysis (Figure S4 ).

Clinical records of COVID-19 patients which consisted of patients' demographics and clinical data, including vital signs, symptoms and signs, as well as imaging studies and laboratory tests results on initial hospital admission and follow ups were manually annotated. Laboratory tests consisted of comprehensive blood biochemical analysis, blood gas analysis, liver and renal biochemistry indexes, cardiac function tests, coagulation tests, and certain serum protein levels and activities, including lactate dehydrogenase (LDH) and C-reactive protein (CRP), etc.

To investigate the correlations between CT quantitative features and clinical parameters, Pearson's and Spearman's correlation tests were performed. Three volume-level quantitative features were used, including GGO to lung volume ratio, CL to lung volume ratio, total lesion (GGO + CL) to lung volume ratio, since they were common lesions for the progressive stage of NCP patients. A correlation analysis between lung lesion features and lung and liver damage assessment were conducted using a composite score system, as lung and liver functions were highly relevant to outcomes of COVID-19 patients based on previous study and international guidelines (Force et al., 2012; Singer et al., 2016) . We identified and quantified clinical parameters (including age, Tmax, RR, platelet, Na + , albumin, AST, LDH, CRP) that were highly correlated with lung lesions by a linear regression analysis. The resultant correlations were regarded as significant when P values was <0.05 after correction with the Holm-Bonferroni method.

For an NCP prognostic and risk factor analyses, we denoted the "severe/critical illness" as the endpoint of the prognosis of COVID-19 patients. The status of "severe/critical illness" was defined and recorded as admission to an intensive care unit (ICU), the use of mechanical ventilation, or death. CT lesion features (only volume quantitative features) and clinical metadata features were used for a prognostic prediction with Gradient Boosting Decision Tree Algorithm (GBDT) as the classifier. The GBDT is a tree-based ensemble model, as each node in the tree can be converted to IF-THEN rules that are easily understandable, and the GBDT with default parameters by Python package Lightgbm was employed (Ke et al., 2017) . A SHAP method was used to display the impact of relevant risk factors on prognostic prediction for critical illness. SHAP is a value explainable tool for tree-based models, which could efficiently and exactly compute local explanations and global explanations. The performance of a local explanation of SHAP for prognosis prediction with interpretability was also investigated. As an example, two patients from the critical illness and the non-critical illness group were used to show the effects of lung lesion features and clinical parameters as the input risk factors for prognosis prediction (Figure S7A and Figure S7B ). We used a five-fold cross-validation scheme for prognostic prediction. For each fold, we calculated a probability (c-score) for each patient in the test dataset (20%) using coefficient estimates from the training and validation dataset (70%:10%). Using a cut-off score of the c-score of 0.5, we were able to classify patients into high-risk group (c-score of >0.5) and low-risk group (c-score of <0.5). With this stratification, the median survival times for the two groups were calculated by the Kaplan-Meier estimator and a log-rank test. The importance of each parameter for prognostic estimation were also estimated by fitting a multi-variable Cox proportional hazards model on CT quantitative lesion features and clinical parameters.

We evaluated semantic segmentation model performance with two evaluation metrices including Pixel Accuracy (PA) and Dice Coefficient (DC). The PA is the percentage of pixels that are classified correctly. The DC is twice the area of overlap between the predicted segmentation and the ground truth divide by the sum of areas of the predicted segmentation and the ground truth. ROC and AUC were used to assess model performance for each classification task. Confidence intervals (CI) of AUC were computed using bootstrapping approach with nonparametric, unstratified resampling of 1000 times (Efron.,1979) . Sensitivity, specificity and accuracy were determined by the selected operating point. The operating point between a low false negative diagnostic rate (sensitivity) and a low positive rate (1 − specificity) were set at different thresholds accordingly. Pearson's and Spearman's correlation tests with Holm-Bonferroni method were employed for the statistical analyses. The training, validation and testing procedures of deep learning models were conducted with Pytorch (v.1.2.0). We used the Python scikit-learn library for data analysis and the Python matplotlib and seaborn libraries to plot graphs. The Python package lightgbm and lifelines were employed for prognostic prediction. The measurements of sensitivity, specificity, and accuracy were calculated by Python scikit-learn library (Davidson et al.,2019).

 AI system that can diagnose COVID-19 pneumonia using CT scans  Prediction of progression to critical illness  Potential to improve performance of junior radiologists to the senior level  Can assist evaluation of drug treatment effects with CT quantification

Zhang et al. present an AI-based system, based on hundreds of thousands of human lung CT scan images, that can aid in distinguishing patients with pneumonia caused by SARS-CoV-2 versus other viral infections and can help to predict the prognosis of COVID-19 patients. 

Segnet: A deep convolutional encoder-decoder architecture for image segmentation

Automated grading of age-related macular degeneration from color fundus images using deep convolutional neural networks

A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster

Quantitative CT Analysis of Diffuse Lung Disease

Quantitative CT Analysis of Diffuse Lung Disease

Rethinking atrous convolution for semantic image segmentation

Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study

lifelines: survival analysis in Python

DRUNET: a dilated-residual U-Net deep learning network to segment optic nerve head tissues in optical coherence tomography images

Bootstrap Methods: Another Look at the Jackknife. Annals of Statistics

A guide to deep learning in healthcare

Clinical Characteristics of Coronavirus Disease 2019 in China

Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Clinical features of patients infected with 2019 novel coronavirus in Wuhan

Use of Chest CT in Combination with Negative RT-PCR Assay for the 2019 Novel Coronavirus but High Clinical Suspicion

Lightgbm: A highly efficient gradient boosting decision tree

Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning

Efficacy of a Deep Learning System for Detecting Glaucomatous Optic Neuropathy Based on Color Fundus Photographs

An artificial intelligence platform for the multihospital collaborative management of congenital cataracts

Fully convolutional networks for semantic segmentation

Consistent individualized feature attribution for tree ensembles

Explainable machine-learning predictions for the prevention of hypoxaemia during surgery

A Novel Coronavirus Emerging in China -Key Questions for Impact Assessment

A call for deep-learning healthcare

PyTorch: An imperative style, highperformance deep learning library

Importation and Human-to-Human Transmission of a Novel Coronavirus in Vietnam

Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning

Predicting the early risk of chronic kidney disease in patients with diabetes using real-world data

U-net: Convolutional networks for biomedical image segmentation

Transmission of 2019-nCoV Infection from an Asymptomatic Contact in Germany

Radiological findings from 81 patients with COVID-19 pneumonia in Wuhan, China: a descriptive study. The Lancet Infectious diseases 10

Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes

High-performance medicine: the convergence of human and artificial intelligence

Fast convolutional neural network training using selective data sampling: Application to hemorrhage detection in color fundus images

A novel coronavirus outbreak of global health concern

Chest CT for Typical 2019-nCoV Pneumonia: Relationship to Negative RT-PCR Testing

Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, retrospective, observational study

A Novel Coronavirus from Patients with Pneumonia in China