key: cord-0714309-nhxrpy37 authors: Mao, X.; Liu, X.-P.; Yang, X.; Mao, J.-H.; Xiong, M.; Zhou, S.; Chang, H. title: Development and validation of chest CT-based imaging biomarkers for early stage COVID-19 screening date: 2020-05-20 journal: nan DOI: 10.1101/2020.05.15.20103473 sha: 553b2968ce371c6937464605778dc4a6222cb159 doc_id: 714309 cord_uid: nhxrpy37 Coronavirus Disease 2019 (COVID-19) is currently a global pandemic, and the early screening of COVID-19 is one of the key factors for COVID-19 control and treatment. Here, we developed and validated chest CT-based imaging biomarkers for COVID-19 patient screening. We identified the vasculature-like signals from CT images and found that, compared to healthy and community acquired pneumonia (CAP) patients, the COVID-19 patients revealed significantly higher abundance of these signals. Furthermore, unsupervised feature learning leads to the discovery of clinical-relevant imaging biomarkers from the vasculature-like signals for accurate and sensitive COVID-19 screening that has been double-blindly validated in an independent hospital (sensitivity: 0.941, specificity: 0.904, AUC: 0.952). Our findings could open a new avenue to assist screening of COVID-19 patients. COVID-19 patients especially at early stage are similar to these from other common pneumonia patients, including H7N9 influenza virus pneumonia, mycoplasma pneumonia, chlamydial pneumonia and bacterial pneumonia [7] . In this study, we develop and validate chest CT-based imaging biomarkers for COVID-19 patient screening using artificial intelligence (or computer vision) methods, which will be of great significance to reduce the workload of clinicians and to assist in differential diagnosis of COVID-19 from other diseases. Our findings could open a new avenue to assist screening of COVID-19 patients. The chest CT images in this case-control study were collected from Hubei Provincial Hospital of Traditional Chinese Medicine and Wuhan Third Hospital. The inclusion criteria for COVID-19 patients were: (1) patients were diagnosed and confirmed through nucleic acid test from January 2020 to March 2020; (2) patient were with mild or moderate disease status, where the severity was classified according to the Coronavirus Disease 2019 (COVID-19) diagnosis and treatment guideline (trial version 7) issued by the National Health Commission of the People's Republic of China. Specifically, according to the guidelines, patients with mild disease status have mild clinical symptoms, and have no obvious pneumonia manifestations on chest CT images; and patients with moderate disease status have obvious respiratory symptoms such as fever and cough, and have obvious pneumonia manifestations on chest CT images. In addition, both patients with community acquired pneumonia (CAP) and healthy participants (with no obvious abnormalities in chest CT images) were randomly collected from aforementioned two hospitals and used as control group. The inclusion criteria for control group were: (1) patients who were diagnosed with lung infection on imaging and clinical basis few months before the onset of the epidemic; (2) patients without severe diseases of respiratory system, cardiovascular or cerebrovascular systems, (3) patients without mental illness or cognitive impairment. This study has been approved by the institutional review board (IRB) of both participating hospitals. Chest CT exams from Hubei Provincial Hospital of Traditional Chinese Medicine were performed with two different scanners: (1) GE Optima 660 CT (GE Healthcare, Milwaukee) and (2) uCT 530 (United imaging, Shanghai), with reconstruction thickness at 0.625 mm and 1 mm, respectively. While, CT exams from Wuhan Third Hospital were performed with GE Discovery CT750 HD (GE Healthcare, Milwaukee) with reconstruction thickness at 0.625mm. Vasculature-like signal is recognized and enhanced using an iterative tangential voting (ITV) approach [8] within pre-segmented lung regions in 3D, where ITV enforces the continuity and strength of local linear structures (i.e., vasculature-like structure) and the 3D lung segmentation is achieved via level-set method [9] . Specifically, ITV operates on CT image gradient information with sigma set to be 0.5 and 1.0 on training and validation cohorts, respectively, to accommodate the technical difference across hospitals. We developed an unsupervised feature learning pipeline based on Stacked Predictive Sparse Decomposition (Stacked PSD) [10] for the unsupervised discover of underlying 3D characteristics from the 'vasculature-like signal' space derived from CT-based raw images. Specifically, in this study, we used single network layer with 256 dictionary elements (i.e., signal patterns) at a fixed pattern size of 20x20x20 pixels and a fixed sampling rate of 100 3D patches per sample which were experimentally optimized. In the training cohort, 8 of 256 dictionary elements were identified to have significant correlation with COVID-19 with cutoff FDR value < 0.05 through cross-validation approach (training sample rate: 0.8; bootstrap 100 times). At last, these 8 significant dictionary elements as a set of imaging biomarkers were selected and utilized to build the random decision forests model for COVID-19 screening. A double-blind study was designed and implemented to validate this model in an independent hospital. Visualization of these imaging biomarkers was created in three-dimensional space using ITK-Snap (version 3.8.0), Python (version 3.7.0), Matplotlib (version 3.1.2), Blender (version 2.82) and Three.js (version r115 on GitHub). Snapshots of the three-dimensional visualization were used to generate two-dimensional visualization that overlays with the original CT slices. In order to validate the diagnostic performance of our pre-identified 3D imaging biomarkers, we invited two experienced chest radiologists to independently and blindly (blind to clinical data) assess the CT images in our validation cohort. These two chest radiologists have 8 and 10 years of clinical imaging diagnosis experience, respectively. And both of them have more than 2 months of intense and continuous diagnosis experience of COVID-19 in Wuhan, China. Sensitivity and specificity were utilized for performance comparison between the chest radiologists and our 3D imaging biomarkers. The difference in the vasculature-like signals among different groups (COVID-19, CAP and healthy) was assessed by non-parametric test, and the association between signatures and COVID-19 by logistic regression. Principle component analysis (PCA) and heatmap cluster analysis were performed in R (version 3.6.1) and MATLAB (version 2012b), respectively. The screening performance was characterized with sensitivity, specificity and area under curve (AUC). In order to identify chest CT-based imaging biomarkers for COVID-19 patient screening, we conducted a case-control study in two hospitals together with artificial intelligence technologies in machine learning (Fig. 1) . The population characteristics for the training and double-blind validation cohorts are summarized in Extended Data Table 1 . A total of 321 participants were included in this case-control study. The cohort (n=116) from one hospital (Hospital A) served as training set, the cohort (n=205) from the other (Hospital B) as a double-blind validation set (Fig. 1) . The median ages of participants in the training and validation cohorts were 42 (range: 14-76) and 58 (range; 19-89), respectively. There were 53 (45.7%) females and 63 (54.3%) males in the training cohort, while corresponding data were 110 (53.7%) and 95 (46.3%) in the validation cohort. The training cohort contained 47 (40.5%) COVID-19 patients, 20 (17.2%) healthy and 49 (42.2%) CAP patients, while the validation cohort had 153 (74.6%) COVID-19 patients, 15 (7.3%) healthy and 37 CAP (18%) patients. In our study, the vasculature-like structure was recognized and enhanced with iterative tangential voting in both training and validation cohorts as a pre-processing step. Interestingly in the training cohort, the mean vasculature-like signal reveals significant difference (p-value < 0.05) among healthy, CAP and COVID-19 patients (Fig. 2b) . These findings are consistent with the observation of vascular changes in lung tissue from COVID-19 patients, including vascular congestion/enlargement, small vessels hyperplasia and vessel wall thickening [11, 12] . Furthermore, such distinction itself leads to remarkable differentiation between COVID-19 and non-COVID-19 groups in our training cohort (AUC=0.721, Extended Data Fig. 2 , blue curve) with logistic regression approach. Altogether, those results encourage us to identify imaging biomarkers from the "vasculature-like signal" space to assist accurate screening of COVID-19. Next, we applied Stacked PSD on the entire training cohort within the 'vasculature-like signal' space to acquire underlying characteristics for dictionary construction (details see method). 256 dictionary elements were learned and optimized from the entire training cohort. We found that 8 of 256 dictionary elements have significantly positive correlation with COVID-19 (FDR < 0.05, Extended Data Table 2 and Supplementary Table 1 ). These 8 COVID-19-relevant signatures (i.e., imaging biomarkers) are graphically presented ( Fig. 1 3D CT Imaging Biomarkers panel) . These biomarkers also allow the construction of full 3D multispectral staining in the entire lung region (Fig. 2a) , which is further demonstrated in 3D animations ( Supplementary Videos 1-3) . The corresponding 2D multispectral staining into the CT image slices are also constructed ( Supplementary Videos 4-6) . The 8 imaging biomarkers clearly separate COVID-19 patients from others in the training cohort by PCA (Fig. 2c) and clustering (Extend Data Fig. 3a) analysis. Finally, we built a random decision forest model for COVID-19 screening based on these imaging biomarkers within the training cohort (AUC = 1.000, Extended Data Fig. 2 , red curve), which will be tested in the validation cohort. Identical enhancement process was applied onto the validation cohort. Similar to training cohort, we observed the distinction of mean vasculature-like signal among different groups (i.e., COVID-19, CAP and healthy) (Fig. 2d) . The logistic regression model pre-built in the training cohort on signal led to accurate prediction between COVID-19 patients and others in the validation cohort (AUC=0.927, Fig. 2f , blue curve). 8 pre-identified imaging biomarkers also clearly separate the COVID-19 patients from others in validation cohort (Fig. 2e, Extended Data Fig. 3b) . Excitingly, we founded the pre-built random decision forest model based on pre-obtained biomarkers predict COVID-19 with high sensitivity (0.941), specificity (0.904), and AUC (0.952), which is competitive with two COVID-19 experienced chest radiologists (Fig. 2f ). In this study, we developed and validated 3D imaging biomarkers for COVID-19 screening based on chest CT images. Our quantitative evaluation suggests that, compared to healthy and CAP patients, COVID-19 patients may have significantly more vascular changes in lung tissue, including vascular congestion/enlargement, small vessels hyperplasia and vessel wall thickening [11, 12] , which leads to the discovery of robust imaging biomarkers for COVID-19 screening. Our double-blind validation confirms the robustness and effectiveness of pre-identified imaging biomarkers in an independent hospital with high specificity (0.904) and sensitivity (0.941), which is competitive with two COVID-19 experienced chest radiologists. The current COVID-19 epidemic is a world-wide threat. Specifically, in Europe and the United States, there are tens of thousands of new confirmed cases and suspected cases every day[13-16]. These sharply increased cases of COVID-19 are running out of medical resources in some countries to varying degrees and are paralyzing the medical system in some countries [17] . Therefore, rapid screening, diagnosis, isolation and treatment of COVID-19 patients are particularly important for the prevention and control of the epidemic. However, it is unfortunate that nucleic acid detection kits for COVID-19 have been in short supply in many countries. At the same time, due to improper operation, technical variations among different nucleic acid detection kits and many other reasons, the results of nucleic acid test have certain false negatives. On the other hand, CT examination has been proved to have unique advantages in the early screening and diagnosis of COVID-19 [11, 18] . To discover new possibility for COVID-19 screening, this study utilized an unsupervised deep learning method to identify robust imaging biomarkers . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 20, 2020. . from chest CT scans for accurate COVID-19 screening. The major advantages of our imaging biomarkers reside in two folds as follows: (1) they provide robust, accurate and cost-effective COVID-19 screening, which can significantly alleviate the shortage of clinical resources, including both nucleic acid detection kids and experienced chest radiologists; and (2) they provide a non-invasive diagnostic tool that enables world-wide scalable practical applications. Furthermore, our imaging biomarkers may provide a new avenue for predicting COVID-19 patients' prognosis and clinical outcome, which will be further investigated in our future research. The Chest CT images involved in this study are available upon request and consideration by corresponding author(s) of this manuscript. Iterative Tangential Voting for vasculature-like structure enhancement is publicly available at http://bmihub.org/project/itv; and the Stacked PSD for imaging biomarker detection is publicly available at http://bmihub.org/project/stackedpsd. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 20, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 20, 2020. . https://doi.org/10.1101/2020.05.15.20103473 doi: medRxiv preprint The COVID-19 epidemic Coronavirus: the spread of misinformation European Centre For Disease P, Control Ecdc Public Health Emergency T: Rapidly increasing cumulative incidence of coronavirus disease (COVID-19) in the European Union/European Economic Area and the United Kingdom