key: cord-1021784-s490to7o authors: Komolafe, Temitope Emmanuel; Cao, Yuzhu; Nguchu, Benedictor Alexander; Monkam, Patrice; Olaniyi, Ebenezer Obaloluwa; Sun, Haotian; Zheng, Jian; Yang, Xiaodong title: Diagnostic test accuracy of deep learning detection of COVID-19: a systematic review and meta-analysis date: 2021-09-17 journal: Acad Radiol DOI: 10.1016/j.acra.2021.08.008 sha: 4518f42a80bfb6807a2b73a8b3779d659d3715f1 doc_id: 1021784 cord_uid: s490to7o RATIONALE AND OBJECTIVE: To perform a meta-analysis to compare the diagnostic test accuracy (DTA) of deep learning (DL) in detecting coronavirus disease 2019 (COVID-19), and to investigate how network architecture and type of datasets affect DL performance. MATERIALS AND METHODS: We searched PubMed, Web of Science and Inspec from January 1, 2020, to December 3, 2020, for retrospective and prospective studies on deep learning detection with at least reported sensitivity and specificity. Pooled DTA was obtained using random-effect models. Sub-group analysis between studies was also carried out for data source and network architectures. RESULTS: The pooled sensitivity and specificity were 91% (95% confidence interval [CI]: 88%, 93%; [Formula: see text] = 69%) and 92% (95% CI: 88%, 94%; [Formula: see text] = 88%), respectively for 19 studies. The pooled AUC and diagnostic odds ratio (DOR) were 0.95 (95% CI: 0.88, 0.92) and 112.5 (95% CI: 57.7, 219.3; [Formula: see text] = 90%) respectively. The overall accuracy, recall, F1-score, [Formula: see text] and [Formula: see text] are 89.5%, 89.5%, 89.7%, 23.13 and 0.13. Sub-group analysis shows that the sensitivity and DOR significantly vary with the type of network architectures and sources of data with low heterogeneity are ([Formula: see text] =0%) and ([Formula: see text] =18%) for ResNet architecture and single-source datasets, respectively. CONCLUSION: The diagnosis of COVID-19 via deep learning has achieved incredible performance, and the source of datasets, as well as network architectures, strongly affect DL performance. Despite the rollout of vaccines across the world, there are still new cases and new deaths recorded daily in some countries of the world. Most new cases reported are due to the second wave necessitating immediate and longterm solutions for early detection. This is crucial to the management of the disease to prevent the third wave due to the highly contagious nature of the disease. The gold standard diagnostic test for COVID-19 is the reverse transcriptase-polymerase chain reaction (RT-PCR) (4) but the time required for the result to be available is considerably long. This perceived shortcoming led to the development of a non-invasive assessment of COVID-19 patients. Radiologic assessment of the chest via plain chest radiography and chest computed tomography have been found useful in the management of COVID-19. Chest CT is capable of revealing some image features in patients with COVID-19 that do not show any detectable abnormalities on a plain radiograph (5) . Radiologists' studies revealed that the imaging features of patients with the COVID-19 are bilateral, peripheral, multifocal ground-glass opacity (GGO), and consolidation, predominantly located at subpleural and peribronchovascular regions, were the typical features (1, 5) . However, other kinds of viral pneumonia can also mimic COVID-19 pneumonia thus making it difficult to differentiate (6) . The field of machine learning (ML) cuts across multiple statistics-based techniques useful for radiologists in disease diagnosis which complements the currently adopted deep learning (DL) approach (7) . The incorporation of ML into deep learning and artificial intelligence (AI) has shown great potentials in assisting decision-making for assessing severity and prediction of clinical outcomes of disease in COVID-19 patients (8, 9) . Li et al. (10) conducted a systematic and meta-analysis review on machine learning diagnosis of COVID-19 on 151 published studies and reported the sensitivity and specificity of 92.5% and 97.9% respectively on the XGBoost model. Recently, Li et al (11) carried out a multi-reader study for the grading of COVID-19 in chest radiography and observed that the AI system improved radiologist performance. Since the deep learning technique has been found useful in the diagnosis of COVID-19 (12, 13) , combining radiologist interpretation with the DL approach gives a promising result for the detection of COVID-19 (13) . To this effect, the potential use of deep learning suggests a better future in the clinical diagnosis of COVID-19 as supported by Islam et al. (14) . Li et al. (13) performed a multi-center retrospective study using a deep learning COVID-19 detection neural network (COVNet) to extract visual features from volumetric CT scans for detection of COVID-19. Accurate detection of distinct features of COVID-19, other than those of community-acquired pneumonia (CAP) and other lung infections, was achieved (13) . A study by Javo et al. (15) to test the diagnostic accuracy of convolutional neural network (ResNet-50) on public chest CT datasets revealed that while the diagnostic accuracy achieved by a deep learning model showed no significant difference with that of radiologists at rule-in thresholds, differences were significant at rule-out suggestive of better results of deep learning with public datasets (9). Moezzi et al (16) summarized the evidence evaluated using the meta-analysis approach on prediction of the accuracy of artificial intelligence (AI) assisted CT scanning for COVID-19 using 36 studies. The study compared deep learning (DL), machine learning (ML), and AI systems. The result shows that AI systems performed slightly better than their corresponding DL and ML counterparts which implies that the AI systems will be useful in identifying COVID-19 symptoms. This study did not consider the effect of training data or how the network architectures affect DL detection ability. A systematic analysis with meta-analysis of the effect of deep learning network architectures and data types will provide a means to bridge this evidence gap which is the aim of this systematic review. This systematic review and meta-analysis aimed to summarize, all the available evidence to quantitatively evaluate the diagnostic test accuracy (DTA) of a deep learning algorithm for detection of COVID-19 in chest CT. In doing so, the review provides crucial new information on how network architecture and data types affect the performance of the DL algorithm in COVID-19 diagnosis. This systematic review and meta-analysis was prospectively registered at PROSPERO with the registration number CRD: 42020223202 (17) The systematic review was performed by two independent reviewers (TEK and YC or PM and EOO using a well-established review protocol known as Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines (18) . The discrepancies between the two results were discussed by the two reviewers, and a more experienced third reviewer (XY or JZ) was consulted in case consensus was not reached. We conducted a meticulous search that focused on deep learning diagnosis of COVID-19 using chest CT images patients who reported well-documented information on diagnosis accuracy test or at least 2x2 confusion matrix that is the sensitivity and specificity. Our search includes clinical trials, cohort, prospective, and retrospective studies based on deep learning detection of COVID-19. It is important to note that most studies fall in the retrospective studies because the nature of deep learning requires a large number of datasets, and all literature reviews were excluded. PubMed, Inspec, Web of Science, and other biomedical databases were searched from inception with additional hand searched to unravel relevant literature from 1st January 2020 to 3rd December 2020. The same keywords were used for PubMed, Inspec, and Web of Science databases, which includes the following search terms: Literature was included in the study if it was based on deep learning diagnosis of COVID-19 using chest CT images both in screening and diagnostic protocol; well-documented information on diagnosis accuracy test at least sensitivity and specificity or 2x2 confusion matrix to compute other diagnostic test accuracy parameters. The included studies are composed majorly of retrospective with few prospective studies, an observer performance study, clinical trial, and comparative studies. The exclusion criteria comprised studies that involved literature reviews, studies on reverse transcription-polymerase chain reaction (RT-PCR), other machine learning detection-based algorithms; detection using chest X-ray datasets or a combination of both chest CT and chest Xray. Besides, studies devoid of useful information to compute the DTA and multiple publications were also excluded. For studies that reported the same study cohort or sub-set of the study, the most detailed one in terms of data availability was used. Articles retrieved for both arms were manually sorted and duplicates were removed using titles/abstracts, then followed by full text according to the predefined search criteria and final eligible studies were selected. We developed a standard extraction sheet which was consensually agreed upon by two independent reviewers team (TEK and YC or BAN and HS), to extract the information needed and resolve the conflict by consensus from eligible studies which includes: Nationality, data source, data partitioning, training model, deep learning techniques, training parameters, the total number of positive (cohort) vs control (negative) and other valuable information. Also, we extracted quantitative data for the meta-analysis which include (2x2) confusion matrix (True Positive (TP), False Negative (FN), True Negative (TN), and False Positive (FP)) needed to compute the required DTA like sensitivity, specificity, diagnostic odds ratios (DOR), recall, accuracy, precision, F1-Score, the positive and negative likelihood ratios and the AUC (19, 20) . The expressions of these assessment measures are given below: (3) The quality of included studies was assessed using a modified QUADAS-2 to ensure appropriateness for COVID-19 screening (21) . The domains assessed were Patient Selection, Index Tests, Reference Standard, Flow and Timing, and Applicability. Two reviewers (TEK and YC) performed an independent quality assessment and the final result was based on consensus. The overall study quality pipeline is shown in Fig.2 ( Figure 2 ) A univariate meta-analysis was performed separately for sensitivity, and specificity to estimate the diagnostic accuracy of each modality using the DerSimonian-Laird method of random effects model (RE) (22) . We chose the RE model due to suspicion of high rates of heterogeneity that might be occasioned by differences in the network architecture used for the training, differences in training data across age, sex, and so on. The primary outcomes were sensitivity, specificity, summary receiver operating characteristic (SROC) curve, and diagnostic odds ratios (DOR). We calculated point estimates and 95% confidence intervals (CI) for each study to ensure consistency in sensitivity and specificity. To obtain a SROC curve, we used a bivariate meta-analysis of sensitivity and specificity using R version 3.6.2 with RStudio version 1.2.5042 implementing R-packages "mada" and "meta", following which mean AUC of SROC was estimated (23) . Secondary outcomes included positive likelihood and negative likelihood ratios, accuracy, precision and F1-score. Statistical heterogeneity between studies was evaluated with Cochran's Q test and the statistic (19) . For the Q statistic, values range 0%-40% imply insignificant heterogeneity, 30%-60% connote moderate heterogeneity, 75%-100% mean considerable heterogeneity. Publication bias was evaluated and visualized by constructing a funnel plot (25) . All p-values were based on two-sided tests and p-value < 0.05 was considered to represent statistical significance. We conducted sub-group analysis by screening based on the deep learning techniques and training model (transfer learning and customized method). Quality assessment studies were rated as being of the moderate overall assessment of quality according to QUADAS2 (Fig. 2) . About 5% of the included studies did not give details about patient selection, 5% provided unclear information about patient selection leading to high and unclear biases in patient selection as others are provided no clear information about the interval between index test and reference standard test and how they were performed leading to unclear bias in flow and timing as others are considered as having a low risk of bias. A funnel plot was used to also assess the publication bias for the 19 studies that met the inclusion criteria. There is low publication bias in the study according to Liu (25) the points will be symmetrically distributed around the true effect in the shape of an inverted funnel when publication bias is very low as shown in Fig. 3 . This was also supported by the QUADASS-2 assessment in Fig.2. ( Figure 3 ) The database search retrieved 283 publications. After the duplicates were removed, and the publications screened using title and abstracts, a total of 255 publications were screened out ( Fig.1 ). Twenty-eight full-text articles were assessed for eligibility. Nineteen articles were found worthy to meet inclusion criteria (12) (13) (26) (27) (28) (29) (30) (31) (32) (33) (34) (35) (36) (37) (38) (39) (40) (41) . Three articles applied the machine learning approach, 3 articles combined datasets of chest X-ray and chest CT and 3 studies did not provide useful information on parameters to estimate the diagnostic test accuracy (DTA), hence these 9 studies were exempted as shown in Fig.1 (13, 26, 32, 34, 36, 40) and finally other models that do not fall into the above two categories (7, (22) (23) (24) (25) 30, 32, 33, 36) . Some of the studies included in the quantitative synthesis (meta-analysis) have reported a higher DTA performance for deep learning algorithms compared with radiologist interpretation (31, 37, 41) , while others have shown that deep learning algorithm did aid DTA performance (12, 30, 33, 39) . Other studies reported higher sensitivity over specificity (26) (27) (28) (29) 33, 35, 39, 41) , while some reported higher specificity (13, 15, 30, 31, 32, 34, 36, 37, 40) . ( Table 1 ) Fig. 7) . The accuracy of all included studies ranges from 0.7600 to 0.9879 with a mean of 0.8948 (Table 2) , while the precision ranges from 0.7059 to 0.9703 with the mean of 0.8966 (Table 2) , the F1-score has a mean of 0.8966 and ranges from 0.7500 to 0.9787 and finally, the recall ranges from 0.7935 to 0.9804 with mean of 0.8949 (Table 2) . indicates that there was a slightly significant difference in the specificity of single-source and multi-source datasets (Fig. 5) for 4 studies). There was no significant difference between the customized training datasets and pre-trained data for analysis with p-value =0.6008 as shown in Fig.8 Fig.10 . There is no statistical difference between the DOR of the pre-trained and customized models. We sub-divided the analysis into three categories of network for convenience and for ease analysis during the meta-analysis as ResNet, ResNet Hybrid, and other networks as shown in These results reveal that there is no significant difference in the specificity of the three categories of network architecture with a p-value of 0.1011 as shown in Fig. 12 of networks used during the training process (Fig. 13 ). To effectively understand how a different deep learning model, deep learning architectures, and nature of datasets will influence the performance of the algorithm for COVID-19 detection, we did a sub-group analysis based on the type of data source, deep learning model, and type of network architecture. In terms of data type, the sensitivity of 0.889 and 0.956 was recorded for multi-source and single-source datasets respectively. This indicates that a single-source had both slightly higher non-significant differences over that of overall sensitivity and that of multi-source datasets. For the specificity, both multi-source and single-source datasets showed higher results over the overall specificity, but single-source datasets exhibited higher non-significant over multisource data. In the pooled estimate of DOR, the value of 282.7 and 88.8 was recorded for single-source and multi-source, respectively. This implies that single-source had slightly higher non-significant differences over that of overall DOR but significant difference from that of multi-source datasets with a p-value of 0.013. For the sub-group analysis based on the training model, the sensitivity of the customized model shows a slightly statistically non-significant difference over that of pre-trained and overall, while the specificity of the pre-trained model shows a statistically non-significantly higher value than the customized and pre-trained models. This means there is no significant difference in terms of the training model. We also did a sub-group analysis based on the type of network architecture. The algorithm trained on ResNet alone had the least sensitivity with no heterogeneity compared to the overall, which had higher sensitivity than the rest. The significantly low heterogeneity indicates the consistency of ResNet for detection. The highest sensitivity was discovered in other variants of network architecture apart from ResNet and its hybrid. The sensitivity of this sub-group shows a slightly significant difference between ResNet, ResNet Hybrid, and other network variants with ( p-value=0.007). For the specificity, there is a slight non-significant difference among the three categories of network used. Similarly, there is a visible non-significant difference in DOR using ResNet, ResNet Hybrid, and other network variants. This result showed that there is a correlation between diagnostic accuracy, which is a function of sensitivity, and network architecture. There was substantial heterogeneity in all the studies because different Countries were included in the metaanalysis (Austria, Iran, Korea, Egypt, China, U.S.A, Japan, and Italy). These differences in data collected could be a source of potential heterogeneity. One of the potential sources of heterogeneity is the different DL architectures used ranging from ResNet and its variant, ResNet Hybrid and other architectures like AlexNet, GoogleNet, UNET++ and other ensembled DL networks. This can be ascertained by the sensitivity of the subgroup analysis when considering only ResNet architecture for detection with the heterogeneity of ( = 0%). In terms of data source, the result of single-source sub-group analysis shows extremely low heterogeneity in sensitivity and DOR with ( = 18%) and ( = 0%) respectively. This simply means that multi-source datasets serve as a potential source of heterogeneity in DL detection. Apart from heterogeneity in data type and DL architecture, most of the model's function is based on radiologist performance to serve as the reference standard. It would therefore be very difficult to conclude that DL outperforms its correspondence radiologist interpretation but rather aid and speed up the detection since a good quality image is needed to estimate accurately the DTA of any equipment. Also, most of the DL detection on chest CT only documented sensitivity and specificity, which may lead to overestimation of the benefits of DTA, hence it is recommended that other DTA likelihood ratios and DOR be estimated alongside sensitivity and specificity. The DL algorithm is regarded as a black box because there is no established mathematical formulation to support its performance making it difficult to replicate, and this might also be another source of concern for a wide range of acceptance. Advances in computing hardware and software will lead to better data acquisition and storage with increase quality, enabling further research into how this model behaves and allowing for complete automation of the detection of diseases like COVID-19. In conclusion, the meta-analysis on DTA of DL detection of COVID-19 was carried out. The results show the high performance of the DL model to detect COVID-19 while establishing that factors such as the source of datasets and DL architectures strongly affect the detection performance of DL algorithms. Table S1 PRISMA diagnostic test accuracy checklists The authors declare no competing financial interests or personal relationships that could influence the work reported in this review paper. Written informed consent was not required for this study because the study is a literature review. represents sub-group analysis of data, when g = 1(single-source datasets) and g = 0 (multi-source datasets). Univariate sub-group analysis of specificity with random model based on data source. g represents sub-group analysis of data when g = 1(single-source datasets) and g = 0 (multi-source datasets). Univariate sub-group analysis of DOR based on data source. DOR: diagnostic odds ratio, g represents sub-group analysis of data when g = 1(single-source datasets) and g = 0 (multi-source datasets). High-resolution computed tomography manifestations of COVID-19 infections in patients of different ages Chest CT features and their role in COVID-19. Radiology of Infectious Diseases Coronavirus Disease (COVID-19) Pandemic; World Health Organization Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR Imaging features of coronavirus disease 2019 (COVID-19): evaluation on thin-section CT CT manifestations of coronavirus disease (COVID-19) pneumonia and influenza virus pneumonia: A comparative study Machine learning principles for radiology investigators CT Quantification and Machine-learning Models for Assessment of Disease Severity and Prognosis Quantification of COVID-19 Opacities on Chest CT-Evaluation of a Fully Automatic AI-approach to Noninvasively Differentiate Critical Versus Noncritical Patients Using machine learning of clinical data to diagnose COVID-19: a systematic review and metaanalysis Multi-Radiologist User Study for Artificial Intelligence-Guided Grading of COVID-19 Artificial intelligence augmentation of radiologist performance in distinguishing COVID-19 from pneumonia of other origin at chest CT Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest CT A review on deep learning techniques for the diagnosis of novel coronavirus (covid-19) Deep learning analysis provides accurate COVID-19 diagnosis on chest computed tomography The diagnostic accuracy of Artificial Intelligence-Assisted CT imaging in COVID-19 disease: A systematic review and meta-analysis Diagnostic accuracy of deep learning detection of COVID-19: a systematic review and meta-analysis Preferred reporting items for a systematic review and meta-analysis of diagnostic test accuracy studies: the PRISMA-DTA statement Early diagnosis of COVID-19-affected patients based on X-ray and computed tomography images using deep learning algorithm How to appraise a diagnostic test QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies Diagnostic test accuracy: application and practice using R software When are summary ROC curves appropriate for diagnostic meta-analyses? Stat Med Quantifying heterogeneity in a meta-analysis The role of the funnel plot in detecting publication and related biases in metaanalysis MULTI-DEEP: A novel CAD system for coronavirus (COVID-19) diagnosis from CT images using multiple convolution neural networks Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography Automated detection of COVID-19 using ensemble of transfer learning with deep convolutional neural network based on CT scans Accurate screening of COVID-19 using attention-based deep 3D multiple instance learning Artificial intelligence for the detection of COVID-19 pneumonia on chest CT using multinational datasets Development and evaluation of an artificial intelligence system for COVID-19 diagnosis COVID-19 pneumonia diagnosis using a simple 2D deep learning framework with a single chest CT image: model development and validation Artificial intelligence-enabled rapid diagnosis of patients with COVID-19 Dual-sampling attention network for diagnosis of COVID-19 from community acquired pneumonia End-to-end automatic differentiation of the coronavirus disease 2019 (COVID-19) from viral pneumonia based on chest CT Prior-attention residual learning for more discriminative COVID-19 screening in CT images A fully automatic deep learning system for COVID-19 diagnostic and prognostic analysis A weakly-supervised framework for COVID-19 classification and lesion localization from chest CT Deep learning-based multiview fusion model for screening 2019 novel coronavirus pneumonia: a multicentre study A deep learning system to screen novel coronavirus disease Deep learning for detecting corona virus disease 2019 (COVID-19) on high-resolution computed tomography: a pilot study Can chest CT improve sensitivity of COVID-19 diagnosis in comparison to PCR? A meta-analysis study Prevalence of COVID-19 Diagnostic Output with Chest Computed Tomography: A Systematic Review and Meta-Analysis Chest CT findings in asymptomatic cases with COVID-19: a systematic review and meta-analysis Coronavirus disease 2019 (COVID-19) CT findings: a systematic review and meta-analysis Diagnostic performance of CT and reverse transcriptase polymerase chain reaction for coronavirus disease 2019: a meta-analysis Reverse-transcriptase polymerase chain reaction versus chest computed tomography for detecting early symptoms of COVID-19. A diagnostic accuracy systematic review and metaanalysis Systematic review with meta-analysis of the accuracy of diagnostic tests for COVID-19 The authors acknowledged Professor Sung Ryul Shim Institutional Review Board approval was not required because it is a review