key: cord-0750392-umufaixr authors: Inui, Shohei; Kurokawa, Ryo; Nakai, Yudai; Watanabe, Yusuke; Kurokawa, Mariko; Sakurai, Keita; Fujikawa, Akira; Sugiura, Hiroaki; Kawahara, Takuya; Yoon, Soon Ho; Uwabe, Yasuhide; Uchida, Yuto; Gonoi, Wataru; Abe, Osamu title: Comparison of Chest CT Grading Systems in Coronavirus Disease 2019 (COVID-19) Pneumonia date: 2020-11-05 journal: Radiol Cardiothorac Imaging DOI: 10.1148/ryct.2020200492 sha: e71ee6e77cd23f0f8b34030434a71a946fb781fa doc_id: 750392 cord_uid: umufaixr PURPOSE: To compare the performance and interobserver agreement of the COVID-19 Reporting and Data System (CO-RADS), the COVID-19 imaging reporting and data system (COVID-RADS), the RSNA expert consensus statement, and the British Society of Thoracic Imaging (BSTI) guidance statement. MATERIALS AND METHODS: In this case-control study, total of 100 symptomatic patients suspected of having COVID-19 were included: 50 patients with COVID-19 (59±17 years, 38 men) and 50 patients without COVID-19 (65±24 years, 30 men). Eight radiologists independently scored chest CT images of the cohort according to each reporting system. The area under the receiver operating characteristic curves (AUC) and interobserver agreements were calculated and statistically compared across the systems. RESULTS: A total of 800 observations were made for each system. The level of suspicion of COVID-19 correlated with the RT-PCR positive rate except for the “negative for pneumonia” classifications in all the systems (Spearman’s coefficient: ρ=1.0, P=<.001 for all the systems). Average AUCs were as follows: CO-RADS, 0.84 (95% confidence interval, 0.83–0.85): COVID-RADS, 0.80 (0.78–0.81): the RSNA statement, 0.81 (0.79–0.82): and the BSTI statement, 0.84 (0.812-0.86). Average Cohen’s kappa across observers was 0.62 (95% confidence interval, 0.58–0.66), 0.63 (0.58–0.68), 0.63 (0.57–0.69), and 0.61 (0.58-0.64) for CO-RADS, COVID-RADS, the RSNA statement and the BSTI statement, respectively. CO-RADS and the BSTI statement outperformed COVID-RADS and the RSNA statement in diagnostic performance (P=.<.05 for all the comparison). CONCLUSIONS: CO-RADS, COVID-RADS, the RSNA statement and the BSTI statement provided reasonable performances and interobserver agreements in reporting CT findings of COVID-19. During the ongoing pandemic of coronavirus disease 2019 , the prompt diagnosis is crucial to achieve swift and optimal clinical decision making and judge the precaution level necessary on admission to help prevent nosocomial infection in the hospital. Various evidence has documented that early diagnosis and intervention are associated with a better prognosis [1] . The gold standard diagnostic method for COVID-19 is reverse transcription-polymerase chain reaction (RT-PCR) that directly quantifies viral load from a nasopharyngeal swab, sputum, or endotracheal lavage. However, the sensitivity of this method is unclear as false-negative results have been reported in patients with insufficient specimen or those in the initial stage of infection [2] . The turnaround time is also long ranging from a few hours to days. Although chest CT is currently not recommended for routine screening purposes, it provides valuable information serving as a supplementary diagnostic tool of COVID-19 pneumonia especially in circumstances in which RT-PCR tests are not sufficiently available or in patients in whom the possibility of false negative results is suspected, or clinical decisions are required before the PCR test results become available. With the accumulation of recent publications clarifying the radiological appearance of COVID-19, various attempts have been made to standardize reporting of chest CT for suspected . The British Society of Thoracic Imaging (BSTI) proposed Guidance for the Reporting Radiologist as a diagnostic framework of COVID-19 from chest CT and radiograph [3] . The recent RSNA expert consensus statement on reporting advocates a standard nomenclature and imaging classification for COVID-19 pneumonia made up of four categories (typical appearance, indeterminate appearance, atypical appearance, and negative for pneumonia) [4] . A working group of the Dutch Radiological Society devised the COVID-19 Reporting and Data System (CO-RADS) to facilitate the advances in and worldwide dissemination of COVID-19 related information and tools [5] . Another group of researchers devised a different structured reporting system based on a review of 37 published papers on the chest CT findings of COVID-19 entitled the COVID-19 imaging reporting and data system (COVID-RADS) that divides the CT findings into five categories [6] . The published CT grading systems of chest CT findings in COVID-19 patients may facilitate both making the radiological diagnosis and smooth communication among professionals in other fields, and their applicability and validity in the clinical practice was recently reported in several studies [7, 8, 9, 10] . However, no studies have yet directly compared the diagnostic performances and interobserver agreement between them. This prompted us to undertake the present study to validate the performance and interobserver agreement of four sets of CT grading systems, including the COVID-19 Reporting and Data System (CO-RADS), the COVID-19 imaging reporting and data system (COVID-RADS), the RSNA expert consensus statement and the BSTI guidance statement. This study was conducted with the approval of our institutional ethics review board. Written informed consent was waived due to the retrospective nature of the study. The privacy of all patients was protected in full. Patient backgrounds were standardized by applying the following inclusion criteria: (1) presentation to the outpatient or emergency department of a single institution from January 30 to June 30, 2020, (2) suspected of COVID-19 because of the presence of symptoms suggestive of pneumonia (i.e. fever (>37.5℃) and at least one of the following symptoms; cough, dyspnea, tachypnea, or hypoxemia), (3) having undergone RT-PCR examination, (4) acquisition of chest CT within 5 days of the initial RT-PCR test. Patients were classified as COVID-19 or non-COVID-19 if they tested positive or negative respectively on RT-PCR at least one time. Those who tested negative on the initial RT-PCR but were on a high clinical suspicion of COVID-19 underwent repeat RT-PCR and categorized as COVID-19 positive if repeat RT-PCR tested positive. Those who tested negative on the initial RT-PCR and were not having a high clinical suspicion of COVID-I n p r e s s Cases included in a previous publication were excluded from the current study based on the following grounds: (1) the previous publication was used in the process of developing two of the sets of criteria (the RSNA expert consensus statement and COVID-RADS), (2) those included in the previous publication were cases from mass infection cohort under special circumstances, and (3) the purpose of this study was to compare the CT grading systems in usual clinical settings that mostly comprises community-acquired infection with COVID-19. Furthermore, patients with COVID-19 were randomly selected and excluded to adjust the sample size. Fifty patients with COVID-19 and 50 patients without COVID-19 were finally included as case and control subjects, respectively. Flow chart illustrating the patient population was summarized in Figure 1 . Medical records were reviewed for the clinical and imaging findings of patients. The following data were extracted from the medical records: demographic data, presence or absence of smoking history, underlying comorbidities, symptoms and signs, and duration from onset to CT. Non-enhanced chest CT was performed using a 6-row multi-detector CT unit (SOMATOM Emotion 6 scanner; Siemens, Tokyo, Japan) on admission with the following parameters: tube voltage, 130 kVp; effective current 95 mA; collimation, 6×2 mm, helical pitch, 1.4, field of view, 38 cm; matrix size, 512×512. A 1.0-mm gapless section was reconstructed before being reviewed on the picture archiving and communication system monitor. The comparison of four sets of CT grading systems were summarized in Table 1 . Eight general radiologists (W.G., Y.W., Y.N., R.K., M.K., K.S., H.S., A.F.) served as observers from 5 hospitals in Japan. Four observers had more than 10 years of experience and four less than 10 years. Because none had experience with CO-RADS, COVID-RADS, the RSNA expert consensus statement or the I n p r e s s BSTI guidance statement prior to this study, we held a practice review session and consensus meeting to review the statements before the initiation of the experiments. In the practice session, each reader independently scored 40 sample cases of COVID-19, which were not included in this study and recorded the reasons for grading and uncertainty (if any). In the following consensus meeting, we decided on a single consensus grading for each case by majority voting followed by a review of cases of interobserver disagreements and listed unclear/unspecified components in each item of the grading systems ( Table 2 ). Based on these results, we added some supplementary remarks regarding interpretation of criteria and sample CT patterns to facilitate CT grading without confusion or misclassification as follows; (1) added conjunction (and/or) for each item of the criterion, (2) confirmed interpretation of the descriptions regarding laterality (i.e. each category was interpreted as "either unilateral or bilateral" if otherwise specified), (3) confirmed categorization of frequently encountered differential diagnosis of COVID-19 for each criterion (e.g. interstitial pulmonary edema falls into COVID-RADS 1 with only interstitial septal thickening and/or pleural effusion; COVID-RADS 2A when accompanied by peribronchial edema; and COVID-RADS 2B when progressed to pulmonary alveolar edema accompanying GGO and pleural effusion), (4) created sample CT patterns of CO-RADS with regards to its categorization of GGO. Randomization was stratified by the patient's disease status (patients with COVID-19 vs. patients without COVID-19). All patient information was removed from the data and observers were blinded to all clinical information, including RT-PCR results. Each observer independently scored the four criteria using an original document of each criterion with the above-described additional notes and recorded all data using a spreadsheet prepared in advance (Microsoft, Redmond, WA, USA). Grading was conducted in four separate sessions designated for each criterion, in which the observers provided a single grading for one case (i.e. in the CO-RADS session, the observers provided only CO-RADS gradings, and in the order of CO-RADS session, COVID-RADS session, RSNA session, and BSTI session). The order of the cases was shuffled between sessions. The AUC was calculated for each sets of criteria for each of the observers. Using the RT-PCR results as the gold standard of COVID-19 diagnosis, the AUC was used to assess the performance of each of the three sets of criteria. Mean AUC across observers and 95% confidence interval (CI) were calculated. Last, for each criterion, the average percentage of patients assigned to each category, including 95% CI, was determined. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated by setting different cut-off points for each criterion. Interobserver agreement was quantified using three types of kappa values (i.e. Fleiss' kappa, Cohen's kappa, and Light's kappa [10] ) calculated across observers. In comparison to the original article, Cohen's kappa values were obtained by comparing the scores of each observer to the median of the remaining seven observers [5] . Overall agreement was quantified using Fleiss' kappa and Light's kappa. The degree of interobserver agreement was considered with the following Computing, Vienna, Austria). Quantitative variables were expressed as mean ± standard deviation (range) or median and interquartile range based on the normality of data. Categorical variables were presented as the percentage of the total. The comparisons of quantitative variables were evaluated using a non-paired t-test or Mann-Whitney U-test and categorical data using the Pearson χ2 test. The comparisons of AUC and kappa-values were conducted using one-way repeated measures analysis of variance, according to the normal distribution assessed by the Shapiro-Wilk test, and post hoc family-wise error correction for multiple comparisons with paired-t-test. All P values correspond to two-sided tests and the statistical significance level was set at Holm-Bonferronicorrected P <.05. Demographics and clinical characteristics of the study population are summarized in Table 3 . The study population comprised 100 patients, 50 patients with COVID-19 (38 men; mean age, 59 years ±17; range, 18-86) and 50 patients without COVID-19 (30 men; mean age, 65 years ±24, range; 17-100). There was no statistical significance between these groups in age, sex, disease duration, smoking history, or presence of comorbidities. Eight observers scored 100 patients, making for a total of 800 observations for each criterion. The probability of COVID-19 diagnosis of each category of each set of criteria was summarized in Table 4 . The level of suspicion on COVID-19 correlated with the RT-PCR positive rate except for being negative for pneumonia in all the systems (Spearman's coefficient: ρ = 1.0, P < .001 for all the systems). The diagnostic performance and interobserver agreements of each set of the criterion were summarized in Table 5 . Average AUCs with 95% CI for each criterion were as follows; CO- The AUC values were significantly higher in CO-RADS and BSTI grading system vs. COVID-RADS (vs CO-RADS, P =.0087 and vs BSTI grading system, P =.0033) and RSNA grading system (vs CO-RADS, P =.0097 and vs BSTI grading system, P =.0019). AUC values were not statistically significant in either of the comparisons between CO-RADS and BSTI grading system or COVID-RADS and the RSNA grading system. The sensitivity, specificity, PPV, and NPV of each set of criteria was summarized in Table 6 . For CO-RADS, the sensitivity, specificity, PPV and NPV was as follows: CO-RADS 5, The interobserver variabilities of COVID-19 diagnosis with 95%CI of each grading system were summarized in Table 5 We conducted a comparison study of the published CT grading systems CO-RADS and COVID-RADS, and the grading system based on the RSNA expert consensus statement and the BSTI guidance statement. Although the three sets of criteria were effective in diagnosing COVID-19 with a mean AUC of about 0.80, a salient finding was that CO-RADS and the BSTI guidance statement were significantly better in distinguishing COVID-19 from non-COVID-19 etiology than COVID-RADS and the RSNA consensus statement. All the sets of criteria were effective in terms of interobserver agreements with an average Cohen's kappa greater than 0.60. For CO-RADS, interobserver agreement was higher in the current study than in the original one (Fleiss' kappa of 0.56 vs. 0.47) [5] . For the RSNA expert consensus statement, the interobserver agreement was also higher in the current study than in a previous one (Cohen's kappa of 0.63 vs. 0.5) [10] . We attribute this to our having held a practice session using sample cases and a rigorous consensus meeting to minimize ambiguity and dependence on subjective clinical judgments and facilitate consistent image interpretation. Based on this experience we also added some remarks regarding the interpretation of these criteria to enhance clarity as summarized in Table 2 , with some illustrative cases shown in Figure 2 -4. As detailed in Table 2 , the reasons for dividing the grading common to all the grading systems and hence candidates for possible revision were (1) the presence of more than two predominant patterns, (2) the presence of only small lesions, (3) the presence of co-existing lung disease (i.e. interstitial pneumonia or emphysema). Although some ambiguity in the terminology used may persist (i.e. (half)-rounded shape, small, homogeneous, or extensive) and a little additional modification may be required, further efforts to define the characteristics in ever greater detail may not be worthwhile considering the wide spectrum of radiological presentations and progression of COVID-19. The diagnostic performance of CO-RADS was slightly lower than that noted in the original article (AUC of 0.84 vs. 0.91, this study vs. the CO-RADS original study [5] ). This discrepancy was I n p r e s s attributed in part to differences in the patient cohorts studied in the original and present works. Reflecting this, as illustrated in Table 4 , in this study cohort CO-RADS 1 categories showed a slightly higher RT-PCR positive rate than CO-RADS 2, while the original paper obtained the opposite result with this trend likewise observed in the other two sets of criteria). This may have resulted from differences in the inclusion criteria between the two studies; namely only patients requiring hospitalization were included in the original article, while we adopted broader inclusion criteria, namely all patients irrespective of their observation status in the present one. The difference is easy to understand since patients with COVID-19 has been reported to show different percentages of chest CT positivity in rough parallel with the severity of their symptoms: normal CT findings being observed in 46% of asymptomatic or mildly symptomatic and 21% of symptomatic patients [12] . The results of the current study confirmed that the sensitivity of chest CT is still not sufficient to use as a rule-out-tool. In contrast, the interobserver agreement was higher in the current study (Fleiss' kappa of 0.56 vs. 0.47, the current study vs. the CO-RADS original study [5] ). COVID-RADS is a relatively simple grading system based on a combination of findings with different levels of suspicion that are stratified according to their frequencies seen in COVID- 19 . This set of criteria, therefore, differs from the other two sets, in that it does not stratify lesions up to their axial zonal distribution (i.e., central, peripheral, or mixed distribution). However, axial zonal distribution is one of the most conspicuous and specific findings of COVID-19 as documented in previous publications [13] [14] [15] [16] [17] . Another limitation of this set of criteria may be related to the designation of multifocal GGO as a "typical finding." Several other diseases, including viral pneumonia of non-COVID-19 etiology, bronchial pneumonia, nonspecific interstitial pneumonia, acute interstitial pneumonitis (IP), pneumocystis jiroveci pneumonia, and drug-induced pneumonitis may show similar multifocal GGOs. In addition, discussions are needed as to whether a variety of COVID-19 and other non-COVID-19 etiologies should not best be placed in grade 2B. One example is that lesions with typical findings are downgraded to grade 2B when concomitant minor findings exist i.e. small amount of pleural effusion or emphysema, small nodules (e.g., intrapulmonary lymph nodes). However, in our experience, such co-existence is common and has I n p r e s s been pointed out elsewhere as well [18] [19] [20] . Compared to the other two sets of criteria based on a systematic grading system, the RSNA expert consensus statement and the BSTI guidance statement rely on groups of findings taking types of lesions, numbers, and distribution into consideration together. Based on gestalt imaging interpretation of the CT findings, these systems are easy to understand and put into practice and facilitates communication with physicians in other fields. As detailed in Table 2 , one potential limitation of the RSNA consensus statement that may affects its interobserver agreements and diagnostic performance is that it does not address the presence of any co-existing lung diseases. However, as previously reported, COVID-19 pneumonia often mimics acute aggravation of IP or emphysema when superimposed on a background of IP or emphysema, thereby often causing an additional diagnostic burden ( Figure 5 ). In contrast, the BSTI guidance statement downgrades the characteristic CT patterns ("CLASSIC COVID" or "PROBABLE COVID" classifications) when they are accompanied by other cardiopulmonary diseases (e.g. IP) to "indeterminate" classification [3] . This may explain the results of this study in which the RSNA expert consensus statement showed a higher sensitivity than the BSTI guidance statement; and in contrast, the BSTI guidance statement showed a higher specificity than the RSNA consensus statement in diagnosing COVID- 19 . In comparison to a previous publication on the RSNA grading system, the current study obtained comparable results with a sensitivity of 73.5% (vs. 71.6%), specificity 82.8% (vs.91.6%), and PPV 81.0% (vs. 87.8%) [10] . Previous studies also reported a similar trend observed in the current study in terms of RT-PCR positive rates for each category of the RSNA grading system with the "negative for pneumonia" classification being more frequent than "atypical" classification [9] [10] . This study has various limitations. First, because of its retrospective nature, a selection bias may have been introduced. Second, the interpretation of the criteria may have been affected by our addition of supplementary remarks in an attempt to reduce ambiguity. One concern is that it may not have been consistent with the intent of the researchers who originally proposed them. Third, the impact of this study may also have suffered from having been conducted at a single institution. However, we included eight readers with varying degrees of experience who practice at five different institutions so as to represent a broad population of readers. Fourth, only symptomatic patients were included, thereby potentially biasing the sensitivity and specificity calculated in this study. Fifth, some patients with single negative RT-PCR results were deemed negative for COVID- 19 . Considering the variety in the sensitivity of RT-PCR ranging from 67-98%, false negative cases were not eliminated with a single RT-PCR negative result [21] . Sixth, we did not keep the interval time constant between each session in the interpretation experiment. Ideally, the interval time should have been kept constant between each of these sessions to allow the observers to forget the details of individual cases to avoid bias, but in this study, we decided not do so to facilitate the swiftest possible publication of these findings given the pressing COVID-19 pandemic. In conclusion, CO-RADS, COVID-RADS, the RSNA expert consensus statement, and the BSTI guidance statement provided reasonable performances and interobserver agreements in reporting the CT findings of COVID-19. Further studies will be needed to further define the clinical implications of these systems in the diagnosis of COVID-19 in a more diverse population. I n p r e s s Table 2 . Summary of reasons for interobserver disagreements of the CT grading systems of COVID-19 -Terms i.e. "(half-)rounded shape, small, homogeneous or extensive may be subjective -Categorization of GGO might be complex (addition of sample patterns as summarized below would be helpful) Effect of non-pharmaceutical interventions to contain COVID-19 in China Use of Chest CT in Combination with Negative RT-PCR Assay for the 2019 Novel Coronavirus but High Clinical Suspicion Thoracic Imaging in COVID-19 Infection, Guidance for the Reporting Radiologist (British Society of Thoracic Imaging Radiological Society of North America Expert Consensus Statement on Reporting Chest CT Findings Related to COVID-19. Endorsed by the Society of Thoracic Radiology, the American College of Radiology, and RSNA CO-RADS -A categorical CT assessment scheme for patients with suspected COVID-19: definition and evaluation Coronavirus disease 2019 (COVID-19) imaging reporting and data system (COVID-RADS) and common lexicon: a proposal based on the imaging data of 37 studies Challenges in the Interpretation and Application of Typical Imaging Features of COVID-19 Structured reporting of chest CT in COVID-19 pneumonia: A consensus proposal Radiological Society of North America Chest CT Classification System for Reporting COVID-19 Pneumonia: Interobserver Variability and Correlation with RT-PCR Diagnostic Accuracy of North America Expert Consensus Statement on Reporting CT Findings in Patients with Infection: An Italian Single Center Experience Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial Chest CT Findings in Cases from the Cruise Ship "Diamond Princess Relation Between Chest CT Findings and Clinical Conditions of Coronavirus Disease (COVID-19) Pneumonia: A Multicenter Study Radiological findings from 81 patients with COVID-19 pneumonia in Wuhan, China: a descriptive study Coronavirus Disease 2019 (COVID-19): A Systematic Review of Imaging Findings in 919 Patients Emerging 2019 Novel Coronavirus (2019-nCoV) Pneumonia Initial CT findings and temporal changes in patients with the novel coronavirus pneumonia (2019-nCoV): a study of 63 patients in Wuhan, China Chest CT Findings in Patients With Coronavirus Disease 2019 and Its Relationship With Clinical Features COVID-19 pneumonia manifestations at the admission on chest ultrasound, radiographs, and CT: single-center study and comprehensive References for figure legends Relation Between Chest CT Findings and Clinical Conditions of Coronavirus Disease (COVID-19) Pneumonia: A Multicenter Study The authors would like to acknowledge Ryohei Terashima for supporting statistical analysis. I n p r e s s