key: cord-0691444-2qyyyffm authors: Tsai, Emily B.; Simpson, Scott; Lungren, Matthew P.; Hershman, Michelle; Roshkovan, Leonid; Colak, Errol; Erickson, Bradley J.; Shih, George; Stein, Anouk; Kalpathy-Cramer, Jayashree; Shen, Jody; Hafez, Mona; John, Susan; Rajiah, Prabhakar; Pogatchnik, Brian P.; Mongan, John; Altinmakas, Emre; Ranschaert, Erik R.; Kitamura, Felipe C.; Topff, Laurens; Moy, Linda; Kanne, Jeffrey P.; Wu, Carol C. title: The RSNA International COVID-19 Open Radiology Database (RICORD) date: 2021-01-05 journal: Radiology DOI: 10.1148/radiol.2021203957 sha: 684dce69eb3b28e6b8eb71514e4b259214f483c2 doc_id: 691444 cord_uid: 2qyyyffm The coronavirus disease 2019 (COVID-19) pandemic is a global health care emergency. Although reverse-transcription polymerase chain reaction testing is the reference standard method to identify patients with COVID-19 infection, chest radiography and CT play a vital role in the detection and management of these patients. Prediction models for COVID-19 imaging are rapidly being developed to support medical decision making. However, inadequate availability of a diverse annotated data set has limited the performance and generalizability of existing models. To address this unmet need, the RSNA and Society of Thoracic Radiology collaborated to develop the RSNA International COVID-19 Open Radiology Database (RICORD). This database is the first multi-institutional, multinational, expert-annotated COVID-19 imaging data set. It is made freely available to the machine learning community as a research and educational resource for COVID-19 chest imaging. Pixel-level volumetric segmentation with clinical annotations was performed by thoracic radiology subspecialists for all COVID-19–positive thoracic CT scans. The labeling schema was coordinated with other international consensus panels and COVID-19 data annotation efforts, the European Society of Medical Imaging Informatics, the American College of Radiology, and the American Association of Physicists in Medicine. Study-level COVID-19 classification labels for chest radiographs were annotated by three radiologists, with majority vote adjudication by board-certified radiologists. RICORD consists of 240 thoracic CT scans and 1000 chest radiographs contributed from four international sites. It is anticipated that RICORD will ideally lead to prediction models that can demonstrate sustained performance across populations and health care systems. © RSNA, 2021 Online supplemental material is available for this article. See also the editorial by Bai and Thomasian in this issue. open medical imaging data set for COVID-19 thoracic imaging made available initially through The Cancer Imaging Archive (TCIA). Additional data will continue to be collected, annotated, and shared, including additional imaging modalities and accompanying clinical data, as part of an expanded effort collaboration with the American College of Radiology and the American Association of Physicists in Medicine to create a new Medical Imaging and Data Resource Center. These data will continue to be made freely available for general research and education. The primary goal of creating the RICORD data set was to achieve a large and heterogeneous database focused on CO-VID-19 chest imaging that represents a diversity of variables across patient populations, imaging equipment, and protocols, which introduced many regulatory and logistical challenges. A second goal was to focus on streamlining imaging data de-identification and transmission such that data aggregation was possible. These goals required custom-built open-source engineering tools to facilitate the process toward successful uniform Digital Imaging and Communications in Medicine (DICOM) study de-identification and submission to a central resource. A third goal of the RICORD data set was to include multimodal imaging, both planar imaging in the form of chest radiography and volumetric data (CT image series). Multimodal images were collected along with relevant clinical and demographic data, presenting unique challenges due to data set complexity. In RICORD, we sought to include expert annotation in the setting of a new disease, with many available and nascent classification schemas requiring both education and several rounds of consensus and coordination with similar data annotation efforts by other large organizations around the world. Moreover, the size and complexity of the data set and use of image series required assembling, coordinating, training, and monitoring a dedicated group of expert annotators to complete the project in a short time frame. Finally, data set hosting and data vending required infrastructure capable of handling large volumes of imaging studies, including experience with public data sets and requisite privacy and security mechanisms. An international task force composed of imaging, biostatistical, and data science experts was assembled to create a staged deployment approach (Fig 1) completed in a relatively short time frame to provide value during the ongoing pandemic. An international call for COVID-19 medical imaging data submission was issued by means of an open survey by the RSNA in April 2020. More than 200 individual responses from 20 countries were received. As part of an aggressive release timeline, four initial sites from four countries were selected for this first release of RICORD based on the availability of curated COVID-19 data and presence of existing data sharing agreements with the RSNA. As of this writing, dozens of additional sites are engaged in active data use agreement review with the RSNA. International laws, institutional regulations and permissions, ethics review, and the availability of access to curated The RSNA International COVID-19 Open Radiology Database, or RICORD, is the first multi-institutional, multinational, expert-annotated coronavirus disease 2019 imaging data set made freely available and designed for the machine learning community. infection has nonetheless increased worldwide (13) . In addition, given the prevalence of asymptomatic and presymptomatic infection, as well as early evidence for limited immunity in exposed populations, COVID-19 is likely to be a globally endemic disease in the future. This underscores the importance of identifying outbreaks and public health planning, particularly as patients will have incidentally detected findings at routine imaging that could be attributable to COVID-19 pneumonia and could expose others in the health care system, especially vulnerable populations (1). Since the outbreak of the virus, there have been many publications related to COVID-19 (14) , including dozens of automated imaging analysis models; however, the majority have underperformed in validation experiments (15, 16) . This failure to generalize to new data sets is often multifactorial, owing in part to limited single-site data, lack of expert annotation, and heterogeneous or inconsistent labeling schemas (15, 16) . The shortcomings of these early efforts highlight the importance of shared research and education resources for COVID-19. Large data sharing collaborations are an effective strategy to pool medical data to address shortcomings related to data availability, generalization, and coordinating annotation frameworks (17, 18) . Unfortunately, even in the face of a health care crisis, incentives for collecting, annotating, and sharing medical data, especially imaging data, are not well established. Prior experience curating large, annotated, multisite medical imaging data sets has been related primarily to hosting machine learning challenges or as part of clinical trials or individual institution efforts (19) (20) (21) (22) (23) . New initiatives are needed that can better operationalize the collection, preparation, annotation, and access to large medical imaging data sets to address public health crises. The purpose of this work is to describe the RSNA International COVID-19 Open Radiology Database (RICORD) data set as the first multinational, multimodality, expert-annotated Institutional review board (ethics committee) approval was obtained from all four sites for this retrospective study. For the United States site, a waiver of informed consent was obtained, and processes were compliant with the Health Insurance the hospitalization and had a confirmed test available after hospital discharge. For chest CT examinations, the axial series was requested with a section thickness of 2.5-5.0 mm and any protocol or kernel, with preference for soft-tissue kernel. The section thickness was set partly to facilitate the section-by-section segmentation and annotation of data. Prior studies have successfully used varied following: reverse-transcription polymerase chain reaction test, immunoglobulin M antibody test, or clinical diagnosis using hospital-specific criteria. For the purposes of this data set, criteria for positive infection encompassed patients who were diagnosed before hospitalization but remained symptomatic during the hospitalization, patients with a positive test or diagnosis during hospital admission, and patients who were symptomatic during CT image shows bilateral nodular and patchy opacities with peripheral and lower lung predominance involving four lung zones, annotated as typical for COVID-19 with moderate severity. (d) Thoracic CT image shows bilateral nodular and patchy opacities with peripheral and lower lung predominance involving more than four lung zones, annotated as typical appearance for COVID-19 and severe lung involvement. (e) Bedside chest radiograph with bilateral patchy and nodular opacities (arrows) with upper lung predominance involving more than four lung zones, annotated as indeterminate appearance for COVID-19 and severe lung involvement. (f) Bedside chest radiograph shows left lower lobe opacities (arrows) with small left pleural effusion involving a single lung zone, annotated as atypical appearance for COVID-19 and mild lung involvement. (g) Bedside chest radiograph shows bilateral patchy and nodular opacities (arrows) with upper lung predominance involving more than four lung zones, annotated as indeterminate appearance for COVID-19 and severe lung involvement. (h) Bedside chest radiograph shows left lower lobe opacity (arrow) with small left pleural effusion involving a single lung zone, annotated as atypical appearance for COVID-19 and mild lung involvement. The RSNA International COVID-19 Open Radiology Database (RICORD) E208 radiology.rsna.org n Radiology: Volume 299: Number 1-April 2021 need to homogenize the de-identification approach as a way to mitigate problematic heterogeneity in de-identification methodologies. Unique pseudonymous identifiers are created for each patient such that subsequent imaging studies for the same patient could be assigned to the same anonymous patient identification number. A customized, free, open-source DICOM de-identification software solution named DICOM Anonymizer was created by experienced DICOM experts at the RSNA based on recognized standards and best practices implementing the RICORD Data De-identification Protocol. The anonymizer generates a unique numeric site identification parameter, or SITEID, as a way to distinguish data sets contributed by different sites without exposing the actual identity of the contributing site. The specific elements modified in the anonymizer script and selected relevant sections of the DICOM standard can be found here: https:// www.rsna.org/-/media/Files/RSNA/Covid-19/RICORD/RSNA-Covid-19-Deidentification-Protocol.pdf. To facilitate multiple workflows, three options were set up to import DICOM objects into DICOM Anonymizer, including (a) importing from a local storage location, (b) direct picture archiving and communication system query (through a DICOM C-MOVE transfer), and (c) the use of accession numbers to allow a user to enter a list of accession numbers by hand, copy and section thicknesses to train machine learning models (24) . For chest radiography, any frontal radiograph and method of acquisition (portable, anteroposterior, posteroanterior) was accepted. The following data elements were obtained for each examination: (a) Unique Patient identifier, (b) Unique Imaging Examination identifier, (c) age, (d) sex, and (e) COVID-19 testing method (polymerase chain reaction, serological, clinical). These data elements were standardized with the de-identification tool (see following section). In the frequent case that a given patient underwent multiple thoracic imaging examinations, relative time of acquisition was maintained with the de-identification to enable timeline preservation in the aggregate data set. CO-VID-19 testing methods and other supporting clinical information, if available, were provided in a spreadsheet and linked with the patient-unique identification number and imaging study identification number. Contributing sites were responsible for de-identifying imaging examinations and associated clinical information for inclusion in RICORD with expectations for use as a public research and educational database (noncommercial use). Although all sites followed their institutional policies and procedures, adhering to locally applicable regulations and best practices, there was a Other lung disease Note.-Image-level annotations were applied as free-form shapes. Examination-level annotations were used in a "choose all that apply" approach such that annotations could include more than one of the labels. COVID-19 = coronavirus disease 2019, IV = intravenous. download. The data are organized as collections, typically images related by a common disease (eg, lung cancer), image modality or type (MRI, CT, digital histopathologic examination, etc), and/or research focus. DICOM is the primary file format used by TCIA for medical imaging. A DICOM file stores the digital image along with a series of tags that contain metadata about the image, such as patient identification number, study identification number, patient weight, and anatomic site. More information about DICOM is available at http://medical.nema. org. Supporting data related to the images, such as patient outcomes, treatment details, genomics, and expert analyses, are also provided when available. This public-facing data hosting and vending workflow enables the wide distribution and use of this imaging resource, and RICORD will be listed as a unique collection within TCIA infrastructure. Any potential user can then access TCIA to search for and download images and associated annotation files. The terms outline that ownership of the submitted data is retained by the submitting institution and, in agreeing to participate, the submitting organization grants the RSNA a nonexclusive, royalty-free, sublicensable, worldwide, perpetual, irrevocable license to the submitted data available for commercial, scientific, and educational purposes, including rights necessary for the RSNA to make the Curated Submission Data available to the public pursuant to the Creative Commons Attribution 4.0 International License. Each submitting center must attest that they have the authority and rights to grant the RSNA paste a list, or open a text file containing a list and then start the DICOM import process. Each participating site was provided with documentation on how to install and use the DICOM Anonymizer to prepare DICOM objects for submission to RI-CORD. Although successful use of the DICOM Anonymizer required technical knowledge of the configuration of the picture archiving and communication system and local networks, a webinar was hosted by members of the RSNA COVID-19 task force as a technical walkthrough demonstration format followed by an open question session from the audience. A discussion board for users was also made available so that troubleshooting and guidance could be more easily accessed by participating sites with asynchronous interactions from RSNA technical staff and volunteers. Over the course of several iterations, a data submission agreement was crafted with the intent of setting the terms by which the RSNA would collect and submit anonymized imaging data from medical centers and research institutions. The agreement describes the collaboration with existing National Institutes of Health resources through TCIA to facilitate the submission, processing, and publication of data. The agreement outlines the process for data set review, confirmation of de-identification, and encrypted transmission. Housed within the National Institutes of Health, TCIA is a service that de-identifies and hosts a large archive of medical images accessible for public utable to COVID-19 pneumonia, including standardized study-level annotation to reduce variability. This multisociety, international, expert annotation consensus will help detail the imaging appearance of COVID-19 pneumonia and aid in further data set enrichment and curation as well as model development across data sets. Examination-level binary annotations were performed to indicate the presence of a support apparatus (endotracheal tube, central venous catheter, pacemaker, feeding tube) as well as to describe global study image quality (adequate, motion artifact, incomplete). Each examination was classified as having typical, indeterminate, atypical, or negative appearance for COVID-19 pneumonia based on the RSNA consensus statement (25) (Fig 3) . Table 1 describes the image-level and examination-level annotation labels. Examples of possible image-level and examination-level annotations are shown in Figure 4 and Figure 5 , respectively. Several decisions among a panel of five senior thoracic radiologists (J.P.K., C.C.W., E.B.T., S.S., and G.S.; average experience, 15 years) were made in developing the annotation schema around specific findings and coordinated with leadership in other similar international parallel efforts. Only opacities and nodules greater than 1 cm were annotated, and clinical judgment as to whether a nodule or mass was infectious was made by the annotating specialist radiologist and annotated as "infectious opacity." For infectious micronodules and tree-in-bud, a region of interest was added that a license to publish and otherwise use the Submission Data and Curated Submission Data in accordance with the terms of the agreement. A fully executed signed agreement was required of all participating sites prior to data submission and can be found in its entirety at the link included here: https://www.rsna.org/-/ media/Files/RSNA/Covid-19/RICORD/RSNA-Data-Submission-Agreement-Form.pdf. The annotation strategy was focused on balancing the richness of the annotations with the effort required to perform annotations. In aggregate for RICORD, 240 unique chest CT scans and 1000 unique chest radiographs were selected from the initial contributing institutions. The entire process from data solicitation to collation of the final data set is summarized in the workflow diagram (Fig 2) . Chest CT scan annotation.-For the chest CT segmentation task, a small portion of each data set from contributing institutions was identified for full hand segmentation and annotation by six subspecialist thoracic radiologists (average experience, 6 years). The annotation schema was designed in coordination with Society of Thoracic Radiology and European Society of Medical Imaging Informatics leadership to ensure consistency. Annotation was focused on current understanding and guidance in reporting chest imaging findings potentially attrib- (Fig 5 continues.) encompassed the entire region of nodularity rather than as individual nodules. In studies where atelectasis was associated with infectious opacities, atelectasis was annotated as an "infectious opacity." Alveolar edema was labeled as "other noninfectious opacity." Lymphadenopathy, regardless of location, was defined as lymph nodes larger than 1 cm short axis. Multiple examination-level labels for type of lung disease were allowed if deemed appropriate to provide a differential diagnosis based on clinical judgment. Chest radiograph annotation.-Selection of the annotation schema for chest radiographs was based on a study-level scoring system and suggested reporting language developed by subspecialist thoracic radiologists (J.P.K. and C.C.W.) (26) . Each chest radiograph was classified as typical, indeterminate, atypical, or negative for findings of COVID-19 pneumonia. In all except the negative cases, the number of regions with abnormal opacities were assessed and classified as showing mild, moderate, or severe disease. As there is subjectivity in this labeling schema, triple annotation was performed, and majority consensus was selected as the final label when at least two of three annotators selected the same classifications. In cases without a majority consensus, final adjudication was performed by one of two experienced subspecialist thoracic radiologists (average experience, 15 years). Annotation was performed using a commercial browser-based application (MD.ai). Annotators were solicited from membership of the Society of Thoracic Radiology and the RSNA, and three (J.S., P.R., and M. Hafez, with 5, 7, and 8 years of experience, respectively) were selected based on their performance on previous chest imaging annotation exercises conducted by these organizations. Each of the three annotators were blinded to each other but were able to work on the data at the same time. Adjudicators (J.P.K. and C.C.W.) were able to view the annotations from the three initial readers. characteristics, and other included metadata will be critical when generating cohorts with RICORD, particularly as additional public COVID-19 imaging data sets are made available through complementary and parallel efforts. It is important to emphasize that there are limitations to the clinical ground truth, as severe acute respiratory syndrome coronavirus 2 reverse-transcription polymerase chain reaction tests have widely documented limitations and are subject to both false-negative and false-positive results (1, 4) , which impacts the distribution of the included imaging data and may have led to an unknown epidemiologic distortion of patients based on the inclusion criteria. These limitations notwithstanding, RICORD has achieved the stated objectives for data complexity, heterogeneity, and high-quality expert annotations as a comprehensive COVID-19 thoracic imaging data resource. The 240 CT scans and 1000 radiographs represent a subset of images uploaded by the contributing sites. Because of human resource limitations and time constraints, annotation was limited to 120 COVID-19-positive CT scans and 1000 COVID-19positive chest radiographs. Cases were randomly selected from each site. Additional data from these sites and other contributing sites will be included in subsequent releases. Because making COVID-19-related cases available quickly was a high priority, and because several public chest radiograph data sets providing cases of non-COVID-19 pneumonia and other conditions for differential analysis are already available, only COVID-19positive chest radiography studies have been published initially. There are plans to supplement these with COVID-19-negative chest radiography cases from contributing sites. Users could also train their model on the 2018 RSNA Pneumonia Challenge data set (27) . Finally, the Medical Imaging and Data Resource Center project achieved its target goal of providing 10 000 COVID-19related studies before the end of 2020. This initial RSNA International COVID-19 Open Radiology Database (RICORD) data set is the first of a larger coordinated effort called the Medical Imaging and Data Resource Center, which is an open-source medical imaging database and machine learning effort co-led by the RSNA, the American College of Radiology, the American Association of Physicists in Medicine, and the National Institute of Biomedical Imaging and Bioengineering at the National Institutes of Health. The Medical Imaging and Data Resource Center will create a separate platform to The RICORD data set is composed of 240 thoracic CT scans and 1000 chest radiographs contributed from four international sites (details in Appendix E1 [online]). Care was taken to curate a balanced data set across the contributing sites. Patient demographics collected included sex, age, and COVID-19 testing status and method. Both CT scan and chest radiograph data included multiple studies for some patients, with a higher prevalence of such studies in the chest radiography data set. Date of study was randomized, but intervals between studies for a given patient were maintained. Descriptive statistics and demographics are listed in Tables 2 and 3 . The RSNA International COVID-19 Open Radiology Database is the largest publicly available expert-annotated data set of chest CT scans and chest radiographs formatted in accordance with the Digital Imaging and Communications in Medicine standard. The data sets were acquired from institutions in four countries, representing diverse patient populations and heterogeneous imaging protocols and vendors. The radiologic examinations were performed for various clinical indications related to coronavirus disease 2019 : to obtain diagnostic criteria for COVID-19, to determine disease severity, or to assess treatment response. Hundreds of hours were spent by volunteer radiologists to compile, curate, and annotate this initial working data set, which is made available through The Cancer Imaging Archive, a service sponsored by the National Institutes of Health that hosts a large archive of medical images accessible for public download. Because this is a public data set, RICORD is available for noncommercial use (and further enrichment) by research and education communities, which may include the development of educational resources for COVID-19 and use of RICORD to create artificial intelligence systems for diagnosis and quantification, benchmarking performance for existing solutions, exploration of distributed and/or federated learning, further annotation or data augmentation efforts, and evaluation of the examinations for disease entities beyond COVID-19 pneumonia. Deliberate consideration of the detailed annotation schema, demographic Note.-Data are numbers of studies by contributing site. Collection 1a comprises COVID-19-positive CT scans; collection 1b, COVID-19-negative CT scans; and collection 1c, CO-VID-19-positive chest radiographs. COVID-19 = coronavirus disease 2019, RT-PCR = reverse-transcription polymerase chain reaction. collect, annotate, store, and share coronavirus disease 2019-related medical images via open-access Gen3 data commons, beginning with RICORD. Engagement with multiple subspecialty societies to leverage their unique expertise in developing clinical use cases and high-quality annotated data sets is an effective and useful model to follow for future collaborations. This joint effort will allow streamlining and consistency in methods of data collection and provide broader aggregation, better organization, and more convenient access to data for researchers and educators. Transmission, Diagnosis, and Treatment of Coronavirus Disease 2019 (COVID-19): A Review An ounce of public health for COVID-19? Ten recommendations for supporting open pathogen genomic analysis in public health Rapid Scaling Up of Covid-19 Diagnostic Testing in the United States -The NIH RADx Initiative Chest CT Findings in Coronavirus Disease-19 (COVID-19): Relationship to Duration of Infection Time Course of Lung Changes at Chest CT during Recovery from Coronavirus Disease 2019 (COVID-19) CT Imaging Features of 2019 Novel Coronavirus (2019-nCoV) Artificial intelligence-enabled rapid diagnosis of patients with COVID-19 Correlation of Chest CT and RT-PCR Testing for Coronavirus Disease 2019 (COVID-19) in China: A Report of 1014 Cases Using Artificial Intelligence to Detect COVID-19 and Community-acquired Pneumonia Based on Pulmonary CT: Evaluation of the Diagnostic Accuracy Association between Initial Chest CT or Clinical Features and Clinical Course in Patients with Coronavirus Disease 2019 Pneumonia The Role of Chest Imaging in Patient Management during the COVID-19 Pandemic: A Multinational Consensus Statement from the Fleischner Society Preparedness and Best Practice in Radiology Department for COVID-19 and Other Future Pandemics of Severe Acute Respiratory Infection Keep up with the latest coronavirus research Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal Suboptimal Quality and High Risk of Bias in Diagnostic Test Accuracy Studies on Chest Radiography and Computed Tomography in the Acute Setting of the COVID-19 Pandemic: A Systematic Review Data sharing in the era of COVID-19 Responding to Covid-19 -A Once-in-a-Century Pandemic? Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge The RSNA Pediatric Bone Age Machine Learning Challenge Challenges Related to Artificial Intelligence Research in Medical Imaging and the Importance of Image Analysis Competitions TCIA: An information resource to enable open science CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison Using Artificial Intelligence to Detect COV-ID-19 and Community-acquired Pneumonia Based on Pulmonary CT: Evaluation of the Diagnostic Accuracy Radiological Society of North America Expert Consensus Statement on Reporting Chest CT Findings Related to COVID-19. Endorsed by the Society of Thoracic Radiology, the American College of Radiology, and RSNA -Secondary Publication Review of Chest Radiograph Findings of COVID-19 Pneumonia and Suggested Reporting Language Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: institution received or will receive grants from GE and Nuance. Other relationships: disclosed no relevant relationships. E.A. disclosed no relevant relationships. E.R.R. disclosed no relevant relationships. F.C.K. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: received compensation from MD.ai for consulting; is employed by Diagnósticos da América. Other relationships: disclosed no relevant relationships. L.T. disclosed no relevant relationships. L.M. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: received compensation from Lunit and iCAD for board membership; institution received or will receive grant(s) from Siemens; has stock or stock options in Lunit. Other relationships: disclosed no relevant relationships. J.P.K. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: received compensation from Parexel International for consulting. Other relationships: disclosed no relevant relationships. C.C.W. Activities related to the present article: disclosed no relevant relationships.Activities not related to the present article: institution received or will receive grants. Other relationships: disclosed no relevant relationships.