key: cord-0194696-g2munzuo authors: Sohan, Md Fahimuzzman title: So You Need Datasets for Your COVID-19 Detection Research Using Machine Learning? date: 2020-08-11 journal: nan DOI: nan sha: 76f3664209c6ac6b97684a479ca47c1f7eec2d2f doc_id: 194696 cord_uid: g2munzuo Millions of people are infected by the coronavirus disease 2019 (COVID19) around the world. Machine Learning (ML) techniques are being used for COVID19 detection research from the beginning of the epidemic. This article represents the detailed information on frequently used datasets in COVID19 detection using Machine Learning (ML). We investigated 96 papers on COVID19 detection between January 2020 and June 2020. We extracted the information about used datasets from the articles and represented them here simultaneously. This investigation will help future researchers to find the COVID19 datasets without difficulty. The Severe Acute Respiratory Syndrome Coronavirus 2, also known as SARS-CoV-2 or novel Coronavirus 2019 or mostly used COVID-19 was first reported in December 2019 in the Hubei Province of China [1] . Over 0.7 million deaths and around 20 million confirmed cases had been reported worldwide by the World Health Organization (WHO) due to the virus epidemic [2] . A vital step is quickly identifying the infected people and isolating them to delay the spread of the epidemic [3] . In the context of COVID-19 identification, ML techniques have been employed to detect the disease from various data analysis and classification [4, 5] . ML-based techniques are automatic and easy to use and implement in clinical settings [6] . One of the key parts of ML techniques is the dataset to train detection models and validation. Literature shows, most of the COVID-19 detection-based models were developed using image datasets including X-ray image, Computed Tomography (CT) images. But it is difficult and hard work for new researchers to collect and identify the required dataset for their research. To address this issue, we represented a large scale open access datasets collection which was used by previous COVID-19 research studies. Initially, we collected hundreds of COVID-19 detection related articles between January 2020 and June 2020 from various online libraries, such as Google Scholar, Elsevier, PubMed, and WHO Database. Finally, we selected 96 ML-based COVID-19 detection articles to conduct this investigation. This article collection process was performed by the established and popular Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [7] . Then we extracted the information of used datasets for detection models from each article. We listed out the most frequently used datasets in the 96 studies. Besides, we included the detailed information, open accessible destination link, and available paper link of each dataset in the next section. This dataset compilation initiative will give easy and effective access for future COVID-19 researchers. As Figure 1 shows, the first task of this study was collecting articles from different databases. We used the PRISMA statement to collect all articles that are relevant to COVID-19 detection research using ML techniques. The article collection using the PRISMA method was described broadly in the following articles [1, 3, 8] . In addition, we used four digital libraries to collect articles including Google Scholar, Elsevier, PubMed, and WHO Database. After ending this process we confirmed 96 COVID-19 detection research articles using ML techniques. The next step was COVID-CT 13 5 ChestX-ray8 11 6 Covid-19-database 10 7 COVIDx 9 8 Mendeley Dataset 9 9 Qatar-Dhaka COVID-19 Data 7 10 NIH Dataset 5 11 Twitter Data 5 12 Snapshots data 4 13 COVID-CTset 1 14 IRCCS blood Data 1 15 POCUS Dataset 1 extraction data from the 96 collected studies. We collected the information on the used dataset from each article individually. The considered articles list and extracted datasets information are available at https://figshare.com/ s/0cee05e6405a9310802e. Finally, the extracted data were summarised and the result presented in the next section. At the beginning of the discussion of the datasets, we represented commonly used 15 datasets with respective dataset names and the number of articles used the dataset in Table 1 . Moreover, we divided the datasets into three categories by • Chest X-ray images • Total 73 confirmed COVID-19 cases (last access 22-Jul-2020) • Link to data the appearance of COVID-19 positive and negative samples. The three categories are COVID-19, Non-COVID-19, and lastly COVID-19 and non-COVID-19 Data. In COVID-19 datasets, only COVID-19 positive cases were included, negative COVID-19 cases were included in the non-COVID-19 dataset, and COVID-19 and non-COVID-19 datasets contain both positive and negative cases. In the following subsections and tables, we represented the three categories datasets elaborately including a short description, characteristics, and reference link of each dataset. 3.2 Non-COVID-19 Data (Table 3) 1. ChestX-ray8 This dataset is prepared by a large number of chest X-ray images of several lung diseases and known as "ChestX-ray8". The data were collected between the year 1992 to 2015 and from various platforms. This dataset is publicly available for research purposes and reported as most commonly accessible in medical image investigation. In COVID-19 research, mostly researchers used a portion of this large dataset. 2. NIH Dataset: It's a subset of the "ChestX-ray8" dataset. This sample dataset contains 5% of the full version. This dataset was published under the National Institutes of Health (NIH), USA. This large scale non-COVID dataset was published in February 2018. The data were collected from a Children's medical center in Guangzhou, China. This repository of images is made available freely for research purposes. 4. Chest X-Ray Images Pneumonia: This chest X-ray images dataset is known as "Chest X-Ray Images Pneumonia", a part of Mendeley and Cohen JP Dataset. Authority prepared the dataset by screening and checking raw images to ensure quality. The dataset is available online with open access. This is the second version of the dataset, where more data is added with the previous version. It is developed jointly by the Radiological Society of North America, US National Institutes of Health, The Society of Thoracic Radiology, and MD.ai. This dataset is also a part of "ChestX-ray8" and updated continuously. (Table 4) 1. Cohen JP Dataset: This dataset is known as the "Cohen JP dataset", widely used, and reported as the very first COVID-19 image dataset. This open-access dataset contains chest X-ray and CT images collected from different hospitals of different countries. All data is released under the GitHub repository and updated continuously by the authority. 2. Qatar-Dhaka COVID-19 Data: It is prepared under the collaboration of some researchers from Bangladesh, Qatar, Pakistan, and Malaysia; we named it as "Qatar-Dhaka COVID-19 Data". Data collected from different • Chest X-ray images • Total 30227 samples found; 8851 normal and 21376 infected • DICOM file format • Link to data sources are combined in their dataset, e.g., Cohen JP dataset, COVID-19-database. This combined dataset is available online and freely accessible in the Kaggle repository. 3. COVIDx: It is also a hybrid dataset that combines 5 different publicly available data repositories and is known as "COVIDx". This open-access dataset is available in GitHub storage. Three types of data were included and each type has different train and test data distribution in the dataset. 4. COVID-CT: This dataset is divided into two parts: one for COVIT-19 positive known as "COVID-CT" and non-COVIT-19 known as "non-COVID-19 CT". Data of various sources were combined to prepare the datasets. The contributors collected 760 COVID-19 research preprints from two different platforms. Then they did an interesting job, they extracted CT images from the PDF documents of the articles. Extracted images were manually pre-processed and metadata of each image collected. 5. COVID-CTset: The lung CT scan data was collected from an Iranian medical center. Dataset contains COVID-19 positive and normal both patients' images, also they included the metadata of the patients. The data and relevant documents are available online with open access. 6. POCUS Dataset: This dataset was prepared differently, images were taken from lung POCUS video recordings that are publicly available in web and publications. A total of 64 videos were considered from various sources, where 39 COVID-19, 14 pneumonia, and 11 videos of healthy patients. Average 17±6 frames were selected from per video with a frame rate of 3Hz. Dataset is available in the GitHub repository with open access. 7. IRCCS blood data: The data of the dataset were collected from patients' blood, known as "IRCCS blood data". This dataset consists of routine blood exam records of patients admitted to an Italian hospital from the end of February 2020 to mid of March 2020. Several features were considered from blood tests, including white blood cell counts, and the platelets, CRP, AST, ALT, GGT, ALP, LDH plasma levels. They also included the rRT-PCR swab test results for every patient as the dependent variable. Dataset is freely available for everyone in the Zenodo repository. Clinical, laboratory and imaging features of COVID-19: A systematic review and meta-analysis. Travel medicine and infectious disease Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal New machine learning method for image-based diagnosis of COVID-19 Detection of COVID-19 Infection from Routine Blood Exams with Machine Learning: a Feasibility Study. medRxiv preprint COVID-Classifier: An automated machine learning model to assist in the diagnosis of COVID-19 infection in chest x-ray images. medRxiv preprint Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement Coronavirus disease 2019 (COVID-19): a systematic review of imaging findings in 919 patients We build a collection of COVID-19 research datasets, where top used datasets were considered from the COVID-19 detection research articles between the period of January 2020 and June 2020. We investigated 96 research articles to access the information of used datasets. We described the datasets in three categories, one was COVID-19 positive sample-based, another was negative sample-based, and lastly, positive-negative both samples based. We represented the datasets along with a short description, Characteristics of them, and relevant reference links. This article will help future researchers to find the COVID-19 detection research datasets easily and it will save their time.