key: cord-0961100-1mugha0l
authors: Safdari, Reza; Rezayi, Sorayya; Saeedi, Soheila; Tanhapour, Mozhgan; Gholamzadeh, Marsa
title: Using data mining techniques to fight and control epidemics: A scoping review
date: 2021-05-07
journal: Health Technol (Berl)
DOI: 10.1007/s12553-021-00553-7
sha: 3591b1d28fdd1bec9a974fda8bd6d64b20f43ed1
doc_id: 961100
cord_uid: 1mugha0l

The main objective of this survey is to study the published articles to determine the most favorite data mining methods and gap of knowledge. Since the threat of pandemics has raised concerns for public health, data mining techniques were applied by researchers to reveal the hidden knowledge. Web of Science, Scopus, and PubMed databases were selected for systematic searches. Then, all of the retrieved articles were screened in the stepwise process according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses checklist to select appropriate articles. All of the results were analyzed and summarized based on some classifications. Out of 335 citations were retrieved, 50 articles were determined as eligible articles through a scoping review. The review results showed that the most favorite DM belonged to Natural language processing (22%) and the most commonly proposed approach was revealing disease characteristics (22%). Regarding diseases, the most addressed disease was COVID-19. The studies show a predominance of applying supervised learning techniques (90%). Concerning healthcare scopes, we found that infectious disease (36%) to be the most frequent, closely followed by epidemiology discipline. The most common software used in the studies was SPSS (22%) and R (20%). The results revealed that some valuable researches conducted by employing the capabilities of knowledge discovery methods to understand the unknown dimensions of diseases in pandemics. But most researches will need in terms of treatment and disease control.

Throughout history, the threat of pandemics has raised concerns for the healthcare community. The potential threat of spreading major infected diseases around the world before anyone aware of it is a controversial issue. The apparent prevalence of Severe Acute Respiratory Syndrome (SARS) and various types of influenza in the past have indicated the extent to which a pandemic disease can affect the health systems of countries [1, 2] . Coronavirus disease is the last series of pandemic diseases that affect the world powerfully. COVID-19 or novel Coronavirus (2019-nCoV) is an infectious disease caused by coronavirus 2 (SARS-CoV-2) that began on December 8, 2019, from Wuhan, China [3, 4] . Since a novel coronavirus (nCoV) is a new strain of the coronavirus family that has not been seen before, the world faces serious challenges to control this outbreak [5, 6] . During the fierce outbreaks, not only clinical specialists have been trying to invent novel treatments and vaccines, but also scientists in the field of data science and technology are trying to discover the infectious and help control it by applying information-based methods [7, 8] .

Nowadays, an extensive amount of health data is collected through patient care from different numerous sources due to the digital health revolution [9, 10] . Hence, the modern world of medicine is rich in information but it is poor in knowledge [11, 12] . Therefore, striving to this new pandemic and possible future pandemics has become one of the notable concerns of scientists.

In the last decades, some valuable studies have been published regarding pandemics and data mining (DM) techniques [13] . Such studies were conducted with the aim of better understanding, controlling, and manage pandemics using various data mining methods. Due to the importance to fight the COVID-19 pandemic, conducting a survey on the most popular and efficient data mining methods could have a significant impact on selecting the most effective techniques in pandemic studies. Thus, it can help us to reveal the unknown character of the new pandemic and the next possible pandemics. As follows, the core objective of this review is sought to collecting, summarizing, and analyzing the existing articles to aid track and analysis of such studies that have been published in terms of pandemics and data mining methods. The specific research questions (RQ) of this review are: (RQ1) To determine how many studies published over the past years and previous months regarding last pandemics and COVID-19 outbreak, (RQ2) Representing an overview of published studies and their characteristics, (RQ3) Investigating the published studies regarding data mining techniques, (RQ4) Identifying the source of data, (RQ5) Determining the most favorite DM techniques in terms of their frequency and clinical domains, (RQ6) Identifying the main approaches of published studies.

The present study was completed according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist to ensure the inclusion of relevant studies [14] . Next, the synthesis of eligible articles based on the main characteristics was conducted to classify the main characteristics of studies.

A systematic search of the scientific database, Web of Science, Scopus, and PubMed databases from 2010 up to 16 Oct 2020 was completed using "data mining", "prediction model", "data mining techniques", "data mining methods", "pandemics", "pandemic", "COVID-19", "SARS-CoV-2", and "coronavirus disease" as keywords. Boolean search strategies were designed based on these keywords in each database.

Articles were included if they met the following criteria:

1) The focus of this study is on pandemic diseases such as COVID- 19, 2) Only the articles about using data mining techniques or knowledge discovery methods were included. Due to the variety of methods in this field, these types of methods are selected based on the study was conducted by Patel and Patel [15] .

3) Studies were limited to those published in the English language.

Articles were excluded if they met the following criteria: 1) The title, abstract, or full text of the article did not relate to any pandemics or COVID-19 disease, 2) Book chapters, letters to editors, short briefs, reports, commentaries, technical reports, review or meta-analysis were excluded, 3) Non-English papers, 3) Image processing methods were not considered. 4) The full text was not available. To reduce the bias of unavailable full-text, the full texts of non-open access articles were obtained by contacting to authors. Therefore, all of the full-text of articles were retrieved by researchers.

In scientific databases searching (Web of Science, Scopus, and PubMed), 311 articles were retrieved through the web interface of scientific websites. Some inclusion and exclusion criteria were defined for screening papers. In the first phase, all titles and abstracts of retrieved articles were examined to select eligible studies. All of the titles and abstracts were screened by three reviewers (MT, SS, and SR) to find relevant articles. Another reviewer (MG) reviewed a sample of studies randomly. The quality analysis of the individual papers was assessed by the Joanna Briggs Institute (JBI) checklist which provides robust checklists for the appraisal and assessment of most types of studies [16, 17] . Since all types of studies were included in our review, we applied this checklist. Decisions on study eligibility and quality were made by two reviewers; any disagreements were resolved by discussion. The flow of screening articles based on the [17] PRISMA method illustrates in Fig. 1 .

Phase three involves full-text screening. In this phase, the full texts of relevant studies were screened thoroughly by four reviewers (MT, SS, SR, and MG). Through a full-text review, the final decision was made by RS if there was a disagreement between the authors in the selection of eligible studies.

Finally, 50 studies remained as eligible articles. Some classifications were assumed to classify and analyze the included studies. The extraction forms were designed by researchers to manage the reviewed articles. This classification comprises general information and specific information. General information includes author names, publication date, and publisher. Specific information includes the main objective, DM techniques, application of DM method, health discipline, main outcomes, evaluation results, data sources, sample sizes, applied software, and country. Included articles were analyzed to extract their characteristics based on the predefined classification. All of the extracted information was re-examined by all authors to reach an agreement. The next reviewer (RS) evaluated and validated the results. End-Note X9 is used for resource management, and all qualitative analysis was performed in SPSS v20.

Earlier searches in scientific databases yielded 311 citations. First, 13 articles were excluded in the duplicate removal phase. Next, 82 articles were omitted due to their irrelevancy in the full-text screening stage. All included articles could be included in our review according to the JBI checklist. In the last screening phase. Finally, 50 articles were identified as eligible studies.

All eligible papers that met our inclusion criteria included 47 journal papers and three conference papers. The distribution of studies by year is described in Table 1 . As it is apparent, the majority of studies were published in 2020. Thus, the frequency of publication of these articles by month in 2020 was also examined. The trend of published articles regarding the month in 2020 is shown in Fig. 2 . The "International Journal of Environmental Research and Public Health" has the first rank with six articles among the journals.

A summary of the included articles based on predefined categories is described in Table 4 in Appendix. To visualize the frequency of words that appeared more frequently in reviewed articles, all articles summarized in the word cloud in Fig. 3 .

Out of 50 studies, only 35 citations reported their sample size. Due to the variety of samples, the range of sample size was very wide. In other words, the samples considered were very different due to the variety of applied methods. The sample size ranged from 53 cases to 1,413,297 posts. In total, 35 different data sources were cited for eligible articles. social media platforms (n = 10), Hospital information sources (n = 7), and World Health Organization (n = 4) data sources were the three most common sources of information.

In terms of the country, articles have been published in 14 different countries. The article also uses global data on the disease pandemic. The distribution of articles by country is shown in Fig. 4 on the worldwide map. As it turns out, China has the highest frequency among other countries.

All of the articles in this study took a specific approach to fight the pandemic diseases and provide a better understanding by applying DM techniques. Based on the survey, we classified all of the articles by their main approaches in 11 categories that are shown in Table 2 . One of the main objectives of eligible articles is Infoveillance. The term infoveillance has come to be used to refer to a type of syndromic surveillance that uses information and online tools in public health domains. Regarding infoveillance, regression was applied to provide new insight into the origins of the outbreak based on the analysis of social media information [18] .

As can be seen from Table 2 , the majority of studies (22%) devoted to the disease characteristic. In the case of diseases, studies show that the most common use of data mining techniques to fight pandemics was related to the new pandemic COVID-19 (n = 44). Other diseases such as H1N1 Influenza (n = 2), Other types of Influenza pandemics (n = 2), and SARS (n = 2) were also considered.

Since the main objective of this study was to determine to what extent data mining techniques are employed to fight pandemics, the frequency of applied methods was investigated in this section according to a study conducted by Patel and Patel [15] . Table 3 showed an overview of the distribution of applied data mining methods in reviewed articles. The analysis showed that all of the applied methods were classified into 14 main categories. It is apparent that the most favorite method was employed in reviewed articles belonged to Natural language processing (NLP) techniques (22%). While logistic regression analysis with 20% of studies was in the second rank to determine the association of the independent variables with one dichotomous dependent variable [68] . It should be noted here that most studies have used more than one data mining technique. Additionally, the distribution of employed DM techniques regarding main approaches is illustrated in Fig. 5 . The distribution and frequency of employed DM techniques based on main approaches can provide an appropriate insight for researchers regarding pandemics. The numbers in this figure indicate the number of studies per axis. All of the DM techniques are categorized into supervised and unsupervised techniques. In a supervised learning method, the algorithm learns on a labeled dataset to provide an answer. While unsupervised learning techniques in which patterns are extracted from the unlabeled input data [69] . Thus, all of the applied methods in the reviewed articles were divided into three categories: supervised techniques (90%), unsupervised techniques (4%), and a combination of supervised and unsupervised techniques (6%).

Special tools and a suitable platform are needed to perform data mining methods. In this section, we have examined the frequency of various tools used in these studies. SPSS software has the highest percentage (22%) among other tools, next R software has the second rank with 10 papers (20%), followed by Python software with nine studies (18%). MATLAB and RapidMiner 

According to reviewed studies, we can classify all eligible articles in this review into eight categories based on their clinical discipline. The identified clinical and health disciplines with their distribution and their frequency are described in Fig. 6 . From the chart, it is obvious that the greatest demand belonged to infectious disease with 18 papers (36%). Next, epidemiology is the second most discipline considered by included studies with 13 studies (26%).

This analysis can be highly useful to determine literature gaps in terms of health domains.

The main objective of this review was to summarize the studies carried out on the application of data-driven DM methods in pandemics. Therefore, 50 articles were selected and analyzed from 311 retrieved studies. The finding and results are discussed in this section. The data sources used in the included studies were very diverse. In terms of country, most studies were conducted in China. This can be explained by the fact that most pandemics began in this country. Nowadays, social media has become a new source of data [70] and they can generate more information in a short period than other resources. Since accessibility to these kinds of data is easier than other sources of data, the foremost of studies were devoted to applying text mining techniques regarding Infoveillance. The qualitative analysis revealed that researchers preferred to use supervised techniques such as regression to produce predictive models for a better understanding of unknown pandemics. All of these methods have been pragmatically used in different fields of medicine efficiently [71] . Additionally, classification methods have been used more than predicted in studies. By selecting the best method for implementing accurate prediction models, researchers can discover certain biomarkers in unknown diseases which can allow them to forecast important outcomes [72, 73] . Therefore, developing prediction models not only can help physicians but also aid health policymakers and societies.

Since the majority of studies were conducted in China, these models may be faced with overfitting. However, none of the studies recommended applying developed models in real practice. However, most authors were optimistic about the development of predictive models. Shamsuddin's opinion regarding the development of forecasting models is in line with our study [74] . Wyntass et al. conducted a systematic review study regarding predictive models of COVID-19. They concluded that proposed models are poorly reported with a high risk of bias [75] .

Results showed that controlling the transmission of infectious disease is the main concern in pandemic disease [76] . Usually, the nature of a new disease in a pandemic is unknown, and identifying the characteristics of a new disease is one of the most important concerns for scientists. That it's why the majority of studies are devoted to revealing disease characteristics. It can be explained by the fact that scientists should be paid more attention to diagnosis than other tasks in pandemic disease [77] . The next important issue in pandemic diseases is how the disease spreads. Hence, almost 10% of the studies have been dedicated to predicting the prevalence of the disease. However, the sample size of datasets is very diverse due to a variety of applied methods. The results showed that most of the studies used various data sources with a limited number of data sets. Using large data sets can improve the strength of the results and improve the accuracy of the model's predictions [78] , which in turn can help scientists better to fight this new disease. Accordingly, researchers are recommended to use large datasets for their studies even internationally, to achieve better diagnostic and therapeutic decisions.

In terms of diseases, most efforts were made under the heading of COVID-19. In the second place, the topics were related to influenza pandemics. This result is expected due to the high prevalence of these two diseases. Using and retrieving large amounts of data provided by electronic systems as a data source can improve access to data [79] . As a result, conducting data-driven studies has become easier in recent years than ever before. The fact that diseases related to other pandemics did not appear in this search may be due Twitter to the authors of these articles considered these diseases as epidemics.

In this study, we encountered some limitations. Nowadays, a vast majority of studies are published regarding COVID-19 daily. We investigated the literature up to 16 Oct 2020. Therefore, some studies might be neglected in the publication time of this article. Consequently, further research is needed to complete our results. Another limitation of the proposed research is that the electronic search process was performed in only three journal databases, and the rest of the databases were skipped while accessing the quality of journal articles which can be addressed in future research. The present study helps researchers to have a useful background for future work to understand the general context of data mining techniques in pandemics and their applications. Further studies could cover the study of data mining applications in a broader concept, or it can include the development of search strategies in larger databases. Analyzing and incorporating non-English written papers with automatic translator tools could be the subject of the next article. At least, it could be interesting to compare the number of non-English papers with English ones.

This review could help scientists to reach published researches regarding DM techniques and fierce pandemics easier. In this study, we surveyed the data mining techniques utilized in global pandemics, however, most of these techniques have been developed in the current context to prevent and predict the COVID-19 epidemic. According to our survey, we found out that the foremost objective of DM applications is related to disease characteristics. Also, it can help the policymakers and decision-makers in better decision-making regarding managing and preventing the major pandemics in the countries. Funding The author(s) received no financial support for the research, or publication of this article. 

The study involves only a review of literature without involving humans and/or animals. The authors have no ethical conflicts to disclose.

The authors declare that they have no conflicts of interest.

Planning for large epidemics and pandemics: challenges from a policy perspective

Pandemic Disease: A Past and Future Challenge to Governance in the United States

Di Napoli R. Features, evaluation, and treatment coronavirus

WHO declares COVID-19 a pandemic

Biological and Epidemiological Trends in the Prevalence and Mortality due to Outbreaks of Novel Coronavirus COVID-19

Estimation of the reproductive number of novel coronavirus (COVID-19) and the probable outbreak size on the Diamond Princess cruise ship: A data-driven analysis

The Role of Digital Technologies that Could Be Applied for Prescreening in the Mining Industry During the COVID-19 Pandemic

Suggesting a framework for preparedness against the pandemic outbreak based on medical informatics solutions: a thematic analysis. The International Journal of health planning and management

Successful containment of COVID-19: the WHO-Report on the COVID-19 outbreak in China

Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study. The Lancet Digital Health

COVID-19: what is next for public health? The Lancet

Covid-19 and Health Care's Digital Revolution

Data mining and model-predicting a global disease reservoir for low-pathogenic Avian Influenza (A) in the wider pacific rim using big data sets

Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement

Survey of Data Mining Techniques used in Healthcare Domain

A comparative analysis of three online appraisal instruments' ability to assess validity in qualitative research

Data Mining and Content Analysis of Chinese Social Media Platform Weibo During Early COVID-19 Outbreak: A Retrospective Observational Infoveillance Study

A machine learning model to identify early stage symptoms of SARS-Cov-2 infected patients. Expert Systems with Applications

Mining the characteristics of COVID-19 patients in China: Analysis of social media posts

A framework for performance analysis on machine learning algorithms using covid-19 dataset

Derivation and validation of the clinical prediction model for COVID-19

Characteristic of 523 COVID-19 in Henan Province and a Death Prediction Model. Frontiers in Public Health

Prediction Model Based on the Combination of Cytokines and Lymphocyte Subsets for Prognosis of SARS-CoV-2 Infection

An Epidemiological Study on the Prevalence of the Clinical Features of SARS-CoV-2 Infection in Romanian People

Smell and taste symptom-based predictive model for COVID-19 diagnosis

Laboratory findings and a combined multifactorial approach to predict death in critically ill patients with COVID-19: a retrospective study

Modeling Spatiotemporal Pattern of Depressive Symptoms Caused by COVID-19 Using Social Media Data Mining

Self-reported COVID-19 symptoms on Twitter: an analysis and a research resource

Top Concerns of Tweeters During the COVID-19 Pandemic: Infoveillance Study

Covid-19 public opinion and emotion monitoring system based on time series thermal new word mining

Using social media to mine and analyze public opinion related to COVID-19 in China

Prediction of number of cases of 2019 novel coronavirus (COVID-19) using social media search index

Predicting COVID-19 Incidence Through Analysis of Google Trends Data in Iran: Data Mining and Deep Learning Pilot Study

Literature-related discovery: potential treatments and preventatives for SARS

Twitter informatics: tracking and understanding public reaction during the 2009 swine flu pandemic

Mining Physicians' Opinions on Social Media to Obtain Insights Into COVID-19: Mixed Methods Analysis

COVID-19 virus outbreak forecasting of registered and recovered cases after sixty day lockdown in Italy: A data driven model approach

Intelligent Forecasting Model of COVID-19 Novel Coronavirus Outbreak Empowered with Deep Extreme Learning Machine. Cmc-Computers Materials & Continua

Identifying mutation positions in all segments of influenza genome enables better differentiation between pandemic and seasonal strains

The use of twitter as an early warning and risk communication tool in the 2009 swine flu pandemic

A novel simple scoring model for predicting severity of patients with SARS-CoV-2 infection. Transboundary and Emerging Dis

Estimates of the severity of coronavirus disease 2019: a model-based analysis. The Lancet Infectious Dis

Estimation of effects of nationwide lockdown for containing coronavirus infection on worsening of glycosylated haemoglobin and increase in diabetes-related complications: A simulation model using multivariate regression analysis

Using Machine Learning to Predict ICU Transfer in Hospitalized COVID-19 Patients

Prediction model and risk scores of ICU admission and mortality in COVID-19

Forecasting the spread of the COVID-19 pandemic in Saudi Arabia using ARIMA prediction model under current public health interventions

Data-based analysis, modelling and forecasting of the COVID-19 outbreak

COVID-19 Pandemic Prediction for Hungary; A Hybrid Machine Learning

Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus

Development and validation a nomogram for predicting the risk of severe COVID-19: A multi-center study in Sichuan

An interpretable mortality prediction model for COVID-19 patients

Towards an Artificial Intelligence Framework for Data-Driven Prediction of Coronavirus Clinical Severity. Computers, Materials \& Continua

The impact of COVID-19 epidemic declaration on psychological consequences: a study on active Weibo users

Predicting Health Care Workers' Tolerance of Personal Protective Equipment: An Observational Simulation Study

Enhanced Gaussian process regressionbased forecasting model for COVID-19 outbreak and significance of IoT for its detection

The Exponentially Increasing Rate of Patients Infected with COVID-19 in Iran. Archives of Iranian medicine

Association between short-term exposure to air pollution and COVID-19 infection: Evidence from China

Naive Bayes classifier for predicting the factors that influence death due to covid-19 in China

Optimizing decision tree criteria for predicting COVID-19 mortality in South Korea dataset

Risk factors for myocardial injury in patients with coronavirus disease et al 2019 in China Esc Heart Failure

The influences of global geographical climate towards COVID-19 spread and death

Statistical Forecast of Pollution Episodes in Macao during National Holiday and COVID-19

Identifying potential treatments of COVID-19 from Traditional Chinese Medicine (TCM) by using a data-driven approach

Natural Language Processing for Rapid Response to Emergent Diseases: Case Study of Calcium Channel Blockers and Hypertension in the COVID-19 Pandemic

Extending the identification of structural features responsible for anti-SARS-CoV activity of peptide-type compounds using QSAR modelling

Knowledge discovery and sequence-based prediction of pandemic influenza using an integrated classification and association rule mining (CBA) algorithm

Data Mining: Concepts and Techniques

Predicting the future with social media

Application and Exploration of Big Data Mining in Clinical Medicine

Accurate and dynamic predictive model for better prediction in medicine and healthcare

Predictive analytics in health care using machine learning tools and techniques

Can medical practitioners rely on prediction models for COVID-19? A systematic review Evidence-based dentistry

Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal

Epidemiology, causes, clinical manifestation and diagnosis, prevention and control of coronavirus disease (COVID-19) during the early outbreak period: a scoping review

Importance of diagnostics in epidemic and pandemic preparedness

Predictive Modeling With Big Data: Is Bigger Really Better? Big Data

Data Processing and Text Mining Technologies on Electronic Medical Records: A

Distributional analysis and motif frequencies of compound microsatellite repeats in viral genomes