key: cord-339642-3trpona9
authors: Obeidat, Rand; Alsmadi, Izzat; Bani Bakr, Qanita; Obeidat, Laith
title: Can Users Search Trends Predict People Scares or Disease Breakout? An Examination of Infectious Skin Diseases in the United States
date: 2020-06-08
journal: Infect Dis (Auckl)
DOI: 10.1177/1178633720928356
sha: 
doc_id: 339642
cord_uid: 3trpona9

BACKGROUND: In health and medicine, people heavily use the Internet to search for information about symptoms, diseases, and treatments. As such, the Internet information can simulate expert medical doctors, pharmacists, and other health care providers. AIM: This article aims to evaluate a dataset of search terms to determine whether search queries and terms can be used to reliably predict skin disease breakouts. Furthermore, the authors propose and evaluate a model to decide when to declare a particular month as Epidemic at the US national level. METHODS: A Model was designed to distinguish a breakout in skin diseases based on the number of monthly discovered cases. To apply this model, the authors correlated Google Trends of popular search terms with monthly reported Rubella and Measles cases from Centers for Disease Control and Prevention (CDC). Regressions and decision trees were used to determine the impact of different terms to trigger the occurrence of epidemic classes. RESULTS: Results showed that the volume of search keywords for Rubella and Measles rises when the volume of those reported diseases rises. Results also implied that the overall process was successful and should be repeated with other diseases. Such process can trigger different actions or activities to be taken when a certain month is declared as “Epidemic.” Furthermore, this research has shown great interest for vaccination against Measles and Rubella. CONCLUSIONS: The findings suggest that the search queries and keyword trends can be truly reliable to be used for the prediction of disease outbreaks and some other related knowledge extraction applications. Also search-term surveillance can provide an additional tool for infectious disease surveillance. Future research needs to re-apply the model used in this article, and researchers need to question whether characterizing the epidemiology of Coronavirus Disease 2019 (COVID-19) pandemic waves in United States can be done through search queries and keyword trends.

Infectious skin diseases encompass a vast array of conditions that range in severity from mild to life-threatening. The clinical presentation of infectious skin diseases varies based on the type of pathogen involved, the skin layers and structures affected, and the underlying medical condition of the patient. Infectious skin diseases represent common diagnoses made by dermatologists, by primary care physicians, and in the emergency room. 1 Rubella, in particular, though a mild, vaccinepreventable skin disease, is of high public health importance owing to the teratogenic effects that can result from congenital rubella infection (CRI), leading to miscarriage, fetal death, or birth of an infant with congenital rubella syndrome (CRS). 2, 3 Rapidly identifying an infectious disease outbreak is critical, both for effective initiation of public health intervention measures and timely alerting of government agencies and the general public. 4 A vast amount of real-time information about infectious disease outbreaks can be found in various forms of Web-based data streams. Studies show that health care providers rely on online search results in obtaining more information about diseases, symptoms, drugs, and other related information. Also, the research showed that doctors find searching online very helpful to get information about tracking geographical locations of disease. Google search queries are the most commonly used data source for search studies around the world. For example, Google's search engine has been used to detect influenza epidemics in areas with a large population of web search users because of its high correlation with the percentage of physician visits if a patient has influenza-like symptoms. 5 

Research studies sought to study the association between disease outbreak and online search keywords and terms. For example, a study by Polgreenet al 6 examined the relationship between searches for influenza and actual influenza occurrence. Another study by Yom-Tov and Fernandez-Luque 7 collected data from a major Internet search engine, while people seek information about the Measles, Mumps, and Rubella (MMR) vaccine. The authors focused on developing an automated way to score Internet search queries and web pages to examine how people use Internet search engines to garner information on vaccination.

Recognizing the need for up-to-date data to inform researchers, policymakers, public stakeholders, and health care providers if search queries can be used to reliably predict skin disease breakouts, we correlated Google Trends popular search terms with monthly reported Rubella and Measles cases from 2004 to 2018. So, this study provides analysis and evaluation for the association between monthly reported Rubella and Measles cases and Google Trends popular search terms that can be used to predict a future outbreak of infectious skin disease case.

Several research studies used Google Trends to answer research questions within health care domains. 8 Some of these studies examined and confirmed the correlation between disease outbreaks and online search keywords and trends. 9 In 2009, Ginsberg et al 5 stated that the Google Flu Trends predictions were 97% accurate compared with the Centers for Disease Control and Prevention (CDC) data. CDC was also testing Google Flu Trends in the United States, and the preliminary finding suggests that Google Flu Trends can detect regional outbreaks of influenza 7 to 10 days earlier than conventional CDC surveillance. 9, 10 The correlation between Google Trends and diseases surveillance was also assessed in several countries such as India, 11 South Korea, 12 South China, 13 and Spain. 14 However, those studies did not propose any unique search terms that can correlate with diseases predictions.

Google Trends were deployed to detect/estimate many disease outbreaks such as influenza, 15, 16 Dengue, 17, 18 Ebola, 19 and Lyme. 20 Few studies analyzed skin-related diseases using Google Trends. Bloom et al 21 extracted data from Google Trends to evaluate whether population inquisitiveness on melanoma and skin cancer was correlated with a lower incidence. They found that the general populations' interest in learning about skin cancer increases during the summer month. Hopkins and Secrest 22 used Google Trends data queried using several search terms (sunscreen, sunburn, skin cancer, and melanoma) in the United States. Then, time-matched search term data were correlated with melanoma outcomes data from Surveillance Epidemiology and End Results Program and United States Cancer Statistics. In another study, Hopkins and Secrest 22 explored international trends in English-speaking countries including (United States, United Kingdom, Canada, Australia, and New Zealand) several search terms that are used to better guide skin cancer prevention campaigns. Hopkins and Secrest 23 assessed the correlations between search terms, time, and melanoma outcomes for each country. None of the previous studies correlated Google Trends popular search terms with certain infectious skin diseases including Rubella and Measles reported from CDC. In this work, Google Trends was used to propose unique search terms that can correlate with Epidemic disease prediction. For this purpose, we collected data and we used machine learning methods to evaluate a dataset of search terms to determine if search queries and terms can be used to reliably predict skin disease breakouts.

In this study, it is important to use different classifiers to have more confidence in the results and compare those different classifiers based on accuracy. Therefore, we have created a new supervised classification model that uses a Support-Vector Machine (SVM) model, linear regression (LR), and Decision Tree (DT) to evaluate each disease breakout prediction. In the first part of this section, we provide an overview of the model we used. Then, we explain the data used in this study. We finish this section by presenting the proposed model for disease epidemic classification and algorithm.

Using SVM is a supervised machine learning technique that is widely used in classification and regression problems. The main objective of SVM algorithm is to find a hyperplane in an N-dimensional space that distinctly classifies the data points, where N is the number of features. To separate 2 classes of data points, infinite number of hyperplanes could be found. In the SVM, the main objective is to find a plane that has the maximum margin.

A separating hyperplane can be written as the following equation:

. ., wn} is a weight vector and b a scalar (bias). For example, in two-dimensional (2D) it can be written as:

The hyperplane defining the sides of the margin is as follows:

H1: 0 + 1 1 + 2 2 1 for = +1,

Any training tuples that fall on hyperplanes H1 or H2 are support vectors.

We used SVM as a classifier where in the applied SVM model, rows represent months and columns represent relevant Google Trends keywords that were extracted from several Obeidat et al 3 cycles. Dates used in SVM were matched for Google Trends keywords and CDC reported diseases. Table 1 shows the last 2 columns of the SVM matrix that we created from the number of reported cases, one of the columns as a continuous value and the last column as a binary target, based on our Epidemic formula.

In the LR, given N observations, where an output vector Y with dimension N × 1 and p inputs X1, X2, . . ., Xp, where each input vector being of dimension N × 1, LR assumes that the regression function E(Y|X) is linear in the inputs. Y is computed based on the following equation:

where ε is the error term. LR model is applied to predict diseases, such as Measles, using many search terms (in terms of their significance and estimated values).

Decision Tree is a supervised learning algorithm which has a flowchart-like tree structure. In the DT, each internal node (non-leaf node) represents a test on an attribute. Each branch is an outcome of the tested condition of on an attribute, and each leaf node or terminal node holds a class label. Decision Tree classification model was employed to study the different terms impact on diseases predication.

Centers for Disease Control and Prevention data. Data were collected on reported cases for each disease in the United States over the period from 2004 to 2018. Reported cases of diseases by months were collected and maintained from public CDC publications. Then, the authors did some data transformation to aggregate data across all United States to monthly basis to match Google Trends data.

Google Trends data. Initial sets of relevant keywords were created to each disease and used them to extract Google Trends data. Specifically, in this article, the authors produced a dataset of popular search terms for 2 infectious skin diseases such as Rubella and Measles that can be used to predict future skin disease breakouts. Google reports search terms on a monthly basis were accumulated.

The proposed model for disease epidemic classification and algorithm. The authors designed an "epidemic" model to distinguish a breakout of diseases based on the number of monthly discovered cases, as well as to decide whether a certain month counts as "Epidemic." The authors observed final values collected from CDC and Google Trends and then decided to make the cut-off in Epidemic class based on the reported values for each disease. Epidemic class value will be 1, or else will be zero (ie, yes, 1 or no, 0). Table 1 shows that the level of increase in a month from previous month was calculated using this formula: (Current month-previous month)/previous month. Transient month increase or decrease was eliminated. As a result, this increase will be considered Epidemic if it occurs in 3 consecutive months. The authors made this algorithm as an approximation of how to flag a month as an epidemic month. This model was applied and validated to distinguish a breakout in Rubella and Measles skin diseases based on the number of monthly discovered cases. The process of the creation of Rubella popular search terms has the following main steps:

• • Start by evaluating the single term (for each skin disease)

in Google Trends across the available period (from 2004 to 2018), with a total of 168 records. The numbers given for the popular terms are in the form of percentages (ie, from 0 to 100), rather than actual volumes of search terms. • • In the second step of keyword selection, we use Google Trends to expand our search terms. Starting Step 1, we analyze the "related queries" section in Google Trends and we extract the "Top" and "Rising" search terms as shown in Table 2 . • • As shown in Table 2 , Google Trends distinguishes terms breakout and rising keywords in terms of how quick and long such terms have been on the search rise.

For a search term to be selected in the collected dataset, it should have the following inclusion criteria: (1) Search terms are extracted only from the "related queries" section in Google Trends results. It should be listed in the "Top" or listed with more than 90% "Rising" terms. (2) It should be repeated for more than 4 times in the "related queries" from different results (ie, from different initial search terms). This was necessary to eliminate out term that are "outliers" (If they show up in only one related term). 

An SVM model was used to evaluate each disease breakout prediction based on collected features in the different experiments. Correlations (Pearson and Spearman) were used between Google Trends of popular search terms and monthly reported Rubella and Measles cases from CDC. In addition, regressions and DTs were used to determine the impact of different terms to trigger the occurrence of epidemic classes.

In our SVM model (Table 3 , Rubella SVM sample), rows represent months and columns represent relevant Google Trends keywords that were extracted from several cycles. Dates were matched for Google Trends keywords and CDC reported diseases. The count column was retrieved from Google Trends and represents the popularity of those relevant keywords in that particular month. Records represent monthly data for both disease volume and Google Trends selected keywords. For Rubella disease, the volume of reported cases was small. In addition, in our dataset, we did not find reports for many other months (ie, missing values). This impacted overall prediction accuracy. Table 4 shows the results of LR prediction on Rubella SVM. We showed search terms with the lowest P values (ie, significant prediction results). However, their estimate values are low, which indicates a low overall impact on disease prediction.

Accuracy of prediction for Measles LR model is better for many search terms (in terms of their significance and estimate values; Table 5 ). One main reason for such better accuracy is the large number of reported cases for Measles and also the fact that we have much fewer missing values for Measles' case.

For each one of the experiments to extract relevant keywords, we evaluated correlations (Pearson and Spearman) between popular terms of Google Trends and Disease arrays. In terms of correlation, no significant positive or negative correlation is shown in the volume of those terms and cases volumes. However, the highest keywords for Rubella in terms of correlation (negative or positive) were Titer, Rubeola, CRS (positive), rubella pregnancy, and rubella rash. Decision Tree classification model was employed as the model has a categorical target class to study the different terms impact. With more than 95% accuracy, Figure 1 shows overall accuracy metrics. Figure 1 shows a high true positive (TP) rate and a very low false positive (FP) rate which implies very acceptable accuracy in all recorded performance metrics (ie, precision, recall, MCC, receiver operating characteristic [ROC] area, and precision-recall curve [PRC] area).

Due to size limitation, we show a summary snapshot from Measles DT in Figure 2 . This figure summarizes search terms with a significant impact on our proposed Epidemic class. This figure shows also minimum weight for the search term to trigger the occurrence of the Epidemic class. In other words, if people are searching for more than this percentage on this particular term, then the rise in this disease is significant. The DT shows the keywords that decide the target class (whether a month is an epidemic or not), their cut-off value to switch the target class from Yes (epidemic) to No, and also how many instances in the dataset in that category.

This article aimed to evaluate a dataset of search terms to determine whether search queries and terms can be used to reliably predict disease breakouts. A model was proposed and evaluated to decide when to declare a particular month as Epidemic at the US national level. In this study, the authors applied the model on 2 infectious skin diseases such as Rubella and Measles.

By using LR as a regression method, we showed that the search terms with the lowest P values estimate values that are low, which indicates a low overall impact on disease prediction. By using the LR, we also found that the accuracy level of prediction for Measles is higher than the accuracy of prediction for Rubella using several Search Terms as shown in Tables 4  and 5 . In addition, the DT classification model was employed as the model for classification with more than 95% accuracy. The DT model successfully shows that the keywords features can be used to classify whether a month is an epidemic or not with accuracy reach to 95%.

In this study, we found that people search for Rubella and Measles diseases throughout the year. Results showed that the volume of search keywords for Rubella and Measles rises when the volume of reported diseases rises. Due to the small volume of reported cases for rubella, it is found that the accuracy level of prediction for Measles is higher than the accuracy of prediction for Rubella. Despite some challenges related to missing values in certain months, the results implied that the overall process was successful and should be repeated with other diseases. Such a process can 

trigger different actions or activities to be taken when a certain month is declared as "Epidemic." One interesting observation is that the query volumes considerably vary according to the searched term. However, this research has shown great interest in vaccination against Measles and Rubella.

This study has some limitations. At first, we were weighing our options to use US data at the national level or state by state. However, based on data availability, we reported analysis only at US national level. In the future, and based on data availability in the CDC, we will analyze historical data on several years per state. For Google Trends, one major limitation we have to deal with in Google Trends is that Google Trends aggregates relative not absolute data. All data reported are relative (ie, in percentage from 0% to 100%) rather than actual volumes of search terms.

In the era of online information overload, can users search trends predict diseases outbreak? To address this question, this study aimed at evaluating a dataset of search terms from 2004 to 2018, by developing and evaluating a model to decide when to declare a particular month as Epidemic at the US national level. The findings suggest that the search queries and keyword trends can be truly reliable to be used for the prediction of disease outbreaks, and search-term surveillance can provide an additional tool for infectious disease surveillance. Future research needs to re-apply the model used in this article, and researchers need to question whether characterizing the epidemiology of Coronavirus Disease 2019 (COVID-19) pandemic waves in the United States can be done through search queries and keyword trends.

Infectious skin diseases: a review and needs assessment

Control of rubella and congenital rubella syndrome (CRS) in developing countries, Part 1: burden of disease from CRS

high burden of congenital infection and spread to Canada

Early detection of disease outbreaks using the Internet

Detecting influenza epidemics using search engine query data

Using Internet searches for influenza surveillance

Information is in the eye of the beholder: seeking information on the MMR vaccine through an Internet search engine

The use of Google Trends in health care research: a systematic review

Google Trends: a web-based tool for real-time surveillance of disease outbreaks

Influenza forecasting with Google flu trends

Google search trends predicting disease outbreaks: an analysis from India

Correlation between national influenza surveillance data and google trends in South Korea

Using Google Trends for influenza surveillance in South China

Diseases tracked by using Google Trends

Assessing Google Flu Trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic

Reassessing Google Flu Trends data for detection of seasonal and pandemic influenza: a comparative epidemiological study at three geographic scales

Evaluation of Internet-based dengue query data: Google Dengue Trends

Using web search query data to monitor dengue epidemics: a new model for neglected tropical disease surveillance

Assessing Ebola-related web search behaviour: insights and implications from an analytical study of Google Trendsbased query volumes

The utility of "Google Trends" for epidemiological research: Lyme disease as an example

Google search trends and skin cancer: evaluating the US population's interest in skin cancer and its association with melanoma outcomes

Public health implications of Google searches for sunscreen, sunburn, skin cancer, and melanoma in the United States

An international comparison of Google searches for sunscreen, sunburn, skin cancer, and melanoma: current trends and public health implications

Data Collection: RO 

Rand Obeidat https://orcid.org/0000-0001-8271-3829