key: cord-0788050-f6yaofyn authors: Kurian, Shyam J.; Bhatti, Atiq ur Rehman; Alvi, Mohammed Ali; Ting, Henry H.; Storlie, Curtis; Wilson, Patrick M.; Shah, Nilay D.; Liu, Hongfang; Bydon, Mohamad title: Correlations Between COVID-19 Cases and Google Trends Data in the United States: A State by State Analysis date: 2020-08-20 journal: Mayo Clin Proc DOI: 10.1016/j.mayocp.2020.08.022 sha: fe14b76ba087e61b52bc169bddd33e73cacfd30b doc_id: 788050 cord_uid: f6yaofyn ABSTRACT BACKGROUND Since January 2020, Coronavirus (COVID-19) cases have risen exponentially in the United States. Accurate data on COVID-19 cases has been difficult to report due to lack of testing as well as the overload of the U.S. healthcare system. This study aims to evaluate whether a digital surveillance model using Google Trends is feasible, and whether accurate predictions can be made regarding new cases. METHODS Data on total and daily new cases in each U.S. state was collected and used in this study from late January to early April. Information regarding ten keywords was collected and correlation analyses were performed for individual states as well as for the United States overall. RESULTS Ten keywords were analyzed from Google Trends. “Face mask”, “Lysol”, and “COVID stimulus check” had the strongest correlations when looking at the United States as a whole, with R values of 0.88, 0.82 and 0.79 respectively. Lag and lead Pearson correlations were assessed for every state and all ten keywords from 16 days before the first case in each state to 16 days after the first case. Strong correlations were seen up to 16 days prior to the first reported cases in some states. CONCLUSION This study demonstrates the feasibility of syndromic surveillance of internet search terms to monitor new infectious diseases such as COVID-19. This information could enable better preparation and planning of healthcare systems. Cases of pneumonia of unknown etiology appeared at the end of 2019 in Wuhan, China. 1 Further sequencing analysis revealed the involvement of a novel strain of virus named severe acute respiratory syndrome coronavirus 2 (SARS-CoV2) obtained from the samples of the lower respiratory tract of infected patients. 2 The number of cases quickly accelerated, and eventually the disease spread to the United States, with the first confirmed case announced in January 2020; the World Health Organization labeled the situation a pandemic on March 11th, 2020. Web-based big data analytics has been gaining popularity in its potential to predict the distribution of infectious diseases. 3 Internet usage has brought about a revolution when it comes to healthcare knowledge accessibility to the public. Monitoring and analysis of internet data has come under the research field known as Infodemiology, defined as obtaining data from web-based resources and repurposing it to inform public health and health policymaking. 4 Web-based activity detection tools can play a vital role in early detection of infectious events and help in the timely preparedness of respective health care systems in order to avoid the adverse consequences of being caught by surprise. Among these web-based surveillance tools, one of the most prominent is Google Trends. Google Trends is one of the most efficient trend analyzers to determine internet search behavior. Google search is based on pattern analysis focused on the most searched keywords that are centered around concerns of the general public. Google Trends provides valuable insights into community dynamics and health-related problems, particularly in the area of infectious diseases. Big data produced by Google Trends has proven to be valuable for correlation assessments and forecasting models of a number of infectious diseases including influenza, middle east respiratory J o u r n a l P r e -p r o o f syndrome (MERS), Zika, and more; it has also been found to be a useful tool for the assessment of dementia cases in the population. [5] [6] [7] [8] Since the first case of COVID-19 appeared in the United States, there has been an exponential increase in the daily number of cases. The United States now has the highest number of cases in the world, with the most deaths globally. 9 The purpose of this study is to explore whether there is a correlation between certain keywords searched by the general public in Google and the number of COVID-19 cases in the United States on a state by state basis. Significant correlations could suggest the utilization of Google Trends to predict new COVID-19 case locations and hotspots. Google Trends processes the magnitude of web-searches performed for a specified keyword, amongst other searches, providing the relative search volume (RSV) for each keyword. This standardized value is calculated by dividing the total number of searches for a keyword by the total searches of the geography and time range it represents to compare relative popularity. The resultant number ranges from 0 to 100 and is based on the topic's daily popularity compared to its search popularity over a given timeframe. 10 Trend changes are displayed online for time series of interest. Keywords can be filtered by location (worldwide, country, state, city) and time span. Data is collected in a time series presented on a normalized scale of 0-100, where 0 represents no search and 100 represents the peak search activity for a particular keyword or string. Data can be downloaded as a ".csv" file. Google Trends daily base data was mined in our study from January 22, 2020, to April 6, 2020. In total, ten keywords related to COVID-19 were chosen based on J o u r n a l P r e -p r o o f popularity and rising patterns on the internet and google news in the time period provided above. Keywords searched include: "COVID symptoms", "Coronavirus symptoms", "Sore throat+shortness of breath+fatigue+cough", "Coronavirus testing center", "Loss of smell", "Lysol" [sanitizer], "Antibody", "Face Mask", "Coronavirus vaccine", and "COVID stimulus check". Keyword categories ranged from disease symptoms, prevention, testing, and possible treatments. Our search methodology was to perform a query for each keyword for each state in the U.S. individually. In total, we obtained data for 50 states for each selected keyword. Data for the daily new and total number of confirmed cases and deaths have been tracked and reported by Johns Hopkins University Center for Systems Science and Engineering. At the time of this study, the data provided included COVID-19 case data on a county by county basis for each of the 50 states. Total U.S. cases reported from January 22nd, 2020 to April 6th, 2020 were available; this is the time frame utilized in this study. County data for each state was combined to create a state by state data set. To assess the relationship between COVID-19 cases and keyword patterns in Google Trends, correlation analysis was performed using R 3.6.2. Ten keywords in Google Trends were searched and data was collected from January 22nd, 2020 to April 6th, 2020. We plotted each keyword's RSV from January to April of 2020. Pearson correlation coefficients were calculated between each keyword's standardized RSV and the number of daily new COVID-19 cases. 95% confidence intervals were also calculated. We used the correlation coefficients of select keywords and daily new COVID-19 cases to create a heat-map for each of the 50 states at time zero (the day of the first case in the state). To study the association between COVID-19 cases and Google search J o u r n a l P r e -p r o o f trends for each of the 10 keywords, we created scatterplots showing the number of COVID-19 cases against a standardized daily Google search RSV value. Lag and lead Pearson correlation coefficients were calculated for all 50 states as well as the United States as a whole. The lag/lead times for each state started 16 days prior to time zero (day of the first case in that state) and 16 days after time zero. We compared the correlation coefficients for each keyword's RSV and daily new COVID-19 cases between day -16 and day +16 in all 50 states as well as the United States as a whole. Ten keywords in Google Trends were searched and data was compounded from January 22nd, 2020 to April 6th, 2020. Keywords generally increased in search popularity over time compared to baseline; some keywords peaked in popularity towards mid-March, such as "COVID symptoms", while others continued to rise in popularity into April, such as "face mask" [Fig. 1 ]. Correlation coefficients were calculated between each keyword and each of the 50 states' daily new COVID cases as well as the daily new COVID cases in the United States as a whole. When looking at the United States as a whole, keyword correlations ranged from R = 0.06 (coronavirus symptoms) to R = 0.88 (antibody); 6 of the 10 keywords had moderate correlations (R 0.3-0.7) with daily new COVID cases in the U.S. while 3 out of the 10 keywords had strong correlations (0.7-1) [ Table 1 ]. When looking at correlations on a state by state basis, four keywords with significant correlations nationwide included "COVID symptoms", coronavirus testing center", "loss of smell", and "face mask." "COVID symptoms" had correlations ranging from 0.37 -0.80, "coronavirus testing center" had correlations ranging from -0.06 -0.63, "loss of smell" had correlations ranging from 0.09 -J o u r n a l P r e -p r o o f 0.76, and "face mask" had correlations ranging from 0.35 -0.90 [Supplemental Table 1 ]. This is further represented in Figure 2 as a United States heat map. Search popularity for each keyword varied with COVID-19 case numbers. Some keywords such as "antibody" and "Lysol" had higher popularity as COVID-19 cases increased; other keywords such as "COVID symptoms" and "coronavirus vaccine" had higher popularity when COVID-19 case numbers were lower [ Figure 3 ]. To further assess this, lag and lead Pearson correlation coefficients were calculated for all ten keywords and each of the 50 states, along with the United States as a whole. Lag correlations were calculated up to 16 days before the first case, and lead correlations were calculated up to 16 days after the first case. A majority of the keywords had moderate to strong correlations days before the first COVID-19 cases appeared, with diminishing correlations following the first case [ Figure 4 ]. "Coronavirus symptoms", for example, had its strongest correlations 16 days prior to the first case in the United States (R=0.77) and in the majority of the 50 states individually. All calculated lag and lead correlation coefficients for each of the 10 keywords and the 50 states as well as the United States overall are displayed in Supplemental Table 2 . When looking at Minnesota, Arizona, Florida, and New York, strong keyword correlations were seen up to 16 days prior to the first reported cases in each of these states. These four states are reported here individually because the authors' institution (Mayo Clinic) has campuses in each of these three states; New York was selected as it was the most strongly impacted area during the beginning of the pandemic in the United States. For Minnesota, the strongest correlations for "COVID symptoms", "coronavirus symptoms", "Lysol", and "coronavirus vaccine" were seen on lag day 8 (R= 0.87), lag day 14 (R=0.85), lag day 15 (R=0.70), and lag day 16 (R=0.82) respectively [Table 2A ]. For Arizona, the strongest correlations for "COVID symptoms", "coronavirus symptoms", "sore throat + shortness of breath + fatigue + cough", "loss of smell", J o u r n a l P r e -p r o o f "Lysol", "coronavirus vaccine", and "COVID stimulus check" were seen on lag day 9 (R=0.80), lag day 16 (R=0.82), lag day 11 (R=0.73), lag day 3 (R=0.66), lag day 1 (R=0.73), lag day 14 (R=0.69), and lag day 2 (R=0.84) respectively [Table 2B ]. For Florida, nearly every keyword had strong correlations prior to the first case in the state; the strongest correlations for "COVID symptoms", "coronavirus symptoms", "loss of smell", and "coronavirus vaccine" were seen on lag day 10 (R=0.74), lag day 16 (R=0.77), lag day 8 (R=0.70), and lag day 15 (R=0.75) respectively [ Table 2C ]. For New York, the strongest correlations for "COVID symptoms", "coronavirus symptoms", "coronavirus testing center", "loss of smell", and "coronavirus vaccine" were seen on lag day 6 (R=0.87), lag day 16 (R=0.87), lag day 9 (R=0.76), lag day 5 (R=0.78), and lag day 15 (R=0.80) respectively [ Table 2D ]. Our study demonstrated moderate to strong correlations between data obtained from searching COVID-19 related keywords in Google Trends and total COVID-19 cases in the United States as obtained from national data aggregators. Strong correlations were seen up to 16 days prior to the first reported cases in some states. This emphasizes the importance of digital surveillance and points to the fact that it can be a useful addition to our toolbelt when trying to monitor new infectious disease outbreaks. Over the years, several studies have pointed to the role of internet surveillance in helping with early prediction of other infectious disease outbreaks; this includes diseases such as dengue fever, Zika, H1N1, influenza, measles, and MERS. 5, 6, [11] [12] [13] [14] There are several benefits to utilizing internet surveillance methods versus traditional methods, and employing a combination of the two is likely the key to an effective surveillance system. One benefit to an internet model is minimal J o u r n a l P r e -p r o o f costs, as all of the data gathered from Google Trends were available for free. Furthermore, the data is made available to the public in real time, with near-instant updates in regard to search results. This is extremely important when attempting to predict outbreaks and new hotspots for a pandemic, as any delay in information could potentially miss the golden window that would allow for preparation prior to an outbreak in a certain location. Several other papers focusing on influenza have emphasized the pitfalls of traditional surveillance, and how CDC surveillance reports were often weeks behind search engine results and estimates, as traditional systems take, "1-2 weeks to gather and process surveillance data." 5, 13 This type of lag was further supported in our study of COVID-19, as Google data on search trends predated the first reports of cases on a state by state basis. When Shin et al. published their paper on MERS in 2016, they found a similar lag pattern, with social media and search engine data reflecting disease outbreak earlier than conventional surveillance models. 6 Scientists in China also looked for this data lag with COVID-19 in their country and had similar results. They looked back 14 days prior to the first reported cases, and found that "the peak internet searches and social media data about the coronavirus disease 2019 (COVID-19) outbreak occurred 10-14 days earlier than the peak of daily incidences in China." 15 We suspect that our United States data shows similar lags in traditional surveillance data for a number of reasons. Firstly, hospital reporting can vary from state to state and even county to county. Although we try to standardize reporting guidelines, during a time of a pandemic when hospital systems and the country are becoming increasingly stressed, appropriate reporting can break down. In fact, inappropriate reporting can lead to significant inaccuracies when data is released using traditional surveillance models. For example, on April 17th, 2020, China raised its coronavirus death toll by 50% in Wuhan relative to their previously reported numbers. 16 A second J o u r n a l P r e -p r o o f important source of data lag using traditional surveillance in the US is the lack of testing required for the current pandemic. Testing is evolving on a day by day basis, and thankfully we are moving in the right direction; however, the United States and the world still has a ways to go. Testing capabilities were sparse at the beginning of the United States outbreaks, and many areas were backlogged in their abilities to test for COVID-19. Even if patient samples were available, the time to test that sample and report the diagnosis back to the physician and patient were delayed as testing capabilities were not robust. This, of course, results in a delay in reported cases and is where internet surveillance could add value. As we continue to evolve, the need for quicker testing and an increase in the quantity of testing for COVID-19 is paramount. In an article by Gottlieb et al. regarding the reopening of the United States, the authors state, "We estimate that a national capacity of at least 750,000 tests per week would be sufficient... in conjunction with more widespread testing, we need to invest in new tools to make it efficient for providers to communicate test results and make data easily accessible to public-health officials working to contain future outbreaks." 17 Data accessibility and speed of communication are key; search engine surveillance meets both of these criteria, and thus provides important up-to-date information while traditional models catch up. It is important to note that our study looked at ten keywords, and each had varying strengths of correlation with case numbers. If we had looked at 100 keywords, even stronger correlations may have been found. Search terms will also evolve as a pandemic progresses. Furthermore, Google itself is widely used in the United States, which makes it a good candidate for digital surveillance, but this is not the case for every country. For example, Google is not a major search engine in China. 15 It would be important to utilize sites relevant to each country when developing predictive models, and using multiple sites could further improve predictions. Shin et al. utilized Google and Twitter when conducting their study on MERS, and found strong correlations using both sites. 6 One J o u r n a l P r e -p r o o f other limitation of Google trends is the granularity it provides. Although it does provide information on some cities, it does not currently provide a comprehensive town by town breakdown of its data. This would make it difficult to create appropriate forecast models on a town by town basis, and individuals would have to rely on broader state-wide predictions. This study reveals the benefits of internet surveillance models and the use of Google Trends to monitor new infectious diseases such as COVID-19. For the United States, Google Trends data was highly correlated with cases of COVID-19 on a state by state basis and could potentially be used to predict new areas of outbreak and possible high impact zones as the disease progresses. Furthermore, this study demonstrates that there is information present in Google Trends that precedes outbreaks, and this data should be utilized to allow for better resource allocation in regard to tests, personal protective equipment, medication, and more. Outbreak of pneumonia of unknown etiology in Wuhan, China: The mystery and the miracle Clinical features of patients infected with 2019 novel coronavirus in Wuhan What's trending in the infection prevention and control literature? From HIS 2012 to HIS 2014, and beyond Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet Detecting influenza epidemics using search engine query data High correlation of Middle East respiratory syndrome spread with Google search and Twitter trends in Forecasting the Incidence of Dementia and Dementia-Related Outpatient Visits With Google Trends: Evidence From Taiwan Global reaction to the recent outbreaks of Zika virus: Insights from a Big Data analysis COVID-19 United States Cases by County. 10. Google. FAQ about Google Trends data Utilizing Nontraditional Data Sources for Near Real-Time Estimation of Transmission Dynamics During the 2015-2016 Colombian Zika Virus Disease Outbreak Dengue prediction by the web: Tweets are a useful tool for estimating and forecasting Dengue at country and city level Monitoring influenza activity in the United States: a comparison of traditional surveillance systems with Google Flu Trends Digital epidemiology: assessment of measles infection through Google Trends mechanism in Italy Retrospective analysis of the possibility of predicting the COVID-19 outbreak from Internet searches and social media data China Raises Coronavirus Death Toll by 50% in Wuhan National Coronavirus Response A Road Map To Reopening