key: cord-0515314-5baj4u7y authors: Pierri, Francesco; Perry, Brea; DeVerna, Matthew R.; Yang, Kai-Cheng; Flammini, Alessandro; Menczer, Filippo; Bryden, John title: Online misinformation is linked to early COVID-19 vaccination hesitancy and refusal date: 2021-04-21 journal: nan DOI: nan sha: e6f32d762b8c43949d84ff23eafb5d8b7e446a2b doc_id: 515314 cord_uid: 5baj4u7y Widespread uptake of vaccines is necessary to achieve herd immunity. However, uptake rates have varied across U.S. states during the first six months of the COVID-19 vaccination program. Misbeliefs may play an important role in vaccine hesitancy, and there is a need to understand relationships between misinformation, beliefs, behaviors, and health outcomes. Here we investigate the extent to which COVID-19 vaccination rates and vaccine hesitancy are associated with levels of online misinformation about vaccines. We also look for evidence of directionality from online misinformation to vaccine hesitancy. We find a negative relationship between misinformation and vaccination uptake rates. Online misinformation is also correlated with vaccine hesitancy rates taken from survey data. Associations between vaccine outcomes and misinformation remain significant when accounting for political as well as demographic and socioeconomic factors. While vaccine hesitancy is strongly associated with Republican vote share, we observe that the effect of online misinformation on hesitancy is strongest across Democratic rather than Republican counties. Granger causality analysis shows evidence for a directional relationship from online misinformation to vaccine hesitancy. Our results support a need for interventions that address misbeliefs, allowing individuals to make better-informed health decisions. The COVID-19 pandemic has killed over 4.9 million people and infected 241 million worldwide as of October 2021 [1] . Vaccination is the lynchpin of the global strategy to fight the SARS-CoV-2 coronavirus [2] , [3] . Surveys conducted during February and March 2021 found high levels of vaccine hesitancy with around 40-47% of American adults hesitant to take the COVID-19 vaccine [4] , [5] . However, populations must reach a threshold vaccination rate to achieve herd immunity (i.e., 60-70%) [6] - [8] . Evidence of uneven distributions of vaccinations [9] raises the possibility of geographical clusters of non-vaccinated people [10] . In early July 2021, increased rates of the highly transmissible SARS-CoV-2 Delta variant were recorded in several poorly vaccinated U.S. states [9] . These localized outbreaks will preclude eradication of the virus and may exacerbate racial, ethnic, and socioeconomic health disparities. Vaccine hesitancy covers a spectrum of intentions, from delaying vaccination to outright refusal to be vaccinated [11] . Some factors are linked to vaccine hesitancy, with rates in the U.S. highest among three groups: African Americans, women, and conservatives [12] . Other predictors, including education, employment, and income are also associated with hesitancy [13] . A number of studies discuss the spread of vaccine misinformation on social media [14] and argue that such campaigns have driven negative opinions about vaccines and even contributed to the resurgence of measles [15] , [16] . In the COVID-19 pandemic scenario, widely shared misinformation includes false claims that vaccines genetically manipulate the population or contain microchips that interact with 5G networks [17] , [18] . Exposure to online misinformation has been linked to increased health risks [19] and vaccine hesitancy [20] . Gaps remain in our understanding of how vaccine misinformation is linked to broad-scale patterns of COVID-19 vaccine uptake rates. The Pfizer-BioNTec COVID-19 vaccine was the first to be given U.S. Food and Drug Administration approval on December 10th 2020 [21] . Since then, two other vaccines have been approved in the U.S. Initially, vaccines were selectively administered with nationwide priority being given to more vulnerable cohorts such as elderly members of the population. As vaccines have become available to the entire adult population more recently [22] , adoption is driven by limits in demand rather than in supply. It is therefore important to study the variability in uptake across U.S. states and counties, as reflected in recent surveys [23] , [24] . Here we study relationships between vaccine uptake, vaccine hesitancy, and online misinformation. Leveraging data from Twitter, Facebook, and the Centers for Disease Control and Prevention (CDC), we investigate how online misinformation is associated with vaccination rates and levels of vaccine hesitancy across the U.S. We also use Granger Causality analysis to investigate whether there is evidence for a directional association between misinformation and vaccine hesitancy. Our key independent variable is the mean percentage of vaccine-related misinformation shared via Twitter at the U.S. state or county level. We used 55 M tweets from the CoVaxxy dataset [17] , which were collected between Jan 4th and March 25th using the Twitter filtered stream API using a comprehensive list of keywords related to vaccines (see Supplementary Information) . We leveraged the Carmen library [29] to geolocate almost 1.67 M users residing in 50 U.S. states, and a subset of approximately 1.15 M users residing in over 1,300 counties. The larger set of users accounts for a total of 11 M shared tweets. Following a consolidated approach in the literature [25] - [28] , we identified misinformation by considering tweets that contained links to news articles from a list of low-credibility websites compiled by a politically neutral third-party (see details in the Supplementary Information). We measured the prevalence of misinformation about vaccines in each region by (i) calculating the proportion of vaccine-related misinformation tweets shared by each geo-located account; and (ii) taking the average of this proportion across accounts within a specific region. The Twitter data collection was evaluated and deemed exempt from review by the Indiana University IRB (protocol 1102004860). Our dependent variables include vaccination uptake rates at the state level and vaccine hesitancy at the state and county levels. Vaccination uptake is measured from the number of daily vaccinations administered in each state during the week of 19-25 March 2021, and measurements are derived from the CDC [9] . Vaccine hesitancy rates are based on Facebook Symptom Surveys provided by the Delphi Group [24] at Carnegie Mellon University. Vaccine hesitancy is likely to affect uptake rates, so we specify a longer time window to measure this variable, i.e., the period Jan 4th-March 25th 2021. We computed hesitancy by taking the complementary proportion of individuals "who either have already received a COVID vaccine or would definitely or probably choose to get vaccinated, if a vaccine were offered to them today." See Supplementary Information for further details. There are no missing vaccine-hesitancy survey data at the state level. Observations are missing at the county level because Facebook survey data are available only when the number of respondents is at least 100. We use the same threshold on the minimum number of Twitter accounts geolocated in each county, resulting in a sample size of N = 548 counties. Our multivariate regression models adjust for six potential confounding factors: percentage of the population below the poverty line, percentage aged 65+, percentage of residents in each racial and ethnic group (Asian, Black, Native American, and Hispanic; White non-Hispanic is omitted), rural-urban continuum code (RUCC, county level only), number of COVID-19 deaths per thousand, and percentage Republican vote (in 10 percent units). Other covariates, including religiosity, unemployment rate, and population density, were also considered (full list in Supplementary Table S9) . We also conduct a large number of sensitivity analyses, including different specifications of the misinformation variable (with a restricted set of keywords and different thresholds for the inclusion of Twitter accounts) as well as logged versions of misinformation (to correct positive skew). These results are presented in Supplementary Information (Tables S3-S8) . We conduct multiple regression models predicting vaccination rate and vaccine hesitancy. Both dependent variables are normally distributed, making weighted least squares regression the appropriate model. Data are observed (aggregated) at the state or county level rather than at the individual level. Analytic weights are applied to give more influence to observations calculated over larger samples. The weights are inversely proportional to the variance of an observation such that the variance of the j-th observation is assumed to be σ 2 /w j where w j is the weight. The weights are set equal to the size of the sample from which the average is calculated. We estimate weighted regression with the aweights command in Stata 16. In addition, because counties are nested hierarchically within states, we use cluster robust standard errors to correct for lack of independence between county-level observations. We investigate Granger causality between vaccine hesitancy and misinformation by comparing two auto-regressive models. The first considers daily vaccine hesitancy rates at time in geographical region (state or county): where is the length of the time window. The second model adds daily misinformation rates per account as an exogenous variable : The variable is said to be Granger causal [30] , [31] on if, in statistically significant terms, it reduces the error term ′ ! , i.e., if meaning that misinformation rates y help forecast hesitancy rates x. We assume geographical regions to have equivalence and independence in terms of the way misinformation influences vaccine attitudes. Thus, we use the same parameters for % and % across all regions. We employ Ordinary Least Squares (using the Python statsmodels package version 0.11.1) linear regression to fit and , standardizing the two variables and removing trends in the time series of each region. We select the value of the time window which maximizes ',( . For both counties and states, this was = 6 days and we present results using this value. We also tested nearby values of ± 2 to confirm these provide similar results. We use data points with at least 1 tweet and at least 100 survey responses for every day in the time window for the specified region. The traditional statistic used to assess the significance of Granger Causality is the F-statistic [30] . However, in our case, there are several reasons why this is not appropriate. First, we have missing time-windows in some of our regions. Looking across U.S. states, we observe a negative association between vaccination uptake rates and online misinformation (Pearson R = -0.49, p < .001). Investigating covariates known to be associated with vaccine uptake or hesitancy, we find that an increase in the mean amount of online misinformation is significantly associated with a decrease in daily vaccination rates per million (b = -3518.00, p = .009, Fig.1A , and see Methods and Supplementary is also strongly associated with vaccination rate (b = -640.32, p = .004). These two factors alone explain nearly half the variation in state-level vaccination rates, and are themselves moderately correlated (Supplementary Fig. S1 and Table S1 ), consistent with prior research [32] . Remaining covariates are non-significant and/or collinear with other variables (i.e., have high variance inflation factors). and thus dropped for parsimony. To test the robustness of these results, we also consider a more granular level of information by examining county data. Similar to previous analyses, we compute Our results so far demonstrate an association between online misinformation and vaccine hesitancy. We investigate evidence for directionality in this association by performing a Granger Causality analysis [30] , [31] . We find that misinformation helps forecast vaccine hesitancy, weakly at state level (p = .0519) and strongly at county level (p < .001; see Methods and Supplementary Tables S10, S11). Analysis of the significant lagged coefficients (Supplementary Table S10 ) indicates that there is a lag of around 2-6 days from misinformation posted in a county to a corresponding increase in vaccine hesitancy in the same county. Finally, Figure 3 shows the most shared low-credibility sources. We note the large prevalence of one particular source, Children's Health Defense, an antivaccination organization that has been identified as one of the main sources of misinformation on vaccines [34] , [35] . We did not observe significant differences in the top sources shared in Republican vs. Democratic majority states. Our results provide evidence for the problem of geographical regions with lower levels of COVID-19 vaccine uptake, which may be driven by online misinformation. Considering variability across regions with low and high levels of misinformation, the best estimates from our data predict a ~20% decrease in vaccine uptake between states, and a ~67% increase in hesitancy rates across Democratic counties, across the full range of misinformation prevalence. At these levels of vaccine uptake, the data predict SARS-CoV-2 will remain endemic in many U.S. regions. This suggests a need to counter misinformation in order to promote vaccine uptake. An important question is whether online misinformation drives vaccine hesitancy. Our analyses alone do not demonstrate a causal relationship between misinformation and vaccine refusal. Our work is at an ecological scale and vaccine-hesitant individuals are potentially more likely to post vaccine misinformation. However, at the individual level, a recent study [20] found that exposure to online misinformation can increase vaccine hesitancy. Our work serves to provide evidence that those findings, which were obtained under controlled circumstances, scale to an ecological setting. Due to the fact that vaccine hesitancy and misinformation are socially reinforced, both ecological and individual relationships are important in demonstrating a causal link [36] . However, we are still unable to rule out confounding factors, so uncertainty remains about a causal link and further investigation is warranted. Public opinion is very sensitive to the information ecosystem and sensational posts tend to spread widely and quickly [25] . Our results indicate that there is a geographical component to this spread, with opinions on vaccines spreading at a local scale. While social media users are not representative of the general public, existing evidence suggests that vaccine hesitancy flows across social networks [37] , providing a mechanism for the lateral spread of misinformation offline among those connected directly or indirectly to misinformation spreading online. More broadly, our results provide additional insight into the effects of information diffusion on human behavior and the spread of infectious diseases [38] . A limitation of our findings is that we are not measuring the exposure, by geographical region, to misinformation on Twitter but rather the sharing activity of a subset of users. Besides, our analyses are based on data averaged over geographical regions. To account for group-level effects we present a number of sensitivity analyses, and note that our findings are consistent over two geographical scales. Our source-based approach to detect misinformation at scale might not capture the totality of misleading and harmful content related to vaccines, and many low-credibility sources publish a mixture of false and true information [39] , [40] . Our results are also limited to a small period of time. Finally, other factors might also influence vaccination hesitancy levels, including accessibility to vaccines, changes in COVID-19 infection and death rates, as well as legitimate reports about vaccine safety [41] . Associations between online misinformation and detrimental offline effects, like the results presented here, call for better moderation of our information ecosystem. COVID-19 misinformation is shared overtly by known entities on major social media platforms [42] . The authors declare no competing interests. All measurements of vaccine uptake and vaccine hesitancy rates as well as socioeconomic, political, and demographic variables at the state and county level are publicly available in the online repository associated with this paper [43] . We also provide aggregated measurements of online misinformation shared by To define as complete a set as possible of English language keywords related to vaccines, we employed a snowball sampling methodology in December 2020 1 (see reference for full details on the data collection pipeline). The final list contains almost 80 keywords, and it is accessible in the online repository associated with the reference 3 . As a robustness test, we further perform sensitivity analyses using a restricted set of keywords ("vaccine", "vaccinate", "vaccination", "vax") which covers almost 95% of the total number of geolocated tweets. Results are equivalent to those presented in the main text and are described in the section "Sensitivity Analyses". To match Twitter posts with US states and counties, we first identified a collection of Twitter accounts that disclosed a location in their Twitter profile. We then employed the carmen Python library 4 to match each location to US states and counties. We were able to match around 1.67 M users to 50 US states, and a subset of 1.15 M users to over 1,300 US counties; the larger set accounts for a total number of almost 11 M shared tweets. To analyze the spread of low-credibility information, we identified all URLs shared in Twitter posts that originated from a list of low-credibility sources, following a large corpus of literature [5] [6] [7] [8] [9] . We employ the Iffy+ Misinfo/Disinfo list of low-credibility sources 10 , which is based on information provided by the Media Bias/Fact Check website (MBFC, https://mediabiasfactcheck.com), an independent organization that reviews and rates the reliability of news sources. As defined in the related methodology, political leaning is not a factor for inclusion. The list includes sites labeled by MBFC as having a "Very Low" or "Low" factual-reporting level as well as those classified as "Questionable" or "Conspiracy-Pseudoscience". The list also includes fake-news websites flagged by BuzzFeed, FactCheck.org, PolitiFact, and Wikipedia, for a total number of 674 low-credibility sources. Based on this list, we measure the prevalence of low-credibility information about vaccines in each region by (1) calculating the proportion of vaccine-related tweets containing URLs pointing to a low-credibility news website, for each geo-located account; and (2) taking the average of this proportion across all accounts within a specific region. We refer to this average as the state-wide (county-wide) prevalence of misinformation. At the county level, we omit observations without vaccine hesitancy data (see next section), and we used different thresholds for the minimum number of geolocated accounts, respectively 10, 50, and 100. In the main paper, we present results when using 100 as a threshold. We provide sensitivity analyses using versions including counties with at least 10 and 50 Twitter accounts (see "Sensitivity Analyses" section). The larger threshold is likely to contain less error but also omits more counties. We use data provided by the MIT Election Lab to extract state-level returns for the 2020 US presidential election 11 . For counties, we use data provided by Fox News, Politico, and the New York Times. They are publicly available at https://github.com/tonmcg/US_County_Level_Election_Results_08-20. To compute vaccine hesitancy rates in each state (county), we leverage daily COVID-19 Symptom We extracted the number of COVID-19 cases and fatalities at the state and county level based on reports made by USAFacts (https://usafacts.org). In particular, we summed the number of daily confirmed COVID-19 cases and fatalities, referring to these as "recent", in the period from January 4 to March 25, 2021. We then computed the cumulative number of cases and fatalities on March 25th, referring to these as "total". To include socioeconomic covariates in our regression model, we use data from the Atlas of Rural and Small-Town America (available at https://www.ers.usda.gov/data-products/atlas-of-rural-and-small-townamerica/), which includes data at the state and county level from the American Community Survey (ACS), the Bureau of Labor Statistics, and other sources. We employ data last updated on July 2, 2020, which include county population estimates and annual unemployment/employment data for 2019. Archives (accessible at https://www.thearda.com/Archive/ChCounty.asp). Figures S1 and S2 present additional results about correlations between vaccine demand, vaccine hesitancy, political partisanship, and online misinformation at state and county levels. Table S1 presents results from the weighted (Models 1 and 2) and ordinary (Models 3 and 4) least-squares regression of state-level vaccine hesitancy and vaccination rate, respectively, on covariates. As shown in We conduct a set of sensitivity analyses to ensure that our findings are robust to alternative variable and model specifications. First, we run standard diagnostics for nonlinearity, skewness, multicollinearity, and heteroskedasticity, correcting any problems we discover. Second, because the misinformation measure at the state level is slightly positively skewed, we conduct a model using a natural logarithmic transformation of mean percent misinformation. Results from these models are consistent with the main findings (Table S2 ). The untransformed variable has a better model fit (lower BIC). Third, because the effect of misinformation may depend on political partisanship, we test for an interaction between misinformation and the percent of GOP voters. There is no evidence of such interaction at the state level. Fourth, we rerun the above models using versions of the mean percentage of vaccine-related misinformation shared by Twitter users by considering a restricted set of keywords to gather tweets (see previous "Twitter Data" section). As shown in Table S3 , findings are consistent and robust to this alternate definition of misinformation sharing. We also conduct a similar set of sensitivity analyses at the county level. First, we test multiple versions of the misinformation variable, which is highly skewed and zero-inflated at the county level. We use the logtransformed version for the main findings due to the best model fit, but obtain significant results with the untransformed variable and very similar findings with a polynomial model that also captures the nonlinear relationship between misinformation and vaccine hesitancy. Second, we test for an interaction between misinformation and percent of GOP voters, finding that being in a majority Republican versus Democratic state moderates the association between misinformation and vaccine hesitancy (Table S4) . A scatterplot of republican and democratic-leaning counties confirms the moderation finding (Fig.2 in the main manuscript). Third, we run models adding the number of tweets per county as a control variable to address variation in the volume of Twitter activity across counties. Adding this covariate did not affect results. Fourth, as at the state level, we generate versions of the vaccine misinformation variable using a restricted set of keywords. Again, these results are consistent with our main findings (Table S5) . Fifth, we examine the robustness of the threshold of 100 Twitter accounts per county for inclusion in the analysis, setting thresholds of 50 and 10. These results are similar to the main findings (Tables S6 and S7), demonstrating that results are robust to different variable specifications. To confirm the relationship between misinformation and GOP vote share, we compute a negative binomial regression model predicting mean percent information (untransformed) at the county level using percent GOP vote and a set of control variables. This multivariate analysis confirms the bivariate correlation, indicating a strong relationship between these factors net of potential confounding variables (Table S8) . Table S9 describes all the covariates considered in the regression analyses. Table S10 and S11 provide results of the OLS regression for the Granger causality analysis respectively at county and state level. Notes: Vaccine hesitancy is based on state-level means from Facebook survey data. The vaccination rate is vaccines administered per million (CDC data). For models predicting vaccine hesitancy (i.e., state means), analytic weights based on sample size are applied. Unstandardized betas and standard errors are provided. * p < 0.05, ** p < 0.01, *** p < 0.001 Table S2 . Weighted/ordinary least squares regression of state-level percent vaccine hesitancy and daily vaccination rate per million on misinformation (logged) and covariates (N=50 states Notes: Vaccine hesitancy is based on county-level means from Facebook survey data. Misinformation is measured using mean percent of low credibility tweets for counties with at least 100 Twitter accounts. Analytic weights based on Facebook survey sample size are applied, and models use cluster robust standard errors to account for counties being nested in states. Unstandardized betas and standard errors are provided. * p < 0.05, ** p < 0.01, *** p < 0.001 Notes: Vaccine hesitancy is based on county-level means from Facebook survey data. Misinformation is measured using mean percent of low credibility tweets for counties with at least 100 Twitter accounts. Analytic weights based on Facebook survey sample size are applied, and models use cluster robust standard errors to account for counties being nested in states. Unstandardized betas and standard errors are provided. * p < 0.05, ** p < 0.01, *** p < 0.001 Notes: Vaccine hesitancy is based on county-level means from Facebook survey data. Misinformation is measured using mean percent of low credibility tweets for counties with at least 10 Twitter accounts. Analytic weights based on Facebook survey sample size are applied, and models use cluster robust standard errors to account for counties being nested in states. Unstandardized betas and standard errors are provided. * p < 0.05, ** p < 0.01, *** p < 0.001 Notes: Misinformation is measured using mean percent of low credibility tweets for counties with at least 100 Twitter accounts. Models use cluster robust standard errors to account for counties being nested in states. Negative binomial regression is employed due to zeroinflated Poisson distribution. Unstandardized betas and standard errors are provided. * p < 0.05, ** p < 0.01, *** p < 0.001 WHO Coronavirus (COVID-19) Dashboard Simply put: Vaccination saves lives Looking beyond COVID-19 vaccine phase 3 trials KFF COVID-19 Vaccine Monitor Dashboard Growing Share of Americans Say They Plan To Get a COVID-19 Vaccine -or Already Have Herd immunity thresholds for SARS-CoV-2 estimated from unfolding epidemics Individual variation in susceptibility or exposure to SARS-CoV-2 lowers the herd immunity threshold Data-driven estimate of SARS-CoV-2 herd immunity threshold in populations with individual contact pattern variations COVID Data Tracker The effect of opinion clustering on disease outbreaks Vaccine hesitancy: Definition, scope and determinants Correlates and disparities of intention to vaccinate against COVID-19 COVID-19 Vaccination Hesitancy in the United States: A Rapid National Assessment Weaponized Health Communication: Twitter Bots and Russian Trolls Amplify the Vaccine Debate Vaccine misinformation and social media Social media and vaccine hesitancy Data for CoVaxxy: A collection of Englishlanguage Twitter posts about COVID-19 vaccines Correcting COVID-19 vaccine misinformation: Lancet Commission on COVID-19 Vaccines and Therapeutics Task Force Members Assessing the risks of 'infodemics' in response to COVID-19 epidemics Measuring the impact of COVID-19 vaccine misinformation on vaccination intent in the UK and USA Food & Drug Administration FDA States Begin Opening COVID-19 Vaccines to All Adults The COVID States Project #43: COVID-19 vaccine rates and attitudes among Americans Delphi Epidata API The science of fake news Fake news on Twitter during the 2016 U.S. presidential election The spread of low-credibility content by social bots Influence of fake news in Twitter during the 2016 US presidential election Investigating Causal Relations by Econometric Models and Cross-spectral Methods Right and left, partisanship predicts (asymmetric) vulnerability to misinformation Fast flow-based algorithm for creating density-equalizing map projections How a Kennedy built an anti-vaccine juggernaut amid COVID-19 | AP News The Anti-Vaxx Playbook | Center for Countering Digital Hate The individualistic fallacy, ecological studies and instrumental variables: a causal interpretation The Impact of Social Networks on Parents' Vaccination Decisions Modelling the influence of human behaviour on the spread of infectious diseases: a review Twitter and Facebook posts about COVID-19 are less likely to spread misinformation compared to other health topics Anatomy of an online misinformation network Volatility of vaccine confidence The COVID-19 Infodemic: Twitter versus Facebook Reproducibility code for 'Online misinformation is linked to early COVID-19 vaccination hesitancy and refusal CoVaxxy: A collection of English-language Twitter posts about COVID-19 Proc. Intl. AAAI Conf. on Web and Social Media (ICWSM) XSEDE: Accelerating Scientific Discovery Data for CoVaxxy: A collection of English-language Twitter posts about COVID-19 vaccines The science of fake news The spread of low-credibility content by social bots Fighting misinformation on social media using crowdsourced judgments of news source quality Influence of fake news in Twitter during the 2016 US presidential election Iffy+ Mis/Disinfo Sites