key: cord-0957421-pbxevpv9
authors: Carrieri, V.; Lagravinese, R.; Resce, G.
title: Predicting vaccine hesitancy from area-level indicators: A machine learning approach
date: 2021-03-09
journal: nan
DOI: 10.1101/2021.03.08.21253109
sha: 8d89d1cd9b494514d5f2343a43e61e25cd6104db
doc_id: 957421
cord_uid: pbxevpv9

Vaccine hesitancy (VH) might represent a serious threat to the next COVID-19 mass immunization campaign. We use machine-learning algorithms to predict communities at a high risk of VH relying on area-level indicators easily available to policymakers. We illustrate our approach on data from child immunization campaigns for seven non-mandatory vaccines carried out in 6408 Italian municipalities in 2016. A battery of machine learning models is compared in terms of area under the Receiver Operating Characteristics (ROC) curve. We find that the Random Forest algorithm best predicts areas with a high risk of VH improving the unpredictable baseline level by 24% in terms of accuracy. Among the area-level indicators, the proportion of waste recycling and the employment rate are found to be the most powerful predictors of high VH. This can support policy makers to target area-level provaccine awareness campaigns.

Vaccine hesitancy (VH) may constitute a serious threat to the mass COVID-19 immunization campaign. Opinion polls in France and US on vaccination intentions suggest that COVID-19 vaccine hesitancy is increasing worldwide (Schwarzinger et al. 2021; Reiter et al. 2020) . A recent survey indicates potential acceptance rates for COVID-19 vaccines ranging from 55% to 90% (Lazarus et al. 2020 ). Beyond COVID-19, VH represents a threat also to the eradication of re-emergent vaccine-preventable diseases like measles and rubella (Horne et al. 2016 ; WHO, 2019; ECDPC, 2019).

Evidence has accumulated that several individual-level factors are strongly associated with VH, such as low income and poor education, certain political or religious orientations, low trust in conventional medicine and institutions, and risk perception (Dincer and Yakub et al., 2014) . However, in practice, health authorities typically do not have such detailed individual-level information available to them in sufficient time to be useful for informing immunization campaigns and/or can be very costly to design ad-hoc surveys for this purpose. Moreover, immunization campaigns are usually organized at the local level (typically municipality level). Thus, to target appropriate interventions, would be extremely useful for policy makers to know on the basis of area-level indicators easily available to them (institutional features, demographic and geographic factors, and socio-economic indicators), the communities more at *Corresponding Author. Department of Law, Economics and Sociology. "Magna Graecia" University of Catanzaro. Viale Europa, 88100 Catanzaro (Italy). e-mail: vincenzo.carrieri@unicz.it . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted March 9, 2021. ; https://doi.org/10.1101/2021.03.08.21253109 doi: medRxiv preprint risk of HV and identify the set of area-level correlates of VH. This is particularly important also to enforce local herd immunity and limit the spreading of a disease to other areas of a country.

In this paper we propose a practical approach to identify "hotspots" for VH relying on lagged municipalitylevel indicators easily available to policy-makers and machine-learning algorithms. The application of machine learning techniques to health issues has increased in the recent times. Applications range from the relationship between height and socio-economic status (Daoud et al. 2019 ) to socio-economic determinants of health (Seligman et al. 2018 ) and key correlates of child marriage (Raj et al. 2020) . A few papers also employed machine learning algorithms to predict VH using patient-level data. For instance, Bell et al (2019) use machine learning approach to identify children at risk of not being vaccinated against Measles, Mumps and Rubella in WHO countries. Oreskovic et al. (2020) adopt Proactive machine-learningbased approaches to VH for a potential SARS-Cov-2 using LASSO logistic regression on a low number of attributes of the child and his or her family and community.

The machine-learning approach offers several policy-relevant advantages in the analysis of VH using arealevel indicators. First, it allows us to overcome the problem of incomplete data on immunization coverage which represents a key issue particularly in low-and middle-income countries (Harrison et al. 2020 ). Second, it allows us to predict areas at high VH risk relying on coverage data from previous immunization campaigns. This might help to get valuable insights also on the local acceptance rates for the next COVID-19 immunization campaign. Last but not least, the identification of the most powerful predictors of VH allows us also to derive a parsimonious set of indicators that can be employed in cases in which only a small set of area-level indicators is available. Again, this is of particular relevance in areas in which data collection is challenging.

We illustrate our approach using real-life data from the child immunization campaigns for seven vaccinepreventable carried out in 6408 municipalities in 2016 in Italy. We focus on child immunization relative to the babies turning 24 months. We only consider the seven vaccines that until 2016 were recommended but not mandatory 1 . Italy represents an interesting setting to study VH because of the high prevalence of novax feelings. The spreading of fake news on the web (Carrieri et al. 2019 ) and the high support to political parties repeatedly campaigned on an anti-vaccination platform (i.e., The Five-Star Movement -M5S) correlated with a recent spread of vaccine-preventable diseases. For instance, among European countries, Italy (4,978 cases) had the highest incidence of measles cases between 1 February 2017 and 31 January 2018 (ECDC, 2019).

We find that the Random Forest algorithm is the best model to make out-of-sample predictions of areas with high VH, with a high true positive rate and low false-positive rate. The final model predicts with high accuracy (0.761), sensitivity (0.623), and specificity (0.850). Compared with the unpredictable baseline level, the accuracy with Random Forest improves by 24%. Remarkably, our best model shows performances in line with recent studies that make use of patient level indicators with a higher granularity (see Bell et al. 2019 ). Among the area-level indicators, the share of waste recycling and the employment rate are found to be the most powerful predictors. This offers strong policy implications for the next COVID-19 immunization campaign.

The rest of paper is structured as follows. Next section presents data and methodology. Section 3 shows results. Last section summarizes and concludes. Table 1 shows a set of municipality-level indicators collected by the Italian National Institute of Statistics (ISTAT) and the Italian Ministry of the Interior. These data are matched with data collected by the Italian Ministry of Health about child immunization campaigns done in 6408 Italian municipalities in 2016 for seven vaccine-preventable diseases (pertussis, measles, haemophilus influenzae type B, meningococcus, pneumococcus, mumps, and rubella).

at year consists of the following group of characteristics: Demographics and geographical ( ), Institutional ( ), and Socio-economic ( ). Every municipality at time has an associated target binary variable (vaccine hesitancy) that takes values "1" (positive sample) if the average child vaccine coverage is under 92% 2 , and value "0" (negative sample) otherwise. The prediction task is formulated as follows: based on the set of features { −1 , −1 , −1 } for municipality , find function (. ) (Machine Learning model) that predicts vaccine hesitancy :

We randomly divide the database as 70 percent for training and 30 percent for out-of-sample testing set (test). The hyper-parameter optimization is only done on the training set. The performance of vaccine hesitancy classification prediction is assessed by the Receiver Operating Characteristics (ROC) curve (Fawcett, 2006) . In our binary classification problem, the positive class is defined as the municipality with high VH risk and the negative class is the municipality with low VH risk. The ROC curve shows the classifier diagnostic ability by plotting the true positive rate (TPR) on the -axis against the false positive rate (FPR) on the -axis since its discrimination threshold is varied (Antulov-Fantulin et al. 2021). The true positive rate is the ratio of municipalities with high VH risk that were 2 In Italy the vaccination coverage has been included in the set of Essential Levels of Health Service ("Livelli Essenziali di Assistenza", LEA). This set of services is defined at national level and provided by the public sector in each region. The recommended acceptable target of vaccination coverage for child immunization has been set to 95%. Below the 92% threshold, the situation is defined as "worrying", and the regions must take action to increase vaccination coverage to avoid sanctions by the National Government.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 9, 2021. correctly categorized as high VH risk (true positive) and the total number of positive samples (high VH risk). The false positive rate is the ratio between the number of municipalities with low VH risk wrongly categorized as high VH risk (false positives) and the total number of actual negative samples (low VH risk). When the classification task is completely unpredictable, the negative class theoretical distribution over feature space coincides with the positive class theoretical distribution over feature space. This implies that the ROC curve would be the diagonal line with an Area Under the Curve (AUC) of 0.5. A perfect classifier has AUC equal to the 1.0, the higher the AUC, the more predictive is the model.

Machine learning models also give information on how useful each feature is in explaining VH. Each model has a different algorithm to estimate importance. In LASSO, feature importance is estimated as the absolute value of the coefficients corresponding to the tuned model (Kuhn, 2020). For RF, feature importance is the mean gain produced by the feature over all the trees where the gain is measured by the Gini index (Liaw, Wiener, 2002) . The feature importance in GBM is the average improvement of the splitting on the features across all the trees generated by the boosting algorithm (Friedman, 2001; Ridgeway, 2007) . NN uses the Garson (1991) algorithm to evaluate relative variable importance. In this case the importance of each variable is determined by identifying all weighted connections between the layers in the network (Venables Ripley, 2002 ). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 9, 2021. ; https://doi.org/10.1101/2021.03.08.21253109 doi: medRxiv preprint

The area under the ROC curve (AUC) in Figure 1 shows that the Random Forest model outperforms all the other models (AUC: RF = 0.836; GBM = 0.775; LASSO = 0.709; NN = 0.677). The best model has accuracy =0.761 (95 per cent CI:0.742, 0.782), sensitivity = 0.623, and specificity =0. 850 (see Table 2 ). Table 3 show that the difference between the ROC curve of the Random Forest model and all the other models is statistically significant (p-value < 0.0005). Figure 2 reports the first ten most relevant features for the Random Forest model. The share of recycling waste and employment rate are the top two area-level features predicting high VH risk. Among the top ten features associated with high VH risk there are also the altitude, poverty rate, the share of families with children, the share of temporary jobs, the divorce rate, the average education level (, i.e. the share of population with a degree or more), population density, and total population.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 9, 2021. ;

Random Forest trained on 70% of observations and tested on the remaining 30%; Resampling: Cross-Validated (10fold, repeated 5 times. Hyper-parameters Best Tune for Random Forest is mtry (Number of variables available for splitting at each tree node) = 15.

The complete set of feature importance for the four models is shown in Table 4 . A high and significant correlation (0.874; 95% CI = 0.743, 0.940) can be noted between the feature importance obtained by RF and GBM (the second-best performing model), while there is no significant correlation between the feature importance obtained between RF and the two worst performing models (LASSO and NN).

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 9, 2021. ; In Figure 3 we finally show the Italian municipalities classified on the basis of VH risk. On the left panel, we report classification on the basis of real vaccination coverage data, while on the right panel we report the classification based on the predictions of our algorithm using lagged values of the area-level indicators (i.e. in 2015). A comparison of the two maps reveals the high potential of ML algorithm to predict VH risk. Figure 3 also shows a large territorial heterogeneity, with a higher VH risk found in rural areas far from metropolitan cities and in almost all the Southern regions and in particular in Puglia, Calabria and Sicilia.

Random Forest trained on 70% of observations and tested on the remaining 30%; Resampling: Cross-Validated (10fold, repeated 5 times. Hyper-parameters Best Tune for Random Forest is mtry (Number of variables available for splitting at each tree node) = 15.

This paper proposes a machine-learning approach for the prediction of vaccination hesitancy (VH) risk in an immunization campaign on the basis of area-level administrative data. We illustrate our approach using real-life data from child immunization campaigns done in Italy in 2016 concerning seven vaccinepreventable diseases. We find that the Random Forest algorithm is the best model to make out-of-sample predictions with a high true positive rate and a low false-positive rate. Compared with the unpredictable baseline level, the accuracy with Random Forest improves by 24%. A higher VH risk is found in rural areas far from metropolitan cities and in almost all the Southern regions and in particular in Puglia, Calabria and Sicilia. Moreover, we also find that the level of waste recycling and employment rate are the most important area-level predictors of a high VH risk in a community. The high association between VH risk and the level of waste recycling supports the relevance of pro-social behaviours in immunization choices (Crociata et al, 2015; Ramkissoon, 2020) .

Although our analysis concerned child immunization, it is possible, albeit with the due limitations, to generalize our evidence to better face the mass immunization campaign for COVID-19. In fact, our approach using only lagged administrative data and socio-economic indicators at the municipal level is easy to be implemented in a short time and can be adapted to any vaccination campaign. Never as in COVID-19 vaccination campaigns, time is a crucial factor. Our approach thus offers a practical benchmark that can be used also in other countries to identify areas where the next mass immunization campaign for COVID-19 might experience low acceptance rates. This might allow policymakers to plan area-level pro-vaccine awareness campaigns.

To the best of our knowledge, the use of administrative data for learning and research purposes in vaccine campaigns is a relatively unexplored field. The main strength of administrative data is their immediate availability, at zero additional cost for analysts. Our approach also suggests that this does not come at costs in terms of accuracy, being the performances of our model in line with models using patient-level indicators to predict VH (see Bell et al. 2019 , Oreskovic et al. 2020 ). The administrative data already exist and does not require an ad-hoc collection. The potential of such data is even greater in fragile settings where collecting (individual) data per se often presents several difficulties and there is pressure to act quickly. To prevent these events, it is necessary to act quickly and targeted campaigns using local indicators can certainly favour vaccination and prevent any hesitations.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted March 9, 2021. ; https://doi.org/10.1101/2021.03.08.21253109 doi: medRxiv preprint

Predicting bankruptcy of local government: A machine learning approach

Proactive advising: a machine learning driven approach to vaccine hesitancy

Random forests. Machine learning

The influence of local political trends on childhood vaccine completion in North Carolina

Vaccine hesitancy and (fake) news: Quasi-experimental evidence from Italy

Associations of COVID-19 risk perception with vaccine hesitancy over time for Italian residents

Recycling waste: Does culture matter

Information, education, and health behaviors: Evidence from the MMR vaccine autism controversy

Predicting women's height from their socioeconomic status: A machine learning approach

Comparing the areas under two or more correlated receiver operating characteristic curves: a non-parametric approach

Shelter in place? Depends on the place: Corruption and social distancing in American states

Monthly measles and rubella monitoring report

An introduction to ROC analysis

Greedy function approximation: a gradient boosting machine

Factors limiting data quality in the expanded programme on immunization in low and middle-income countries: A scoping review

Countering antivaccination attitudes

Vaccine hesitancy is strongly associated with distrust of conventional medicine, and only weakly associated with trust in alternative medicine

& El-Mohandes, A. (2020). A global survey of potential acceptance of a COVID-19 vaccine

Knowing less but presuming more: Dunning-Kruger effects and the endorsement of anti-vaccine policy attitudes

Proactive machine-learning-based approaches to vaccine hesitancy for a potential SARS-Cov-2 vaccine

COVID-19 Place confinement, pro-social, pro-environmental behaviors, and residents' wellbeing: A new conceptual framework

Application of machine learning to understand child marriage in India. SSM-population health

Acceptability of a COVID-19 vaccine among adults in the United States: How many people would get vaccinated?

proc: an open-source package for r and s+ to analyze and compare roc curves

Skepticism to vaccines: enough already?

COVID-19 vaccine hesitancy in a representative working-age population in France: a survey experiment based on vaccine characteristics

Machine learning approaches to the social determinants of health in the health and retirement study

Regression shrinkage and selection via the lasso

Modern Applied Statistics with S

Institutional trust and misinformation in the response to the 2018-19 Ebola outbreak in North

The French public's attitudes to a future COVID-19 vaccine: the politicization of a public health issue

Attitudes to vaccination: A critical review