key: cord-0631256-w8mnu9ch authors: Marathe, Aboli; Sakhrani, Harsh; Parekh, Saloni title: Investigating the Relationship Between World Development Indicators and the Occurrence of Disease Outbreaks in the 21st Century: A Case Study date: 2021-09-20 journal: nan DOI: nan sha: 3fdfb95edca792466e34728149a4893bdb55b957 doc_id: 631256 cord_uid: w8mnu9ch The timely identification of socio-economic sectors vulnerable to a disease outbreak presents an important challenge to the civic authorities and healthcare workers interested in outbreak mitigation measures. This problem was traditionally solved by studying the aberrances in small-scale healthcare data. In this paper, we leverage data driven models to determine the relationship between the trends of World Development Indicators and occurrence of disease outbreaks using worldwide historical data from 2000-2019, and treat it as a classic supervised classification problem. CART based feature selection was employed in an unorthodox fashion to determine the covariates getting affected by the disease outbreak, thus giving the most vulnerable sectors. The result involves a comprehensive analysis of different classification algorithms and is indicative of the relationship between the disease outbreak occurrence and the magnitudes of various development indicators. Coronavirus has become an unprecedented health crisis and has spread to over 150 countries, severely impacting the world economy and causing social disruption. In a recent study from 2020, it was observed that the COVID-19 outbreak had a significant impact on the Italian economy, eventually tipping it into recession. The impact of this recession fell on the financially weak population, elderly and the working population. [Sanfelici, 2020] Learning from our experiences, we wish to move forward and create robust emergency preparedness measures. Providing the authorities with the most vulnerable sectors which will succumb to disease outbreaks first will be an invaluable resource for planning and policy-making. But finding these vulnerable sectors presents a challenge as it requires big data analysis of imperfect data over multiple years of history for a particular region. Furthermore, the vulnerable sectors cannot be directly quantified, thus their vulnerability needs to be estimated through indirect measures. We studied such cases of * Equal Contributors previous disease outbreaks, and propose a method of accurately identifying these vulnerable sectors, including critical sectors like economy, healthcare and safety. We came across the World Development Indicators (WDI), established by the World Bank that are a set of indicators, collected over time for every country through their individual governments. The indicators cover most sectors of development, including trade and safety markers and we thought of using these indicators to estimate the vulnerability of different sectors. Some world development indicators get more affected than others and their identification is a challenging problem for disease outbreak preparedness and planning. Over the years, researchers have analysed the disease outbreaks to determine the risk factors [Anno et al., 2019] and aid the disease outbreak surveillance [Allard, 1998] . But the relationship between trends in socio-economic indicators and the occurrence of previous disease outbreaks still remains a mystery. While researchers tend to analyse socioeconomic systems in the context of disease outbreaks, we tried understanding the relationship between socio-economic systems and disease outbreaks. Whether metadata could be used to analyze the anomalies and their cause or impacts on a country-per-country basis was our primary research question. We tried to answer this question using a different methodology. In this paper, we propose an approach that uses data-driven models for disease outbreak identification rather than disease outbreak forecasting and the treatment of this identification as a classification problem is a novel approach that we would like to introduce to this field. We combine the World Development Indicators data [Bank, 2010; Bank, ] provided by the World Bank and the disease outbreaks data by the World Health Organization [Organization, ] to create a dataset. As there was a small degree of uncertainty in the dataset due to missing values, we also make use of statistical data imputation and predictive modelling for data treatment. Lastly, we apply benchmarked classification techniques for disease outbreak identification and CART based feature importance to find the crucial indicators. After finding these indicators, we compare the results with former studies and surveys to validate the performance of this methodology. The verified indicators can be passed on to the authorities for emergency preparedness and planning assistance. arXiv:2109.09314v2 [cs. LG] 27 Oct 2021 2 Background Work Due to the recent rise of disease outbreaks, the research community is laying special emphasis on studying epidemiology. Researchers have found correlation of time-series data trends with the presence of disease outbreaks [Richardson et al., 2016; Li et al., 2012] and have also found causal relationships between socio-cultural systems [Davis et al., 2019] . [Farrington and Beale, 1998 ] worked extensively to model these outbreaks and predict them. The case-study of Salmonella agona in their paper highlighted both the potential and the shortcomings of automated detection procedures, emphasising both their time optimization and less perceptible results. [Heisterkamp et al., 2006] tried another approach, using hierarchical time series analysis model to detect outbreaks and found the proposed model to be a reliable tool for Rubella notifications and Salmonella infections. [Streftaris and Gibson, 2004 ] considered continuous-time stochastic compartmental models that can be applied in veterinary epidemiology to model the within-herd dynamics of infectious diseases. [Stroup et al., 1993] introduced a statistical method for detection of specific types of aberrations in public health surveillance. [Rohwerder, 2020] summarised the work of multiple authors in an attempt to identify the secondary impacts of these disease outbreaks in certain countries. They analyse the economic, political, social and secondary impacts of the outbreaks, unlike the traditional healthcare impacts and found some common features among the countries struck by outbreaks. We were inspired by these results and wondered if one methodology, applied on single or multiple datasets could reproduce these findings in sufficiently timely fashion to allow interventions to take place. While most studies are trying to identify the impact of disease outbreaks using statistical modelling, few try to analyse this problem in a reverse manner, i.e. socio-economic indicators that could be associated with the occurrence of a disease outbreak. The directionality of this indicator-outbreak network has been considered a problem too vast for any single study, something that we agree with but look forward to solving. In 2011 however, [Unkel et al., 2012 ] discussed a wide variety of techniques, their possible limitations and advantages from regression to ARIMA models to Markov models for the identification of unusual patterns in data which may result from infectious disease outbreaks. This study provided a base for our methodology. World Development Indicators (WDI) is the primary World Bank collection of 143 development indicators for more than 200 economies and 40 country groups. The part of the database that we considered spans from the year 2000 -2019. [Bank, ] The disease outbreak data from WHO was extracted separately for individual countries. [Organization, ] The years that had a disease outbreak occurrence/absence were labelled as 1/0 respectively. The basic preprocessing involved encoding categorical features, scaling, normalization and resampling. Robust Scaler was utilized for scaling, since it scales the data according to the quantile range and is insensitive to outliers. A number of other scaling techniques like Min-Max scaler, Standard scaler and our in-house Logarithmic Deviation scaler were also tried, but gave substandard results. The severely skewed class distribution observed in our dataset posed a challenge for the classification algorithms. Both undersampling and oversampling have known disadvantages. Undersampling can throw away potentially useful data, and oversampling can increase the likelihood of overfitting. Hence, a combination of both Undersampling and Oversampling was used. SMOTE is an oversampling technique that synthesizes new plausible examples of the minority class by interpolating between several minority class examples that lie together. Tomek Links refers to an undersampling technique that identifies cross-class nearest neighbors and removes the majority class occurrence. [Batista et al., 2003] 4 Methodology We employed a number of statistical and inferential data imputation techniques ranging from simple statistic substitution to complex deep learning based imputation techniques. The techniques that gave us noteworthy results are explained below. The K-Nearest Neighbors algorithm is used to map a point with its k closest neighbors in a multi-dimensional space. The intuition behind using KNN for data imputation is that a missing input variable value can be approximated by the value of the points that are closest to it, and this 'closeness' can be determined on the basis of other non-missing variables. In our dataset, the missing World Development Indicator values are imputed using this 'closeness', which is usually seen in groups of countries having similar indicator values, or the countries that have had similar development curves in different time frames. After experimenting with a number of parameters, the best results were obtained when a combination of 5 neighbors and euclidean distance (1) was used. The intuition behind the MSREG algorithm is to leverage the correlation between the input variables by regressing the missing variable on all the other input variables. We employ the Linear Regression Model to estimate the missing values. For example, in our dataset there is a strong positive correlation between the "Number of Community Health Workers" and the "Current health expenditure" columns. The MSREG algorithm is capable of utilizing such correlations in order to impute the missing variable values. To counter the decrease in the inherent variability of the imputed variable, normally distributed noise with a mean of zero and variance equal to the standard error of regression estimates was introduced. The MSREG method assigns values to each missing element x according to (2), where k is This method allows imputation of the missing data by picking random observed values of a particular variable. This method was applied on all features with missing data, by selecting random variables with the probability of an imputation being 1/n where n is the number of present values. Conventionally Feature Selection has always been used to identify the relevant set of features for which there is a significant increase in the performance of the algorithm. But we attempt to utilize it in an unorthodox fashion. The promising classification scores [1] do show that there is a strong correlation between the World Development Indicators and Disease Outbreaks. But the crucial question would be to discover the set of unapparent Indicators which get affected by Disease Outbreaks and understand them better, for which we use Feature Selection. RandomForestClassifier's implicit feature selection was used to determine the subset of relevant features. For randomized trees' ensembles, the variable importance X m for predicting Y is calculated by adding up the weighted impurity decreases p(t)∆i(st, t) for all nodes t where X m is used, averaged over all N T trees in the forest (3). where p(t) is the proportion N t /N of samples reaching t and v(s t ) is the variable used in split s t . When using the Gini index as an impurity function, this measure is known as the Gini importance or Mean Decrease Gini. [Louppe et al., 2013] The identification of important features from the imputed dataset was accomplished through the use of benchmarked classification techniques to predict the target variable y, which in our case is the disease outbreak occurrence in a particular year. We applied these techniques to our dataset with a 0.2 train-test-split, and compared 3 different methods of imputation: KNN, Random Imputation and MSREG. After analysing the F1-score and the accuracy, hyperparameter optimization was performed to boost our results. This was followed by the usage of the CART based feature selection technique to determine the important features. Classification: A wide range of state-of-the-art classification techniques were employed including Bayesian, Treebased, Ensemble and Deep Learning Algorithms [1]. Feature selection: We applied feature selection for filtering out the input variables strongly correlated with disease outbreak occurrence. By plotting the relative importance of these covariates, we can increase the interpretability of this pipeline, and thus deliver the vulnerable indicators as our final result [1]. The results were very promising on the imperfect data classification as we achieved 94.2% top accuracy and a F1-Score of 0.94 on the dataset using the Random Forest algorithm, MSREG imputation and SMOTE Sampling after hyperparameter tuning. The ensemble techniques performed better than both the regression and the deep learning models. We were able to extract the most important features that the algorithm predicted [1]. To interpret our results better, we visualised the frequency of disease outbreaks per country with one of the more important predicted features-Number of International tourism, expenditures (current US$), and observed that the two variables were indeed correlated [2] . It is interesting to note how our observations match the results put forward by [Rohwerder, 2020] , where they found the above features to be strongly affected by disease outbreaks in low and middle income countries through a different methodology and dataset. The dire impact of disease outbreaks are unequivocally faced by the most vulnerable populations, the healthcare workers and the financially disadvantaged, and our insights could help the authorities increase the accessibility of social services. Our proposed method leverages data-driven models and feature selection for the quick identification of the affected indicators, giving the vulnerable sectors. The results on the imputed datasets, while indicative of potential relationships, cannot tell the whole story on their own. Many critical variables (e.g. competing political priorities, cultural narratives etc.) cannot be completely captured in a large scale analysis, and can be found by comparing public opinion and conflictrelated casualties. In the future work, these insights can contribute to forming prior knowledge for a knowledge-driven model, providing concrete parameters to help assess and validate the theoretical framework. Our team has also conducted a study parallel to this work that builds on the dataset and analyzes causal relationships between the features [Marathe et al., 2021] .This research will be useful in the emergency preparedness planning for the developing world. Use of time-series analysis in infectious disease surveillance Spatiotemporal dengue fever hotspots associated with climatic factors in taiwan including outbreak predictions based on machine-learning Rasaki Stephen Dauda. Hiv/aids and economic growth: Evidence from west africa. The International journal of health planning and management Standard deviation of normalised number of International tourism, expenditures (current US$) with the frequency of disease outbreaks per country Adjusting outbreak detection algorithms for surveillance during epidemic and non-epidemic periods Missing value estimation algorithms on cluster and representativeness preservation of gene expression microarray data Ethnic politics, risk, and policy-making: A cross-national statistical analysis of government responses to hiv/aids The italian response to the covid-19 crisis: Lessons learned and future direction in social development