key: cord-0465359-blpomw3s authors: Marathe, Aboli; Parekh, Saloni; Sakhrani, Harsh title: Modelling Major Disease Outbreaks in the 21st Century: A Causal Approach date: 2021-09-15 journal: nan DOI: nan sha: 66063945e6124e5483d07702aa0444cdabe32d03 doc_id: 465359 cord_uid: blpomw3s Epidemiologists aiming to model the dynamics of global events face a significant challenge in identifying the factors linked with anomalies such as disease outbreaks. In this paper, we present a novel method for identifying the most important development sectors sensitive to disease outbreaks by using global development indicators as markers. We use statistical methods to assess the causative linkages between these indicators and disease outbreaks, as well as to find the most often ranked indicators. We used data imputation techniques in addition to statistical analysis to convert raw real-world data sets into meaningful data for causal inference. The application of various algorithms for the detection of causal linkages between the indicators is the subject of this research. Despite the fact that disparities in governmental policies between countries account for differences in causal linkages, several indicators emerge as important determinants sensitive to disease outbreaks over the world in the 21st Century. linkages between these indicators and disease outbreaks, as well as to find the most often ranked indicators. We used data imputation techniques in addition to statistical analysis to convert raw real-world data sets into meaningful data for causal inference. The application of various algorithms for the detection of causal linkages between the indicators is the subject of this research. Despite the fact that disparities in governmental policies between countries account for differences in causal linkages, several indicators emerge as important determinants sensitive to disease outbreaks over the world in the 21 st Century. Researchers have been studying the consequences of important events on global dynamics for centuries, with some focusing on socio-economic shifts, others on healthcare concerns, and yet others on cultural and historical factors. In the COVID-19 pandemic, the relationship between socio-economic factors and illness outbreaks has recently become a topic of great interest. Scientists around the world are still baffled as to how these outbreaks affect the planet or what elements influence them. These patterns are not only unpredictable, but they also vary from place to country due to differences in population, culture, geography, and other factors. For disaster management and outbreak preparedness, investigative analyses that lead to interpretable findings might be very beneficial. We were inspired to perform this research after witnessing the terrible impacts of the COVID-19 pandemic. We wanted to learn more about the origins and effects of disease outbreaks around the world. We chose to approach the problem statement as a challenge in causal inference for this work, and we used statistical techniques to handle the data and derive conclusions from the indicator dataset. As a result, we use interpretable network diagrams to depict the features that have strong causal linkages to the incidence of disease outbreaks. The directionality between the nodes may show whether the outbreak was triggered by or impacted the preparedness in that sector. This study was carried out individually for each country after the missing data was imputed, and then the findings were aggregated for the entire world, following which the most commonly related nodes (indicators) were retrieved. These nodes indicate universal indicators linked to disease outbreaks, which authorities can analyze in depth in order to take necessary steps to assist the development sectors they represent. The world development indicators have been very popular among researchers trying to quantify or model the dynamics of global systems. Using these indicators, scientists have been able to determine if growth and development spur improvement in governance [1] , links between population and resources [2] , change in the development outcomes associated with the activities initiated by the MDGs [3] and between financial development and economic growth [4] . Some papers note their shortcomings in obtaining extensive local data, but were able to find distinctive causal chains between the features. Specifically in the field of healthcare, there have been many attempts to find the effects of disease burden [5] , whether differences in microbial diversity can explain patterns of age-adjusted AD rates between countries [6] and how spillovers of zoonotic infectious diseases into the human population will be impacted by global environmental stressors [7] . The recent COVID-19 pandemic saw a rise in research work in this area, with many papers attempting to correlate the effectiveness of policies with the curve of the pandemic. From the dynamic causal modelling of COVID-19 [8] to effects of non-pharmaceutical interventions [9] , causal inference has been gaining preference for providing interpretable insights through scientific studies. Under the narrow field of disease outbreaks, some researchers have suggested measures for sustainable development [10] , have forecasted economic trends [11] or have studied the historical trends [12] and presented their views on planning for better preparedness. We observed that although these works are present at large, the task of analysing the causal relationships between socio-economic factors and disease outbreaks with our dataset has not been explored at a global scale and we present the results of such global network analyses in this work. The dataset used in this study was created from the World Development Indicators Data [13] provided by the World Bank and the disease outbreak occurrence data by the World Health Organization(WHO) [14] , put together to create a novel dataset for determining the relationship between disease outbreak occurrence and socio-economic factors. World Development Indicators (WDI) is an expanding World Bank collection of development indicators from which we extracted 141 development indicators for 204 countries spanning over the years 2000 -2019. Some examples of these indicators include ARI treatment (% of children under 5 taken to a health provider) and Unmet need for contraception (% of married women ages 15-49). The disease outbreak data from WHO was extracted separately for individual countries. The years that had an outbreak occurrence/absence were labelled as 1/0 respectively. The basic preprocessing involved encoding categorical features like country name, scaling the data and performing normalization. As the average percentage of missing values per column was 24%, there was a need for data imputation techniques for filling the missing values. We employed a number of statistical data imputation techniques (KNN imputation, MSREG and Random Imputation) out of which MSREG provided the most relevant results for the analysis. We determined the effectiveness of the imputation algorithms by observing the statistical changes in the dataset before and after imputation, including variance, covariance and correlations. The Stochastic Multiple Regression Imputation (MSREG) [15] method assigns values to each missing element according to (1), where is the number of manifest variables used in a model, is the number of missing values in , and () is a function that returns a different element of a standardized normally distributed random column vector each time it is invoked. where = 1 ... , ≠ , = 1 ... Some features had non-Gaussian distributions before and after imputation, thus changing them to exponential format transformed the dataset to a normal distribution. The Shapiro Wilk test [16] (2) along with the histogram visualization was used to test the where the ( ) are the ordered sample values and the are constants generated from the means, variances and covariances of the order statistics of a sample of size n from a normal distribution. After performing the normality test, we tested if the data was stationary or not, as the format of the dataset is time series. For testing this, we used the augmented Dickey-Fuller test (ADF) statistic [17, 18] (3) which tests the null hypothesis that a unit root is present in a time series sample. Around 20 % of the features were found to be non-stationary, which we made stationary by differencing the series twice and repeated the test again. The unit root test is carried out under the null hypothesis = 0 against the alternative hypothesis of < 0. Once a value for the test statistic (3) has been obtained, it may be compared to the Dickey-Fuller test's relevant critical value. If the calculated test statistic is less (more negative) than the critical value, then the null hypothesis of = 0 is rejected and no unit root is present and thus the series is stationary. Granger's causality tests [19] [20] [21] (4) the null hypothesis that the coefficients of past values in the regression equation is zero. This means the past values of time series (X) do not cause the other series (Y). So, if the p-value obtained from the test is lesser than the significance level of 0.05, then, we will reject the null hypothesis. where P refers to probability, is an arbitrary non-empty set, and I ( ) and I − ( ) respectively denote the information available as of time in the entire universe, and that in the modified universe in which is excluded. If the above hypothesis is accepted, we say that Granger-causes . The IC* (Inductive Causation) algorithm [22, 23] can be used to recover an underlying DAG structure from observed associations between traits. The algorithm is implemented as follows: (a) For each pair of variables a and b in search for a set such that the conditional independence between a and b given ( ⊥ | ) holds in ( ). We begin by constructing an undirected graph linking the nodes a and b if and only if is not found. Before testing for causal relationships, we explored the data distribution, trends and characteristics of the 141 development indicators. To explore the data set, we calculated a correlation matrix using Pearson's correlation coefficient [24, 25] and plotted the correlations in a heat map. A sample of this correlation heatmap for the country St. Martin (French part) can be seen in Figure 2 . Some features were already heavily correlated and were removed to avoid erroneous connections in the final results. As data is stationary and fits normal distribution, it satisfies all the assumptions for the causality tests and we can proceed with the causal analysis. The first step was using the Granger causality values to construct a network showing predictive causal relationships between the nodes. We are trying to view only the temporal relations through this statistic, as one thing preceding another can be used as a proof of causation. The Granger causality tests whether Y forecasts X, which could be interesting to observe in our indicator trends. The linkages were shown in the corresponding graphs. The total number of causal relationships between the target variable-occurrence of disease outbreaks and indicators was found to be 492 relationships. Figures 3 and 4 show the Granger causality network graphs for Bulgaria and St. Martin (French portion) using causality matrices identical to the sample presented in Table 1 . By using this algorithm, we are essentially treating our problem statement as causal discovery with hidden variables and trying to remove irrelevant connections to maintain the potential causal connections thus inferring causal DAGs. Along with the algorithm, a Robust Regression Test [26, 27] was used to identify outliers and minimize their impact on the coefficient estimates. It also simultaneously checks the independence of the two time series. After applying this technique to each country separately, we observed several causal structures and their corresponding embedded patterns. The total number of causal relationships between the target variable-occurrence of disease outbreaks and indicators was found to be 234 relationships. In this graph, each variable is a node (green coloured nodes), and each edge represents statistical dependence between the nodes that cannot be eliminated by conditioning on the variables specified for the search. If the edge also satisfies the local criterion for genuine causation, then that network of directed edges has been isolated in graph 2 of each figure, marked by pink nodes. 11 such relationships of genuine causation were found in the After observing the graphs of 204 countries for 141 development indicators, we can clearly see that every country has a distinctive pattern of correlations and the total number causal relationships between features between the target variable-occurrence of disease outbreaks and indicators were found to be 492 relationships using the Granger Causality, 234 using IC* statistical dependence and 11 using the IC* genuine causation algorithm respectively. Out of the 234 relationships determined by IC*, only 6 were confirmed using both Granger and IC* algorithms which have been presented in Table 3 . We observed the graphs obtained by the algorithms closely and noticed some interesting patterns. A certain subset of features were continuously found to be related with the target variable, the disease outbreak occurrence and have potential for genuine causation. By general observation, these features include indicators like individuals using internet, GDP, employment and health expenditure, which intuitively make sense as being factors affected by major disease outbreaks. By ranking these features by frequency, which can be observed in Table 2 , the frequent features can be given to the authorities as preliminary findings, or can be fed to further network models to gain comprehensive insights. The main motivation behind this study was increasing the interpretability and attempting to trace the common causal relationships occurring in world dynamics over time which can be seen in the network graphs and ranked features. The findings provide easy-to-understand insights for the many nations included in the worldwide statistics.We can observe global patterns and country specific trends develop, and the direction of the impact seen in the directed graphs, provides us with insight on the nature of these connections, by aggregating the results gathered from all 204 nations. GDP and Healthcare Expenditure are some strong features that appear frequently in labelled outbreak sensitive features and can be targeted by authorities to become more resilient to the ravages of future outbreaks. One important observation is that the dataset in spite of containing over 140 indicators, is still not sensitive to the minor events and factors that influence modern countries. For example, the interactions between the employment ratio and pandemic occurrence may also be due to the ineffective policies or internal conflicts in the country. While critiquing the employed methodology, we are aware that Granger causality is not necessarily true causality but can be indicative of the precedence of variables in the dataset. Directly utilizing the results of this study without a background verification for the given country may lead to incorrect assumptions about the nature of dynamics and further lead to ethical concerns by policy makers. Using the IC* algorithm to fine tune these results may potentially provide a degree of certainty to our determined causal relationships, but the verification of our results using more complex causal algorithms may be necessary due to the complex nature of the data and randomness in world indicators. This paper presents a new approach towards understanding how disease outbreaks affect development of countries across the world. In the future, we would like to extend this application, integrate more statistical analyses and build a more thorough knowledge framework based on the current dataset, combined with external country specific data sources. We would also like to share our insights with observations from domain experts studying the effects of disease outbreaks and provide better explanations for why each feature appears to have the respective causal relationship with the other features in a connected network. The epidemiological findings may be utilised to build strong emergency preparation systems and plan and assess future development initiatives. We hope that this study will aid researchers in better understanding disease outbreak dynamics and their implications for global development. Growth and governance: Models, measures, and mechanisms Population and resources: an exploration of reproductive and environmental externalities The contribution of millennium development goals towards improvement in major development indicators Access to finance: An unfinished agenda Global health and the new bottom billion: what do shifts in global poverty and disease burden mean for donor agencies? Hygiene and the world distribution of Alzheimer's diseaseEpidemiological evidence for a relationship between microbial environment and age-adjusted disease burden Environmental-mechanistic modelling of the impact of global change on human zoonotic disease emergence: a case study of Lassa fever Dynamic causal modelling of COVID-19 Using difference-in-differences to identify causal effects of covid-19 policies Opinion: Sustainable development must account for pandemic risk Potential economic impact of an avian flu pandemic on Asia The impacts of the HIV/AIDS pandemic and socioeconomic development on the living arrangements of older persons in sub-Saharan Africa: A country-level analysis World Bank WHO Disease outbreaks by countries, territories and areas Data Set Single missing data imputation in PLS-SEM An analysis of variance test for normality (complete samples) Econometric analysis. Pearson Education India Introduction to Statistical Time Series Univariate tests for time series models Investigating causal relations by econometric models and cross-spectral methods Causality in macroeconomics Causality: Models, reasoning and inference cambridge university press A Theory of Inferred Causation Typical laws of heredity VII. Note on regression and inheritance in the case of two parents Modern Methods for Robust Regression Robust tests for independence of two time series Financial intermediation and growth: Causality and causes Detecting causality in complex ecosystems Misery loves companies: Rethinking social initiatives by business Natural resources, conflict, and conflict resolution: Uncovering the mechanisms Foreign direct investment in Africa: The role of natural resources, market size, government policy, institutions and political instability Distribution of the estimators for autoregressive time series with a unit root Likelihood ratio statistics for autoregressive time series with a unit root Introduction to statistical time series Testing for causality: a personal viewpoint Essays in econometrics: collected papers of Exploring network structure, dynamics, and function using NetworkX statsmodels: Econometric and statistical modeling with python Graphviz and dynagraph-static and dynamic graph drawing tools Causal networks: Semantics and expressiveness Identifying independence in Bayesian networks An algorithm for deciding if a set of observed independencies has a causal explanation Statistics: Fourth International Student Edition