key: cord-0159704-yo6ll49h authors: Stojkoski, Viktor; Utkovski, Zoran; Jolakoski, Petar; Tevdovski, Dragan; Kocarev, Ljupco title: Correlates of the country differences in the infection and mortality rates during the first wave of the COVID-19 pandemic: Evidence from Bayesian model averaging date: 2020-04-14 journal: nan DOI: nan sha: dbaf46516fbe0a7d039e0f157c30a1e0d1b9bf26 doc_id: 159704 cord_uid: yo6ll49h In the initial wave of the COVID-19 pandemic we observed great discrepancies in both infection and mortality rates between countries. Besides the biological and epidemiological factors, a multitude of social and economic criteria also influence the extent to which these discrepancies appear. Consequently, there is an active debate regarding the critical socio-economic and health factors that correlate with the infection and mortality rates outcome of the pandemic. Here, we leverage Bayesian model averaging techniques and country level data to investigate the potential of 28 variables, describing a diverse set of health and socio-economic characteristics, in being correlates of the final number of infections and deaths during the first wave of the coronavirus pandemic. We show that only few variables are able to robustly correlate with these outcomes. To understand the relationship between the potential correlates in explaining the infection and death rates, we create a Jointness Space. Using this space, we conclude that the extent to which each variable is able to provide a credible explanation for the COVID-19 infections/mortality outcome varies between countries because of their heterogeneous features. In order to reduce the potential enormous impact of the coronavirus disease spread (COVID- 19) , most governments implemented social distancing restrictions such as closure of schools, airports, borders, restaurants and shopping malls. In the most severe cases there were even lockdownsall citizens were prohibited from leaving their homes. This subsequently led to a major economic downturn: stock markets plummeted, international trade slowed down, businesses went bankrupt and people were left unemployed. While in some countries the implemented restrictions had a significant impact on reducing the expected shock from the coronavirus, the extent of the disease spread in the population greatly varied from one economy to another. A multitude of health, social and economic factors have been attributed as potential correlates for the observed variety in the coronavirus outcome in terms of the number of infections and/or deaths during this first wave of the pandemic. Indeed, there are numerous studies which discover various factors that affect the within country distribution of infections and deaths (See for example, Refs. [1-5]). The same debate has been extended to evaluate the between country discrepancies. In particular, some experts say that the hardest hit countries also had an aging population [6, 7] , or an underdeveloped healthcare system [8, 9] . Others emphasize the role of the natural environment [10, 11] . Having in mind the ongoing discussion, a comprehensive empirical study of the critical health, social and economic correlates with the country level outcome of the number of infections and deaths during first wave of the pandemic can not only aid in inferring whether there are any general rules in their potential impact, but would also offer a guidance for future policies that aim at preventing the emergence of future epidemic crises. To this end, here we perform a detailed statistical analysis on a large set of potential health and socio-economic variables and explore their potential to explain the variety in the observed coronavirus total infections/deaths between countries in the first wave of the virus spread.We focus on COVID-19 data that is generated only in the first wave of the pandemic, and thus do not account for various waves (we formally define the first wave in the next section). While this may be seen as a limitation of our analysis, we assert that for each subsequent wave, there was larger knowledge for the spread of the virus and vaccines were available. This significantly impacted in the way in which the population reacted to the potential susceptibility. Thus, it can be said that each wave has its own health, social and economic characteristics and therefore it should be studied separately. To construct the set of potential correlates we conduct a thorough review of the literature that describes the social and economic factors which contribute to the spread of an epidemic.We identify a total of 28 potential variables that describe a diverse ensemble of factors, including: healthcare infrastructure, societal characteristics, economic performance, demographic structure etc. To investigate the performance of each variable in explaining the coronavirus infections/deaths outcome, we collect a sample of 105 countries, the largest set of countries for which all data were available, and utilize the technique of Bayesian model averaging (BMA). BMA allows us to isolate the most important correlates by calculating the posterior probability that they truly regulate the process. At the same time, BMA provides estimates for the relative impact of the correlates and accounts for the uncertainty in their selection [12] [13] [14] . In this aspect, our analysis adds value a growing body of literature which applies Bayesian methods for investigating the critical factors that drive a certain process, and in this particular case the outcome of the COVID-19 pandemic [15] . Based on the studied data, we observe patterns which suggest that during the first wave of the pandemic, there were only few variables that acted as strong and robust correlates with the final number of registered coronavirus infections and deaths in a country. These variables are related to the effect of density in social interactions and the overweight prevalence within the population. A simple correlation analysis indicates that the heterogeneity between the countries in terms of their health, social and economic nature might be the driver of this conclusion. Thus, the initial BMA results cannot capture (potentially) significant interactions between the correlates that are relevant to a particular country. To deal with this issue, we develop the coronavirus correlates Jointness Space. The Jointness Space models the interrelation between the potential correlates in explaining the coronavirus infections/deaths outcome, and can represent a statistical foundation for understanding the relationships between variables when developing policy recommendations for preventing future epidemic crises. Using this space, we find that the routes for reducing the potential negative impact of COVID-19 should focus on decreasing the prevalence of overweight people in the population and a small number of other variables that are relevant to that is studied. This will reduce both the registered infections and the observed deaths due to the COVID-19 disease. In the absence of realistic models that adequately cover all relevant aspects, this study provides the first step towards a more comprehensive understanding of the relationship between the socio-economic correlates of the coronavirus pandemic. In a formal setting, the final number of registered COVID-19 infections per million population (p.m.p.) and the number of total COVID-19 deaths p.m.p. during the first wave of the pandemic are a result of a disease spreading process [16, 17] . The extent to which a disease spreads within a population is uniquely determined by its reproduction number. This number describes the expected number of cases directly generated by one case in a population in which all individuals are susceptible to infection [18, 19] . Obviously, its magnitude depends on various natural characteristics of the disease, such as its infectivity or the duration of infectiousness, and the social distancing measures imposed by the government. Also, it depends on an abundance of health and socio-economic factors that govern the behavioral interactions within a population [20, 21] . In general, we never observe the reproduction number, but rather the disease outcome, i.e., the number of infections/deaths. Thus, it is mathematically complex and computationally expensive to try and infer the reproduction number. To circumvent this problem, we utilize its known characteristics and derive a much simpler statistical model for the COVID-19 outcome. Here we choose a specific formulation where the disease outcome is modeled through the linear regression framework where the dependent variable is either the log of accumulated number of registered COVID-19 infections p.m.p. or the log of the accumulated number of COVID-19 deaths p.m.p. of the country at the end of the first wave of the pandemic. We focus on registered quantities normalized on per capita basis for the dependent variable instead of raw values to eliminate the bias in the outcomes arising from the different population sizes in the studied countries. The accumulation of the registered infections and deaths spans from the day of observation of the first infection in the country, up until the last day of the first wave of the pandemic in that country. The last day is, in general, different for each country and is inferred on the basis of the level of daily government response. The estimation procedure used to infer the last day of the first wave will be discussed in more detail in the next section. The log transformation of the COVID-19 infections/deaths p.m.p. reduces the skewness of the original data and makes the dependent variable real-valued and continuous. For a such dependent variable, the linear regression framework is the simplest tool that quantifies the marginal effect of a set of potential independent variables (correlates). Its advantage lies in the efficient and unbiased analytical inference of the strength of the linear relationship. As such it has been widely used in modeling the outcome of epidemiological phenomena (See for example Refs. [22] [23] [24] ). A central question which arises in the model specification is the selection of the independent variables. While a literature review can offer a comprehensive overview of all potential correlates, in reality we are never certain in their credibility. To reduce our uncertainty, we resort to BMA. BMA leverages Bayesian statistics to account for model uncertainty by estimating each possible specification, and thus evaluating the posterior distribution of each parameter value and probability that a particular model is the correct one [25] .This has allowed the BMA technique to be used in various domains, ranging from studying correlates of economic growth [26] , up to determinants of innovation processes [27] . Recently, it was even applied for estimating the output losses during the Covid-19 pandemic [28] . The BMA method relies on the estimation of a baseline model that is used for evaluating the performance of all other models. In our case, this is the model which encompasses only variables for the state of the epidemic dynamics within the country and effect of government policies regarding social distancing, contact tracing and testing procedures. We use two variables to quantify the possibility that countries are in a different state of the disease spreading process. The first variable simply measures the duration of epidemics in a country as the number of days since the first registered infection. In addition, we evaluate the time which the country had to prepare for the first wave of coronavirus. This is given as the number of days between the first registered infection worldwide and the first infection in the country. In order to assess the effect of government policies regarding social distancing and testing we construct an aggregated government response index. The index quantifies the average daily variation in government responses to the epidemic dynamics. As a measure for the daily variation, we take the Oxford COVID-19 government response index [29] . The Oxford COVID-19 government response index is a composite measure that combines the daily effect of policies on social distancing, testing and contact tracing in an economy. For each country, we construct a weighted average of the index from all available data since their first registered coronavirus infection, up until the end date, i.e., the date when the government response index is at its maximum value. This threshold is chosen as a means to capture the moment when a country gains the ability to control and stabilize the propagation of the disease. To emphasize the effect of policy responses implemented on earlier dates, we construct a weighted average by putting a larger weight on those dates. This is because earlier responses are supposed to have a bigger impact on the prevention of the spread of the virus. The procedure implemented to derive the average government response index is described in Section S1 of the Supplementary Information (SI). Fig. 1 visualizes the results from the baseline model. We observe that the countries which had more detailed response policies also had less COVID-19 infections and mortality rates, as expected. In addition, the countries with longer duration of the crisis registered more infections and deaths p.m.p., whereas the countries which had more time to prepare for the crisis also had less infections and deaths. It is apparent that the baseline model already has a large coefficient of determination (R 2 ) and can significantly explain a certain amount of the cross country variations in registered COVID infections/deaths p.m.p.. However, there is still a large amount of variation which, we conjecture that can be attributed to various health, social and economic correlates present within a society. To derive the set of potential health, social and economic correlates of the COVID-19 infection and mortality rates during the first wave of the pandemic we conduct a comprehensive literature review. From the literature review we recognize a total of 28 potential correlates, listed in Table 1 . For a detailed description of the potential effect of the correlates we refer to the references given in the same table, and the references therein. We hereby point out that the data for each potential correlate corresponds to the last observed value (the value in 2019). This prevents the possible problem of endogenous independent variables in the specification of the regression. In what follows, we only describe in short, the set of potential correlates on the basis of their characteristics. Healthcare Infrastructure: The healthcare infrastructure essentially determines both the quantity and quality with which health care services are delivered in a time of an epidemic. As measures, we include 2 variables which capture the quantity of hospital beds, nurses and medical practitioners, as well as the quality of the coverage of essential health services. On the one hand, studies report that well-structured healthcare resources positively affect a country's capacity to deal with epidemic emergencies [30] [31] [32] [33] [34] [35] [36] [37] . On the other hand, the healthcare infrastructure also greatly impacts the country's ability to perform testing and reporting when identifying the infected people. In this regard, economies with better structure are able to easily perform mass testing and more detailed reporting [38] [39] [40] . National health statistics: The physical and mental state of a person play an important role in the degree to which the individual is susceptible to a disease. In countries where a significant proportion of the population suffer from diseases highly associated with the spread of an infectious disease as well as its fatal outcomes, we would expect more severe consequences of the emergent epidemics [41] [42] [43] [44] . Specifically, metabolic disorders such as diabetes may intensify epidemic complications [45, 46] , whereas it has been observed that the susceptibility to various diseases account for the majority of deaths in complex emergencies [47] . In addition, there is empirical evidence that adequate hygiene greatly reduces the rate of mortality, whereas overweight or asthma prevalence in the population may increase the fatality of epidemic diseases [48] [49] [50] . To quantify the national health characteristics, we include 6 variables that assess the general health level in the studied countries. Economic performance: We evaluate the economic performance of a country through 4 variables. This performance often mirrors the country's ability to intervene in a case of a public health crisis [51] [52] [53] [54] [55] [56] . Variables such as GDP per capita have been used in modeling health outcomes, mortality trends, cause-specific mortality estimation and health system performance and finances [57] [58] [59] . For poor countries, economic performance appears to improve health by providing the means to meet essential needs such as food, clean water and shelter, as well access to basic health care services. However, after a country reaches a certain threshold of development, few health benefits arise from further economic growth. It has been suggested that this is the reason why, contrary to expectations, the economic downturns during the 20th century were associated with declines in mortality rates [60, 61] . Observations indicate that what drives the health in industrialized countries is not absolute wealth or growth but how the nation's resources are shared across the population [62] . The more egalitarian income distribution within a rich country is associated with better health of population [63] [64] [65] [66] . Societal characteristics: The characteristics of a society often reveal the way in which people interact, and thus spread the disease. In this aspect, properties such as education and the degree of digitalization within a society reflect the level of a person's reaction and promotion of selfinduced measures for reducing the spread of the disease [67] [68] [69] [70] [71] . Also, the way we mix in society may effectively control the spread of infectious diseases [21, [72] [73] [74] [75] . To measure the societal characteristics, we identify 4 variables. Demographic structure: Similarly, to the national health statistics, the demographic structure may impact the average susceptibility of the population to a disease. Certain demographic groups may simply have weaker defensive health mechanisms to cope with the stress induced by the disease [76] [77] [78] [79] . In addition, the location of living may greatly affect the way in which the disease is spread [80, 81] . To express these phenomena, we collect 7 variables. Natural environment: Numerous studies discuss possible correlation between air pollution and COVID-19 infections and mortality rates [11, 82, 83] . In addition, some authors note that countries where natural sustainability is deteriorated, are also more vulnerable to epidemic outbreak [10] . On the other hand, healthy natural environments may attract more tourists, which could drive the disease spread [38] . Finally, weather patterns can also impact the infectiousness of the disease, especially exposure when there are very cold days in winter and very hot days in summer [84] . We gather the data for 5 variables which capture the essence of this characteristic. procedure we use data on 105 countries. This is the maximal set of countries for which the data on all 28 potential correlates could be attained. The summary statistics and the data gathering and preprocessing procedures are described in SI Section S2. The mathematical background of BMA together with our inference setup is given in SI Section S3. Fig. 2 displays the respective results. In both situations, the variables are ordered according to their posterior inclusion probabilities (PIP), given in the second column. PIP quantifies the posterior probability that a given correlate belongs to the linear regression model that best describes the COVID-19 infections/mortality rates. Besides this statistic, we also provide the posterior mean (Post mean) and the posterior standard deviation (Post Std). Post mean is an estimate of the average magnitude of the effect of a correlate, whereas the Post Std evaluates the deviation from this value. In the inference procedure (described in SI Section S3) we initially assumed that the linear regression model which best describes the COVID-19 first wave infections and mortality rates is a result of the baseline specification and 3 additional variables. Our prior belief stems from the general observation which suggests that economies are heterogeneous, and a small number of complementing factors may contribute to the extent of the coronavirus spread, while the other potential correlates may simply behave as substitutes in terms of socio-economic interpretation within a country. Nevertheless, we found that our results do not depend on the prior assumption of the size of the true model. Altogether, this implies that the prior inclusion probability of each potential correlate is around 0.1. We use this attribute, together with the posterior inclusion probability of each correlate, to divide the correlates into four disjoint groups: Correlates with strong evidence: (PIP > 0.5). The first group describes the correlates which have by far larger posterior inclusion probability than prior probability, and thus there is strong evidence to be included in the true model. We find two such variables related to explaining the coronavirus infections, the overweight prevalence in the country and the population density. Both variables are positively related with the number of registered COVID-19 infections p.m.p.. When investigating the critical correlates of the coronavirus deaths, it appears that the overweight prevalence is the only variable for which there is strong evidence to explain the outcome and has a positive impact. Correlates with medium evidence: (0.5 ≥ PIP > 0.1). There are no variables for which there is medium evidence to be a correlate of the COVID-19 number of infections in the first wave, whereas mortality from non-natural causes is the only variable for which there is medium evidence to be a correlate of the COVID-19 death rate, with a negative effect. Correlates with weak evidence: (0.1 ≥PIP> 0.05). These are correlates which have lower posterior inclusion probability than their prior one, but still may account for some of the variations in the COVID-19 infections/deaths. For the infections per million population there are three such correlates, the fraction of elderly population, the number of international tourist arrivals and the mortality from non-natural causes. The elderly population has a positive Post Mean, whereas the other two variables have negative Post Mean. When studying the COVID-19 death rate, we find two correlates with weak evidence. They are the household size and the government health expenditure. The household size has a positive marginal effect (Post Mean), whereas the government health expenditure shows a negative effect. Correlates with negligible evidence: (PIP≤ 0.05). All other variables have negligible evidence to be a true correlate of the coronavirus outcome. In total, we find negligible evidence for explaining the coronavirus infections in 23 variables and for explaining the coronavirus deaths in 24 variables. The division of the variables into groups allows us to assess the robustness of each potential correlate -those belonging to a group described with a larger PIP also offer more credible explanation for the coronavirus infections and death rates. Nonetheless, we point out that although the comparison between posterior inclusion probabilities and prior inclusion probabilities is a common approach, its interpretation must be taken with care. Concretely, the inhomogeneous nature of the specific features of the countries can drive our results. The presence of this phenomenon in our data be inferred by conducting a simple correlation analysis between the potential correlates. If the variables are highly correlated between each other then there is a problem of multicolinearity. Multicolinearity can lead to wider credible intervals that eventually produce less statistically reliable posterior inclusion probabilities in terms of the effect of independent variables in a model. As said in [26] , even if the posterior inclusion probability is lower than the prior inclusion probability for a given variable, it might be that this particular variable is important to decision makers under certain circumstances. In SI Section S4.1 we conduct several checks to confirm the robustness of our results. In the first robustness check we investigate the impact of outliers. In particular, definitely there were several countries which were either extremely affected by the coronavirus or displayed great immunity to the epidemic crisis. To check the robustness of our results against the presence of such data we implement the following strategy. First, we remove a country from the sample. Then, we re-perform the BMA procedure with the resulting countries. We repeat this procedure for every country and recover the median results for each potential correlate. The results indicate that the findings presented here are valid even in the presence of outliers. In the same section, we display the economies which contributed most and least to the credibility of a particular variable. These are the countries which, when excluded, lead to the minimum, respectively maximum, posterior inclusion probability of the given variable. The investigation suggests that there are multiple countries which are significant contributors to the PIP value of each correlate, thus further indicating that there is heterogeneity in the health social and economic features of the countries. In the second check, we change the end date of the pandemic to be equal to the first date after the day at which the daily government response index is at its maximum and that is at least 20% lower than the daily maximum. This effectively prolongs the duration of the first wave. Nonetheless, it still does not impact the findings. In the third check, we change the dependent variable to be the raw number of infections and deaths at the end of the first wave. In other words, now the dependent variable describes counts and the linear regression framework is not a suitable model. Instead, for the estimation of the marginal impact we use a quasi-Poisson model, which is the most often used procedure when the dependent variable is given as a count that has a large variance [90] . Even in this case, the results do not change. In the final robustness check, we add a spatial weighting matrix in the baseline model in order to account for the potential spatial autocorrelation in the spread of COVID-19. Multiple studies have indicated that this effect might exist (See for example [91] ). Again our findings do not significantly change. Definitely, even if useful for presentation purposes, the mechanical application of a threshold, or a simple comparison between the prior and the posterior, should often be avoided in practice. Each BMA analysis should be coupled with an investigation for the interrelationships between the variables in explaining the dependent variable. We perform this analysis in the subsequent section. The next step in deriving the linear regression model that describes best the coronavirus infections/mortality rates is to find its dimension, i.e., the number of explanatory variables included in the model. As a measure for this quantity, BMA provides the posterior size, formally defined as the posterior belief for the dimension of the model. We find that, for the coronavirus infections p.m.p. the posterior model size is 2.21 whereas for the coronavirus deaths p.m.p. it is 1. 34 . After discovering the model size, we need to specify the explanatory variables. This raises the issue of how to construct the appropriate model. One possible solution is to use the correlates with the highest PIP value and regress them on the dependent variable. However, this neglects the interdependence of inclusion and exclusion of correlates in a same model. A standard approach for resolving this issue is to conduct a statistical jointness test. The concept of jointness has been introduced within the BMA framework with the aim to capture dependence between explanatory variables in the posterior distribution over the model space [92] . By emphasizing dependence and conditioning on a set of one or more other variables, jointness moves away from marginal measures of variable importance and investigates the sensitivity of posterior distributions of parameters of interest to dependence across regressors. For example, if two variables are complementary in their posterior distribution over the model space, models that either include or exclude both variables together receive relatively more weight than models where only one variable is present. In our context, jointness tests will allow us to infer whether two variables are complements, i.e., tend to be included together in models with high posterior probability, or substitutes, i.e., models with high posterior probability tend to exclude the joint inclusion of both variables. To better understand the properties of the COVID-19 infection and mortality rates during the first wave, we perform the jointness test developed by Hofmarcher et al. [93] . Using this test we can estimate a metric between each pair of correlates and quantify their relationship in a range between −1 and 1. In the two extremes, −1 indicates that the two correlates behave as perfect substitutes in the true model, whereas 1 indicates that they are included in the true model together. The resulting jointness metric between pairs of correlates can be used to construct a network (graph), which we refer to as the Jointness Space of the COVID-19 correlates. In this network, the nodes are the potential health, social and economic correlates, whereas the jointness values represent the edge weights. In other words, two arbitrary correlates are linked with each other by the posterior belief that both of them belong to the same linear regression model governing the coronavirus infections/mortality rate. In theory, many possible factors may cause complementarity between the variables, such as national culture [94] , the type of healthcare system [95] or political priorities [96] . All of these are a priori notions of what dimension drives the relatedness between the potential correlates and assume that there is little flexibility in choosing the correct model. Instead, the jointness space follows an agnostic approach and uses a data-driven measure, based on the idea that, if two correlates are related because they offer contrasting information regarding the coronavirus outcome they will tend to be included in the true model in tandem, whereas variables that give similar information are less likely to be included together. Hence, the developed network offers a statistical view for the importance of the social, health and economic correlates when developing policies aimed at reducing the impact of epidemic crises. The networks depicted in Fig. 3 visualize the Jointness Space of the correlates included in our BMA framework. To emphasize the complementary relationships, we connect only correlates with positive jointness. The full description for the procedure implemented for constructing the Jointness Space is given in SI Section S5. In the networks, the correlates which can be included in multiple models take a more central position whereas the periphery is constituted of correlates whose credibility in explaining the coronavirus outcome mostly substitutes the effect of other variables. Interestingly, we observe that the topological form of the Jointness Space is not significantly determined by how we specify the dependent variable. In both situations, there is one large connected component with correlates where the central role is played by the overweight prevalence. Thus, the obtained maps suggest the first step in the construction of the linear-regression model for the COVID-19 infections/death rate in the first wave is by first focusing on the fraction of overweight persons in the country. Moreover, almost all other variables belong to the same component. Only in the case when the dependent variable is modelled through the COVID-19 deaths, Life expectancy and Health coverage are excluded from the component. Hence, the variables included in our analysis are complements in explaining the COVID-19 infections/death rates. Based on this finding, we once again assert that the next variables that will be included in the model, should be specific for the economy that is the subject of the study. Nonetheless, improving the features of the correlates that are located more centrally might yield a synergistic effect, thus significantly reducing the risk of a more negative COVID-19 infections/death rate. In this work, we utilized Bayesian model averaging techniques to provide a comprehensive analysis for the health, social and economic correlates of that contributed to between country differences in the final number of infections and deaths during the first wave of the COVID-19 pandemic. Our findings suggest that government response policies, such as testing procedures, tracking of individuals and social distancing measures, and the state of the dynamics of the disease spread can significantly explain the variety in the coronavirus outcome between the countries. Aside from these variables, only a handful of additional variables are able to robustly explain the extent of the COVID-19 infection/deaths and thus provide general rules for the virus spread. The sole variable strongly related to the coronavirus deaths is the overweight prevalence. Countries with a larger fraction of overweight population also show greater susceptibility to fatal virus outcomes. Interestingly, besides the overweight prevalence, the population density is also a strong correlate of the registered coronavirus infections per million population. More densely populated countries display higher infection rates. A plentiful of reasons can be used as a possible inter-pretation for these results. For instance, it is known that the degree of disease spread scales proportionally with population density [97] . This is because, everything else considered, in denser populations typically there is more social mixing [21] . In a similar fashion, various explanations can be found for the observed effect of overweight prevalence. In particular, the prevalence of overweight people is closely related to unhealthy habits of living and, hence, larger susceptibility to both disease infections and fatal outcomes. The robustness checks and the performed jointness analysis suggested that the insignificance of the other variables might not be the reason for their low PIP values. Instead, the variables which we studied have complementary effect in explaining the COVID-19 infections and death rates of the first wave of the pandemic. This lead us to suspect that the results are driven by the heterogeneous health, social and economic features of the countries. To this end, an interesting topic for future research would be to explore how the effect of the correlates evolved during the different waves of the pandemic. In the absence of a unifying framework covering the relevant aspects of the interrelation between the potential correlates during the various waves, the jointness analysis performed here (and the resulting Jointness Space) can provide the starting point for the development of a more comprehensive understanding of the factors determining the infection and mortality rates of the pandemic. Moreover, with an improved understanding of the dynamics of the coronavirus pandemic, the insights obtained from this analysis can influence the development of appropriate policy recommendations. To calculate our government response measure, we make use of Oxford's daily government response index. Oxford's daily government response index measures, on a scale of 1-100, the variation in daily government responses to COVID-19 by accumulating ordinal data on country social distancing measures on school, workplace and public transport closure; cancellation of public events; restrictions of internal movement; control of international travel and promotion of public campaigns on prevention of coronavirus spread; testing policies and procedures implemented for tracing contacts of infected individuals. We refer to [29] for a detailed overview on how the daily index is constructed. To calculate the overall government response index c i (d * i ) at the final date d * i from the provided daily indexes we implement the following procedure. Let C i (t) represent the government response on day t, where t = 1, 2 . . .d * i , then our index can be estimated as where w i (s) are the weights given to each day since the first registered case. We use a simple inverse weight procedure by giving larger weights to earlier dates, i.e., We choose the last date d * i to be the last day at which the daily government response index C i (t) is at its maximum value. The data for the dependent variables are taken from Our World in Data coronavirus tracker. The tracker offers daily coverage of country coronavirus statistics, by collecting data mainly from the European Centre for Disease Prevention and Control. Because national aggregates often lag behind the regional and local health departments' data, an important part of the data collection process consists in utilizing thousands of daily reports released by local authorities. The results were made with data gathered on 13th November 2020. The data used for measuring the possible health, social and economic correlates are gathered from 9 various sources. In particular, the collection is as follows: 19 variables are from the World Bank's World Development Indicators (WDI), 2 variables are from the Our World in Data database and there is 1 variable from World Bank's Environmental, social and governance data (ESG), the Worldometers database (WM), Data For Good (DFG), the State of Global Air (SGA), the Global footprint network (GFN), United Nations (UN) database and Google. Six of the potential correlates were constructed by deriving our own index with data taken from the described source. The construction procedures for each of these variables are described in the following subsection. The full list of sources together with links to their websites is given in Table S1 . The data used in the analysis are available at https://github.com/pero-jolak/coronavirus-socio-economic-determinants. To reduce the noise from the data we, use only data for countries with population above 1 million. In addition, we only use countries for which there is data on all of the potential correlates. Table S2 gives the countries for which all of these data was available. Figure S1 : Correlation matrix. Altogether, we end up with data on 28 variables and 105 countries. Table S3 reports the summary statistics of each variable. We hereby point out that as a measure of the correlate the log of the last observed value is taken (the value in 2019), unless otherwise stated in Table S3 . This prevents the possible problem of endogenous independent variables in the specification of the regression. In Fig. S1 we plot the correlation matrix between the potential correlates. It can be observed, that in general the correlation between the variables is large. Out of 378 variable pairs, 102 have correlation that is either below -0.6 or above 0.6. Medical resources index: The Medical resources index is estimated as a Principal Component Analysis (PCA) weighted index of the logs of three variables [98] . These are: -Physicians (per 1,000 people) -Nurses and midwives (per 1,000 people). -Hospital beds (per 1,000 people). Non-natural causes mortality index: The Non-natural causes mortality index is calculated as a Principal Component Analysis (PCA) weighted index of the logs of these four variables found in WDI: -Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population). -Cause of death, by communicable diseases and maternal, prenatal and nutrition conditions (% of total). -Mortality from CVD, cancer, diabetes or CRD between exact ages 30 and 70, female (%). -Mortality rate attributed to unsafe water, unsafe sanitation and lack of hygiene (per 100,000 population). The Immunization index is estimated as a Principal Component Analysis (PCA) weighted index of the logs of two variables: -Immunization, DPT (% of children ages 12-23 months). -Immunization, measles (% of children ages 12-23 months). weighted index of the thirtheen individual measures describing the burdens of disease, measured by a metric called 'Disability Adjusted Life Years' (DALYs). These are: -Neglected tropical diseases and malaria. -Maternal disorders. -Neonatal disorders. -Nutritional deficiencies. -Neoplasms. -Cardiovascular diseases. -Chronic respiratory diseases. -Cirrhosis and other chronic liver diseases. -Digestive diseases. -Neurological disorders. -Mental and substance use disorders. -Musculoskeletal disorders. -Other non-communicable diseases. a measure of the magnitude of Facebook connections between pairs of countries i and j. Formally, the i j-th index is estimated as where FB Connections i j is the total number of Facebook connections between i and j and FB Users l is the number of Facebook users in country l. Combining all pairs, this results in an N × N dimensional matrix. We transform it to be an only one-country measure by estimating the log of the PageRank (eigenvector centrality) of each country in the original SCI matrix [99] . The Immunization index is estimated as a Principal Component Analysis (PCA) weighted index of the logs of four variables: -Individuals using the Internet (% of population). -Fixed broadband subscriptions (per 100 people). -Fixed telephone subscriptions (per 100 people). -Mobile cellular subscriptions (per 100 people). We specify our linear regression model M m for the final number of COVID-19 infections/death rates of the first wave as where, for simplicity, we denote both the log of registered COVID-19 infections per million population and the log of COVID-19 deaths per million population of country i as y i . In the equation, X m i it is a k m dimensional vector of health, social and economic explanatory variables that determine the dependent variable, β m is the vector describing their marginal contributions, β 0 is the intercept of the regression, and u i is the error term. The s i term controls for the impact of government responses, and γ is its coefficient. Finally, we also include the term d i , with δ capturing its marginal effect, that measures the duration of the pandemics within the economy. This allows us to control for the possibility that the countries are in a different state of the disease spreading process. BMA leverages Bayesian statistics to account for model uncertainty by estimating each possible model M m , and thus evaluating the posterior distribution of each parameter value and probability that a particular model is the correct one [25] . More precisely, in BMA, the posterior probability for the parameters g(β m |y, M m ) is calculated using M m as: (S10) Since the posterior mean is a point estimate of the average marginal contribution, we use it as our measure of the effect of the correlate on the COVID-19 impact. Another interesting statistic is the posterior inclusion probability PIP h of a variable h, which measures the posterior probability that the variable is included in the 'true' model. Mathematically, PIP h is defined as the sum of the posterior model probabilities for all of the models that include the variable: (S11) Posterior inclusion probabilities offer a more robust way of determining the effect of a variable in a model, as opposed to using p-values for determining statistical significance of a model coefficient because they incorporate the uncertainty of model selection. According to equations (S5) and (S6), it is clear that we need to specify priors for the parameters of each model and for the model probability itself. To keep the model simple and easily implemented here we use the most often implemented priors. In other words, for the parameter space we elicit a prior on the error variance that is proportional to its inverse, p(σ 2 ) ≈ 1/σ 2 , and a uniform distribution on the intercept, p(α) → 1, while the Zellner's g-prior is used for the β m parameters, and for the model space we utilise the Beta-Binomial prior. To estimate the posterior parameters we use a Markov Chain Monte Carlo (MCMC) sampler, and report results from a run with 200 million recorded drawings and after a burn-in of 100 million discarded drawings. Finally, before we perform the inference the data for each variable is transformed into its z-score, in order to normalize the measuring unit. The theoretical background behind our setup can be read in Refs. [25, [100] [101] [102] . As said in the main text, we check the robustness of our results against the presence of outliers by removing a country from the sample and re-performing the BMA procedure with the resulting countries. We repeat this procedure for every country and recover the median results for each potential correlate. The results can be seen in Fig. S2 . They are nearly identical to the ones presented in the main text, thus suggesting that our results are robust to outliers. Table S4 outlines the countries which had the biggest impact on the observed credibility of a given correlate. We define two types of countries, i) the weakest contributor, this is the country which when excluded from the sample leads to the largest PIP for the studied correlate; and ii) the strongest contributor -i.e., the country which when excluded we observe the lowest PIP for the studied correlate. We find numerous countries which can be significant contributors for each correlate, thus indicating that there is indeed heterogeneity in the socio-economic features of the countries. In this robustness check, we change the end date of the pandemic to be equal to the first date after the day at which the daily government response index is at its maximum and that is at least 20% lower than the daily maximum. This effectively prolongs the duration of the first wave. The results are shown in Fig. S3 . In this case, it appears that there are more variables that are either strong or medium correlates of the COVID-19 infections/death rates. Nonetheless, the variables which were found in the main results, persist in being correlates with strong or medium evidence. In this check, we change the dependent variable to be the raw number of infections and deaths at the end of the first wave. That is, now the dependent variable describes counts and the linear regression framework is not a suitable model. Instead, for the estimation of the marginal impact we use a quasi-Poisson model. This is the most often used procedure when the dependent variable is given as a count that has a large variance [90] . Indeed, the number of COVID-19 infections or deaths has a variance larger than its mean. This is apparently due to the disparate effect the pandemic had throughout the world. The results can be seen in Fig. S4 . Again, the variables that were found to be strong correlates with the COVID-19 infections and mortality rates, remain strong correlates even in this specification. Thus, it can be concluded that our results are robust to a different model specification. In the last check we add a spatial weighting matrix in the baseline model in order to account for the potential spatial autocorrelation (SAR) in the spread of COVID-19. Multiple studies have indicated that this effect might exist (See for example [91] ). Again our findings do not significantly change. The SAR model which we use, takes the following matrix form Where y is the n-dimensional vector of infection or mortality rates, s and d are matrices containing the baseline independent variables and W is a known row-standardized spatial weight distance matrix between the studied countries. The parameter ρ is a coefficient on the spatially lagged dependent variable, Wy. The spatial weight matrix, W, is n × n stochastic matrix, where n is the number of countries with element w i j defining the spatial relations between locations i and j. This matrix is constructed through the following steps: 1. Gather data for the latitude and longitude of each county from Google Developers. 2. Calculate the Haversine distance D i j between each pair of countries i and j using the data. This procedure allows us to determine the great-circle distance between any two countries on a sphere given their longitudes and latitudes (See Ref. [103] ). 3. Construct a distance matrix, D = [D i j ], between the countries using the estimations from the previous step. 4. Row-standardize the distance matrix to obtain the spatial weight matrix, W. That is, the i j-th entry The baseline SAR results are presented in Table S5 . We observe that the spatial autocorrelation coefficient estimate for the SAR model is negative and statistically significant when the dependent variable is the infection rate, indicating the presence of spatial autocorrelation in the regression relationship. The coefficient remains negative when the dependent variable is the mortality rate, though it loses its significance. Table S5 : SAR results. * indicates significance at α = 0.05 The results for the BMA after implementing SAR as a baseline model, can be seen in Fig. S5 . They are nearly identical to the ones presented in the main text, thus suggesting that our results are robust after accounting for spatial autocorrelation. To construct the coronavirus correlates Jointness space we utilize a network approach. In this network, the nodes represent the potential health, social and economic correlates, whereas the edge between a pair of correlates is given by a Jointness measure of the posterior probability that the pair is included in the same model explaining the COVID-19 infections/mortality rates. As a Jointness measure we utilize the the Hofmarcher et al. Jointness test. This test is a regularised version of the well-known Yule's Q association coefficient and is derived based on an augmented contingency table of variable inclusion. The table allows us to avoid the problems that arise due to zero counts [93] . The test statistic, J hk between variables h and k, is calculated as J hk = (a + 1 2 )(d + 1 2 ) − (b + 1 2 )(c + 1 2 ) (a + 1 2 )(d + 1 2 ) + (b + 1 2 )(c + 1 2 ) , (S13) where a,b,c and d are the empirical counts of the MCMC drawings in which, respectively, h and k are included together; h is included and k is excluded; h is not included and k is included; and both h and k are excluded. The main advantage of this test over other jointness measures is that it is appropriately defined as long as one of the studied variables is included in the true model with positive probability. Moreover, it is monotonic, with larger values implying that the two variables are complements; commutative,i.e. J hk = J kh ; it is bounded between −1, and 1, and has an adequate limiting behavior. To visualize the resulting network we use only the positive links (those that are greater than 0). To set the coordinates of each node we use the Force-Layout drawing algorithm. Covid-19: how doctors and healthcare systems are tackling coronavirus worldwide Can developing countries alone face corona virus? an iraqi situation. Public Health in Practice Opinion: Sustainable development must account for pandemic risk Exposure to air pollution and covid-19 mortality in the united states. medRxiv Bayesian model averaging for linear regression models Bayesian model averaging: a tutorial Determinants of long-term growth: A bayesian averaging of classical estimates (bace) approach Bayesian model averaging: A systematic review and conceptual classification Nowcasting and forecasting the potential domestic and international spread of the 2019-ncov outbreak originating in wuhan, china: a modelling study Early dynamics of transmission and control of covid-19: a mathematical modelling study The mathematical theory of infectious diseases and its applications. Charles Griffin & Company Ltd, 5a Crendon Street, High Wycombe, Bucks HP13 6LE Further notes on the basic reproduction number Modeling infectious diseases in humans and animals Contacts in context: large-scale setting-specific social mixing matrices from the bbc pandemic project. medRxiv The obesity epidemic in the united states-gender, age, socioeconomic, racial/ethnic, and geographic characteristics: a systematic review and metaregression analysis Germs, social networks and growth A systematic review of population based epidemiological studies in myasthenia gravis Model averaging in economics: An overview Determinants of economic growth: a bayesian panel data approach Robust determinants of companies' capacity to innovate: a bayesian model averaging approach The determinants of output losses during the covid-19 pandemic Variation in government responses to covid-19. Blavatnik school of government working paper Socio-economic determinants of hiv/aids pandemic and nations efficiencies Pandemic influenza and critical infrastructure dependencies: possible impact on hospitals Seasonal and pandemic influenza preparedness: a global threat Preparedness for highly pathogenic avian influenza pandemic in africa Relationship between equipment and infrastructure for pandemic influenza and performance in an avian flu drill Mitigating absenteeism in hospital workers during a pandemic. Disaster medicine and public health preparedness Major issues and challenges of influenza pandemic preparedness in developing countries Maternal health care in the time of ebola: a mixed-method exploration of the impact of the epidemic on delivery services in monrovia Predictive power of air travel and socio-economic data for early pandemic spread Health inequalities and infectious disease epidemics: a challenge for global health security Ahmad Reza Hosseinpoor, and Ties Boerma. Monitoring universal health coverage within the sustainable development goals: development and baseline data for an index of essential health services Social determinants of health inequalities. The lancet Modelling control measures to reduce the impact of pandemic influenza among schoolchildren The scourge of asian flu in utero exposure to pandemic influenza and the development of a cohort of british children The epidemiology and clinical impact of pandemic influenza The global burden of diabetes and its complications: an emerging pandemic Diabetes and the severity of pandemic influenza a (h1n1) infection Communicable diseases in complex emergencies: impact and challenges Asthma and covid-19 Modification of the risk of mortality from pneumonia with oral hygiene care Oral hygiene reduces the mortality from aspiration pneumonia in frail elders Health, nutrition, and economic development Health and economic growth: findings and policy implications Macroeconomics and health: investing in health for economic development. World Health Organization When does improving health raise gdp? NBER macroeconomics annual Hiv/aids and labor force upgrading in tanzania The effects of employment on influenza rates The changing relation between mortality and level of economic development Developing a comprehensive time series of gdp per capita for 210 countries from 1950 to 2015 The 'heart kuznets curve'? understanding the relations between economic development and cardiac conditions The effect of economic recession on population health The reversal of the relation between economic growth and health progress: Sweden in the 19th and 20th centuries The spirit level. Why equality is better for everyone The reversal of fortunes: trends in county mortality and cross-county mortality disparities in the united states Towards an epidemiological understanding of the effects of long-term institutional changes on population health: a case study of canada versus the usa Income inequality and health: pathways and mechanisms. Health services research Income inequality: When wealth determines health: Earnings influential as lifelong social determinant of health Social capital: Measurement and consequences Does "community social capital" contribute to population health? Social science & medicine A comparative analysis of the validity of us state-and countylevel social capital measures and their associations with population health The education effect on population health: a reassessment. Population and development review Socioeconomic inequalities in health in 22 european countries Estimating the impact of school closure on social mixing behaviour and the transmission of close contact infections in eight european countries Social contacts and mixing patterns relevant to the spread of infectious diseases What types of contacts are important for the spread of infections? using contact survey data to explore european mixing patterns Projecting social contact matrices in 152 countries using contact surveys and demographic data Using data on social contacts to estimate age-specific transmission parameters for respiratory-spread infectious agents The spanish influenza pandemic in occidental europe (1918-1920) and victim age. Influenza and other respiratory viruses Trends in infectious disease mortality in the united states during the 20th century The impact of the aids epidemic on the health of older persons in northwestern tanzania Contact patterns in a high school: a comparison between data collected using wearable sensors, contact diaries and friendship surveys The contribution of social behaviour to the transmission of influenza a in a human population Do respiratory epidemics confound the association between air pollution and daily deaths? Pollution, infectious disease, and mortality: evidence from the 1918 spanish influenza pandemic Climate and the spread of covid-19 Obesity in patients younger than 60 years is a risk factor for covid-19 hospital admission Obesity a risk factor for severe covid-19 infection: multiple potential mechanisms Obesity and impaired metabolic health in patients with covid-19 Social connectedness: Measurement, determinants, and effects The geographic spread of covid-19 correlates with structure of social networks as measured by facebook Quasi-poisson vs. negative binomial regression: how should we model overdispersed count data? The spatial econometrics of the coronavirus pandemic Jointness of growth determinants Bivariate jointness measures in bayesian model averaging: solving the conundrum Can dimensions of national culture predict cross-national differences in medical communication? 2015 international profiles of health care systems. Canadian Agency for Drugs and Technologies in Health Electoral incentives to combat mosquito-borne illnesses: Experimental evidence from brazil Thresholds for virus spread on networks Constructing socio-economic status indices: how to use principal components analysis. Health policy and planning Some unique properties of eigenvector centrality Model uncertainty in cross-country growth regressions Benchmark priors for bayesian model averaging On the effect of prior assumptions in bayesian model averaging with applications to growth regression The cosine-haversine formula