key: cord-0744626-6frof9sx authors: Charnley, G. E. C.; Yennan, S.; Ochu, C.; Kelman, I.; Gaythorpe, K. A. M.; Murray, K. A. title: Investigating the impact of social and environmental extremes on cholera time varying reproduction number in Nigeria date: 2022-03-21 journal: nan DOI: 10.1101/2022.03.21.22272693 sha: 93ffd73f89a57608a241723edc593f5a5a8f39c7 doc_id: 744626 cord_uid: 6frof9sx Cholera is reported as endemic in more than 50 countries, many of which are in sub-Saharan Africa. Nigeria currently reports the second highest number of cases, with several risk factors potentially contributing to this including poverty, water, sanitation and climate. Enteric pathogens have a significant global burden, especially on children and those most vulnerable. Despite this, attention is often drawn away from these diseases, most recently to Ebola and COVID-19. To address the need for more research and focus on cholera, a covariate selection process and machine learning was used. Data for environmental (floods, droughts) and social (conflicts) extremes, along with pre-existing social vulnerabilities, were fit to time varying reproductive number in Nigeria. We analysed this both spatially and temporally and used it to create a traffic-light system for cholera transmission, highlighting potential thresholds and triggers for outbreak. Improved access to sanitation, number of monthly conflict events, Multidimensional Poverty Index and Palmers Drought Severity Index were retained in the best fit model. Varying exposure periods showed that those living in decreased poverty, with more access to sanitation were not as vulnerable to changes and offset some of the cholera risk caused by extremes. The work presented here shows the need to address these pre-existing vulnerabilities and sustainable development for disaster prevention and mitigation and improve health and quality of life. results can be used by a wide range of scientific disciplines and organisations to reduce cholera risk in fragile settings. In Nigeria, there were 837 and 564 cholera cases for 2018 and 2019, respectively. These were confirmed by either rapid diagnostic tests or culture. The confirmed cases are shown spatially in is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 21, 2022 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 21, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 21, 2022. ; Stepping through the model possibilities using hierarchical stepwise analysis, the best fit model was found according to mean standard error of the residuals (RMSE), R² and correlations between the incidence-based and covariate-based R values. The model included number of monthly conflict events, Multidimensional Poverty Index (MPI), Palmers Drought Severity Index (PDSI) and improved access to sanitation, fitted to R values with a serial interval of 5 days (standard deviation: 8 days). The fit of the incidence-based vs covariate-based R values are shown below, along with the measures of model performance used in the hierarchical analysis (Fig. 4 ). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Using the best fit model, R was predicted for the remaining 31 states which did not meet the 40case threshold and the remaining dates for the six states which were included. This created estimations of R for all 37 states on a monthly temporal scale for 2018 and 2019. The predictions suggest that the model fits well and accurately predicts R, as the higher R values are in areas with known elevated cholera burden (northern and northeastern regions) and the states which only marginally fell below the threshold for R calculations (Fig. 5 ). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint -the five states which met the equal to or more than 40 case thresholds. Covariate-based (purple) -the 31 states which did not meet the threshold and had R predicted using the best fit model. State label colour shows which states had an average R of R = >1 (black) and R = <1 (orange). The 48 historical exposure periods by month and R threshold are shown in is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 21, 2022. ; https://doi.org/10.1101/2022.03.21.22272693 doi: medRxiv preprint >1 and R = <1, respectively. In contrast, monthly conflict events and PDSI shows a less defined relationship with some months showing either a positive or negative association. Based on the historical median and standard error values as a starting point for the four covariates, three hypothetical exposure periods were created: red (R = >1 values), amber (mid-point between green and red) and green (R = <1 values). R appeared to raise above 1 around 50% or lower for improved sanitation access and MPI values of above 0.32. For PDSI and conflict, R values increased above 1 at around -1.1 for PDSI and monthly conflict events of 1.6 ( Fig. 7) . The less Monthly Conflict MPI J a n 2 0 1 8 J u l 2 0 1 8 J a n 2 0 1 9 J u l 2 0 1 9 J a n 2 0 2 0 J a n 2 0 1 8 J u l 2 0 1 8 J a n 2 0 1 9 J u l 2 0 1 9 J a n 2 0 2 0 J a n 2 0 1 8 J u l 2 0 1 8 J a n 2 0 1 9 J u l 2 0 1 9 J a n 2 0 2 0 J a n 2 0 1 8 J u l 2 0 1 8 J a n 2 0 1 9 J u l 2 0 1 9 J a n 2 0 2 0 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 21, 2022. ; https://doi.org/10.1101/2022.03.21.22272693 doi: medRxiv preprint defined relationship between conflict and PDSI and R meant further investigation was needed to understand possible spatial differences. To investigate this potential spatial heterogeneity the data were split for each R threshold by state for the full dataset (2018-2019) (Fig. 8 ). This showed significant spatial heterogeneity among the two thresholds by state for monthly conflict events and PDSI. Borno and Kaduna were investigated further for their clear relationship with conflict (more conflict = more cholera transmission). Furthermore, Kwara and Nasarawa had additional analysis due to their relationship between extreme dryness and higher R values and Ekiti and Lagos for extreme wetness and higher R values. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 21, 2022. ; access). For Borno, raising monthly conflict events from 1 to 2 increased R above 1, but an increase in access to sanitation from 41-46% pushed the R value back below one. This relationship continued in a stepwise pattern and in a similarly way for MPI but to a lesser degree of magnitude. This showed that increasing sanitation and therefore decreasing vulnerability, allowed the states to adapt to increasing conflict and keep the R value below 1 (See Supplementary Figure 2 ). For the four states investigating the differences between extreme wetness (Lagos and Ekiti) and extreme dryness (Nasarawa and Kwara) and R values, the analysis and subsequent PDSI hypothetical exposure periods yielded similar results. All four states found low predicted R values and higher ranges (Supplemental Figs. 3 & 4) . A potential explanation for this is the high variable importance of PDSI (Fig. 3 ) and the high levels of sanitation and low levels of poverty in all four states contribute to overall lower levels of cholera. Therefore, the model was detecting a signal in only small changes in PDSI, that resulted in changing R values which have not been detected in other states with higher rates of poverty and lower levels of sanitation access. It also helps to highlight the multi-directionality of the relationship between PDSI and cholera transmission, with both extreme wetness and extreme dryness causing increases in R. The Using the best fit model, nowcasting was used to calculate the R values for the remaining thirty-one states which did not meet the threshold. Both historical and hypothetical exposure periods helped to shed light on the thresholds and triggers for raising R values above 1 in Nigeria. MPI and sanitation showed a well-defined relationship with R, with consistently higher access to sanitation and less poverty when R was less than 1. Thresholds which pushed R above one included decreasing access to sanitation below 50% and increasing the MPI above 0.32. Whereas the relationship between R and conflict events and PDSI is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 21, 2022. ; https://doi.org/10.1101/2022.03.21.22272693 doi: medRxiv preprint appeared to vary spatially, with some states showing a negative and some states a positive association. For these two covariates, the effect on R was largely dependent on the access to sanitation and poverty within the states, with high levels of sanitation and low poverty resulting in a decreased effect of PDSI and conflict. This showed that better sustainable development in the state acted as a buffer to social and environmental extremes and allowed people to adapt to these events better, due to less pre-existing vulnerability. Poverty when measured in monetary terms alone can create issues due to its impact on the risk factors stated and is an advantage of using the MPI as a poverty indicator. Nigeria's cash transfer scheme has allowed many Nigerians to meet the household income limit for poverty but there is a case for turning these funds and attention onto structural reform 36 . Nigeria's nationwide average access to sanitation is around 25%, therefore using these funds to increase access to sanitation may significantly improve health 37 . Here we show the need for expansion of sanitation to reduce cholera risks and the shocks of extremes on its transmission. In a recent review on the implementation of non-pharmaceutical cholera interventions, there was generally a high acceptance of several WASH interventions. Despite this, education was key and building community relationships is needed to achieve this, such as understanding cultural differences and barriers 38 . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 21, 2022. ; This is especially important in areas with conflict, where trust between the government and residents PDSI and several of the other drought indices tested here showed high variable importance, this resulted in only small changes altering R values in some states. When analysing spatial differences between R and PDSI, the relationship appears to be multi-directional. Furthermore, access to sanitation and poverty were important in how PDSI impacted R, similar to the impacts of conflict. There is significant evidence to show that both droughts 19, 26 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Despite adapting the methodology to account for this, a potential limitation may be lagged effects of the covariates on cholera 46, 47 . Both long-term and short-term changes to the population may take time before changes in cholera transmission are evident. While some disasters may be considered slow-onset or rapid-onset and therefore defining their beginning is subjective. Despite this, the incubation period of cholera is short (<2 hours -5 days) and previous research has suggested that acute shocks cause increases in cholera cases within the first week of the event 48,50 . Calculating R on monthly sliding windows and using monthly covariate data helped to reduce potential lagged effects on the R values, which would be captured if the one-week lag estimate is applicable here. Although beyond the scope of the research presented here, the impacts of different lagged periods for several of these covariates and cholera outbreaks is an essential area of future research. Cholera is considered an under-reported disease, and the lack of symptomatic cases means that many are likely to be missed. There are also incentives not to report cholera cases, due to travel restrictions and isolations and implications for trade and tourism 51 . However, the robust reporting system in Nigeria suggests that the data used here is the best available for analysis. While during times of crisis, cholera may be over-reported or more accurately represent the cholera burden in the area. This is due to the presence of cholera treatment centres and external assistance from humanitarian aid and non-governmental organization, detecting cases that may have been missed is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint shows the importance of doing so to gain a more accurate understanding of disease outbreaks in complex emergencies. Nigeria is currently working towards its ambitious goal of lifting 100 million people out of poverty by 2030 36 . If it is successful, this could significantly improve health, increase quality of life and decrease the risks of social and environmental extremes. Cholera data were obtained from NCDC and contained linelist data for 2018 and 2019. The data were age and sex-disaggregated, on a daily temporal scale and to administrative level 4. The data also provided information on the outcome of infection and whether the patient was hospitalised. The data were subset to only include cases which were confirmed either by rapid diagnostic tests or by laboratory culture. A range of covariates were investigated based on previously understood cholera risk factors. Covariates included conflict (monthly, daily) 53 , drought (Palmers Drought Severity Index, Standardised Precipitation Index) 54,55 , IDPs (households, individuals) 56 , WASH (improved drinking water, piped water, improved sanitation, open defecation) 57 , healthcare (total facilities, facilities per 100,000) 53 , population (total, density) 58 and poverty (MPI, headcount ratio in poverty, intensity of depravation among the poor) 53 . The covariate data were on a range of spatial and temporal scales, therefore administrative level one (state) was set as the spatial granularity (data on a finer spatial scale were attributed to the upper level) and the smallest temporal scale possible was used for covariate selection (repeating values if data were not available to the lower level). The datasets and methods used here were approved by Imperial College Research Ethics Committee and a data sharing agreement through NCDC and the authors. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 21, 2022. ; https://doi.org/10.1101/2022.03.21.22272693 doi: medRxiv preprint The 2018 and 2019 positive linelist data were used to calculate incidence. Incidence was calculated on a daily scale by taking the sum of the data entry points by state and date of onset of symptoms. This created a new dataset with a list of dates and corresponding incidence for each state. All analysis was completed in R Studio version 4.1.0. (packages !incidence" 59 & !EpiEstim" 60 ). Using the incidence data, R was calculated with the parametric standard interval method, which uses the mean and the standard deviation of the standard interval (SI). The SI for cholera is welldocumented and there are several estimates in the literature [61] [62] [63] . The parametric method was used (vs the non-parametric which uses a discrete distribution), as assumptions can be made here about data distribution and parameters. SI is the time from illness onset in the primary case to onset in the secondary case and therefore impacts the evolution of the epidemic and speed of transmission. To account for several reported SI values for cholera, a sensitivity analysis was used including 3, 5 and 8 days with a standard deviation of 8 days. Estimating R too early in the epidemic increases error, as R calculations are less accurate when there are lower incidence cases over the time window. A way to understand how much this impacts R values is to use the coefficient of variation (CV), which is a measure of how spread out the dataset values are relative to the mean. The lower the value, the lower the degree of variation in the data and a posterior coefficient of variation was set to 0.3 (or less) as standard, based on previous work 60 . To reach the CV threshold, calculation start date for each state was altered until the threshold CV was reached. States with <40 cases were removed, as states with fewer cases did not have high enough incidence across the time window to reach the CV threshold. Additionally, the R values were calculated over monthly sliding windows, to ensure sufficient cases in the time window. Daily and two-week sliding windows did not have incidence values sufficient to reach the posterior CV threshold. Monthly data were the most common temporal data granularity . CC-BY-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 21, 2022. ; https://doi.org/10.1101/2022.03.21.22272693 doi: medRxiv preprint used in the model and many of the covariates could have potential lag effects on cholera. Several of the tested covariates may not have impacted cholera immediately and using monthly data and calculating R over monthly sliding windows helped to account for some of this uncertainty. The covariates listed above (conflict, drought, IDPs, WASH, healthcare, population and poverty) were run through a covariate selection process 64, 65 . The selection process removes covariates not significantly associated with the outcome variable (R) and clusters the remaining based on the degree of correction between them. The threshold for clustering was set to an absolute pairwise correlation of above 0.75 and the aim was to reduce multi-linearity in the final model. R was chosen as the outcome variable, rather than incidence (which has less implicit assumptions) as it is more descriptive, providing information on the evolution of the epidemic (e.g., R = >1, cases are increasing) and not a single time point of disaster burden. Using the subset list of covariates, the aim was to fit a model which could accurately predict R values under changing conditions. Supervised machine learning algorithms such as decisionmaking algorithms, are now a widely used method. They work by choosing random data points from a training set and building a decision tree to predict the expected value given the attributes of these points. Transparency is increased by allowing the number of trees (estimators), number of features at each node split and resampling method to be specified. Random forest then combines several decision trees, combining predictions from multiple algorithms into one model, this makes them more accurate, while also dealing well with interactions and non-linear relationships 66, 67 . Random forest variable importance was tested on all the covariates which were not removed from the covariate selection process. Variable importance is a measure of the cumulative decreasing mean standard error each time a variable is used as a node split in a tree. The remaining error left in predictive accuracy after a node split is known as node impurity and a variable which reduces this impurity is considered more important. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint due to the continuous outcome variable. The parameters for training were set to repeated crossvalidation for the resampling method, with ten resampling interactions and five complete sets of folds to complete. The model was tuned with an optional number of predictors at each split set to 2, based on the lowest out-of-bag (OOB) error rate and the evaluation metric used was RMSE (package !caret" 68 ). A hierarchical stepwise analysis was used to fit the models taking into consideration the covariate clustering and variable importance. One covariate was selected from each cluster, and different combinations tested until the best-fit model. Models were assessed against each other in terms of predictive accuracy, based upon R² and RMSE. Predictions were then calculated on the testing dataset to compare incidence-based vs covariate-based R values, evaluations were built on multiple metrics including correlation, R² and RMSE. Despite random forest models being accurate and powerful at predicting, they are easily over-fit and therefore calculating error for the predictions is important. Error was calculated using mean absolute error (MAE), where yi is the prediction and xi is the true value, with the total number of data points as n. The best fit model in terms of predictive power according to the metrics above, was used to predict R for the remaining states which did not have sufficient reported cases to calculate R using incidence or had missing data for certain dates. Data for the best fit model covariates were collected for the states and missing dates from the sources given above. The data for the selected covariates is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint To understand the conditions needed to raise R values from less than 1 to more than 1, the data were split into exposure periods, to investigate the relationship between the covariates and R in these different time periods. Historical periods were set by splitting the data into monthly periods for 2018 and 2019 and by the R threshold being equal to or more than 1 (R = >1) or less than 1 (R = <1). For each of the 48 historical exposure periods, the mean, standard deviation, median and standard error were calculated for each covariate. This was to understand how the mean and median covariate values changed when R was over or less than 1. The two thresholds were also compared by state to investigate any spatial differences in the historical covariate values. Hypothetical exposure periods were then created based on these findings, using the median values and standard error for R = >1 and R = <1 as a starting point for the red exposure period and green exposure period, respectively. The amber exposure period was taken as a mid-point between the two. The best fit model was used to predict how this would impact R and the exposure periods altered as needed. This provided thresholds and triggers for outbreaks in Nigeria, creating a traffic light system for cholera risk. To understand spatial differences, six states had additional sub-national analysis and included Borno, Kaduna, Nasarawa, Ekiti, Lagos and Kwara. These states were selected based on their relationship with conflict or PDSI and R (clear positive/clear negative relationship). Three hypothetical exposure periods (red, amber and green) for either conflict or PDSI were produced for these states, keeping all the other covariates at the R = >1 mean values. This was to understand spatial differences in these two covariates and understand the threshold needed to push R values is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 21, 2022. ; https://doi.org/10.1101/2022.03.21.22272693 doi: medRxiv preprint Updated global burden of cholera in endemic countries Inapparent infections and cholera dynamics Mapping the burden of cholera in sub-Saharan Africa and implications for control: an analysis of data across geographical scales Descriptive characterization of the 2010 cholera outbreak in Nigeria The multi-sectorial emergency response to a cholera outbreak in internally displaced persons camps in Borno state Descriptive epidemiology of a cholera outbreak in Kaduna State Risk factors associated with cholera outbreak in Bauchi and Gombe States in North East Nigeria Regional-scale climate-variability synchrony of cholera epidemics in West Africa Modelling the climatic drivers of cholera dynamics in Northern Nigeria using generalised additive models Cholera outbreak in a naïve rural community in Northern Nigeria: the importance of hand washing with soap A large cholera outbreak in Kano City, Nigeria: the importance of hand washing with soap and the danger of street-vended water United Nations Statistical Division. Millennium Development Goal Indicators The global burden of cholera From cholera outbreaks to pandemics: the role of poverty and inequality A Wake Up Call: Nigeria Water Supply, Sanitation, and Hygiene Poverty Diagnostic Descriptive epidemiology of cholera outbreak in Nigeria Exploring relationships between drought and epidemic cholera in Africa using generalised linear models Using selfcontrolled case series to understand the relationship between conflict and cholera in Nigeria and the Democratic Republic of Congo Traits and risk factors of post-disaster infectious disease outbreaks: a systematic review The exacerbation of Ebola outbreaks by conflict in the Democratic Republic of the Congo Exploring droughts and floods and their association with cholera outbreaks in sub-Saharan Africa: a register-based ecological study from 1990 to 2010 Environmental factors influencing epidemic cholera Tackling poverty in multiple dimensions: A proving ground in Nigeria Is cholera disease associated with poverty Informal urban settlements and cholera risk in Dar es Salaam Drought-related cholera outbreaks in Africa and the implications for climate change: a narrative review Treating cholera in severely malnourished children in the Horn of Africa and Yemen Health and sustainable development: can we rise to the challenge? Distribution of impacts of natural disasters across income groups: A case study of New Orleans Politics of attributing extreme events and disasters to climate change Feasibility, acceptability, and effectiveness of non-pharmaceutical interventions against infectious diseases among crisis-affected populations: a scoping review The nature of Nigeria"s Boko Haram war Changes in size of populations and level of conflict since World War II: implications for health and health services The cholera outbreak in Yemen: lessons learned and way forward Evaluation of monitoring tools for WASH response in a cholera outbreak in northeast Nigeria Floods in southern Africa result in cholera outbreak and displacement Cholera outbreak and spread in Ebonyi state Climate variability and the outbreaks of cholera in Zanzibar, East Africa: a time series analysis Local environmental predictors of cholera in Bangladesh and Vietnam Water supply interruptions and suspected cholera incidence: a time-series regression in the Democratic Republic of the Congo Waterborne cholera outbreak following cyclone Aila in Sundarban area of West Bengal Cholera surveillance and estimation of burden of cholera Global Task Force on Cholera Control The Humanitarian Data Exchange High resolution Standardized Precipitation Evapotranspiration Index (SPEI) dataset for Africa Data Bank Subnational Population Epidemic curves made easy using the R package incidence Estimate Time Varying Reproduction Numbers from Epidemic Curves Urban cholera transmission hotspots and their implications for reactive vaccination: evidence from Bissau city, Guinea bissau Population-level effect of cholera vaccine on displaced populations Incubation periods impact the spatial predictability of cholera and Ebola outbreaks in Sierra Leone Yellow fever in Africa: estimating the burden of disease and impact of mass vaccination from outbreak and serological data The global burden of yellow fever Random forests Analysis of a random forests model caret: Classification and Regression Training We would like to thank and acknowledgement the Nigeria Centre for Disease Control for providing the data used here and those who work for the NCDC who collected the data in the field. We would also like to thank Anwar Musah (University College London) and Kelly Elimian (Karolinska Institutet) for their guidance on cholera data for Nigeria and facilitating the partnership with NCDC. GECC was part of the study design and conceptualisation of ideas, ran the analysis, wrote and