key: cord-0999786-tku0c0l3 authors: Olisah, Chollette C.; Ilori, Olusoji O.; Adelaja, Kunle; Usip, Patience U.; Uzoechi, Lazarus O.; Adeyanju, Ibrahim A.; Odumuyiwa, Victor T. title: Data-driven approach to COVID-19 infection forecast for Nigeria using negative binomial regression model date: 2021-05-21 journal: Data Science for COVID-19 DOI: 10.1016/b978-0-12-824536-1.00002-2 sha: 5a2165d0d4145db92beb746694cf8fb1d343a32c doc_id: 999786 cord_uid: tku0c0l3 COVID-19: the new wave of a global pandemic, is bringing about an increasing number of scientific efforts aimed at enabling governments to make informed decisions. In this paper, we explore the negative binomial regression model from the family of generalized linear models for the prediction of the future infection pattern of COVID-19 in Nigeria. We approached the prediction from a new perspective that is inspired by transfer learning and feature engineering approaches widely adopted in machine learning. We trained the model to learn COVID-19 pattern cues of other countries such as South Africa, Senegal, Slovenia, Australia, Belgium, and Israel with sufficient and recorded infection cases and test count as baseline data; and created additional features to increase the model's predictive power. With a testing capacity of 2000 persons per day in Nigeria, the cumulative infection counts for 30-04-2020, 15-05-2020, and 22-05-2020 were predicted to rise to 3044, 5622, and 7254 respectively. Since the first confirmed case of Coronavirus, SARs-CoV-2 (COVID- 19) , in China's city of Wuhan in December 2019, the pandemic has spread across the globe with over 190 countries affected. In Nigeria, the first confirmed case of COVID-19 was detected on 25th February 2020, in the city of Lagos. Since the index case, the outbreak has spread to 32 states of the country. The first set of confirmed cases was reported to be from foreigners and nationals who visited Nigeria from different countries mainly Italy, the United Kingdom, and the United States of America. Though there were several interventions the governments put in place to curb the spread of the virus, however, as of 28th of April 2020, Nigeria had recorded 1532 cases, 44 deaths, and 255 recoveries from the virus [1] . Though these statistics are minimal compared to the record of most countries in Europe and America, some experts believe that there is still going to be an exponential rise of the infected cases in Nigeria if the measures put in place by the governments are not adequately coordinated. It is, therefore, imperative to bring about scientific efforts toward its control. One of such efforts is the prediction of the future infection pattern of the virus to enable the governments at the Federal and State levels to make informed health-related decisions. Typically, in epidemiological cases, from the first instances, a high health-risk disease is identified, researchers often sort the use of mathematical models to predict the course of the disease over time [2e8] . Mathematical models can be broadly classified into mechanistic models and empirical models [9] . Mechanistic models require detailed knowledge and data on the underlying problem. In places where precise data and computational resources are available, mechanistic models may suffice for predicting coronavirus infection patterns and modeling the impact of various intervention strategies more accurately to inform policymakers and health workers [8] . In Nigeria, detailed and comprehensive data are not available. The data available are not precise enough for accurate mechanistic modeling. There are constraints inherent in the collection of the data because of scarce resources such as testing kits, and inadequate sampling strategies. Also, there is the problem of delayed infection case reports and under-reporting. As a result, the data are noisy [10] . Empirical models can be appropriate for extracting patterns in the available data and forecasting coronavirus transmission dynamics. However, the accuracy of any mathematical and statistical models depends heavily on assumptions, parameters, and theory. To state: how good is the assumptions on which the model is based [11e13], how significant are the estimated parameters in modeling a given infectious disease and within a geographical region [14] , and is the model formulated based on theory. Alternatively, with an increase in real data points as the disease progresses, computational models, which are also grounded in mathematical models, can be explored to predict the future growth pattern of disease. The use of computational models in epidemiology dates back to the 1980s [15] , and it is still prevalent in modern epidemiology. Some commonly used in modern epidemiology are linear regression (LR) [3, 16] , poisson regression (PR), negative binomial regression (NBR) [17, 18] , exponential smoothing (ES) [19] , autoregressive integrated moving average [20] , support vector machines. These models can be adapted to a given problem. However, there is not a fit-all model, each problem's outcome data differ by type and distribution. By outcome data, we mean the data of the dependent variable and the term will be used as stated throughout this paper except where otherwise specified. For the current prediction trend of COVID-19, Petropoulos and Makridakis [21] adopted a computational model from the ES family for the prediction of the global cumulative confirmed cases of COVID-19. Roosa et al. [22] employed and compared the generalized logistic growth model, the Richards growth model, and a subepidemic wave model capability to objectively forecast future global cases of COVID-19. Jia et al. [23] employed and analyzed the logistic model, Bertalanffy model, and Gompertz model to fit and analyze the situation of COVID-19. Anastassopoulou et al. [3] compared the predictions of LR along-side the susceptible-infectious-recovered-dead (SIRD) model for COVID-19 future spread. Considering that the approaches used to adapt a computational model to the COVID-19 data differs across these models, their prediction results will not be compared in this work. Rather, we take the time to acknowledge the quick prediction response they have made and present further, our prediction methodology as it contributes to knowledge, and facilitates health-care interventions. In this work, we present an epidemiological prediction that is uniquely fashioned for Nigeria, though can be adapted for other countries. The reason for the focus on Nigeria is the fact that there is a gradual growing spread of the virus as reported by the Nigerian Center for Disease Control (NCDC) [1] and because of the heedless attitude of Nigerians about the virus. Frequently, people cluster to purchase from malls, the market, receive food aid, and without protective measures in place for preventing the spread of the virus in such gatherings. Yet, of her approx. 195 million people, only about 10,861 persons, constituting about 0.00557% of the population has so far been tested as at 28-APR-2020. Based on the presented case, we make the following assumptions: There are a lot more people who are carriers of COVID-19 but are not showing symptoms because they are self-medicating or illegally treated at unauthorized private hospitals, which is likely to suppress the symptoms and make carriers go unnoticed. This is because Nigeria isn't running as many tests as possible, given the uncontainable interaction of people in gatherings, carriers further transmit to more persons. Having stated these, we hypothesize that there is a causal relationship between testing and identification of COVID-19 carriers in Nigeria. Therefore, we make the following statement: "With an increase in COVID-19 testing relative to the suspected percentage of carriers in Nigeria, the number of infected cases will increase significantly" Subsequently, we will identify the predictor variables meaningful to the given case for the prediction of the infection pattern of the virus over some time. Unlike previous approaches [3,21e23] , we explore a prediction model from the family of generalized linear models (GLMs) for our prediction. This section comprises data, preprocessing, and the prediction model. Before data collection, it is important to identify the outcome and predictor variables to avoid erroneous data collection. Therefore, based on our hypothesis, the outcome variable: infected count, and the predictor variables: number tested, number of deaths, time, were identified with no consideration to "best predictors" given. Furthermore, these variables helped in the collection of the number of tested data 1 and the rest of the variable's data. 2 Since computational models for epidemiological prediction usually require historical data collected over a long period for prediction accuracy, we capitalized on collecting the daily incidence of COVID-19 from countries with a sufficient number of test counts to create baseline data. These countries are South Africa, Senegal, Slovenia, Australia, Belgium, and Israel. As a result of the inclusion of the number tested predictor variables, the records from Nigeria as provided by the NCDC is omitted because there was no accurate daily record of the number of tested data from the period of March 9, 2020 to April 19, 2020. Also, guided by the trend of occurrence of infection in Nigeria, we consider records from the USA, UK, and Italy to be too extreme to be included in the data model. First, the data collected from the various countries earlier mentioned in Section 2.1 are merged into a single data. To enable each sample to still represent an infection count, two indexes are set; one for index and the other for the date entry. Second, a few null values appeared in the data. To generate the missing values, we adopted the linear interpolation method which estimates the null values from known values closest to it. A sample of the data after the processing is shown in Fig. 31 .1. Third, based on our earlier assumption that the "percentage of infected" will be a significant factor for identifying new carriers of the virus, we employed feature engineering approach by creating a percentage-suspected of COVID-19 carriers as an additional predictor variable. This is to enable us to test whether there is an effect of increasing testing capacity relative to the growing number of people suspected to be carriers of the virus and the prediction of the outcome. After the data has been prepared, it is split into training and testing sets with a distribution of 80% and 20%, respectively, that is, 136 observations for the train set to 44 observations for the test set. The data sample is provided in Table 31 .1. Different from other predictions of COVID-19 in the literature [22, 23] , our predictive model is trained on global data, of the countries aforementioned, to extract meaningful cues of the virus spread pattern and subsequently adapt the model to forecasting the future infection spread of the virus for Nigeria. This approach is inspired by transfer learning used when there is a limited number of samples for recognition tasks like [24] and basically because there are numerous missing values in the number of test data from the Nigeria COVID-19 reports. It should be noted that the aspect of transfer learning we refer to is the act using prediction patterns gained from a different problem to deduce the pattern of a different but related problem. Therefore, in predicting the future of COVID-19 infection count in Nigeria, we generated seven-test data, of which 3e7 are as stipulated 3 (1) data that fit the current daily testing capacity in Nigeria (see Table 31 .2). (2) data that follow a 300 increase in a testing capacity. (3) data that meet an expected 1500 daily testing capacity of Nigeria. (4) data that meet an expected 2000 daily testing capacity of Nigeria. (5) data that meet an expected 2500 daily testing capacity of Nigeria. (6) data that meet an expected 3500 daily testing capacity of Nigeria. (7) data that meet an expected 5000 daily testing capacity of Nigeria We employ a type of GLMs. This family of models is chosen because they are known for their powerful application to prediction problems of count data. Also, for the fact that they can be used to validate the relationship between variables to judge the contribution each variable makes to the model performance. From the initial descriptive analysis, the population distribution is observed to be skewed and to approximate the Poisson distribution. However, the respective means of the outcome data show to greatly deviate from the variance. In statistics, this is termed over dispersion. By definition, overdispersion can be described as when data variance is greater than its statistical mean [25] . This characteristic of the data violates fitting the data to the PR model, a commonly used model for fitting epidemiological count data. Therefore, we explore the NBR model for fitting the count data. From the literature, the NBR is more appropriate for fitting overdispersed count data [26] and very much adopted in solving epidemiological problems like in Refs. [15,17,18,26e28] bib28. With the NBR model, we consider the goal of predicting the outcome variable y i , which is the number of infected cases at observation i, given the exposure time t i , and a set of predictor variables x 1i ; x 2i ; .; x ki at observation i. Thus, the model is formulated as: where a is the dispersion parameter, G is the gamma function, and m i is the expected mean value of y i per t i , which is given as: where b 0 is the intercept and the unknown parameters b 0 ; b 1 ; b 2 ; .; b k are regression coefficients estimated using the maximum likelihood method [25, 29] . Consequently, the future observations of the infection pattern of the virus at time t þ n; n > 0 can be predicted. In applying Eq. (31.1) to COVID-19 data, the full model representation of m i can be specified as: The implementation of the prediction model was achieved using the Statsmodel v0.12.0.dev0 (v207) application programmer interface in Python Environment. The prediction model result and analysis will be presented and discussed in subsequent sections. The results will be presented as follows: (1) goodness of fit of the model to the data, (2) testing the effect of predictor variables on the outcome variable, (3) model prediction performance on unknown data, the Nigeria data, for one-month. In accessing the fit of the NBR model to the data, we used the Chi-square goodness of fit statistical measure as proposed in Ref. [30] . Based on the author's recommendations, we evaluate the deviance value with the model degree of freedom (see Table 31 .3) at a 5% significance level using the following formula. where c 2 is Chi-square goodness of fit, O is observed value, and E expected value. The P-value determined from the Chi-square distribution calculator, P À c 2 Á ¼ 0.41144, suggests that the Chi-square test is not statistically significant. Therefore, we conclude that the NBR model fits the data well. Furthermore, our claim of the fitness of the proposed model can also be verified using a plot of the deviance residual and the fitted value, which is illustrated in Fig. 31.1. As expected, the deviance of the proposed model lies along the zero point and shows no evidence of one-directional bias, either of overestimation or underestimation given the Here, we report the model effect of the predictor variables given in Eq. (31.3) and their statistical significance to predicting the outcome count. These reports are tabulated in Table 31 .4. To interpret the result of Table 31 .4, we draw the attention of the reader to the coefficient estimates for the model effect and the P-value for statistical significance. The statistical analysis of the predictor variables using the P-values at 5% significance value reveals that the predictor variables except death are statistically significant to the model outcome. While the number_tested and percentage_suspected variables show to be of very high significance given their 0.000 P-values which are way below the 0.01 significance level. The day variable with a P-value of 0.030 reveals a high significance. As expected, the number_of_deaths variable does not influence the pattern of infection spread per t i . Since the NBR model performance is in line with our assumptions, it clearly expresses the prediction power of the model for solving the given problem. The coefficient estimate is not informative by itself. So, we adopted an interpretative strategy for a coefficient estimate as provided in Ref. [31] , it is given as: where H is the percentage change in the expected mean of y i , Dh1, which represents the one-unit change, b is the regression coefficient. Applying Eq. (31.5) to the predictor variables number_tested, percentage_suspected, day, number_of_deaths gives the values 0.05%, 7.9%, 2.2%, 0.08%, respectively. By interpretation, the exponentiated value of each predictor regression coefficient indicates how much the mean of y i , that is, m i changes with every one-unit increase in X, while holding other predictors constant. For instance, the percentage_suspected predictor with An alternative approach to evaluating the relationship between the outcome variable y and a predictor variable X k , conditional on other predictor variables X ek , is through a visual plot of the residuals retrieved by regressing the outcome variable against X k . The partial regression plot available in the Statsmodel v0.12.0. dev0 (v207) is used to achieve this goal. From Fig. 31.2 , it is obvious that all observations for the predictor variables were consistently close to the trend line, though a compact spread along the trend line is seen for deaths and test predictors which results from the presence of the outlier. However, there is a lack of trend for the day predictor which illustrates that it is not as explanatory as the regression coefficients suggested. The capability of the predictive model can be seen in Fig. 31 .3 which illustrates the plot of the predicted observation of infected cases versus the actual observation of infected cases over the timestamps of dates from 09-03-2020 to 09-04-2020. These dates represent the periods most countries began recording numerous cases. It is interesting to see that the predictive distribution for predicted resembles the actual, though at some observations (13-03-2020 to 15-03-2020, 30-03-2020 to 04-04-2020) the model over predicted. However, the predicted observation of infected cases closely resembles the actual observation. Of particular interest in epidemiological predictions is the ability to project into the future the spread pattern a disease might take over a duration of time to help facilitate health-care decisions. Henceforth, we term this phenomenon, forecasting. We consider the forecast distribution of future observations for infection pattern of the virus at time t þ n; n > 0 to be of great significance to public health in Nigeria. This is mainly because a clear picture of the infection threats of COVID-19 is still not well-understood. Using the generated data [1e7] discussed in subsection 2.2, which represents Eq. (31.3) variables, for testing the predictive power of the model to unseen data. The future cumulative numbers of infected cases for data [1e7] are illustrated in Fig. 31 .4D and Fig. 31.5D . This is a report on a thirty-days-ahead forecast and illustration of the effectiveness of the proposed model when applied to COVID-19. We assume that there are more COVID-19 infected cases than is reported and if being the case then, the Government must increase its COVID-19 testing capacity. As observed from Figs. 31.4 and 31.5, an increase in the testing capacity increased the number of infected cases. Though Refs. [1e7] of the generated data is created from the actual COVID-19 Nigeria data, it is interesting to observe the pattern of the spike on day 30 of the trained model reoccur on the daily and cumulative plot of the infected cases of COVID-19 forecast. Also, we observe that the forecast errors between February 20, 2020 Interestingly, the cumulative predicted number of infected cases in Nigeria is expected to continue to rise in the coming weeks as seen from Figs. 31.4D and 31.5D. The growth level expected on the 30-04-2020 for the three-scenarios of testing capacities are: (1) scenario of gradual increase in testing capacity as is currently practiced in Nigeria which is labeled "initial_test_per_day_capacity" is 1756, (2) scenario of a 300 increase in the current testing capacity labeled as "300_increase_test_per_day_capacity" is 1914, (3) scenario of achieved 1500 daily test labeled as "1500_test_per_data_capacity" is 2509. By 15-05-2020 and 22-05-2020, the infected count is expected to rise to 2951 and 3697, 3280 and 4145, 4518 and 5790 for scenario 1, scenario 2, and scenario 3, respectively. Furthermore, the worst infected number of COVID-19 cases, in Nigeria, can be observed for testing capacity from 2000 up to 5000. If Nigeria eventually carries out more tests as projected for the coming days, there will be more and more persons in need of health care facilities. As predicted, precisely about 7254, 9135, 14,639, 30,244 infected number of COVID-19 cases by the 22-05-2020 might be identified for the scenarios of 2000-test-per-day-capacity, 2500-test-per-day-capacity, 3500-test-per-day-capacity, and 5000-test-per-day-capacity, respectively. Therefore, care should be taken as Nigeria currently considers relaxing lockdown in the coming weeks without careful deliberations on the potential risk and ways to mitigate against it. While the benefits of the lockdown can be observed through the gradual rise of the COVID-19 infected cases, we should be wary of the uncontainable interaction of people in markets, malls, shops around people's homes, and the contagiousness of the act of gathering a cluster of people to give aids to people by noble Nigerian philanthropist. If these gatherings are not contained, Nigeria should expect a spike that is by far more than the worst case of the 1500-testing-capacity infected number of cases. The predicted cases from 2000 up to 5000 testing capacity cases reveal so. This paper explored the NBR model from the family of GLM for the prediction of the future infection pattern of COVID-19 in Nigeria. We approached the prediction from a whole new perspective that is inspired by transfer learning and feature engineering approaches widely used in machine learning. We trained the model to learn COVID-19 pattern cues of countries such as South Africa, Senegal, Slovenia, Australia, Belgium, and Israel with sufficient and recorded infection cases and test count as baseline data for forecasting infection trends in Nigeria. The experimental results showed the effectiveness of the proposed approach to predict the test set of the trained data and forecast a rise in the infected number of COVID-19 cases in Nigeria, which closely resembles the actual number of infected cases in Nigeria. COVID-19 Case Update, 2020. Retrieved from covid19.ncdc. gov.ng [Accessed 28th CoronaTracker: World-wide COVID-19 outbreak data analysis and prediction Data-based analysis, modelling and forecasting of the COVID-19 outbreak Modeling and predictions for COVID 19 spread in India Mathematical Predictions for COVID-19 as a Global Pandemic. medRxiv Statistics-based predictions of coronavirus epidemic spreading in mainland China Epidemic Analysis of COVID-19 in China by Dynamical Modeling. arXiv Imperial College COVID-19 Response Team. Impact of Non-pharmaceutical Interventions (NPIs) to Reduce COVID-19 Mortality and Healthcare Demand All Likelihood: Statistical Modelling and Inference Using Likelihood COVID-19 is a data science issue Impact of Nonpharmaceutical Interventions (NPIs) to Reduce COVID-19 Mortality and Healthcare Demand. Imperial College COVID-19 Response Team Mathematical modelling of infectious diseases Correction: appropriate models for the management of infectious diseases Real-time numerical forecast of global epidemic spreading: case study of 2009 A/H1N1pdm On the use of the negative binomial in epidemiology Linear Regression Models in Epidemiology. Institute of Industrial Ecology, the Urals Branch of the Russian Academy of Sciences Variable selection and regression analysis for the prediction of mortality rates associated with foodborne diseases Estimating risk of mortality from cardiovascular diseases using negative Applications and comparisons of four time series models in epidemiological surveillance data Using autoregressive integrated moving average (ARIMA) models to predict and monitor the number of beds occupied during a SARS outbreak in a tertiary hospital in Singapore Forecasting the novel coronavirus COVID-19 Short-term forecasts of the COVID-19 epidemic in Guangdong and Zhejiang Prediction and Analysis of Coronavirus Disease Transferability of artificial neural networks for clinical document classification across hospitals: a case study on abnormality detection from radiology reports Regression models for count data based on the negative binomial(p) distribution Analysis of overdispersed count data: application to the human papillomavirus infection in men (HIM) study Using a negative binomial regression model for early warning at the start of a hand foot mouth disease epidemic in dalian Application of negative binomial modeling for discrete outcomes: a case study in aging research Applied Regression Analysis Handbook of Biological Statistics Tutorial on using regression models with count outcomes using R We would like to express our sincere gratitude to the Association of Massachusetts Institute of Technology Trained African Universities Lecturers (AMTAUL) for providing the platform for this research collaboration. We acknowledge the role of Professor Akintayo Akinwande (Program Coordinator) and TOTAL Nigeria for supporting the program.