key: cord-0724430-58tbv7th authors: Gning, Lucien; Ndour, Cheikh; Tchuenche, J. M. title: Modelling COVID-19 daily cases in Senegal using a generalized Waring regression model date: 2022-03-17 journal: Physica A DOI: 10.1016/j.physa.2022.127245 sha: c22dc0f1f8916fc1c1a6b13c19c974f9e697307e doc_id: 724430 cord_uid: 58tbv7th The rapid spread of the COVID-19 pandemic has triggered substantial economic and social disruptions worldwide. The number of infection-induced deaths in Senegal in particular and West Africa in general are minimal when compared with the rest of the world. We use count regression (statistical) models such as the generalized Waring regression model to forecast the daily confirmed COVID-19 cases in Senegal. The generalized Waring regression model has an advantage over other models such as the negative binomial regression model because it considers factors that cannot be observed or measured, but that are known to affect the number of daily COVID-19 cases. Results from this study reveal that the generalized Waring regression model fits the data better than most of the usual count regression models, and could better explain some of the intrinsic characteristics of the disease dynamics. is an acute and severe viral respiratory infection, with high fever, fatigue, cough and shortness of breath as the most common symptoms. The ongoing pandemic has affected more than 184 million people in the world ( Figure 1 ), with severe public health and economic/financial crisis globally (Bubar, K., Kissler, Lipsitch, Cobey, Grad and Larremore (2021) ), and countries such as Senegal were constrained to the most restrictive preventive measures such as lock downs, closure or limited openings of shops and schools (Haug (2020) ). As of July 9, 2021, the Word Health Organization's (WHO) statistics reported that death rate of detected infectious individuals with COVID-19 was on average of 2.16 percent (0.1% -9.3%) depending on individuals' age group and other factors such as health status as well as immunity system among the many factors (https://www.who.int/). However, with the persistent issue of under-reporting, asymptomatic or untested cases, one would expect the actual case fatality rate to be (perhaps much) lower. In Senegal, COVID-19 third wave tends to have a higher peak than the first and second waves, with the number of deaths following the same pattern. A powerful tool used to mitigate the spread and potential impact of the pandemic is mathematical modelling to forecasting the long term dynamical spread of the disease (Dietz and Heesterbeek (2002)), and to help support, using data the effective allocation and mobilization of scarce public health resources (Ajbar, Alqahtani and Boumaza (2021) , Shan and Zhu (2021) ). From the early days of the epidemic to its current state as a pandemic, numerous mathematical models have been developed. These include (1) compartmental model (Chowell, Hyman, Bettencourt and Castillo-Chavez (2009) , Martcheva (2015) ) of the SIR (Susceptible-Infected-Recovered) type and its variants (Lopez and Rodo (2021) , Saikia, Bora and Bora (2021) , Venables and Ripley (2015) ), (2) statistical models (Chan, Chu, Zhang and Nadarajah (2021) , , Odhiambo, Okungu and Mutuura (2020) ), and particularly count data regression models used for example to modelling and forecasting the number of new daily COVID-19 cases reported in various countries (Chan et al. (2021) , Odhiambo et al. (2020) ). We aim to investigate the spread of COVID-19 by fitting a generalized Waring regression model (GWRM) (Rodriguez-Avi, Conde-Sanchez, Saez-Castillo, Olmo-Jiménez and Martinez-Rodriguez (2009) ) to the number of new daily cases in Senegal, and make short time predictions. The application of this model to COVID-19 is to the best of our J o u r n a l P r e -p r o o f Journal Pre-proof knowledge relatively new. Next, we compare the performance of this model with that of the Poisson regression model (PRM) and the negative binomial regression model (NBRM). An extension of the negative binomial distribution that allows three sources of variation to be distinguished (Rodriguez-Avi et al. (2009) ) is the univariate generalized Waring distribution (UGWD) (Harris, Hilbe and Hardin (2014) , Irwin (1968) , Rodriguez-Avi, Conde-Sanchez, Saez-Castillo, Olmo-Jiménez and Martinez-Rodriguez (2007) , Xekalaki (1983) ), while the GWRM is a regression model with a UGWD as its underlying distribution. One main advantage of the implementation of the GWRM over other traditional models such as the NBRM is that the GWRM distinguishes individual heterogeneity due to intrinsic (internal) factors, and those due to external factors (e.g., covariates that influence the variability of data but because they cannot be observed or measurable are not included in the model). The outline of the rest of this paper is as follows. Section 2 describes our dataset. In Section 3, we detail the different models used. Section 4 provides a practical application, with discussion of the results. Conclusion is provided in Section 5. Our dataset consists of the historical daily new confirmed COVID-19 cases in Senegal. Cases during the first wave of the outbreak were collected from the Senegalese Ministry of Health website (https://www.sante.gouv. sn/), available at this github repository https://github.com/senegalouvert/COVID-19. The timeline of the observations are from 2 of March 2020 to the ends of 22 of October 2020. The daily reported number of new cases and the cumulative number of reported cases are depicted in Figure 2 while the ratio number of reported cases over number of screening tests is depicted in Figure 3 . We simulate the possibility of near real-time predictions during an actual outbreak to test the predictive power of each of the models by using 90% of the data for estimation, and the remaining 10% for evaluation. The other variables of the data used are the following: • the number of daily screening tests • the number of daily contact COVID-19 cases. A contact case is a COVID-19 case transmission in which the person has been infected by another person with whom he was in contact, whether it is a member of family, a friend, or a neighbour, etc … That is to say the person exactly knows the person who contaminated him/her and where he/she was infected. • the daily number of community COVID-19 cases. A community case is a COVID-19 case transmission in which the person does not know where he is infected. • the number of daily COVID-19 imported cases (from foreign countries or from outside the community). • the number of total recovered cases. • the number of total deaths due to COVID-19. • the number of total evacuated cases at foreign countries. • the number of daily critical cases. The descriptive statistics of all variables used in this paper are summarized in Table 1 . In our framework denotes the newly confirmed COVID-19 cases on day , = 1, … , . In what follows, ℱ − denotes the history up to day − , 1 ≤ < , ′ = (1 1 ⋯ ) is the vector of covariates on day , and ′ = ( 0 1 ⋯ ) is the vector of the parameters regression on the covariates. In this model we assume, for the conditional distribution given ℱ − , that: It's follow that: The conditional mean is then equal to the conditional variance (equidispersion) for this model that constitutes a major shortcoming for its application on count data. In this model, we make the following distributional assumption for given ℱ − : where − is a gamma distributed variable: Then, using straightforward calculations we obtain (see, for example, Cameron and Trivedi (2013) , Rodriguez-Avi et al. (2009) or Vilchez-Lopez, Saez-Castillo and Olmo-Jimenez (2016)): where − = 1∕(1 + − ). Its follows, for the NBRM, that the conditional mean is given by: Remark that this model is a Poisson-gamma mixture model and, for the gamma distribution, the parameter does not depend on the covariates contrary to − . The model parametrized that way is referred as NegbinII model in the literature (see Cameron and Trivedi (2013) ). The variance of the model is now given by: The decomposition of this variance is interpreted as follow: the first term of the model variance represents the variability due to randomness inherent in any random phenomenon and the second to differences between the days between 2 The GWMR is an extension of NBRM. Contrary to the last model, it allows to distinguish the variability due to individual differences related to the covariates included in the model, to the one only due to individual differences not related to the model covariates (Rodriguez-Avi et al. (2009 ), Vilchez-Lopez et al. (2016 ). Next, we present a brief background on the genesis of the GWRM and some properties of the model. In this model we assume that given ℱ − has a univariate generalized Waring distribution: This implies that the probability mass function of |ℱ − is given by (Irwin (1968) ): where , , > 0 and ( ) = Γ( + ) Γ( ) for > 0 and Γ is the gamma function defined by Γ( ) = ∫ ∞ 0 −1 exp(− ) , for > 0. Then, the conditional mean is given by: In addition, linking the conditional mean with the explanatory variables as − = exp ′ − , we have: The decomposition of this variance is interpreted as follow: • the first term referred as randomness represents the variability inherent to any random phenomenon • the second one called liability represents the variability due to individual differences related to the explanatory variables • the third one called proneness is the variability due to individual differences not related to the explanatory variables included in the model Each of the three models considered above, namely the PRM, the NBRM and the GWRM was fitted using the maximum likelihood estimation. That is by maximizing: (2020)). Note that, because PRM and NBRM are special cases of the generalized linear model, we estimate their parameters using the Iteratively re-weighted least squares (IRLS) algorithm (for more details see Hilbe (2011)). The PRM case is implemented in the glm function included in the package stats, while the NBRM case is implemented in the glm.nb included in the package MASS. New daily COVID-19 cases is considered as a discrete count variable over-dispersion tendency. With comparison with the Poisson model, several causes might be responsible of this variability. First, there are external observable factors that could significantly influence the number of daily COVID-19 cases, such as -1 : the number of daily contact COVID-19 cases. -2 : the daily number of community COVID-19 cases. -3 : the number of daily COVID-19 imported cases. Second, there are missing external covariates which would affect the response variable, and the GWRM could provide more information on the data through the quantification of the effect and sources of variability. Finally, the variable number of daily screening COVID-19 tests, has been included as an offset in the models. This means that the ratio number of new daily COVID-19 cases / number of daily screening COVID-19 tests is used as response variable for all models. The 14-day COVID-19 incubation period from exposure to symptoms onset is denoted by the parameter -the median time is about 5 days (Zaki and Mohamed (2021) ). In a machine learning point of view, can be considered as a hyperparameter, then, in the following, we use a grid search process to automate the tuning of . In Figure 4 , we plot the Root Mean Square Error (RMSE) of the models as functions of , and it can be observed that = 3 is the optimal value. We fit the models from Section 3 to Senegal data. The computed values of the log-likelihood function shown in Table 2 ) enables us to evaluate the accuracy of the model fit, the Akaike and the Bayesian information criteria. In addition, for each model, using a cross-validation procedure, we calculate in Table 2 the values of RMSE, the Mean Absolute Error (MAE), and the Mean Error (ME). Our results show that the best model out of the three is the Generalized Waring regression model (GWRM). In Table 3 , we display the maximum likelihood parameter estimates for the GWRM (the best fitted model), their respective standard errors, and the associated partial Wald tests (statistics and -values). The estimates of and are respectively 53.68050 and 12.85094, with standard errors of 46.012067 and 4.668843. Note that the GWRM includes only the daily number of COVID-19 contact cases as a significant covariate at 5% significance level. Now, we focus on how well the GWRM fits the observed data. We investigate this issue by carrying out a visual residual analysis. Indeed, in Figures 5, 6, and 7 , we depict the quantile quantile plots (QQ-plots) with simulated envelope for residuals of type deviance, of type Pearson and of type response (for more details see Atkinson (1985) , Vieira, Hinde and Demetrio (2000) ), Vilchez- Lopez et al. (2016) . For the deviance and Pearson residuals, all the plotted points fall within the boundaries of the envelope, which indicates a good accuracy of the fitted model. However, for the response residuals, there are some points outside the envelope, a sign of a potential lack of accuracy, and hence no evidence against the adequacy of the fitted model. Thus, the GWRM is appropriate to fit the data. We observe that, on average, the proneness represents 50.18 percent of the variability of the model while the liability represents 41.57 percent and the randomness 8.25. This means about 50% of the variation of the model are due to external factors that are unknown because they could be difficult to observe or to measure. Figure 8 depicts the relationship between the proportion of each component of the variance and the daily new COVID-19 cases. It can be seen that, on average, the randomness and the liability decrease as the number of new daily COVID-19 cases increase, while the proneness increases on average. Figure 9 shows the proportion of the variance related to randomness, liability and proneness, in terms of the number of screening tests. In general, the variability in the number of new daily COVID-19 cases due to randomness and liability decreases as the number of screening tests increases, whereas the variability due to proneness increases. So, we can deduce that increasing the number of screening tests emphasizes the role of the daily characteristics as a cause of differences between days in relation to the number of new daily COVID-19 cases, whereas randomness and liability are somewhat less relevant. The Figures 10, 11 and 12 illustrate the relationships between the model's covariates and the variance related to randomness, liability and proneness. In general, the variability in the number of daily new COVID-19 infections due to randomness and liability decreases as the number of daily contact or the number of daily community cases increases, whereas the variability due to proneness increases. Thus, an increase of the daily number of cases emphasizes the role of the proneness, whereas randomness and liability become less relevant. In contrast, the variability in the number of new daily infections varies randomly as the number of daily imported cases increases (Figure 12 ). However the fraction of proneness is globally more important than the other sources of variation. It is important to note that the imported cases represent 2.24% of the total COVID-19 cases in Senegal during the first stage outbreak with a peak at 27 on June 13, 2020, and, the GWRM does not include this variable as a significant covariate at 5% significance level. The rapid spread of the COVID-19 pandemic has triggered substantial economic and social disruptions worldwide. Several mathematical studies have been proposed to investigate and forecast long term dynamics of the disease. We use count regression models such as the generalized Waring regression model to forecast the daily confirmed new COVID-19 cases in Senegal. We fitted the generalized Waring model (GWRM) to the Senegal data (daily number of confirmed new COVID-19 infections). Results from this study reveal that the generalized Waring regression model fits better the data than most of the usual count data such as the negative binomial regression model (NBRM) . Therefore the former model can explain specific characteristics of the studied phenomenon. We also analyze the variance decomposition for each day. An advantage of the GWRM is that it accounts for factors that are known to affect the number of daily COVID-19 cases. This study is not exhaustive, and future work can explore several other avenues such as modeling the daily mortality due to Covid-19 using a GWRM, investigating (1) the impact of testing capacity on the number of detected daily cases (Zhan, Chen and Zhang (2021) ), (2) the effect of various prevention and therapeutic intervention measures on COVID-19 dynamics. One can also try to model the daily number of COVID-19 cases using Generalized Waring auto-regressive process in order to understand better the COVID-19 dynamics. A possible issue with daily confirmed cases data is the potential of weekly seasonality effects, few tests performed over weekends, hence less cases and oscillations in data. In our future study, we pan to investigate -if there are any such issues in the Senegal data and how to correct this for further improvements -whether the generalized Waring regression model could capture the disease multiple waves. Dynamics of an SIR-based covid-19 model with linear incidence rate, nonlinear removal rate, and public awareness Plots, Transformations and Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis Model-informed covid-19 vaccine prioritization strategies by age and serostatus Regression Analysis of Count Data Count regression models for covid-19 Mathematical and statistical estimation approaches in epidemiology A statistical analysis of the novel coronavirus (covid-19) in Italy and Spain Modelling count data with generalized distributions Ranking the effectiveness of worldwide covid-19 government interventions Modeling Count Data The generalized waring distribution applied to accident theory A modified SEIR model to predict the covid-19 outbreak in Spain and Italy: Simulating control scenarios and multi-scale epidemics An introduction to Mathematical Epidemiology Stochastic modeling and prediction of the covid-19 spread in Kenya A new generalization of the waring distribution A generalized waring regression model for count data Covid-19 outbreak in india: an seir model-based analysis Bifurcations and complex dynamics of an model with the impact of the number of hospital beds R: A language and environment for statistical computing. R Foundation for Statistical Computing Modern Applied Statistics with S. Fourth edition Zero-inflated proportion data models applied to a biological control assay An R package for identifying sources of variation in overdispersed count data The univariate generalized waring distribution in relation to accident theory: Proneness, spells or contagion? E.a. the estimations of the covid-19 incubation period: A scoping reviews of the literature An investigation of testing capacity for evaluating and modeling the spread of coronavirus disease The authors would like to thank the editor and the referees for careful reading and comments which improved the paper.