key: cord-319885-8qyavs7m
authors: Chan, Stephen; Chu, Jeffrey; Zhang, Yuanyuan; Nadarajah, Saralees
title: Count regression models for COVID-19
date: 2021-02-01
journal: Physica A
DOI: 10.1016/j.physa.2020.125460
sha: 
doc_id: 319885
cord_uid: 8qyavs7m

At the end of 2019, the current novel coronavirus emerged as a severe acute respiratory disease that has now become a worldwide pandemic. Future generations will look back on this difficult period and see how our society as a whole united and rose to this challenge. Many reports have suggested that this new virus is becoming comparable to the Spanish flu pandemic of 1918. We provide a statistical study on the modelling and analysis of the daily incidence of COVID-19 in eighteen countries around the world. In particular, we investigate whether it is possible to fit count regression models to the number of daily new cases of COVID-19 in various countries and make short term predictions of these numbers. The results suggest that the biggest advantage of these methods is that they are simplistic and straightforward allowing us to obtain preliminary results and an overall picture of the trends in the daily confirmed cases of COVID-19 around the world. The best fitting count regression model for modelling the number of new daily COVID-19 cases of all countries analysed was shown to be a negative binomial distribution with log link function. Whilst the results cannot solely be used to determine and influence policy decisions, they provide an alternative to more specialised epidemiological models and can help to support or contradict results obtained from other analysis.

The novel coronavirus disease (COVID-19) first identified in Wuhan, the capital of Hubei, China in October 2019, is a severe acute respiratory disease that has now become a worldwide pandemic. Fig. 1 shows the extreme extent to which this pandemic has spread across the world, with the total number of confirmed global cases exceeding 700,000 as of 30th March 2020 and is still increasing exponentially. The total number of cases worldwide has already surpassed the number due to the severe acute respiratory syndrome (SARS) in the early 2000s. Many reports have suggested that this new virus is becoming comparable to the Spanish flu pandemic from 1918.

The most common symptoms of COVID-19 are almost identical to those of the flu -e.g. high fever, fatigue, cough and shortness of breath. Individuals have been required to self-isolate if they believe that they are exhibiting these symptoms. The most severe symptoms have been linked to pneumonia, multi-organ failure, and death. Other symptoms of COVID-19 include the loss of sense of smell (anosmia) and in some cases individuals may display no symptoms at all, but will still be carrying the virus. The global effects of COVID-19 have led to many countries locking down their international borders, cities and towns for extended periods. For example, in China, UK, Italy, Spain, France and many others. Hence, many are fearful that another global recession is on the horizon. On the contrary, after a two month national lock-down, China has shown the world that a strict lock-down has contributed to a reduction in the number of new cases and deaths from COVID-19, with the number of new cases recently decreasing to zero.

The current literature relating to COVID-19 is limited and the majority of the known work focuses exclusively on China, where the first major outbreaks occurred. The existing research has focused on topics such as determining the population who are most at risk, the factors increasing the risk of infection, the medical properties of those who become infected, the factors that can improve clinical outcomes and reduce the spread of the virus, the biological properties of the virus, and many others. See for example [1] [2] [3] [4] [5] [6] [7] , and [8] .

More specifically, the literature relating to COVID-19 analysis outside of China has been limited. Since these countries are lagging behind China in terms of the overall spread of the disease, much of the literature has been focused on modelling and predicting the disease in the early stages of the outbreak -particularly the daily incidence (number of new confirmed cases per day) and the basic reproductive number. For example, in Italy [9, 10] , in France [11] , and in Japan [12] [13] [14] ; to name but a few.

Thus far, a wide range of statistical and predictive methods have already been applied to the analysis of COVID-19 in China, for example traditional epidemic models, such as the SIR model [15, 16] and the basic reproductive number [17] ; neural networks [18] ; regression models [19, 20] ; experimental frameworks [21] ; correlation analysis [22] .

From the literature, it is evident that the majority of the analyses on COVID-19 are limited to China, and a limited number of countries in Asia and Europe. These are arguably the countries that first identified known cases of COVID-19. However, we should note that since this is an ongoing situation, new research is being published daily and therefore the literature is being updated continuously. Hence, our main motivation is to provide a statistical analysis in modelling and analysing the number of confirmed cases of COVID-19 in eighteen countries around the world. The main contributions of this paper are: (i) to provide a statistical analysis of COVID-19 worldwide; (ii) to investigate whether it is possible to utilise count regression models for fitting and predicting the number of daily confirmed cases due to COVID-19 globally.

The contents of this paper are organised as follows. Section 2 describes the data used in our analysis. In Section 3, we detail the methodology and models used. Section 4 outlines the results, and provides a discussion of these results. Section 5 provides a conclusion and summary of our results.

The data we analyse consists of the historical daily new cases due to the COVID-19 Coronavirus confirmed from eighteen different countries worldwide (China, Denmark, Estonia, France, Germany, Italy, Malaysia, Philippines, Qatar, South Korea, Sri Lanka, Sweden, Taiwan, Thailand, UAE, UK, USA, Vietnam), listed on the EU Open Data Portal from 31st December 2019 to 25th March 2020. These countries were chosen because they were the earliest countries to detect COVID infections.

The data were downloaded from the website ''European Centre for Disease Prevention and Control'' (ECDC) which sources its data from the WHO, and our analysis is limited to the data available at the time of writing. The eighteen countries were chosen based on their ranking in terms of the highest numbers of cases, thus we believe that the data obtained gives a satisfactory representation of the main countries affected by the virus at these times.

In epidemiology and the study of infectious diseases, count-based data related to incidence are commonplace. In particular, data such as the daily incidence (number of cases) relating to an infectious disease can be modelled and predicted using a wide variety of methods, including compartmental (or deterministic) models such as the SIR and SEIR models, and stochastic models such as discrete time and continuous time Markov chains, and stochastic differential equations. In this study, we apply discrete time count regression models with the aim of modelling and predicting the daily incidence of COVID-19 across the world. Such models are preferred because they provide an appropriate, rich, and flexible modelling environment for non-negative integers. In addition, the models are robust for estimating constant relative policy effects and when implemented to policy evaluations, such models can move beyond the consideration of mean effects and determine the effect on the entire distribution of outcomes instead [23] . Poisson count regression models are part of the family of generalised linear models that are commonly used in epidemiological studies [24] . The Poisson and negative binomial regression models are widely used for modelling discrete count data where the count takes a non-negative integer with no upper limit, while the data is highly skewed. The negative binomial regression has the added advantage of being able to deal with the problem of overdispersion [25] .

The four models below are due to Christou and Fokianos [26, 27] , Fokianos and Fried [28] , Fokianos et al. [29] and Fokianos and Tjostheim [30] . The models due to Christou and Fokianos [26] are based on the negative binomial distribution. The models due to the others are based on the Poisson distribution. Both Poisson and negative binomial distributions are commonly implemented when dealing with count data and observations occurring at a specific rate.

Let Z t = denote the number of newly confirmed cases in a country on day t, t = 1, . . . , T . In other words, Z t = the change in the cumulative confirmed cases from day t − 1 to t. For each of the eighteen countries selected, the following four regression models were fitted to the corresponding daily incidence data:

where F t−1 denotes the history up to day t − 1, α represents the intercept parameter, and β is the slope parameter.

Each of the four models was fitted by the method of maximum likelihood. That is, by maximising

respectively, with respect to α, β and φ. We shall denote the maximum likelihood estimates byα,β andφ, respectively.

For a more in-depth discussion of the four regression models we refer the readers to the literature cited above. The models were fitted using the command tsglm in the R package tscount [31] . For each of the fitted models, we computed the Akaike information criterion (AIC), Bayesian information criterion (BIC) and associated p-values obtained by re-sampling. The AIC for the four models were computed as

The BIC for the four models were computed as 

The values are given in Table 1 . According to AIC and BIC values, the best model out of the four is the negative binomial model with a logarithmic link function. Table 2 gives the estimates of the intercept and slope parameters along with their corresponding standard errors for this model. Also given in Table 2 are the p-values quantifying the significance of the slope parameter. In line with standard significance levels, if the p-value is less than 0.05 then the slope estimate is deemed to be significant.

We applied the models specified in Section 3 and fitted them to our data on the number of new daily cases of individuals infected with COVID-19 from eighteen different countries worldwide. According to Table 2 , the majority of p-values corresponding to the best fitting model (negative binomial model with a logarithmic link function) for each country's data are smaller than 0.05 -indicating significance of the slope coefficient estimates at the 5 percent significance. However, a particular exception is that of China, whose p-value is significantly greater than 0.05. This result is, perhaps, not surprising as China was the first country to be majorly affected by COVID-19 and by the time most other countries started to see significant increases in new numbers of cases its numbers had already peaked and new cases in China were being confirmed at a slower rate.

Among the countries where the model appears to show a reasonable fit, the slope estimate was positive in all cases indicating the expected number of new cases confirmed each day is expected to increase with respect to time. In particular, the UK and Vietnam have the largest and smallest slope estimates, respectively, hence the rate of increase in new daily COVID-19 cases with time is the highest for the UK and lowest for Vietnam. 

. The predicted median at time t, say M(t), was computed as the solution of The predicted 95 percent confidence interval at time t, say [L(t), U(t)], was computed as the solutions of

The actual number of new cases falls within the 95 percent confidence intervals for each of the eighteen countries (for 7 days, 10 days and 15 days), suggesting that the fitted model is robust in spite of being simple. For some countries, such as Denmark, Malaysia, and the Philippines, the actual and predicted values are reasonably close. On the other hand, for many countries the predicted values overestimate the actual number of new cases (Estonia, France, Germany, Italy, etc.). However, in a few instances -e.g. Qatar and United Arab Emirates, the actual number of new daily cases starts to outgrow the predicted values in the latter half of the 10 days (same was observed for 7 days and 15 days). Note that these countries do not appear to share a common connection. Although the regression model accounts for the historical number of daily cases (and the average rate of new daily cases), a possible explanation why it may under or overestimates the true number of new daily cases is due to the fact that it does not take into account many other factors that can influence the spread of infectious diseases, such as the behaviour of individuals (e.g. social, travel, etc.), government action, and economic policies. Whilst this method has its advantages of being simple, straightforward and yet robust, the results should be interpreted with caution. They allow us to capture the general trend of the new daily cases in each country and generate some basic predictions in the short term. However, arguably, this approach misses key factors that are accounted for in other types of available models. Therefore, it would not be wise to purely use the results presented here to make policy decisions, but rather these results should be used in conjunction with those from other analyses, which can help to support or contradict.

Furthermore, we do not consider here the historical daily mortality due to COVID-19 as there exist many dependent factors that should be considered when modelling these numbers. Examples include available treatments, susceptible population, hospital capacity, transmission rate, location and elevation risk, socio-economic factors and many more. This data can often be limited or hard to obtain due to restrictions such as data privacy or unreliable reporting. For further information we refer the readers to Booth and Tickle [32] . Finally, we check robustness of the (log, negative binomial) model. We fitted all four models ((identity, Poisson), (log, Poisson), (identity, negative binomial) and (log, negative binomial)) to the two halves of the data set. The first half was taken as the data from 31 December 2019 to 11 February 2020. The second half was taken as the data from 12 February 2020 to 25 March 2020. The values of AIC and BIC for the four models for each half are given in Tables 3 and 4 . We see that the (log, negative binomial) model gives the smallest values for each country and for each half.

We have provided a statistical study on the modelling and analysis of the daily incidence of COVID-19 in eighteen countries around the world. In particular, we have investigated whether it is possible to fit count regression models to the number of daily new cases of COVID-19 in various countries and make short term predictions of these numbers. The results suggest that the biggest advantage of these methods is that they are simplistic and straightforward allowing us to obtain preliminary results and an overall picture of the trends in the daily confirmed cases of COVID-19 in different countries.

The best fitting count regression model for modelling the number of new daily COVID-19 cases of all countries was shown to be a negative binomial distribution with log link function. The best fitted model was robust in that the 95 percent confidence intervals for prediction contained the actual number of new cases for each country. However, the model was not able to predict the trends of new daily cases well for China. We believe that this could be related to fact that China was the first country to be significantly affected, and by the time other countries started to be affected by COVID-19, China had already reached its peak in confirmed cases and their confirmed cases dramatically declined. Given these results, this suggests that this model may be more useful for modelling the early stages of an outbreak, when the number of new cases is increasing, and, more specifically, this suggests that a count regression model is better suited for modelling new daily cases when the trend is increasing linearly, semi-exponentially, or exponentially. Among the countries that fit well with this model, the slope estimate was positive in all cases, indicating that the expected number of new cases being confirmed each day is expected to increase with respect to time. The UK and Vietnam have the largest and smallest slope estimates, respectively, hence the rate of the daily increase in COVID-19 cases is highest for the UK and lowest for Vietnam. The model is beneficial for short term predictions in order to see the short term trend and the rate of growth of new cases, when no intervention measures are taken. In addition, the results could be useful in contributing to making health policy decisions or government intervention, but more importantly, these results should be used in conjunction with the results from other mathematical models that are more specific to epidemiology.

Nevertheless, direct extensions to the current work could include modelling the daily mortality due to COVID-19. Such models could incorporate dependent factors that influence mortality rate such as available treatments, susceptible population, hospital capacity, transmission rate, location and elevation risk, socio-economic factors and many more. A further extension is to seek models that are theoretically motivated for COVID data.

Stephen Chan: Introduction, Motivation, Data. Jeffrey Chu: Analysis, Discussion. Yuanyuan Zhang: Analysis, Discussion. Saralees Nadarajah: Methods, Fitting.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: A retrospective cohort study

Early prediction of disease progression in 2019 novel coronavirus pneumonia patients outside wuhan with CT and clinical characteristics

Epidemiological and transmission patterns of pregnant women with 2019 coronavirus disease in China

Estimation of the transmission risk of the 2019-nCoV and its implication for public health interventions

Modelling the epidemic trend of the 2019 novel coronavirus outbreak in China

The 2019 new coronavirus epidemic: Evidence for virus evolution

Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: A statistical analysis of publicly available case data

The epidemiological characteristics of an outbreak of 2019 novel coronavirus diseases (COVID-19)-China

Modelling and predicting the spread of Coronavirus (COVID-19) infection in NUTS-3 Italian regions

COVID-19 and Italy: What next?

Mechanistic-statistical SIR modelling for early estimation of the actual number of cases and mortality rate from COVID-19

Prediction of the epidemic peak of coronavirus disease in Japan

Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship

Estimation of the reproductive number of novel coronavirus (COVID-19) and the probable outbreak size on the Diamond Princess cruise ship: A data-driven analysis

Network-based prediction of the 2019-nCoV epidemic outbreak in the Chinese province Hubei

Epidemic analysis of COVID-19 in China by dynamical modeling

The reproductive number of COVID-19 is higher compared to SARS coronavirus

Finding an accurate early forecasting model from small dataset: A case of 2019-ncov novel coronavirus outbreak

Prediction and analysis of coronavirus disease

Prediction of the number of new cases of 2019 novel coronavirus (COVID-19) using a social media search index

Prediction of epidemic spread of the 2019 novel coronavirus driven by spring festival transportation in China: A population-based study

Temporal relationship between outbound traffic from wuhan and the 2019 coronavirus disease (COVID-19) incidence in China

Evidence-Based Policy Making in Labor Economics: The IZA World of Labor Guide

Poisson regression analysis in clinical research

Comparison of statistical approaches to evaluate factors associated with metabolic syndrome

Quasi-likelihood inference for negative binomial time series models

Estimation and testing linearity for non-linear mixed Poisson autoregressions

Interventions in log-linear Poisson autoregression

Poisson autoregression

Log-linear Poisson autoregression

R Development Core Team, R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing

Mortality modelling and forecasting: A review of methods

The authors would like to thank the Editor and the three referees for careful reading and comments which greatly improved the paper.