key: cord-032332-bfggtolv authors: Liu, Zhe title: Uncertain growth model for the cumulative number of COVID-19 infections in China date: 2020-09-19 journal: Fuzzy Optim Decis Making DOI: 10.1007/s10700-020-09340-x sha: doc_id: 32332 cord_uid: bfggtolv As a type of coronavirus, COVID-19 has quickly spread around the majority of countries worldwide, and seriously threatens human health and security. This paper aims to depict cumulative numbers of COVID-19 infections in China using the growth model chosen by cross validation. The residual plot does not look like a null plot, so we can not find a distribution function for the disturbance term that is close enough to the true frequency. Therefore, the disturbance term can not be characterized as random variables, and stochastic regression analysis is invalid in this case. To better describe this pandemic automatically, this paper first employs uncertain growth models with the help of uncertain hypothesis tests to detect and modify outliers in data. The forecast value and confidence interval for the cumulative number of COVID-19 infections in China are provided. Regression analysis estimates relationships among variables. Although stochastic regression analysis has a long history of development, they are all considered under the framework of probability theory. However, the premise of probability theory, i.e., the estimated distribution being close enough to the true frequency, cannot be satisfied in many cases. Motivated by this, Liu (2007 Liu ( , 2009 ) founded uncertainty theory based on normality, duality, subadditivity, and product axioms to better address the inaccuracy of the human system. Currently, this theory has been successfully applied in uncertain statistics (Liu 2010) . For example, Yang and Liu (2019) first presented uncertain time series analysis to predict future values based on imprecise observations. Following that, more research interests (Yang and Ni 2020; Liu and Yang 2020b; Tang 2020) were drawn to study and extend this topic. Especially, Ye and Yang (2020) applied uncertain time series to modelling cumulative numbers of COVID-19 infections. Uncertain regression analysis has been developed to address relationships between variables under the framework of uncertainty theory. Yao and Liu (2018) proposed least squares estimations for unknown parameters in uncertain multiple regression models. Motivated by this, researchers considered other estimations such as least absolute deviations estimations (Liu and Yang 2020a ), Tukey's biweight estimations (Chen 2020) , and maximum likelihood estimations (Lio and Liu 2020) . In addition, Lio and Liu (2018) provided the confidence interval for the response variable. To evaluate different uncertain regression models, cross-validation methods (Liu and Jia 2020; Liu 2019) have attracted some scholars' attention. Later, estimations for the unknown parameters in uncertain multivariate regression models were derived (Ye and Liu 2020b; Zhang et al. 2020) . The appropriateness of estimations for unknown parameters in uncertain regression analysis can be tested using the uncertain hypothesis test (Ye and Liu 2020a) . There is no doubt that emerging infectious diseases seriously threaten human health and security. Currently, as an emerging respiratory infectious disease, coronavirus disease 2019 (abbreviated "COVID-19") is caused by severe acute respiratory syndrome coronavirus, and is characterized as a "pandemic" by the World Health Organization. Reported illnesses of COVID-19 have ranged from very mildness to severity, including illness resulting in death. In a short time, the outbreak has rapidly worsened, and has received considerable global attention. After the irresolution in December 2019 and January 2020, the Chinese government has taken a series of multifaceted public health interventions, such as home confinement, traffic restrictions, and centralized quarantine, to effectively strengthen the control of the COVID-19 outbreak. In contrast, many other countries are in the acceleration phase of the pandemic. According to the World Health Organization (2020), the number of confirmed cases worldwide reached 332, 930 by March 23, 2020. The relationship between the cumulative number of COVID-19 infections and time is a core issue reflecting the severity of COVID-19. For this purpose, growth models as a type of nonlinear regression model represent how a particular quantity increases over time, and have been used to describe previous epidemic growth patterns (Chowell et al. 2016) . To guide prevention and mitigation plans, this paper aims to characterize the cumulative number of COVID-19 infections in China excluding imported cases (hereinafter referred to as cumulative numbers of COVID-19 infections in China) using growth models from February 13 to March 23, 2020. The data before February 13 are not used because they are not real-time data due to the limitation of testing ability. Section 2 will choose a best S-shaped growth model using cross validation among several growth models, and give the results using stochastic growth models. Section 3 is going to introduce some fundamental knowledge about the uncertain growth model. The uncertain hypothesis test and data modification will be introduced in Sect. 4. Section 5 will present an algorithm. Following that, Sect. 6 will analyze cumulative numbers of COVID-19 infections in China using uncertain growth models and uncertain hypothesis tests. Section 7 is going to compare results obtained from the stochastic growth model and uncertain growth model. Finally, Sect. 8 will conclude this paper. Before February 13, cumulative numbers of COVID-19 infections in China are not real-time data due to the limitation of testing ability. Thus we focus on cumulative numbers of COVID-19 infections in China from February 13 to March 23, 2020 which can be seen in Table 1 . As Fig. 1 shows, in this period, the pandemic growth pattern in China coincides with an S-shaped growth curve due to the urgent and ambitious measures taken by the government. For example, since January 23, the government blocked all outbound transportation from Wuhan and banned public transit. A series of social distancing measures such as compulsory mask wearing and cancellation of social gatherings, were also implemented. On February 17, the government initiated a door-to-door symptom screening for all residents. Table 2 Different S-shaped growth models and average testing error (ATE) with 5 fold cross validation Form ATE To select a model with the best generalization ability among some S-shaped growth models, we apply v-fold cross validation. Data are partitioned into v almost equalsized subsets, and data in each subset are treated as future values to calculate square testing error with parameters estimated using data in other subsets. Then, the model with the smallest average testing error is selected. We use different S-shaped growth models to investigate how the cumulative number of COVID-19 infections in China increase over time. Average testing errors for different S-shaped growth models are shown in Table 2 . Therefore, we choose the logistic growth model where the response variable y represents the cumulative number of COVID-19 infections in China, the explanatory variable x represents the day after February 12, 2020, and is a disturbance term. From January 20 to March 23, 2020, values of the explanatory variable x are integers from −23 to 40, respectively. That is to say, the explanatory variable x equals to −23, −22, . . ., 40 on January 20, January 21, . . ., March 23, 2020, respectively. With cumulative numbers of COVID-19 infections in China from February 13 (x = 1) to March 23, 2020 (x = 40), the estimations (β 0 ,β 1 ,β 2 ) for unknown parameters (β 0 , β 1 , β 2 ) using the function "lsqnonlin" in Version 8.6.0.267246 Matlab are Thus the stochastic growth model is where has an estimated expected valueê = −0.3274 and an estimated variancê σ 2 = 256.24 2 . To evaluate this model, we consider the coefficient of determination (R 2 ), which represents the proportion of the variance for a dependent variable that is explained by variables in a regression model. With data (x i , y i ), i = 1, 2, . . . , n, the total sum of squares is SS tot = i (y i −ȳ) 2 whereȳ = 1 n n i=1 y i , the sum of squares of residual is SS res = i (y i −ŷ i ) 2 whereŷ i is the fitted value, and the coefficient of (1) determination R 2 is R 2 = 1 − SS res /SS tot . The R 2 for the stochastic growth model (1) is 0.9963, where in the best case R 2 = 1. Stochastic regression analysis requires that a residual plot should look like a null plot which has constant mean, constant variance, and no separated points. However, it seems that the residual plot for the stochastic growth model (1) shown in Fig. 2 is not a null plot. Under this situation, if we use stochastic regression analysis, the distribution function we give is not close enough to the real frequency for the disturbance term. Therefore, the disturbance term can not be characterized as a random variable. As a result, stochastic regression analysis is invalid at this case, and that is the reason why we try to use uncertain regression analysis next. In this section, we introduce some fundamental knowledge about the uncertain growth model, including parameter estimation, residual analysis, and prediction. The uncertain logistic growth model is where y represents the cumulative number of COVID-19 infections in China, x represents the day after February 12, 2020, and is a disturbance term characterized as an uncertain variable. With data (x i , y i ), i = 1, 2, . . . , n for the uncertain growth model (2), the least squares estimations Yao and Liu (2018) Z. Liu and the fitted growth model is In the uncertain growth model (2), there exists a disturbance term that represents the difference between the predicted response variable's value and the true response variable's value. The estimated expected value and variance for Lio and Liu (2018) are suggested asê respectively, where˜ i are the i-th residuals defined as i = 1, 2, . . . , n, respectively. If we further assume the disturbance term is a normal uncertain variable, then the uncertainty distribution for can be suggested as For a new explanatory variable x, the forecast uncertain variable Lio and Liu (2018) of y isŷ and the forecast value of y is According to the operational law for calculating the inverse uncertainty distributions Liu (2007) , the inverse uncertainty distribution ofŷ equals tô with Uncertain growth model for the cumulative number of… Then the β (0 < β < 1) confidence interval of y Lio and Liu (2018) is where b is the minimum value such that In this section, we apply the uncertain hypothesis test Ye and Liu (2020a) to testing the appropriateness of the estimated expected valueê and the estimated varianceσ 2 for the disturbance term in the uncertain growth model (2), and modify outliers. Suppose that the disturbance term in the uncertain growth model (2) is a normal uncertain variable with expected value e and variance σ 2 , respectively, i.e., ∼ N(e, σ ). Then for the two-side hypothesis H 0 : e =ê and σ =σ versus H 1 : e =ê or σ =σ , the uncertain hypothesis test is where˜ i are defined in (6), and The estimated expected valueê and estimated varianceσ 2 for the disturbance term in the uncertain growth model (2) then (x i , y i ) is called an outlier. We modify y i as and get the corresponding modified data (x i , y i ). In this section, we present an algorithm to summarize the previous section. Step 1 (Parameter estimation). With the data (x 1 , y 1 ), (x 2 , y 2 ), . . ., (x n , y n ), compute least square estimations (β 0 ,β 1 ,β 2 ) for (β 0 , β 1 , β 2 ) in the uncertain logistic growth model y = β 0 /(1 − β 1 exp(−β 2 x)) + , i-th residuals˜ i (i = 1, 2, . . . , n), the estimated expected valueê, and the estimated varianceσ 2 according to Eqs. (3), (6), (4), and (5), respectively. Step 2 (Uncertain hypothesis test). For a given significance level α (e.g. α = 0.01), construct the uncertain hypothesis test W defined in (11). Step 3 (Data modification) If (˜ 1 ,˜ 2 , . . . ,˜ n ) / ∈ W , go to Step 4. Otherwise, for each i (i = 1, 2, . . . , n) then set y i =β 0 /(1 −β 2 exp(−β 2 x i )). Go to Step 1. Step 4 (Forecast) For a new explanatory variable x, calculate the forecast uncertain variable, the forecast value, and the β (e.g. β = 0.95) confidence interval of y suggested as Eqs. (7), (8), and (10), respectively. With cumulative numbers of COVID-19 infections in China from February 13 to May 2, 2020 shown in Table 1 , the least squares estimations (β 0 ,β 1 ,β 2 ) for (β 0 , β 1 , β 2 ) are As shown in Fig. 3 , data on February 18 does not pass the test. Modify data on February 18, and reestimate unknown parameters in the logistic growth model. It follows from Eq. (3) that uncertain least squares estimations are Similarly, the estimated expected value and estimated variance of the disturbance term in uncertain growth model (13) areê = 0.1196 andσ 2 = 208.28 2 , respectively, according to Eqs. (4) and (5). Assume that the disturbance term is a normal uncertain variable with expected value 0.1196 and variance 208.28 2 , i.e., ∼ Z. Liu As shown in Fig. 4 , data on February 17 does not pass the test. Modify data on February 17, and reestimate unknown parameters in the logistic growth model. According to Eq. The estimated expected value and estimated variance of the disturbance term in uncertain growth model (14) As shown in Fig. 5 , all data pass the test. Fitted uncertain regression growth model (14) and cumulative numbers of COVID-19 infections in China from January 20 to March 23, 2020, are shown in Fig. 6 . Forecast (14) are larger than data in Table 1 from January 20 to February 12. This result confirms that data before February 13 are not real-time data due to the limitation of testing ability. Next, we predict the cumulative number of COVID-19 infections in China on March 24, 2020 with the actual value 80744. According to Eqs. (7) and (8) It follows from Eq. (9) that the inverse uncertainty distribution ofŷ iŝ In this section, we compare the stochastic growth model (1) and the uncertain growth model (14). Because the residual plot for the stochastic growth model (1) shown in Fig. 2 does not look like a null plot, we can not find a distribution function for the disturbance term that is close enough to the true frequency. Therefore, the disturbance term can not be characterized as a random variable, and stochastic regression analysis is invalid. What's more, as we can see from Table 3 the variance for the disturbance term in the stochastic growth model (1), i.e.,σ 2 = 256.24 2 , is larger than that in the final uncertain growth model (14), i.e.,σ 2 = 183.82 2 . With uncertain hypothesis tests, we detected and modified outliers in the cumulative number of COVID-19 infections in China. As a result, the disturbance term in the final uncertain growth model (14) has a smaller estimated variance. This fact shows that uncertain logistic growth model with the help of uncertain hypothesis tests is more suitable for handling the cumulative number of COVID-19 infections in China. The emerging infectious disease COVID-19 seriously threatens human health and security worldwide. In this paper, the logistic growth model chosen by cross validation was implemented to depict the cumulative number of COVID-19 infections in China from February 13 to March 23, 2020. It seemed that the residual plot is not a null plot. So we can not find a distribution function for the disturbance term that is close enough to the true frequency. As a result, the disturbance term can not be characterized as a random variable, and stochastic regression analysis is invalid in this case. This paper first employed the uncertain growth model with the help of uncertain hypothesis tests to detect and modify outliers in data, and produced a better result automatically. The forecast value and confidence interval for the cumulative number of COVID-19 on March 24, 2020 were given. In the future work, better ways to handle outliers and better fitting models may be considered. Tukey's biweight estimation for uncertain regression model with imprecise observations Mathematical models to characterize early epidemic growth: A review Residual and confidence interval for uncertain regression model with imprecise observations Uncertain maximum likelihood estimation with application to uncertain regression analysis Uncertainty theory Some research problems in uncertainty theory Uncertainty theory: A branch of mathematics for modeling human uncertainty Leave-p-out cross validation test for uncertain Verhulst-Pearl model with imprecise observations Cross-validation for the uncertain Chapman-Richards growth model with imprecise observations Least absolute deviations estimation for uncertain regression with imprecise observations. Fuzzy Optimization and Decision Making National Health Commission of the People's Republic of China Uncertain vector autoregressive model with imprecise observations Coronavirus disease 2019 (COVID-19) situation report-99 Uncertain time series analysis with imprecise observations. Fuzzy Optimization and Decision Making Least-squares estimation for uncertain moving average model Uncertain regression analysis: An approach for imprecise observations Uncertain hypothesis test with application to uncertain regression analysis Multivariate uncertain regression model with imprecise observations Analysis and prediction of confirmed cases of COVID-19 in China by uncertain time series model. Fuzzy Optimization and Decision Making Least absolute deviations for uncertain multivariate regression model Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations Acknowledgements This work was supported by National Natural Science Foundation of China (No. 11771241).