key: cord-0998384-a3u9ssyt authors: Goo, Taewan; Apio, Catherine; Heo, Gyujin; Lee, Doeun; Lee, Jong Hyeok; Lim, Jisun; Han, Kyulhee; Park, Taesung title: Forecasting of the COVID-19 pandemic situation of Korea date: 2021-03-25 journal: Genomics Inform DOI: 10.5808/gi.21028 sha: c254ee0c4cdc8906e07753424e2878a1b183f6f2 doc_id: 998384 cord_uid: a3u9ssyt For the novel coronavirus disease 2019 (COVID-19), predictive modeling, in the literature, uses broadly susceptible exposed infected recoverd (SEIR)/susceptible infected recoverd (SIR), agent-based, curve-fitting models. Governments and legislative bodies rely on insights from prediction models to suggest new policies and to assess the effectiveness of enforced policies. Therefore, access to accurate outbreak prediction models is essential to obtain insights into the likely spread and consequences of infectious diseases. The objective of this study is to predict the future COVID-19 situation of Korea. Here, we employed 5 models for this analysis; SEIR, local linear regression (LLR), negative binomial (NB) regression, segment Poisson, deep-learning based long short-term memory models (LSTM) and tree based gradient boosting machine (GBM). After prediction, model performance comparison was evelauated using relative mean squared errors (RMSE) for two sets of train (January 20, 2020‒December 31, 2020 and January 20, 2020‒January 31, 2021) and testing data (January 1, 2021‒February 28, 2021 and February 1, 2021‒February 28, 2021) . Except for segmented Poisson model, the other models predicted a decline in the daily confirmed cases in the country for the coming future. RMSE values’ comparison showed that LLR, GBM, SEIR, NB, and LSTM respectively, performed well in the forecasting of the pandemic situation of the country. A good understanding of the epidemic dynamics would greatly enhance the control and prevention of COVID-19 and other infectious diseases. Therefore, with increasing daily confirmed cases since this year, these results could help in the pandemic response by informing decisions about planning, resource allocation, and decision concerning social distancing policies. The novel coronavirus disease 2019 (COVID- 19) presents an important and urgent threat to global health. Since the outbreak in early December 2019 in the Hubei province of the People's Republic of China, the number of patients confirmed to have the disease has exceeded 118 million as the disease spread globally, and the number of people infected is probably much higher [1] . More than 2.6 million people have died from COVID-19 (up to 11 March 2021) [2] . Despite public health responses aimed at containing the disease and delaying the spread [3, 4] , several countries have been confronted with a critical care crisis, and more countries could follow [5] . To mitigate and suppress the burden of COVID-19 on the healthcare system, while also protecting the general public, especially the highly susceptible group of people, ro-bust models that predict the prognosis of COVID-19 were urgently needed to support decisions about shielding, hospital admission, treatment, and population level interventions [6] . In this situation, prediction tools can help project different scenarios such as (1) number of possible confirmed (new) cases, (2) number of possible hospitalized cases, (3) number of possible death cases and so forth. As a consequence, prediction tools are useful for several different purposes [7] . Other features, such as social distancing, stay-at-home orders, use of facemasks or self-quarantine, travel restriction, and contact tracing could help predict what comes next. For better understanding, prediction models are important for better estimation about the disease and its possible threats such as the number of cases based on the level of severity can help determine the need of numbers of ventilators and other sophisticated medical equipment. Furthermore, countries need to shape their health system responses in accordance with the need [8] . Therefore, access to accurate outbreak prediction models is essential to obtain insights into the likely spread and consequences of infectious diseases. Governments and other legislative bodies rely on insights from prediction models to suggest new policies and to assess the effectiveness of the enforced policies [9] . For COVID-19, predictive modeling, in the literature, uses broadly susceptible exposed infected recoverd (SEIR/SIR), agentbased, curve-fitting models. Besides, machine-learning models that are built on statistical tools have widely been used too [7] . Here, we employ statistical models; segmented Poisson, negative binomial (NB), and local likelihood regression (LLR), mathematical model SEIR, deep-learning based model long short-term memory (LSTM), and tree based gradient boosting machine (GBM) for prediction of future COVID-19 pandemic situation of Korea. The COVID-19 daily confirmed cases of Korea was divided into two regions: capital area (Seoul metropolitan area) and non-capital area (non Seoul metropolitan area). Domestic which is the sum of Capital and non-capital areas was also analyzed (Fig. 1) . The daily confirmed cases of these regions were then split into train ( January 20, 2020-December 31, 2020 and January 20, 2020-January 31, 2021) and test (January 1, 2021-Febraury 28, 2021 and Febraury 1, 2021-February 28, 2021) datasets. The prediction performance of the above models were tested using relative mean square error (RMSE). RMSE takes the total squared error and normalizes it by dividing by the total squared error of a simple predictor. Thus, the smaller the RMSE value, the better the prediction performance of the model. Therefore, with increasing daily confirmed cases since the beginning of 2021, in Korea and elsewhere, such models could help in the response to pandemic by informing decisions about planning, resource allocation, and decision of the social distancing. The daily series of confirmed cases of COVID-19 for South Korea from January 20, 2020 to Febraury 28, 2021 was obtained from where c i (i = 1,2,3) are breakpoints. NB model is defined as [12] ; where λ t is the conditional expectiation of Y t given F t-1 as the history of the joint process {Y t ,λ t :t∈ℕ}. Conditional mean and variance of Y t are defined as; where ϕ is the dispersion parameter. And overdispersion parameter σ 2 is defined as σ 2 = 1/ϕ. NBdistribution is defined as; where y = 0,1,…n. For estimating λ t , l = {1,7,21} were used as lagged confirmed cases and log(λ t ) = β 0 + β i * log(Y t-li + 1), were used for the model. For NB model, package 'tscount' were used to analyze the confirmed cases of South Korea as time series count data [13] . Our LLR model is based on Poisson model which previously mentioned. For this model, local quadratic approximation is fitted within a smoothing window of bandwidth h, which is the number of the nearest past observations to be used in the local fit. We use tricube kernel of weight W(u) = (1-|u| 3 ) 3 for each point. The local quadratic log-likelihood is defined as; Kaggle (from January 20 to June 30, 2020) [10] and Korea public data portal of the Ministry of Health and Welfare (from July 1, 2020 to February 28, 2021) [11] . The combined data was divided into two regions: capital or Seoul Metropolitan area (Capital; Seoul, Incheon, and Gyeonggi-do) and non-capital or non-Seoul Metropolitan area (non-capital; other cities beside Seoul, Incheon, and Gyeonggi-do). The analysis was conducted on the Domestic area (capital and non-capital), Seoul Metropolitan area, and non-Seoul Metropolitan area data, respectively. The data was split into two subsets. First subset is composed of training ( January 20, 2020-December 31, 2020) and test data (the last 59 days, January 1, 2021-Febraury 28, 2021). And second subset is composed of training ( January 20, 2021-January 31, 2021) and test data (the last 28 days, February 1, 2021-Febraury 28, 2021) for downstream analysis with the test data used for prediction analysis. As one model may not give the best prediction of the COVID-19 situation of Korea, we present prediction results estimated by different models that can apply the above data. Many models are available and have been implemented for forecasting the pandemic situation of many countries and look at afew. In this section, we introduce segmented Poisson model, LLR model, deep-learning based LSTM model, NB model, SEIR model, and GBM model used for predicting the COVID-19 situation of Korea. Here, we regarded the confirmed cases as a function of time t based on a segmented Poisson model. Let Y t be the confirmed cases at day t which is the number of days since the first case occurred. Poisson model is defined as; where μ t is the expectation of Y t with segments. Breakpoints were considered in the daily confirmed cases during the analysis by splitting the daily confirmed cases into segments (Supplemantary Table 1 ). These breakpoints were decided using some of the aforementioned significant events linked to the spread of COVID-19 in South Korea. Since there are three breakpoints, four segments are defined as follows: log(μ t ) = β 0 + β 11 t + β 21 log (t + 1), (t = 0, 1, ..., c 1 -1) where w i (t) = W( ) and l are the log-likelihood function based on Poisson distribution assumption. The local likelihood estimate is made by maximizing over the parameter a = (a 0 , a 1 , a 2 ) t . We utilize a rolling origin cross validation to select optimal bandwidth of the smoothing window [14] . Validation sets are divided at the local peaks of counts. Validation MSE is cumulated by each validation set's counts being predicted using past validations sets. The bandwidth with the smallest validation MSE is selected as optimal bandwith. And then using optimal bandwith we finally fitted LLR model. For LLR model [15] , package 'locfit' was used [16] . Here, LSTM network is considered [17] . Let the input data X t be as a set of vector consisting of Y t-h to Y t-1 according to day t, where h is the bandwidth having values {7, 14, 21, 28, 35, 42, 49, 56}. Among these, the optimal h is selected using validation set, which is last 7 days of the training period. The data is normalized using minmax normalizer to transform data to be in the range of 0 to 1. The LSTM architecture is described in Fig. 2 [18] . Each blocks in the model use current input value X t with C t-1 and h t-1 to be trained. The C t-1 and h t-1 are the state and output of the last block, respectively. We assume four blocks with 64 units, each with a 0.2 dropout layer. The optimization is held using the adam optimizer to minimize the MSE during the training process. Once the optimal model is built, it is applied to the test data for prediction. The prediction is performed sequentially by using the current output as the part of input of the next prediction. The analysis was performed using Python version 3.7.6, and 'keras' library. The infectious disease dynamic can be formulated with a mathematical model. We consider the SEIR model to fit the dataset of COVID-19 daily confirmed cases and predict the incidence of COVID-19 epidemic in Korea. In SEIR model, population is divided in four groups: susceptible (S), exposed (E), symptomatic and infectious (I) and recovered (R) individuals. This model includes the spread of infection during the latent period. The latency of COVID-19 infection is biologically realistic. The SEIR model is defined by the following the system of ordinary differential equations [19] [20] [21] [22] : where β is the transmission rate, γ is the recovery rate, and 1/κ is the average incubation period. The initial condition of this model S(0), E(0), I(0), R(0) must satisfy the condition S(0) + E(0) + I(0) + R(0) = N, where N is the total population size. In data fitting, the unknown parameters in model were estimated by a least squares algorithm. The numerical simulation and analysis were performed in MATLAB 2020a. GBM is a tree based machine learning algorithm that can be used for regression and classification problems. GBM consist of weak regression learner and decision trees. The decision tree uses the input value to determine which regression learner is best to make predictions. Based on adaptive boosting algorithm, GBM can build a strong regression learner by iteratively combining a set of weak regression leaners. GBM use gradient descent for minimizing loss function of a strong regression learner. Like other boosting algorithms, GBM adds models into the tree using greedy style [23] : where F m is the updated model, the newly added model. h m is the trained base learner which minimizes the loss function L and ρ is the multiplier which is found by solving one dimensional optimization problem. To build GBM, 'LightGBM' library was used [24] . To evaluate the above models, RMSEs for the train and test datasets for each of the fitted models were calculated as follows: where n is the number of data points, y t is the observed values, μ t is the predicted values from a fitted model and y³ is the mean of observed values. To compare models predicting different regions, having different scale of confirmed cases, RMSE measure was chosen. The COVID-19 daily confirmed cases of the country were divided into two regions (non-capital and capital) with the total being do-mestic and analysed using the above models. The data was split into two subsets and used in the training and prediction analysis of the models. As for model evaluation, in Table 1 , for comparison of models in the whole country and the two regions, we observe that the train RMSE is always lower than the test RMSE, with the domestic region producing the highest RMSE values for all models. Also, the segmented A look that the prediction plots of the these models shows that the daily COVID-19 confirmed cases will decline in the country (domestic), Seoul metropolitan (capital) and non-Seoul metropolitan (non-capital) areas using LLR, and NB models using the first data subset, while it will increase and stay constant using the segmented Poisson and LSTM models, respectively (Supplementary Figs. 1-3) . With the second data subset, daily COVID-19 confirmed cases will decline in the three regions as predicted by NB, segmented Poisson and LSTM models, while it will increase in the country and non metropolitan areas but will decline in the metropolitan areas, using LLR model . The SEIR and GBM models shows a decrease in daily confirmed cases in the country and the two regions for all data subsets (Supplementary Figs. 7 and 8). The objective of our analysis was to predict the future COVID-19 situation of South Korea using daily confirmed cases. We employed six different models in this analysis and all models gave some different prediction results for different data subsets and regions. The evaluation of the models using RMSE showed that local likelihood rgression, GBM, SEIR and NB models had the lowest RMSE values, making them the best models, though LSTM gave better RMSE values compared to segmented Poisson model. LLR, GBM, SEIR, NB and LSTM models mainly predicted a decline in COVID-19 daily confirmed cases in the country and the two regions of Korea. We can reasonably take that these results portray the future situation of the country. With the first dataset, NB, SEIR, and LLR, respectively showed the best test performance in domestic, capital, and non-capital areas, while with second dataset, GBM showed the best test performance for all regions. In case of NB model, we found that the coefficient of confirmed cases of the day before had the largest value. This means that confirmed cases of the day before can affect the prediction of most future confirmed cases. In case of GBM, we could obtain feature importance plots of the model (Supplementary Fig. 9 ). We discovered that the confirmed cases of the last day was the most important feature for all regions. Thus, the models using the confirmed cases of the past days seemed to perform better than other models without using such data. The parameters of the mathematical model SEIR can be easily interpreted as a rate or transition parameter. However, the local regression does have too many parameters to assign meaningful interpretation. Note that the number of daily confirmed cases varied across regions of Korea, so we first fitted these models for each of the regions and compared them with the number of observed cases. However, the comparison of prediction models based on regional data was not convincing due to the small sample sizes. Instead of fitting the models for each region, we considered combining the two regions of capital and non-capital areas, which provided enough sample sizes. We also tried predicitng the number of deaths due to COVID-19. However, the data was not large enough (with only 1,669 deaths as of March 14, 2021) [25] to provide reliable fitted results from the models. In our study, there is one challenge that predictions made reflect interventions in place at the time the model was developed. So, one can argue the influence of government intervention policies in the above observed results. However, our comparison result is still valid because all models reflected the same intervention effects. Actually, the Korean government has maintained a high level of social distancing with a ban in gatherings of more than five persons [26] , in their efforts to lower the triple-digit number of daily confirmed cases that has been observed since the start of this year. According to Heo et al. 's study [27] on the COVID-19 situation of Korea, the Korean government social distancing policy was predicted to lead to a decrease in the daily confirmed cases observed in the country but with only segmented Poisson model which according RMSE value, did not perform well as compared to the other models. In future, we hope to control for the influence of government interventions using the other models, to give a whole picture of the future COVID-19 situation of the country. A good understanding of the epidemic dynamic would greatly enhance the control and prevention of COVID-19 as well as other infectious diseases. Therefore, taking precaution when using prediction to support a decision, for example, return to work or lowering of the social distancing level, is highly encouraged too. COVID-19: asymptomatic carrier transmission is an underestimated problem United States: coronavirus cases. Worldometer, 2021. Accessed The effect of large-scale anti-contagion policies on the COVID-19 pandemic Ranking the effectiveness of worldwide COVID-19 government interventions Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal Prediction models for COVID-19 outcomes COVID-19 prediction models and unexploited data Division of Viral Diseases. Atlanta: Centers for Disease Control and Preventions, 2020. Accessed COVID-19 outbreak prediction with machine learning DS4C: data science for COVID-19 in South Korea Ministry of Health and Wel-fareCorona 19 City/ProvinceStatus Reading: Addison-Wesley tscount: analysis of count time series Out-of-sample tests of forecasting accuracy: an analysis and review Smoothing by local regression: principles and methods locfit: local regression, likelihood and density estimation Time series forecasting of COVID-19 transmission in Canada using LSTM networks Deep learning via LSTM models for COVID-19 infection forecasting in India The Mathematical Theory of Infectious Disease and Its Applications The mathematics of infectious diseases Modeling Infectious Diseases in Humans and Animals Mathematical Tools for Understanding Infectious Disease Dynamics Prediction of COVID-19 confirmed cases using gradient boosting regression method Gradient Boosting Machine. San Francison: GitHub, 2021. Accessed Cheongju: Korea Disease Control and Prevention Agency Yonhap News Agency. (3rd LD) S. Korea to impose nationwide ban on gatherings of 5 or more people in virus fight: PM. Seoul: Yonhap News Agency Statistical estimation of effects of implemented government policies on COVID-19 situation in South Korea Taewan Goo: https://orcid.org/0000-0001-9427-2290 Catherine Apio: https://orcid.org/0000-0003-0240-794X Gyujin Heo: https://orcid.org/0000-0003-3499-2848 Doeun Lee: https://orcid.org/0000-0002-2791-8431 Jong Hyeok Lee: https://orcid.org/0000-0002-3573-1853 Jisun Lim: https://orcid.org/0000-0002-1058-3292 Kyulhee Han: https://orcid.org/0000-0002-7243-5711 Taesung Park: https://orcid.org/0000-0002-8294-590X Taesung Park serves as an editor of the Genomics and Informatics, but has no role in the decision to publish this article. All remaining authors have declared no conflicts of interest. Supplementary data can be found with this article online at http:// www.genominfo.org.