key: cord-0958789-6k4irh9d authors: Suzuki, Y.; Suzuki, A. title: Machine learning model estimating number of COVID-19 infection cases over coming 24 days in every province of South Korea (XGBoost and MultiOutputRegressor) date: 2020-05-14 journal: nan DOI: 10.1101/2020.05.10.20097527 sha: 22f90a4f33351758ca66c939c7ec723eea6214fa doc_id: 958789 cord_uid: 6k4irh9d We built a machine learning model (ML model) which input the number of daily infection cases and the other information related to COVID-19 over the past 24 days in each of 17 provinces in South Korea, and output the total increase in the number of infection cases in each of 17 provinces over the coming 24 days. We employ a combination of XGBoost and MultiOutputRegressor as machine learning model (ML model). For each province, we conduct a binary classification whether our ML model can classify provinces where total infection cases over the coming 24 days is more than 100. The result is Sensitivity = 3/3 = 100%, Specificity = 11/14 = 78.6%, False Positive Rate = 3/11 = 21.4%, Accuracy = 14/17 = 82.4%. Sensitivity = 100% means that we did not overlook the three provinces where the number of COVID-19 infection cases increased by more than100. In addition, as for the provinces where the actual number of new COVID-19 infection cases is less than 100, the ratio (Specificity) that our ML model can correctly estimate was 78.6%, which is relatively high. From the above all, it is demonstrated that there is a sufficient possibility that our ML model can support the following four points. (1) Promotion of behavior modification of residents in dangerous areas, (2) Assistance for decision to resume economic activities in each province, (3) Assistance in determining infectious disease control measures in each province, (4) Search for factors that are highly correlated with the future increase in the number of COVID-19 infection cases. for decision to resume economic activities in each region, (3) Assistance in determining infectious disease control measures in each region, (4) Search for factors that are highly correlated with the future increase in the number of COVID-19 infection cases. In Table 1 , we introduce the existing works about ML models which predict the spreading of COVID-19 infection. Also, we show the novelty points of our ML model compared to the existing works. [5] To reveal correlates and patterns of COVID-19 disease outbreak in sub-Saharan Africa (SSA). ◼ MSE of Adaptive lasso: 2.72×10 -28 ◼ R-squared of Adaptive lasso: 1.0000 M. A. M. T. Baldé., 2020 [6] To predict the future evolution of COVID-19 in Senegal. There was no clear numerical description. A. Kumar et al.,2020 [7] To forecast the possible rise in the number of COVID-19 infection cases by considering the daily data of new infection cases. To Predict the number of confirmed cases, recovered cases, and death cases of COVID-19 infection. In this paper, their method is stated "machine learning approach" and there is no more detailed information. M. Paggi, 2020 [13] To propose a methodological contribution based on machine learning to foster the use of epidemiological models over pure data-driven best-fitting approaches and assess the reliability of their predictions. To define data-driven clusters of countries. Unsupervised machine learning algorithms (k-means) There was no clear numerical description. By comparing the above existing works with our ML model, we found the following novelty points of our ML model. The below two additional items were input to our ML model as binary label (0 or 1). The details are described in Supplementary Table 2 . (1) special measures taken by South Korean government to prevent the spreading of COVID-19 infection (2) date of South Korean legislative election  By using the XGBoost in combination with the MultiOutputRegressor, multiple objective variables, i.e., the number of COVID-19 infection cases in each of 17 provinces can be output. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 14, 2020.  Supplementary Tables 1 and 2 . In addition to the two tables, the number of daily infection cases in each province of South Korea over the past 24 days (e.g., the red box in Fig. 1 ) is input. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 14, 2020. . https://doi.org/10.1101/2020.05. 10.20097527 doi: medRxiv preprint cases over the coming 24 days (Mar. 10th -Apr. 2th) in each of 17 provinces is estimated. The estimation performances of our ML model for the test set are shown in Fig. 3 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 14, 2020. . https://doi.org/10.1101/2020.05. 10.20097527 doi: medRxiv preprint related to COVID-19 over the past 24 days (Feb. 15th -Mar. 9th) in each of 17 provinces is input. Then, the total increase in the number of infection cases over the coming 24 days (Mar. 10th -Apr. 2th) in each of 17 provinces is estimated. The black bar graph is the true value, and the white bar graph is the estimated value. The accuracy of the binary classification whether our ML model can classify provinces where total infection cases over the coming 24 days is more than 100 is as follows. Sensitivity=100% of the binary classification means that we did not overlook the three provinces where the number of COVID-19 infection cases increased by more than 100. In addition, as for the provinces where the actual number of new COVID-19 infection cases is less than 100, the ratio (Specificity) that our ML model can correctly estimate is 78.6%, which is relatively high. Next, we evaluate the accuracy of our ML model from the perspective of the regression task, not the binary classification. The ratio that our ML model can estimate the increase of COVID-19 infection cases in each province over the coming 24 days when the maximum permissible error is set to 100 infection cases is as follows. Another accuracy when the maximum permissible error is set to 100 infection cases = 12/17 = 70.6% From the above all, it is demonstrated that there is a sufficient possibility that our ML model can support the following four points. (1) Promotion of behavior modification of residents in dangerous areas (2) Assistance for decision to resume economic activities in each region (3) Assistance in determining infectious disease control measures in each region (4) Search for factors that are highly correlated with the future increase in the number of COVID-19 infection cases. It is pointed out that the actual number of positives may be higher than this dataset because PCR tests may be insufficient. If this point is correct, there is a possibility that the performance of our ML model (sensitivity = 100%, specificity = 78.6%, false Positive Rate = 21.4%) may change. There is a possibility that the current input information may not contain important information. For example, in this method, population, population density, temperature, humidity, weather, regulation of economic activity, degree of land (urban area, depopulated area, industrial area, forest area), etc. of each province are not input. By inputting this information, the estimation performance might be improved. In this paper, both input and output of our ML model are set to 24 days, but this is not always optimal. Performance may change by changing the number of days. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Sensitivity = 100% means that we did not overlook the three provinces where the number of COVID-19 infection cases increased by more than 100. In addition, as for the provinces where the actual number of new COVID-19 infection cases was less than 100, the ratio (Specificity) that our ML model could correctly estimate is 78.6%, which is relatively high. From the above all, it is demonstrated that there is a sufficient possibility that our ML model can support the following four points. (1) Promotion of behavior modification of residents in dangerous areas (2) Assistance for decision to resume economic activities in each region The other authors helped designing this research project. Their contributions are almost equal to each other. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 14, 2020. Fig. 1 are input to our ML model. Label (0 or 1) Korean government raised the national alert level to Yellow (level 2). Jan. 20 -Jan . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 14, 2020. . https://doi.org/10.1101/2020.05. 10.20097527 doi: medRxiv preprint Search for factors that are highly correlated with the future increase in the number of COVID-19 infection cases COVID-19 Pandemic Prediction for Hungary: A Hybrid Machine Learning Approach Using Supervised Machine Learning and Empirical Bayesian Kriging to reveal Correlates and Patterns Disease outbreak in sub-Saharan Africa: Exploratory Data Analysis Fitting SIR model to COVID-19 pandemic data and comparative forecasting with machine learning Preparedness and Mitigation by projecting the risk against COVID-19 transmission using Machine Learning Techniques Risk Estimation of SARS-CoV-2 Transmission from Outbreak Trends of Coronavirus Disease-2019 in India: A Prediction First-principles machine learning modelling of COVID-19 A machine learning methodology for forecasting of the COVID-19 cases in India Analysis of COVID-19 spreading in South Korea using the SIR model with timedependent parameters and deep learning Simulation of Covid-19 epidemic evolution: are compartmental models really predictive? Predictive Analytics of COVID-19 Using Information, Communication and Technologies COVID-19 Epidemic Analysis using Machine Learning and Deep Learning Algorithms A machine learning methodology for real-time forecasting of the 2019-2020 COVID-19 outbreak using Internet searches, news alerts, and estimates from mechanistic models Prediction of COVID-19 Disease Progression in India: Under the Effect of National Lockdown Analysis of the COVID-19 pandemic by SIR model and machine learning technics for forecasting Detecting Suspected Epidemic Cases Using Trajectory Big Data Forecasting the dynamics of COVID-19 Pandemic in Top Using country-level variables to classify countries according to the number of confirmed COVID-19 cases: An unsupervised machine learning approach Declaration of competing interest On behalf of all authors, the corresponding author states that there is no conflict of interest Yoshiro Suzuki managed the project, analyzed the results, and wrote the paper. Korean government declared 'Special Management Region Social distancing approaches have been launched. After Mar After Mar After Mar pdf Pilot operation of COVID-19 Epidemiological Investigation Support System was implemented 1 Except for the above Special entry procedure expanded to all income travelers Testing all income traveler from Europe. After Mar After Mar South Korean legislative election was held