key: cord-0958789-6k4irh9d
authors: Suzuki, Y.; Suzuki, A.
title: Machine learning model estimating number of COVID-19 infection cases over coming 24 days in every province of South Korea (XGBoost and MultiOutputRegressor)
date: 2020-05-14
journal: nan
DOI: 10.1101/2020.05.10.20097527
sha: 22f90a4f33351758ca66c939c7ec723eea6214fa
doc_id: 958789
cord_uid: 6k4irh9d

We built a machine learning model (ML model) which input the number of daily infection cases and the other information related to COVID-19 over the past 24 days in each of 17 provinces in South Korea, and output the total increase in the number of infection cases in each of 17 provinces over the coming 24 days. We employ a combination of XGBoost and MultiOutputRegressor as machine learning model (ML model). For each province, we conduct a binary classification whether our ML model can classify provinces where total infection cases over the coming 24 days is more than 100. The result is Sensitivity = 3/3 = 100%, Specificity = 11/14 = 78.6%, False Positive Rate = 3/11 = 21.4%, Accuracy = 14/17 = 82.4%. Sensitivity = 100% means that we did not overlook the three provinces where the number of COVID-19 infection cases increased by more than100. In addition, as for the provinces where the actual number of new COVID-19 infection cases is less than 100, the ratio (Specificity) that our ML model can correctly estimate was 78.6%, which is relatively high. From the above all, it is demonstrated that there is a sufficient possibility that our ML model can support the following four points. (1) Promotion of behavior modification of residents in dangerous areas, (2) Assistance for decision to resume economic activities in each province, (3) Assistance in determining infectious disease control measures in each province, (4) Search for factors that are highly correlated with the future increase in the number of COVID-19 infection cases.

for decision to resume economic activities in each region, (3) Assistance in determining infectious disease control measures in each region, (4) Search for factors that are highly correlated with the future increase in the number of COVID-19 infection cases.

In Table 1 , we introduce the existing works about ML models which predict the spreading of COVID-19 infection.

Also, we show the novelty points of our ML model compared to the existing works. [5] To reveal correlates and patterns of COVID-19 disease outbreak in sub-Saharan Africa (SSA).

◼ MSE of Adaptive lasso: 2.72×10 -28 ◼ R-squared of Adaptive lasso: 1.0000 M. A. M. T. Baldé., 2020 [6] To predict the future evolution of COVID-19 in Senegal.

There was no clear numerical description.

A. Kumar et al.,2020 [7] To forecast the possible rise in the number of COVID-19 infection cases by considering the daily data of new infection cases. To Predict the number of confirmed cases, recovered cases, and death cases of COVID-19 infection.

In this paper, their method is stated "machine learning approach" and there is no more detailed information. M. Paggi, 2020 [13] To propose a methodological contribution based on machine learning to foster the use of epidemiological models over pure data-driven best-fitting approaches and assess the reliability of their predictions. To define data-driven clusters of countries.

Unsupervised machine learning algorithms (k-means)

There was no clear numerical description.

By comparing the above existing works with our ML model, we found the following novelty points of our ML model.

The below two additional items were input to our ML model as binary label (0 or 1). The details are described in Supplementary Table 2 .

(1) special measures taken by South Korean government to prevent the spreading of COVID-19 infection

(2) date of South Korean legislative election  By using the XGBoost in combination with the MultiOutputRegressor, multiple objective variables, i.e., the number of COVID-19 infection cases in each of 17 provinces can be output. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 14, 2020.  Supplementary Tables 1 and 2 . In addition to the two tables, the number of daily infection cases in each province of South Korea over the past 24 days (e.g., the red box in Fig. 1 ) is input. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 14, 2020. . https://doi.org/10.1101/2020.05. 10.20097527 doi: medRxiv preprint cases over the coming 24 days (Mar. 10th -Apr. 2th) in each of 17 provinces is estimated. The estimation performances of our ML model for the test set are shown in Fig. 3 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 14, 2020. . https://doi.org/10.1101/2020.05. 10.20097527 doi: medRxiv preprint related to COVID-19 over the past 24 days (Feb. 15th -Mar. 9th) in each of 17 provinces is input. Then, the total increase in the number of infection cases over the coming 24 days (Mar. 10th -Apr. 2th) in each of 17 provinces is estimated. The black bar graph is the true value, and the white bar graph is the estimated value.

The accuracy of the binary classification whether our ML model can classify provinces where total infection cases over the coming 24 days is more than 100 is as follows. Sensitivity=100% of the binary classification means that we did not overlook the three provinces where the number of COVID-19 infection cases increased by more than 100. In addition, as for the provinces where the actual number of new COVID-19 infection cases is less than 100, the ratio (Specificity) that our ML model can correctly estimate is 78.6%, which is relatively high.

Next, we evaluate the accuracy of our ML model from the perspective of the regression task, not the binary classification.

The ratio that our ML model can estimate the increase of COVID-19 infection cases in each province over the coming 24 days when the maximum permissible error is set to 100 infection cases is as follows.

Another accuracy when the maximum permissible error is set to 100 infection cases = 12/17 = 70.6%

From the above all, it is demonstrated that there is a sufficient possibility that our ML model can support the following four points.

(1) Promotion of behavior modification of residents in dangerous areas

(2) Assistance for decision to resume economic activities in each region (3) Assistance in determining infectious disease control measures in each region (4) Search for factors that are highly correlated with the future increase in the number of COVID-19 infection cases.

It is pointed out that the actual number of positives may be higher than this dataset because PCR tests may be insufficient. If this point is correct, there is a possibility that the performance of our ML model (sensitivity = 100%, specificity = 78.6%, false Positive Rate = 21.4%) may change.

There is a possibility that the current input information may not contain important information. For example, in this method, population, population density, temperature, humidity, weather, regulation of economic activity, degree of land (urban area, depopulated area, industrial area, forest area), etc. of each province are not input. By inputting this information, the estimation performance might be improved.

In this paper, both input and output of our ML model are set to 24 days, but this is not always optimal. Performance may change by changing the number of days.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Sensitivity = 100% means that we did not overlook the three provinces where the number of COVID-19 infection cases increased by more than 100. In addition, as for the provinces where the actual number of new COVID-19 infection cases was less than 100, the ratio (Specificity) that our ML model could correctly estimate is 78.6%, which is relatively high. From the above all, it is demonstrated that there is a sufficient possibility that our ML model can support the following four points.

(1) Promotion of behavior modification of residents in dangerous areas

(2) Assistance for decision to resume economic activities in each region

The other authors helped designing this research project. Their contributions are almost equal to each other. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 14, 2020. Fig. 1 are input to our ML model.

Label (0 or 1)

Korean government raised the national alert level to Yellow (level 2).

Jan. 20 -Jan . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 14, 2020. . https://doi.org/10.1101/2020.05. 10.20097527 doi: medRxiv preprint 

Search for factors that are highly correlated with the future increase in the number of COVID-19 infection cases

COVID-19 Pandemic Prediction for Hungary: A Hybrid Machine Learning Approach

Using Supervised Machine Learning and Empirical Bayesian Kriging to reveal Correlates and Patterns

Disease outbreak in sub-Saharan Africa: Exploratory Data Analysis

Fitting SIR model to COVID-19 pandemic data and comparative forecasting with machine learning

Preparedness and Mitigation by projecting the risk against COVID-19 transmission using Machine Learning Techniques

Risk Estimation of SARS-CoV-2 Transmission from

Outbreak Trends of Coronavirus Disease-2019 in India: A Prediction

First-principles machine learning modelling of COVID-19

A machine learning methodology for forecasting of the COVID-19 cases in India

Analysis of COVID-19 spreading in South Korea using the SIR model with timedependent parameters and deep learning

Simulation of Covid-19 epidemic evolution: are compartmental models really predictive?

Predictive Analytics of COVID-19 Using Information, Communication and Technologies

COVID-19 Epidemic Analysis using Machine Learning and Deep Learning Algorithms

A machine learning methodology for real-time forecasting of the 2019-2020 COVID-19 outbreak using Internet searches, news alerts, and estimates from mechanistic models

Prediction of COVID-19 Disease Progression in India: Under the Effect of National Lockdown

Analysis of the COVID-19 pandemic by SIR model and machine learning technics for forecasting

Detecting Suspected Epidemic Cases Using Trajectory Big Data

Forecasting the dynamics of COVID-19 Pandemic in Top

Using country-level variables to classify countries according to the number of confirmed COVID-19 cases: An unsupervised machine learning approach

Declaration of competing interest On behalf of all authors, the corresponding author states that there is no conflict of interest

Yoshiro Suzuki managed the project, analyzed the results, and wrote the paper. Korean government declared 'Special Management Region

Social distancing approaches have been launched. After Mar

After Mar

After Mar

pdf Pilot operation of COVID-19 Epidemiological Investigation Support System was implemented

1 Except for the above

Special entry procedure expanded to all income travelers

Testing all income traveler from Europe. After Mar

After Mar

South Korean legislative election was held