key: cord-0861207-bf3c670v authors: Oladunni, T.; Tossou, S.; Haile, Y.; Kidane, A. title: COVID-19 County Level Severity Classification with Class Imbalance: A NearMiss Under-sampling Approach date: 2021-05-25 journal: nan DOI: 10.1101/2021.05.21.21257603 sha: 6500520087ed7e4614ce41b8c87fc7a9d72a6a9f doc_id: 861207 cord_uid: bf3c670v COVID-19 pandemic that broke out in the late 2019 has spread across the globe. The disease has infected millions of people. Thousands of lives have been lost. The momentum of the disease has been slowed by the introduction of vaccine. However, some countries are still recording high number of casualties. The focus of this work is to design, develop and evaluate a machine learning county level COVID-19 severity classifier. The proposed model will predict severity of the disease in a county into low, moderate, or high. Policy makers will find the work useful in the distribution of vaccines. Four learning algorithms (two ensembles and two non-ensembles) were trained and evaluated. Class imbalance was addressed using NearMiss under-sampling of the majority classes. The result of our experiment shows that the ensemble models outperformed the non-ensemble models by a considerable margin. Since the outbreak of the coronavirus pandemic, the Centers for Disease Control and Prevention (CDC) has recorded close to 30 million cases. Thousands of lives have been lost to COVID-19 [1] . While the United States and other developed countries has been able to bend the curve on the fatality rate, emerging evidence suggests that the disease is just taking root in some countries. As of May 19, 2021, Mexico tops the fatality rate with 9.3%. At a distant second is Peru with 3.5 %. Italy and Iran came third and fourth with 3% and 2.8% respectively. The origin of this pandemic is an ongoing research; however, most scientists believe that it originated from a bat in Wuhan, China. The question now is: how do we categorize the severity of COVID-19 fatality in a county? We answered this question by building a machine learning classifier using the fatality rate dataset from the 3 006 counties in the US. Dataset was obtained from the John Hopkins University repository. Machine learning algorithms have been shown to have the capability to learn pattern and discover knowledge from a dataset. It has been used in image recognition, fraud detection, voice recognition, malware detection etc. Since the outbreak of the coronavirus pandemic, several studies have been done using machine learning algorithms to understand the pandemic and provide strategies to reduce its spread. Author [2] proposed a quantitative model to predict vulnerability to COVID-19 using genomes. Neural networks and Random Forests were used as learning algorithms. The result of the study confirmed previous work on phenotypic comorbidity patterns in susceptibility to COVID-19. In another study, Kexin studied nineteen risk factors associated with COVID-19 severity. The result suggested that severity relates to individual's characteristics, disease factors, and biomarkers [3] . Hina et al., proposed a model to predict patient COVID-19 severity in Pakistan. Seven learning algorithms were trained and evaluated. The result of the experiment showed that Random Forest had the best performance with 60% accuracy. While there are several studies on COVID-19 severity, there seems to be a gap in machine learning literature on the imbalanced classification of COVID-19 severity at the county level. Therefore, the focus of this study is the algorithmic imbalance classification of COVID-19 of a county into low, moderate, or high. We hypothesized that ensemble learning in conjunction with the under-sampled majority class of an imbalance COVID-19 dataset has a superior capability of predicting the severity of COVID-19 at the county level. We test our hypothesis by experimenting with ensemble and non-ensemble learning algorithms. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Severity of COVID-19 was measured using the fatality rate as the response variable. The fatality rate as recorded in the dataset was a continuous variable. Therefore, attributes were split into 3 groups based on the following criterion: counties with fatality rates less than 1 were categorized as low (0 < x ≤ 1). Moderate class are the counties with fatality rate greater than 1 but less than or equal to 2 (1 < x ≤ 2). Finally, the high class are counties that have greater than 2 but less than equal to 4 fatalities (2 < x ≤ 4). Categorization or grouping is crucial for classification of continuous variables. The above categorization resulted into skewed class distribution. This skewness of the class distribution is referred to as class imbalance. An imbalance dataset has one or more classes with low records (minority class) and one or more classes with many records (majority class). Class imbalance has been shown to have a considerable negative impact on the effectiveness of a learning algorithm. Near Miss Under-sampling (NMU) Approach The question is, how do we balance the dataset? An imbalanced data can be balanced by oversampling of the minority class or under-sampling of the majority class. In oversampling approach, more data are created to increase the size of the minority class records to equal the majority class records. However, this approach has the risk of overfitting. On the other hand, in under-sampling, the size of the majority class is reduced to balance the class distribution. We believe this is a better approach. Therefore, in this study, we used the Near Miss Under-sampling (NMU) strategy. NMU selection is based on distance of the majority records to the minority records. It is a k nearest neighbor approach. Distance is based on the Euclidean distance measure. NMU has three versions: version 1, version 2 and version 3. Version 1 is based on the smallest average distance between the majority class and three closest records of the minority class. Version 2 selects records from the majority class with farthest distance from three minority class. Lastly, in version 3, a given number of the majority class is selected for each closest example in the minority class. In this study, version 1 is used. The result of our experiment shows the effectiveness of our strategy. The NearMiss function from the imblearn.under_sampling library was used. We trained and evaluated 2 ensemble learning algorithms (Random Forest and Boosting). We also trained and evaluated 2 non-ensembles (Logistic Regression and K Nearest Neighbors). Dataset was split into 90% and 10% for training and testing, respectively. Performance evaluation was based on precision, recall, accuracy and F1 score. To compare the results of our experiment, we used accuracy, the recall, and the f-1 score as our factors of comparison. Accuracy Accuracy is defined as the percentage of correct predictions for the test data. It can be calculated easily by dividing the number of correct predictions by the number of total predictions, and the formula is as follow: (1) Precision is a metric that quantifies the number of correct positive predictions made. And it is calculated using the following formula: Recall Recall is a metric that quantifies the number of correct positive predictions made from all positive predictions . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) F-Measure F-Measure provides a way to combine both precision and recall into a single measure that captures both properties. Its formula is as given: We trained and evaluated the performances of 4 learning algorithm. K-Nearest Neighboring (KNN) In a dataset with y response variable y and X feature vectors, a KNN learning algorithm identifies K points in a training dataset that are closest to a new testing datapoint x0. Where j is estimated response and yi as the target (label). N0 is the K points. In our experiment 5 was selected as the value of K. In addition, we used the MixedMeasures for the measure types. The Euclidean distance was used as the distance metric. [5] Where d represents the distance, x and y are 2 data points. Performance of the KNN learning algorithm is shown in table 1. In all evaluation criterion, the result suggest that moderate class has the lowest prediction. Accuracy score was approximately 0.61. Logistic Regression Logistic regression is a supervised learning algorithm for predicting the likelihood of a target variable. In a two-class problem, the target or dependent variable is dichotomous, which implies there would be just two potential classes [6] . The logistic function produces output between 0 and 1. where b0 is the bias or intercept term and b1 is the coefficient for the single input value (x). L2 regularization was used as the overfitting control. Tolerance for stoppage criteria was 1e-4. Optimization was based on lbfgs. Table 2 shows the result of the Logistic Regression. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2021. ; Table 2 . Logistic Regression performance. As shown in table 2. The performance of Logistic Regression is worse than that of KNN. Random Forest Random forest is a supervised learning algorithm that is utilized for classifications as well as regression. A forest is comprising of trees and more trees suggest a stronger forest. Aggregating decision trees in ensemble learning, produces a better performance. Essentially, the Random forest calculation creates decision trees on bootstrapped training data samples. and afterward gets the forecast from every one of them and then lastly chooses the best solution through voting [7] . It is an ensemble method that is superior to a solitary decision tree since it decreases the overfitting by averaging the outcome. where is the importance of feature i calculated from all trees, is the normalized feature importance for i in tree j. Table 3 . Random Forest performance. Table 3 shows that the Random Forest model outperformed KNN and Logistic Regression models. Boosting Tree Boosting is an ensemble modeling technique that endeavors to fabricate a solid classifier from the number of weak classifiers. It is done by building a model by utilizing weak models in series like Random forest. First and foremost, a model is built from the training data. At that point the subsequent model is constructed which attempts to address the errors present in the first model. This method is proceeded, and models are added until either the total training data is predicted accurately, or the most extreme number of models are added [8] . Its implementation required us to 100 for the number of trees, a maximal depth of 5, a min rows of 10, a min split improvement of 1.0E-5, a number of bins equals to 20, a learning rate of 0.01, and a sample rate of 1. Table 4 . Boosting Tree Performance . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2021. III. Accuracy of the models were compared. For each model, we also took the average performance of the precision, recall and F1 score. Table 5 shows the comparison table. IV. CONCLUSION In this study we have designed, developed, and evaluated a COVID-19 severity classifier using imbalance class dataset. The proposed model has the capability of predicting the severity level of COVID-19 in a given county. Dataset was obtained from the JHU COVID-19 repository. COVID-19 Severity level was based on fatality rates in all the 3 006 counties of the US. For classification purpose, fatality rate was categorized into low, moderate and high. Imbalance class was addressed using the Near Miss Undersampling (NMU) approach. Ensemble and nonensemble learning algorithms were trained and evaluated. Ensemble models include Random Forest and Boosting Trees. KNN and Logistic Regression were used as the non-ensemble models. The result of our experiment suggests that the ensemble models are the most effective in building a COVID-19 severity classifier at the county level using imbalanced dataset. Thus, we do not have sufficient evidence against our hypothesis. Therefore, we contend that ensemble learning in conjunction with under-sampled majority class of an imbalance COVID-19 dataset has a superior capability of classifying the severity of COVID-19 at the county level. V. [ CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 25, 2021. ; https://doi.org/10.1101/2021.05.21.21257603 doi: medRxiv preprint CDC Predictions of COVID-19 Infection Severity Based on Co-associations between the SNPs of Co-morbid Diseases and COVID-19 through Machine Learning of Genetic Data Risk factors and indicators for COVID-19 severity: Clinical severe cases and their implications to prevention and treatment Comparative analysis of k-nearest neighbor and modified knearest neighbor algorithm for data classification Logistic Regression Model Optimization and Case Analysis Analysis model of the most important factors in Covid-19 through data mining, descriptive statistics and random forest Predictive Modeling of Hospital Mortality for Patients With Heart Failure by Using an Improved Random Survival Forest Gaps in knowledge about COVID-19 among US residents early in the outbreak Support vector machines, import vector machines and relevance vector machines for hyperspectral classification Random-Forest-Bagging Broad Learning System with Applications for COVID-19 Pandemic ACKNOWLEDGEMENT This work is funded by the National Science Foundation grant number 2032345.