key: cord-0829424-r0flr4qr authors: Zoha, Naurin; Ghosh, Sourav Kumar; Arif-Ul-Islam, Mohammad; Ghosh, Tusher title: A Numerical Approach to Maximize the Number of Testing of COVID-19 using Conditional Cluster Sampling Method date: 2021-02-17 journal: Inform Med Unlocked DOI: 10.1016/j.imu.2021.100532 sha: d8c31886acdfe8b4d21e1eed8da0cc04ee14c209 doc_id: 829424 cord_uid: r0flr4qr The COVID-19 pandemic is the defining health crisis of the world in 2020 and the world economy is affected as well. Bangladesh is also one of the impacted countries, which needs to conduct sufficient tests to identify patients and accordingly adopt measures to limit the massive outbreak of this viral infection. But due to economic drawbacks and also unavailability of testing equipment, Bangladesh is lagging critically behind in test numbers. This study shows a pool testing method named Conditional Cluster Sampling (CCS) that utilizes soft computing and data analysis techniques to reduce the expense of total testing equipment. The proposed method also demonstrates its effectiveness compared to the traditional individual testing method. Firstly, according to patients’ symptoms and severity of their conditions, they are classified into four classes- Minor, Moderate, Major, Critical. After that Random Forest Classifier (RFC) is used to predict the class. Then random sampling is done from each class according to CCS. Finally, using Monte Carlo Simulation (MCS) for 100 cycles, the effectiveness of CCS is demonstrated for different probability levels of infection. It is shown that the CCS method can save up to 22% of the test kits that can save a huge amount of money as well as testing time. The worldwide COVID-19 pandemic is now difficulty to humanity which is caused by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). Bangladesh is one of the many countries to be affected by COVID-19. The first positive patient was identified in Bangladesh on March 7, 2020. The transmission was minimal through the month of March, but exhibited a steep increase since April 2020 and up to 24 June 2020, Bangladesh is placed at 17 th position considering the total number of infected patients [1] . The exponential increase in the number of sample testing could help beat the transmission, but as of 24 June 2020, Bangladesh stands in the 147 th position in the number of tests per million populations in the world which is the bottom-most position in the South Asia region. Not only in Bangladesh, even in the most developed countries, reverse transcription-polymerase chain reaction (RT-PCR) testing, which involves swab testing for the virus' genetic material and is currently the standard test, is severely constrained. This is due to shortages in key supplies, such as reagents, and a limit to the number of tests that can be performed per day using existing equipment [2] . Germany and India have already adopted pool testing methods to enhance their number of tests with the expense of a limited number of test kits [3] . This study shows a Conditional Cluster Sampling (CCS) J o u r n a l P r e -p r o o f method where patients are tested in pools, instead of individual testing using a numerical method that implements both machine learning and statistical data and upon obtaining results of each pool, a decision is taken whether to continue the testing iteration or to terminate it. The prime motives of this study are to: 1. Classify the total number of patients on the basis of severity of their conditions 2. Apply CCS to decrease the expense of test kits. The entire paper is organized as follows-Section 2 describes the related recent literary works. In section 3, the basic methodologies along with the soft computing methods are discussed. At first, data collected from a population of potential patients for their symptom details, according to those details they are classified into four classes-Minor, Moderate, Major, and Critical taking the opinions of specialist frontline doctors. Then using a machine learning algorithm, Random Forest Classifier (RFC), each patient is classified into one of these classes and after implementing the CCS method, the patients are tested. Afterward, using Monte Carlo Simulation (MCS) techniques for different probability levels of infection, the efficacy of the method is demonstrated. Section 4 covers the results and discussions of the methodologies, including data description as well as the comparative analysis of the traditional method and proposed method. Finally, in section 5, conclusions with limitations and assumptions on the paper and its future directions are provided. Underdeveloped countries in Africa encounter greater limitations to testing resources, leaving them illequipped to react to the pandemic [4] . Rapid detection tests (RDT) using kits based on antibody detections are less reliable than the PCR-based tests. So, the rapid microfluidic RT-PCR method can be replaced with that to ensure accuracy which is a very sensitive issue regarding this virus spread [5] . A modified stacked autoencoder for modeling the transmission dynamics was proposed to predict the confirmed cases of COVID-19 in China [8] . The robust Weibull model based on iterative weighing was used to predict the number of active cases of COVID-19 in countries worldwide [9] . The COVID-19 outbreak was predicted by different mathematical evolutionary algorithms and two distinct Machine Learning (ML) techniques. Among ML techniques, artificial neural network (ANN) outperformed adaptive neuro-fuzzy inference system (ANFIS) [10]. 9 different ML algorithms were employed to estimate the new cases of COVID-19 outbreak in 10 densely populated countries worldwide to find the best-fitted model for each country [11] . The autoregressive integrated moving average (ARIMA) and least square support vector machine (LS-SVM) models were employed to predict the confirmed cases of COVID-19 in the five countries of the world. Both models showed good results. However, the accuracy of the LS-SVM model is better than the ARIMA model [12] . Support vector regression model was proposed to forecast the death and active cases of COVID-19 in India for the period of 1 st March to 30 th April 2020 [13] . Country-based prediction models for the COVID-19 pandemic are proposed and fathomed by multi-gene genetic programming (MGGP) [14] . A deterministic mathematical model based on susceptible, infectious, exposed, and recovered (SEIR) persons is developed to predict the COVID-19 outbreak. This model considers the effect of lockdown to estimate the number of affected people in Saudi Arabia [15] . A review on group testing is discussed, and it is found that group testing can reduce the constraints in the available testing methods for SARS-CoV-2 [2] . The probability to be a positive victim of COVID-19 was predicted based on the neural network. The cluster sampling was consisted based on that prediction and it is posited that 73% of tests can be mitigated [16] . It is proposed that 30 samples per pool can ameliorate test capacity with existing test kits and identify positive samples with sufficient adequate diagnostic accuracy [17] . A single positive sample can be determined in pools of up to 32 samples, with 90% accuracy. With certain cycle amplification, the sampling size may be increased up to 64 samples with a minimum error rate [18] . The optimum pool size was calculated based on the prevalence conditions of positive tests. If the pool is positive, all samples will be tested individually while for negative tests, the pool was unaffected [19] . It is found that the pool testing method depends on the infection rate. If the infection rate is high, the pool size will be small. It is proposed that for 30.78% positive tests, the optimal pool size should be 3. On the contrary, the pool size is considered to be 25 for a 0.18% infection rate [20] . This research focuses on conditional cluster sampling (CCS) for COVID-19 patients based on the health condition of patients. Basically, this work is divided into four major steps as follows: 1. Collecting the patient database. 2. Applying Random Forest Classifier (RFC) to classify each patient's condition. 3. Implementing the CCS method based on the condition of the patient. 4 . Applying Monte Carlo Simulation (MCS) at different levels of probability for several cycles to estimate the total number of tests. The data was collected for patients across different age groups. In figure Symptoms of individual patients are collected over the survey. This information is sent to frontline doctors directly involved in the treatment of COVID-19 patients to analyze their conditions and depending on the doctors' report, the database is completed. Step 2: Applying Random Forest Classifier (RFC) to classify each patient's condition Random Forest (RF) is a supervised machine learning algorithm that is is mainly used for classification applications also used for both classification and prediction; however, it is mainly applied for classification applications. Forest means trees and the more trees the more robust the forest is. In the random forest classification method, this model creates different decision trees based on data samples and when new data points are inserted for its class prediction, each decision tree gives one prediction, and finally, the best solution is selected by voting. For an input vector (x), each decision tree will give a vote. Then, where is the prediction of class on random-forest tree and is the final prediction using the majority vote [25] . The main concept behind this model is simple but a powerful one. The reason for this wonderful effect is that the models protect each other from their trees. There are many attribute selection methods but the most frequently used attribute selection measures in decision tree induction are the gain ration criterion [26] and the Gini Index [27] . RFC uses the Gini Index method for its attributes' selection which measures the impurity of an attribute with respect to its classes. For a given training set P, selecting a sample case randomly and to predict its class as , the Gini index can be written as- Here, , /| | is the probability that a selected case belongs to the class . For generating a prediction model through RFC, basically, two parameters are rudimentarily required-the number of classification trees and the predicting variables that reside in each node to spread out the trees. The selected features are expanded for each node and this way, N decision trees are grown where N is a user-defined value about the number of trees to be grown. When new data points are introduced, these are passed down to all those trees and then it chooses its class by maximum votes out of N votes. For this research, input data with various features and an output attribute with different levels are split into two datasets: training dataset and testing dataset. Then bootstrap aggregating and attribute bagging are developed to form a randomly selected decision tree by minimizing the misclassification rate. Finally, the testing dataset is examined to predict the class. 90% of data is used as training data [28] while the rest of the data is assigned as testing data to classify the patient's condition. Step 3: Implementing the CCS method based on the condition of the patient Conditional Cluster Sampling is a technique to stratify the cluster sample based on the condition of the patient. For the better accuracy of the test, the maximum cluster size chosen is 64 [18] . The sample size is inversely proportional to the severity that means more severe cases are clustered into small sample sizes. The main reason behind this is that the probability of a critical patient to be positive for COVID-19 is Here, the severity level is taken to be increasing with higher classes; for instance, the class-4 patients are The key symptoms of COVID-19 found in the literature are classified into eight types with different levels that are shown in Thereafter, Random Forest Classifier (RFC) is used to predict patient condition, where 90% of the data are used as training and the remaining as testing in R studio (version 3.6.3). We have tuned the model by coding, and the best model was found with ntree = 100 (number of trees in the random forest) and mtry = 2 (no of attributes selected randomly for each tree during attribute bagging in a random forest). The accuracy of the model is 96%, which indicates that the training dataset is well constructed. A random tree and the confusion matrix are depicted below in figure-7 and table-2, respectively. patient, if symptom-1 (fever) sustain more than or equal to 2 days, it will check whether it sustained more than or equal to 5 days. If so, the patient is considered to be a critical patient. If else, it will further check the level of symptom-4 (difficulty in breathing). If there is no breathing problem, it will be taken as a moderate patient. If else, it will again examine the level of symptom-5 (pain in chest). If the level is minor or moderate, it will look over symptom-2 (cough). If there is no cough, it will be a moderate patient else if else it will examine symptom-3 (sore in the throat). If there is a mild or moderate sore throat, the patient is minor else it will again check this symptom. If sore throat is severe it will be a critical patient otherwise it will be a moderate patient. The predicted data is utilized to apply in CCS using R studio (version 3.6.3) to find out the total number of tests needed. Up until 24 June 2020, the COVID-19 positive cases in Bangladesh are 18% against the total number of tests performed [1] . In this study, the results are depicted in two ranges of probability levels for a patient being tested positive. For the first case, the maximum probability of a patient testing positive is assumed to be 25% (Table-3 ) and in the second case, the maximum probability is assumed to be 20% (Table-4 The compiled results from negative slope of both curves shows that as the severity class of patients' advances, the test kit saving decreases per 100 MCS cycles for both cases which means that CCS tends to be more like an individual testing method with an increase in the patients' severity. Also, the CCS method mainly relies on symptoms to classify the patients. So the method is not effective for asymptomatic patients. Bangladesh is undergoing community transmission in the spreading COVID-19 and to address this, the initial focus has been on case identification. The case identification is currently very low due to a shortage of testing kits. This study suggests a means to mitigate this issue by utilizing conditional cluster sampling. This study incorporates a numerical method, probabilistic sampling, and health science to arrive at a systematic cluster specimen testing method, which is the CCS method. The accuracy of RFC to predict a patients' class is 96%. The CCS method is repeated for 100 cycles according to MCS, whuch resulted in a saving of 12% test kits for higher probabilities of positive cases detection and 22% for lower probabilities of positive cases detection of the test kits. This will save both time and money for rapidly obtaining test reports. The CCS method is beneficial in terms of mass specimen testing. However, this study has some limitations-1. The probability ranges are selected based on current statistics and infection patterns. The probability is contingent upon different infection patterns and situations. 2. The test data set is only 399 patients. Testing on a higher population will most likely derive a more accurate scenario. 3. The model does not consider asymptomatic patients. 4. Due to computational simplifications, 100 cycles of simulation is conducted in MCS. An increase in the number of cycles is likely to deliver a more precise result. This study can also be explored using other intricate tools in the future. This research utilizes RFC to classify the test data which can also be done using Deep Learning or a Deep Neural Network algorithm to add more dimensions. J o u r n a l P r e -p r o o f Bangladesh Coronavirus: 122,660 Cases and 1,582 Deaths -Worldometer Group Testing for SARS-CoV-2: Forward to the Past? Covid-19 update: How pool testing can enhance speed, scale Preparedness and vulnerability of African countries against importations of COVID-19: a modelling study Review and analysis of current responses to COVID-19 in Indonesia: Period of Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China Coronavirus disease 2019 (COVID-19): A literature review Artificial Intelligence Forecasting of Covid-19 in China Outbreak Prediction of COVID-19 for Dense and Populated Countries Using Machine Learning Study of ARIMA and least square support vector machine (LS-SVM) models for the prediction of SARS-CoV-2 confirmed cases in the most affected countries A python based support vector regression model for prediction of COVID19 cases in India COVID-19 outbreak: Application of multi-gene genetic programming to country-based prediction models Impact of lockdowns on the spread of COVID-19 in Saudi Arabia A combination of 'pooling' with a prediction model can reduce by 73% the number of COVID-19 (Corona-virus) tests Pooling of samples for testing for SARS-CoV-2 in asymptomatic people Evaluation of COVID-19 RT-qPCR test in multi-sample pools Optimization of group size in pool testing strategy for SARS-CoV-2: A simple mathematical model Is Pool Testing Method Employed in Germany and India Effective ? A classifier prediction model to predict the status of Coronavirus CoVID-19 patients in South Korea Random forests Book review: C4. 5: by j. ross quinlan. inc., 1993. programs for machine learning morgan kaufmann publishers Measuring the accuracy of species distribution models: A review Ovarian cancer data classification using bagging and random forest INTRODUCTION TO MONTE CARLO SIMULATION