key: cord-0127545-v7b6z86y authors: Avila, Eduardo; Dorn, Marcio; Alho, Clarice Sampaio; Kahmann, Alessandro title: Hemogram Data as a Tool for Decision-making in COVID-19 Management: Applications to Resource Scarcity Scenarios date: 2020-05-10 journal: nan DOI: nan sha: 8cea14f271e041c2fca0cfd013d22293386fa890 doc_id: 127545 cord_uid: v7b6z86y COVID-19 pandemics has challenged emergency response systems worldwide, with widespread reports of essential services breakdown and collapse of health care structure. A critical element involves essential workforce management since current protocols recommend release from duty for symptomatic individuals, including essential personnel. Testing capacity is also problematic in several countries, where diagnosis demand outnumbers available local testing capacity. This work describes a machine learning model derived from hemogram exam data performed in symptomatic patients and how they can be used to predict qRT-PCR test results. Methods: A Naive-Bayes model for machine learning is proposed for handling different scarcity scenarios, including managing symptomatic essential workforce and absence of diagnostic tests. Hemogram result data was used to predict qRT-PCR results in situations where the latter was not performed, or results are not yet available. Adjusts in assumed prior probabilities allow fine-tuning of the model, according to actual prediction context. Proposed models can predict COVID-19 qRT-PCR results in symptomatic individuals with high accuracy, sensitivity and specificity. Data assessment can be performed in an individual or simultaneous basis, according to desired outcome. Based on hemogram data and background scarcity context, resource distribution is significantly optimized when model-based patient selection is observed, compared to random choice. The model can help manage testing deficiency and other critical circumstances. Machine learning models can be derived from widely available, quick, and inexpensive exam data in order to predict qRT-PCR results used in COVID-19 diagnosis. These models can be used to assist strategic decision-making in resource scarcity scenarios, including personnel shortage, lack of medical resources, and testing insufficiency. information is significant. All variables were normalized to maintain anonymity and remove scale effects. No missing data imputation was performed during model generation to avoid bias. Considering the significant ammount of missing data, only 510 patients presented values for all 15 parameters evaluated in hemogram results (comprising the following cell counts or hematological measures: hematocrit, hemoglobin, platelets, mean platelet volume, red blood cells, lymphocytes, leukocytes, basophils, eosinophils, monocytes, neutrophils, mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), and red blood cell distribution width (RDW). Data for the above parameters were used in model construction, along with qRT-PCR COVID-19 test results. The full dataset is available in https://www.kaggle.com/einsteindata4u/covid19. Machine learning (ML) is a field of study in computer science and statistics dedicated to the execution of computational tasks through algorithms that do not require explicit instructions but instead rely on learning patterns from data samples to automate inferences [19] . These algorithms can infer input-output relationships without explicitly assuming a pre-determined model [10, 11] . There are two learning paradigms: supervised and unsupervised. Supervised learning is a process in which the predictive models are constructed through a set of observations, each of those associated with a known outcome (label). In opposition, in unsupervised learning, one does not have access to the labels, it can be viewed as the task of "spontaneously" finding patterns and structures in the input data. Our objective with this study is to predict in advance the results of the qRT-PCR test with machine learning models using data from hemogram tests performed on symptomatic patients. The main process can be divided into four steps: (1) pre-processing of the data (2) selection of an appropriate classification algorithm, (3) model development and validation, i.e., the process of using the selected characteristics to separate the two groups of subjects (positive for COVID-19 vs. negative for COVID-19 in qRT-PCR test), and (4) test generated model with additional data. Steps are detailed as follows: Data Pre-processing: Samples presenting a missing value in any of the 15 evaluated features were removed. A total of 510 patients (73 positives for COVID-19 and 437 negatives) presented complete data and were considered for the model construction. Classification Algorithm: In this work, we use the Naïve Bayes (NB) classifier, which is a probabilistic machine learning model used for classification tasks. The main reasons for choosing this classifier are due to their low computational cost and clear interpretation. In medicine, the first computer-learn attempts in decision support were based mainly on the Bayes theorem, in order to aggregate data information to physicians' previous knowledge [18] . The Naïve Bayes (NB) method combines the previous probability of an event (also called prior probability, or simply prior) with additional evidence (as, for example, a set of clinical data from a patient) to calculate a combined, conditional probability that includes the prior probability given the extra information. The result is the posterior probability of an outcome, or simply posterior. This classifier is called "naïve" because it considers that each exam result (variables) is independent of each other. Once this situation is not realistic in medicine, the model should not be interpreted [23] . Besides this drawback, it can outperform more robust alternatives in classification tasks, and once it reflects the uncertainty involved in the diagnosis, Bayesian approaches are more suitable than deterministic techniques [8, 11] . Model Development and Validation: A classifier is an estimator with a predict method that takes an input array (test) and makes predictions for each sample in it. In supervised learning estimators (our case), this method returns the predicted labels or values computed from the estimated model (positive or negative for COVID -19) . Cross-validation is a model evaluation 3/14 method that allows one to evaluate an estimator on a given dataset reliably. It consists of iteratively fitting the estimator on a fraction of the data, called training set, and testing it on the left-out unseen data, called test set. Several strategies exist to partition the data. In this work, we used the Leave-one-out (LOO) cross-validation model, as in Chang et al. [4] . The number of data points was split N times (number samples). The method was trained on all the data except for one point, and a prediction was made for that point. The proposed approach was implemented in Python v.3 (https://www.python.org) code using Scikit-Learn v. 0.22.2 [21] as a backend. Model Test: In order to evaluate the adequacy and generalization power of the proposed model, a set of 92 samples (10 positives for COVID-19 and 82 negatives) was extracted from the patient database. Those samples were not initially employed in model delineation, considering they present a single missing value among all 15 employed hemogram parameters. Missing data for this training set was imputed using the average value of the missing parameter within the resulting group (positive or negative). The test set was submitted to the previously generated model in order to evaluate classification performance. For data description, probability density function (PDF) of all 15 hemogram parameters were estimated through the original sample by kernel density estimator. Some hemogram parameters present notable differences between the distributions of positive and negative results, mainly regarding its modal value (distribution peak value) and variance (distribution width). Differences are summarized in Table 1 . Regarding basophiles, eosinophils, leukocytes and platelets counts, qRT-PCR positive group distribution shows lower modal value and lower variance. On the other hand, monocyte count displays opposite behavior, once lower modal value and variance are observed for the qRT-PCR positive group. Lower variance may depict a condition pattern, therefore it is expected that negative cases present higher variance once it may contain a higher variety of conditions (reasons for symptom presence). The remaining nine hemogram parameters did not show a notable difference between negative and positive groups. PDF analysis results are presented in Supplementary Material Figure S1 . Table 1 . Descriptive analysis of hemogram parameters used in present study. Modal value Variance A NB classifier based on training set hemogram data was developed. Under the model, the complete range of prior probabilities (from 0.0001 to 0.9999 by 0.0001 increments) was scrutinized, and posterior probability of each class was computed for different prior conditions. A posterior probability value of 0.5 was defined as the classification threshold in one of the positive or negative predicted groups. Resulting model showed a good predictive power of the qRT-PCR test result based on hemogram data. Figure 1 shows the accuracy, sensitivity, and specificity curves derived from the model for different prior probabilities of each class (positive or negative for COVID -19) . Reported prior probabilities refer to positive COVID-19 condition. When setting the prior probability to the maximum defined value (0.9999), the NB classifier correctly diagnosed all PCR positive cases. On the other hand, such configuration improperly predicted 77.3% of negative PCR results as positive. Regarding the lower possible prior probability setting, it does not classify a single observation as positive. This result can be explained by the unbalanced number of observations for each class, tending to over classify samples as the class with more observation, i.e. negative results. Such characteristics can also be noticed in the general accuracy, since the decrease in the prior ponce the classifier tends to diagnose all observations as belonging to the dominant class (negative) and consequently raising the total of correctly classified samples. The break-even point is met when prior probability is set to 0.2933. Under this condition, all metrics are approximately 76.6%. Regarding the model sensitivity, the rate of positive samples correctly classified is over 85% within 0.999 to 0.5276 range, with small decrease of it when the prior probability of positive result is diminished within this range. When prior is set to under 0.0606, the number of positive predicted samples decrease rapidly, yielding lower sensitivity. As for specificity, it presents linear growth as tested priors decrease. Ultimately, the accuracy results profile are similar to specificity, due to the negative patients dominance. As mentioned above, prior probability choice has a critical relevance in proposed model use. It is clear that, when extreme values of positive probability are applied (close to 0 or 1), specific classes (positive or negative qRT-PCR test results predictions) are favoured, increasing its ability of correct detection. As an example, when a value of 0.9999 is set for prior probability of positive result is set, an increase in misclassification in negative class results is observed. At the same time, it is possible to properly identify samples where hemogram evidence strongly indicates a negative result, according to the model. This is based on the fact that evidence used in the model construction (in present case, hemogram data) must strongly support the reduction of posterior probability of disease to values under 0.5, therefore leading to a negative result. This logic can be applied to fine tune the prior probability used in the model, in order to improve correct classification of positive or negative groups prediction. Examples of how to use this feature is provided in the "Discussion" section. Test samples (n=92, including 10 qRT-PCR positives) were used to test the proposed model. Figure 2 presents results obtained from the model application to test dataset. Laboratory findings can provide vital information for pandemics surveillance and management [16] . Hemogram data have been previously proposed as useful parameters in diagnosis and management of viral pandemics [25] . In the present work, an analysis concerning hemogram data from symptomatic patients suspected of COVID-19 infection was executed. A machine learning model based on Naïve Bayes method is proposed in order to predict actual qRT-PCR from such patients. The presented model can be applied to different situations, aiming to assist medical practitioners and management staff in key decisions regarding this pandemic. Figure 3 summarizes model construction and application. Predictions are not intended to be used as a diagnostic method since this technique was designed to anticipate qRT-PCR results only. As such, it is highly dependable on factors affecting qRT-PCR efficiency, and its prediction capacity is dependent on the sensitivity, accuracy, and specificity of the original laboratory exam [24] . Descriptive analysis of hemogram clinical findings shows differences in blood cell counts and other hematological parameters among COVID-19 positive and negative patient results. Differences are conspicuous among three measures (leukocytes, monocytes and platelets) and more discrete to additional two (basophiles and eosinophiles). It is possible that differences are also present across the complete data spectrum, even though they are not clearly visualized with PDF data. These results are in accordance to previous reports of changes in laboratory findings in COVID-19 infected patients, where conditions as leukopenia, lymphocytopenia and thrombopenia were reported [6] . It is important to highlight that data analysis is not sufficient to characterize clinical hematological alterations in evaluated patients (when compared to demographic hematologic parameters data), once data was normalized for the evaluated sample set only. However, even within this particular quota of population (individuals presenting COVID-19-like symptoms), differences were found between individuals presenting negative or positive qRT-PCR COVID test results. The proposed NB-ML model can be helpful in accessing different levels of information from hemogram results, through inferring non-evident patterns and parameter relationships from this data. Bayesian techniques are based on the choice of a prior probability of an event (in present case, positive result for qRT-PCR test). The method considers actual evidence (hemogram data) to result in a posterior probability of the outcome (prediction of a positive result). By changing the selected prior probability, we can derive an uncertainty analysis of the model to understand its distribution. Uncertainty can be then applied to adequately adapt the classifier to a particular ongoing context. This option allows the evaluation of different decision-making scenarios concerning diverse aspects of pandemics management. During a crisis situation, measures should be taken seeking to maximize benefits and achieve a fair resource allocation [5] . To illustrate the model flexibility and how it can be used to help on this matter, a general framework of application is proposed, followed by a simulation of four scenarios where resource scarcity is 7/14 assumed. The proposed NB model can be applied in two distinct situations. When clinical data is available for a particular patient, it is highly recommended that medical staff determine the prior probability on a case-by-case basis. When no clinical or medical data is available, or when decisions regarding resource management involving multiple symptomatic patients are necessary, the model can be used in multiple individuals simultaneously, aiming to identify those with higher probabilities of presenting positive qRT-PCR results. Individual risk management and personal evaluation is essential for COVID-19 response [7] . Individuals presenting COVID-19 symptoms are medically evaluated where no COVID-19 test is available for appropriate diagnosis confirmation. Medical practitioners can determine a probability of disease based on anamnesis, symptoms, clinical exams, laboratory findings and other available data. This probability of infection, as determined by the physician or medical team, can be considered as the prior probability. Using hemogram data as input, and informing the prior probability of COVID-19 based on medical findings, the model will consider hemogram data to inform a posterior probability, which can be higher or lower than the original, and based on the hemogram alterations caused by the virus infection. It is important that hemogram data would not be included in original medical assessment and prior determination, in order to avoid bias and reduce model overfit. It can be used in situations where decisions are necessary for resource management including multiple individuals. Choice of a target group (positive or negative qRT-PCR result prediction) should be defined. The model can be applied to multiple individuals simultaneously, with the choice of prior probability carefully adjusted to result in a specific number of predicted individuals from the target group, according to the desired outcome. This method increases the correct selection of candidates belonging to the target group, when compared to random selection. When additional clinical data is available, or become available later, patients selected during bulk evaluation should be reassessed individually as proposed in the general framework, in order to reduce misclassifications. Examples of proposed model use are presented for some specific scarcity scenarios in Table 2 . As can be seen, the model sensitivity can be adjusted by selecting prior probability employed, according to desired outcome or interest group. prior selection should be carefully decided, based on current context or situation proposed, and must consider the classification group where higher accuracy is intended. High accuracy in qRT-PCR result prediction is achieved based on hemogram information only. Further analysis performed on the original data (not shown) suggest that additional clinical results can improve prediction efficiency. This conclusion is in accordance with previous findings suggesting biochemical and immunological abnormalities, in addition to hematologic alterations, can be caused by COVID-19 disease [12] . In this context, the relevance of data employed to generate ML models is emphasized. The use of large and comprehensive datasets, containing as much information as possible regarding clinical and laboratory findings, symptoms, disease evolution, and other relevant aspects, is crucial in devising useful and adequate models. The development of nationwide or regional databases based on local data is essential, in order to capture epidemiological idiosyncrasies associated with such populations [28] . Also, natural differences in hemogram results from distinct demographic groups (as seen in reference values 8/14 Despite having high overall accuracy, performance metrics obtained with proposed model show unequal ability to predict positive or negative results. This situation is caused by a significant imbalance in number of samples belonging to each of this qRT-PCR result groups in original data. The use of balanced data in machine learning model design is important to assure high prediction quality [15] . The option of maintaining original data in model construction was adopted, since it better represents actual COVID-19 prevalence among symptomatic patients, and therefore seems to represent a more realistic situation. Additional simulations applying a balanced model (data not shown) using positive group oversampling (to compensate its insufficiency in original data) have devised alternative models with superior prediction power. Alternative balanced model results are presented in Supplementary Material Figure S2 .Therefore, additional positive samples will be added to the data and used in future model versions. As a perspective, collection of hemogram results from asymptomatic patients (in addition to symptomatic individuals) can be used to evaluate the utility of this approach on the detection of asymptomatic infections, in order to provide alternatives in diagnostics, especially in a context of testing deficiency. A web-based application was developed by the authors, in which hemogram data can be introduced for a single individual, along with prior probability of infection, based on data used to generate the present model. The online tool is available at http: //sbcb.inf.ufrgs.br/covid. Future implementation will allow the upload of multiple patients simultaneously, and construction or testing of user data-derived models. This service will allow easy access and practical application of the proposed model. Figure S2 . Performance metrics of alternative balanced Naive-Bayes model. In this case, random oversampling of positive results was employed, until sample number in each class is identical. Prior probabilities are presented in reference to positive qRT-PCR prediction. Confusion matrices (left to right) are presented for 0.9999, 0.2237 and 0.0001 prior probabilities, respectively. Sensitivity=True Positive Ratio; Sensitivity=True Negative Ratio. Random seed was set to 0 for replication purposes. Epidemiology, causes, clinical manifestation and diagnosis, prevention and control of coronavirus disease (covid-19) during the early outbreak period: a scoping review How will country-based mitigation measures influence the course of the covid-19 epidemic? The Lancet Covid-19: the case for health-care worker screening to prevent hospital transmission Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer Fair allocation of scarce medical resources in the time of covid-19 Blood and blood product use during covid-19 infection Individual risk management strategy and potential therapeutic options for the covid-19 pandemic Experience with a model of sequential diagnosis Clinical characteristics of coronavirus disease 2019 in china Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools, and techniques to build intelligent systems The elements of statistical learning: data mining Hematologic, biochemical and immune biomarker abnormalities associated with severe illness and mortality in coronavirus disease 2019 (covid-19): a meta-analysis Clinical features of patients infected with 2019 novel coronavirus in wuhan, china. The Lancet Health security capacities in the context of covid-19 outbreak: an analysis of international health regulations annual report data from 182 countries Learning from imbalanced data: open challenges and future directions The critical role of laboratory medicine during coronavirus disease 2019 (covid-19) and other viral outbreaks Defining the epidemiology of covid-19 -studies needed Clinical versus acturial prediction in the differential diagnosis of jaundice. a stdy of the relative accuracy of predictions made by physicians and by a statistically derived formula in differentiating parenchymal and obstructive jaundice Machine Learning The socio-economic implications of the coronavirus and covid-19 pandemic: A review Scikit-learn: Machine learning in python Critical supply shortages -the need for ventilators and personal protective equipment during the covid-19 pandemic Computer-assisted decision support for the diagnosis and treatment of infectious diseases in intensive care units Interpreting Diagnostic Tests for SARS-CoV-2 Clinical utility fo the full blood count in identifying patients with pandemic influenza a (h1n1) Real-time rt-pcr in covid-19 detection: issues affecting the results Covid-19: how doctors and healthcare systems are tackling coronavirus worldwide Hematological findings and complications of covid-19 The epidemiological and clinical features of covid-19 and lessons from this global infectious public health event Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal