key: cord-0873831-0j7xxy4b authors: Bezzan, Vitor P.; Rocco, Cleber D. title: Using bi-dimensional representations to understand patterns on COVID-19 blood exam data date: 2021-12-30 journal: Inform Med Unlocked DOI: 10.1016/j.imu.2021.100828 sha: 8c6c71a96091b1d290d7c76816255f80bbe235ce doc_id: 873831 cord_uid: 0j7xxy4b Blood tests have an essential part in everyday medicine and are used by doctors in several diagnostic procedures. Still, this data is multivariate – and often some diseases, like COVID-19, could have different symptom manifestation and outcomes. This study proposes a method of extracting useful information from blood tests using UMAP technique - Uniform Manifold Approximation and Projection for Dimension Reduction combined with DBSCAN clustering and statistical approaches. The analysis performed here indicates several clusters of infection prevalence varying between 2%–37%, meaning that our procedure is indeed capable of finding different patterns. A possible explanation is that COVID-19 is not just a respiratory infection but a systemic disease with critical hematological implications, primarily on white-cell fractions, as indicated by relevant statistical tests p-values in the range of 0.03–0.1. The novel analysis procedure proposed could be adopted in other data-sets of different illnesses to help researchers to discover new patterns of data that could be used in various diseases and contexts. still witnessing death and hospitalization rates which were high at the beginning of 2021 (especially in the North region), according to Johns Hopkins Coronavirus Resource Center [20] . More than ever we should use all the tools available to understand the infection scientifically in a precise and structured way. In this study, we propose an exploratory and descriptive data analysis using a bidimensional representation generated by UMAP -Uniform Manifold Approximation and Projection for Dimension Reduction [28] , followed by a DBSCAN clustering and posterior usage of statistical tests on the clusters obtained to reveal (non-causal) relationships between different parameters of blood-test data and their diagnostic counterparts. As with any study that aims only to describe data (and therefore does not make any predictions or offer any recommendations for action), we will not perform test and training samples for our dataset. Data was obtained from the Albert Einstein Hospital in Sao Paulo, Brazil [18] . The data consist of patients blood tests, providing information about whether or not a given patient has COVID-19 and if the patient needed special care or not (hospitalization in standard, semi, and intensive care units). The main purpose of this article is to show that techniques previously considered to belong to the theoretical world of data science can be successfully applied to blood test data. Furthermore, we want to open a path where techniques like these can be used more frequently by medical science researchers. We believe that studies organized with a greater amount of data (both in number of people and in number of variables), greater representativeness of the samples and greater variability can benefit from what we discuss here and use the proposed procedures with little modification in relation to the original experiments. We also believe that in the future our methodology can be easily extended to data that do not come from blood samples. This article is organized as follows: in Section 2, we examine the most up-todate literature regarding Machine Learning applied to model blood test results and explain their results. In Section 3 we present the method used to perform the two data experiments, revealing their results in Section 4. In Sections 5 and 6, we outline the results and discuss the limitations and possible implications of this study. In this section, some studies from the literature are presented that inspired and laid the authors' foundations to create their analysis and perspective. Some interesting results were obtained without using machine learning (ML) and should be encouraged as a first-line open-access tool available to most researchers. In [14] , it can be observed that statistically significant differences were found using two-way tables based on blood test data from a hospital in Italy, which is a quick and cheap solution to detect infections. A new study is [24] , which is much more focused on hematological data and sheds light on significant statistical differences and possible risk factors associated with different patients. One specific meta-analysis, including the results of 35 other studies [5] , indicated factors that contribute the most to non-severe patients to develop severe diseases. Using blood tests with machine learning seems to have gained attraction since the beginning of the pandemic. Theoretical justification and groundwork for supervised ML techniques can be observed in several articles. In [9] , attention is paid to possible combinations of models that could be used with results varying between the values of 0.6 -0.9 area under the Receiver Operating Characteristics Curve (ROC) to detect infected patients. In [31] , similar results were obtained using the same dataset we adopted in this study. Amalgamating the results of these articles and some others, there is [2] which uses ensembles and achieves 99.88% accuracy in predicting infections. Other articles with similarity but not identical purposes are available. [8] uses several ML models on a dataset provided by the Sírio Libanês Hospital, in Brazil, to predict special-care probability and the number of days under special care, obtaining a value of 0.94 area under the ROC curve for the first target. In [15] , we see a prime example of how a system could be implemented to detect COVID-19 in a given patient. This study also stands out as it uses a small sample and optimization techniques to find the most important variables for the problem. As an example of unsupervised learning techniques, a study that can be mentioned is [22] , which uses a model to predict infection and compares COVID-19 manifestations with other diseases using t-distributed stochastic neighbor embedding (similar to the purpose of UMAP), concluding that blood parameters of those affected with severe COVID-19 resemble more bacterial than viral infections, which was a very surprising result. So, putting in context what will be shown in this article, our main contributions will be in the use of a set of computational techniques to discover hidden patterns in blood test data, using for this a well-known cluster technique combined with a very recent dimensionality reduction technique that has gained adherents in several more applied areas of activity. This reduction in dimensionality, studied primarily as a purely mathematical exercise by the authors of the original article, has been pivotal in discovering previously unknown patterns in several areas and we believe that the methods to be presented here can be extrapolated with broad generality to other investigations in medicine and biological sciences. The key difference between this study and the others mentioned above is the fact that we are not pursuing the creation of a fully supervised model. Instead, we aimed to test the "manifold hypothesis" on this data to check the existence of different groups where the manifestations of the disease could be different, providing researchers a whole new set of techniques to apply in other data sets in a similar context. Table 1 shows some articles using UMAP as base for dimensionality reduction in several different contexts related to medicine and biology in general over the last few years, demonstrating the versatility and power of the technique we propose to use. The use of clustering techniques is widespread in medical sciences in general. In a first class of articles, patient characteristics are used to unveil some hidden data structure present for diagnosing or understanding the disease's progression, such as [29] and [35] . Another class of studies tends to use more comprehensive statistical analysis with clustering to separate manifestations and possible patterns arising in a more specific group of patients as in [33] and [1] . Vitor P. Bezzan, Cleber D. Rocco Table 1 Selected references for UMAP usage in medicine and biology. Application/Usage 2019 [7] Single-cell visualization using UMAP 2019 [10] Population patterns in genomic cohorts 2021 [11] UMAP in population genetics 2021 [4] Artifacts in microbiome data 2021 [25] Transfer learning on molecular fingerprints 2021 [38] Molecular dynamics simulations Although this study offers non-causal inference, it is relevant to point out sources like [32] that mix up causal inference and clustering in a medical setting; something we believe that should be further explored whether any other dataset allows us to do so. The procedure behind our analysis primarily consists of two phases. In the first phase we project high-dimensional laboratory exam data into a two-dimensional subspace using UMAP (tuning two hyperparameters), making the dataset more amenable to clustering techniques. In the second step we cluster the data representation using DBSCAN [34] to find any patterns that may arise. The number of clusters obtained is a consequence of the hyperparameter tuning method used. Here, we used DBSCAN as a clustering alternative because the number of clusters is not specified upfront. By doing that, we assume more neutrality when analyzing the data structure. The "overall quality" of fit for a specific combination of hyperparameters is measured without resorting to the target s current value, using the silhouette coefficient for a given arrangement [21] . We then compare different arrangements using this metric, selecting the one with the maximum value overall. Table 2 summarizes all hyperparameters used in the cluster tuning procedure. Obtained parameter values will be discussed in Section 4. As know in data science, high-dimensional data has fewer degrees of freedom than one might initially assume, which is known as the "Manifold Hypothesis". [13] presents a complete description of the hypothesis and several demonstrations on the subject. We present in Appendix A a small application of UMAP to a dataset well known to the general public to demonstrate what the expected results are for the type of analysis we will conduct in this study. The hypothesis and the dimensional reduction provided by UMAP allows to analyze blood test data within a new perspective: different groups with different manifestations of the disease could be traced using this technique, as these groups will tend to cluster together in the low-dimensionality representation. Moreover, more significant factors could give us some clues about the disease and its progression. Thereby here We propose two "experiments". In the first, we analyze data from all patients in our dataset with measurements of blood tests (red and white series) and then use the procedure outlined above. In the second one, we filter out our patient data keeping only those with confirmed COVID-19 and comparing the results using the targets for both situations. It is worth mentioning at this point that none of our analysis aims to be causal. The study was not conceived in this way, and the data are observational. For this purpose, we suggest using Causal Forests [39] , which can deal with observational data and make a satisfactory causal inference whether the number of samples is high enough as the method needs several data splits. Figure 1 summarizes all the steps in both experiments. Silhouette coefficient is used to select the number of clusters for our experiments prior to statistical analysis. Dimensionality reduction techniques have already become commonplace in the data science community. Among the most usual techniques, we can mention statistical techniques already considered "classic", such as PCA [37] and ICA [19] . More modern developments include t-SNE [26] and the aforementioned UMAP, both classified as modern developments of manifold learning. We expect data from blood tests to be highly non-linear, and their lower dimensional representations to be dominated by complex terms. PCA and ICA should therefore be disregarded as techniques to be used for dimensionality reduction in these cases as they are based on linear relationships/filters between variables. As for t-SNE, we discard its use based on scalability as the number of samples grows, as we want our methodology to be applied to arbitrary size datasets. Another major disadvantage of this method is that it does not allow the transformer object to be reused in other datasets different from the initial set. UMAP is entirely based on Riemannian geometrical assumptions (uniform distribution, locally constant metric tensor and local connectivity). It models the data using a fuzzy topological structure. The math behind the method is fairly advanced and it is not going to be discussed in this article. We suggest the reader to consult [28] for more details. Various clustering techniques have been widespread in the data science field in recent decades, with one of the first examples being discussed with greater emphasis since the 1960s. We can cite as examples most used by the community the k-means [27] , fuzzy c-means [12] , OPTICS [3] and DBSCAN [34] The main differences among these techniques are on scalability (both in number of clusters and in number of samples), capacity to detect clusters of different formats and detection of outliers built into the procedure. We can consider that our data will not present trivial geometry after the dimensionality reduction process, thus invalidating the use of techniques such as k-means, which tend to separate clusters more uniformly and with more "circular" geometry. Likewise, we can expect that our techniques can be applied to new datasets in a scalable way and with a high degree of reproducibility. This invalidates the use of the fuzzy c-means technique given the high need for components that introduce unwanted "degrees of freedom" to the method. Furthermore, we want an element to belong to a single cluster. DBSCAN was then the selected technique due to its scalability, detection of clusters in geometries that are not necessarily circular and the ability to filter outliers in different contexts. Furthermore, the main programming languages already have the algorithm included in their packages, which facilitates the implementation and dissemination of the concepts presented in this article. The technique is based assuming that regions of clusters have a higher density of points, separated by lower density regions. The minimum density requirements are codified in the parameters shown at Table 2 , when considering Euclidean distances. The data contains anonymous information about 598 patients admitted to the Albert Einstein Hospital during the COVID-19 pandemic. 81 patients tested positive for infection (13%) and 128 patients needed special care treatment (21%, not only related to . There are available parameters related to red and white cell counts for each patient, all of them normalized by the mean and standard deviation (z-scores). Table 3 summarizes all the variables used for the study. To further expand on the data, Figure 2 presents white cell distribution for all 598 patients (blue dots are negative infection whereas orange dots are positive). No univariate pattern was observed emerging in the data, which leads us to use a multivariate technique. As mentioned above, two data experiments were performed: The first experiment consists of all 598 patients and tries to understand whether there are groups with high prevalence (greater than the average of the dataset) and to point out the main characteristics of these groups. In the second experiment, the focus is primarily on the confirmed COVID-19 diagnostic, aiming to discover any groups with more prominent special care needs than the whole dataset. In this first analysis, after performing the aforementioned dimensionality reduction with UMAP and the clustering of the resulting 2-dimensional space variables, we obtained a value of 0.12 for the silhouette coefficient (the clusters obtained are very packed together). Overall, 7 clusters were obtained with COVID-19 prevalence in the range of 3 − 35%. 29 patients did not meet any of the DBSCAN similarity criteria and were not assigned any cluster, thus they were removed from the analysis. A close inspection of Tables 4 and 5 reveals that most extreme values reside on the first two clusters for white-cell counts. This fact could be interpreted in a twoway manner: Patients could have comorbidities and be more susceptible to being infected by COVID-19, thus having greater white-cell counts, as pointed out by [36] . On the other hand, COVID-19 could be responsible for the values themselves. One observation is about the number of platelets, which is very low, much in line with discoveries shown in [16] , [17] , [6] and [30] . No extreme values were found in red cell samples for high COVID-19 prevalence clusters, but the close observation of the tables regarding the prevalence and the number of people in each cluster may help to "name" each cluster, a procedure that is made when clusters are applied in several contexts. For example, cluster 1 could be named "Non-symptomatic patients", although more data is needed to make such an affirmation. In this analysis, we obtained a value of 0.40 for the silhouette coefficient (the clusters obtained seem very separated, as shown in Figure 4 ). Overall, two clusters were obtained, with COVID-19 prevalence in the range of 7 − 61%. No patients without clusters were obtained in this analysis. The number of clusters obtained allows us to go one step further in the analysis. We conducted two-sample one-sided (lower) t-and KS-statistical tests. Tables 6 and 7 show the p-values associated with one of these tests in every parameter. The result is very similar to Experiment I. Red cell components do not display any statistical differences between the two groups, however white cell components show statistical differences. Again, platelets appear as a significant factor, once again indicating a relationship between coagulation factors, COVID-19 and a possible patient prognostic. The limitations of this study are in two points. The first one is data: the variables to be analyzed ("wider": more columns) and the number of patients ("taller": more rows) could lead to a substantial improvement in the outcomes achieved so far, allowing us to separate the clusters better. More variables for each patient also mean that different representations could be obtained. In medical terms, more complex relationships could be extracted. Restricting ourselves only to blood exams, C-reactive protein, AST, ALT, GGT, and LDH could be excellent additions to the analysis. Other data sources could be leveraged: social and economic data could help to trace relationships between infection severity and social strata. Genetic markers could help to understand whether some populations are more susceptible to infections than others. Medical imaging data could help to associate blood parameters with physiological changes in organs and tissues, and so on. The second point is the non-causality of analysis. None of this study's conclusions are causal for two reasons: The data is observational, and the number of patients and parameters is not large. This reveals an excellent opportunity for researchers because the procedure applied here could be used to control the experiment data without any modifications. There are some studies in the literature that combine cluster analysis with causal inference but it is still very sparse [32] . Statistically significant samples and more parameters could help to create groups of patients where a treatment (or protective measures) could be tailored for each group. Other diseases could also benefit from the same approach presented here. Considering the nature of this research, other epidemics (e.g. Dengue fever, Zika Virus, Ebola) could be an excellent investigation opportunity, as the primary source of data used here is inexpensive and could be collected even in developing and emerging countries. Using only data science methods we were able to demonstrate that different prevalence subgroups exist, and that these groups have different medical interpretations that they make sense. This study opens a window of opportunity for those with access to individual and more granular blood data for patients, paving the way for a more comprehensive analysis with more factors to be analyzed. Moreover, we aim to help to demonstrate that COVID-19 is not only "a simple flu" with only respiratory effects but a more complex disease with several potential implications and outcomes, particularly hematological as described by relevant statistical testing. Special implications in platelets (which control coagulation), eosinophils and monocytes (related to infection control and adaptive immunity) further disclose that COVID-19 is a multi-systemic, multi-implication disease that must be analyzed from a multi-disciplinary perspective and the clusters found can be the first indication that several approaches must be taken by medical staff, policymakers and governments. On the future, we can use similar techniques with augmented data to address different problems related to COVID-19 such as vaccine distribution, field hospital construction, disease spread analysis and other issues. The approach presented here can be also easily adapted to other diseases as well. see that similar points tend to cluster closely, and non-similar digits tend to be more distant. The overall distance is controlled by the parameters neighbors and spread in Table 2 . We selected here a specific two-dimensional representation of our data for viewing purposes, knowing that high-dimensional representations could be necessary to deal with very-high dimensionality data/ or complex behaviors. The application of unsupervised clustering methods to alzheimer's disease Ensemble learning model for diagnosing covid-19 from routine blood tests Optics: ordering points to identify the clustering structure Uniform manifold approximation and projection (umap) reveals composite patterns and resolves visualization artifacts in microbiome data Comparative analysis of laboratory indexes of severe and non-severe patients infected with covid-19 Covid-19 concerns aggregate around platelets Dimensionality reduction for visualizing single-cell data using umap Predicting special care during the covid-19 pandemic: A machine learning approach Detection of covid-19 infection from routine blood exams with machine learning: A feasibility study Umap reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts A review of umap in population genetics A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters Testing the manifold hypothesis Routine blood tests as a potential diagnostic tool for covid-19 Heg.ia: an intelligent system to support diagnosis of covid-19 based on blood tests The impact of covid-19 disease on platelets and coagulation Effect of covid-19 on platelet count and its indices HIAE: Diagnosis of covid-19 and its clinical spectrum Independent component analysis: Algorithms and applications JHCRC: Johns hopkins coronavirus resource center Silhouettes: A graphical aid to the interpretation and validation of cluster analysis Covid-19 diagnosis by routine blood tests using machine learning MNIST handwritten digit database Haematological characteristics and risk factors in the classification and prognosis evaluation of covid-19: a retrospective cohort study Should we embed in chemistry -a comparison of unsupervised transfer learning with pca, umap, and vae on molecular fingerprints Visualizing high-dimensional data using t-sne Some methods for classification and analysis of multivariate observations Umap: Uniform manifold approximation and projection for dimension reduction Cluster analysis and related techniques in medical research Thrombocytopenia and thrombosis in hospitalized patients with covid-19 Covid-19 diagnosis prediction in emergency care patients: a machine learning approach Use of clustering analysis in randomized controlled trials in orthopaedic surgery Clustering medical data to predict the likelihood of diseases Dbscan revisited, revisited: Why and how you should (still) use dbscan Multivariate methods to identify cancer-related symptom clusters Epidemiological and clinical characteristics of the covid-19 epidemic in brazil Mixtures of probabilistic principal component analysers Umap as a dimensionality reduction tool for molecular dynamics simulations of biomacromolecules: A comparison study Estimation and inference of heterogeneous treatment effects using random forests Using bi-dimensional representations to understand patterns on COVID-19 blood exam data Acknowledgments: The authors would like to thank the anonymous referees for their useful comments. This project partially benefited from the CNPq grant (process number: 400868/2016-4). A good way to visualize the dimensional reduction performed by UMAP is by comparing Figures 5 and 6. Figure 5 shows elements of the so-called MNIST dataset [23] , which is composed of 28x28 pixel images (784 dimensions) of thousands of handwritten digits. In Figure 6 , after the UMAP algorithm, we can