key: cord-0587514-pqrxjc6c authors: Karisani, Negin; Platt, Daniel E.; Basu, Saugata; Parida, Laxmi title: Inferring COVID-19 Biological Pathways from Clinical Phenotypes via Topological Analysis date: 2021-01-19 journal: nan DOI: nan sha: 1186d2910092265ddc200a9394e164cc9456575c doc_id: 587514 cord_uid: pqrxjc6c COVID-19 has caused thousands of deaths around the world and also resulted in a large international economic disruption. Identifying the pathways associated with this illness can help medical researchers to better understand the properties of the condition. This process can be carried out by analyzing the medical records. It is crucial to develop tools and models that can aid researchers with this process in a timely manner. However, medical records are often unstructured clinical notes, and this poses significant challenges to developing the automated systems. In this article, we propose a pipeline to aid practitioners in analyzing clinical notes and revealing the pathways associated with this disease. Our pipeline relies on topological properties and consists of three steps: 1) pre-processing the clinical notes to extract the salient concepts, 2) constructing a feature space of the patients to characterize the extracted concepts, and finally, 3) leveraging the topological properties to distill the available knowledge and visualize the result. Our experiments on a publicly available dataset of COVID-19 clinical notes testify that our pipeline can indeed extract meaningful pathways. Since the early stages of the COVID-19 pandemic, the scientific community has made tremendous effort to address the clinical course of the virus. However, there is still a lot to reveal about COVID-19. For instance, most people who contract COVID-19 develop mild to moderate symptoms (WHO 2020) , some may show no symptoms, while for others the disease can be fatal. To better understand different strains of COVID-19 one approach is to study the underlying pathways. The aim of this study is to investigate the application of topological properties in automatically inferring candidate pathways. We use unstructured clinical notes as the source of information to automatically extract phenotypes to be used in our topological model. Phenotypes are the symptoms and signs that reflect the presence of a disease-in the follows, we refer to them as symptoms. Advancement in technology has helped scientists to garner large amount of biomedical data. This has provided the community with unprecedented opportunities to study and better understand the spread of diseases. However, this burst of information has posed significant challenges to the traditional data analysis and visualization techniques. Traditional infographics, such as Venn diagrams which are still widely used to compare and contrast set of symptoms, fail to aid practitioners in analyzing large set of symptoms. Thus, tools that can effectively employ the techniques in other scientific communities to facilitate this process are of great value. In this article, we rely on concepts from Topological Data Analysis and propose a pipeline to automatically extract candidate pathways associated with COVID-19 from clinical notes. Our pipeline which is based on the notion of Redescriptions (Parida and Ramakrishnan 2005; Mullins et al. 2006; Platt et al. 2016) consists of three steps: 1) preprocessing the notes and identifying the candidate symptoms, 2) mapping the symptoms to the space of the patients, and finally, 3) extracting the topological properties and their visualization. We have evaluated our pipeline in a publicly available dataset of COVID-19 clinical notes. The results show that our model is able to extract meaningful pathways. For example, in Section 6.1 we demonstrate that there are potentially distinctive pathways between coughers and noncoughers. The remainder of this article is organized as follows: in Section 2 we provide an overview of the concepts that we use. In Section 3 we present our pipeline and in Section 4 we discuss the implementation detail. In Section 5 we explain the detail of our experiments and in Section 6 we present and discuss the results. Finally, in Section 7 we conclude the article. In this section, we review the concepts used in the remainder of the paper. Redescriptions are used to identify the phenomena that occur in different ways. The concept was first introduced in (Ramakrishnan et al. 2004) , and later in (Parida and Ramakrishnan 2005) was generalized to a framework called redescription mining, for which the authors present some applications in Genome Ontology database. Redescriptions are mathematically formalized using Boolean algebra, which is also used to model the causeeffect relationship among the symptoms. Two different sets of symptoms which correspond to the same group of patients is an example of redescriptions. More specifically, suppose s 1 , s 2 are two symptoms, and P 1 , P 2 their respective set of patients. If the presence of symptom s 1 implies the presence of symptom s 2 , then P 1 ⊆ P 2 . If we consider the combination of the symptoms (i.e. s 1 ∧ s 2 ), then the group of the patients who experience both symptoms is P 1 ∩ P 2 = P 1 , which is the same group of patients that we obtain by considering only the symptom s 1 . Redescriptions-the combination of symptoms that give rise to the same group of patientscan reveal logical associations among symptoms. They can highlight the underlying pathways and are commonly used to derive rules in the pathways (Mullins et al. 2006 ). Over the past two decades, Topological Data Analysis (TDA), raised from Algebraic Topology, has found its way in the real-world applications. In (Dagliati et al. 2019) , TDA is used to model disease progression by inferring temporal phenotypes. More recently ) use TDA and machine learning techniques to investigate genome mutation of SARS-COV-2. Some other examples in biology include analysis of brain neural activities (Dabaghian et al. 2012; Nasrin et al. 2019) , and cancer genomics (Nicolau, Levine, and Carlsson 2011; Rabadán et al. 2020) . In this section, we aim to provide a brief overview of the primary TDA concept, i.e., the persistent homology. We avoid the mathematical detail which is beyond the scope of this article. For a thorough description see (Edelsbrunner and Harer 2008; Wasserman 2018) . Let M be a continuous space equipped with a metric δ, the topological invariants of M are defined as the properties that do not change under continuous deformation (i.e. twisting but not tearing). The invariants in M for lower dimensions are usually referred to as the connected components, the holes, and the void spaces, respectively in dimension 0, 1 and 2; in the higher dimensions, they are understood as k-dimensional holes. The number of k-dimensional holes in M are called the k-th betti numbers. Given data points X and a distance function δ-X represents as the points sampled from M -the goal is to compute the topological invariants of the underlying structure of X (i.e. space M ). A common approach to accomplish this is by constructing k-simplexes over X. Intuitively, one could think of a k-simplex as the smallest convex hull of k + 1 points. A collection of k-simplexes glued together is called a simplicial complex (satisfying some conditions). Since it is not reasonable to begin with all the possible k-simplexes over the data points in X; the technique is to add the simplexes in a sequence of steps. First, a parameter is selected and the initial simplicial complex S, is set to be the collection of points in X as the 0-simplexes; then the parameter is increased such that at each step just a specific set of simplexes, that satisfy some conditions, could be added to S; this procedure creates a filtration of simplicial complexes on X, which then is analyzed. The conditions to select a sub-set of simplexes at each step, give rise to different types of simplicial complexes. Figure 1 : Recovering topological properties using simplicial complexes An example of a simplicial complex isČech complex. Figure 1 shows a simple illustrative example. The goal is to recover the topological invariants of the space in Figure 1 (a). It is clear that the 0-betti number is one, since there is only one connected component; and the 1-betti number is two, since there are two holes; the higher dimensional betti numbers are zero. The dataset X is given by the six sampled points in Figure 1 (b). To construct theČech complex over X, we begin with the points in X as 0-simplexes. In order to construct the higher dimensional simplexes, we start growing a ball at each point, as in Figure 1 (c); at this state, the 0-betti number is six, and 1-betti number is zero. By increasing the radius, some of the balls start to overlap each others; for each k + 1 overlapped balls we insert a k-simplex. Figure 1(d) shows a collection of 1-simplexes, line segments joining the two points, created by the pairwise overlap of their corresponding balls; what can be clearly seen in this simplicial complex is that it recovers the topological properties of the underlying structure in Figure 1 (a). If we increase the radius further, the three balls at the top begin to overlap each others, hence we can add a 2-simplex-a filled in triangle-as in Figure 1 (e). Therefore, the hole which was created at Figure 1(d) disappeared by increasing the radius at Figure 1(e) , as a result that topological property is lost; this could eventually happen for the second hole as we continue to increase the radius and add more simplexes. It is important to notice that the topological properties that persist for a longer period, before they disappear, best represent the properties of the underlying structure. This characteristic is the basic principle of the persistent homology method, which was first formalized in (Edelsbrunner, Letscher, and Zomorodian 2002) . A diagram known as Barcode is commonly used to keep track of the lifetime of topological properties. In the barcode, each topological property is represented as a horizontal line segment. The line segments span the period that the corresponding topological properties exist, along the parameter axis (i.e. radius). We use the barcode in Section 6.1 (see Figure 2 ). In this section, we introduce our pipeline. The first step is to extract structured data-the set of symptoms and their corresponding patients-from the unstructured clinical records. Next step is to define the feature space, the sampling strategy, and the metric to measure the similarity between the data points. Finally, the topological properties are extracted and are visualized. Concept extraction: We carry out the concept extraction in three steps: 1) We parse the clinical notes and map the biological terms to the concepts in a medical ontology. 2) Since the clinical notes have informal language model their parsing can be noisy. Thus we ask the user to cure the candidate relations and resolve the inconsistencies. 3) We use the health records to construct an association matrix between the patients and the extracted concepts. Natural language processing techniques are widely used to analyze biomedical documents (Demner-Fushman, Chapman, and McDonald 2009; Koleck et al. 2019) . Despite the significant advances in neural text processing over the last decade, we found that the existing methods are not adequate to effectively parse the medical records. Thus, to reduce the noise and ensure that the extracted terms are indeed valid medical concepts we use manual supervision to validate the automatic process. Feature space construction: In our model, features correspond to the patients, and the data points correspond to the combination of concepts-we call them patterns. Given a feature vector-i.e., a data point-a feature is set to 1 if the corresponding patient shows all the symptoms associated with the data point. Thus, a data point is understood as a cluster of patients, who share the same set of symptoms-i.e., pattern. As mentioned in Section 2.1, pathways can be inferred by identifying the redescriptions-i.e., patterns that have the same group of patients. However, in order to make inference about the underlying pathways, it is important to analyze the patterns whose clusters are statistically significant. Since tests such as Binomial are not successful in separating higher-order correlations, which can distinguish groups of patients that identify disease processes from the impact of pair-wise correlations, we use cumulant correlation expansions. In quantum field theory, they emerge as connected Feynman diagrams (1-particle irreducible). These are multivariate moments related to cumulants appearing in statistics. In that context, their generating functions factor according to partitions of sets. Let G • represent the moments in moment generating functions E [exp ( i x i J i )] where the J i 's are the conjugate variables, and Γ • represent the higher dimensional cumulants, e.g. for the symptoms x i , x j , x k then G ij = E(x i x j ) and G ijkk = E(x i x j x 2 k ), and Γ ij and Γ ijkk are the corresponding cummulants. The factorizations are as follows. where A is nominally 1, seen by setting the J i = 0. The power series in the J i 's then require We apply the above factorization to the clusters, and shuffle the symptoms to test significance by constructing variances and null hypotheses. To search for the redescriptions, we need to investigate the cause-effect relationships among the selected patterns. However, often due to the misclassifications of patients, e.g., caused by wrong diagnosis, the set inclusion property does not hold in the data. Therefore, the exact equality of sets should be estimated. This estimation can be done by Jaccard distance, which measures the dissimilarity between sets. For the two sets A and B, Jaccard distance is defined by, For the example in Section 2.1, when P 1 ⊆ P 2 , then the Jaccard distance d(P 1 ∩ P 2 , P 1 ) = 0, otherwise if P 1 ⊆ P 2 then 0 < d(P 1 ∩ P 2 , P 1 ) ≤ 1, which can be interpreted as the probability that subjects picked from the two sets are not shared. Hence, we consider Jaccard distance to measure the distances between the sampled data points. Topological analysis and visualization: To explore the structure of the space created in the previous step, Vietoris-Rips (VR) complexes are employed to construct the filtration. The VR complex is an abstract simplicial complex with 0-simplexes as the data points, and k-simplexes are created for any k + 1 points whose pairwise distances are at most 2r, while r is fixed. The initial simplicial complex is a collection of 0simplexes which correspond to the sampled data points-i.e., the clusters of patients selected from the previous step-and Jaccard distance is used as the filtration parameter to construct the VR complexes. Finally, the barcode is generated and representative cycles of the bars are retrieved for further analysis. To parse the clinical notes and extract the biomedical terms we used Amazon Comprehend Medical (ACM) 1 , an online proprietary NLP programming interface to analyze the unstructured clinical notes. For technical details regarding ACM see (Jin et al. 2018; Bhatia, Busra Celikkaya, and Khalilia 2020) . We also used the International Classification of Diseases (ICD-10CM) 2 to select the concepts, which are mapped by ACM to the extracted terms. ICD is a medical ontology, published by the World Health Organization to classify diseases, symptoms, and other medical conditions. In the TDA step, we used Dionysus 3 package for the construction of simplicial complexes and visualization. We also incorporated the Cyclonysus 4 implementation to retrieve the representative cycles of the 1-dimensional topological properties. We begin this section by describing the dataset, then we discuss the steps of the experiment. We used the dataset introduced in (Xu et al. 2020) 5 . The dataset is continually updated with the available records of confirmed COVID-19 patients. We used the version published on June 8, 2020. Among the available records in the data set we retained all the records that their "symptom" field was non-empty, this amounted to 1,545 patients. This field, which is a textual feature, is a clinical note describing the patient's medical state. ACM associates a list of ICD-10CM codes to each extracted medical condition, ordered by their confidence scores, hence we retained a code with the highest confidence score. We only considered medical conditions that at least 0.3 percent of patients experienced. If the ICD-10CM codes associated to a medical condition were at the same level of the hierarchical ontology and ACM was assigning high confidence scores to all of them, we considered them as one class. An example of that includes R53. = {R53.1 : Weakness, R53.81 : Malaise, R53.83 : Other fatigue}. We retained the data corresponded to thirty-one ICD-10CM codes. Based on the data Fever, Cough and Fatigue are the most common symptoms among the COVID-19 patients. Table 1 presents the list of selected classes and their number of patients, and Table 2 provides the number of patients who experienced k medical conditions. In the second step of the pipeline, we selected 632 data points with patterns corresponded to the subsets of the thirtyone ICD-10CM codes. To construct the VR filtration, we set the threshold of the filtration parameter to 0.5. In this section, we report the main result and discuss its significance. We obtained topological properties of dimensions 0 and 1; there was no topological property of higher dimensions. We report an important 1-dimensional property which is striking. As mentioned in Section 3, we used Jaccard distance. Therefore, at any two data points, the lower the distance, the more similar their sets of patients are. Following from this, the topological properties whose 1-simplexes corresponded to low distances were of interest. Figure 2 shows the barcode of the 1-dimensional topological properties, whose lifetime is within the interval (0, 0.5). The horizontal axis corresponds to the parameter of the filtration-Jaccard distance-and the vertical axis corresponds to the number of properties. With respect to the previous paragraph, what stands out in the diagram is the first bar annotated by the circled line, which spans between 0.23 and 0.34. Since the 1-dimensional topological properties are understood as the holes made up of points and 1-simplexes, a cycle generating the annotated bar is shown in Figure 3 . Data points are illustrated by their associated combination of ICD-10CM codes along with the number of patients who experienced them, and the 1-simplexes joining the data points are labeled by the Jaccard distance between the respective sets of patients. Therefore, as an example, the label (R05 ∩ R09. values of the Jaccard distances imply stronger associations among the respective clusters, which are important to identify the redescriptions. In particular, this cycle suggests that among the subjects in R09.3-Abnormal sputum-it appears that there is not a particular interaction between subjects in R05-Cough-with subjects in R50.9-Fever. This opens the question if there is a distinctive signature showing alternative pathways to disease among non-coughers compared to coughers. To interpret the relationships between the symptoms in Figure 3 we rely on Jaccard distance. Since the equivalence of sets of subjects matching different patterns produces logical constraints determined by biological processes, multiple pathways connecting phenotypes to disease may yield information about multigenic complex diseases marked by multiple pathways leading to disease. However, phenotype definitions are prone to misclassification for numbers of reasons. Therefore, equivalence may be meaningfully characterized by based on the chances that a subject in one or the other of two phenotype clusters is not in both of them, which is the Jaccard distance, described above. In the case of Figure 3 , there are two paths leading from R09.3 (abnormal sputum) to R05 ∩ R09.3 ∩ R50.9, one passing through R09.3 ∩ R50.9 and the other through R05 ∩ R09.3, where R05 is cough, and R50.9 is fever. In both pathways, the distances between sputum and cough is larger than that between sputum and fever. So coughing is not as strong an association as fever for abnormal sputum production. In this case, the relationship between sputum and fever is independent of coughing, since the cycle appears to be a parallelogram. So a coughing symptom is independent of fever among sputum productive subjects. This suggests the paths are independent predictors of severe disease. In this study we investigated the application of topological properties in extracting the candidate COVID-19 pathways from the clinical notes. We also proposed a pipeline to preprocess the data, extract the salient concepts, construct a feature space, and visualize the results. We evaluated our pipeline on a set of 1,545 patients and showed that it can extract meaningful associations between symptoms, and reveal intriguing candidate pathways. One limitation of our study is the reliance on human validation. As we mentioned in Section 3, the available text processing tools were not able to effectively parse and extract the relevant medical concepts. To resolve this shortcoming we plan to exploit the available structured data accompanied by the medical records to cluster the patients and automatically filter out the improbable associations. End-to-End Joint Entity Extraction and Negation Detection for Clinical Text A Topological Paradigm for Hippocampal Spatial Map Formation Using Persistent Homology Inferring Temporal Phenotypes with Topological Data Analysis and Pseudo Time-Series Biomedical Natural Language Processing. Edelsbrunner; Letscher; and Zomorodian. 2002. Topological Persistence and Simplification Persistent homology-a survey Improving Hospital Mortality Prediction with Medical Named Entities and Multimodal Learning Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review Data mining and clinical data repositories: Insights from a 667,000 patient data set Bayesian Topological Learning for Brain State Classification Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival Redescription Mining: Structure Theory and Algorithms Characterizing redescriptions using persistent homology to isolate genetic pathways contributing to pathogenesis Identification of relevant genetic alterations in cancer using topological data analysis Turning CARTwheels: An Alternating Algorithm for Mining Redescriptions Decoding asymptomatic COVID-19 infection and transmission Topological Data Analysis WHO. 2020. Coronavirus:symptoms Epidemiological data from the COVID-19 outbreak, real-time case information