key: cord-0793517-s747wouu authors: Mukherjee, Soham; Wethington, Darren; Dey, Tamal K.; Das, Jayajit title: Determining clinically relevant features in cytometry data using persistent homology date: 2021-04-27 journal: bioRxiv DOI: 10.1101/2021.04.26.441473 sha: c926ce8e7355aa1e36a543c6e710e173a4a90edf doc_id: 793517 cord_uid: s747wouu Cytometry experiments yield high-dimensional point cloud data that is difficult to interpret manually. Boolean gating techniques coupled with comparisons of relative abundances of cellular subsets is the current standard for cytometry data analysis. However, this approach is unable to capture more subtle topological features hidden in data, especially if those features are further masked by data transforms or significant batch effects or donor-to-donor variations in clinical data. We present that persistent homology, a mathematical structure that summarizes the topological features, can distinguish different sources of data, such as from groups of healthy donors or patients, effectively. Analysis of publicly available cytometry data describing non-naïve CD8+ T cells in COVID-19 patients and healthy controls shows that systematic structural differences exist between single cell protein expressions in COVID-19 patients and healthy controls. Our method identifies proteins of interest by a decision-tree based classifier and passes them to a kernel-density estimator (KDE) for sampling points from the density distribution. We then compute persistence diagrams from these sampled points. The resulting persistence diagrams identify regions in cytometry datasets of varying density and identify protruded structures such as ‘elbows’. We compute Wasserstein distances between these persistence diagrams for random pairs of healthy controls and COVID-19 patients and find that systematic structural differences exist between COVID-19 patients and healthy controls in the expression data for T-bet, Eomes, and Ki-67. Further analysis shows that expression of T-bet and Eomes are significantly downregulated in COVID-19 patient non-naïve CD8+ T cells compared to healthy controls. This counter-intuitive finding may indicate that canonical effector CD8+ T cells are less prevalent in COVID-19 patients than healthy controls. This method is applicable to any cytometry dataset for discovering novel insights through topological data analysis which may be difficult to ascertain otherwise with a standard gating strategy or in the presence of large batch effects. Author summary Identifying differences between cytometry data seen as a point cloud can be complicated by random variations in data collection and data sources. We apply persistent homology used in topological data analysis to describe the shape and structure of the data representing immune cells in healthy donors and COVID-19 patients. By looking at how the shape and structure differ between healthy donors and COVID-19 patients, we are able to definitively conclude how these groups differ despite random variations in the data. Furthermore, these results are novel in their ability to capture shape and structure of cytometry data, something not described by other analyses. the host by pathogens (e.g., a virus) or due to the presence of tumors which usually 10 result in changes in the 'shape' of point cloud data measured in cytometry 11 experiments [3] [4] [5] . Cytometry data analysis techniques commonly rely on Boolean 12 gating and calculation of relative proportions of resulting populations as a method to 13 compare datasets across control/healthy and experimental/diseased conditions. In 14 recent years, state-of-the-art analyses based on sophisticated machine learning 15 algorithms capable of mitigating batch effects, ad hoc gating assumptions, and 16 donor-donor variability have been developed [6, 7] . However, these methods are not 17 designed to quantitatively characterize shape features (e.g., connected clusters, cycles) 18 in high dimensional cytometry datasets that can contain valuable information regarding 19 unique co-dependencies of specific proteins in diseased individuals compared to healthy 20 subjects. 21 Topological Data Analysis (TDA) aims to capture the underlying shape of a given 22 dataset by describing its topological properties. Unlike geometry, topological features 23 (e.g., the hole in a doughnut) are invariant under continuous deformation such as 24 rotation, bending, twisting but not tearing and gluing. One of the tools by which TDA 25 describes topological features latent in data is persistent homology [8, 9] . For example, 26 for a point cloud data, persistent homology captures the birth and death of topological 27 features (e.g., 'holes') in a dataset after building a scaffold called a simplicial complex 28 out of the input points. This exercise provides details regarding topological features 29 that 'persist' over a range of scale and thus contain information regarding the shape 30 topology at different length scales (see Fig S1 for details) . Persistent homology has been 31 applied to characterize shapes and shape-function relationships in a wide variety of 32 biological systems including skin pattern formation in zebra fish [10] , protein structure, 33 and pattern of neuronal firing in mouse hippocampus [11] . TDA has additionally 34 previously been applied to identify immune parameters associated with transplant 35 complications for patients undergoing allogenic stem cell transplant using populations of 36 immune cell types assayed via mass cytometry [12] . However, this work did not use 37 persistent homology or expression levels of proteins in their analysis, leaving the shape 38 of cytometry data uncharacterized. Another work focuses on the use of TDA as a data 39 reduction method for single-cell RNA sequencing data [13] , but again do not attempt to 40 characterize how topologies derived from point clouds differ among disparate data 41 sources such as healthy and diseased individuals. 42 The challenges of directly applying current persistence methodologies to cytometry 43 data to characterize distinguishing features between healthy and diseased states are the 44 following: 1. Features that separate healthy from diseased state can pertain to the 45 change in density of points in a region in point cloud data -therefore, the information of 46 local density should be incorporated in persistent homology methods, in particular in 47 the filtration step that brings in sequentially the simplices connecting the points. In 48 commonly used Rips filtration [14] the density of points is not included. 2. There can 49 be shape changes giving a different length scale in the point cloud data, such as 50 formation of an elbow, in a diseased condition. 3. There can be systematic differences 51 between healthy and diseased states across batch effects and donor-donor 52 variations.Topological features should capture these global differences being oblivious to 53 the local variations caused by measurement noise. 54 We address the above challenges by developing an appropriate filtration function to 55 compute persistence and applying the method to characterize distinguishing features of 56 non-naïve CD8+ T cells between healthy and SARS-CoV-2 infected patients. resilience to local perturbations [15] . It is this property of persistent homology that 61 motivates us to use TDA in distinguishing clinically relevant features in flow cytometry 62 data in COVID-19 patients. Persistent Homology: Persistent homology builds on an algebraic structure called homology groups graded by its dimension i and denoted by H i . Intuitively, they describe the shape of the data by 'connectivity' at different levels. For example, H 0 describes the number of connected components, H 1 describes the number of holes, and, H 2 describes the number of enclosed voids apparently present in the shape that the dataset represents. Three and higher dimensional homology groups capture analogous higher ( 3) dimensional features. A point cloud data (henceforth abbreviated as PCD) itself does not have much of a 'connected structure'. So, a scaffold called a simplicial complex is built on top of it. This simplicial complex, in general, is made out of simplices of various dimensions such as vertices, edges, triangles, tetrahedra, and other higher dimensional analogues. Given a growing sequence of such complexes called filtrations, a persistence algorithm tracks information regarding the homology groups across this sequence. In our case, these complexes can be restricted only to vertices and edges. With the restriction that both vertices of an edge appear before the edge, we get a nested sequence of graphs Persistence Diagram: Appearance ("birth") and disappearance ("death") of 65 topological features, that is, cycles whose classes constitute the homology groups, can 66 be captured by persistence algorithms [8, 16] . These "birth" and "death" events are 67 represented as points in the so-called persistence diagram. If a topological feature is 68 born at filtration step b and dies at step d, we represent this by persistence pair (b, d) since edge e 0 at filtration step 6 kills the component created by v 1 at step 1. Similarly, 80 we obtain pairs (2, 7), (4, 8) , (5, 9) , and (3, 11) . These points, tracking the 'birth' and 81 'death' of components, produce the persistence diagram for the 0-th homology group H 0 82 and hence we refer to it as H 0 -persistence diagram. Note that the edge e 4 creates a 83 cycle (yellow) that never dies. In such cases, i.e. when a topological feature never dies, 84 we pair it with 1. For the edge e 4 , we obtain a persistence pair (10, 1). But, this Computing persistent homology for cytometry datasets: Our datasets consist 101 of cytometry data for non-naïve CD8+ T cells. Given protein expressions (real values) 102 for d proteins in such a single cell, we can represent it as a d-dimensional point in R d . Considering a population of single cells, we get a point cloud (PCD) in R d . Now, we 104 study the shape of this PCD using the persistence framework that we describe above. 105 We compute persistence diagrams for the PCDs generated with protein expressions from 106 different individuals and compare them. It turns out that, for computational purposes, 107 we need a limit on the dimension d for PCD which means we need to choose carefully 108 the proteins that differentiate effectively the subjects of our interest, namely the healthy 109 individuals, COVID-19 patients, and recovered patients. We typically choose 3 110 (sometimes 2) protein expressions to generate the PCD and call it a PCD in the P1, P2, 111 P3 space if it is generated by proteins P1, P2, and P3 respectively. Flow cytometry data for non-naïve CD8+ T cells in Mathew et al. [3] show 113 generation of CD8+ T cells with larger abundances of the proteins CD38 and HLA-DR 114 (CD38+HLA-DR+ cells) for some COVID-19 patients, forming an "elbow" in the two 115 dimensional PCD with CD38 and HLA-DR protein expressions (see Fig S2) . Moreover, 116 there is an increase in the local density of the points (or single CD8+ T cells) in the 117 "elbow" region. This suggests that, to study the PCD generated by the protein 118 expressions by persistence framework, we need to choose a filtration that is able to 119 capture such geometric shapes and variations in the local density. 120 We briefly describe our choice of filtration by considering the example of a point noteworthy to mention that f v (p) is the distance-to-measure defined in [17] and 129 captures the density distribution of points whereas f e (e) captures the inter-point 130 distances between the points in the given point cloud. 131 The persistence algorithm processes each vertex and edge in the order of their 132 appearance in the filtration. We execute it using a threshold value from 1 to 1 133 and generate the persistence diagram accordingly. Intuitively, as is increased from Our aim is to find out systematic differences in topological features extracted from 159 cytometry data for healthy individuals and COVID-19 patients. Ideally one would like 160 to compute persistence diagrams for all 25 proteins that were measured in single CD8+ 161 T cells, however, this task encounters two major problems. First, as we mentioned 162 before taking the full 25 dimensional PCD introduces the curse of dimensionality [18] 163 making it computationally infeasible to produce the persistence diagrams. The second 164 one is more subtle. In order to measure how the density of data differs from a healthy 165 to infected person in a quantitative way, we need to ensure that the number of points in 166 each PCD, to be analyzed by persistent homology, is the same. Cytometry data usually 167 contain different numbers of single cells in datasets obtained from different donors or 168 replicates. To address the curse of dimensionality we use a classifier (XGBoost) that construct persistence diagrams for each dataset. To quantify the structural differences 176 in the datasets as captured by the corresponding persistence diagrams, we compute the 177 Wasserstein distance [19] between persistence diagrams from randomly selected pairs of 178 either two healthy donors (H⇥H) or a healthy donor and an infected patient (H⇥P) and 179 compute distributions of the Wasserstein distances for a large number of (H⇥H) and 180 (H⇥P) pairs. The comparison of these distributions provides information regarding the 181 systematic differences in shape features in the CD8+ T cell cytometry data across 182 healthy individuals and COVID-19 patients. The computational pipeline is summarized 183 in (Fig 3) . Below we describe results from the application of our computational pipeline 184 to the CD8+ T cell cytometry data in Mathew et al. [3] Fig 5 (Fig S4) . The difference between H⇥H and H⇥R 227 distributions of distances in the T-bet, Eomes, and Ki-67 space are less prominent 228 (Fig S5) . We further test if such systematic differences are present for proteins that are 229 A readily apparent difference between the resulting persistence diagrams is given by 241 the lower birth times in H 1 of the COVID-19 patient compared to the healthy control 242 (Fig 6e, 6f) . This result indicates that the length scale of the data is smaller in the 243 COVID-19 patient, which can be visually confirmed in the scatter plots of the data (Fig 244 6a, 6b) . Specifically, the single cell abundances of T-bet and Eomes in CD8+ T cells are 245 clustered significantly tighter around the origin for the COVID-19 patients than for the 246 healthy controls. Similar manual inspection of other H⇥P pairs that generate large 247 Wasserstein distances between their persistence diagrams confirms that this trend is not 248 limited to this pair alone. Additionally, the points in the H 0 -persistence diagram are spread out more widely 250 for the healthy control than the COVID-19 patient (Fig 6c, 6d) . A wider distribution 251 of births and deaths in the 0-th homology H 0 implies that there are regions of disparate 252 densities. This suggests that the densities in the protein expressions of T-bet and Eomes 253 are more homogeneous in the PCD in the COVID-19 patient than in the healthy control. 254 The structural change in the PCD for CD8+ T cells in the T-bet/Eomes plane that 255 occurs during COVID-19 infection implies that T-bet and Eomes expression should be 256 downregulated (decreased) in non-naïve CD8+ T cells. This result is consistent with 257 analysis of clusters of CD8+ T cells by Mathew et al. [3] that shows that clusters high 258 in T-bet and/or Eomes are downregulated in COVID-19 patients. The downregulation 259 of T-bet and Eomes in response to viral infections is not well documented, as CD8+ T 260 cells commonly differentiate into phenotypes with, high T-bet, high Eomes, or both in 261 response to infections [20, 22] . We developed a persistent homology based approach to determine topological features 264 hidden in point cloud data representing single cell protein abundances measured in 265 cytometry data. In particular, we characterized the number of connected components or 266 H 0 , and the number of holes or H 1 in our persistence calculations, and showed that our 267 approach is able to determine systematic shape differences in the cytometry data for 268 CD8+ T cells obtained from healthy individuals and COVID-19 patients. Therefore, the 269 approach is able to successfully determine systematic shape differences that exist in the 270 presence of batch effect noise and donor-donor variations in the cytometry data. Furthermore, our approach does not use data transformations (e.g., arc-sinh 272 transformation) or any ad-hoc subtype gating to determine these systematic differences, 273 thus we expect persistent homology based approaches will be especially useful in 274 identifying high-dimensional structural trends hidden in cytometry data. 275 We determined structural changes in T-bet and Eomes abundances in single CD8+ 276 T cells in COVID-19 patients that can be summarized as downregulation. This result is 277 non-intuitive as previous findings show that T-bet and Eomes protein abundances are immunophenotype (i.e.,Eomes+, T-bet+, CD8+ T cells) is systematically less prevalent 285 in COVID-19 patients than in healthy controls. The ability of our approach to identify 286 non-intuitive shape features such as the above without any 'supervision' (e.g., specific 287 gating) of the cytometry data shows that it can potentially determine more complicated 288 immunologically relevant shape features. For instance, although we did not identify 289 differing shapes or voids in the COVID-19 data in our current approach, our method 290 would be able to detect these in a different dataset that contains such characteristics. Currently, the limitations are mostly centered around computational complexity and 301 the curse of dimensionality. The computational resources necessary to calculate a KDE 302 are great enough that extensions beyond three dimensions are not feasible. However, 303 improving or circumventing the KDE calculation step will significantly increase the 304 dimensionality of data that can be considered. Since we are computing pairwise 305 distances between datapoints to obtain the persistence diagram (Section SIB), 306 computation time will linearly increase as the dimension of data increases. Our 307 methodology on our computational cluster resources currently takes about 40 minutes. 308 This is comparable to other data science applications to large datasets, but can be a 309 barrier to those without access or experience with computational clusters. Additionally, 310 it is unclear how adding more dimensions will impact the statistical properties of the 311 data and interpretability of the results. To expand into many (i.e. 25) dimensions, 312 computational interpretation and validation tools will be necessary. tree based classifier, and as a byproduct we get feature scores that correspond directly 325 to each feature's importance in the classification. The higher the score for a protein, the 326 more important it is for the classifier's decision. After our classifier orders the proteins 327 by their scores, we take first r proteins to construct the point-cloud on which persistence 328 diagrams are computed. We set r = 3 for all our analysis reported here. We used data 329 from 56 healthy individuals and 108 COVID-19 patients for our feature selection. Estimation. The estimated probability distribution function was used to draw 20,000 337 points corresponding to each PCD that were further analyzed using persistent homology. 338 We used KDE estimated probability distributions instead of random selection of 20,000 339 datapoints in three dimensions, because the random selection of data points do not 340 appropriately sample the distribution of the data points for the relatively smaller 341 sample size (20, 000 data points). KDE was implemented using scikit-learn [24] package. 342 The kernel function was chosen to be 'gaussian'. For now we have tuned the bandwidth 343 parameter based on manual observation and to remain consistent we have fixed the 344 bandwidth at 0.2 for each dataset. Details of persistent homology computation: As mentioned before (section 2.1), computation of persistence diagrams needs a filtration. We set the filtration induced by the function f = {f v , f e } where f v (p) computes an "average" Euclidean distance between the vertex p and its k neighbors according to Eq. (1) and f e (e) computes the length of the edge e according to Eq. (2). kp q i k 2 , p 2 P, and q i 2 k-Nearest Neighbors of p. . The term kp q i k in the above equation is the Euclidean distance between the vertices p and q i . The function value f e (e) for an edge e = (p, q) is given by the Euclidean distance between p and q. For the experiments, the number of nearest neighbors is fixed to k = 40. f e (e) = kp qk , 8p, q 2 P and p 6 = q We begin with computing Kernel Density Estimation (KDE) for every cytometry 346 PCD c i and take n(= 20, 000) samples from each KDE. We do this to make c i uniform 347 w.r.t. number of data points (single CD8+ T cells). We compute a complete weighted 348 graph G(V, E) with vertices in the sampled data. This complete graph G is a key-step 349 that enables us to compute the persistence diagram, Dgm(c i ) of the dataset c i , w.r.t. the filtration function f . We show the algorithm (Algorithm 2) that executes this step 351 in detail in the supplementary material. Notice that the graph G is weighted in the 352 sense that each vertex v 2 V and edge e 2 E carries a weight of f v (v) and f e (e) 353 respectively. Observe that f : V [ E ! R constitutes a valid filtration of G. 354 We compute persistence diagrams for each c i 2 D according to Algorithm 3. The 355 next step involves comparing the persistence diagrams. We do this by computing the 356 Wasserstein distance between persistence diagrams and plotting their distributions. We 357 take two persistence diagrams of randomly selected healthy individuals and compute the 358 Wasserstein distance between them with the help of Gudhi [19, 25] and scikit-learn 359 Python library [24] . Similarly, we compute Wasserstein distance between persistence 360 diagrams of a healthy and an infected individual (both are randomly drawn from the 361 collection). We plot the resulting distances. We do this for 108 pairs to obtain two 362 distributions. Note that, results described in Section 2.1 still holds for 200 pairs 363 ( Fig S4) . Intuitively, a large Wasserstein distance between two persistence diagrams 364 implies the datasets on which they were constructed are structurally very different while 365 a small distance implies they are structurally similar. Available code: Our current code is available at 377 https://github.com/soham0209/TopoCytometry and will be updated for ease-of-use 378 and performance enhancements. Mass Cytometry: Single Cells, Many Features Mass cytometry: a powerful tool for dissecting the immune landscape Deep immune profiling of COVID-19 patients reveals distinct immunotypes with therapeutic implications Human NK cell repertoire diversity reflects immune experience and correlates with viral susceptibility Monitoring immune responses in the tumor microenvironment Immunophenotype Discovery, Hierarchical Organization, and Template-Based Classification of Flow Cytometry Samples optimalFlow: optimal transport approach to flow cytometry gating and population matching Computational topology: an introduction Topological data analysis. Advances in applied and computational topology Topological data analysis of zebrafish patterns Quantitative firing pattern phenotyping of hippocampal neuron types Mass cytometry and topological data analysis reveal immune parameters associated with complications after allogeneic stem cell transplantation Single-cell topological RNA-seq analysis reveals insights into cellular differentiation and development Efficient and robust persistent homology for measures Stability of Persistence Diagrams Persistence barcodes for shapes Declutter and resample: Towards parameter free denoising Dynamic Programming Geometry Helps to Compare Persistence Diagrams Cutting Edge: IL-12 inversely regulates T-bet and eomesodermin expression during pathogen-induced CD8+ T cell differentiation The Ki-67 protein: From the known and the unknown Characterization of T-Bet and Eomes in Peripheral Human Immune Cells A Scalable Tree Boosting System API design for machine learning software: experiences from the scikit-learn project GUDHI User and Reference Manual. 3.4.1 ed. GUDHI Editorial Board Computational Topology for Data Analysis: Notes from Book Acknowledgements: This work is supported by the NIH awards R01-AI 143740 and R01-AI 146581 to JD, and by the Research Institute at the Nationwide Children's Hospital, and NSF awards CCF 1839252 and 2049010 to TD. The authors would like to thank John Wherry and members of his lab for depositing their data and helping us to understand it.