key: cord-0855126-52quljiv authors: Chung, Moo K.; Ombao, Hernando title: Lattice Paths for Persistent Diagrams date: 2021-05-01 journal: 4th International Workshop on Interpretability of Machine Intelligence in Medical Image Computing, iMIMIC 2020 and 1st International Workshop on Topological Data Analysis and Its Applications for Medical Data, TDA4MedicalData 2021 held in conjunction with 24th International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI 2021 DOI: 10.1007/978-3-030-87444-5_8 sha: df7023e0d6c43ffd9e479ac801ce856a87619f29 doc_id: 855126 cord_uid: 52quljiv Persistent homology has undergone significant development in recent years. However, one outstanding challenge is to build a coherent statistical inference procedure on persistent diagrams. In this paper, we first present a new lattice path representation for persistent diagrams. We then develop a new exact statistical inference procedure for lattice paths via combinatorial enumerations. The lattice path method is applied to the topological characterization of the protein structures of the COVID-19 virus. We demonstrate that there are topological changes during the conformational change of spike proteins. Despite its rigorous mathematical foundation developed for two decades starting with study [13] , persistent homology still suffers from numerous statistical and computational problems. It has not yet become a standard tool in medical imaging. Persistent homology has been applied to a wide variety of data including brain networks [8] , protein structures [15] , RNA viruses [5] and molecular structures [19] . However, most of these methods only serve as exploratory tools that provide descriptive summary statistics rather than formal inference. The main difficulty is due to the heterogeneous nature of topological features, which do not have a one-to-one correspondence across persistent diagrams. Motivated by these challenges, we propose a more principled topological inference procedure through lattice paths. Lattice paths are widely studied algebraic objects in combinatorics and may have potential applications in persistent homology [2, 8, 21, 23] . Here, we propose to use the lattice path approach in computing probabilistic statements about the similarity of two persistent diagrams. This is often needed to produce some baseline quantitive measure, such as the p-value, commonly used in biomedical research [7, 8] . Existing methods for computing p-values usually rely on approximate time consuming resampling techniques: jackknife, bootstrap and the permutation test [1, 8] . However, our approach is analytic and thus compute the exact probability without computational burden. The main contributions of this paper are the following: (1) a new data representation via Dyck and lattice paths; (2) the analytic approach for computing probabilities without resampling and significantly reducing run time; (3) the first topological study on the shape of COVID-19 virus spikes proteins. The proposed lattice path method was used in differentiating the conformational changes of the COVID-19 virus spike proteins that is needed for the virus to penetrate host cells ( Figure 1 ). This demonstration is particularly relevant due to the potential for advancing vaccine development and the current public health concern [25,4]. Simplicial homology. High dimensional objects, such as proteins and molecules, can be modeled as a point cloud data V consisting of p number of points (atoms) indexed as V = {1, 2, · · · , p}. Suppose that the distance ρ ij between points i and j satisfies the metric properties. For proteins, we can simply use the Euclidean distance between atoms in a molecule. Then X = (V, ρ), ρ = (ρ ij ) is a metric space where we can build a filtration necessary for persistent homology. If we connect points following some criterion on the distance, they will form a simplicial complex which will follow the topological structure of the molecule [12, 18, 28] . The k-simplex is the convex hull of k + 1 points in V . A simplicial complex is a finite collection of simplices such as points (0-simplex), lines (1-simplex), triangles (2-simplex) and higher dimensional counterparts. In particular, the Rips complex X is a simplicial complex, whose k-simplices are formed by (k + 1) points which are pairwise within distance [16] . The Rips complex induces a hierarchical nesting structure called the Rips filtration X 0 ⊂ X 1 ⊂ X 2 ⊂ · · · for filtration values 0 = 0 < 1 < 2 < · · · . The filtration is quantified through k-cycles where 0-cycles are the connected components, 1-cycles are loops while 2cycles are 3-simplices (tetrahedron) without interior. During the Rips filtration, the i-th k-cycles are born at filtration value b i and die at d i . The collection of all the paired filtration values {(b 1 , d 1 ), · · · , (b q , d q )} displayed as scatter points in 2D plane is called the persistent diagram. Dyck paths. The first step in the proposed lattice path method is to sort the set of all the birth and death values in the filtration as order statistics c : c (1) < c (2) < · · · < c (2q) , where c (i) is one of the birth or death values. The subscript (i) denotes the i-th smallest value. We will simply call such sequence as the birth-death process. Every possible valid sequence of birth and death values can be viewed as forming a probability space, where each valid sequence is likely to happen with equal probability. During the filtration, the sequence of birth and death occurs somewhat randomly but still maintains a specific pairing structure. There exists a one-to-one relation between the ordering information and Dyck paths if we identify births with and deaths with [2, 21] . If we trace the arrows, we obtain the Dyck path ( Figure 2 ) [23] . A valid Dyck path always starts at y = 0 and ends at y = 0. At any moment during the filtration, a Dyck path cannot go below y = 0. The total number of Dyck paths is called the Catalan number κ p = 1 q+1 2q q . The first few Catalan numbers are κ 1 = 1, κ 2 = 2, κ 3 = 5 and κ 4 = 14. More rapid changes in the direction of Dyck paths imply more fleeting fluctuations which are indicative of smaller topological signals. Less fluctuations indicate larger persistence and thus larger topological structures. The first path in Figure 2 has larger persistence while the last path has smaller persistence. Lattice paths. If we rotate the Dyck paths clockwise at 45 • and flip vertically, we obtain equivalent monotone lattice paths consisting of a sequence of → (uparrow) and ↑ (downarrow). Figure 2 displays corresponding monotone lattice paths between (0, 0) and (q, q) where the path does not pass above the diagonal line y = x [23] . During the filtration, there cannot be more deaths than births and thus the path must lie below the diagonal line. The total area below Dyck paths can be used to quantify the Dyck paths [6] . Since the area below a Dyck path is equivalent to q 2 /2 subtracted by the total area of boxes below the corresponding lattice path, we will simply use lattice paths for quantification. If we tabulate how the area of boxes change over the x-coordinate in a lattice path, it is monotone. In the first path in Figure 2 , the number of boxes below the first and the last lattice paths are (0, 0, 0, 0) and (0, 1, 2, 3). The area below the path is related to persistence. A barcode with smaller persistences (last path in Figure 2 ) will have more boxes (dark gray boxes) while longer persistences will have fewer boxes (first path in Figure 2 ). Given the sequence of heights of piled-up boxes, we can recover the corresponding lattice path by tracing the outline of boxes. We can further recover the original pairing information about births and deaths. In the Rips filtration for 0-cycles, persistent diagrams line up vertically as (0, d (i) ). We simply augment them as ((i − 1)δ, d (i) ) for sufficiently small δ. The lattice and Dyck path representations only encode the ordering information about how births and deaths are paired, and do not encode the actual filtration values. This is remedied by adaptively weighting the length of arrows in lattice paths. We sort the set of birth values b i and death values d i as the order statistics: We start at origin (0, 0). When we encounter a birth b (i) , we take the horizontal step to b (i) . When we encounter a death d (i) , take the vertical step to d (i) ( Figure 2 ). The weighted lattice paths contains the same topological information as the original persistent diagram. Exact topological inference. Using the weighted lattice paths, we can provide the probabilistic statement about how close two birth-death processes are, which can be used for topological inference. For this, we need the transformation φ: Theorem 1. There exists a one-to-one map from a birth-death process to a monotone function φ with φ(0) = 0 and φ(1) = q. Proof. We explicitly construct such a function φ. Consider the sequence of areas of boxes as we traverse the weighed lattice path: h : is the area of i-th box with h 1 = 0. The areas h may not strictly increase ( Figure 3 ). If births occurs r times sequentially in the birth-death process, h will have r repeated identical areas h i , · · · , h i as a subsequence. To make the subsequence strictly increasing, we simply add a sequence of strictly increasing small numbers δ(0, 1, 2, · · · , r−1) to the repetition with δ ≤ 1 r (Figure 3 ). Denote the transformed sequence as h : h 1 < · · · < h q . Then φ(t) is given as a step function . From φ(t), the original sequence h and the original birth-death process can be recovered exactly. Such map from a birth-death process to φ is one-to-one. Note h − h 2 → 0 as δ → 0. So by making δ as small as possible, we can construct a strictly monotone h to be arbitrarily close to h. The normalized step function φ(t)/q can be viewed as an empirical cumulative distribution and many statistical tools for analyzing distributions can be readily applied. Figure 4 -bottom displays the lattices paths and the normalized step functions of 1-cycles corresponding to the spike proteins used in the study. With monotone function φ, we are ready to test the topological equivalence of two birth-death processes: Let φ 1 and φ 2 be the step functions corresponding to C 1 and C 2 . The topological distance will be used as the test statistic for testing the equivalence of C 1 and C 2 . The normalizing denominators q 1 and q 2 ensures that the value of step functions are in [0, 1]. The statistic D(φ 1 , φ 2 ) is the upper bound of area difference under φ 1 (t)/q 1 and φ 2 (t)/q 2 : Theorem 2. Under the null hypothesis of the equivalence of C 1 and C 2 , Proof. The statement can be proved similarly as the combinatorial construction of the Kolmogorov-Smirnov test [3, 17, 8] . First, we combine monotonically increasing sequences C 1 and C 2 and sort them into a bigger monotone sequence of size q 1 + q 2 . Then, we represent the combined sequence as the sequence of → and ↑ respectively depending on if they are coming from C 1 or C 2 . Under the null, there is no preference and they equally likely come from C 1 or C 2 . If we follow the sequence of arrows, it forms a monotone lattice path from (0, 0) to (q 1 , q 2 ). In total, there are q1+q2 q1 possible equally likely lattice paths that forms the sample space. From Theorem 1, the values of φ 1 (t) and φ 2 (t) are integers from 0 to q. Then it follows that where A u,v is the total number of valid paths from (0, 0) to (u, v) within dotted red lines in Figure 3 . with the boundary condition A u,0 = A 0,v = 1 for all u and v. Computing A q1,q2 iteratively requires at most q 1 · q 2 operations while the permutation test will cause a computational bottleneck for large q 1 and q 2 . Thus, the proposed lattice path method computes the exact p-value substantially faster than the permutation test. Since most protein molecules consist of thousands of atoms, q 1 and q 2 should be sufficiently large to apply the asymptotic [17, 22, 10] : Subsequently, the p-value under null is given by where d o is the observed value of q1q2 q1+q2 D. Computing the p-value through Theorem 3 mainly requires sorting, which has the runtime of O q log q for q = q 1 = q 2 . On the other hand, the traditional permutation test requires computing the distance for 2q q possible permutations, which is asymptotically O(4 q / √ πq) [11, 14] . For thousands of atoms, the total number of permutations is too large to compute. Thus, only a small fraction of randomly generated permutations are used in the traditional permutation test [10, 9, 11, 20, 24, 27] . Even if we use hundreds of thousands permutations, the traditional permutation test still takes a significant computational effort. Further, as an approximation procedure, the standard permutation test does not perform better than the exact topological inference, which gives the mathematical ground truth. This is demonstrated in Table 1 in the simulation study [9] . The proposed lattice path method is used to study the topological structure of the severe acute respiratory syndrome coronavirus 2 (SARS-Cov-2), which is often called COVID-19. Since the start of the global pandemic (approximately December 2019), COVID-19 has already caused 3.85 million deaths in the world [4] . In this study, we analyzed the spikes of three different coronaviruses identified as 6VXX, 6VYB [25] and 6JX7 [26] . The 6VXX and 6VYB are respectively the closed and open states of SARS-Cov-2 from human while 6JX7 is feline coronavirus (Figure 1 ). All the domains of 6VXX have exactly 7604 atoms and are expected to be topologically identical. Applying the lattice path method to 1-cycles, we tested the topological equivalence of the B-domain and the A-and C-domains within 6VXX. The normalized step functions are almost identical and the observed topological distances are 0.0090 and 0.0936, which give the p-value of 1.00 each. As expected, the method concludes that they are topologically equivalent. The domain B of 6VXX is also compared against the domain B of feline coronavirus 6JX7 consisting of 9768 atoms. Since 6JX7 is not from human, it is expected that they are different. The topological distance is 0.9194 and p-values is 0.00×10 −38 confirming that the topological nature of spikes are different. This shows the biggest difference among all the comparisons done in this study. The Matlab codes and data used for the study are available at http://www.stat. wisc.edu/~mchung/TDA. In this paper, we proposed a new representation of persistent diagrams using lattice paths. The novel representation enable us to perform the statistical inference combinatorially by enumerating every possible valid lattice paths analytically. The proposed lattice path method is subsequently used to analyze the coronavirus spike proteins. The normalized step functions φ(t)/q for all the spike proteins show fairly stable consistent global monotone pattern but with localized differences. We demonstrated the lattice path method has the ability to statistically discriminate between the conformational changes of the spike protein that are needed in the transmission of the virus. We hope that the our new representation enables scientists in their effort to automatically identify the different types and states of coronaviruses in a more principled manner. Local persistent homology based distance between maps Geometry of the space of phylogenetic trees A Kolmogorov-Smirnov test for r samples Distinct conformational states of SARS-CoV-2 spike protein Topology of viral evolution Moments of dyck paths. Discrete mathematics On the bootstrap for persistence diagrams and landscapes Exact topological inference of the resting-state brain networks in twins Exact combinatorial inference for brain images. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) Exact topological inference for paired brain networks via persistent homology. Information Processing in Medical Imaging (IPMI) Rapid acceleration of the permutation test via transpositions Computational topology: An introduction Topological persistence and simplification An introduction to probability theory and its applications A topological measurement of protein compressibility Barcodes: The persistent topology of data Nonparametric Statistical Inference Computational topology for shape modeling Persistent spectral-based machine learning (PerSpect ML) for protein-ligand binding affinity prediction Nonparametric permutation tests for functional neuroimaging: A primer with examples Noncrossing partitions Estimate of deviation between empirical distribution functions in two independent samples Genetic influences on brain structure Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein Smile-GANs: Semi-supervised clustering via GANs for dissecting brain disease heterogeneity from medical images Whole-brain anatomical networks: Does the choice of nodes matter? Topology for computing The illustration of COVID-19 virus (Figure 1