key: cord-0615144-c1aeqvtf authors: Reinauer, Raphael; Caorsi, Matteo; Berkouk, Nicolas title: Persformer: A Transformer Architecture for Topological Machine Learning date: 2021-12-30 journal: nan DOI: nan sha: b0f21dfc286258b092e099e991af583926e0cdcb doc_id: 615144 cord_uid: c1aeqvtf One of the main challenges of Topological Data Analysis (TDA) is to extract features from persistent diagrams directly usable by machine learning algorithms. Indeed, persistence diagrams are intrinsically (multi-)sets of points in R2 and cannot be seen in a straightforward manner as vectors. In this article, we introduce Persformer, the first Transformer neural network architecture that accepts persistence diagrams as input. The Persformer architecture significantly outperforms previous topological neural network architectures on classical synthetic benchmark datasets. Moreover, it satisfies a universal approximation theorem. This allows us to introduce the first interpretability method for topological machine learning, which we explore in two examples. Topological Data Analysis (TDA) is a rapidly growing field of data science that incorporates methods to estimate the topology of a dataset within machine learning pipelines. The most common descriptors in the field are the so-called persistence diagrams, which are further used in TDA pipelines for classification and regression tasks; the applications span a wide variety of scientific areas such as material science [NHH + 15, LBD + 17], neuroscience [RNS + 17], cancer biology [ACC + 20], comprehension of deep learning architectures [NZL20, LIU21] and the analysis of COVID19 propagation [DR20] . Persistence diagrams are subsets of R 2 whose points correspond to topological features in the dataset (such as connected components, loops, voids, ...), with coordinates encoding a notion of size of the feature. They are usually compared using either the bottleneck distance or a Wasserstein-type distance. We refer the reader to one of the many introductory TDA textbooks [DW21, Oud15, Ghr14] for the details of these definitions. The main challenges for incorporating persistence diagrams into a machine learning pipeline are twofold. Firstly, the data structure underlying persistence diagrams is intrinsically a set. Secondly, whatever the type of distance considered (bottleneck or Wasserstein), the space of persistence diagrams cannot be isometrically embedded into a Hilbert space [MV21, Theorem 4.3] . To overcome these issues, the TDA community has developed several vectorization methods in order to associate to a set of persistence diagrams a set of vectors in a Hilbert space. These methods are primarely of two types. One either defines a priori the vectorization map, as for persistence landscapes [Bub15] , or one learns it through a trainable architecture, such as a neural networks [CCI + 20, HKN19]. One of the major changes of paradigm in neural network architectures over the last five years concerns the introduction of the transformer architecture [VSP + 17], incorporating a self-attention mechanism. In short, transformer models process the datum as a whole, exploit long-distance relationships between the elements of a datum and avoid recursion altogether. This is why transformers architectures achieve stateof-the-art performance on a variety of tasks. In Natural Language Processing (NLP), large pre-trained language models like BERT [DCLT19] and GPT-3 [BMR + 20] achieve state-of-the-art result on various NLP benchmarks. For computer vision tasks, the Vision Transformer [ZKHB21] models achieve state-of-the-art results on a variety of benchmark datasets. In this work, we introduce Persformer, a transformer neural network architecture designed for analyzing persistence diagram datasets, making available the power and versatility of transformer architecture for topological machine learning. We compare our model with already existing neural network architectures handling persistence diagrams: PersLay [CCI + 20] and PLLAy [KKZ + 20], exceeding the test accuracy of previous stateof-the-art models by 4.0% and 4.1%, respectively, on benchmark datasets. The fact that our architecture does not make use of any handcrafted vectorization of persistence diagrams, as is the case for already existing methods [CCI + 20, KKZ + 20], allows us to adapt a well-known interpretability method for neural networks to Persformer. We define Saliency Maps for Persformer, whose value on a given point of a persistence diagram quantifies the importance of this point in the value of the Persformer. In particular, we recover with Saliency Maps the observation made in [BHPW20] , that the "small bars" of persistence diagrams detect curvature. These results lead us to conclude that it is too restrictive for topological machine learning tasks to assume that "small bars" are mere representations of noise. Therefore, since "small bars" are unstable (i.e., they change easily from one realization of a datum to another), we conclude that it is too restrictive to impose stability of the topological features extracted by our Persformer. As described in the introduction, one of the main challenges of topological data analysis is that the space of persistence diagrams, equipped with either the bottleneck or the Wasserstein distance, cannot be isometrically embedded into a Hilbert space [MV21, Theorem 4.3] . Since most machine learning methods assume that the input dataset is a subset of a Hilbert space, they cannot be directly applied to datasets of persistence diagrams. To overcome this issue, considerable effort has been made in the TDA community to define vectorizations of the space of persistence diagrams, that is, to define a Hilbert space H together with a continuous map φ : D → H, with D the space of persistence diagrams endowed with either the bottleneck or the Wasserstein distance. These methods are primarily of two types. Prescribed vectorization methods This corresponds to defining a Hilbert space H, and a continuous map φ : D → H that are independent of the machine learning task one tries to solve. Among others, there are the persistence scale-space kernel [RHBK15] , persistence landscapes [Bub15] , the weighted Gaussian kernel [KHF16] , or the sliced Wasserstein kernel [CCO17a] . Learnable vectorization methods Another approach to vectorization methods of persistence diagrams is to learn the "best" one for a fixed data analysis task among a family of vectorization maps. More precisely, assume we are considering a fixed learning task (such as supervised classification) on a Hilbert space H, and a family of vectorizations φ θ : D → H. Then this approach consists in learning the best value of the parameter θ, according to an optimization criterion provided by the learning task (typically a loss function associated to the classification process in H). The first article introducing learnable vectorization methods of persistence diagrams is [HKNU17] , where the φ θ are given by the sum of two-dimensional Gaussian functions (with mean and standard deviation defined by θ) evaluated on the points of persistence diagrams. In [HKN19] , the authors introduce the first neural network architecture able to accept persistence diagrams as input. Furthermore, they elaborate on the observation (initially made in another context [ZKR + 17, Theorem 2]) that any realvalued Hausdorff-continuous function on the set of persistence diagrams contained in a fixed compact subset of R 2 with exactly n points can be approximated arbitrarily well by for certain functions φ : R 2 → R p and ρ : R p → R. They introduce specific classes of functions for ρ and φ, which have then been extended in [CCI + 20] by the PersLay architecture. Persistence diagrams are the most commonly used descriptors developed by the TDA community. They come in two main flavors: ordinary and extended persistence diagrams. We refer to the textbook [DW21] for an extended exposition of the mathematical background of this section. Ordinary persistence diagrams track the evolution of topological features of dimension i (connected component for i = 0, holes for i = 1, cavities i = 2, ...) in nested sequences of topological spaces (or simplicial complexes) (X t ) t∈R , where X a ⊂ X b whenever a ≤ b. The variable t ∈ R is called filtration value and intuitively corresponds to the time of the evolution of the topological features. Thus, we will personify topological features and give them birth and death dates. If a topological feature of dimension i is born at time b ∈ R in the filtration and dies at time d ∈ R ∪ {+∞}, it will give rise to a point with coordinate (b, d) in the i-th persistence diagram of this filtration. Therefore, ordinary persistence diagrams are multi-sets (sets where elements can have multiplicity) of points in the subset Extended persistence diagrams generalize ordinary persistence, and subsume the size and type of topological features of the fibers of a continuous map f : X → R. In practice, extended persistence diagrams are defined for real-valued functions on the 0simplices of a simplicial complex. For each dimension i, the i-th dimensional topological features of the fiber of f are encoded with a birth b ∈ R and a death d ∈ R, and one of the following four possible types: Ordinary, Relative, Extended+ or Extended- [CSEH09] . Extended persistence is a strict generalization of ordinary persistence since the latter is contained in the former. Furthermore, it has the computational advantage of containing only points with finite coordinates. It is possible to compare persistence diagrams using various distances, all defined as the infimum cost of a partial matching problem between points of two persistence diagrams. The two most common choices are the bottleneck distance [DW21, Definition 3.9], which is well understood theoretically but is not robust to outliers in the dataset, and the class of Wasserstein distances [DW21, Definition 3.10], whose theoretical study is still ongoing but has good empirical robustness properties. Both persistence diagrams and bottleneck/Wasserstein distances can be efficiently computed by software such as Giotto-tda [TLT + 21], Gudhi [MBGY14] or Dionysus [Mor] . We recall a useful approximation result for functions on sets. Theorem 1 (Theorem 9 [ZKR + 17]) Let X be a compact subset of R d , and M be a positive integer. Let 2 X M ⊂ 2 X denote the set of subsets of X with exactly M elements, equipped with the Hausdorff metric. For any Hausdorff continuous function L : 2 X M → R and ε > 0, there exist a natural number p > 0 and two continuous functions φ : X → R p and ρ : R p → R such that: Using the universal approximation theorem for neural networks [Hay10] the statement of Theorem 1 can be extended to the statement that every Hausdorff continuous function L on uniformly bounded finite subsets of R 2 of cardinality M can be arbitrarily well approximated by where φ and ρ are neural networks with a finite number of hidden layers containing a finite number of neurons with a non-constant, bounded, and nondecreasing continuous activation function like ReLU. A neural network like (1) is called a Deep Set [ZKR + 17]. The encoder part of the transformer architecture introduced in [VSP + 17] without positional encoding and with multi-head attention pooling (see Section 4.1) composed with a fully-connected neural network is at least as expressive as a Deep Sets model, see [LLK + 19] . Hence, this architecture satisfies the universal approximation theorem of set transformers. As noted in [LLK + 19], the self-attention mechanism enables explicit interactions between instances of a set and also higher-order interactions, by stacking multiple layers. The authors of the paper further show state-of-the-art performance of this architecture on various set-based datasets. This section is devoted to introducing the Persformer architecture in detail. We refer to [VSP + 17] for a detailed introduction of the self-attention mechanism. The Persformer architecture is shown in Figure 4 .1. It consists of an embedding layer, which is a trainable position-wise fully connected layer 1 R 2 → R d , followed by stacked self-attention layers consisting of a multi-head self-attention block and a fully connected feed-forward layer. Finally, the multi-head attention pooling layer provides a vector representation of the persistence diagram, and a fully connected neural network computes the final class prediction. The individual building blocks of the architecture are explained in more detail in the following sections. The self-attention mechanism aims to model pairwise interactions of elements in a sequence. For this purpose, three families of vectors Q, K, V ∈ R N ×d are calculated from a sequence of length N of d-dimensional vectors X ∈ R N ×d . These families are called query, key, and value vectors. They are linear transformations of the input sequence X by trainable matrices W Q , W K , W V ∈ R d×d . For each query vector Q i , a similarity score is computed with all key vectors by calculating the scalar product up to a factor of 1/ √ d , and then all the similarity scores are normalized by using the softmax function to get the so-called attention score Input Embedding Persistence Diagrams Q K V Layer Norm Position-wise feed Forward Layer Norm AttentionScore(Q i , K) j V j . Position-wise Feed-Forward Network The position-wise feed-forward network is a fully connected neural network with two layers that is applied to each vector in the sequence. The multi-head attention pooling layer is a variation of the multi-head attention layer with a single trainable query vector Q ∈ R 1×d and key and value vectors as linear transformations of the input sequence [LLK + 19]. The output MultiHead(Q, K, V ) ∈ R d is a single vector that does not depend on the order of the input sequence. Residual connections Since the points in a persistence diagram may be very close to each other, it may be difficult for the encoder to separate them. This difficulty leads the gradients of the attention blocks to be very small at the beginning, which complicates the training of this architecture, especially with the numerous self-attention layers [GBC16] . To overcome this problem, we added residual connections between the self-attention layers. To our knowledge, this architecture choice has not been made so far in the literature. Empirically, we could see that this solved the vanishing gradient problem and allowed us to stack significantly more self-attention layers while substantially speeding up the training. On a high level, the Persformer architecture consists of an encoder (φ), a pooling layer (p), and a decoder (ρ) that computes the final class prediction. As input, we take the points in the persistence diagram together with their one-hot encoded homology dimensions. The encoder consists of stacked attention layers. Each attention layer maps a sequence of vectors to a sequence of vectors and is permutation-equivariant, i.e., for every {x 1 , . . . , x n } sequence of vectors and permutation σ of the set {1, . . . , n} we have where σ permutes the order of the vectors in the sequence φ({x 1 , . . . , x n }). The attention pooling maps a sequence of vectors to a single vector in a permutation-invariant way, i.e., the output vector does not depend on the order of the sequence. Combining the permutation-equivariance property of the encoder and the permutation-invariance of the attention layer, we get a permutation-invariant map ρ•p•φ that maps a sequence of vectors to a single vector. The universal approximation theorem of set transformers can be directly extended to Persformers by using the fact that residual networks satisfy the Universal approximation theorem [TG20] . Hausdorff-continuous function on the set of persistence diagrams contained in a fixed compact subset of R 2 with exactly n points can be approximated arbitrarily well by a Persformer model with ReLU-activation function. Training specification Unlike the optimization of other neural network architectures, a learning warm-up stage is crucial to achieving good results for transformer architectures [PB18] . Empirically, we have found that this results in an increase of about 2 percentage points in test accuracy for the Persformer model. We used the optimizer AdamW with a weight decay of 1e-4 and a maximum learning rate of 1e-3, a batch-size of 32, and a cosine with hard restarts learning-rate scheduler with 10 warmup epochs, 3 cycles, and a total of 1,000 epochs. Using grid-search, we found that with a dropout of 0.2 in the decoder part of the Persformer and no dropout in the encoder part we obtained the best results. To demonstrate the performance of our models, we consider the ORBIT5k dataset which is a standard dataset to benchmark methods for vectorizing persistence diagrams [AEK + 17, KKZ + 20, CCI + 20]. The dataset consists of subsets of size 1,000 of the unit cube [0, 1] 2 generated by a dynamical system that depends on an parameter ρ > 0. To generate a point cloud, a random initial point (x 0 , y 0 ) ∈ [0, 1] 2 is chosen randomly in [0, 1] 2 and then a sequence of points (x n , y n ) for n = 0, 1, . . . , 999 is generated recursively by: For every parameter ρ = 2.5, 3.5, 4.0, 4.1 and 4.3 we generated a dataset of 1,000 orbits obtaining 5,000 orbits at the end. The shape of the point cloud depends heavily on the parameter ρ. The orbits are transformed into persistent diagrams by computing the alpha complex filtration in dimensions 0 and 1 [CCI + 20]. The classification problem is to recover the parameter ρ ∈ {2.5, 3.5, 4.0, 4.1, 4.3} from the given persistence diagrams. In addition to the ORBIT5k dataset, we consider the ORBIT100k dataset, containing 20,000 orbits per parameter instead of 1,000 orbits per parameter, for a total of 100,000 orbits. Both datasets are split in a ratio of 70:30 into training and validation sets. Besides classical kernel methods, we compare the performance of our Persformer model with current state-of-the-art models -the PersLay [CCI + 20] model and the PLLAy model [KKZ + 20]. Since PLLay uses both the persistence diagrams and the raw point clouds as inputs, we trained a Persformer-like model using only the raw point clouds as input, achieving a test accuracy of 99.1%, an increase of 4.1 percentage points compared to PLLay. We repeated all experiments five times and report the average performance on a holdout test dataset. A detailed comparison with all the models is in Table 1 . Compared to previous methods [CCI + 20, KKZ + 20], our model does not make use of handcrafted vectorization of persistence diagrams in the first layer of the neural network, which allows Persformer to satisfy the universal approximation property in the sense of Section 3.2. This allows us to identify those points that are important for the classification, by the mean of Saliency Maps, an already well-known interpretability method. We will show with concrete datasets (Orbit5k, Orbit100k and a curvature dataset [BHPW20] ) that, contrary to the common view, "small bars", which are unstable features of persistence diagrams with respect to bottleneck or Wasserstein distances, are also essential predictors for classifications and regression problems, as already identified in [BHPW20] . Saliency Maps The Persformer model for a classification problem is an almost everywhere differentiable function F : D → R m , where m is the number of classes and D is the space of persistence diagrams. It maps a persistence diagram to the logarithm of the class probability. Let d be the maximum homology dimension to be considered and let x = (x k ) k∈{1,...,n} ∈ (R 2+d ) n be a persistence diagram and i(x) = argmax j F (x) j . The first two coordinates of x k ∈ R 2+d are the birth and death coordinates and the last d coordinates are the one-hot encoded homology dimensions. The saliency map of F on x is defined as Therefore, S F assigns to each point in a persistence diagram, a real value indicating how important a given point in the diagram is for the classification. Experiments For the classification problem of ORBIT5k as in 4.5, we observe the traditional TDA motto that the important features in a persistence diagram are the most persistent ones. Indeed, points closer to the diagonal have almost zero saliency value, see Figure 2 . What stands out is that many stable features have a high Saliency map score, which explains the good performance of the classical vectorization methods. However, there are also points in the persistence diagram close to the diagonal with a high Saliency map score, i.e., they correspond to "short bars" but are very influential for the class prediction. Therefore, these are important for the classification and explain that our model outperforms previous ones. It is also remarkable that extreme points of the persistence diagram have high Saliency map scores and H 0 features have a very low Saliency map score. In [BHPW20] , the authors consider the following regression problem: sample randomly 1,000 points on a surface of constant curvature K and predict K from the Vietoris-Rips persistence diagrams of the resulting point-cloud. They give mathematical evidence that small bars in these persistence diagrams should be good predictors of the curvature, and are able to train different machine learning models with good R 2 -score, such as Support Vector Regression. We reproduced the dataset of [BHPW20] and trained a regression Persformer model with mean squared error to predict the curvature of a point cloud using the H 1 persis- tence diagrams as predictors. As a result, we obtain an R 2 -score of 0.94 compared to an R 2 -score of 0.78 using Support Vector Regression [BHPW20] . We calculate the Saliency map score using the Saliency map method as in the previous dataset, see Figure 3 . Some points in the persistence diagram that correspond to small bars have a very high Saliency map score, which confirms the results of [BHPW20] . An interesting observation is that the Saliency map score also seems to increase with birth time. The importance of "short bars" for the classifications result is much more pronounced in this case than for the Orbit5k dataset. Moreover, we can even show that the Saliency map score decreases with distance to the diagonal, see Figure 4 and Figure 5 . To visualize the dependence of the Saliency map score on the distance to the diagonal, we normalize the lifetimes and the Saliency map scores per persistence diagram to the interval [0, 1]. Then we distribute the points in the persistence diagram to the bins [0, 0.1], . . . , [0.9, 1.0] with respect to their lifetime and consider the maximum and the sum of all Saliency map scores per bin. We then average the scores per bin over the entire test dataset. As one can clearly see, the Saliency map score decreases with the distance to the diagonal, demonstrating that "short bars" are important for estimating the curvature of a dataset. In this work, we have introduced Persformer, the first Transformer architecture for learning on persistence diagrams datasets. We proved that this architecture significantly outperforms previous ones on usual benchmark synthetic datasets. In addition, Persformer is the first neural network architecture for persistence diagrams that is shown to satisfy a universal approximation theorem, enabling us to adapt Saliency Map, a well-known interpretability method, to Persformer. To the best of our knowledge, this is the first method for interpretable topological machine learning. We expect it will be helpful as an exploratory method to understand mechanisms where the shape of a dataset is essential, such as biology or material science. A Numerical instability of the Orbit5k dataset When preparing the benchmark datasets, we realised that the non-linear dynamical system used in the ORBIT5K dataset exhibits chaotic behaviour. In particular, the numerical error in the generation of the ORBIT5k dataset increases exponentially, see Figure 6 and Figure 7 . Hence the topology depends heavily on the floating point precision used. We created a dataset with arbitrary precision, meaning a high enough floating point precision such that a further increase in floating point precision does not change the point cloud, and trained a model on a dataset that was created with float64 precision. The model generalized well to the dataset generated with arbitrary precision, with only a small difference (2.1% accuracy points) in test performance. Persistent Homology Based Characterization of the Breast Cancer Immune Microenvironment: A Feasibility Study Persistence images: A stable vector representation of persistent homology Persistent homology detects curvature Ilya Sutskever, and Dario Amodei. Language Models are Few-Shot Learners Statistical topological data analysis using persistence landscapes PersLay: A Neural Network Layer for Persistence Diagrams and New Graph Topological Signatures Sliced Wasserstein kernel for persistence diagrams Sliced Wasserstein kernel for persistence diagrams Extending Persistence Using Poincaré and Lefschetz Duality BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Visualising the Evolution of English Covid-19 Cases with Topological Data Analysis Ball Mapper Computational Topology for Data Analysis Deep learning Neural Networks and Learning Machines. Pearson Education India Learning Representations of Persistence Barcodes Deep Learning with Topological Signatures Kernel method for persistence diagrams via kernel embedding and weight factor Persistence weighted gaussian kernel for topological data analysis PLLay: Efficient Topological Layer based on Persistent Landscapes Quantifying similarity of pore-geometry in nanoporous materials Topological Uncertainty: Monitoring trained neural networks through persistence of activation graphs Set transformer: A framework for attention-based permutation-invariant neural networks Persistence fisher kernel: A Riemannian manifold kernel for persistence diagrams The gudhi library: Simplicial complexes and persistent homology Dionysus 2 : a software to compute persistent homology The space of persistence diagrams on n points coarsely embeds into Hilbert space Escolar, and Yasumasa Nishiura. Persistent homology and many-body atomic structure for medium-range order in the glass Topology of deep neural networks Persistence Theory: From Quiver Representations to Data Analysis Training Tips for the Transformer Model Pointnet: Deep learning on point sets for 3d classification and segmentation A stable multi-scale kernel for topological machine learning Cliques of Neurons Bound into Cavities Provide a Missing Link between Structure and Function Universal Approximation Power of Deep Neural Networks via Nonlinear Control Theory. CoRR, abs Anibal Medina-Mardones, Alberto Dassatti, and Kathryn Hess. giotto-tda: A topological data analysis toolkit for machine learning and data exploration Attention is All You Need Scaling vision transformers Deep sets Acknowledgements The authors would like to thank Kathryn Hess Bellwald for the many fruitful discussions and valuable comments. This work was supported by the Swiss Innovation Agency (Innosuisse project 41665.1 IP-ICT).