key: cord-0162132-jol84whr
authors: Zang, Xiao; Kurtek, Sebastian; Chkrebtii, Oksana; Tucker, J. Derek
title: Elastic $k$-means clustering of functional data for posterior exploration, with an application to inference on acute respiratory infection dynamics
date: 2020-11-24
journal: nan
DOI: nan
sha: 78cba24529d49725ff17b52e8be5f8de4b4e3aaf
doc_id: 162132
cord_uid: jol84whr

We propose a new method for clustering of functional data using a $k$-means framework. We work within the elastic functional data analysis framework, which allows for decomposition of the overall variation in functional data into amplitude and phase components. We use the amplitude component to partition functions into shape clusters using an automated approach. To select an appropriate number of clusters, we additionally propose a novel Bayesian Information Criterion defined using a mixture model on principal components estimated using functional Principal Component Analysis. The proposed method is motivated by the problem of posterior exploration, wherein samples obtained from Markov chain Monte Carlo algorithms are naturally represented as functions. We evaluate our approach using a simulated dataset, and apply it to a study of acute respiratory infection dynamics in San Luis Potos'{i}, Mexico.

Bayesian inference is central in many modern applications including geological engineering (Iglesiasm et al., 2014) , biochemical kinetics (Wilkinson, 2011) , studying the patterns of animal movement (McDermott et al., 2017) , and many more. An important benefit of the Bayesian approach is that it flexibly accounts for uncertainty from different sources, where subjective belief about a set of unknown model components conditioned on observed data is expressed via posterior probabilities. Posterior uncertainty is often highly structured, such as when probable values of the unknown object lie on a low-dimensional manifold, exhibit complex correlation structure, or form multiple disjoint clusters. Processing and visualizing such structures is critical to assessing our understanding of unknown model components, be they vectors of model parameters, functions over time, or surfaces and shapes.

When inference is to be made on a limited number of scalar parameters and the probability mass of the posterior distribution is concentrated within a tractable domain, direct visualization could be achieved by density plots and contour plots using a grid approximation (Kruschke, 2014; Gelman et al., 2013) . Most often, posterior samples are drawn by Markov chain Monte Carlo (MCMC) methods, and visualization is then done either through kernel density estimates (see, e.g., Rasmussen et al. (2014) and Rosales et al. (2004) ) or via standard graphical representations of univariate and multivariate data including one-and two-dimensional histograms and scatter plots (see, e.g., Zhu et al. (2018) , Eisenkolb et al. (2019) and Baetica et al. (2016) ). These visualization techniques help identify correlations and clusters in the samples. In addition, summaries such as the mean or quantiles, which can convey useful information about the posterior distribution such as its center and spread, are frequently estimated by the corresponding statistics computed from MCMC samples.

For higher-dimensional parameter spaces, visualization becomes more difficult. Common practice adopted in the Bayesian analysis literature is to apply the aforementioned vi-sualization tools marginally, usually up to two dimensions at a time. However, it has been pointed out that this approach may fail to identify structures such as multimodality (Venna and Kaski, 2003) . One can foresee that exploration of vector-valued MCMC samples can potentially benefit from more sophisticated techniques for visualizing multivariate data, many of which are surveyed in Liu et al. (2017) and Grinstein et al. (2001) .

But, in modern application scenarios, the inferential object is often a function, such as the temporal evolution of a variable or an unknown initial condition for a partial differential equation model, where the infinite-dimensionality of the object under study poses an even greater challenge for posterior visualization and summarization. Unlike the large repertoire of methods for multivariate data, the techniques to visualize a sample of functions are relatively limited. One possible approach is to map the data into a finite-dimensional space using dimensionality reduction such as functional Principal Component Analysis (fPCA), and analyze the resulting fPCA coefficient vectors. Visualizations can then be mapped back to the original space to provide better interpretation in terms of the original functions (Hyndman and Shang, 2010; Tucker et al., 2013) . Instead of resorting to dimension reduction, Sun and Genton (2011) construct functional boxplots using the notion of data depth for functional observations (López-Pintado and Romo, 2009 ). An alternative boxplottype graphic that decomposes variability in observed functions into various components (not related to fPCA) was proposed by Xie et al. (2017) .

When dealing with samples of functions, even simple measures of center and dispersion are not trivial to obtain. For example, pointwise means, quantiles and standard deviations, in spite of being prevalent approaches to summarize posterior sample functions, suffer from a commonly encountered issue in functional data analysis: the misalignment of features such as peaks and valleys. This is caused by two sources of variability often jointly contributing to the overall variation of observed functions: vertical variability along the y-axis referred to as amplitude variability, and the lateral displacement along the x-axis called phase variability (Marron et al., 2015) . The latter could be alternatively viewed as the variability induced 

Acute respiratory infections (ARI) are a public health concern around the world. (2) (1)(b)-(c) Two clusters (single mode cluster in blue and cluster with two modes in red) of functions that were optimally aligned within each cluster (solid black functions are estimated cluster templates). The number of clusters was automatically determined via a proposed modification of the Bayesian Information Criterion (BIC).

(2)(a)-(c) Pointwise summaries (mean ± two standard deviations) for the functions shown in (1)(a)-(c), respectively. mission rates, and (ii) characteristics of the interaction between pathogens circulating in the same year, such as cross-immunity, and importantly, (iii) the dynamics of the epidemic, such as the occurrence of epidemic peaks and their relative order during the year. The first two features correspond to multivariate inputs in commonly-used circulation models, and thus posterior visualization can be conducted using existing approaches. On the other hand, the time-evolution of infections has a functional structure and requires the development of appropriate posterior visualization techniques.

A popular model structure for the evolution of infection states is the Susceptible-Infected-Recovered (SIR) framework which describes transitions between infection states. Although a deterministic compartmental model can approximate reality well in some situations, stochastic SIR modeling better reflects the stochastic nature of disease transmission events (Allen, 2017 Potosí, Mexico. More information about the model, data and statistical analysis is provided in García et al. (2017) . It seems possible that the posterior sample is made up of at least two qualitatively different outbreak patterns, one which has a single peak, and one that has two. These patterns, however, are not clearly discernible in the spaghetti plot due to the presence of overlapping misaligned functions. The lower panel of Figure 2 (a) shows pointwise summaries of the infection trajectories, directly applied to the posterior sample (top of panel (a)), where the possible presence of multiple posterior clusters is obscured, and a moderate level of distortion is present due to misalignment. The result given by our proposed approach, which is outlined in subsequent sections, is shown in panels (b) and (c).

We identify two distinct clusters in the posterior sample corresponding to different temporal infection patterns. An additional benefit of such clustering is that alignment within each cluster results in improved pointwise summarization of the sample.

More generally, similar issues arise for Bayesian models with function-valued unknowns that are not fully identified given the data. Therefore, it is important to devise a systematic analysis technique for functional posterior samples that is able to identify and distinguish disjoint clusters of trajectories while also resolving misalignment.

For a sample of functions, such as ones generated via MCMC, we aim to achieve the following objectives prior to summarization and visualization of variability:

1. Identify distinct functional shapes within the sample;

2. Partition the sample into such shape clusters using an automated approach; and 3. Align functions within each cluster to separate amplitude and phase variabilities.

Visualization and pointwise summarization within each shape cluster provides a natural strategy for exploring posterior structure and variability. With this in mind, the present paper makes the following contributions to the literature:

1. We introduce a novel algorithm, referred to as elastic k-means, for clustering functions based on their shape; 2. We propose a criterion-based approach for automatic selection of the number of clusters in a sample of functions;

3. We employ the proposed methodology to interpret the results of a Bayesian analysis of the time evolution of acute respiratory infections in San Luis Potosí, Mexico from García et al. (2017) .

The performance of the proposed approach is also evaluated via multiple simulation studies.

The rest of the paper is organized as follows. Section 2 briefly reviews background material on elastic functional data analysis and discusses existing methods for functional data clustering.

Section 3 outlines the proposed elastic k-means approach and defines a model-based criterion for automatic selection of the number of clusters. Section 4 describes simulation studies that validate the proposed approach and compare its performance to that of other applicable methods. The application of our approach to the aforementioned ARI posterior samples is presented in Section 5. Section 6 provides a short discussion and describes directions for future work.

We begin with a brief description of function alignment, and a technical review of existing methods for clustering of functional data.

Under the assumption that no clustering structure exists in a sample of functions, the three objectives outlined in Section 1.2 reduce to the single task of function alignment, also commonly referred to as registration. Thus, we begin by formally defining the problem of function alignment, and highlight a framework that is later used to define the proposed elastic k-means clustering approach.

Given two functions f 1 and f 2 belonging to a function space F equipped with a distance metric d(·, ·), pairwise alignment seeks an optimal warping γ * , within some set of warping functions Γ, that minimizes this distance, i.e., γ * = arg min γ∈Γ d(f 1 , f 2 • γ), where f • γ denotes function composition. For the alignment problem to be well-defined, the triple (F , d, Γ) must satisfy the isometry property: for any f 1 , f 2 ∈ F and γ ∈ Γ, d(f 1 , f 2 ) = d(f 1 • γ, f 2 • γ). That is, the distance between two functions should be unchanged by simultaneous warping. A common approach in the literature uses F = L 2 ([0, 1]), d(f 1 , f 2 ) = f 1 − f 2 := 1 0 |f 1 − f 2 | 2 dt, and Γ = {γ : [0, 1] → [0, 1] | γ(0) = 0, γ(1) = 1,γ > 0}, wherė γ is the derivative of γ. Unfortunately, these choices do not result in isometry under warping, resulting in the so-called pinching effect and registration asymmetry (Marron et al., 2015; Srivastava and Klassen, 2016) . Examples of methods that use the L 2 metric for registration include Ramsay and Li (1998) and Liu and Müller (2004) .

Alternatively, Srivastava, Klassen, Joshi, and Jermyn (2011) and Srivastava, Wu, Kurtek, Klassen, and Ma (2011) consider F = {f : [0, 1] → R m | f is absolutely continuous} equipped with an elastic Riemannian metric closely related to the well-known Fisher-Rao (FR) metric for probability density functions (Rao, 1945) , and Γ defined the same as in the previous paragraph.

While these choices result in isometry under warping, the resulting distance is difficult to use in practice and requires computationally intensive numerical algorithms for registration (Mio et al., 2007) . Instead, one can simplify the problem via a simple transformation of the original data as follows. Define the square-root velocity function (SRVF) as q :=ḟ √ |ḟ| where | · | denotes the ℓ 2 -norm (q := 0 when |ḟ |= 0). The space of SRVFs, corresponding to F , is Q = L 2 ([0, 1]), and the aforementioned complicated elastic Riemannian metric on F simplifies to the simple L 2 metric on Q, also retaining all of its theoretical properties. Consequently, the distance between two functions f 1 and f 2 can be computed via the L 2 distance between their SRVFs q 1 and q 2 . A warping of a function f , f • γ, results in a transformation of its SRVF q given by (q, γ) = (q • γ) √γ . Let [q] = {(q, γ)|γ ∈ Γ} denote the orbit of an SRVF under warping. Pairwise registration refers to minimizing q 1 − (q 2 , γ) over γ ∈ Γ, which leads to a distance between two orbits, [q 1 ], [q 2 ], defined as

. Such a distance provides a measure of similarity between the shapes of two functions, i.e., the orbit defines the function's amplitude and the distance is then the amplitude distance. The optimal warping γ * can be identified using the dynamic programming algorithm (Robinson, 2012) or Riemannian optimization (Huang et al., 2016) .

We use f * 2 = f 2 • γ * to denote the function f 2 after optimally aligning it to f 1 (corresponding to q * 2 = (q 2 , γ * )).

Multiple alignment is generally formulated with respect to a template, usually defined as the sample mean. Let f 1 , . . . , f N denote the observed functions, and q 1 , . . . , q N be their SRVFs.

Then, under the SRVF representation, the sample mean amplitude orbit is defined via the amplitude distance as [q] := arg

; note that the amplitude distance involves pairwise registration of each function in the sample to the estimated template. In practice, one must choose a single representative element in this orbit,q ∈ [q]. This is done via a centering step, which ensures that the average of warping functions estimated using pairwise alignment of the data toq is the identity warping γ id (t) = t . Thus, the algorithm for multiple alignment can be viewed as a two step process: (1) estimate the sample mean amplitude and select a representative element in its orbit, and (2) perform pairwise alignment of each function in the data to this representative element. The entire procedure results in the templatef (corresponding toq), optimal warping functions {γ * 1 , . . . , γ * N }, and registered functions {f

Clustering of functional data has received great interest in recent years, and many different methods have been developed for this task; see the survey by Jacques and Preda (2014) and references therein. The most naive approach is to view f i ([t]), the values of a function f i evaluated at T points along its domain [t] = (t 1 , . . . , t T ) ⊺ , as multivariate data. Then, one can directly apply standard methods for clustering vector-valued observations; many such methods exist in the literature (Xu and Tian, 2015; Saxena et al., 2017) , including k-means clustering (Forgy, 1965) , hierarchical clustering (Ward, 1963) , Gaussian mixture models (GMM) (Banfield and Raftery, 1993) , among others. However, treating discretized functions as vectors ignores the temporal dependence inherent in functional data. Such an approach is also inappropriate when the recorded time points are different across functions, which is the case in many applications. An alternative approach is to reduce the dimension by representing functional data using basis function coefficients. One example is Abraham et al. (2003) who performed k-means clustering on B-spline basis coefficients estimated separately for each function via least squares. James and Sugar (2003) also utilized basis functions, but they proposed a model-based framework assuming that the basis coefficients come from a GMM, with each f i ([t]) observed on a sparse grid of time points, conditioned on the basis coefficients, following a multivariate Gaussian distribution. Distance-based clustering methods generally do not rely on basis expansions, and can be easily applied to cluster functional data. An example is Tarpey and Kinateder (2003) who used the L 2 distance in a functional k-means clustering algorithm.

None of the aforementioned approaches consider separation of amplitude and phase variabilities in functional data, i.e., they assume that the data is already aligned or that phase variability is negligible. However, in many applications, as extensively evidenced in Srivastava and Klassen (2016) , functional data analysis benefits from separation of amplitude and phase. Compared to the abundance of methods for clustering functional data without alignment, the literature for methods that find amplitude clusters through function alignment is relatively scarce. Liu and Yang (2009) include time shifts as incomplete data into model-based clustering, but applications of this method are limited due to a very simple warping structure. Sangalli et al. (2010) present the first attempt to extend k-means clustering to the problem of finding amplitude clusters. The function space and distance metric considered in their paper are F = {f ∈ L 2 (R) |ḟ ∈ L 2 (R), |ḟ | = 0} and

, respectively. In order to satisfy the isometry property under the chosen metric, their approach restricts warpings to the set of strictly increasing affine trans- 

is the orbit of f under the action of Γ on F ) with respect to cluster template functions µ := (µ 1 (t), . . . , µ K (t)) and cluster assignments δ := (δ 1 , . . . , δ N ), where δ i ∈ {1, . . . , K}.

Since this approach is most closely related to the proposed elastic k-means algorithm, we use it as a state-of-the-art benchmark comparison in Section 4.

Clustering via k-means with elastic alignment, using the triplet (F , d, Γ) as specified by Srivastava and Klassen (2016) , can be viewed as a generalization of multiple function alignment (Section 2.1.2), by allowing K different templates. Formally, we aim to minimize the

over cluster templates η k (represented using the SRVF), k = 1, ..., K, and cluster assignments δ i ∈ {1, ..., K}, i = 1, ..., N. Here, the distance is the L 2 distance between SRVF orbits, as defined in Section 2.1.1. The proposed clustering algorithm is built in similar fashion to that of Sangalli et al. (2010) . In particular, it iterates through four main steps: (1) alignment, (2) assignment of each function to a cluster, (3) centering within orbits for each cluster, and (4) estimation of cluster templates. The alignment step performs pairwise registration of each function in the given data to each cluster template using the elastic distance. The assignment step assigns each function to a cluster based on the minimum amplitude distance between that function and the cluster templates.

As in multiple alignment, the centering step ensures that the average of warping functions estimated using pairwise alignment of the data within each cluster to the estimated cluster template is the identity warping γ id (t) = t . Finally, template estimation for each cluster is straightforward, as it is based on the cross-sectional mean of the SRVFs assigned to each cluster.

The details of the full elastic k-means algorithm are as follows. Here, we assume that the observed data f 1 , . . . , f N was first transformed into the corresponding SRVFs q 1 , . . . , q N . 

< ǫ, ǫ > 0 and small. Otherwise, continue to the next iteration.

In each iteration, Step 2(a) finds all pairwise amplitude distances between orbits of cluster templates and orbits of observed functions. Then, Step 2(b) minimizes L(η, δ) with respect to δ.

Step 2(c) does not change the value of the cost function, and simply ensures that the cluster templates are identifiable. At the end, Step 2(d) further reduces the cost function,

k | denotes the number of functions in cluster k after iteration n. Therefore, each full iteration, composed of Steps 2(a)-(d) decreases the cost function, which is bounded below by zero. Thus, the proposed elastic k-means algorithm is guaranteed to converge. To avoid empty clusters after Step 2(b), the proposed algorithm uses linear programming when determining cluster assignments (Bradley et al., 2000) .

The elastic k-means clustering method, as most other k-means algorithms, requires that the number of amplitude clusters K is known a priori. However, the choice of K is difficult in most scenarios, including the motivating application to exploration of variability in Bayesian posterior sample functions. Thus, to alleviate this issue, we propose a model-based method for selecting K based on the Bayesian Information Criterion (BIC). In particular, we use a combination of dimension reduction, a Gaussian mixture model (GMM) and the BIC. For clarity, we describe our approach for one-dimensional functional data only, and note that the case of higher-dimensional data can be handled with minor adjustments.

We begin by applying elastic k-means clustering to the given data for K ∈ {1, ..., K max }, where K max is the maximum number of clusters; the choice of K max is user and application specific. For the result of elastic k-means clustering with the number of clusters set to K, let q * Ki denote the SRVF of the i th function aligned to its corresponding cluster template, and δ Ki ∈ {1, ..., K} denote its cluster membership. Then, M Kk := {i ∈ 1, . . . , N | δ Ki = k} is the set of indices of the functions belonging to the k th cluster. Within each cluster, we perform fPCA on the aligned functions (amplitude component) {q * Ki } i∈M Kk as follows (Tucker et al., 2013) .

) denote a vector (of size 1 × T ) of evaluations of the function q * Ki at T equally-spaced time points; the width of the time intervals used for discretization will be denoted by ∆t. Then, Q Kk is a |M Kk | × T matrix whose rows are

is the variance of the j th PC. Also, the j th column of W Kk = 1 √ ∆t V Kk , denoted by w Kkj , is the discretized weight function corresponding to the j th PC. Then, the dimension d (number of PCs) that is used for specifying the GMM and computing the BIC, is set to the minimum number of PCs needed to explain at least (ρ × 100)% of variability in any of the clusters across all values of K. Note that this choice ensures that the dimension is the same across all K = 1, . . . , K max .

To reduce dimension, we represent each function using the vector of the first d PC coefficients,

where α Kk , µ Kk and Σ Kk are the mixture probability, and mean vector and covariance matrix in the k th cluster, respectively. The log-likelihood is given by

Since PC coefficients are uncorrelated within each cluster, the clusterwise covariance matrices are diagonal. The Maximum Likelihood Estimates (MLEs) for all parameters are given

and µ Kkj is the j th element of µ Kk . This results in (2d+1)K −1 estimated parameters. Therefore, the BIC for the number of clusters K, based on this GMM model, is given by

(1)

The number of clusters used in the final analysis is then chosen to be the value K that yields the lowest BIC.

To evaluate the performance of the proposed elastic k-means approach, we simulate a series of functional samples with known ground truth cluster partitions. We assess clustering performance via the popular Rand index (Hubert and Arabie, 1985) . The index is bounded above by 1, which indicates a perfect match between generated and ground truth cluster partitions, while a value close to 0 indicates a near-random partitioning.

In the first set of simulations, we consider one-dimensional functional data. We define each simulated function g i (t) as the sum of p i Gaussian kernels with random amplitude variability induced by z ij iid ∼ N(1, τ ): where φ(t; µ, σ) = exp − 1 2σ 2 (t − µ) 2 and p i = k with probability 1 K * for k = 1, ..., K * . Here, K * denotes the total number of clusters. Intuitively, when K * = 1, the sample only consists of single-peak functions, when K * = 2 the sample contains single-peak and two-peak functions, and so on. The functions f i used for clustering are generated through a random (3) is computed using the method described in Section 2.1.2. Note that approaches (2) and (3) essentially view the functional data after discretization as multivariate data. For KMA, we use the function kma() in the R package fdakma with default parameter settings. For elastic k-means, and methods (2) and (3), we repeat the clustering procedure 10 times, with random initializations, and retain the result with the lowest value of the cost function. For KMA, we only performed the clustering once due to high computational cost. Table 1 reports clustering accuracy for each considered method; we report the average Rand index (with standard deviations in parentheses) across 50 replicates. In this simulation, elastic k-means perfectly matched the ground truth partitions for almost all of the simulation settings and replicates, outperforming the other methods tested. The accuracy of KMA was poor overall with similar performance to standard k-means on the unaligned, discretized functions. Interestingly, k-means applied to the aligned, discretized functions performed well when within-cluster amplitude variability was small, and performed poorly otherwise.

We looked at the results in detail for one of the simulated datasets where N = 120, τ = 0.1 and K * = 3. Figure 4 shows the observed functions; it is difficult to extract any meaningful information by visual inspection of the spaghetti plot. Figure 5 displays clustering results computed using the four aforementioned methods. It is clear that the poor clustering accuracy of KMA is largely attributed to poor alignment; this is due to a lack of flexibility in warping functions considered by that method. Applying k-means to the unaligned, discretized functions resulted in mixed clusters, as expected. Applying k-means to The results reported so far were computed under the assumption that the true number of clusters K * is known. However, in practice, the number of clusters should also be inferred based on the given data. For the elastic k-means approach, we use the proposed BIC criterion to select the number of clusters. Thus, we next assess the effectiveness of this criterion; to reduce dimension when computing the BIC, we set ρ = 0.95. The minimum and maximum numbers of clusters allowed are set to 1 and 6, respectively. The data-generating process in this simulation is the same as in the previous one. Figure 7 shows the average BIC for different true values of K * across 50 replicates (with standard deviations shown as error bars). When the true number of clusters K * > 1, the proposed criterion always chooses the correct number of clusters. The corresponding error bars are also narrow, providing further confidence in the results.

The task becomes more challenging when K * = 1, i.e., there is no clustering structure in The proposed approach is also applicable to clustering of vector-valued functional data. To illustrate its performance relative to existing approaches, a similar simulation study was conducted for two-dimensional functional data as described in Supplementary Material Section ??. Another simulation study, which is more closely linked to our motivating application, is presented in Supplementary Material Section ?? and relates to posterior visualization of dynamical system states. We applied elastic k-means to the problem of clustering posterior sample states of the FitzHugh-Nagumo system of ordinary differential equations. Although in this case a ground truth clustering is not available, the performance of the proposed approach was assessed visually relative to clustering obtained for a model parameter that one contains a single infection peak, while the other shows a first major peak followed by a smaller second peak (Figures 2, 9) . García et al. (2017) jointly modelled the interaction of influenza and RSV infection states, resulting in posterior infection trajectories that could be disaggregated by pathogen. We therefore inspected these disaggregated trajectories in each cluster ( Figure 9 ) and found that the initial major ARI outbreak in both clusters could be attributed to a peak of RSV infection early in the epidemic year. The minor second ARI peak in the two-outbreak cluster was due to either a small influenza infection peak, that occurred much later in the epidemic year and is well-separated from the RSV infection peak, or an influenza peak that is relatively close to the RSV peak in terms of timing, but is larger in magnitude. Importantly, this two-cluster structure could be identified neither visually from the spaghetti plot, nor from the pointwise summaries of functions without clustering and alignment (lower panel of Figure 2 In two of these clusters, influenza and RSV peaks occur at slightly different times with large, but partial, overlap and with different relative magnitudes. In the remaining cluster, the peak infections are completely synchronized with the RSV peak being much higher than the influenza peak ( Figure 12 and year 2002-03, only a single cluster was found in disaggregated influenza-RSV trajectories ( Supplementary Figures ??, ??) , which seems to be discordant with the two distinct overall ARI clusters shown in Figures 2 and 9 . The lack of clustering structure in disaggregated functions is largely due to the continuous spectrum of difference in influenza versus RSV infection peaks. Aggregation of those two dimensions, however, leads to some aggregated ARI trajectories having two distinct peaks with others having a single peak. 

The elastic functional data analysis framework has been shown to be effective for the problems of functional alignment and shape registration (Srivastava and Klassen, 2016) . In this paper, we showed that adoption of the same framework towards functional k-means greatly enhances the accuracy of clustering results compared to KMA, which shares the same algorithmic design, by virtue of the flexible group of warping functions that leads to better alignment quality within clusters. We additionally proposed a model-based approach to choose the optimal number of clusters by minimizing a BIC computed via cluster-wise fPCA. When it works together with elastic k-means and pointwise summaries applied to aligned functions within each cluster, they form a pipeline producing more informative summaries of Bayesian posterior sample functions.

Since elastic k-means effectively performs multiple alignments in each cluster, the withincluster amplitude variability is encoded by the aligned functions, while within-cluster phase variability is encoded by the optimal warpings. In the results, we only showed the visualization of the amplitude variability, but one can certainly visualize the phase variability by applying similar pointwise summaries to the warping functions. Our method resolves the distortion of shape and exaggeration of amplitude variability in the presence of misalignment and multiple clusters, but an additional criticism over the pointwise summaries is that they ignore the underlying functional structure, which is resolved by taking the perspective of functional data analysis. Alternative visualizations of within-cluster amplitude and phase variability that preserves the functional structure could be considered for further improvement, such as the functional boxplot-type display devised by Xie et al. (2017) .

We mentioned that elastic k-means is an amplitude clustering method that treats phase variability as irrelevant to clustering, as could be seen from the cost function that depends on amplitude distance, but not on phase distance. Therefore, it may not be suitable for some application scenarios, especially when phase plays an important role in the clustering structure of interest. For example, in the Berkeley Growth Study (Jones and Bayley, 1941) , one would expect that phase is a pivotal factor distinguishing growth velocity curves of two genders, as growth spurts occur at different time points in males and females. If we use just the amplitude component for clustering, it will not be surprising to see the cluster labels matching poorly with genders. In the future, we plan to develop a clustering approach that incorporates the phase distance into the cost function, and gives users the freedom to control the importance weight of amplitude versus phase.

When the functions are vector-valued, an implicit assumption of elastic k-means is that phase is synchronized across multiple dimensions, as we always apply the same warping to each dimension. Thus, another interesting research direction is to allow different warpings across dimensions, possibly with a certain penalty imposed on the asynchrony. We present an additional simulation as an extension to the simulation that was carried out in the paper. In this simulation, we consider samples of 2-dimensional vector-valued functions f : [0, 1] → R 2 , where the i th function before warping is g i (t) = (g i1 (t), g i2 (t)).

Let b kl denote the number of peaks in g il for the function in the k th cluster, and we set b 11 = 2, b 12 = 1, b 21 = 1, b 22 = 2, b 31 = 2, b 32 = 2. Then, , computed across 50 replicates, for (a) elastic k-means, (b) KMA, (c) k-means on unaligned, discretized functions, and (d) k-means on aligned, discretized functions. Best performance is highlighted in bold. where φ and p i are defined in the same way as in Simulation 1 in the paper, and z ijl iid ∼ N(1, τ ).

The observed f i is obtained by applying the same random warping to g i as in Simulation 1.

We vary the true number of clusters K * from 1 to 3, and again choose sample sizes N to be either 120 or 240 and τ to be either 0.05 or 0.1. Supplementary Table 1 discretized functions is closer to that of random clustering. We also notice that the accuracy of the k-means algorithm applied to aligned, discretized functions does not show as large a decrease in performance as in Simulation 1 in the main article when the within-cluster Figure 1 before (first column) and after clustering and alignment via elastic k-means (right three columns). The first and second rows correspond to the first and second dimension of the functions, respectively. 2 Clustering Posterior Trajectories of the FitzHugh-

N τ K * =1 K * =2 K * =3

We present an additional simulation that considers variation in neural data. In particular, we consider the FitzHugh-Nagumo system (FHN, Fitzhugh (1961) , Nagumo et al. (1962) ).

The FHN system is a simplification of the Hodgekin-Huxley model (Hodgkin and Huxley, 1952) for the excitation of neurons. The model is formulated via the system of ordinary 

where v, which stands for voltage, may be identified with membrane potential. Although r may be interpreted as the recovery force that gradually diminishes v, it does not correspond to any specific physiological quantity that could be measured. The FHN model has been widely used to illustrate parameter estimation of dynamical systems from noisy observations (see, e.g., Ramsay et al. (2007) ). We consider a somewhat artificial setting where both v and We used the proposed BIC to estimate the number of clusters in the posterior sample, trying K = 1, ..., 6 with ρ = 0.95. After a plateau at K = 1, 2, 3, the BIC decreases as K increases.

The decrease in BIC from K = 5 to K = 6 is small, but we choose K = 6 as the optimal number of clusters in this case. The within-cluster alignments and pointwise summaries are shown in Supplementary Figure 6 (each column corresponds to one of the six clusters). (1)(b)-(c) Two clusters (blue and red) of functions that were optimally aligned within each cluster (estimated cluster templates in black). The number of clusters was automatically determined via the proposed BIC. The blue cluster appears to have a higher mode than the red cluster.

(2)(a)-(c) Pointwise summaries (mean ± two standard deviations) for the functions shown in (1)(a)-(c), respectively.

Elastic k-means separated the (v, r) functions into clusters that are characterized by different numbers of peaks and valleys; this is especially evident in the first five clusters. We also identify a cluster, shown in the last column of the figure, that is characterized by function shapes that are very different from those in the first five clusters. Although the parameter c is associated with the phase of the states v and r, the association is not linear, as evidenced by the assignment of functions corresponding to similar c values to different clusters. This suggests that functional clustering could not be replaced with simply clustering of this onedimensional parameter. In particular, we discover much more structure in the posterior sample, by clustering the (v, r) functions, than by simply clustering the parameter c. 

Unsupervised curve clustering using B-splines

A primer on stochastic epidemic models: Formulation, numerical simulation, and analysis

A Bayesian approach to inferring chemical signal timing and amplitude in a temporal logic gate using the cell population distributional response. bioRxiv

Model-based Gaussian and non-Gaussian clustering

Constrained k-means clustering

Modeling of biocatalytic reactions: A workflow for model calibration, selection and validation using Bayesian statistics

Cluster analysis of multivariate data: Efficiency versus interpretability of classifications

Identifying individual disease dynamics in a stochastic multi-pathogen model from aggregated reports and laboratory data. arXiv , 1710.10346v1

Bayesian data analysis

High-dimensional visualizations

Riemannian optimization for registration of curves in elastic shape analysis

Comparing partitions

Rainbow plots, bagplots, and boxplots for functional data

Well-posed Bayesian geometric inverse problems arising in subsurface flow

Functional data clustering: A survey

Clustering for sparsely sampled functional data

The Berkeley growth study

Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan

Visualizing highdimensional data: Advances in the past decade

Functional convex averaging and synchronization for timewarped random curves

Simultaneous curve registration and clustering for functional data

On the concept of depth for functional data

Functional data analysis of amplitude and phase variation

Hierarchical nonlinear spatiotemporal agent-based models for collective animal movement

On shape of plane elastic curves

Curve registration

Information and the accuracy attainable in the estimation of statistical parameters

Phylodynamic inference for structured epidemiological models

Functional data analysis and partial shape matching in the square root velocity framework

Calcium regulation of single ryanodine receptor channel gating analyzed using HMM/MCMC statistical methods

K-mean alignment for curve clustering

A review of clustering techniques and developments

Shape analysis of elastic curves in euclidean spaces

Functional and shape data analysis

Registration of functional data using Fisher-Rao metric

Functional boxplots

Clustering functional data

Generative models for functional data using phase and amplitude separation

Visualizing high-dimensional posterior distributions in Bayesian modeling

Hierarchical grouping to optimize an objective function

Stochastic Modelling for Systems Biology

A geometric approach to visualization of variability in functional data

A comprehensive survey of clustering algorithms

A new moving strategy for the sequential monte carlo approach in optimizing the hydrological model parameters

Impulses and physiological states in theoretical models of nerve membrane

A quantitative description of membrane current and its application to conduction and excitation in nerve

An active pulse transmission line simulating nerve axon

Parameter estimation for differential equations: A generalized smoothing approach

The authors gratefully acknowledge Drs. Yury García