key: cord-0648698-apdptdf1
authors: Cremona, Marzia A.; Chiaromonte, Francesca
title: Probabilistic $K$-mean with local alignment for clustering and motif discovery in functional data
date: 2018-08-14
journal: nan
DOI: nan
sha: 70c4f76a31637003f60bba9baffe9f7d3e9db692
doc_id: 648698
cord_uid: apdptdf1

We develop a new method to locally cluster curves and discover functional motifs, i.e. typical"shapes"that may recur several times along and across the curves capturing important local characteristics. In order to identify these shared curve portions, our method leverages ideas from functional data analysis (joint clustering and alignment of curves), bioinformatics (local alignment through the extension of high similarity seeds) and fuzzy clustering (curves belonging to more than one cluster, if they contain more than one typical"shape"). It can employ various dissimilarity measures and incorporate derivatives in the discovery process, thus exploiting complex facets of shapes. We demonstrate the performance of our method with an extensive simulation study, and show how it generalizes other clustering methods for functional data. Finally, we apply it to the discovery of functional motifs in"Omics"signals related to mutagenesis and genome dynamics.

Given a set of curves, we consider the problem of discovering functional motifs inside them, i.e. typical "shapes" that may recur within each curve, or across several curves in the set. Some of these motifs may be present in most of the curves, but in different positions. Conversely, other motifs may characterize subgroups of curves -and thus differentiate among them based on local shape similarities. We provide a novel method for functional motif discovery that aligns curves locally to identify their shared portions, employing different definitions of (dis)similarity. Importantly, neither the motifs nor their lengths or their number need to be known in advance.

During the last two decades the analysis of curves has received increasing attention and interest in the statistical literature. Indeed, several functional data analysis (FDA) methods have been developed and applied in many fields (see, e.g., Ramsay and Silverman, 2005; Ferraty and Vieu, 2006; Horváth and Kokoszka, 2012) . In particular, many algorithms have been proposed to cluster aligned functional data (reviewed in Jacques and Preda, 2014) . Since functional data are very often misaligned, algorithms have also been proposed to simultaneously cluster and align curves (Liu and Yang, 2009; Sangalli et al., 2010; Park and Ahn, 2017) . All these methods consider the curves globally, over their entire domain of definition. However, in many applications, separation in groups may occur only on a small portion of the domain; this type of clustering structure might be missed by methods that consider curves in their entirety. The multivariate counterpart of this "domain selection" problem in functional data is in fact feature selection for clustering, and has been widely studied. Two examples are the clustering objects on subsets of attributes (COSA) method (Friedman and Meulman, 2004) , and the sparse clustering method (Witten and Tibshirani, 2010) . In the FDA framework, Fraiman et al. (2016) and Floriello and Vitelli (2017) recently proposed procedures to cluster curves while performing feature (i.e. domain) selection. Notably though, while these procedures consider the curves locally, they are not able to deal with misaligned data.

The problem of functional motif discovery we tackle here is more general -and to the best of our knowledge, it has never been studied in the statistical literature. In order to identify motifs, we define clusters locally on portions of the misaligned curves, and allow each cluster to contain multiple portions of the same curve (i.e. multiple instances of the same functional motif). In addition, we allow each curve to belong to multiple clusters (i.e. to comprise multiple functional motifs). This problem is the continuous version of sequence motif discovery, which is ubiquitous in bioinformatics and "Omics" sciences (see, e.g., Bailey et al., 2006) and consists of searching for highly similar patterns in a set of DNA or protein sequences. While these are discrete sequences of 4 or 20 symbols (4 nucleotides, or 20 amino acids), we consider curves that can attain any real values -and can in fact be multivariate (i.e. take values in a generic R d ). A similar problem for time series has been addressed by the data mining community (Lin et al., 2002; Mueen et al., 2009; Yeh et al., 2016 Yeh et al., , 2018 defining a motif as a pattern repeated multiple times within a single time series. Available tools generally employ the Euclidean distance or the correlation between portions of the time series. They usually require as input the length and the number of motifs to be found, although Linardi et al. (2018) recently introduced an algorithm that finds all motifs in a given range of lengths. Importantly, these tools require a user-specified minimum distance within which two portions of the time series are considered the same motif (i.e. a "motif radius"), and this distance is the same across motifs.

We embed the problem of functional motif discovery in a full-blown FDA framework, which allows us to capture complex shape characteristics by incorporating derivatives in the discovery process. The FDA framework also allows us to rigorously define variability within each motif, and to naturally reduce noise in the curves through smoothing. Our novel method, probabilistic K-mean with local alignment (probKMA), leverages ideas from FDA, bioinformatics and fuzzy clustering in order to identify K shared curve portions, which represent K candidate functional motifs in the set of curves under consideration. Indeed, similar to the K-mean with (global) alignment of Sangalli et al. (2010) , we perform simultaneously clustering and alignment of curves. However, we employ local alignment in place of their global alignment. Also, similar to BLAST-type algorithms in bioinformatics (Altschul et al., 1990) , we perform local alignments through the extension of high similarity "seeds". Finally, similar to fuzzy clustering in which points can belong to multiple clusters (Bezdek, 1981; Bezdek et al., 1984) , our curves can be associated with more than one cluster if they contain more than one typical "shape". As with every K-mean algorithm, the output of probKMA depends on its initialization. Since our aim is motif discovery, we run the algorithm several times, using different initializations and parameters, in order to obtain an adequate pool of candidate functional motifs. We then implement a post-processing phase to filter out low quality candidate motifs, merge the ones that have been found multiple times and finally locate the best functional motifs along the curves. This last step is actually a motif search algorithm and produces groups of curves where overlaps capture the multiplicity of shared motifs.

The article is organized as follows. In Section 2 we present the theoretical setting of probKMA, formulate it as an optimization problem, derive necessary conditions for its solution (Subsection 2.1) and describe its algorithmic implementation (Subsection 2.2). In Section 3 we discuss evaluation of the clusters produced by the algorithm (Subsection 3.1) and identification of the resulting "discovered" motifs (Subsection 3.2). In Section 4 we provide an extensive simulation study to evaluate probKMA and compare it to other approaches in the computer science and statistics literature. In Section 5 we present an application to mutagenesis data. We provide concluding remarks in Section 6.

We consider a set of N (d-dimensional) curves x i : R −→ R d , i = 1, . . . , N . Our goal is to identify K (d-dimensional) cluster centers of potentially different lengths c 1 , . . . , c K , defined as v k : (0, c k ) −→ R d , k = 1, . . . , K. These are "patterns" to which the curves are, locally, highly similar with respect to a distance d(·, ·). Since we are interested in local similarity (i.e. similarity between portions of curves) and we define the cluster centers only in the intervals (0, c k ), we allow each curve to be aligned to each cluster center as to minimize their distance. Alignment to each cluster center v k is performed composing each curve x i with a warping function h k,i : R −→ R from a class W . Here we consider shifts W = {h : t −→ t + s; s ∈ R}, but our method can be generalized to other warping functions that are commonly employed in the FDA literature (see, e.g., Ramsay and Silverman, 2005) .

Because of the focus on local similarity, a curve can belong to more than one cluster; that is, different portions of a curve can be similar to portions of other curves. Hence, mimicking fuzzy clustering (see, e.g., Bezdek, 1981; Bezdek et al., 1984) , we assign to each curve x i a probability p k,i to be a member of each cluster k. In particular, we define a membership function p k : {x 1 , . . . , x N } −→ [0, 1] for each k = 1, . . . , K, with p k (x i ) = p k,i , requiring that K k=1 p k,i = 1 for all i = 1, . . . , N , and that N i=1 p k,i > 0 for all k = 1, . . . , K. Each membership probability p k,i corresponds to a particular shift s k,i of the curve x i ; namely, the one that minimizes the distance between x i and v k given all constraints. We summarize these shifts in a matrix S = [s k,i ] and the membership probabilities in a matrix P = [p k,i ].

For a fixed number of clusters K and fixed lengths c 1 , . . . , c K , probKMA can be formulated as the following optimization problem: find K cluster centers v 1 , . . . , v K , membership probabilities P and shifts S that minimize the generalized least-squares functional

under the constraints p k,i ∈ [0, 1], ∀i, k; K k=1 p k,i = 1, ∀i; and N i p k,i > 0, ∀k. Here m > 1 is a fixed weighting parameter controlling the degree of "fuzziness", andx i,s k,i = x • h k,i are the shifted curves. Necessary conditions for (P,Ŝ,v 1 , . . . ,v K ) to be a (local) minimizer of (1) are that each ofP,Ŝ andv 1 , . . . ,v K minimizes (1) fixing all the other variables. We prove two key results (see Supplementary Material). The first provides an explicit solution for the optimal membership probabilities given shifts and centers. Importantly, this result holds for any distance d(·, ·) and does not rely on any regularity assumption on the curves or the cluster centers.

Proposition 1. Fix a shift matrixŜ and K cluster centersv 1 , . . . ,v K . Let

under the constraints K k=1 p k,i = 1, ∀i and N i=1p k,i > 0, ∀k. ThenP = [p k,i ] is a global minimizer of (2) if and only if

for all i ∈ R andp

If the i-th curve has positive distance from all cluster centers, (3) states that its probability of belonging to cluster k is inversely proportional to its distance from the k-th cluster center. Equation (4) tackles the extreme case of a curve with distance 0 from one or more cluster centers; in this case the functional minimum is attained by setting the membership probabilities to 0 for all clusters from which the curve has positive distance. Note that if a curve has distance 0 to more than one cluster (4) defines an uncountable set of minimizers. However, the non-uniqueness is trivial.

The second result concerns updates of the cluster centers and depends on the distance employed. Some regularity assumptions are required for the distance to be well defined, as we specify later. This provides an explicit solution for optimal centers given shifts and probabilities (note that the formula holds almost everywhere), for any distance d α (·, ·) defined as

where w j > 0 is the weight of the j-th component of a (multivariate) curve, indicated by (j) , indicates the weak derivative, and α ∈ [0, 1] is a parameter that defines the relative weight of the curve's levels and derivatives. Selecting α = 0 we obtain an L 2 -like distance d 0 (·, ·) that focuses exclusively on the levels. In this case we only need assumptions on the curves and centers, and in particular we require x i ∈ L 2 (R, R d ) and v k ∈ L 2 ((0, c k ), R d ) Conversely, the choice of α = 1 leads to an L 2 -like pseudo-distance d 1 (·, ·) that uses only weak derivative information, hence focusing on curve variations (their slopes or trends). Finally, α ∈ (0, 1) properly defines a Sobolev-like distance d α (·, ·) that allows one to highlight more complex features of curve shapes, taking into account both levels and variations. When α > 0 we need assumptions on the curves and centers, and also on their weak derivatives. In particular, we require that they belong to Sobolev spaces:

Proposition 2. Fix a membership matrixP and a shift matrixŜ. Consider the distance

if and only ifv

For α = 1,v k is defined by (7) up to an additive constant.

Equation (7) defines the k-th cluster center has a weighted average of the shifted curves in the interval (0, c k ). Weights are determined by membership probabilities: the contribution of a curve to the computation of the k-th cluster center is directlt proportional to its probability of belonging to cluster k.

Propositions 1 and 2 suggest to numerically minimize (1) through an iterative procedure that alternates: i. identification of cluster centers with equation (7), ii. curve alignment (warping function selection), and iii. computation of membership probabilities using equations (3)-(4). We propose the following algorithm for probKMA.

Initialization Fix the number of clusters K and the cluster center lengths c 1 , . . . , c K . Consider an initial membership matrix P (0) such that K k=1 p (0)

k,i > 0, ∀k (all clusters are non-degenerate), and an initial shift matrix S (0) formed selecting K portions of curves of length c 1 , . . . , c K in each curve x i ;

Iteration For it = 1, 2, . . . , iterate the following three steps until convergence:

i. Identification of cluster centers. For each k, compute the k-th cluster center v k,i . Stopping criterion At each iteration, the convergence of the probabilistic clustering is evaluated by means of the Bhattacharyya distance between the membership matrices P (it) and P (it−1) . In particular, for each cluster k we consider p

A global Bhattacharyya distance is then computed as the maximum, mean, or order q quantile of the Bhattacharyya distances for all clusters k. Steps i-iii are repeated until the global Bhattacharyya distance reaches a given tolerance.

Remark 1. We observe that steps i and iii are analogous to the steps of a fuzzy K-mean algorithm (Bezdek et al., 1984) , or to the steps of an EM algorithm for mixture models (Dempster et al., 1977) . Also notably, steps i and ii correspond to the functional K-mean with (global) alignment algorithm (Sangalli et al., 2010) .

Every iteration can be written in functional form as

For each initialization, the algorithm generates a sequence of iterations

Below we show that the functional (1) is continuous and descends along (9). This is an important result, which mimics the one in Hathaway et al. (1987) for fuzzy K-mean (its proof is provided in the Supplementary Material).

i.e. J m is a descent functional for T m . Moreover, J m descends strictly along the iterations (the inequality in (10) 

Remark 2. Although the previous result is not enough to guarantee that every sequence of iterations (9) converges to a minimizer of the functional J m , it is a necessary condition for convergence, and a desirable property for the algorithm.

In the previous theoretical results and algorithm, the lengths c 1 , . . . , c K of the cluster centers can differ among clusters, but remain fixed along iterations. However, we seek to identify local similarities even when the lengths of the "matching" curve portions are not known a priori. This problem has been already tackled by local sequence alignment methods in bioinformatics, whose goal is to find similar stretches of unknown lengths within a collection of nucleotides or amino acids sequences. In this context, one of the most widely used algorithms is BLAST (Altschul et al., 1990) . BLAST starts by finding short stretches shared by the sequences, and uses them as seeds. It then extends the seeds on both sides in order to construct larger local alignments, stopping when the similarity score drops below a given threshold. Borrowing this logic, we add a center elongation step to our algorithm. This step is performed only when the algorithm is reaching convergence, to guarantee that we do not extend low-quality cluster centers. We attempt elongation of both the left and the right of each center, generating the elongated center with equation (7) applied to the correspondingly elongated set of curves. For elongation to be acceptable, we require that the corresponding objective function

decreases or that it increases less than a given threshold ∆ J m,k .

ProbKMA lacks the ability to distinguish the case of a curve x i 1 that matches all K cluster centers (i.e. d(

. In both cases equation (3) leads to the same membership probabilities p k,i 1 = p k,i 2 = 1/k, k = 1, . . . K. To overcome this issue, we perform a cluster cleaning step when the algorithm is near convergence. The membership matrix P is dicotomized, transforming all membership probabilities into either 0 or 1. We consider the quantile q 1 K of order 1 K of all distances d(x i , v k ); membership probabilities p k,i corresponding to distances lower than q 1 K are set to 1, while all others are set to 0. Note that, in this way, each curve x i is allowed not to belong to any cluster, as well as to more than one cluster. This step distinguishes among the two extreme cases above (as long as all or most curves are not extreme cases), leading to "clean" membership probabilities p k,i 1 = 1 and p k,i 2 = 0, k = 1, . . . , K.

As shown in Subsection 2.1, from a theoretical point of view the input curves are required to satisfy only mild regularity conditions that are typical in FDA. In practice, as for almost all FDA methods, probKMA works best if the curves are reasonably smooth. Curve smoothing can address this need and highly improve results. Moreover, in real applications each functional datum must be created from discrete evaluations -possibly available on datumspecific and/or irregular grids, with some measurements missing relative to other data. This too is tackled with smoothing, and other straightforward pre-processing steps to fill small gaps. However, in some applications input curves present also large gaps, i.e. miss entire subregions that cannot be meaningfully imputed by smoothing (see, for instance, our application to mutagenesis data in Section 5). FDA methods that consider the curves globally are not appropriate for this type of data. On the contrary, our method can tolerate large gaps because it exploits the functional data locally. We formalize this situation allowing the i-th input curve x i : R −→ R d to have a domain D i ⊆ R defined as a finite union of intervals. The distance in (5) is thus generalized as

where D is the domain of x. While the distance in (5) is computed on the whole interval (0, c) where the cluster center v is defined, the dissimilarity in (15) is computed only on the portion of (0, c) that intersects the domain of x, and is well defined only if this intersection is not empty, i.e. |(0, c) ∩ D| > 0. Based on (15), the equation to update the k-th cluster center becomesv

a.e. on (0,

Remark 3. Equations (16) and (7) are very similar: the k-th cluster center is still a weighted average of the shifted curves, but the contribution of a curve now depends on its domain, in addition to its probability of belonging to cluster k. When no large gaps are present in the input curves, (16) reduces to (7).

In this section we describe how the output of probKMA is evaluated and used to identify functional motifs. More detail on some of the post-processing steps described below are provided in the Supplementary Material.

To evaluate a clustering produced by probKMA, we develop a generalized silhouette index, similar to the one used in classic clustering (Rousseeuw, 1987) . Our index, which is defined for portions of curves, measures how well each portion fits in the cluster it was assigned to. By aggregating over portions assigned to the same cluster, we create measures of the quality of each cluster -and can discern if all K clusters have similar quality, or if the algorithm produced some very good clusters and some very poor ones. Finally, by aggregating over all K clusters, we quantify the overall quality of a clustering -and can thus compare results corresponding to different choices for the number of clusters K and the center lengths c 1 , . . . , c K .

First, we compute the clean membership matrix P (see Subsection 2.4) and identify the clean cluster centers. Next, from every curve x i , we extract the portions that correspond to at least one among the K clusters (i.e. for which the clean membership probability is p k,i = 1). Then, for each of the extracted portions, say j = 1, . . . , J, we compute its average distance d j (k) from cluster k as the mean of the distances between j itself and all the portions attributed to cluster k. We define the intra-cluster distance a j as the average distance of portion j from the cluster k j it belongs to, i.e. a j = d j (k j ), and the inter-cluster distance b j as the minimum of the average distances of portion j from all the other clusters, i.e. b j = min k =k j d j (k). The generalized silhouette index for portion j is a number in [−1, 1] defined as

.

Large values of s j indicate that j is appropriately assigned to its cluster, while low values indicate bad assignments. In particular, negative values signify that portion j is closer to a cluster different from the one it was assigned to. For each cluster k, we then compute the cluster average silhouette index S k as the average silhouette index across all the portions assigned to k. This measures the compactness of the cluster and hence its quality. Finally, the overall average silhouette index S, i.e. the average of all S k 's, measures the overall quality of the clustering. Similar to classic clustering, silhouettes for portions, clusters and overall clustering are visualized in a silhouette plot that facilitates their interpretation.

Likewise other K-mean algorithms, probKMA finds a local minimum of the functional (1) and its output depends heavily on initialization. Since our goal is functional motif discovery, we run the algorithm multiple times with different initializations, and form the set of candidate motifs taking the union of the resulting solutions. Moreover, since we do not know a priori motif frequencies and lengths in the input curves, we run probKMA multiple times changing K and the minimum lengths c 1 , . . . , c K . Once again, we accumulate solutions in a union set. We then "clean up" this set of candidate motifs through two main procedures. First, we use generalized silhouette indices and number of occurrences to prune the candidate motif list. Specifically, we retain only motifs (cluster centers) such that both the average cluster silhouette index and the number of curve portions assigned to the cluster are larger than fixed thresholds (see Subsection 3.1). Second, we merge candidate motifs that are very similar to each other, as they may in fact correspond to the same "true motif" identified by multiple runs of probKMA (this is likely to happen when a motif appears in the input curves with a large signal-to-noise ratio).

Finally, given the resulting set of discovered motifs, we utilize a motif search algorithm to locate all their instances in the input curves. For each motif, we map all portions of the curves with distance lower than a given radius R from the motif.

Merging and motif search can be tackled with many different strategies, each with its pros and cons. To save computation, we devised implementations that take advantage of the large amount of information gained from the multiple runs of probKMA -in particular, of the distances that have been computed between each candidate motif and all input curves (both the ones that contain the motif, and the ones that do not). Full details are provided in the Supplementary Material. Briefly, for merging we group similar candidate motifs benchmarking their pairwise distance against the distances between all motifs and curves. In each group, we select one representative motif based on its number of occurrences and the average distance between the motif and its occurrences in the curves. If the candidate motifs in a group have very different lengths, we can select more than one representative. This allows us to retain both a long motif and a shorter motif -which may have a larger number of occurrences, or a smaller average distance. In addition, we estimate the variability within each group of motifs, and we use it to define the radius R used in the motif search step. This is group-specific and it allows us to map instances of each motif based on a data-driven evaluation of its own signal-to-noise ratio.

In this section we present simulation experiments that illustrate the proposed methodology. For simplicity, we generate d = 1 dimensional functional data without any missing subregions. Subsection 4.1 describes the flexible model we employ to generate curves comprising functional motifs. This task is non trivial, since we require motifs to be smoothly embedded in curves while allowing them to occur with noise. In Subsection 4.2 we employ this model to evaluate the performance of our method in discovering functional motifs. In particular, we examine the effects of increasing curve length (and thus the amount of "background data" among which the algorithm must identify motifs) and the noise level comprised in motif instances. Finally, we compare probKMA to a time series motif discovery tool in Subsection 4.3, and to other FDA clustering methods in Subsection 4.4.

In order to generate smooth curves embedding functional motifs, we take advantage of the flexibility provided by B-splines. We consider a B-spline basis {Φ l } L l=1 of order n, with equally spaced knots t 1 , . . . , t L−n+2 , and define each curve as

where c l ∈ R, l = 1, . . . , L are coefficients to be chosen. The order n controls smoothness and complexity of x (x is a curve of class C n−2 and a piece-wise polynomial of degree n − 1). Higher orders provide more degrees of freedom, allowing one to generate curves with more complex shapes, and smoother at the knots. Each basis function Φ l has compact support; specifically, it is 0 outside an interval of length nT , where T is the distance between two subsequent knots. As a consequence, we can define a functional motif of length T fixing the values of n coefficients c m,i , . . . , c m,i+n−1 and repeating them multiple times within the same curve or across different curves. Longer motifs of length 2T, 3T, . . . -that may result in more complex shapes -can be created in a similar way, fixing the values of n + 1, n + 2, . . . subsequent coefficients. Since a single curve can embed more than one functional motif, as well as more than one occurrence of the same motif, we require motifs to be separated by at least one sub-interval (t i , t i+1 ) as not to be artificially merged (i.e. we require at least n background coefficients between them). Motif occurrences that are "the same", both in shape and level, are generated adding Gaussian noise to the corresponding coefficients:c m,j = c m,j + j , j iid ∼ N (0, σ 2 ). Motif occurrences that are "the same" in shape but have different levels are obtained adding a constant d m to all the coefficients that define a single occurrence, choosing different constants for different occurrences:c m,j = c m,j + d m + j , j iid ∼ N (0, σ 2 ). Background coefficients c bg ∈ [a, b] are randomly generated as (c bg − a) /b iid ∼ Beta(0.45, 0.45). The Beta distribution allows us to create reasonably different backgrounds for both the curve x(t) and its derivative x (t).

With this flexible model we can generate functional data in several complex scenarios, varying curve and motif lengths, as well as variability, frequencies and positions of the occurrences of each motif.

The aim of our simulations is to demonstrate the performance of probKMA in discovering functional motifs embedded in a set of curves, and to examine the effects of increasing curve length and the noise level comprised in motif occurrences. We consider two different scenarios, with sets of curves embedding (1) motifs that share both shapes and levels; or (2) motifs that share shapes but have different levels.

In scenario (1), we consider a set of 20 curves embedding two functional motifs -each with 12 occurrences (see Fig. 1 and Supplementary Material). In particular, 12 curves contain only one occurrence of a motif (6 curves for each of the two motifs), 4 curves contain two occurrences of a motif (2 curves for each of the two motifs), 2 curves contain one occurrence of each of the two motifs, and 2 curves contain no motif occurrences at all. We generate data according to the model in Subsection 4.1, using a B-spline basis of order 3, knots at distance 10 and motifs of length 60. Coefficients defining the two motifs are randomly generated from a Beta(0.45, 0.45) distribution rescaled to [−15, 15] . In this scenario, we consider four different curve lengths l = 200, 300, 400, 500 and four levels of noise σ = 0.1, 0.5, 1, 2creating a total of 16 simulated datasets. In order to maximize the consistency among these datasets and thus highlight the effects of different l and σ values, we place motif occurrences within the leftmost sub-interval of length 200 of each curve -that is common to all datasets -utilizing the same motif positions and background in all 16 cases. Curves are sampled on a grid of points at distance 1, so that each motif corresponds to 61 points.

For each combination of l and σ, we run our probKMA-based functional motif discovery with Sobolev-like distance d 0.5 (this takes into consideration both curve level and curve variation). We evaluate the number of motifs found, the distance between true and estimated motifs, the estimated lengths of motifs, and the number of true and false positives. ProbKMA is run for K = 2, 3, minimum motif lengths c = 40, 50, 60 and 20 random ini- Simulation results for l = 200 can be found in Fig. 2 and show very good performance for our method. As expected, performance slightly declines when more noise is introduced in the motif instances: some occurrences can be missed, and/or false positives can be included. However, results remain satisfactory even when σ = 2. Results for other curve lengths are shown in the Supplementary Material. They suggest the same behavior as the noise level increases and, interestingly, they appear rather robust across lengths. The only effect of increasing the ratio between "background" curve portions and curve portions occupied by motifs is a slight increase in false positives -which occurs exclusively when also the noise level is high.

In scenario (2) we consider exactly the same curves and motifs as in scenario (1), but allow motif occurrences to have different levels (see Supplementary Material). In particular, a random value d m ∼ U (−10, 10) is added to the coefficients defining an individual motif occurrence. For each combination of l and σ, we run our probKMA-based discovery with the L 2 -like pseudo-distance d 1 (this lets us focus on curve variation). Parameters are the same as in scenario (1), and Figures detailing the results are provided in the Supplementary Material. We find again that our method has good performance -and that this performance is affected (as expected) by the noise level, but not much by the length of the curves. In some cases, especially when the curves are very long, we actually "discover" motifs that were not embedded in the simulated data. Note that, strictly speaking, these motifs are not all together "false". As one elongates the background portions of the curves, it is possible to generate by chance a few patterns that recur often enough to be identified by our algorithm. In our experiments, these additional motifs are noisier and have fewer occurrences than the two motifs originally embedded in the data.

To validate the results described above, we repeat simulations in both scenarios 10 times, considering 10 different randomly generated pairs of motifs and re-generating the curves' background (see Supplementary Material) . In all cases, our method shows good performance and similar behaviors as curve length and noise level change. This is evidence that its effectiveness does not depend on on the specific shapes of the motifs embedded in the curves.

Finally, we consider the 10 simulations for the first scenario and examine how results change with the number of random initializations used to run probKMA for each (K, c) pair. To do this, we subsample the probKMA runs from the analysis already conducted, which employed 20 random initializations -using only 5, 10 or 15 initializations for each (K, c) pair. We re-run the post-processing steps and compare results with those previously obtained with all 20 initializations. Reassuringly, our method is pretty robust to the number of initializations employed, and retains its good performance even when probKMA is run only 5 times for each choice of (K, c) (see Supplementary Material).

Next, we compare our probKMA-based functional motif discovery to time series motif discovery. As noted in the Introduction, the problem of motif discovery in a time series is similar to that of functional motif discovery. However, some important differences exist. First, the definition of motif is different. While our functional motifs recur across a set of curves, and possibly within individual curves in the set, time series motifs are defined as recurring within a single time series -usually starting from pairs of highly similar subseries (see the Supplementary Material). Second, time series motif similarity is based on Euclidean distance or correlation, while functional motif similarity is more general, based on a distance between functions that can incorporate derivatives. Third, and perhaps most importantly, available time series motif discovery tools are not statistical in nature. As a consequence, they do not estimate the level of noise associated with each motif. The user needs to provide one, and usually only one, motif radius as input to the discovery procedure. On the contrary, probKMA-based discovery "learns" an appropriate radius for each motif from the data (see Subsection 3.2).

Among existing tools for time series motif discovery, we consider the recent Matrix Profile (Yeh et al., 2016 . This tool discovers the top motif pairs in a time series and, for each of these pairs, provides all the "neighboring" subseries, i.e. all the subseries with distance less than R from the motif pair. The same radius R is used for all the top motif pairs. We consider again the two simulation scenarios introduced in Subsection 4.2 (see Fig. 1 and Supplementary Material), focusing on two specifications: the simple case of short curves and low noise level (l = 200 and σ = 0.1), and the complex case of long curves and high noise level (l = 500 and σ = 2). For probKMA-based discovery, we use the same parameters as in Subsection 4.2 (K = 2, 3, minimum motif lengths c = 40, 50, 60, and 20 random initializations for each (K, c)). We run Matrix Profile on one "time series" obtained by concatenating the 20 curves one after the other, and using different choices of radius (from R = 1 to R = 150). Since the tool requires also the motif length as input, we use the true motif length c = 60 (a newer implementation of Matrix Profile, introduced in Linardi et al., 2018, can find all motif pairs in a given range of lengths). Table 1 reports results for simulation scenario (1). Both probKMA-based motif discovery and Matrix Profile discover the two motifs in the simple case (l = 200, σ = 0.1). However, when curves are longer and motifs noisier (complex case, l = 500, σ = 2), Matrix Profile fails to find Motif 1, and includes many false positives in Motif 2. When the radius is large (R > 100), it does correctly identify a small number of occurrences of Motif 1, but it also reports a very large number of false positives (for both motifs). On the contrary, even in this complex case, probKMA-based motif discovery remains robust to noise level and curve length, and is able to identify both motifs with a very small number of false positives. Similar observations can be made for simulation scenario (2) (see Supplementary Material).

Here we perform simulation experiments to compare probKMA, meant as a clustering method and separate from its motif discovery purpose, to other functional clustering methods. In particular, we include in the comparison the standard functional K-means (Tarpey and Kinateder, 2003) and K-mean with (global) alignment (KMA Sangalli et al., 2010)which are both non-sparse methods -and a recent sparse clustering technique (Floriello and Vitelli, 2017) . We consider 2 clusters and generate 9 curves for each cluster, in the four different scenarios depicted in Fig. 3 : (a) curves in the two clusters are aligned, and they differ on the entire domain; (b) curves in the two clusters are misaligned, and they differ on the entire domain; (c) curves in the two clusters differ on a portion of the domain, and this portion is aligned; (d) curves in the two clusters differ on a portion of the domain, and this portion is misaligned. These four scenarios can be seen as special cases of the more general functional In each scenario, we run all four clustering methods with Euclidean distance and using the true number of clusters (K = 2). For the sparse clustering method, we also take the sparsity parameter (i.e. the minimum length of the unselected part of the domain) as known, setting it to the curve length minus the motif length. We set the motif length parameter in Table 2 : Comparison of probKMA with non-sparse and sparse functional clustering methods in four simulation scenarios. We report means (and standard deviations) of classification error rates across 10 repetitions. probKMA in the same way. In KMA and probKMA, we consider only shift alignments. We then evaluate clustering results by means of a classification error rate (1 minus the Rand index; Rand, 1971 ) that is equal to 0 if every curve is correctly classified and (since K = 2) is equal to 0.5 if the classification is as good as random. Also, since probKMA produces a probabilistic clustering, we compute the classification error rate after assigning each curve to the cluster with highest membership probability.

Results are shown in Table 2 . We observe that all four methods correctly classify all curves in scenario (a). K-means only works in this scenario, while KMA performs well in scenarios (a) and (b), and sparse clustering performs well in scenarios (a) and (c). Interestingly, when the noise level is small, sparse clustering also achieves a good performance in scenario (b). ProbKMA performs very well in all four scenarios.

To further illustrate the proposed method, we apply it to a mutagenesis dataset adapted from Kuruppumullage Don et al. (2013) . Mutagenesis comprises all the processes by which mutations are generated in DNA, it is one of major evolutionary forces, and is central to causing many human diseases (e.g. cancer). Understanding mutagenesis and how it is influenced by the genomic landscape is key to shedding light on genome dynamics (Makova and Hardison, 2015) . Kuruppumullage Don et al. (2013) estimated different types of mutation rates in non-overlapping neutrally evolving regions (broken into windows) along the human genome, employed Hidden Markov Models to define six divergence states and segmented the genome in different regions belonging to these states. One of the states is of particular interest: it comprises "hot regions" with very high rates for substitutions, small insertions and deletions, which are associated with high GC content, early replication timing and open chromatin. Since these results are obtained at a rather large scale (1-Mb windows), investigating rates at finer resolution within the hot regions may reveal more specific trends and patterns of variation.

Estimating high-resolution mutation rates in 1-kb windows within each hot region (with the same pipeline as in Kuruppumullage Don et al., 2013, see Supplementary Material) we generate a dataset of 43 curves, varying in length from 1 Mb (corresponding to a grid of 1 000 points) to 22 Mb (22 000 points). The curves are highly noisy, and contain several missing or inaccurate values -due to the fact that in many 1-kb windows the information needed to estimate rates is scarce. In particular, substitution rates can be reliably estimated only in 60% of the 1-kb windows. After pre-processing with stochastic regression imputation and local smoothing, missing values are reduced to 17% of the windows (see Supplementary  Material) . In this application we do not consider insertion and deletion rates, because their estimates are even noisier and less accurate than for substitution rates.

We employ our probKMA-based functional motif discovery on the 43 curves, using the Sobolev-like distance d 0.5 to quantify similarities. We look for motifs with minimum lengths c = 40, 50, 60, 70 (maximum length c max = 150), and we run probKMA for K = 2, 3, 4, 5 using 10 random initialization for each (K, c) pair. We employ our generalized silhouette index to evaluate each probKMA run and to filter the set of candidate motifs (see Fig. 4(a) and Supplementary Material). Post-processing parameters are tuned based on probKMA results (see Fig. 4 (b)-(c) and Supplementary Material).

We identify 13 functional motifs that differ substantially in length (40 to 104 kb), levels and shapes (see Fig. 5(a) ). The motifs also differ substantially in frequency (i.e. number of occurrences in the data) and level of variability (see Table 3 ). This highlights the advantage of employing a motif discovery methodology able to learn motif-specific lengths, frequencies and variabilities based on the data.

At least four of the motifs found are of biological interest: Motif 12 corresponds to eight long sub-regions (about 100 kb) with extremely high substitution rates -an elevation of 10% to 20% relative to the mean level across all hot regions, which is already elevated in Number 19 12 27 37 63 12 72 47 14 11 9 8 6 Mean dist 1.9 1.9 3.5 5.1 6.5 3.0 8.7 7.4 5.2 3.0 5.5 18.5 17.0 comparison to the genome at large. Motif 4 and Motif 8 also present very high substitution rates, and opposite patterns. In Motif 4, rates are about 10% above the overall hot regions mean for the initial ∼20 kb's, and then decrease. In Motif 8 rates increase and then stabilize at about 10% above the mean for ∼20 kb's. The two motifs have similar variability and are both very frequent (37 and 47 occurrences, respectively). Finally, Motif 13 corresponds to six long sub-regions with a substitution rate 20-30% below the mean. These portions of the hot regions are in fact not hot; substitutions rates are similar to those of the rest of the genome.

To investigate the genomic landscape of the motifs found, we consider a set of 35 genomic features -measured in each of the 1-kb windows constituting the hot regions. These features represent biological contexts that have an interplay with mutagenesis, such as DNA conformation, DNA sequence, replication, recombination, chromatin openness and modifications Figure 5 : ProbKMA-based functional motif discovery in substitution rate curves. (a) Motifs found, plotted as percent changes with respect to the mean substitution rate across all hot regions; (b) Genomic landscape of the motifs, with color intensity proportional to the significance (− log 10 (p)) of a mean difference two-sided test contrasting motif occurrences and hot regions at large -red, blue and white represent positive, negative and non significant (p > 0.1) differences, respectively.

(see Supplementary Material). We then compare, independently for each genomic feature and each motif, the mean of the measurements in motif occurrences with the mean across all hot regions. In particular, we perform a simulation-based two-sided test for mean difference, where the empirical null distribution is obtained from 1000 datasets generated by randomly relocating motif occurrences within the set of curves. Fig. 5(b) shows the results of this analysis. We observe that each motif has a characteristic genomic landscape, which helps in its biological interpretation. For example, occurrences of Motif 13 are enriched in exons and conserved elements compared to hot regions in general -their lowered substitution rates may indeed correlate with such enrichments.

This article, for the first time to the best of our knowledge, tackles the problem of functional motif discovery from a statistical perspective. We proposed probKMA for discovering candidate motifs in a set of curves, incorporating ideas from functional data analysis, bioinformatics and fuzzy clustering. In addition, we proposed a generalized silhouette index to evaluate probKMA results, and implemented a post-processing stage for merging candidate motifs and searching motif occurrences along the curves. Although many alternative strategies can be employed in post-processing, each with pros and cons, results on simulated and real data suggest that our implementation is effective in a range of scenarios. ProbKMA employs a flexible definition of curve similarity, which incorporates both levels and derivatives. In addition, similarity is defined locally, in a way that tolerates large gaps in the curves. This broadens the application scope of our methodology. ProbKMA can also be applied to multivariate curves and, importantly, does not require the user to specify motif lengths or variability levels at the outset. These are learned from the data, substantially improving performance with respect to approaches where lengths and/or radii are fixed.

In our experience, motif discovery with probKMA can fail when motifs are too similar to one another or when they are too similar to background portions of the curves. This can happen by chance when motifs are very noisy. Relatedly, simulations show that, when motifs are very noisy and/or dispersed in very long curves, our method can identify motifs that were not intentionally introduced in the data -but rather randomly created when generating background portions of the curves. In a way, these additional motifs may be considered as "unintentional" and yet true (as opposed to false) positives; they do recur in the curves in a way that is detectable by the algorithm. Nevertheless, in our simulations they are noisier and have fewer occurrences. This observation underscores the need for further work addressing the statistical significance of motifs found by probKMA. Notably, the flexible model based on B-splines that we introduced to generate simulation data may play an important role in this context, providing a way to estimate the likelihood of discovering motifs in background curves.

Separately from its motif discovery purpose, probKMA can also be employed for probabilistic clustering of misaligned functional data based on local similarities. In this respect, it also represents a generalization of sparse clustering procedures recently proposed in functional data analysis (Fraiman et al., 2016; Floriello and Vitelli, 2017) . In the limit, when the minimum motif length is close to the length of the curves under consideration, probKMA becomes a probabilistic version of K-mean with (global) alignment (Sangalli et al., 2010) .

Code implementing our methodology in R is available upon request. An R package is in preparation.

Basic local alignment search tool

MEME: discovering and analyzing dna and protein sequence motifs

Pattern recognition with fuzzy objective function algorithms

FCM: the fuzzy c-means clustering algorithm

Maximum likelihood from incomplete data via the EM algorithm

Nonparametric functional data analysis: theory and practice

Sparse clustering of functional data

Feature selection for functional data

Clustering objects on subsets of attributes

An improved convergence theory for the fuzzy isodata clustering algorithms

Inference for functional data with applications

Functional data clustering: a survey

Segmenting the human genome based on states of neutral genetic divergence

Finding motifs in time series

Matrix profile X: VALMOD -scalable discovery of variable-length motifs in data series

Simultaneous curve registration and clustering for functional data

The effects of chromatin organization on variation in mutation rates in the genome

Exact discovery of time series motifs

Clustering multivariate functional data with phase variation

Functional data analysis

Objective criteria for the evaluation of clustering methods

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

K-mean alignment for curve clustering

Clustering functional data

A framework for feature selection in clustering

Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile

Matrix profile I: all pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets

We thank Matthew Reimherr and Piercesare Secchi for discussions about functional data methodology; Kateryna D. Makova and Di (Bruce) Chen for help with the mutagenesis application; Valeria Vitelli and Davide Floriello for their sparse functional clustering code.

This work was partially funded by the Eberly College of Science, the Institute for Cyberscience and the Huck Institutes of the Life Sciences (Penn State University); NSF award DMS-1407639; and Tobacco Settlement and CURE funds of the PA Department of Health.

The online Supplement contains proofs and additional details on post-processing, simulations and analysis of the mutagenesis data.