key: cord-0225715-c73xaebv
authors: Scharwachter, Erik; Lennartz, Jonathan; Muller, Emmanuel
title: Differentiable Segmentation of Sequences
date: 2020-06-23
journal: nan
DOI: nan
sha: 1ab5db66bea0f628ca8b0c70e53b6971ecab695a
doc_id: 225715
cord_uid: c73xaebv

Segmented models are widely used to describe non-stationary sequential data with discrete change points. Their estimation usually requires solving a mixed discrete-continuous optimization problem, where the segmentation is the discrete part and all other model parameters are continuous. A number of estimation algorithms have been developed that are highly specialized for their specific model assumptions. The dependence on non-standard algorithms makes it hard to integrate segmented models in state-of-the-art deep learning architectures that critically depend on gradient-based optimization techniques. In this work, we formulate a relaxed variant of segmented models that enables joint estimation of all model parameters, including the segmentation, with gradient descent. We build on recent advances in learning continuous warping functions and propose a novel family of warping functions based on the two-sided power (TSP) distribution. TSP-based warping functions are differentiable, have simple closed-form expressions, and can represent segmentation functions exactly. Our formulation includes the important class of segmented generalized linear models as a special case, which makes it highly versatile. We use our approach to model the spread of COVID-19 by segmented Poisson regression, perform logistic regression on Fashion-MNIST with artificial concept drift, and demonstrate its capacities for phoneme segmentation.

Non-stationarity is a classical challenge in the analysis of sequential data. One source of nonstationarity is the presence of change points, where the data-generating process switches its dynamics from one regime to another regime. In some applications, the detection of change points is of primary interest, since they may indicate important events in the data [40, 7, 6, 35, 33, 3, 45] . Other applications require models for the dynamics within each segment, which may yield more insights into the phenomenon under study and enable predictions. A plethora of segmented models for regression analysis [37, 22, 32, 5, 38, 1] and time series analysis [21, 11, 4, 13] have been proposed in the literature, where the segmentation materializes either in the data dimensions or the index set.

We adhere to the latter approach and consider models of the following form. Let x = (x 1 , ..., x T ) be a sequence of T observations, and let z = (z 1 , ..., z T ) be an additional sequence of covariates used to predict these observations. Observations and covariates may be scalars or vector-valued. We refer to the index t = 1, ..., T as the time of observation. The data-generating process (DGP) of x given z is time-varying and follows a segmented model with K T segments on the time axis. Let b k and e k denote the beginning and end of segment k, respectively. We assume that

where the DGP in segment k is parametrized by θ k . This scenario is typically studied for nonstationary time series [20, 8, 26, 11, 42, 44] , but also captures predictive models with concept drift [17] . For example, in a segmented Gaussian autoregressive process of order h, the vector of covariates is z t = [x t−h , ..., x t−1 , 1] and the DGP is the normal distribution N (z t θ k , σ 2 ). In a segmented generalized linear model (GLM), the DGP is a probability distribution with conditional expectation E[x t | z t ] = g(z t θ k ), where the linear predictor is transformed by a link function g.

We express the segmentation of the time axis by a segmentation function ζ : {1, ..., T } −→ {1, ..., K} that maps each time point t to a segment identifier k. The segmentation function is order-preserving with boundary constraints ζ(1) = 1 and ζ(T ) = K. We denote all segment-wise parameters by θ = (θ 1 , ..., θ K ). The ultimate goal is to find a segmentation ζ as well as segment-wise parameters θ that minimize a loss function L(ζ, θ), for example, the negative log-likelihood of the observations x.

Existing approaches exploit the fact that model estimation within a segment is often straightforward when the segmentation is known. These approaches decouple the search for an optimal segmentation ζ algorithmically from the estimation of the segment-wise parameters θ:

Various algorithmic search strategies have been explored for the outer minimization of ζ, including grid search [32] , dynamic programming [22, 5] , hierarchical clustering [37] and other greedy algorithms [1] , some of which come with provable optimality guarantees. These algorithms are often tailored to a specific class of models like piecewise linear regression, and do not generalize beyond. Moreover, the use of non-standard optimization techniques in the outer minimization hinders the integration of such models with deep learning architectures, which usually perform joint optimization of all model parameters with gradient descent.

In this work, we provide a continuous and differentiable relaxation of the segmented model from Equation 1 that allows joint optimization of all model parameters, including the segmentation function, using state-of-the-art gradient descent algorithms. Our formulation is inspired by the learnable warping functions proposed recently for sequence alignment [34, 50] . In a nutshell, we replace the hard segmentation function ζ with a soft warping function γ. An optimal segmentation can be found by optimizing the parameters of the warping function. We propose a novel class of piecewise-constant warping functions based on the two-sided power (TSP) distribution [47, 28] that can represent segmentation functions exactly. TSP-based warping functions are appealing because they are differentiable, easy to evaluate analytically with closed-form expressions, and their parameters have a one-to-one correspondence with segment boundaries.

Although our notation in Equation 1 implies a probabilistic DGP, our formalism also applies to fully deterministic models. We can replace ∼ with = in Equation 1 and proceed analogously. Moreover, the segmented model may be part of a larger model architecture, where the covariates z t and the parameters θ k come from some upstream computational layer, and the outputs x t are passed on to the next computational layer with an arbitrary downstream loss function. The interpretation of z t as covariates and θ k as parameters is merely for consistency with prior work on segmented models. It is more accurate to interpret z t as temporal variables that differ for every time step t, and θ k as segmental variables that differ for every segment k. The DGP combines the information from both types of variables to produce an output for every time step.

The main idea of our work is to relax the segmented model formulation from Equation 1 and the optimization problem from Equation 2 by replacing the hard segmentation function ζ with a soft parametric warping function γ that can be estimated effectively with gradient descent. We now describe step-by-step how this relaxation is implemented. We first rewrite the model definition to

where we substitute the actual parameter θ k of the DGP at time step t in segment k by the predictorθ t . We allow the predictorθ t to take on values between two parameters θ k and θ k+1 , if there is ambiguity as to whether time step t belongs to segment k or segment k + 1. More precisely, we define the predictor as a linear interpolation [23, 50, 34] between the parameters of two consecutive segments,

The interpolation depends on a continuous predictorζ t ∈ [1, K] for the value of the segmentation function ζ(t). Ifζ t = ζ(t) ∈ {1, ..., K} for all t, no interpolation takes place, and the novel formulation is fully equivalent to Equation 1. Non-integer values inζ t encode ambiguity in the segment assignment that leads to interpolated parametersθ t . Ideally, the predictors take on integer values as often as possible, and are ambiguous only near the segment boundaries. The predictors can be transformed into a hard segmentation function by rounding to the closest integers. For consistency, the predictorsζ t must be order-preserving,ζ 1 ≤ ... ≤ζ T , and satisfy the boundary constraintsζ 1 = 1 andζ T = K of the segmentation function. We obtain such predictors from warping functions.

Warping functions describe order-preserving alignments between closed continuous intervals [41] . Formally, the function γ :

is a warping function if it is monotonically increasing and satisfies the boundary constraints γ(0) = 0 and γ(1) = 1. We transform a warping function γ into a predictorζ t by sampling γ at T evenly-spaced grid points on the unit interval [0, 1] and rescaling the result to [1, K] . Let u t = (t − 1)/(T − 1) for t = 1, ..., T be a such a unit grid. We definê

The predictorζ t exactly represents the segmentation function ζ if

which is satisfied only for piecewise-constant warping functions with a step-like shape. An example segmentation function and three warping functions are shown in Figure 1 . The problem of searching for a segmentation function has changed to that of estimating a suitable warping function:

Several families of warping functions have been proposed, based on trigonometric functions [2] , spline basis functions [41, 19, 16] , warplets [10] , continuous piecewise-affine (CPA) velocity fields [15, 12, 50] , and nonparametric approaches [31, 34] . None of them contains piecewise-constant warping functions, since these families are based on strictly increasing functions for invertibility. As a result, no member of these families can represent a segmentation function exactly. Moreover, these families are more expressive than necessary for the segmentation task, which makes estimation harder than necessary. Below, we define a novel class of piecewise-constant warping functions that represents any segmentation function exactly, with only one parameter per segment boundary.

Warping functions have some similarity with cumulative distribution functions (cdfs) for random variables [34] . Cdfs are monotonically increasing, right-continuous, and normalized over their domain [49] . If their support is bounded to [0, 1], they satisfy the same boundary constraints as warping functions. Therefore, we can exploit the vast literature on statistical distributions to define and characterize families of warping functions. Surprisingly, there are only few established distributions for random variables with continuous bounded support [28] . The most prominent example is the beta distribution, which has no closed-form expression and requires approximations. Our family of warping functions is instead based on the two-sided power (TSP) distribution [47, 28] .

The TSP distribution has been proposed recently to model continuous random variables with bounded support [a, b] ⊂ R. It generalizes the triangular distribution and can be viewed as a peaked alternative to the beta distribution [24] . In its most illustrative form, its probability density function (pdf) is unimodal with power-law decay on both sides, but it can yield U-shaped and J-shaped pdfs as well, depending on the parametrization. Formally, the pdf is given by

with a ≤ m ≤ b. a and b define the boundaries of the support, m is the mode (anti-mode) of the distribution, and n > 0 is the power parameter that tapers the distribution. The rectangular distribution is the special case with n = 2. In the following, we restrict our attention to the unimodal regime with a < m < b and n > 1. In this case, the cdf is given by

For convenience, we introduce a three-parameter variant of the TSP distribution with support restricted to subintervals of [0, 1] located around the mode. It is fully specified by the mode m ∈ (0, 1), the width w ∈ (0, 1] of the subinterval, and the power n > 1. Depending on the mode and the width, the distribution is symmetric or asymmetric. Illustrations of the pdf and cdf of the three-parameter TSP distribution for various parametrizations can be found in Figure 2 . We denote the three-parameter TSP distribution as TSP(m, w, n) and write f TSP (u; m, w, n) and F TSP (u; m, w, n) for its pdf and cdf, respectively. The original parameters a and b are obtained from m and w via

and yield a unimodal regime. Intuitively, the three-parameter TSP distribution describes a symmetric two-sided power kernel of window size w that is located at m and becomes asymmetric only if a symmetric window would exceed the domain [0, 1]. An advantage of the TSP distribution over the beta distribution is that its pdf and cdf have closed form expressions that are easy to evaluate computationally. Moreover, they are differentiable almost everywhere with respect to all parameters.

We define the TSP-based warping function γ TSP : [0, 1] −→ [0, 1] for a fixed number of segments K as a mixture distribution of K − 1 three-parameter TSP distributions. The motivation is that mixtures of unimodal distributions have step-like cdfs that approximate segmentation functions. We use uniform mixture weights, and treat the width w and power n of the TSP component distributions as fixed hyperparameters. The components differ only in their modes m = (m 1 , ..., m K−1 ):

We constrain the modes to be strictly increasing, so that γ TSP is identifiable. If the windows around two consecutive modes m k−1 and m k are non-overlapping, γ TSP will be constant at level k−1 K−1 between these windows. It is also constant at level 0 for all points that come before the first window and constant at level 1 for all points that come after the last window. Therefore, the family of TSP-based warping functions contains piecewise-constant functions. In fact, the functions γ 1 and γ 2 in Figure 1 are examples of TSP-based warping functions. Lemma 1. For every segmentation function ζ, there is a TSP-based warping function γ TSP such that the predictorζ t from Equation 5 represents ζ exactly, in the sense of Equation 6.

Proof. We place the K − 1 modes m k on the segment boundaries (projected to the unit grid) and choose a window size w not larger than the resolution of the grid. The power n > 1 can be choosen freely. Formally, let ζ(t) = k and ζ(t + 1) = k + 1 be the k-th segment boundary. We set m k := (u t+1 + u t )/2 for all segment boundaries k and w := 1/(T − 1) to obtain the result.

In practice, the segmentation function ζ is unknown and the modes m = (m 1 , ..., m K−1 ) must be estimated in an unsupervised way. To simplify the estimation problem, we rewrite the modes as

with unconstrained real parameters µ = (µ 1 , ..., µ K ). The transformation of the parameters guarantees that the modes are strictly increasing and come from the interval (0, 1), with a bogus mode m K = 1 that can be ignored. The warping function is now overparametrized, since transformation is invariant to additive terms in the parameters µ. This issue can be resolved by enforcing µ 1 := 0.

We have described all components of the relaxed segmented model architecture. It can use any family of warping functions to approximate a segmentation function. An overview of our model architecture with TSP-based warping functions is given in Figure 3 . The learnable parameters of this architecture are θ = (θ 1 , ..., θ K ) for the DGP and µ = (µ 1 , ..., µ K ) for the warping function. The hyperparameters are the number of segments 1 < K T , and the window size w ∈ (0, 1] and power n > 1 of the TSP distributions. This architecture is a concatenation of simple functions that are either fully differentiable or differentiable almost everywhere. Therefore, all parameters can be learned jointly using gradient descent. We implemented the model in Python 1 using the PyTorch library 2 . Source codes can be found in the supplementary material.

For effective training of the segmentation parameters µ with gradient descent, the window size of the TSP components should be chosen larger than the sampling resolution of the unit grid, 1/(T − 1), 1 https://python.org/ 2 https://pytorch.org/ to allow the loss to backpropagate across segment boundaries. The window size can be interpreted as the receptive field of the individual TSP components. In all our experiments, we use a large window size of w = .5 combined with a high power n = 20 to obtain functions that are close to piecewise-constant. The width can be tapered down to w ≤ 1/(T − 1) over the training epochs to obtain a warping function that is truly piecewise-constant and exactly represents a segmentation function. An alternative strategy that works for any family of warping functions is to replace the linear interpolation from Equation 4 by integer interpolation for a few epochs at the end of training. We applied the latter strategy for simplicity and consistency across all families of warping functions. The learning problem may be non-convex and converge to local optima that are not global optima. Therefore, it is advisable to train the model multiple times with randomized initial parameters.

First, we analyze how well our relaxed model identifies simple and complex functions. All experiments in this and the following sections can be reproduced with the codes in the supplementary material. We generate a piecewise linear time series of length T = 1, 000 with a single change point at t = 500 where the slope changes from −1 to 1. We use the linear DGP f Linear (z t , θ k ) = θ k z t with scalar covariates z t spaced evenly within [−1, 1]. We experiment with three different families of warping functions: nonparametric (NP) [34] , CPA-based (CPAb) [50] , and our TSP-based functions (TSPb). We minimize the mean squared error over 200 epochs of ADAM [25] with learning rate η = 0.01. We explore two training strategies to obtain hard segmentations at the end of training: (i) 160 epochs with linear interpolation followed by 40 epochs of integer interpolation (160, 40) , and (ii) 200 epochs with integer interpolation only (0, 200). Convergence plots, averaged over 100 restarts, are shown in Figure 4 (left block). Our relaxed segmented model architecture easily identifies the segment boundary and DGP with all families of warping functions and both training strategies: the average loss after training is around 0.01 throughout all approaches, with standard deviations of 0.01-0.02. The reason is that the loss function is close to convex near the optimal solution.

We repeat the experiment with sinusoidal time series using the DGP f Sin (z t , θ k ) = sin(θ k z t ), with covariates z t spaced evenly within [0, T ]. At the change point, the frequency parameter changes from 0.2 to 0.1. This task has a highly non-convex loss function with many local optima and cannot be We observe that the mixed strategy yields slightly better losses than the pure integer strategy.

Next, we fit our relaxed segmented model to COVID-19 [51] case numbers. Exploratory work [30, 39] has applied segmented Poisson regression [36, 38] to identify change points in the pandemic. We check whether our approach finds change points consistent with these works. We follow Küchenhoff et al. [30] and model daily time series of newly reported cases. We obtained official data for Germany from Robert Koch Institute 3 . Figure 5 (bars) reveals non-stationary growth rates and weekly periodicity in the reported data. We use time and a day-of-week indicator as covariates. We tie the coefficients for the day-of-week across all segments, while the daily growth rates and the bias terms differ in every segment. We minimize the negative log-likelihood (NLL) loss with TSPb warping functions (K = 5) using ADAM (η = 0.01, (15000, 5000) training strategy, 10 restarts).

In the model with the lowest loss, the four change points are located at 2020-03-16, 2020-03-29, 2020-04-09, and 2020-04-28. Although the studies are based on different data, the change points at 2020-03-16, 2020-03-29, and 2020-04-09 are consistent with Küchenhoff et al. [30] in that they lie within their reported 95% confidence intervals. Muggeo et al. [39] do not report confidence bands, but find nearby change points at 2020-03-17 (+1 day), 2020-03-29 (±0), 2020-04-06 (−3). Predictions from the model are shown in Figure 5 (blue line). We also provide smoothed predictions where the average day of week (dow) effect is incorporated into the bias term to highlight the change of the growth rate from segment to segment, see Figure 5 (purple line).

We now demonstrate that our model can be combined with deep architectures for feature learning. We designed a segmented logistic regression model where the covariates in the segmented model are the output of a stack of convolutional layers. The feature transformation is shared across all segments, while the parameters of the final classifier change. We use the Fashion-MNIST dataset [52] to simulate a sequential binary classification task with concept drift [17] . We generate a segmented sequence of labeled instances from two classes and change the class associations +1 and −1 from segment to segment. In the first segment, we provide Results are visualized in Figure 6 . The true segment boundaries are located at t = 200 and t = 700 and detected by our model at locations t = 201 and 699. The model fit on the training data is almost perfect, with a single misclassified instance near the first segment boundary (training accuracy .999).

To verify that the model has learned different classifiers in the three segments, we apply it on three sequences of test instances, where each sequence contains 1,000 examples from a single task only. The classifiers in every segment in fact perform best on the tasks that they specialized on.

At last, we apply our model for unsupervised phoneme segmentation. We assume that the speech signal-represented by a sequence of 12-dimensional MFCC vectors-is piecewise constant within a phoneme. We model it by a minimal DGP with no covariates that simply copies the 12-dimensional parameter vectors to the output. We fit the model to a single utterance ("choreographer") from the TIMIT corpus [18] with ground-truth segment labels, by minimizing the mean squared error loss with ADAM (η = 0.01, (160, 40) training strategy, 10 restarts), and obtain: observed sequence, with true phoneme boundaries predicted sequence, with true phoneme boundaries Although the simple DGP does not capture all dynamics of the speech signal, 7 out of 9 phoneme boundaries were correctly identified, with a time tolerance of 20 ms. A baseline detector that predicts segment boundaries from a uniform distribution was as good or better only in 69 out of 10000 runs (< 1%). This minimal experiment suggests that relaxed segmented models, when combined with more powerful DGPs, may be useful for discrete representation learning [43, 46, 14] , in particular for learning segmental embeddings [27, 48, 9, 29] . We consider this a fruitful direction for future work.

We have described a novel approach to learn models for non-stationary sequential data with discrete change points. Our relaxed segmented model formulation is highly versatile and can use any family of warping functions to approximate a hard segmentation function. If the family of warping functions is differentiable, our model can be trained with gradient descent. We have introduced the novel family of TSP-based warping functions designed specifically for the segmentation task: it is differentiable, contains piecewise-constant functions that exactly represent segmentation functions, its parameters directly correspond to segment boundaries, and it is simple to evaluate computationally. While our simulations did not show significant differences between our family and existing families in terms of final loss, they yield evidence for a more robust convergence. We believe that this robustness will translate to improved results when our model is embedded within larger model architectures. Finally, the experiments on diverse real datasets demonstrate the modeling capacities of our approach.

Broader impact and ethical considerations. The application of our model to COVID-19 case numbers must be interpreted with care, as the analysis is only explorative. In particular, the reported change points and model predictions should be not used (unless further validation is performed) for conclusions on the efficacy of containment strategies implemented in Germany at specific points in time. Apart from that, this work does not present any foreseeable societal consequence.

Fast algorithms for segmented regression

Speech recognition using time-warping neural networks

A kernel multiple change-point algorithm via model selection

Structural breaks in time series

Computation and analysis of multiple structural change models

Detection of Abrupt Changes: Theory and Application

A Change in Level of a Non-Stationary Time Series

A Markov Model of Switching-Regime ARCH

Unsupervised Neural Segmentation and Clustering for Unit Discovery in Sequential Data

A multiresolution approach to time warping achieved by a Bayesian prior-posterior transfer fitting strategy

Structural Break Estimation for Nonstationary Time Series Models

Deep Diffeomorphic Transformer Networks

Multiple Change Point Analysis: Fast Implementation And Strong Consistency

SOM-VAE: Interpretable discrete representation learning on time series

Highly-expressive spaces of well-behaved transformations: Keeping it simple

Joint probabilistic curve clustering and alignment

A survey on concept drift adaptation

TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1

Self-modelling warping functions

Event detection from time series data

Analysis of time series subject to changes in regime

Point Estimation of the Parameters of Piecewise Regression Models

Spatial Transformer Networks

Continuous univariate distributions

Adam: A method for stochastic optimization

A dynamic HMM for on-line segmentation of sequential data

Segmental recurrent neural networks

Beyond Beta: Other Continuous Families of Distributions with Bounded Support and Applications

Phoneme Boundary Detection using Learnable Segmental Features

Analyse der Epidemischen Covid-19 Kurve in Bayern durch Regressionsmodelle mit Bruchpunkten

Signal estimation under random time-warpings and nonlinear signal alignment

Fitting Segmented Regression Models by Grid Search

M-Statistic for Kernel Change-Point Detection

Temporal Transformer Networks: Joint Learning of Invariant and Discriminative Time Warping. In CVPR

A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data

Generalized Linear Models

Piecewise regression

Estimating regression models with unknown break-points

Modelling COVID-19 outbreak: Segmented Regression to Assess Lockdown Effectiveness

Continuous Inspection Schemes

Curve registration

Non-stationary dynamic bayesian networks. NIPS

Discrete variational autoencoders

The Segmented iHMM: A simple, efficient hierarchical infinite HMM

Two-Sample Testing for Event Impacts in Time Series

Neural discrete representation learning

The standard two-sided power distribution and its properties: With applications in financial engineering

Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection

All of Statistics: A Concise Course in Statistical Inference

Diffeomorphic Temporal Alignment Nets

A new coronavirus associated with human respiratory disease in China

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv