key: cord-0801833-nutm7l2w
authors: Pruilh, Solange; Jannot, Anne-Sophie; Allassonnière, Stéphanie
title: Spatio-temporal mixture process estimation to detect dynamical changes in population
date: 2022-02-23
journal: Artif Intell Med
DOI: 10.1016/j.artmed.2022.102258
sha: 79b8db7415c4f332e45fccf0fdf80be30903397b
doc_id: 801833
cord_uid: nutm7l2w

Population monitoring is a challenge in many areas such as public health and ecology. We propose a method to model and monitor population distributions over space and time, in order to build an alert system for spatio-temporal data changes. Assuming that mixture models can correctly model populations, we propose a new version of the Expectation-Maximization (EM) algorithm to better estimate the number of clusters and their parameters at the same time. This algorithm is compared to existing methods on several simulated datasets. We then combine the algorithm with a temporal statistical model, allowing for the detection of dynamical changes in population distributions, and call the result a spatio-temporal mixture process (STMP). We test STMPs on synthetic data, and consider several different behaviors of the distributions, to fit this process. Finally, we validate STMPs on a real data set of positive diagnosed patients to coronavirus disease 2019. We show that our pipeline correctly models evolving real data and detects epidemic changes.

The rapid growth of health information systems has led to the availability of a real-time spatio-temporal follow up of patients affected by a given disease. A remaining challenge is to develop methods to use this data to improve public health strategies and to transform this observed data into actionable decision support systems.

Spatial models are based on the characterisation of individuals by their geographical location (place of birth, place at the time of diagnosis, place of residence, etc.). Taken together, these individuals form a population. Concurently the temporal component is essential in disease monitoring, therefore requiring consideration of the population distribution as evolving over time.

First, they are flexible as one can set the probability distribution function (pdf) of each cluster depending on the type of observations (scalars, vectors, positive measures, etc). Second, the results are interpretable because subjects can be attributed to estimated classes a posteriori which enables one to distinguish homogeneous groups in the whole set. Third, they do not rely on a population reference distribution estimation, unlike scan statistics methods: they only rely on cases distribution. Last, these mixture models are parametric and well understood.

When data are multivariate real valued observations, the customary probability density for each cluster is the multivariate Gaussian distribution. This is particularly relevant when considering geographical data (mapped as lying on the real plane). The use of a multivariate

Gaussian distribution inside a mixture model gives a Gaussian Mixture Model (GMM).

To perform the estimation of Gaussian mixture parameters, given the number of clusters (i.e. classes), the classic algorithm is the Expectation-Maximization (EM) algorithm introduced by

Dempster et al. [7] . The estimate obtained with the EM algorithm is deterministic and highly dependent on the initialization step. Moreover, the construction of the sequence ensures that the critical points are maxima, but could be either global or local ones. To avoid sensibility to initial values and selection of a wrong local maximum, several strategies rely on repetitions of a random initialisation step or initialisation with K-means algorithm [8] . Recently, Lartigue et al. introduced an annealing E-step to better stride the support and become almost independent from the initialization [9] . However, this method requires to set the temperature profile which may be time-consuming.

Finding an optimal number of components K is not an objective directly included in the original EM algorithm. This objective is often based on a model selection step, which requires a collection of estimated models [10] [11] [12] . The well-known criteria for model selection are the propose an improvement of the Robust EM algorithm [18] . We will suggest changes to obtain a more automatic algorithm to avoid overlapping components, observed with the Robust EM algorithm on real data tests. This modified version of Robust EM algorithm is compared to the original one (and other selection model criteria) to show that on synthetic data, there is no loss of performances and on real data we outperform the state of the art algorithm.

To finish designing our STMP, we will perform experiments on synthetic data. And we will study the behaviour of our pipeline in different situations to produce a robust monitoring.

Applying our process to a dataset of COVID-19 cases from the Paris area, we will demonstrate the adequacy of a mixture model evolving over time and the consistency of the alert response to population epidemic changes.

We assume for our future application in Section 4 and 5 that the population is generated from a Gaussian Mixture Model (GMM). In this section, we first recall the GMM definition. Then we detail the Robust EM algorithm, one of the algorithms used to fit GMM. These methods are the basic elements on which we build our STMP pipeline described in Section 3.

In order to describe a Gaussian Mixture Model, we consider a set of observations denoted 1 = ( , , )

. Let ( | , ) d k k

The EM algorithm [7] was introduced for the purpose of estimating GMMs and has remained the most popular choice. The general principle of this algorithm is to produce a sequence of parameters ( ) As the EM algorithm presents several drawbacks detailed in Section 1, and that we expect our framework to have a single run to estimate the data distribution at a given time step, we turn to the more "dynamical" algorithms where estimation and selection of the model are performed at the same time [15] [16] [17] [18] [19] .

In the next part, we will detail a recent dynamical algorithm proposed by Yang et al. [18] , which answers almost all issues and is the base of our proposition.

As mentioned previously, the unknown number of clusters in GMM is a main issue. The authors of [18] go deeper into looking dynamically for the best number of components in the mixture. Their Robust EM adjusts the EM mixture objective function, by adding a criterion based on the entropy of the mixture proportions k  . Non-informative proportions are given by a high entropy. Consequently, the penalty added to the likelihood is given by the negative entropy.

Starting from the complete log-likelihood ( , , )  xz , the objective function to maximize in the M-step with this entropy-based penalty is therefore:

, with 0.

With this new criterion to maximise, the update equation of components proportions  inside the EM algorithm becomes: As we can see, a new hyperparameter  comes as a penalty weight in Eq. (3) . It helps to control the competition between clusters. Acting on the evolution of proportions with  enables one to check at each iteration that all the components proportions are above a given threshold, and therefore to delete those of proportion

. This is the annihilation part in their process. A specific dynamic is imposed on  . This parameter is set to zero when the cluster number K is stable, i.e. not decreasing for a time period min p . This is important to avoid oscillating parameters, and so to reach a maximum. A limitation is that they fixed this time limit to = 60 min p iterations, without any attempt to adapt it to different use cases. This algorithm is however robust to initialization as, to start with, each data point is the center of its own component, which yields the initial number of class 0 K to be n , the sample size.

Although efficient, entropy-based penalisation [18] does not prevent from having several components with similar parameters, meaning that two cluster may be superimposed. In the Robust EM algorithm [18] , competition and instability of component proportions do not avoid ending up with a local maximum of this type. The coefficient  is usually not high enough to trigger removal of one of the superimposed clusters. As the competition is not guaranteed at each iteration, we suggest improvements of the Robust EM algorithm in the next section. We also present a temporal process which, combined with estimation algorithms, will provide efficient detection of population dynamical changes.

Method: Spatio-temporal mixture model with efficient estimation algorithms for dynamical change detection

In this section, we describe our general pipeline for temporal evolution modelling of a population including a distribution change detection, named STMP. Then, we introduce modifications on the Robust EM algorithm to escape local maxima characterised by "overlapping clusters". Finally, we detail another adaptation of the EM algorithm in order to constrain the estimation of GMM parameters. This enables to propose a close estimation of a given distribution, while taking into account new samples. The STMP pipeline and the estimation algorithms are generic enough to apply on different mixture models by using different estimation algorithms.

We consider that the time period is discretized and the time steps are given by = 1, , tT K .

At each time step, denote the data vector ( ) Note that this pipeline is very versatile with respect to the chosen distributions in the mixture model as well as the estimation algorithms used in first step. Depending on the dataset, the model is able to handle any type of pdfs.

We now describe the two algorithms that we use to perform the candidate and alternative model estimations.

In Section 2, we have highlighted two weaknesses of the Robust EM algorithm by [18] .

First, the minimal number of iterations (named min p ) before setting =0  is too small, which means that the algorithm is untimely stopped in its exploration. Then, the algorithm is stuck in local maxima as soon as the convergence condition (

 is a threshold) is satisfied, which stops the algorithm too early, revealing aberrant clusters. These aberrant clusters are superimposed clusters, which means that at least two clusters are sharing very similar (or exactly equal) parameters values. This corresponds to local maxima which can be analysed only by post-processing the results, and it is particularly observable on real and scattered data.

To avoid this local maximum issue inside the estimation algorithm (and avoid post-processing analysis), we propose slight modifications of the Robust EM algorithm, by incorporating an online verification step of superimposed clusters. We consider that two clusters i J o u r n a l P r e -p r o o f 

We name Constrained EM (C-EM) a slight variation of original EM algorithm [7] where the parameters are restricted to a neighbourhood of a given vector of parameters denoted 0  . In particular, we introduce constraints on the estimated components proportions

Moreover, when the cluster means are involved, restrictions are also put on these means. The initialization of our C-EM algorithm is given by the parameter vector 0  as well. The idea behind C-EM algorithm is to obtain estimated parameters highly driven by the initial parameters vector 0  but updated on data X . Because the parameters of our dynamical modeling are estimated empirically, the estimation suffers from the uncertainty given by the sampling. This means that the estimated parameters at time 1 t  may not be the perfect description of the data set and a new independent sample from the same ground truth distribution will lead to a slightly different estimated parameter vector and a slightly different likelihood. Therefore, we consider that a newly independent estimated mixture and the given estimated one may both come from the same ground truth. For this reason, the C-EM enables us to give a chance to the previously estimated model to explain the data distribution. Otherwise, forcing the comparison of matrices, in order to avoid singular ones. As we want to determine when the estimated candidate model does not correspond to the data, we remove this regularisation from the C-EM algorithm.

Therefore, we raise an alert when one or more covariance matrices become singular. We add this condition as an alert in STMP detailed in Subsection 3.1, before the calculation of the ratio t r (Eq. (5)).

In addition to this, as the covariance matrices are not constrained in the C-EM algorithm, we introduce a condition to check these parameters a posteriori. From the C-EM algorithm, covariance matrices are freely estimated, but they can evolve far away from initial covariances matrices 0 k  , thus missing the time link. We introduce an already existing similarity measure between final estimated ˆc k  in C-EM and 0 k  the initial covariance matrices. We use the cosine similarity, also introduced as the correlation matrix distance by [22] on correlation matrices. We adopt their formulation and apply it on covariance matrices instead of correlation matrices.

Bounded between 0 and 1, this coefficient measures orthogonality between two matrices and is useful to evaluate whether the spatial structure of the clusters have significantly changed. Low values reflects high similarity while high values reflects orthogonality, and so on dissimilarities.

As ˆc In STMP, 0  will be the estimated parameter vector from the previous time step 1 t  of the pipeline, which corresponds to ( 1) t   . We obtain at time t an estimated parameter depending on estimated parameter at time 1 t  , but allowing some adaptation of the model to the newly observed data () t X . Finally, we should not forget that the C-EM is constrained by initial J o u r n a l P r e -p r o o f Journal Pre-proof parameters 0  , including a fixed number of clusters 0 K . It is not possible in C-EM to merge clusters based on their properties, as this would violate the imposed constraints. If the model estimated by C-EM is not correctly fitting data () t X , this will be detected inside STMP.

To conclude this section, our new process is fully described in Algorithm 1, combining the input :

As in the following applications we will only consider geographical data, in 2 , we use Gaussian Mixture Models to represent these data. The GMM parameters are estimated with the presented algorithms, and the likelihoods are computed with Eq. (2) . Recall that the pseudo-code 1

shows the STMP with all our propositions, which could be used with different mixture models.

The adaptation of the estimation algorithms may also be used to fit with other distributions. 

We compare here our Modified Robust EM with several EM-based methods mentioned in Subsection 1.1.2. First, we run the original EM algorithm, which is based on the a priori knowledge of the number of clusters. As in practice we do not know the true number of components, we estimate several models and select the best one based on either Bayesian

Information Criterion (BIC) [11] or Integrated Completed Likelihood (ICL) [23] , with the original Robust EM algorithm (REM) [18] and Figueiredo and Jain's method [15] (called FJ method from here). The FJ method requires to fix an initial number of clusters initial K .

Originally initial K was "far from the true number of components" but not too high (around

in several experiments of [15] ), but in order to approach the behavior of the Robust Then, we compare the estimated parameter precision over runs with successful component estimation. For each of the two defined mixtures, we computed the relative distance between the true and the estimated parameters. From these mean relative errors (Table 1 and Table 2 Table 1 Mean (standard deviation) relative errors for the estimates parameters of GMM within dataset from Fig.1a . The absolute-value norm is used for proportions, the euclidean norm is used for means, and the Frobenius norm for covariances. higher with our method than the REM one because we put a "soft" condition on convergence to stop the algorithm. The number of iterations is very high for the FJ method because of the initial number of components, which is high here. But it was originally fixed to a lower number by the authors of [15] , and needed to be fixed arbitrarily.

Note that we have provided a narrow range of values including the correct one for model selection criteria with BIC and ICL. The EM algorithm failed with higher number of components as the algorithm tended to remove one cluster by cancelling its proportion and degenerating the covariance matrix. Our Modified Robust EM shows no loss of performances compared to the Robust EM on synthetic data, and solve Robust EM problems on real data as we will see later. 

All the experiments are done on a two time steps configuration (only =0 t and =1 t ). We consider the following situation where we have a Gaussian mixture distribution with three clusters and we represent all cases in Table 9 . In addition, Figure 2 and Figure 3 give a simple representation of Cases I. to IV. and of Setup F., M. and C. involved in Cases V. to IX. Table 9 with an example of sampled data sets. Blue crosses correspond to We obtain Gaussian mixture distributions with parameters (0)  and (1)  from described cases. For each case I. to IX., given the two distributions, we can sample 01 == n n n points, which form our data sets (0) X and (1) X . The sampling step, for any of the cases presented above, is executed S times and followed by execution of our STMP on each set of sampled data. It produces S different resulting processes, for each experiment (see Table 9 ). This enables us to analyze the behavior of STMP and likelihood ratio across runs and evolution cases. M . This explains that we obtain 100 % of correct alerts on the majority of the case experiments (Fig. 4) , as the computed theoretical likelihood ratios are really higher than tested values of the threshold. The cases which are critical for the choice of the threshold are Case VII. and Case IX (Fig. 4) . They imply slight differences of the distributions between =0 t and =1 t , so the theoretical likelihood ratio values stay relatively close to one.

Therefore correct alerts are not raised for a threshold over about 1.06 when considering experiments with datasets of size = 400 n points (Fig. 4a) for these two cases. With = 100 n points, we clearly see that theoretical likelihood ratios are globally lower. For a same value of the threshold the number of false negative alerts increases (Fig. 4b) . This provides us an intuition on the level of variations that our model can detect. Case IV., and later for the Cases II. and III. . Case VIII., which corresponds to a move from Setup F. to Setup C. is affected earlier by the likelihood ratio threshold, as the proximity of two clusters ( Figure 3c ) affects the estimation of mixture parameters and so on the likelihood ratio. It leads us to set a threshold relatively closed to one. As on the theoretical likelihood ratio study, we observe here that slight movements corresponding to Case VII. and Case IX. lead to incorrect alerts for a threshold over one. The estimated likelihood ratio values stay relatively close to one because the model M can adapt to data (1) X . The evolving distributions are not detected.

Therefore when applied to a specific problem, one has to know that the relocation of one cluster may be detected if it relates to the variance of the estimated clusters. Otherwise, these displacements may be considered as normal variability of the discretization of the distributions.

Note that this alert criterion may be adapted given a specific problem with the constrains that are imposed to the candidate model. Finally, we see from analysis of the theoretical ratios and the estimated ratios that we need to make a compromise. The optimistic theoretical likelihood ratios would lead us to take a threshold very close to one. But the obtained values with the estimated models contain more uncertainty that we cannot ignore and require to select a larger threshold. To obtain good performances of our pipeline we fix the threshold to = 1.1  . We obtain a balance between false negative and false positive alerts, that we want to maintain as low as possible, considering all possible situations.

We present here results of the estimation of GMM parameters with the Modified Robust J o u r n a l P r e -p r o o f Cases I. to IV. give high rates, explained by the correct separation of the clusters as seen in This score is also high for displacement from Setup F. to Setup M. (Case VII.).

When we consider moving clusters which are getting too close this score decreases. The global score of STMP executions involving at least one Setup C. distribution is affected by the superposition of two clusters. The correct proportions are not bigger than 54% . By looking at estimated (0) K and ˆa K in Table 3 , the Modified REM algorithm estimates at least 30 over 100 times two classes with samples from Setup C. distribution. Although these estimates are incorrect, they lead to understandable results, as samples from the two left hand side clusters can be confused (see Figure 3c ). An example of wrong estimated parameters for Setup C. is presented in Figure 6a , which confirms the interpretability of the results. Table 3 Proportion of correctly estimated number of components among = 100 S runs. At each execution, the estimation is correct iff : Table 9 . Thereafter, we compute estimation errors for means and covariances matrices on experiments with correctly estimated number of components K (see Table 4 ). It confirms that these estimated Gaussian mixtures are correctly estimated by the Modified Robust EM inside our pipeline STMP. We also notice a poorer average estimate of GMM parameters for datasets from Setup C. As said previously, this parametrization implies that two clusters are mixed up. Estimates Table 4 Mean (standard deviation) relative errors (expressed as a percentage) for the estimated means and covariance matrices within each case, over all runs having correctly estimated K inside STMP.

The Euclidean norm is used for means, and the Frobenius norm for covariances.

We have defined in the previous subsection the threshold to alert the user that there may be J o u r n a l P r e -p r o o f Journal Pre-proof a population dynamical change between two given time points. As explained in Subsection 3.3, there is also an alert when a component tries to disappear (leading to degenerated covariance matrix) when estimated by the C-EM. And when covariance matrices estimated by the C-EM are too different from the previous step ones. With these warning systems, we defined a whole pipeline, named STMP, to monitor the dynamic of the population and raise alerts when reasonable changes occur. We are now demonstrating the performances of STMP. Using an alert threshold of = 1.1  , we obtain the following alert rates, that we can retrieve in the Figure 5 . For Case I., Case V. and Case VI. we obtain 98% true negative alerts respectively. For Cases II. to IV. we obtain a true positive alert rate of 100%, detecting all changes in population distribution with our STMP.

STMP does not raise an alert when the distributions differ barely in time. This is due to our likelihood ratio threshold fixed to = 1.1  . The true positive number of alerts is of 2% for Case VII. and 1% for Case IX. . In contrary, the bigger movement in Case VIII. leads to a true positive number of 29%. This brings us to the problem that STMP can not raise an alert when GMM are hard to estimate correctly, as here. This experiment involves the Setup C., which is complex to estimate for EM algorithms.

Last but not least, our proposed method is computationally efficient with a very low computational time. All experiments on datasets of size = 400 n are performed with an average execution time of 1.33s. We recover average execution time by case type in Table 5 . Fast execution was also a criterion leading the construction of our method, and satisfying for our future applications.

Average computation time over Table 5 Average (and standard deviation) computation time of the different case experiments, with 01 = = 400 nn points.

In previous explained experiments on synthetic data, we fixed the data set size to = 400 n . Afterwards, we study the impact of the number of points for Cases I. to IX. described previously, with {100, 200, 400} n  . With the same true distributions as in Figures 2 and 3 , we perform = 100 S runs of our process with data samples of size = 200 n and = 100 n at each time step.

As expected, decreasing n decreases the proportion of good estimated K over the 100 runs and inherently the quality of estimation of parameters K (Table 6) becomes complicated to properly estimate a GMM even with well defined clusters: the best alert rate is 61% and the worst is 8% . Therefore, we must be aware that decrease the number of samples decreases the proportion of good estimated K and inherently the quality of estimation of parameters  in our Modified Robust EM. But overall, the STMP performance is less affected than the Modified Robust EM by changes of data sets size ( (Fig 3c) , as two Gaussians components are hardly separable the pipeline will estimated one cluster for the two components, and raise an alert as it is evolving away from the estimated distribution at =0 t (which could be Setup F. or M. Case IX. 54% 34% 29% Table 6 Proportion of right estimated number of components among = 100 S runs. At each execution, the estimation is correct iff :

(1) = a KK and (0)

Even if the Modified Robust EM becomes less accurate with smaller data sets, our pipeline still produces interpretable and meaningful results. The problem of decreasing performance on small dataset estimation should be solved directly on the Modified Robust EM.

In this section we demonstrate the relevance of STMP with GMM on real epidemiological data from COVID-19 in Paris, France. Table 7 Distribution of positive diagnosed people to COVID-19 over weeks. (Figure 7d and 

The aim here is to underline presence or absence of temporal constancy in COVID-19 data.

A temporal constancy would suggest that the population distribution was stable at the peak of the pandemic. This is in line with epidemiological studies that where showing a "peak" around these weeks after the first propagation phase (weeks 9 to 12) (see weekly reports of Public Health (14) M . There is no likelihood ratio value in the last

week. This ratio is incalculable due to the "empty class phenomenon". The model M tries to remove a component which leads to an early stop of the estimation process of this model. This triggers the inevitable choice of the alternative model and raises an alert. 

The 

We have proposed a complete and generic pipeline for modeling evolution of population J o u r n a l P r e -p r o o f Journal Pre-proof distribution, and detecting abnormal changes in this distribution. This STMP was combined with new EM algorithm variants. Our application on public health data shows that this STMP models population distributions well, and raises meaningful alerts.

The STMP for monitoring population distributions and the algorithms to estimate the models are two independent objects. This enables future directions of our work when integrating covariates following non-Gaussian distributions in the mixture. We will still be able to use our proposed algorithms as they are blind to the distributions in the mixture.

On the other hand, the performance of the EM algorithms depends on the data set sizes. In future work we will try to introduce a temperature parameter in the Modified Robust EM as proposed by [25] to improve estimations in unstable situations.

Finally, the decision rule was empirically fixed in this work. In future work this decision rule will be modeled as an acceptation probability, taking advantage of Monte Carlo Markov Chains theory.

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Table 11 Estimated parameters by Robust EM [18] and modified Robust EM on week 13 of the COVID-19 dataset. These estimations were performed independently of previous time steps. 

The authors declare that they have no known compe5ng financial interests or personal rela5onships that could have appeared to influence the work reported in this paper.

• Spatio-temporal process to monitor population dynamics • Markovian modeling of mixture models 

Advances in spatial epidemiology and geographic information systems

A spatial scan statistic

Evaluating cluster alarms: a space-time scan statistic and brain cancer

A Space-Time Permutation Scan Statistic for Disease Outbreak Detection

Spatial Epidemiology: Current Approaches and Future Challenges

Bias in analytic research

Maximum likelihood from incomplete data via the EM algorithm

EM for mixtures: Initialization requires special care

Deterministic Approximate EM Algorithm

Information theory and an extension of the maximum likelihood principle

Estimating the dimension of a model. The annals of statistics

Minimal penalties for Gaussian model selection

Probability theory and related fields

Slope heuristics: overview and implementation

Clustering and model selection via penalized likelihood for different-sized categorical data vectors

Unsupervised learning of finite mixture models

Estimation for the number of components in a mixture model using stepwise split-and-merge EM algorithm

Competitive EM algorithm for finite mixture models. Pattern recognition

A robust EM clustering algorithm for Gaussian mixture models

Simultaneous feature selection and clustering using mixture models

Estimation and inference by compact coding

Minimum message length and Kolmogorov complexity

distance, a meaningful measure for evaluation of non-stationary MIMO channels

IEEE 61st Vehicular Technology Conference

Assessing a Mixture Model for Clustering with the Integrated Classification Likelihood

A New Class of EM Algorithms. Escaping Local Minima and Handling Intractable Sampling

The authors thank the EDS APHP Covid consortium integrating the APHP Health Data Warehouse team as well as all the APHP staff and volunteers who contributed to the implementation of the EDS-COVID database and operating solutions for this database. The 

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by a grant from Région Île-de-France. 

This study was approved by the institutional review board (authorization number IRB 00011591) from the Scientific and Ethical Committee from the AP-HP.

The EM algorithm alternates between the two following steps (until convergence criterion is met). At the th p iteration of the algorithm, the update equations are given by:  M-step : Update the parameter estimates: