key: cord-0773210-exeyqkv0
authors: García-Escudero, Luis A.; Mayo-Iscar, Agustín; Riani, Marco
title: Constrained parsimonious model-based clustering
date: 2021-11-20
journal: Stat Comput
DOI: 10.1007/s11222-021-10061-3
sha: 5a090431a1b13ed55c8ce9c668759dc5fca2c70a
doc_id: 773210
cord_uid: exeyqkv0

A new methodology for constrained parsimonious model-based clustering is introduced, where some tuning parameter allows to control the strength of these constraints. The methodology includes the 14 parsimonious models that are often applied in model-based clustering when assuming normal components as limit cases. This is done in a natural way by filling the gap among models and providing a smooth transition among them. The methodology provides mathematically well-defined problems and is also useful to prevent us from obtaining spurious solutions. Novel information criteria are proposed to help the user in choosing parameters. The interest of the proposed methodology is illustrated through simulation studies and a real-data application on COVID data.

In this work, we use the notation φ(·; μ, Σ) for the probability density functions of the p-variate normal distribution with mean μ and covariance matrix Σ. Given a sample of p-dimensional observations {x 1 , · · · , x n }, the classification likelihood approach searches for a partition {H 1 , . . . , H k } of the {1, · · · , n} indices, mean vectors μ 1 , · · · , μ k in R p , symmetric positive semidefinite p × p scatter matrices Σ 1 , · · · , Σ k and positive weights π 1 , · · · , π k with k j=1 π j = 1, which maximizes k j=1 i∈H j log π j φ(x i ; μ j , Σ j ) .

(1)

Alternatively, the mixture likelihood approach seeks the maximization of

An important problem when maximizing (1) and (2) is that these two target likelihood functions are unbounded ones (Kiefer and Wolfowitz 1956; Day 1969) . Another important issue is the typically large number of local maxima that can be found. In the mixture likelihood case, the existence of a sequence of local maxima converging to the true mixture parameters is guaranteed as the sample size n increases. However, it is not obvious how to choose those local maxima in practical applications. In fact, many local maxima related to these very high values of the likelihoods are known to be clearly non-interesting and often referred to as "spurious solutions" (see, e.g., chapter 3.10 in McLachlan and Peel (2000) ). In these cases, components basically defined from a few, almost collinear, observations are obtained. Algorithms applied for maximizing the target likelihood (EM algorithms when maximizing (1) and CEM algorithm when maximizing (2)) can be affected by unboundedness, being trapped into sub-optimal maxima or detect non-interesting local maxima. This is even more problematic when applying well-known information criteria (such as BIC and ICL). These criteria are based on penalized versions of the target likelihood values and spurious solutions or the unboundedness issue can result in artificially large values for the likelihood. Note also that it is necessary, when choosing k, to fit models with a higher than needed number of components. All the previously mentioned problems are even more likely there to appear.

These problems with local maxima can be in principle solved by carefully exploring and analyzing all possible local maxima (McLachlan and Peel 2000) . Although some interesting procedures have been introduced in that direction (see, e.g., Gallegos and Ritter (2018) ), this approach is not straightforward and is certainly time consuming. Another widely applied remedy consists in trying to initialize the algorithms adequately in order that iterations return good local maxima. It is well-known that EM and CEM algorithms are highly dependent on their initialization, but it is also true that adequate initialization strategies (for instance, appropriate hierarchical model-based clustering initializations) often result in sensible local maxima. However, theoretical guarantees about correctness of initializations are difficult to establish and it may happen that the final fitted model inherits significant drawbacks from the initializing procedure. Additionally, if two different initializations provide different final results, it is difficult to justify not choosing the one with the larger value of the likelihood without any further analysis. In fact, some initialization procedures are clearly aimed at searching directly for the largest values (see, e.g. Biernacki et al. 2003) .

It is also common to enforce constraints on the Σ j scatter matrices when maximizing (1) or (2). Among them, the use of "parsimonious" models (Celeux and Govaert 1995; Banfield and Raftery 1993) is one of the most popular and widely applied approaches in practice. These parsimonious models follow from a decomposition of the Σ j scatter matrices as

with λ j = |Σ j | 1/ p (volume parameters), Γ j = diag(γ j1 , . . . , γ jl , . . . , γ j p ) with det(Γ j ) = p l=1 γ jl = 1 (shape matrices), and Ω j (rotation matrices) with Ω j Ω j = I p . Different constraints on the λ j , Ω j and Γ j elements are considered across components to get 14 parsimonious models (which are coded with a combination of three letters). These models reduce notably the number of free parameters to be estimated, so improving efficiency and model interpretability. Moreover, many of them turn the constrained maximization of the likelihoods into well-defined problems and help to avoid spurious solutions. Unfortunately, the problems remain for models with unconstrained λ j volume parameters, which are coded with the first letter as a V (V** models). Aside from relying on good initializations, it is common to consider the early stopping of iterations when approaching scatter matrices with very small eigenvalues or when detecting components accounting for a reduced number of observations. A not fully iterated solution (or no solution at all) is then returned in these cases. The idea is known to be problematic when dealing with (well-separated) components made up of a few observations. Starting from a seminal paper by Hathaway (1985) , an alternative approach is to constrain the Σ j scatter matrices by specifying some tuning constants that control the strength of the constraints. A fairly comprehensive review of this approach can be found in García-Escudero et al. (2018) . For instance, a recent proposal following this idea is the "deter-and-shape" one in García-Escudero et al. (2020) . The maximal ratio among the λ j terms is there constrained to be smaller than a fixed constant and, additionally, the maximal ratio γ jl /γ jl in each Γ j shape matrix is also constrained to be smaller than another fixed constant. In this work, we will refer to these second type of constraint as "shape-within" as they control the relative sizes of the shape elements "within" each shape matrix individually. When this second constant is set equal to 1, since all γ jl are then equal 1, we are imposing spherical components.

In this work, we introduce new "shape-between" constraints where the maximal ratio γ jl /γ j l is controlled for every fixed l for l = 1, . . . , p. Figure 1 shows a summary of the two types of constraints considered on the shape elements. Notice that we have γ jl = γ j l for every j and j in the most constrained case, but the fitted components are not necessarily spherical. Therefore, these new constraints are better suited to control differences among shape matrices without assuming sphericity. The new constraints can be easily combined with others typically imposed on the Ω j rotation matrices.

The main contributions of this work are the following:

(a) The proposed constraints yield well-defined problems and it is not necessary to include the specification of any particular initialization strategy. An underlying population (theoretical) problem can thus be defined. Section 4 shows existence results for both the sample and the pop- The new constraints allow us to achieve, as limit cases, the 14 parsimonious models which are commonly applied in model-based clustering. These popular parsimonious models cannot be obtained as limit cases when only considering "deter-and-shape(within)" constraints or other constraints such as the ones based on eigenvalues ratios (García-Escudero et al. 2015) . However, contrary to what happens with V** models, the associated likelihood maximization problems are always well defined. It is perhaps too extreme to choose only among strongly constrained models (maximal ratios exactly equal to 1) and the fully unconstrained ones (maximal ratios taking arbitrarily large values). Sometimes it is clear that data do not suggest considering the most constrained models, but leaving them fully unrestricted may cause estimation instabilities and the detection of spurious solutions. A smooth transition between these extreme cases can be obtained with the proposed methodology. An interesting connection between the two types of constraints (between-within) in the shape matrices elements is given in Sect. 2, together with some practical consequences. (c) Although 14 different algorithms are often employed to estimate the classical parsimonious models, a unifying algorithm is proposed in Sect. 3 which includes all the 14 classical parsimonious models as limit cases. (d) Some general guidelines about how to choose the tuning parameters are provided. In fact, the smooth transition among models turns out to be useful to introduce novel information criteria, inspired by those in Cerioli et al. (2018) . These criteria penalize high likelihood values resulting from unnecessary model complexity associated with the constraints. Model complexity here does not necessarily correspond to the total number of parameters, but it simply means that more flexibility in the constraints allows us to fit more varied models. This proposal can be seen as a first step in order to obtain a reduced list of "sensible" cluster solutions, as done in Cerioli et al. (2018) .

Some simulations and a real data example are provided in Sects. 6 and 7 to illustrate the interest of the proposed methodology. We do not claim that the well-established and widely applied proposals considered for comparison are useless; they have amply demonstrated their usefulness. However, we illustrate that the proposed methodology can also be very useful and that there is room for further investigations of this new proposal. Concluding remarks and open research directions are given in Sect. 8.

We impose three different types of constraints on the Σ j matrices which depend on three constants c det , c shw and c shb all of them being greater than or equal to 1.

The first type of constraint serves to control the maximal ratio among determinants and, consequently, the maximum allowed difference between component volumes:

The second type of constraint controls departures from sphericity "within" each component:

shape-"within": max l=1,..., p γ jl min l=1,..., p γ jl ≤ c shw for j = 1, . . . , k.

This provides a set of k constraints that in the most constrained case, c shw = 1, imposes Γ 1 = · · · = Γ p = I p , where I p is the identity matrix of size p, i.e., sphericity of components. Constraints (4) and (5) were the basis for the "deter-andshape" constraints in García-Escudero et al. (2020) . These two constraints resulted in mathematically well-defined constrained maximizations of the likelihoods in (1) and (2). However, although highly operative in many cases, they do not include, as limit cases, all the already mentioned 14 parsimonious models. For instance, we may be interested in the same (or not very different) Γ j or Σ j matrices for all the mixture components and these cannot be obtained as limit cases from the "deter-and-shape" constraints.

In this work, we introduce a third type of constraint that serves to control the maximum allowed difference between shape elements "between" components: shape-"between": max j=1,...,k γ jl min j=1,...,k γ jl ≤ c shb for l = 1, . . . , p.

This new type of constraint allows us to impose "similar" shape matrices for the components and, consequently, enforce Γ 1 = · · · = Γ k in the most constrained c shb = 1 case . Additionally, another type of constraint on the rotation Ω j matrices can be combined with the previous ones. Three different constraints rot on the rotation matrices can be considered and coded with the letters E, I and V. If rot=E, then we are assuming the same rotation matrices Ω 1 = · · · = Ω k for all the components. If rot=I, then we are assuming Ω 1 = · · · = Ω k = I p , i.e. axes parallel to the coordinate axes (conditional independence within cluster components). Finally, rot=V leaves the rotation matrices Ω j fully unconstrained.

In the third case of fully unconstrained rotation matrices rot=V, we choose the diagonal elements of Γ j (by choosing the appropriate rotation Ω j matrices) such that these shape elements appear in non-increasing order:

This ordering makes sense since adequate rotations (in the rot=V case) can be performed such that ordered elements within each shape matrix are achieved.

The following lemma shows an interesting connection between the two different types of constraints on the shape matrices.

Lemma 1 If the "shape-within" constraints (5) are satisfied for a constant c shw ≥ 1, then

for any j, j ∈ {1, . . . , k} and l = 1, . . . , p.

The proof of this technical lemma is given in Appendix A. When taking into account the definition of the "shapebetween" constraints in (6) as a maximal ratio, the previous lemma implies that the choice of c shw in (5) modifies the effect of c shb in (6) and that there is no point in considering c shb not obeying

For instance, this implies that c shb ≤ √ c shw when we are in dimension p = 2 and that we are obviously assuming c shb = 1 whenever we set c shw = 1.

An important consequence is that, although we potentially have 2 3 × 3 = 24 different extreme models (appearing when 

c det , c shw and c shb are chosen equal to 1 or ∞ and the three possible constraints rot on the rotations), not all these 24 models are feasible (because c shw = 1 necessarily implies c shb = 1) and only the 14 well-known parsimonious models make sense. Table 1 shows how these 14 limit models are derived from different combinations of constraints (this table only includes 14 rows). Table 1 helps to understand the smooth transition among all these 14 models when constants c det , c shw and c shb are moved in a controlled fashion. This smooth transition is useful for introducing the novel information criteria in Sect. 5. The "deter-and-shape" constraints also appear as a limit cases when c shb tends to ∞ and rot=V is chosen. Notice that Lemma 1 implies that when c shw is chosen close to 1 in the "deter-and-shape" approach, then we are also (implicitly) assuming that c shb is close to 1 too. On the contrary, a large c shw is still compatible with a c shb as close to 1 as desired. In fact, choosing moderate values for c det and c shb (but not exactly equal to 1) and fixing a very large c shw value, together with rot=V, turns out to be convenient and advisable, providing a very competitive procedure in many cases.

constrained model-based clustering approaches. For the sake of completeness, this operator is reviewed in Appendix B.

The proposed algorithm follows analogous steps as EM and CEM algorithms, but iterative procedures may be needed at some points. In the t-th step, the constrained scatter matrices are going to be updated as

for some d j > 0, diagonal matrices D j with |D j | = 1 and R j being orthogonal matrices. All these d j , R j and D j elements are determined through iterative procedures where, roughly speaking, these elements are sequentially improved in turn by optimally updating one of them (in the sense of increasing the target likelihood and fulfilling the constraints) conditionally on the other elements. Further iterations are sometimes required for D j and R j within that iterative procedure.

Iterations can be stopped when reaching a maximal number of iterations and we use the notation "iter.max.***" when referring to maximal number of allowed iterations. Additionally, it is useful to monitor "relative changes" in updated parameters and stop iterations whenever relative changes are found smaller than some pre-specified small tolerances. We use the simplified notation Δb for measuring the relative change in the h-th iteration of parameter b denoted by b (h) with respect to b (h−1) in the previous (h − 1)-th step (we simplify the notation by deleting the dependence on index h). All these small tolerances are going to be notated as "tol.***." More details on these aspects are provided in Remark 1. Additionally, nstart controls the total number of random initializations.

We denote the parameters at the t-th step of the proposed algorithm by

The procedure is initialized nstart times by randomly selecting different initial

k ) sets of parameters. A simple strategy for this initialization is to randomly select k × ( p + 1) observations and use them, after splitting them into k groups, to compute k initial μ (0) j centers and k initial scatter matrices Σ (0) j . It may happen that the initial Σ (0) j matrices do not satisfy the required constraints but the constraints will be imposed in the following iterative step. 2. Iterative step: t ← t + 1 2.1. Computing observation weights: From θ (t−1) , observation weights are computed as

when maximizing (1) and the associated H j sets are H

On the other hand, when maximizing (2), observation weights are computed as

.

weights, we define

and the component weights are updated as π (t) j = n j /n.

Location parameters are updated as

j is not so straightforward. As previously commented, the updated scatter matrices are obtained as Σ (t) j = d j R j D j R j , where these d j , D j and R j terms have to be obtained through iterations. Our starting point is the k weighted sample covariance matrices defined as

2.4.1 Initialization: u ← 0 Initially set d j = |S j | 1/ p and R j as follows: rot=V Take R j as the matrix whose columns are the eigenvectors of the S j matrices associated with their eigenvalues in decreasing order. rot=I Take R 1 = · · · = R k = I p .

rot=E Take R 1 = · · · = R k = R where R is the matrix whose columns are the eigenvectors of the "pooled" scatter matrix 

(ii) Take s ← s + 1 and apply the "within" constraints as:

(iii) Normalize the elements in {e j1 , . . . , e j p } in order to get unit determinant shape matrices by taking e jl ← e jl / p p l=1 e jl for l = 1, . . . , p and j = 1, . . . , k.

(iv) rot=V case: Consider a permutation σ j , which serves to sort the previous elements in decreasing order as e jσ j (1) ≥ · · · ≥ e jσ j ( p) and take e jl ← e jσ j (l) for l = 1, . . . , p. v) Apply the "between" constraints:

and ΔD (·) > tol.D or otherwise conclude iterations and finally update

Compute ν 1 , . . . , ν k with

and update the d j parameters

rot=V No change is needed in the R j matrices rot=I Nothing to be done because R j = I rot=E (needs iterations through R (r ) ):

and ΔR (·) > tol.R or otherwise conclude iterations and finally update 

Notice that we use vec(·) to convert matrices into vectors when needed. The only exception is when monitoring iterated R (h) rotation matrices where we use

Remark 2 Trivial computational short-cuts can be introduced in limit cases, where any of these constants c det , c shw and c shb are chosen equal to 1 or to ∞. In those cases, many iterative steps can be avoided. It is also worthwhile to notice that the most computationally demanding version of the algorithm happens when rot=E.

The rationale behind all the steps in this algorithm follows from adaptation of algorithms in the literature. We always try to improve a subset of parameters conditionally on the remaining ones, and, of course fulfilling the required constraints. For instance, the algorithm of Browne and McNicholas (2014) is used in Step 2.4.2.2 and the step in 2.4.2.2 follows from García-Escudero et al. (2020) . The"optimal truncation" operator is also used in step 2.4.2.1 to impose the novel constraint in (6).

If we assume that {x 1 , . . . , x n } is a sample from a theoretical probability distribution P, a population version of the constrained parsimonious methodology can be defined and existence and consistency results can be proved whenever finite restriction constants c det , c shw and c shb are considered.

Given θ = (π 1 , · · · , π k , μ 1 , · · · , μ k , Σ 1 , · · · , Σ k ), we introduce the functions W j (x; θ) = π j ϕ(x; μ j , Σ j ) and W (x; θ) = max{W 1 (x; θ), . . . , W k (x; θ)}, and the set Θ [c det ,c shw ,c shb ] = {θ : Σ 1 , . . . , Σ k satisfy(4), (5) and (6) for c det , c shw and c shb }.

Theorem 1 provides existence (for both theoretical and sample problem) and consistency result under finite secondorder moment conditions.

is achieved when θ is constrained to be in Θ [c det ,c shw ,c shb ] .

(b) then there exists θ ∈ Θ [c det ,c shw ,c shb ] such that the maximum of

Let us consider an i.i.d. sample {x 1 , . . . , x n } from the underlying distribution P and P n = n i=1 δ {x i } the associated empirical distribution. The maximizations (1) and (2) under constraints (4), (5) and (6) when P = P n reduce exactly to the methodology just presented in Sect. 2. Consequently, Theorem 1 also guarantees the existence of the solution of the empirical problem.

Moreover, a consistency result can be proven for the sequence of empirical maximizers toward the maximizer of the theoretical problem if it is unique (up to a relabelling). Let θ n = (π n 1 , · · · , π n k , μ n 1 , · · · , μ n k , Σ n 1 , · · · , Σ n k ) ⊂ Θ [c det ,c shw ,c shb ] denote the sequence of empirical maximizers for the sequence of empirical sample distributions {P n } ∞ n=1 from P. With this notation, Theorem 2 presents the consistency result.

Theorem 2 Let us assume that P is not concentrated at k points, that E P · 2 < ∞ and that θ 0 ∈ Θ [c det ,c shw ,c shb ] is the unique constrained maximizer of (10), resp. (11), up to a relabelling of the parameters corresponding to each of the k components, for P. If {θ n } ∞ n=1 is a sequence of empirical maximizers of (1), resp. (2), when θ n ∈ Θ [c det ,c shw ,c shb ] then θ n → θ 0 almost surely.

The proofs of Theorems 1 and 2 derive from similar results in García-Escudero et al. (2020) given that, trivially, we have Θ [c det ,c shw ,c shb ] ⊂ Θ c 1 ,c 2 , for c 1 = c det and c 2 = c shw , with the same notation for Θ c 1 ,c 2 as in García-Escudero et al. (2020) . The results in that previous work were, in turn, based on more general theoretical results in García-Escudero et al. (2015) that had been proved for eigenvalues ratio constraints.

It can be also proven that finite c det and c shb values are just enough for existence and consistence results if P satisfies E P · 2 < ∞ and P is not concentrated at k hyperplanes.

In this section, we introduce novel information criteria intended to automatically choose the number of mixture components k and pars = [c det , c shw , c swb , rot]. The proposal is to choose 

where L pars k is the maximum value achieved in the constrained maximization of (2) under constraints defined by pars, and where v pars k is a penalty term defined as:

Notice that larger values of c det , c shw and c swb yield less restricted Σ j scatter matrices, given that more complex models are allowed to be fitted. It is important to note that "model complexity" here does not necessarily correspond to an increased number of parameters, and so "smaller complexity" does not necessarily mean that there are fewer parameters to be interpreted. We also consider a source of complexity for the Σ j matrices which depends on the allowed rotations through rot. The proposal follows a similar philosophy so that introduced in Cerioli et al. (2018) . The same arguments in terms of "relative volumes" (as those in Theorem 3.1 in that paper) have been taken into account to derive the previous expression for v pars k . It is easy to see that v pars k exactly coincides with the number of free parameters for the classical 14 parametrizations which appear as limit cases (restriction constants equal to 1 or tending to ∞) reviewed in Table 1 . Moreover, the proposal also coincides with the BIC proposal for the "deter-and-shape" constraints previously introduced in García-Escudero et al. (2020) when the constraint (6) is removed by taking c shb → ∞. In that case, the contribution to v pars k due to parameters associated to "shape elements" is just k( p − 1) 1 − 1 c shw and k(rot) = k (no constraints on the rotation matrix).

In this work, we have just focused on the BIC proposal, which is an extension of the MIX-MIX approach in Cerioli et al. (2018) . A classification likelihood approach can be also applied by replacing the target function (2) with (1) to define extensions of the MIX-CLA and CLA-CLA approaches (in the spirit of the ICL criterion in Biernacki et al. (2000) ).

The minimization of the criterion (12) over all the possible combinations of k and pars is not an easy task. In order to circumvent this problem, we just consider powers of 2 for the restriction constants and propose the following procedure:

1. Fix K as an upper bound for the maximum number of components and C such that 2 C is large enough that the constraints enforced are not very strict. 2. We first obtain

This implies applying the proposed methodology for K × 14 slightly constrained models (and guaranteeing that numerical issues due to singularities are avoided), in correspondence with all the feasible models in Table 1 . These K × 14 models need also to be fitted initially as happens with other BIC proposals for parsimonious models. All the other intermediate models (to evaluate) are included within those model fitted with restriction constants equal to 2 C−1 . This maximization will directly provide our final choice for the number of clusters k * and for the chosen rotation rot * . Moreover, it also returns our final choices for c det , c shw and c shb whenever any of these c * det , c * shw and c * shb take the value 1. 3. Constants c det , c shw and c shb need to be refined because just upper bounds are initially allowed in Step 1. To perform these refinements, let us obtain

where

We are taking advantage of the initial information about parameters resulting from Step 1 and applying Lemma 1 to reduce notably the number of configurations to be tested. 4. After this process, we finally consider [k * , c * * det , c * * shw , c * * shb , rot * ] as a suggestion for [ k, pars].

In this section, we show the advantage of our constrained proposals using simulated datasets. We first consider examples where the number of clusters k is assumed to be known (Sect. 6.1). We then show the effectiveness of the novel information criteria for choosing models (Sect. 6.2). As mentioned in the introduction section, we do not claim that the well-established and widely applied proposals considered for comparison in this section are useless, since they clearly have long proven their validity in many scenarios and real data applications.

We show examples that highlight the usefulness of the proposed methodology in achieving extra stability.

We compare first the performance of the proposed methodology with respect to well-known implementations of the 14 parsimonious model-based clustering methods when assuming a known number of components and the parametrization needed (the VVV parametrization in all cases). The first example is based on k = 3 normally distributed components, where the first two coordinates (X 1 , X 2 ) of these components are generated from bivariate normals with mean parameters equal to (0, 0) , (2, 6) and (6, 0) , and the covariances matrices are 2 0 0 2 , 3 0 0 1 and 1 0 0 2 , respectively. A third independent component X 3 is generated from a univariate normal with 0 mean and variance 100, i.e., X 3 ∼ N (0, 100) . This simulation scheme is denoted as "lower p" case, but we also add an independent fourth coordinate X 4 ∼ N (0, 100) to generate the "higher p" case. In the simulation study, we take random samples with n 1 = 50, n 2 = 20 and n 3 = 20 from each component in the "lower n" case, and doubled sizes n 1 = 100, n 2 = 40 and n 3 = 40 in the "higher n" case.

To explore the effect of the restriction constants, always applying the rot=V case, a three letters notation is used when summarizing the simulation results. The first letter corresponds to the restriction constant chosen in c det , the second letter to the constant in c shw and the third letter to the constant c shb . In these three letters, we use the letter "C" when the constant defining the constraint is exactly chosen equal to the maximal ratio for this constant computed from the true model generating the data. On the other hand, we use letter "U" when this constant is chosen so high that the procedure is (almost unrestricted) by fixing it to be equal to 10 10 (just to avoid very extreme numerical issues). Letter "D" is used when we double the value for the constant fixed in C, letter "E" when we multiply the constant in C by 2 2 , letter "F" when we multiply by 2 4 and letter "G" when we multiply by 2 8 .

We always consider the case k = 3 and rot=V and apply the algorithm in Sect. 3 with nstart= 1000 and iter.max= 100. We have included in the simulation study two particular cases, namely, the CCU (that corresponds to the "deter-and-shape" proposal with a large c shb = 10 10 ), and the CUC (corresponding to a large c shw = 10 10 ).

The results obtained are compared with those which come from applying the mixture (Browne et al. 2018 ) and the Mclust (Scrucca et al. 2016) packages in R, when using the VVV parametrization in both cases and when searching for k = 3 components. We consider two available options for initializing based on the k-means method ("mix_km") and on 1000 random starts ("mix_rs") when applying the mixture package. Figure 2 shows the value of the ARI-Adjusted Rand Indexes (Hubert and Arabie 1985) for the obtained partitions, with respect to the true classification, on the same 100 simulated data sets for each of the four possible scenarios depending on the two possible dimensions and two possible sample sizes.

In order to see the effect on the estimation of the parameters, the Euclidean distances when estimating all the true mean vectors μ 1 = (0, 0, 0) , μ 2 = (2, 6, 0) and μ 3 = (6, 0, 0) (lower p) and μ 1 = (0, 0, 0, 0) , μ 2 = (2, 6, 0, 0) and μ 3 = (6, 0, 0, 0) (higher p) are shown in Fig. 3 . Relabelling has been applied to match estimators with the estimated location parameters. Figure 4 finally shows the maximum values achieved by the mixture likelihood obtained when maximizing the defining target likelihood function in (2).

As expected, the "higher n" cases exhibit clearly better performances. We can also see in Figs. 2 and 3 that the constrained approaches seem to provide higher ARI values and lower estimation errors than their competitors, and that those including letters C and D exhibit the most accurate results among them. We can also see that the least constrained approaches (including letters E, F, G and U) do not provide good results (because stability seems to be lost when increasing the restriction constants), but values of these constants greater than the true ones in C are in general not excessively detrimental. On the other hand, the unconstrained case UUU gives the worst performance. We can also see similar unsatisfactory behavior as in the UUU case when applying the VVV models with the mixture and the Mclust packages. These approaches also considered theoretically a fully unre- Fig. 2 ARI values in the comparative study for the k = 3 components example Fig. 3 Sum of the Euclidean distances when estimating the true mean vectors in the comparative study for the k = 3 components example stricted approach as in the UUU case. However, despite this lack of constraints in the scatter matrices, we observe that the performance of mixture and Mclust depend heavily on the initializing procedure. In this regard, we can see that the initialization based on k-means is not satisfactory due to the particular data generation scheme, which is clearly not appropriate for k-means.

Even though "mix_rs" is also based on 1000 random initializations, as in our constrained proposals, we can see in Fig. 4 that it does not reach values in the target likelihood so high as those obtained in the UUU case (that could perhaps be even greater if these constants were chosen at values greater than 10 10 ). Therefore, the type of initializations considered in Step 1 of our algorithm seems to better explore the parametric space than the initializations in "mix_rs". The same happens with the initializations provided by k-means or by the initializing procedure based on hierarchical model-based clustering applied by Mclust. The initializations can be useful to avoid spurious solutions, but it is also important to note that they are not striving to maximize the target likelihood Fig. 4 Maximum values achieved when maximizing the target likelihood function for the k = 3 components example function and they clearly influence the performance of the methodology. Figure 4 also shows how the target likelihoods steadily increase when increasing the values of the restriction constants (C, D, E, F, G and U) and this could serve to understand the degree of stability provided by constraints and help us to achieve smooth transitions between models.

We briefly present another example with a higher k = 6 number of components. The example starts from a bivariate mixture of six spherical normally distributed components with the same scatter and component sizes equal to n 1 = 23, n 2 = 36, n 3 = 93, n 4 = 38, n 5 = 123 and n 6 = 12 with the following mean vectors: (−4.5, 3.6) , (0.40, 3.6) , (−4.4 , −1) , (9.2, −1) , (0.4, −1) , and (9.2, 3.6) . We add a third dimension by using an independent variable X 3 ∼ N (0, 100), which makes these clusters become elongated.. A simulation study, completely analogous to the one previously described is considered. Figure 5 provides the Euclidean distances when estimating the corresponding six means of the components individually.

We can see that the constrained approach seems to provide better results also in this k = 6 example.

We also compare the performance of the novel information criteria introduced in Sect. 5 with respect to the BIC procedures resulting from the application of the mixture package (Browne et al. 2018) . With this aim in mind, we simulate 100 samples of size n = 200 in dimension p = 10 for each of the 14 classical parsimonious models. To be more precise, each sample is generated from a k = 3 components mixture where μ 1 , μ 2 and μ 3 and Σ 1 , Σ 2 and Σ 3 are randomly generated parameters in such a way that the covariance matrices satisfy the specific model constraints and also a prefixed overlap rate equal to 0.05. That overlap is achieved by applying the extension of the MixSim method of Maitra and Melnykov (2010) given in Riani et al. (2015) . Given two clusters j and l obtained from normal densities φ(·; μ j , Σ j ) and φ(·; μ l , Σ l ), with probabilities of occurrence π j and π l , the overlap between groups j and l is defined as the sum of the two misclassification probabilities w jl = w j|l + w l| j where w j|l = P[π l φ(X ; μ l , Σ l ) < π j φ(X ; μ j , Σ j )]. The average overlap is the sum of the off-diagonal elements of the matrix of the misclassification probabilities w j|l divided by k(k − 1)/2. Note that when we say that the covariance matrices satisfy the model constraints, we mean that we ensure that the Σ 1 , Σ 2 and Σ 3 matrices do exactly satisfy the constrained models with the values of c det , c shw and c swb as in Table 1 but replacing the values of "∞" in that table by values equal to 100 in the case of c det or c shw and by a value equal to 10 in the case of c swb . The mixture components weights are always π 1 = π 2 = π 3 = 1/3. Figure 6 shows the ARI indexes between the true partition and the partition obtained from the fitted mixture suggested by the BIC-type information criterion. In this figure, we use the notation "new" for the results associated with the new proposed methodology and "mix_km" for those with the mixture package when initialized with k-means and "mix_rs" when initialized by using random starts (the same number of random starts nstart as in "new" are considered). Figure 7 shows the proportion of times that the true number of components, k = 3, is determined by the BIC-type information criteria.

We can see in these two figures that the BIC methodology in the mixture package is reasonably able to recover the right number of components and the true data generating mechanism. However, the comparison is clearer in Fig.  7 when looking at the proportion of times that the true number of components is detected. We see that better results are obtained when applying the new constrained approaches. Of course, those differences are not so noticeable for the most constrained (VVI, EVI, …, EII) models, where no great advantages can be achieved by restricting even more.

The improvement is more clearly seen in Fig. 7 than in Fig. 6 , perhaps because spurious components, made up of few observations, do not significantly modify the ARI, even though they change the number of components detected. This wrong determination of the number of components may of course be problematic, when interpreting results. Note also that constrained approaches seem to avoid partitions exhibiting very low ARI values (outliers outside whiskers in Fig.  6 ). 

The example is inspired by the analysis of a real data set on the SARS-CoV-2 symptoms kindly provided to us by the ASL3 Genovese Hospital. Measurements on six variables x 1 = "heart rate (the number of beats the heart per minute)", x 2 = "Oxygen Uptake Efficiency Slope (index of functional reserve derived from the logarithmic relation between oxygen uptake and minute ventilation during incremental exercise)," x 3 = "watts (reached by the patient during the stress test on a cycle ergometer (stationary bike) at the aerobic threshold, that is, when the patient 'begins to struggle')," x 4 = "watts peak (watts reached at maximum effort (during exercise test on exercise bike)," x 5 = "value of the maximum repetition (maximum force of muscle contraction of the quadriceps femoris of the dominant limb expressed in kg)" and x 6 = "previous variable corrected on the subject (in relation to the patient's weight)" on 79 COVID patients and 77 non-COVID ones. Figure 8a shows the (supposedly) true classification, as provided by the doctors. Data have been collected by "Post-COVID Outpatient Rehabilitation Center ASL3 Liguria Region Health System" and approved by the Ethics Committee of Liguria region (Italy).

We will apply the (unsupervised) constrained modelbased clustering approach to see if something close to the doctor's classification partition is achieved. We will use the modified BIC approach described in Sect. 5 to determine the underlying number of clusters and the set of constraints to be imposed.

The proposed BIC approach described in Sect. 5 is applied with K = 5 and C = 8 (i.e., 2 C−1 = 128). In the max-imization of (13), after fitting 5 × 14 models, we obtain k * = 2, c * det = 1, c * shw = 128, c * shb = 1 and rot * =E. Afterwards, we perform the maximization in (14) where C = 1 × {2 0 , 2 1 , . . . , 2 7−1 } × 1. This means that we need to obtain c * * shw = arg max c shw ∈{1,2,...,128}

(as we directly have c * * det = 1 and c * * shb = 1). Our BIC proposal suggests [ k, pars] = [2, 1, 128, 1, E]. This is a quite constrained solution where only the "within" components shape elements are left notably unrestricted (the value of c shw is such that only a slightly constrained ratio is considered). The associated partition is shown in Fig. 8b and exhibits an ARI index with respect to the "true" doctor's classification equal to 0.5891.

On the other hand, the BIC approach implemented through function gpcm() in the mixture package in R (Browne et al. 2018) suggests k = 2 but the EEE parameterization results in an ARI equal to 0.0117 with respect to the doctor's suggested partition. This result is obtained when considering a k-means type initialization but the results do not seem to improve when considering random initializations. The results, for this particular data set, do not improve when we apply the BIC criterion provided by the Mclust package (Scrucca et al. 2016 ) that only suggests one k = 1 component for this data set. The VEE parameterization is suggested when considering Mclust's BIC criterion but restricted to models with k = 2, which yields a ARI=0.0414 that is notably Fig. 8 a Doctor-based "true" classification for the COVID data set with COVID patients denoted with C symbols and non-COVID by N symbols with observations represented in the first two principal components. b Clustering results of the constrained parsimonious model-based clustering proposal with parameters chosen from the new BIC procedure smaller that the 0.5891 achieved when applying the proposed methodology with the new BIC proposal.

We have introduced a new methodology for constrained parsimonious model-based clustering that depends on three restriction constants c det , c shw and c shb and on fixing a particular type rot of rotations. The methodology provides a smooth transition among the well-known 14 parsimonious models that are commonly applied in model-based clustering when assuming normality for the components. The proposed constraints result in mathematically well-defined problems and provide extra control on the covariance matrices of the fitted components. Novel information criteria have been introduced to help the user in providing sensible choices for all the tuning decisions.

There are many open research lines related to this new approach. For instance, dealing with computational aspects could still be needed to speed up the procedures. Although MATLAB code for its practical application is now available, we are developing a more dedicated and easy to apply implementation within the FSDA MATLAB toolbox (Riani et al. 2012) . This implementation will hopefully include more elaborate graphical and numerical tools in helping to determine and explore the solutions obtained when moving all the involved parameters in the spirit of Cerioli et al. (2018) . With that aim, stability and ARI distances among partitions could be taken into account in order to derive a reduced (and ranked) list of sensible partitions and also graphical summaries as the "car-bike plots." The methodology can be also adapted to include "trimming" to introduce new robust model-based clustering approaches.

Funding Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. difference equal to log c shw . In that case, the most extreme difference is for j and j when ξ j,1 = · · · = ξ j,l = ξ j, p + log c shw and ξ j,l+1 = · · · = ξ j, p and ξ j ,1 = · · · = ξ j ,l−1 = ξ j , p + log c shw and ξ j ,l = · · · = ξ j , p .

By taking into account the zero sum condition (15), we have 0 = pξ j, p + l log c shw and 0 = pξ j , p + (l − 1) log c shw , for these two configurations. Therefore, p(ξ j,l − ξ j ,l ) = ( pξ j,l − ( pξ j, p + l log c shw )) −( pξ j ,l − ( pξ j , p + (l − 1) log c shw ))

where we have used that ξ j,l − ξ j, p = log c shw and that ξ j ,l − ξ j , p = 0. Consequently, ξ j,l − ξ j ,l = p−1 p log c shw for that most extreme possible difference and so (8) is just proven after exponentiation.

For sake of completeness, we review the "optimal truncation" procedure (Fritz et al. 2013 ) that has been extensively used in the algorithm in Sect. 3.

Given a d ≥ 0 and a fixed restriction constant c ≥ 1, we introduce the m-truncated value is defined as 

Obtaining that optimal threshold value only requires the maximization of a real-valued function and m opt can be efficiently obtained by performing only 2 · J · L + 1 evaluations (Fritz et al. 2013 ) of (16) through a procedure which can be fully vectorized.

Model-based Gaussian and non-Gaussian clustering

Assessing a mixture model for clustering with the integrated completed likelihood

Choosing starting values for the em algorithm for getting the highest likelihood in multivariate gaussian mixture models

Estimating common principal components in high dimensions

mixture: mixture models for clustering and classification

Gaussian parsimonious clustering models

Finding the number of normal groups in model-based clustering via constrained likelihoods

Estimating the components of a mixture of normal distributions

A fast algorithm for robust constrained clustering

Probabilistic clustering via pareto solutions and significance tests

Avoiding spurious local maximizers in mixture modeling

Eigenvalues and constraints in mixture modeling: geometric and computational issues

Model-based clustering with determinant-and-shape constraint

A constrained formulation of maximum likelihood estimation for normal mixture distributions

Comparing partitions

Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters

Simulating data to study performance of finite mixture modeling and clustering algorithms

Finite Mixture Models. Wiley Series in Probability and Statistics

Maximum likelihood estimation via the ECM algorithm: a general framework

FSDA: a Matlab toolbox for robust analysis and interactive data exploration

Simulating mixtures of multivariate data with fixed cluster overlap in FSDA library

mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

Let us consider ξ j,l = log γ jl such that ξ j,1 ≥ · · · ≥ ξ j,l ≥ · · · ≥ ξ j, p , as in (7), under the constraint(given that p l=1 γ jl = 1). Constraints (5), after logtransformation, reduce to ξ j,l − ξ j,l ≤ log c shw for any l, l ∈ {1, . . . , p} for a fixed j.It is not difficult to see that the maximal possible difference is achieved with repeated ξ j,l values and setting that maximal