key: cord-0480294-6pw563gf
authors: Pirkovi'c, Bogdan A.; Laketa, Petra N.; Nasti'c, Aleksandar S.
title: On Generalized Random Environment INAR Models of Higher Order: Estimation of Random Environment States
date: 2021-09-01
journal: nan
DOI: nan
sha: ce2cd5e22b0a0a77060cabd6ed841e29ba854f8e
doc_id: 480294
cord_uid: 6pw563gf

The behavior of a generalized random environment integer-valued autoregressive model of higher order with geometric marginal distribution {and negative binomial thinning operator} (abbrev. $RrNGINAR(mathcal{M,A,P})$) is dictated by a realization ${z_n}_{n=1}^infty$ of an auxiliary Markov chain called random environment process. Element $z_n$ represents a state of the environment in moment $ninmathbb{N}$ and determines three different parameters of the model in that moment. In order to use $RrNGINAR(mathcal{M,A,P})$ model, one first needs to estimate ${z_n}_{n=1}^infty$, which was so far done by K-means data clustering. We argue that this approach ignores some information and performs poorly in certain situations. We propose a new method for estimating ${z_n}_{n=1}^infty$, which includes the data transformation preceding the clustering, in order to reduce the information loss. To confirm its efficiency, we compare this new approach with the usual one when applied on the simulated and the real-life data, and notice all the benefits obtained from our method.

Integer-valued autoregressive (IN AR) models appeared for the first time in McKenzie (1985) and Al-Osh and Alzaid (1987) . Over the time, they showed to be a very useful tool for describing the integer-valued data. One may, for example, apply IN AR models to describe the monthly number of rainy days, crime cases, newborn individuals of one species. It becomes clear that such data may be found in any area. For that reason, numerous IN AR models have been proposed and studied in the literature. IN AR models are based on so called thinning operator, which to a given integer-valued random variable X assigns the sum of X independent identically distributed random variables. The distribution of the auxiliary random variables determines the type of the thinning operator. Some of the models with different thinning operators may be found in Aly and Bouzar (1994) , Latour (1998) , Zheng et al. (2006 Zheng et al. ( , 2007 and Ristić et al. (2009) . Another variety of IN AR models arise from considering different marginal distributions, see for example McKenzie (1986) , Al-Osh and Aly (1992) , Alzaid and Al-Osh (1993) and Bakouch and Ristić (2010) .

We focus on the recent random environment IN AR models that appeared for the first time in Nastić et al. (2016) and are flexible towards the environment conditions changes. The behavior of these models is ruled by a Markov chain {Z n } ∞ n=1 , called random environment process. The elements of the random environment process are also called (random) environment states. To apply a random environment IN AR model, one must first estimate the environment states. In Nastić et al. (2016) this was done using clustering methods, in particular K-means introduced by Hartigan and Wong (1979) . However, using K-means for this purpose induces a certain loss of information, as we discuss below, and may lead to poor performance of the model. In order to estimate environment states as accurate as possible, we propose a new random environment estimation (abbrev. RENES ) method. To avoid the confusion, it is important to emphasize that we do not introduce a new method for estimating the parameters of random environment IN AR models, but only a new method for estimating {z n } ∞ n=1 . However, the estimators of random environment IN AR models parameters are defined under the assumption that {z n } ∞ n=1 is known in advance, meaning that a different approach for estimating {z n } ∞ n=1 will for sure imply the difference in the parameter estimates.

We need to provide more details on the random environment IN AR models. We begin with the first order random environment IN AR model with geometric marginals (RrN GIN AR(1)) introduced in Nastić et al. (2016) . The marginal distribution of the RrN GIN AR(1) time series in moment n is determined by the realization of the random environment process z n recorded in the same moment -for this reason we write X n (z n ). Moreover, the distribution of X n (z n ) is geometric with expectation µ zn ∈ {µ 1 , µ 2 , . . . µ r }. The recursive relation that defines RrN GIN AR(1) model is given by X n (z n ) = α * X n−1 (z n−1 ) + ε n (z n , z n−1 ),

where, α * : X → X i=1 U i denotes the negative binomial thinning operator, that to each integervalued random variable X assigns the sum of X independent random variables having geometric distribution with mean value α. In order to measure the goodness of fit of such defined model, corresponding environment states z n for all observations must be estimated. This is where the K-means clustering method took place. The predefined number of clusters was chosen to be the number of environment states r registered in the observed phenomenon. Each cluster was assigned to one state and each sample element was assigned to a state depending on the cluster it felt into. In particular, if two different process elements X n (z n ) and X m (z m ) belong to the same cluster, then it was assumed that z n = z m . In that way, the sequence of random environment states was fully determined. However, this approach shows a serious shortcomings, which are consequence of the fact that only data point value was taken into account in K-means clustering method. Once the K-means is performed, graphed representation of the database will be divided by horizontal lines into strips, as shown in the right-hand panel of Figure 1 . Each strip corresponds to one cluster. This entails that all high values in the database must be located in the same cluster. Similar to this, all low values must be located in the same cluster. This is, however, not the case with the data simulated from RrN GIN AR(1) process. The left-hand panel of Figure 1 shows the data simulated from R2N GIN AR(1) model. As we can see, it is possible that high data values appear also in the environment conditions different than those assumed for the high data values, so the data is no longer divided by a horizontal strip. K-means totally rules out this possibility.

As an example, consider the number of the new COVID-19 cases per day. As it is known, the weather conditions significantly affected the spread rate, so we use RrN GIN AR(1) model with r = 2 different environment states: summer and winter. However, there are some other circumstances undetected or not measurable, that can affect the number of new cases per day, for example public demonstrations, unallowed gatherings of people during vacations or emergence of the new virus strain. In these situations, it would be ideal for clustering method to recognize specific circumstances and keep high values (detected in summertime) in 'summer' cluster. However, standard K-means is incapable to do so. Observing only numerical value of the process realization, K-means might recognize high summertime values as winter occasions, and locate those realizations in the wrong cluster. The same holds for all K-means adaptations familiar so far. Obviously, an improved random environment estimation method is needed in order to solve such problem.

In years that followed, few more sophisticated IN AR models appeared. Nastić et al. (2017) defined random environment IN AR models of higher order. Beside the marginal distribution parameter, authors assumed here that the order of the model is also determined by the environment state in particular moment. Another step ahead was made by Laketa et al. (2018) . Beside all the assumptions mentioned above, authors additionally assumed that the thinning parameter value α zn in moment n depends on the environment state z n in the same moment. According to Laketa et al. (2018) , {X n (z n )} ∞ n=1 is called a generalized random environment IN AR model of higher order with geometric marginals and negative binomial thinning operator (RrN GIN AR(M, A, P)) if its element X n (z n ) at moment n ∈ N is determined by the recursive relation

w.p. φ zn 1,Pn , α zn * X n−2 (z n−2 ) + ε n (z n , z n−2 ) w.p. φ zn 2,Pn , . . . α zn * X n−Pn (z n−Pn ) + ε n (z n , z n−Pn ) w.p. φ zn Pn,Pn ,

Sets M = {µ 1 , . . . , µ r }, A = {α 1 , . . . , α r }, P = {p 1 , . . . , p r } contain model parameter values -µ zn is the mean of the marginal geometric distribution of X n (z n ), α zn is the thinning parameter value and p zn represents the maximal value that order P n may take for a fixed state z n ∈ {1, . . . , r}.

There are two different RrN GIN AR(M, A, P) models, depending on the way sequence {P n } ∞ n=1 is defined. One of them, RrN GIN AR max (M, A, P) is constructed so that for each n ∈ N it holds P n = max{p * n , p zn }, while for the other one, RrN GIN AR 1 (M, A, P) we have P n = 1 if p * n < p zn and P n = p zn otherwise. Here p * n = max{i ≥ 1 : z n−i = · · · = z n−1 } represents the number of predecessors of z n that are mutually equal.

Beside the shortcomings mentioned above, we detected another difficulty when applying Kmeans to a generalized random environment IN AR process of higher order, as it was done in Laketa et al. (2018) . To explain the difficulty, consider the simplest case with r = 2 environment states and suppose similarity between mean values within states, µ 1 ≈ µ 2 . In other words, the observations inside states are not that much different and are accumulated around parallel horizontal lines that are close to each other. In this situation, it is reasonable to expect the existence of a strip in which points from both environment states will be mixed. The border between states won't be a straight line, but a wavy and jagged line. Taking into account the fact that K-means method separates clusters by straight horizontal lines, it becomes obvious that some improvements are necessary.

In this paper we introduce a new RENES method for estimation of {z n } ∞ n=1 , that will eliminate disadvantages mentioned above. The idea of RENES method is to transform the data sample that corresponds to the generalized random environment IN AR model of higher order before applying clustering. As previously mentioned, all the parameter values µ zn , α zn and P n carry the information about z n . To prevent the information loss, the main goal is to form a three-dimensional sequence, based on real-life data realizations, that mimics the behavior of {(µ zn , α zn , P n )} ∞ n=1 . Finally, the K-means algorithm will be applied on such obtained three-dimensional data sequence. It is possible to apply RENES method to any other generalized random environment IN AR model of higher order (to those that have different marginal distribution and thinning operator), but we focus on RrN GIN AR(M, A, P) models, as it was the case in Laketa et al. (2018) . In order to confirm the efficiency of RENES, corresponding simulated data sequences are created. Observing the simulations, we can examine whether changes in the number of states and parameter values affect the efficiency of RENES method.

The structure of this paper is as follows. In Section 2, a construction of the new random environment estimation (RENES ) method, which overcomes problems mentioned above, is presented. Section 3 provides description of simulations and their properties. Cases with 2 and 3 different environment states are described. An extensive simulation study for newly proposed RENES method is given in Section 4. Results of applying the RENES method to the real-life data are given in Section 5. based only on the realized sample, without knowing the random environment sequence {z n } N n=1 . Such obtained three-dimensional sequence {(μ n ,α n ,P n )} N n=1 is supposed to mimic the behavior of the model parameters over time. Then, clustering the three-dimensional data {(μ n ,α n ,P n )} N n=1 would produce better estimation of {z n } N n=1 than clustering of the starting sequence {x n } N n=1 , since the information loss is prevented.

As mentioned before, our goal is not to define new estimators of model parameters, but to improve the estimation of {z n } ∞ n=1 . Given sequence of so-called 'pre-estimators' {(μ n ,α n ,P n )} N n=1 is just a a helpful tool to estimate {z n } ∞ n=1 , and it doesn't represent any kind of alternative estimates of model parameters. Model parameters have already been successfully estimated in Laketa et al. (2018) , and we rely on those results in evaluating our RENES method. For an illustration, see Figure 2 .

Although it is possible to implement clustering of three-dimensional data {(μ n ,α n ,P n )} N n=1 , method can be improved even more, by considering trimmed (truncated) means. To that purpose, for a given sequence a 1 , a 2 , . . . , a N and vector c = (c 0 , c 1 , . . . , c k ) let us define a function

Elements of c are decreasing c 0 ≥ c 1 ≥ · · · ≥ c k , so that T (a i , c) represents a trimmed mean affected the most by the current value a i . The effect of the k neighboring elements of a i decreases when moving away from a i . If z n−k = · · · = z n+k for some n ∈ {1, . . . , N }, it will make sense to use T (μ n , c) instead ofμ n when estimating the environment state z n , because in this case all the elements X n−k , . . . , X n+k carry information about z n . As mentioned in Laketa et al. (2018) , RrN GIN AR(M, A, P) model shows bad performances in case when environment states are changing rapidly. Its application makes sense only if the probability of remaining in the same state is big enough. Hence, for k small enough, one may assume that 2k+1 neighboring elements of {z n } N n=1 are equal with high probability in all the situations of practical importance. Thus, it sounds reasonable to replace {(μ n ,α n ,P n )} N n=1 in clustering procedure with another three-dimensional data sequence {(T (μ n , c m ), T (α n , c a ), T (P n , c p ))} N n=1 , for some vectors c m , c a and c p . Theoretically speaking, 5 lengths of these vectors do not have to be equal. The upper limit of the vector's length k might be discussed as well. Higher values of k give better pre-estimates, provided all of observations X n−k , . . . , X n+k correspond to the same state. Otherwise, pre-estimates might be even worsened.

To reconcile these two opposing facts, we are not going to discuss vectors c m , c a , c p of length higher than 4.

If we want all the coordinates of (T (μ n , c m ), T (α n , c a ), T (P n , c p )) to have equal impact on the clustering, it is necessary to scale them. Thus, we define a function that to a given element a n of a sequence {a n } N n=1 assigns properly scaled (normed) value of T (a n , c) given with

.

By introducing three more parameters C m , C a , C p ∈ R, it becomes possible to control the level of impact each coordinate has on the clustering procedure. Finally, by using standard K-means we cluster the three-dimensional data vector (C m S(μ n , c m ), C a S(α n , c a ), C p S(P n , c p )).

(4)

The only left is to define starting pre-estimators (μ n ,α n ,P n ), n = 1, 2, . . . , N , in a reasonable way, by looking at the construction of RrN GIN AR(M, A, P) models. Bearing in mind the fact that parameters µ i , i = 1, 2, . . . , r, represent means within clusters, it would be reasonable to set for any n = 1, 2, . . . , N thatμ

Taking into account the fact that the partial auto-correlation function is used to determine the order of the time series, estimation of the sequence {P n } N n=1 will take place as follows. If p zn is the maximal order allowed for particular element in the state z n , than we havẽ

where pacf K is the partial auto-correlation function at lag K and d p ∈ N. The function (6) works well if all 2d p + 1 elements of the sequence {X n } N n=1 involved inP n correspond to the same state z n . However, this requirement is not demanding, since the application of RrN GIN AR(M, A, P) model is reasonable only in the case when the probability of remaining in the same state is higher than the probability of changing the state.

To predict the thinning parameter value in moment n, we use the known property of the negative binomial thinning operator, that E (α * X|X) = αX. This motivates us, by looking at (2), to define

Here (x) + = max{x, 0} represents the positive part of x ∈ R. Since such obtained thinning parameter value might be greater than 1, we finally havẽ

To apply new RENES method, one should choose the values of parameters d p , c m , c a , c p , C m , C a and C p (called in sequel RENES method parameters). We will try to make an optimal choice based on RrN GIN AR max (M, A, P) and RrN GIN AR 1 (M, A, P) simulations. It is important here to distinguish RENES method parameters from the model parameters. In Section 3 we give details about our choice of model parameters, while in Section 4 we give results of the simulation study with such choice of model parameters and discuss how to choose RENES method parameters.

We simulated RrN GIN AR(M, A, P) time series of length N = 500. In the sequel we consider all sequences of the same length, so instead of {·} 500 n=1 we write shortly {·}. Properties of simulations are such that they make difficult to apply standard K-means. The case with r = 2 different environment states is presented within the section. On the other hand, the case with r = 3 environment states can be found in Appendix A. For each of the cases, two different combinations of model parameters are observed. Further, each combination of parameters will generate two different replications of the corresponding RrN GIN AR(M, A, P) time series. One of them will be used to obtain the values of d p , c m , c a , c p , C m , C a , C p . With the help of such obtained RENES method parameters, the other replication will be reconstructed in order to evaluate efficiency of the new RENES method. Furthermore, both versions of the model, RrN GIN AR max (M, A, P) and RrN GIN AR 1 (M, A, P) will be analyzed simultaneously. For more information about these models, see Laketa et al. (2018) .

The random environment process, being a Markov chain, has parameters p vec -a vector containing initial probabilities of being in certain state and p mat -transition probability matrix that in the intersection of i-th row and j-th column contains the probability P (Z n = i|Z n−1 = j) for any i, j ∈ {1, . . . r}. Another remark about the notation is that we write M, A and P as vectors, even though they are introduced as sets. We do so to eliminate the ambiguity, preserving the order of the states.

In order to create R2N GIN AR(M, A, P) simulations, the following combinations of parameters are given.

1. First of all, we are going to create time series with similar means within states, while other model parameters will differ significantly. Surrounding like this would make K-means useless. Hence, we choose M = (1, 1.5). On the contrary to that, thinning parameters, as well as maximal orders within states, should differ significantly. Hence, we choose A = (0.05, 0.6) and P = (2, 4). Regarding the choice of α j , j = 1, 2, one of them is chosen to be very small, while the other is chosen to be close to its upper limit. Furthermore, probabilities φ k i,j corresponding to the R2N GIN AR max (M, A, P) simulation are chosen to be

Probabilities corresponding to the R2N GIN AR 1 (M, A, P) simulation are located in last rows of these matrices. An initial state is nearly fair, due to the value of its distribution p vec = (0.6, 0.4). In order to have long arrays of elements corresponding to the same state within the simulated R2N GIN AR(M, A, P) time series, transition probabilities outside the main diagonal are significantly smaller than those located on the main diagonal. Thus, transition probability matrix is of the form

2. The other combination of parameters is characterized by a great similarity between thinning parameters. Beside that, the mean values will be similar enough to to make it difficult to use the standard K-means method. Orders of the model will be the only values on the basis of which it is possible to determine the environment states of realizations. That will be a good test for our new approach. To that purpose, we have that M = (3, 5), A = (0.4, 0.5) and P = (2, 5). Further, in the case of R2N GIN AR max (M, A, P),

Last rows of these matrices contain probabilities corresponding to the R2N GIN AR 1 (M, A, P) simulation. An initial state is fair, since p vec = (0.5, 0.5). Finally, the transition probability matrix provides long arrays of elements corresponding to the same state, that is,

In this section we seek for optimal RENES method parameters, based on RrN GIN AR(M, A, P) simulations with r = 2 environment states. The choice of corresponding model parameters is given in the previous section. In order to improve the readability of the manuscript, only one parameters combination will be discussed in detail, with fully exposed procedure of obtaining d p , c m , c a , c p , C m , C a and C p . As for the second combination, the procedure will be omitted and only final results will be provided. Corresponding discussion regarding optimal RENES method parameters in the case of RrN GIN AR(M, A, P) simulations with r = 3 environment states is provided in Appendix B.

Using the first combination of the model parameters, corresponding R2N GIN AR max (2, 4) and R2N GIN AR 1 (2, 4) simulations were created, two replications of each. The first replication of each pair was used to provide the parameters of RENES method. The procedure starts with the determination of {μ n } using (5). In order to improve {μ n }, vector c m has to be provided. For k small enough, we have already assumed that all X n−k , . . . , X n+k correspond to the same state. Thus, allμ n−k , . . . ,μ n+k can have similar contribution to T (μ n , c m ). In other words, we can choose coordinates of the vector c m to be as equal as possible. Due to the fact that it multiplies the middle realization x n , the value of c 0 may eventually be a bit higher. Figure 3 shows sequences of pre-estimates {T (μ n , c m )}, obtained for various selections of c m . There we show only the first 200 elements, to increase readability of the plot.

As Figure 3 shows, the usage of c m results in much more accurate pre-estimates of the sequence {µ n } in both cases. Using this technique, we managed to trim peaks that deviate significantly form the real mean values. Obviously, the best result is obtained for c m = (0.16, 0.14, 0.14, 0.14) in the case of R2N GIN AR max (2, 4) simulation. Speaking of R1N GIN AR 1 (2, 4) simulation, similar results are obtained in the case of c m = (0.2, 0.2, 0.2) and c m = (0.16, 0.14, 0.14, 0.14). The second option is selected.

We determine {P n } in two steps. The first step provides a determination of the parameter d p , given in (6). In order to obtain the optimal value of d p , let us denote by ∆ p the root mean square of differences between correct orders P n , n = 1, 2, . . . , 500, and corresponding estimated order values P n , 1, 2, . . . , 500, obtained by (6). The error ∆ p is calculated for various choices of d p and the results are presented in Table 1 . The smallest value of ∆ p will reveal the optimal value of parameter d p . Having a brief look at Table 1 , we conclude that, in the case of R2N GIN AR max (2, 4) simulation, the smallest value of ∆ p is obtained for d p = 8 (∆ p = 1.439). Similarly, the smallest ∆ p value in the case of R2N GIN AR 1 (2, 4) simulation is obtained for d p = 15 (∆ p = 1.372). The second step involves determination of the corresponding vector c p for fixed optimal value of d p . Similarly as for c m , we assume that X n−k , . . . , X n+k all correspond to the same state. Thus, P n−k , . . . ,P n+k are all assumed to have similar contribution to T (P n , c p ). Hence, coordinates of c p are chosen to be as similar as possible. Sequences {T (P n , c p )} obtained for various selections of c p are shown in Figure 4 and compared to the exact order sequence {P n }.

In order to interpret Figure 4 , one fact needs to be clarified. Namely, the goal is to choose order pre-estimate which provides the highest probability of placing corresponding observations in correct clusters. In other words, the best pre-estimate of {P n } is not necessarily the one that most often matches the exact order value, but the one that is close enough in most of the cases. In this respect, the best result in both cases (R2N GIN AR max (2, 4) and R2N GIN AR 1 (2, 4)) is obtained for k = 4, ie. for c p = (0.16, 0.14, 0.14, 0.14). Although this pre-estimate struggle to reach maximal orders, in most of the cases it stays close enough to the correct order values and do not make large mistakes.

Finally, having calculated {T (μ n , c m )} and {P n }, we are able to calculateα n , n = 1, 2, . . . , N, using (7). Following the same reasons as before, coordinates of c a are assumed to be as similar as possible. Regarding the length of c a , several options are tested. Sequences {T (α n , c a )} obtained for various selections of c a are given in Figure 5 and compared to the real sequence {α n }.

According to the figure, vectors c a = (0.2, 0.2, 0.2) and c a = (0.16, 0.14, 0.14, 0.14) provide more accurate pre-estimates then c a = 1 or c a = (0.4, 0.3). Namely, sequences obtained for k = 3 and k = 4 do not show sudden and sharp ups and downs so often. The most of their values stay in a strip between α 1 and α 2 , which is an expected behavior for a fine sequence of pre-estimates. It is hard to choose the better one, but it seems that the plot line obtained for c a = (0.16, 0.14, 0.14, 0.14) stays a bit closer to the real parameter values. This conclusion holds for both, R2N GIN AR max (2, 4) and R2N GIN AR 1 (2, 4) models, so the same c a = (0.16, 0.14, 0.14, 0.14) is chosen in both cases.

To summarize, optimal values of d p , c m , c a and c p , involved in RENES method are provided in Table 2 . Note that in both cases all three vectors are chosen to be of the same length k, which is not surprising. Recall that k depends on the probabilities of staying in the same state. If we chose smaller diagonal values of p mat , the optimal length k would certainly be smaller. (0.16,0.14,0.14,0.14) (0.16,0.14,0.14,0.14) (0.16,0.14,0.14,0.14) R2N GIN AR1(2, 4) dp cm ca cp 15 (0.16,0.14,0.14,0.14) (0.16,0.14,0.14,0.14) (0.16,0.14,0.14,0.14)

. Now, one can provide 3-dimensional sequences {T (μ n , c m )}, {T (α n , c a )}, {T (P n , c p )}, and after that {S(μ n , c m )}, {S(α n , c a )}, {S(P n , c p )}. It is left to determine parameters C m , C a and C p given in (4). To that purpose, a modification of the procedure used to determine d p is applied. More precise, for each C m = i, C a = j, C p = l, i, j, l = 1, 2, . . . , 10, the clustering of the three dimensional a) R2N GIN ARmax(2, 4) model with dp = 8 b) R2N GIN AR1(2, 4) model with dp = 15 data sequence {(C m S(μ n , c m ), C a S(α n , c a ), C p S(P n , c p ))} is performed. In that way, a thousand different estimates of the environment state sequence {z n } are provided. To select the best one, estimates thus obtained are compared with the sequence of exact states. The highest number of exactly estimated states will reveal the best combination of parameters C m , C a , C p . In the case of R2N GIN AR max (2, 4) simulation, the best result in random environment estimation is obtained for C m = 6, C a = 2, C p = 9, having 328 estimated states equal to corresponding exact states. On the other hand, result obtained by standard K-means managed to have 301 exactly estimated states. A comparative overview of exact states, states obtained by standard K-means and states obtained by usage of RENES method is provided by Figure 6 . And yet again, only first 200 states are given in each figure.

Beside the higher number of the exactly estimated states, two more improvements are achieved by usage of RENES method. According to the figures, the RENES method produces much longer data series that correspond to the same state. Bearing in mind the fact that random environment models show poor performances when environment states are changing rapidly, mentioned improvement seems very convenient. Further, RENES method doesn't make a crisp data division using a horizontal line, as K-means does. The possibility of obtaining a high data value in the environment conditions different from those assumed for the high data values is not ruled out this time. In other words, RENES method allows data elements with high values to belong to the cluster with predominantly low values, and vice versa. This property makes RENES method more suitable for clustering the data where, beside one detected predominant environment condition, some hidden circumstances also have an impact on the time series realizations.

Similar holds in the case of simulated R2N GIN AR 1 (2, 4) time series. For each C m = i, C a = j, C p = l, i, j, l = 1, 2, . . . , 10, the clustering of three dimensional data sequence {(C m S(μ n , c m ), C a S(α n , c a ), C p S(P n , c p ))} is performed. The best result is obtained for C m = 8, C a = 2, C p = 3, having 326 estimated states equal to the corresponding exact states. On contrary to that, the standard K-means method managed to have 309 exactly estimated states. And yet again, comparative overview of the exact states, states obtained by standard K-means method and states obtained by new RENES method is provided by Figure 7 . The plots undoubtedly show dominance of RENES method comparing to the standard K-means. Achievements mentioned in the case of R2N GIN AR max (2, 4) simulation, also hold here.

Unused replications are suitable here to check the efficiency of our RENES method. First of all, these replications were observed as real-life data sequences. Further, environment state estimation via K-means method and via RENES method took place. The same holds for both, R2N GIN AR max (2, 4) and R2N GIN AR 1 (2, 4) simulations. Having results of both random environment estimation methods given above, unknown model parameters were estimated for each clustering result by usage of conditional maximum likelihood (CM L) procedure.

A data sequences reconstruction by corresponding R2N GIN AR max (2, 4) or R2N GIN AR 1 (2, 4) model may happen now for each clustering result. RM S of differences between simulated data and 14 Exact states of R2N GIN ARmax(2, 4) simulation States obtained by standard K-means clustering method States obtained by RENES method for dp = 8, cm = (0.16, 0.14, 0.14, 0.14), ca = (0.16, 0.14, 0.14, 0.14), cp = (0.16, 0.14, 0.14, 0.14), Cm = 6, Ca = 2, Cp = 9. States obtained by RENES method for dp = 15, cm = (0.16, 0.14, 0.14, 0.14), ca = (0.16, 0.14, 0.14, 0.14), cp = (0.16, 0.14, 0.14, 0.14), Cm = 8, Ca = 2, Cp = 3. Figure 7 : The environment states of R2N GIN AR1(2, 4) simulation their reconstructions will represent the measure of the fitting quality. Results of the modeling obtained after applying standard K-means and RENES method are provided in Table 3 . Dominance of RENES method is noticeable. RM S values obtained after applying standard K-means method are unexpectedly high (RM S = 1.989 in the case of R2N GIN AR max (2, 4) model and RM S = 1.836 in the case of R2N GIN AR 1 (2, 4) model). This confirms the hypothesis given in the introduction that K-means is not a useful tool for clustering the data corresponding to the RrIN AR(M, A, P) process with similar means within states. On the other hand, RM S values obtained after applying RENES method are much more acceptable (RM S = 1.529 in the case of R2N GIN AR max (2, 4) model and RM S = 1.478 in the case of R2N GIN AR 1 (2, 4) model). Using the second combination of the model parameters, corresponding R2N GIN AR max (2, 5) and R2N GIN AR 1 (2, 5) simulations are created, two replications of each. The first one is used to obtain the optimal values for d p , c m , c a , c p , C m , C a and C p . The same procedure as the one presented in the case of R2N GIN AR(2, 4) simulations is preformed, and thus obtained optimal values are given in Table 4 .

Further, unused replications were observed as real-life data sequences. After the environment state estimation happened via K-means method and via RENES method, those replications were reconstructed by R2N GIN AR max (2, 5) or R2N GIN AR 1 (2, 5) model for each clustering result. Modeling results thus obtained are given in Table 5 .

Generally, much higher RM S values were detected in the case when simulations were dictated by the second combination of model parameters. This is understandable, given that the realization Table 4 : Values of the constant dp and vectors cm, ca, cp, in the case of simulated R2N GIN AR(2, 5) time series R2N GIN ARmax(2, 5) dp cm ca cp Cm Ca Cp 17 (0.16,0.14,0.14,0.14) (0.16,0.14,0.14,0.14) (0.4,0.3) 4 2 3 R2N GIN AR1(2, 5) dp cm ca cp Cm Ca Cp 9 (0.2,0.2,0.2) (0.16,0.14,0.14,0.14) (0.4,0.3) 9 6 7 . values are much higher in this case. As a consequence, a benefit obtained in RM S values is higher as well. Comparing the corresponding parameter estimates, we see that the estimates of means are much more accurate after application of the RENES method. Estimates of thinning parameters are also a bit more accurate in this case. More accurate parameter estimates ultimately led to a significant differences in RM S values. Based on all the above, it can be concluded that the new RENES method successfully sorts the realizations in corresponding clusters and thus contributes to a more efficient application of the R2N GIN AR(M, A, P) models.

In order to confirm its efficiency, we tested our RENES method on the data that has been very popular in recent months. From the web site Data Europa (http://www.data.europa.eu) we chose the time series that represent the number of new COVID-19 cases on daily basis detected on the island of Mauritius between March 18, 2020 and April 25, 2021. The plot of a given series is provided in Figure 8 . As can be noticed, the number of newly detected cases was kept under control most of the time. The most frequent number of newly infected inhabitants was 0, with occasional and isolated jumps. However, in two time intervals (form March 22, 2020 to April 9, 2020 and from March 6, 2021 to April 9, 2021), strange results emerged. During those periods, the number of newly infected inhabitants oscillated dramatically, with sharp and frequent ups and downs. In other words, very high values began to appear, followed by sudden decrements and vice versa. All mentioned here indicates that environment state changes might occurred. The plot of the autocorrelation function given in Figure 9 shows that all orders up to order 5 are significant. Since the influence of our RENES method on modeling by R2N GIN AR(2, 4) and R2N GIN AR(2, 5) models has already been examined in the previous section, the same models are going to be observed here as well. Now, we can estimate random environment sequence {z n } using standard K-means and RENES method. All RENES method parameters are the same as in the previous section. Obtained clustering results are provided in Figure 10 . Unlike the standard K-means, RENES method had success in recognizing the atypical behavior of the time series and managed to place in a separate cluster almost all values that were realized during two mentioned time intervals. This conclusion points to the fact that the application of the selected As a final step in proving the supremacy of the RENES method, the fitting quality of given R2N GIN AR(2, 4) and R2N GIN AR(2, 5) models is examined for each clustering result. As a measure of the goodness of fit we use the root mean squares (RM S) of differences between the observations and their predicted values. Table 6 contains results obtained using R2N GIN AR max (2, 4), R2N GIN AR 1 (2, 4), R2N GIN AR max (2, 5) and R2N GIN AR 1 (2, 5) models for each clustering result. Obviously, there is a big difference in fitting quality, depending on the choice of method by which the realizations are distributed into clusters. According to Table 6 , all selected models show much lower RM S values after applying the RENES method. In that way, usefulness of the RENES method is definitely proved and the benefits of its use are confirmed. If the reader possibly wants to compare results of modeling given in Table 6 with the results obtained using various models with stationary or non-stationary nature, he can take a look at Appendix C.

In this article, the new method (RENES ) for estimating random environment process {z n } in RrN GIN AR (M, A, P) models is defined. The standard K-means clustering method, that was used before, showed poor performances. Taking into account only the values of the data elements, application of the K-means on RrN GIN AR(M, A, P) data sequences leads to the loss of information about random environment. Otherwise, by applying K-means on previously transformed data, the loss of information is significantly reduced. This happens because the method follows the behavior of all parameters of the model, which also carry information about belonging to the particular environment state. Hence, RENES method leads to a more natural interpretation of the time series, since there is a possibility of finding extremely high or low values in any state. Because of all mentioned above, RENES method is more suitable for fine clusterings, where small differences (distances) between means within clusters occur, ie. where boundaries between states are not straight lines, but wavy or jagged lines. First, the theoretical review of the method is given, with all necessary discussions and clarifications. Appropriate simulated RrN GIN AR(M, A, P) time series are created. Application of RENES method on the simulated data is implemented. Finally, the supremacy of this new approach over standard K-means is confirmed on popular real-life data. 2. The second combination of model parameters will also create an interesting challenge for RENES method, since some states have only one pair of parameters which are significantly different. Namely, we have that M = (2, 4, 6), A = (0.2, 0.3, 0.6) and P = (2, 4, 5). Beside that, we have As we can see, means within states grow progressively, although the jumps are not too high. The first and the second state have similar thinning parameters, while corresponding orders differ significantly. On the other hand, the second and the third state have similar orders, while corresponding thinning parameters differ significantly. Finally, all parameters of the first and the third state differ significantly. In order to be able to place the realization at moment n in the appropriate cluster, it is crucial for the clustering method to possess information about the behavior of all parameters of the model at the same moment. An initial state has the distribution p vec = (0.35, 0.35, 0.3) and the transition probability matrix is of the form Simulation study follows the same path as it was the case with testing on simulated data with 2 environment states. After creating R3N GIN AR max (2, 4, 2) and R3N GIN AR 1 (2, 4, 2) simulations (two replications of each) using the first combination of model parameters, one may start with determination of RENES method parameters. The sequence {μ n } is again obtained by usage of (5). To improve such obtained sequence of pre-estimates, optimal shape of the vector c m is of interest. Sequences {T (μ n , c m )} obtained for various selections of c m are shown in Figure 11 and compared to the real parameter values {µ n }.

For both simulations, R3N GIN AR max (2, 4, 2) and R3N GIN AR 1 (2, 4, 2), the best pre-estimate result is obtained in the case of c m = (0.16, 0.14, 0.14, 0.14). The ability of trimming the high peaks is noticed. In particular, this pre-estimate shows remarkable potential to assess means within the second (middle) state.

Further, a determination of the sequence {P n } takes place in two steps. The first step towards that goal is to determine the value of parameter d p . Calculations of ∆ p for various selections of d p are performed and corresponding results are given in Table 7 . Optimal d p values are 17 and 18 (for R3N GIN AR max (2, 4, 2) and R3N GIN AR 1 (2, 4, 2) respectively).

The second step in determination of {P n } is to provide corresponding vector c p for fixed optimal value of d p . Sequences {T (P n , c p )} obtained for various selections of c p are shown in Figure 12 and compared to the exact order sequence {P n }. According to figure, sequences obtained in case when k = 2, k = 3 and k = 4 behave practically the same. On the other hand, the sequence obtained for k = 1 has frequent and sharp ups and downs, which lead to the erroneous clustering result. The same conclusion holds for both simulations, R3N GIN AR max (2, 4, 2) and R3N GIN AR 1 (2, 4, 2). Due to the simplicity of the model, we prefer to take k = 2 and c p = (0.4, 0.3).

We are now able to determineα n , n ∈ N, using (7). To improve such obtained pre-estimates, determination of c a took place. Sequences {T (α n , c a )} obtained for various selections of c a are given in Figure 13 and compared to the real sequence {α n }. As figure shows, sequence of pre-estimates obtained for k = 4 is in advantage in regard to other sequences. For c a = (0.16, 0.14, 0.14, 0.14), just a few steep jumps are located on the plot curve. Pre-estimates are rarely beyond the greatest 14, 0.14, 0.14). Table 7 : Values of the error ∆p for various selections of dp R3N GIN ARmax(2, 4, 2) R3N GIN AR 1 (2, 4, 2) R3N GIN ARmax(2, 4, 5) R3N GIN AR 1 (2, 4, 5) dp ∆p dp ∆p dp ∆p dp ∆p 5

1 thinning parameter value, and even if something like that happens, the overdrafts are generally not large. Most of the time, this sequence of pre-estimates keeps oscillating between α 1 and α 3 , with particulary god assessment of α 2 . The same conclusion holds for both, R3N GIN AR max (2, 4, 2) and R3N GIN AR 1 (2, 4, 2) simulation. Thus, in both cases we have c a = (0.16, 0.14, 0.14, 0.14).

Previous results regarding d p , c m , c a and c p are summarized in Table 8 . To determine C m , C a and C p , the clustering of {(C m S(μ n , c m ), C a S(α n , c a ), C p S(P n , c p ))} is performed for each C m = i, C a = j, C p = l, i, j, l = 1, 2, . . . , 10, and thousand different estimates of {z n } are provided. The best result is obtained for C m = 9, C a = 7 and C p = 2 in the case of simulated R3N GIN AR max (2, 4, 2) time series, having 209 estimated states which are equal to corresponding exact states. On contrary to that, the standard K-means method managed to have only 155 exactly estimated states, which doesn't seem acceptable at all. A comparative overview of exact states, states obtained by standard K-means method and states obtained by usage of new RENES method is provided by Figure 14 . The same procedure is applied in the case of simulated R3N GIN AR 1 (2, 4, 2) time series. The highest number of exactly estimated states is obtained for C m = 6, C a = 1 and C p = 8, with 216 elements which estimated states are equal to corresponding exact states. The standard K-means method managed to have only 153 exactly estimated states. A comparative overview of exact states, states obtained by standard K-means method and states obtained by usage of RENES method is provided by Figure 15 .

Same as it was the case with 2 environment states simulations, several improvements are noticeable here as well. Beside higher number of exactly estimated states, the RENES method produces much longer sequences of consecutive elements in each of 3 given states. In general, this improvement enables more successful application of random environment IN AR models of higher order. Further, the possibility of finding extremely high or low values in any of three given states is perceptible here. Furthermore, the possibility of having equal elements in different states is 14, 0.14, 0.14). Table 8 : Values of the constant dp and vectors cm, ca, cp, in the case of simulated R3N GIN AR(2, 4, 2) time series R3N GIN ARmax(2, 4, 2) dp cm ca cp 17 (0.16,0.14,0.14,0.14) (0.16,0.14,0.14,0.14) (0.4,0.3) R3N GIN AR1(2, 4, 2) dp cm ca cp 18 (0.16,0.14,0.14,0.14) (0.16,0.14,0.14,0.14) (0.4,0.3)

.

Exact states of R3N GIN ARmax(2, 4, 2) simulation States obtained by standard K-means clustering method States obtained by RENES method for dp = 17, cm = (0.16, 0.14, 0.14, 0.14), ca = (0.16, 0.14, 0.14, 0.14), cp = (0.4, 0.3), Cm = 9, Ca = 7, Cp = 2 Figure 14 : The environment states of R3N GIN ARmax(2, 4, 2) simulation also detected. Since this often occurs in generalized random environment IN AR time series of higher order with similar means within states, the RENES method seems more applicable than the standard K-means.

The amount of benefit one gets by applying the RENES method is measured on unused replications of the simulated R3N GIN AR max (2, 4, 2) and R3N GIN AR 1 (2, 4, 2) time series. Having results of both environment state estimation methods mentioned earlier, a valid reconstruction of given simulations is performed by appropriate R3N GIN AR max (2, 4, 2) or R3N GIN AR 1 (2, 4, 2) model for each clustering result singularly. RM S of differences between simulated data and their reconstructions should indicate whether there is any truly benefit from applying the RENES method. Results of modeling obtained after applying the standard K-means method and after applying the RENES method are provided in Table 9 . Although a way smaller than in the case of simulations with 2 environment states, the benefit of applying the RENES method still exists. Reconstructions after K-means clustering produced the following RM S values: RM S = 1.285 in the case of R3N GIN AR max (2, 4, 2) simulation and RM S = 1.471 in the case of R3N GIN AR 1 (2, 4, 2) simulation. On contrary to that, reconstructions after RENES method produced the following: RM S = 1.149 in the case of R3N GIN AR max (2, 4, 2) simulation and RM S = 1.370 in the case of R3N GIN AR 1 (2, 4, 2) simulation. .

In order to confirm additionally the effectiveness of the RENES method, two replications of the 32 simulated R3N GIN AR max (2, 4, 5) and R3N GIN AR 1 (2, 4, 5) time series were created, dictated by the second combination of model parameters. Optimal values of d p , c m , c a c p , C m , C a and C p were obtained, based on the first replications of each pair. The procedure presented in previous cases was followed this time as well. Results thus obtained are presented in Table 10 . Table 10 : Values of the constant dp and vectors cm, ca, cp, in the case of simulated R3N GIN AR(2, 4, 5) time series R3N GIN ARmax(2, 4, 5) dp cm ca cp Cm Ca Cp 12 (0.16,0.14,0.14,0.14) (0.16,0.14,0.14,0.14) (0.4,0.3) 10 3 1 R3N GIN AR1(2, 4, 5) dp cm ca cp Cm Ca Cp 11 (0.16,0.14,0.14,0.14) (0.16,0.14,0.14,0.14) (0.4,0.3) 7 5 2 .

Further, the standard K-means and the RENES method are performed on unused replications. Furthermore, those replications are reconstructed by corresponding R3N GIN AR max (2, 4, 5) or R3N GIN AR 1 (2, 4, 5) model for each clustering result, and modeling results such obtained are provided in Table 11 . As one may notice, RM S values are not that large in general, bearing on mind relatively high realization values. Hence, the amount of benefit detected after application of the RENES method is really satisfactory. The benefits are mainly generated by a more accurate estimates of the mean values. As for the other model parameters, corresponding estimates are of the same level.

8.3. Appendix C. Application of various models with stationary or non-stationary nature Beside mentioned R2N GIN AR(2, 4) and R2N GIN AR(2, 5) models, several models with stationary or non-stationary nature are considered here. The following stationary models are taken into account: IN AR(1) model with Poisson marginals (P oIN AR(1)) from Al-Osh and Alzaid (1987) , quasi-binomial IN AR(1) model with generalized Poisson marginals (GP QIN AR(1)) presented in Alzaid and Al-Osh (1993) , geometric IN AR(1) model (GIN AR(1)) provided by Alzaid and Al-Osh (1988) , new geometric IN AR(1) model (N GIN AR(1)) defined by Ristić et al. (2009) , combined geometric IN AR(p) model (N GIN AR(p)) given in Nastić et al. (2012) , p = 2, 3, 4, 5, and random coefficient IN AR(1) model with negative binomial marginals (N BRCIN AR(1)) defined by Zheng et al. (2007) . As for the non-stationary models, we consider the following: a 2 state random environment N GIN AR(1) model presented in Nastić et al. (2016) and random environment models of higher order (R2N GIN AR(p)) described in Nastić et al. (2017) , where p = 2, 3, 4, 5.

Corresponding modeling results are given in Table 12 and Table 13 . Table 12 contains results obtained by applying stationary models. In addition, a modeling result obtained by applying the R2N GIN AR(1) model is also placed in this table. Further, Table 13 contains results obtained by applying R2N GIN AR max (p) and R2N GIN AR 1 (p) models of various orders. Based on the tables, the following conclusions can be drawn. The weakest results are obtained by applying stationary models. Involving the concept of random environment with 2 different environment states into the modeling procedure brings significant improvement. This confirms the hypothesis that the time series really took place in two different environment states. Taking into account results given in Table 6 , R2N GIN AR(M, A, P) models are the best for selected real-life data among all models 33 in random environment. In other words, selected real-life data may be observed as a realization of the generalized random environment IN AR time series of higher order. Therefore, it makes sense to test the effectiveness of the new RENES method on selected real-life data. 

First order autoregressive time series with negative binomial and geometric marginals

First-order integer-valued autoregressive (INAR(1)) process

On Some Integer-Valued Autoregressive Moving Average Models

First-order integer-valued autoregressive (INAR(1)) process:distributional and regression properties

Some autoregressive moving average processes with generalized Poisson marginal distributions

Zero Truncated Poisson Integer Valued AR(1) Model

Algorithm AS 136: A K-Means Clustering Algorithm

Crossed Bivariate Integer-valued Autoregressive process based on bivariate Random Environment process

Generalized Random Environment INAR Models of Higher Order

Existence and stochastic structure of a non-negative integer-valued autoregressive process

Some simple models for discrete variate time series

Autoregressive moving-average processes with negative binomial and geometric distributions

A combined geometric IN AR(p) model based on negative binomial thinning

Random Environment Integer Valued Autoregressive process

Random Environment INAR models of higher order

4, 2) model with dp = 17

b) R3N GIN AR1(2, 4, 2) model with dp = 18

4, 2) models: green diamond-exact order sequence {Pn}; regular black line-sequence {T (Pn, cp)} for cp = 1; thick blue line-sequence {T (Pn, cp)} for cp = (0.4, 0.3); dashed black line-sequence {T (Pn, cp)} for cp = (0.2, 0.2, 0.2), dashed red line-sequence {T (Pn, cp)} for cp

4, 2) simulation by RENES method for dp = 18

The environment states of R3N GIN AR1