key: cord-0914843-e9hx637w
authors: Salem, Semeh Ben; Naouali, Sami; Chtourou, Zied
title: A rough set based algorithm for updating the modes in categorical clustering
date: 2021-03-27
journal: Int J Mach Learn Cybern
DOI: 10.1007/s13042-021-01293-w
sha: 8cca8a177e36fe187224a44dfedb93103df1250c
doc_id: 914843
cord_uid: e9hx637w

The categorical clustering problem has attracted much attention especially in the last decades since many real world applications produce categorical data. The k-mode algorithm, proposed since 1998, and its multiple variants were widely used in this context. However, they suffer from a great limitation related to the update of the modes in each iteration. The mode in the last step of these algorithms is randomly selected although it is possible to identify many candidate ones. In this paper, a rough density mode selection method is proposed to identify the adequate modes among a list of candidate ones in each iteration of the k-modes. The proposed method, called Density Rough k-Modes (DRk-M) was experimented using real world datasets extracted from the UCI Machine Learning Repository, the Global Terrorism Database (GTD) and a set of collected Tweets. The DRk-M was also compared to many states of the art clustering methods and has shown great efficiency.

Cluster analysis, also known as unsupervised learning, is a branch of Machine Learning which has multiple applications in various domains, including financial fraud detection, medical diagnosis, image processing, information retrieval and bioinformatics [1] . Clustering is used to unlock initially hidden and undetectable patterns by identifying groupings within the dataset under investigation in an unsupervised manner: no labels (or classes) are initially provided as input parameters. This analysis process based on identifying the clusters without a prior knowledge of the outcomes makes this task more difficult, challenging and prone to errors.

Clustering methods can be classified into five types: hierarchical [2, 3] , partitional [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] , density-based [17, 18] , gridbased [19] or model-based methods [20] . The aim of cluster analysis is to partition a dataset composed of N observations embedded in a d-dimensional space into k distinct clusters. In this process, data points within the same cluster will have more similar characteristics than observations in other clusters according to a specific measure called distance metric [21] .

Clustering qualitative data is problematic due to the lack of its geometric properties. For example, categorical attributes are unordered and it is inappropriate to use traditional numerical distance functions to capture resemblance between these categorical values. To overcome this limitation, many methods were proposed to deal with categorical data types such as the k-modes and its variants [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] . These methods start with K initial centroids and use the alternating minimization method to solve a non convex optimization problem. In the k-modes, the simple matching dissimilarity measure is used to compare two categorical values: the comparison yields a difference of zero for two identical values and one otherwise. However, for all these methods, one main issue is related to the identification of the initial number of clusters [22] .

The mode, representing the most frequent patterns (modality in each attribute) in the cluster is randomly selected in the last step of the clustering process in all these methods. However, it is possible to identify more than one mode depending on the modalities' frequency in the attributes. In the k-modes type clustering algorithms' iterative process, the mode is also identified when moving from the ith iteration to the (i + 1)th iteration. The selection of this mode is essential and directly influences the formation of the final clusters. Frequently, random modes selection, widely used in the literature, may induce a clustering process to terminate in a locally optimal solution. Thus, this method is not convenient to ensure high performance. On the other hand, tackling the mode's random selection issue during the clustering process was not considered in previous methods and thus, is an interesting issue to consider.

As it is widely known, in reality, the border of the data is hard to partition, as there is often no sharp boundary between the clusters. Most of the proposed formal modeling tools are deterministic and precise which does not fit real world situations that are very often not deterministic and cannot be described precisely. This fact requires integrating uncertainty based models in the clustering process. The first attempts to handle uncertainty were proposed with fuzzy theory such as the fuzzy k-means [23] , the fuzzy k-modes [24] and their variants. In fuzzy clustering, each object can have membership functions to more than one cluster instead of the hard assignment given in the k-modes based methods. However, fuzzy algorithms have the same limitations as hard algorithms, i.e. they require multiple runs with different centroid initializations to ensure stability. Fuzzy methods also need to adjust one control parameter for the membership fuzziness to obtain better solutions, which is a complex task even through extensive experiments. Another attempt to handle uncertainty and avoid the fuzzy sets limitations was by using the Rough Set Theory (RST), introduced by Pawlak [25] . One main reason for the success of the RST is that no additional information is required to start the clustering process, such as thresholds or expert knowledge in a particular domain. In recent years, RST has attracted much attention in a variety of fields [26] such as computer vision [27, 28] , biomedical engineering [29] and economy and finance [30] .

In this paper, it is proposed to tackle the mode identification in categorical clustering using an uncertainty based model. The proposed method, called Density Rough k-Modes (DRk-M) aims to select the most appropriate modes in each iteration during the clustering process. This method can be implemented either for the k-modes or any of its variants. The rough mode selection permits identifying the most central centroid for each cluster and thus ensures better clustering performance. To better illustrate this notion and the main differences between the DRk-M and states of the art methods, Fig. 1 is proposed.

In Fig. 1 , the input is a categorical dataset composed of N observations described by d attributes. Although many previous studies considered the issue of the selection of the initial number of clusters [22] , these methods do not fall under the scope of this study. In Fig. 1 , a description of the clustering process in the k-modes and its variants is given. In the first phase, K initial centroids are randomly selected. These centroids will correspond to the starting point of the partitioning process. However, this random selection may usually lead to incorrect results since elements that are supposed to be part of the same cluster can be selected as centroids and thus put in different clusters. Many initialization methods were proposed in the literature to choose the most representative initial centroids [8, [35] [36] [37] 42] . In the second step of the k-modes, the (N-K) remaining observations will Fig. 1 The k-modes algorithm and its variants compared to the DRk-M be assigned to the clusters according to the value of similarity computed using the simple matching dissimilarity metric between these points and the centroids. In this step, some variants of the k-modes, proposed using other distance metrics to enhance the algorithm [7, 10, 24, 38] . Once all the observations are assigned to their corresponding clusters, the centroids are updated in the last step of the algorithm and the new mode is computed for each group. The DRk-M proposes to tackle this last centroid updating step. As shown in Fig. 1 , no previous method was presented in the literature to investigate this issue.

The DRk-M has some contributions and characteristics that can be summarized as follows:

• The DRk-M is a categorical clustering approach. Categorical clustering is a tricky subject since it deals with categorical data characterized by their complex form. • The random selection of the mode is a critical issue in categorical clustering. Avoiding this limitation permits obtaining more accurate and stable results. • Since more than one possible mode can be generated in a given cluster in each iteration, the RST can be efficiently used in this context to bring more certainty and accuracy to the final results. • The convergence, performance and scalability of the DRk-M under the new mode selection method were investigated.

This paper is organized as follows: In Sect. 2, categorical clustering is detailed. Some states of the art categorical clustering methods are given and classified according to their contribution compared to the k-modes. In Sect. 3, the DRk-M is detailed and the building theories and concepts are given. The algorithm and a complexity analysis are also given in this section. In Sect. 4, an experimental analysis is provided using several datasets and the DRk-M is compared to many states of the art algorithms. Finally, in the last section, conclusions and perspectives are provided.

A cluster is commonly characterized using a centroid that measures its centrality. Clustering methods based on prototypes are usually called partitional methods. Initially the number of clusters K is required as an input parameter and the centroids are iteratively updated until reaching a stop criterion. One of the prototypes based algorithms proposed in the literature is the k-means and its multiple variants [5, 6] . However, because large categorical data sets exist in many applications, such as environmental data analysis [31] , market basket data analysis [32] , DNA or protein sequence analysis [33] , text mining [34] , it was not possible to use the k-means in such application types. Thus, developing more appropriate algorithms to handle categorical clustering was an interesting subject in the last twenty years.

Partitional methods, consecutively divide the dataset until finding K clusters with a static configuration where no further splitting is possible. The partitional process can be described as follows:

Let CEN = {Cen 1 ,Cen 2 ,…,Cen k } be the set of centroids for the clusters C j where j = 1,…,K and X = {obs 1 , obs 2 ,…,obs N } the initial dataset with size N. For each observation obs i ∈ X.

Step 1: assign obs i to the cluster C j where Cen j verifies: Ce n j = argmin j=1,…,K {d(obs i ,Cen j }, where d is a similarity metric.

Step 2 : update the centroid C j according to a given procedure depending on the data type involved (mean, median, modes).

Step 3: if the stopping criteria are met, the process will end otherwise repeat starting from step 1.

A categorical information system can be described as a quadruple IS = (U, A, V ,f), where:

(1) U = {x 1 ,x 2 ,...,x N } is the nonempty set of N data points, called a universe; (2) A = {a 1 ,a 2 ,...,a d } is the nonempty set of d categorical attributes; (3) V is the union of attribute domains, i.e.,

is the value domain of categorical attribute a j and is finite and unordered, e.g., for any 1 ≤ p ≤ q ≤ n j , either a p j = a q j or a p j ≠ a q j . Here, n j is the number of categories of attribute a j for 1 ≤ j ≤ d;

The k-modes [4] is based on optimizing a cost function given in Eq. (1): Q = {cen 1 , cen 2 , …, cen K } represents the set of cluster modes and W = [w ji ] is a {0,1} matrix that corresponds to the current membership of an observation obs i to be part of a given cluster. This matrix verifies the rules given in Eqs. (2) and (3):

il D(obs ij , cen lj )

The k-modes permits clustering categorical datasets according to three modifications:

(i) The simple matching dissimilarity measure D is considered to evaluate the similarity between obs j and cen j where (ii) The centroids are called modes where a mode of a categorical dataset U described by d attributes is a vector Q = [q 1 , q 2 ,…, q d ] that minimizes the quantity defined in Eq. (6):

The mode Q represents the set of the most frequent modalities in each attribute. (iii) using a frequency-based method to update the modes during the clustering process to minimize the cost function.

Many variants of the k-modes were proposed to improve its efficiency and scalability according to many perspectives. In the rest of this section, these methods are detailed in order to provide the main research axis used.

Most partitional categorical clustering methods such as the k-modes and its variants require pre-defining the number of clusters K as well as the selection of the initial centroids which is a great limitation. The initialization of the centroids can have a high impact on the final clustering results and various initializations can lead to several output clusters. Usually, selecting the initial clusters is random which is also problematic since one may select initial centroids that can have similar characteristics. Due to its simplicity, random initialization was widely used. In order to adjust the negative effects of the random initialization, these algorithms generally need to be rerun many times with different initializations [8, [35] [36] [37] .

(cen j , obs j ) (5) cen j , obs j = 0, ifcen j = obs j 1, ifcen j ≠obs j

In [8] , the authors proposed an initialization method based on the density and distance measures. In this method, the initial dataset is split into several subsets based on its attributes. Thus, it becomes possible to discard some data points from the potential set of initial centroids. Then, the most frequent attribute value is spotted in each attribute domain to compose its representative point and generate the centroid. The proposed method's computational cost is O(2Nm|V| +|V|+ mK 2 |V|) which is linear with respect to the number of data points where �V� = ∑ m i=1 n j , m is the number of categorical attributes and n j the corresponding modalities. However, the method is not appropriate when the number of clusters is considerable.

In [36] , the authors proposed an advanced method to identify prominent attributes that correspond to the dataset's most relevant attributes. A multiple clustering of data based on the attributes is then performed to spot interesting initial centroids. The method performs multiple clustering on different attributes in the original data space and uses distinct attribute values in an attribute as cluster labels. These multiple views provide new insights into the data's hidden structures to find consistent cluster structure and aid in computing better initial centroids. Three approaches were presented to select different attribute spaces that can help in generating different clustering views from the data, namely:

• Vanilla approach: this method considers all the attributes (m) present in the dataset. • Prominent attributes: only a few attributes may be useful to generate multiple clustering views. • Significant attributes: a set of attributes generated from the prominent attributes will be retained.

In [37] , the initialization of the centroids was considered from the view of outlier detection. Two different initialization algorithms were proposed: a distancebased called Ini_Distance and an entropy-based outlier detection technique called Ini_Entropy within the RST. These two distances were used to calculate the degree of outlierness of each object. The complexity of these two methods is given as follows: • The complexity of the Ini_Distance is O(m × N 2 ) which makes it not suitable for large datasets. • The complexity of the Ini_Entropy is O(KmN + m 2 N) which makes it not suitable for high dimensional datasets.

In [7] , the authors proposed an enhanced version of the k-mode by integrating the between cluster similarity terms in the optimization function to compute the individuals' similarity. This term is defined as follows:

S(Z l ) denotes the similarity between the lth cluster represented by z l and other clusters and ∑ N i=1 li its weight which is the number of objects in the lth cluster. Thus, it becomes possible to simultaneously minimize the withincluster dispersion and enhance the between-cluster separation. The corresponding objective function is given as follows:

The parameter γ is to maintain a balance between the effect of the within-cluster information and that of the between-cluster information on the minimization process. In their proposal [7] , the authors applied this enhancement to three methods: the Huang version of the k-modes, the Ng's version of the k-modes and the weighted k-modes version.

In [38] , a new dissimilarity measure is defined for the k-modes. The measure is based on the idea that the similarity between a data object and cluster mode, is directly proportional to the sum of relative frequencies of the common values in mode. Formally, the new dissimilarity measure is:

where Note that f r A j = q lj |X l is the frequency of qlj in cluster X l .

One of the first attempts proposed to handle uncertainty was by using the fuzzy sets theory. As an extension of the fuzzy k-means, the fuzzy k-modes [24] were proposed and many variants were also developed [10] . In this algorithm, each pattern or object can have membership functions to all clusters rather than having a strict membership to exactly one cluster. In the fuzzy k-modes, the objects of the universe U will be put in k clusters by finding W and Z that minimize the objective function given in Eq. (11):

is the membership degree of obs i to the l th cluster; CEN = {cen 1 , cen 2 , …, cen k }, cen l = [cen l1 , cen l2 , …, cen lm ] is the lth cluster prototype with categorical attributes a 1 , a 2 ,..., a m ; d(cen l , obs i ) is the simple matching dissimilarity measure as defined by Huang.

The method finds fuzzy cluster modes when a simple matching dissimilarity measure is used for categorical objects.

In [10] , the authors proposed a fuzzy categorical clustering algorithm where the fuzzy k-modes' objective function was modified by adding a between-cluster information term. This consideration permitted simultaneously minimizing the within-cluster dispersion and enhancing the between-cluster separation. To obtain the modified objective function's local optimal solutions, the corresponding update formulas of the membership matrix and the cluster prototypes were derived. In their methods, the authors integrated the within-cluster and between-cluster information to update the membership matrix and cluster prototypes, which can effectively produce clustering results with high within-cluster similarity and low between-cluster similarity.

By assigning confidence to objects in different clusters, the clusters' core and boundary objects can be decided. This provides more useful information when dealing with boundary objects. However, the final fuzzy clustering outputs are still influenced by the mode initialization and the processing order of the objects in the datasets. Furthermore, these types of methods need to adjust one control parameter of membership fuzziness. In the applications, it is not clear how to find out the optimal parameters. Their values are often selected based on the decision makers' previous knowledge of the domain and their intuition or the proposed criteria.

On the other hand, RST, proposed by Pawlak since 1980, has received considerable attention in the computational intelligence literature since its development. It was used to develop clustering algorithms to handle uncertainty. The main advantage of RST based clustering methods compared to fuzzy clustering is that they don't require any domain expertise to assign the fuzzy membership.

In [39] , the authors proposed the information-theoretic dependency roughness (ITDR), taking into account the information-theoretic attributes dependencies degree of categorical-valued information systems. In [26] , the authors proposed the Total Mean Distribution Precision (TMDP) to

select the partitioning attribute based on probabilistic RST. Using this technique and the concept of granularity, a new hierarchical clustering algorithm, called Maximum Total Mean Distribution Precision (MTMDP), for categorical data was developed. The MTMDP searches the best clustering attribute among the set of available features. It takes into account the mean distribution precision of all attributes and determines the further clustering node by considering the cohesion degree of all nodes. This consideration is a more reasonable method compared to previous methods proposed for RST clustering [40] .

In [41] , the authors proposed an algorithm based on fusing rough set and fuzzy set theories. The proposed rough fuzzy clustering method was used sequentially to integrate different measures to enhance the clustering performance. Thus, pure classified, semi rough and pure rough points are identified. After that, the Random Forest can be used in an incremental manner to classify these semi and pure rough points using pure classified points to yield better clustering results.

In [42] , the authors addressed the issue of outlier detection as an initialization method to select the best centroids when starting the clustering process. The uncertainty regarding the clustering process is addressed by considering a soft computing approach based on rough sets. Accordingly, the modified clustering algorithm incorporates the lower and upper approximation properties of rough sets.

Most of the clustering approaches based Rough Set consider two techniques: (i) introducing a decision attribute based on which the dataset will be divided to partition the objects [39, 40, 43, 44] or, (ii) evaluating the lower, upper and quality of approximations of the a dataset [42, 45] .

The selection of the most appropriate centroids when initializing the clustering process has been considered in many types of research since it may heavily impact the final results resulting from the partitioning procedure. However, this issue was only considered in the algorithm's first step and not in all the consecutive iterations within the process. Even when executing the updating of the modes in each iteration, multiple modes can be proposed. It is crucial to identify the most appropriate when to consider instead of automatically using the random selection method.

Let IS = (U,A,V,f) be a categorical information system, and P a subset of the descriptive attributes A of the universe U (P ⊆ A). The objective of the clustering is to find the set of observations OBS = {obs 1 , obs 2 , …, obs N } and the set of centroids CEN = {cen 1 , cen 2 ,…,cen K } that minimize the same cost function given in Eq. (1). The same constraints given in Eqs. (2) and (3) will also be considered. The DRk-M will implement the simple matching dissimilarity measure defined in Eqs. (4) and (5) during the assignments step of the observations to their closest clusters [4, 24] .

The process of optimization can be described as follows:

• Step 1. Choose K distinct objects cen 1 , cen 2 ,...,cen K from the universe U as the initial set of modes (t = 1) CEN (t=1) = {cen 1 ,cen 2 ,...,cen K } ∈ U k . Determine W (1) such that F (W,cen (1) ) is minimized. D stands for the simple matching dissimilarity measure. The W = [ ̄ li ] matrix is a NxK binary matrix that reports whether an observation is part of a given cluster or not. To better illustrate the notion of this matrix, let's consider the example given in Fig. 2 .

In the given example, the dataset is composed of seven observations to be put in three clusters. Once the clustering is terminated, the W matrix will be generated as given in Table 1 . If an observation obs i belongs to the cluster Cl l then W il = 1 otherwise W il = 0. The W matrix is used to determine to which cluster, each observation belongs.

Theorem 2 let cen lj = [cen l1 , cen l2 ,...,cen ld ] be the mode of

For a given object obs i = [obs i1 , obs i2 ,…,obs id ].

where 1 ≤ t, r ≤ n j for 1 ≤ j ≤ d In other words, according to theorem 2, the quantity

For a given matrix W, we have:

and thus minimizing lj corresponds to maximizing

where n represents the cardinality of the dataset. In terms, in theorem 2, minimizing the cost function F(W, Z) corresponds to minimizing all the inner sums of the quantity lj that are nonnegative and independent. The i n n e r s u m i s m i n i m i z e d i f f e v e r y t e r m n − | | | li | cen lj , obs ij , li = 1 | | | is minimal which requires maximizing the cardinality of the sets where cen lj = obs ij .

Proof theorem 3 Only a finite number of possible cluster modes CEN = {cen 1 , cen 2 , …,cen K } can be defined. We then show that each final mode can have only one occurrence in the clustering process. This case corresponds to the last iteration and the stop criterion of the DRk-M.

If not, then there exist two distinct iterations t 1 ≠ t 2 such that the centroids are equal CEN (t1) = CEN (t2) . According to the first Theorem, the proposed clustering algorithm using the simple matching dissimilarity measure computes the minimizers W (t1) and W (t2) for CEN = CEN (t1) and CEN = CEN (t2) for these two iterations, respectively which implies that:

In the other hand, the sequence F W (t) , CEN (t) generated with the DRk-M using the simple matching dissimilarity measure is strictly decreasing which is not compatible with the previous result. 

In order to illustrate the notion of the mode, an example is provided in Table 2 . Let's consider a cluster composed of seven observations {obs 1 , obs 2 , obs 3 , obs 4 , obs 5 , obs 6 , obs 7 } and described by four categorical attributes {a 1 , a 2 , a 3 , a 4 }: According to Table 2 , the modes corresponding to that cluster are given as follows: Thus, for the cluster given in Table 2 , six candidate modes could be defined and are provided as follows:

In the original version of the k-modes, the mode is selected randomly and, to our knowledge, no previously defined method was provided to identify the most appropriate mode which is considered as a restriction that limits the performance of the clustering method.

The RST can deal with imperfect, vague and imprecise data based on the notion of indiscernibility between the observations. In this context of study, it is used to identify the most suitable modes among a list of candidate ones. With any rough set, a pair of precise sets, called the lower approximation and upper approximations, is associated. The lower approximation consists Q 1 = a, e, l, y , Q 2 = d, f , l, y , Q 3 = a, g, l, y , Q 4 = d, e, l, y , Q 5 = a, f , l, y , Q 6 = d, g, l, y of all objects which surely belong to the set and the upper approximation contains all objects which possibly belong to the set [40, 46] .

Let IS = (U,A,V,f) be a categorical information system and B ⊆ A a subset of attributes, a binary relation IND(B), called indiscernibility relation between two observations obs i and obs j of U is defined as:

Thus, it is possible to consider that for every subset of attributes B selected from A, an indiscernibility relation can be generated. In other words, two observations are indiscernible in the context of a set of attributes if they have the same values for those attributes.

The lower approximation.

The lower approximation of a subset X ⊆ U and B ⊆ A denoted B * (X) or B(X) is defined as follows:

The upper approximation.

The upper approximation of a subset X ⊆ U and B ⊆ A denoted B*(X) or B (X) is defined as follows:

In order to better understand the notion of indiscernibility in a dataset, an example is provided in Table 3 related to the Covid-19 pandemic. Covid-19 is a strain of coronavirus that first broke out in Wuhan, China in December 2019 and has since become a global pandemic. The dataset 

The DRk-M can be seen as a generalization of the Huang's definition of the mode. The approach is based on defining the list of all candidate modes, then generate a sub list that represents the most potentially modes and called rough modes. d is the simple matching dissimilarity measure. The rough mode is the closest element to all the observations of the cluster. Minimizing the previous quantity is a key issue to determine the rough modes. for q j ≠ c kj for all j = 1, …, d where f r (A j = c kj |Cl j )= n c kj N corresponds to the relative frequency of the k th category c kj in attribute A j and n c kj is the number of objects having the k th category c kj in attribute A j .

In other words, the D(OBS,Q) quantity is minimized by considering the most frequent modalities in each attribute to compose the rough mode.

Proof of Theorem 4 let f r (A j = c kj |Cl j )= n c kj N be the relative frequency of the k th category c kj in attribute A j , where N is the total number of observations of the dataset and n c kj the number of objects having the category c kj .

we have where x ij , q j corresponds to the simple matching dissimilarity measure and thus can take either 1 or 0 with a sum maximum value of N. n ij represents the number of cases where x ij = q j . Thus, minimizing

is minimized if and only if every N 1 − f r A j = q j |Cl j is minimal. Thus, f r A j = q j |Cl j must be maximal.

Theorem 4 is used to compute the rough modes that correspond to the list of all possible modes within the cluster. In all cases, the rough mode is a vector that contains the most frequent modalities in each attribute of the cluster observations. It may be an element of the cluster or a synthetic one generated during the process.

To select the best mode in the set of potential centroids, we don't only consider the distance between objects, but also the average density of the modes. If the distance between the object and the already existing cluster centers is the only considered factor, it is possible that outlier is taken as a new cluster center. Similarly, if the density of the object is only taken into account, it is utmost possible that many cluster centers can be located in the surrounding of one center. To avoid these potential problems, the distance between objects with the density of the object will be combined together to measure the possibility of an object to be a cluster center.

To better illustrate the notion of rough mode, Fig. 3 is given.

In Fig. 3 , the clustering process of the DRk-M is provided. The DRk-M propose to investigate the step where the modes are updated in each iteration of the process which corresponds to the third step of the process. The simple matching dissimilarity measure is used as a distance metric and no centroid initialization in the first step is incorporated. In step 3, the DRk-M considers that more than only one mode is identified. This number can vary from a cluster to another. The mode with the highest density value will has a high probability to be selected as a centroid for that cluster and thus put in the upper approximation.

The algorithm is described as follows:

In the first step of the DRk-M, K initial observations are randomly selected as cluster modes. This initial random selection is the same approach also used in the k-modes. However, many previous methods proposed initialization methods to select the most appropriate initial modes [6, 8, [35] [36] [37] . It can be also possible to integrate in the upcoming researches initialization methods to the DRk-M. In the second step, the simple matching dissimilarity measure is used to assign the observations to their closest clusters. The focus of this step is to minimize the cost function defined in Eq. 10. In the third step, all possible candidate modes are computed for each obtained cluster considering the modality frequency for each attribute and put either in the lower or upper approximation.

The DRk-M is scalable when compared to the standard k-modes since it does not affect the clustering paradigm but only introduces a new approximation step towards identifying the most adapted centroid in the cluster. In order to assess the scalability of the DRk-M it is required to computationally analyze all the different steps involved in the clustering process. As for the standard k-modes, the N observations of the DRk-M will be assigned into K clusters in t iterations and thus the complexity will be O(NKdt). Then, in each iteration t, the computational complexity required to compute the modes is O(NKtd n c kj ) where n c kj is the number of objects having the category c kj . Finally, a time complexity of O(|C k |pKtd) where |C k | is the cardinality of the considered cluster is required to assign the identified rough modes either to the upper or lower approximation of the rough sets. As a conclusion, the overall complexity of the DRk-M will be O(NKdt) + O(NKtd n c kj ) + O(|C k |pKtd) = O(NK|C k |ptd n c kj ).

Considering the approximation that K, t, d, |C k |, p, n c kj are < < < N, it is possible to conclude that the overall time complexity of the algorithm is O(N).

In this section, it is proposed to evaluate the clustering performance and scalability of the DRk-M. The algorithm will be compared to many states of the art algorithms including the Huang's k-modes [4] (1998), the Ng's k-modes (2007) [50], the Cao's dissimilarity (2012) [11] , the improved Huang's k-modes [7] (2014), the Weighted k-modes [7] (2014), the Improved Weighted k-modes [7] (2014), Improved Ng's k-modes [7] (2014), the Bai's fuzzy k-modes [10] (2013), the Khan's initialization method [36] (2013) and the Fuzzy k-modes [10] (2013). Various experimental datasets will be used with several testing configurations either in terms of the number of observations N, clusters K or dimensions d. The efficiency of the DRk-M will be validated using several well known evaluation metrics. The algorithms were coded using the Java coding language and the experiments were executed with an Intel Core i7-3.5Ghz machine with 16 GB memory capacity.

The accuracy is used to qualify the correctly classified cases. To compute the accuracy, each cluster is assigned to the most frequent pattern in the cluster according to the modalities in the attributes. The accuracy of this assignment is then measured by counting the number of correctly assigned Fig. 3 The DRk-M clustering process observations and dividing it by the total number of observations N. The accuracy is computed according to Eq. 13: 3 is the set of clusters and ℂ = c 1 , c 2 , … , c j is the set of classes identified from the patterns. The accuracy is always a positive value that ranges from 0 to 1 where a higher value of the accuracy depicts a better clustering.

The entropy is used to measure the disorder in a distribution of objects. The smallest value for the entropy is 0. An increasing value of this metric indicates a bad clustering. This metric denoted H is defined in Eq. 14: P k and P c j are the probabilities of an observation being in cluster k and class c j respectively.

The mutual information (MI) of two random variables is a measure of the mutual dependence between them. The Normalized Mutual Information (NMI) metric ranges from 0 to 1 and as its value is high, better clustering is obtained. The NMI is defined according to Eq. 15:

I is the mutual information defined in Eq. 16:

corresponds to the probability of an observation being in the intersection of k and c j .

Three other evaluation metrics will also be used in this study which are precision, recall and the F1-score. These (13) 

metrics can be directly computed from the confusion matrix. In order to better understand how these metrics are computed, let's consider the example given in Table 4 :

The considered example concerns the classification resulting from a dataset composed of two classes: Cancer = Yes and Cancer = NO. The goal is to predict whether a patient has Cancer or not for a total of 100 patients. The confusion matrix represents the predicted (returned by the model) and the actual (real) results for each of the two classes. It is possible based on this matrix to identify the observations that were correctly classified and those that were not.

The confusion matrix obtained can be interpreted as follows:

• 25 patients that have cancer were correctly classified by the system as True Positives (TP), i.e. they represent patients that have truly cancer and were predicted with cancer by the system. • 65 patients that do not have cancer were also correctly classified by the system as True Negatives (TN). • Only 10 (5 + 5) patients were wrongly classified by the system where either the patients have cancer and were spotted as not having cancer or vice-versa. These two groups correspond respectively to False Negatives (FN) and False Positives (FP)

The precision is a measure of the correctly classified positive cases from all the predicted positive cases. Thus, it is useful when the costs of False Positives is high. The precision can be directly computed from the confusion matrix as follows:

The recall is a measure of the correctly identified positive cases from all the actual positive cases. It is important when the cost of False Negatives is high. This metric is defined as follows:

One other way to compute the accuracy using the confusion metric is to apply the formula: 

The F1-score is the harmonic mean of the precision and recall and gives a better measure of the incorrectly classified cases than the accuracy metric.

For example, in the considered cancer dataset, according to the values presented in the confusion matrix, the values of the precision, recall, accuracy and F1-score can be given as follows:

The Silhouette score is a statistical interpretation and validation of clustering results that provides a measure of how well a data point is classified when it is assigned to a cluster. Thus, this metric can potentially be used as a quality measure to validate the clustering results according to the number of clusters in our case. The silhouette ranges from − 1 to + 1, where a high value indicates that the object is well matched to its cluster and poorly matched to neighboring clusters.

Considering a data observation obs i ∈ C j that was classified in cluster C j , it is possible to measure the mean distance between obs i and all the other data points in the same cluster as follows:

where d is the distance used such as the Euclidean distance. This metric can also be interpreted to qualify how well obs i is assigned to a cluster: the smaller the value of a, the better the assignment is. 

It is also possible to define the mean dissimilarity of obs i to other clusters C k as the mean of the distance from obs i to all the points in C k (where C k ≠ C i ). For each data point obs i , the mean dissimilarity is computed as follows:

This distance should be the smallest mean distance of obs i to all the points in any other cluster, of which obs i is not a member. The cluster with this smallest mean dissimilarity is said to be the "neighboring cluster" of obs i because it is the next best fit cluster for point obs i . The silhouette value of one data point obs i is then given as follows:

For s obs i to be close to 1, it is required that a(obs i ) < < b(obs i ). As a(obs i ) is a measure of how dissimilar obs i is to its own cluster, a small value means it is well matched. Furthermore, a large b(obs i ) implies that obs i is badly matched to its neighboring clusters. Thus, if s obs i is close to one, this means that the data is appropriately clustered. If s obs i is close to negative one, then by the same logic, it is evident that it would be better to classify obs i in a neighboring cluster. An s obs i near zero means that the data is on the border of two natural clusters. 

In order to assess the efficiency of the DRk-M, the algorithm was experimented using five datasets extracted from the UCI Machine Learning Repository. These datasets were widely used in the literature to evaluate states of the art methods and are described as follows:

• Mushroom data: The data set includes descriptions of hypothetical samples corresponding to 22 species of gilled mushrooms in the Agaricus and Lepiota Family. It consists of 8124 objects and 23 categorical attributes.

Each object belongs to one of the two classes, edible (4208 objects) and poisonous (3916 objects). • Breast cancer data: The data set was obtained from the University Medical Center, Institute of Oncology, Ljubljana, Yugoslavia. It consists of 699 data objects and 9 categorical attributes. It has two clusters: Benign (458 data objects) and Malignant (241 data objects). • Credit approval data: The data set contains data from credit card organization, where customers are divided into two classes. It is a mixed data set with eight categorical and six numeric features. It contains 690 data objects belonging to two classes: negative (383 data objects) and positive (307 data objects). In the test, we only consider the categorical attributes on the data set. • Zoo data: Zoo data set contains 101 elements described by 17 Boolean-valued attributes classified into seven classes. • Lung cancer data: The data set was used by Hong and Young to illustrate the power of the optimal discriminant plane even in illposed settings. This data has 32 instances described by 56 categorical attributes. It contains three classes.

In the experiments, the DRk-M is first tested using the Mushroom dataset in terms of its dimensionality. Three evaluation metrics are used: the accuracy, the entropy and the NMI. The experiments are conducted by varying the number of dimensions d (4 → 24) and the obtained results are given in Fig. 4 .

According to the results, the DRk-M provided better results in 86% of the total cases which makes it more accurate than the k-modes. For the accuracy, the values range from a = 0.5437 for d = 9 to a = 0.7368 for d = 14. In these cases, the DRk-M provided better results in 16 cases (80% of the total cases). The lines representing the accuracy are given in the bottom of Fig. 4 . For the NMI represented with the lines in the middle of Fig. 4 , the DRk-M provided values ranging from NMI = 1.1288 for d = 7 and NMI = 2.4387 for d = 12. In these cases, the DRk-M provided better results in terms of the NMI in 14 cases (70% of the total cases). The last metric used is the entropy, the values computed for the DRk-M range from e = 2.3259 for d = 6 to e = 4.2498 for d = 15. In terms of the entropy, the DRk-M provided better results than the k-modes in 16 cases (80% of the total cases). The conducted experiments are interesting since they permitted experimenting the effect of various dimensionalities on the performance of the DRk-M and compare it with the k-modes. It is important to mention that in this case, the number of clusters is K = 2 which is in concordance with the ground truth of the mushroom dataset since this dataset is in fact composed of two classes as mentioned in the dataset description. Different results would have been obtained if another value of K was selected.

The breast cancer dataset was also used to assess the efficiency of the DRk-M. In the experiments, the DRk-M and k-modes were compared for various numbers of clusters K (6 → 10). The corresponding results are provided in Table 5 where the accuracy, entropy and NMI are used as evaluation metrics.

For the breast cancer dataset, the DRk-M provided better clustering results in 12 cases.

In this second part of the experiments, more UCI datasets are used to compare the DRk-M to many state of the art methods as enhancements of the k-modes. Since most of these methods suffer from stability issues, 100 runs were carried out of the DRk-M with various initial modes. This technique was also used in many previous studies to ensure stable results. The comparison results of the DRk-M with each of the methods are given in Tables 6, 7, 8, 9, 10 . Each value in these tables is the average of 100 times experiments. Table 9 The accuracy (AC) and F1-score computed for 100 runs for the credit approval dataset

Values written in bold correspond to the metrics were the proposed algorithm performed better than state of the art methods Methods Huang's k-modes [4] In the tables, comparison of the DRk-M with some fuzzy k-modes methods as reported in [10] are also provided. In this case, the fuzziness parameter was fixed to α = 1.1. In fact, according to an explanation provided in [10] , several values of the fuzziness parameter were tested and it was found that α = 1.1 provided the least value of the cost function to be minimized, i.e. best results were provided using this value. Besides, in all the experiments, the number of clusters is set to be equal to the number of classes for each of the given data sets in order to respect ground truth conditions. In the experiments, two metrics were used: the accuracy and the F1-score. According to Tables 6, 7, 8, 9, 10 , the DRk-M provided better clustering results in terms of the accuracy and F1-score for all the datasets considered except for the Mushroom dataset when testing the algorithm with the Khan's initialization method [36] . In the state of the art methods, many variants of the k-modes were considered either by improving the simple matching dissimilarity measure, using fuzzy methods or implementing an initialization method to select the most accurate initial centroids. In all these cases, the DRk-M with the proposed Rough mode selection provided more accurate results. The results obtained confirm the dominance of using the DRk-M for categorical clustering and the advantage of implementing the RST in updating the modes during the segmentation process.

In this section, two datasets collected from Twitter using the python coding language were considered. The twitter accounts targeted correspond to some profiles related to terrorist groups and the datasets are described as follows:

• Dataset 1: this dataset contains 1803 instances described by 13 categorical attributes ["month of the tweet", "tweet_id", "source", "device", "in_reply_to_status_id", "in_reply_to_user_id", "in_reply_to_screen_name", "user_tweet_id", "user_tweet_name", "user_tweet_ screen_name", "user_tweet_location" and "language". The tweets collected correspond to specific key words related to cyber terrorism and are given as follows: Islamic State, caliphate editions, state of the caliphate, daesh, Battalion Okba-Ibn-Nafaa, African media. • Dataset 2: this second dataset contains 284 instances described by 10 categorical attributes ["tweet_date", "screen name", "tweet_id", "in_reply_to_status_id", "retweeted_status_id", "reply_to_user_ID", "user_verified", "retweeted", "user_tweet_location", "hashtags". The tweets were collected from two specific page user name given as follows: "Gzrawi" (131 tweets) for tweets posted by terrorist groups affiliated with "daech" and "Daesh_Online_01" (153 tweets). In this study, a categorical clustering algorithm is experimented. In Table 11 , an example on how to evaluate the clustering results using the accuracy is given. The number of clusters K is initially defined. Then, the clustering process is launched. In this step, it is required to define the groups' class label which was set to the attribute location that corresponds to the place where the tweet was posted. A total number of 15 labels were then identified corresponding to various countries: Tunisia, Egypt, Algeria, Lybia, Morocco, Yemen, London, Iraq, KSA, Lebanon, Turkey, Kuwait, Syria, Yemen and NULL if no country is identified. For example, let's consider Table 11 where K = 5 groups is used:

In Table 11 , the most frequent label in each cluster is identified in the last line = max. These values are then summed and used to compute the accuracy: a = 0.87. Using the two datasets described above, the accuracy of the DRk-M and the k-modes is computed for various number of clusters K (5 → 10) and the obtained results are reported in Table 13 .

According to the results given in Table 12 , the DRk-M provided better results than the k-modes for the two datasets. For dataset 1, better results were obtained for all the experiments expect for K = 7 were the same accuracy was computed which can also be considered as an acceptable result. For dataset 2, better results were obtained in all cases expect for K = 5 and K = 10. Out of 24 experiments, the DRk-M Table 11 Clusters resulting from the segmentation process for K = 5 Tunisia  10  1  190  5  5  221  Egypt  5  160  9  10  5  189  Algeria  165  2  3  2  1  173  Lybia  2  4  2  48  3  59  Morocco  2  1  6  1  72  83  max  165  160  190  48  72 635/725 = 0.87 provided better results in 18 cases (75%). In the other cases, the same accuracy was computed.

In order to test the efficiency of the DRk-M for large datasets, the cardinality of dataset 2 (initially composed of 284 observations) was increased using several data copies. Thus, two datasets were generated with cardinalities N = 10 3 and N = 15 × 10 3 . In this part of the experiments, the entropy and the NMI were used as evaluation metrics to evaluate the effectiveness of the DRk-M and compare it with the k-modes. These experiments were conducted for various K (3 → 15) and the results reported in Figs. 5 and 6:

• For N = 10 3 , in almost 46% of the cases, the DRk-M provided higher entropy than the k-modes. • For N = 15 × 10 3 , the DRk-M provided better clustering results with higher NMI values than the k-modes in 52% of the cases.

In order to statistically validate the obtained results, additional experiments were conducted on the same datasets using the silhouette evaluation metric. The Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique indicates how well each object has been classified. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

In Table 13 , the average accuracy computed for the DRk-M and k-modes for 50 runs of the two algorithms is reported. Two datasets were considered with various number of clusters K (5 → 10). Besides, the standard deviation was also computed for each set of 50 runs to identify the degree of confidence of the average accuracy calculated.

A higher value of the silhouette indicates a more compact and separated cluster. Thus, according to the results provided in Table 13 , the clusters generated using the DRk-M for all the experiments are more compact and isolated than those generated using the k-modes.

In Fig. 7 , the value of the silhouette score is given in the y axis. Each red point corresponds to the silhouette score computed for a given clustering. The k-modes and DRk-Modes were executed 50 times using the Twitter dataset (N = 10 3 and K = 5 → 10). The silhouette mean SC was also computed and is given as follows:

In the comparison between the resulting clusters, the focus is to identify the highest average Silhouette score resulting from the 50 runs conducted. From this purpose, a box plot representation was used. Box plots enable to study the distributional characteristics of a group of scores as well as the level of the scores. It is easy to identify the mean silhouette for each set of experiments which corresponds to the ( +) sign. According to the results, the DRk-M provided a higher silhouette average than the k-modes and closer to one which indicates that its performance in producing more compact and isolated clusters is higher than the k-modes.

In this section, the Global Terrorism Database was used to assess the scalability of the DRk-M in terms of the execution time and the accuracy. The tests were conducted for several dataset cardinalities N (500 → 25 × 10 3 ) and various number of clusters K = 8 and K = 10. The execution time was computed and reported in Fig. 8 .

According to the results given in Fig. 8 , the DRk-M provided higher computational time than the k-modes for large datasets. For small datasets, the two algorithms provided almost the same running performance. This issue is due to the time required for computing all the candidate modes in each cluster for the DRk-M which implies scanning the whole dataset multiple times for each attribute. Besides, it is possible to enhance the run time and obtain faster results by considering more powerful machines and resources either in terms of the memory or CPU.

In the other hand, to statistically validate the obtained results, the clustering outcomes were evaluated using the average accuracy for 50 runs of the two algorithms. Thus, SC = max ks (k) two large datasets (N = 2 × 10 4 and N = 25 × 10 3 ) were considered for this purpose. The experiments were conducted for various number of clusters K = 3 → 8. Each algorithm was executed 50 times with various initial centroids in order to deal with stability issues. The average accuracy, STD and Silhouette were then considered for these 50 runs. The obtained results are reported in Table14.  Table 14 provides results related to the average accuracy computed for 50 runs of the DRk-M and the k-modes for several numbers of clusters. The accuracy reported how well the observations are arranged in their corresponding clusters. According to the values of the accuracy computed, the DRk-M provided better results than the k-modes in all cases which makes it more effective and efficient. Besides, in order to statistically characterize the values of the accuracy, the standard deviation (STD) is used to evaluate the overall spread of the 50 values calculate for each case. A lower value of the STD indicates more close values to the mean value (average) and thus depicts better results. According to Table 14 , the DRk-M provides almost better results in all cases expect for K = 5 and K = 7 where the values computed for the k-modes were better. The silhouette was also used to characterize the compactness and density of the clusters generated by measuring the distance between the observations arranged in the clusters generated. A closer value to 1 of the silhouette indicates more compact and dense clusters. Once again and based on the results reported in Table 14 , the DRk-M provided more accurate results than the standard k-modes. 

Categorical clustering has gained great interest since the development of the k-modes algorithm. This algorithm has some major limitation related to the update of the modes in the last step of the process. In this paper, the RST was used as an uncertainty based model to identify the most accurate modes in a list of candidate ones when implementing categorical clustering. This consideration permits avoiding the random selection of the modes previously used in all states of the art k-modes based methods. The DRk-M is proposed based on computing the density of each candidate mode. This characterizes the number of observations that are closer to it as much as possible. Modes with high density would have higher probability to be considered as centroids for that cluster. In the experiments, multiple datasets with various configurations were used to assess the efficiency of the DRk-M. It was experimentally demonstrated that the DRk-M provided promising results. However, one main limitation of the DRk-M is the computational time required compared with the k-modes that is considerable due to the fact that more arithmetic computations are necessary to compute the list of all the candidate modes in the cluster. Since the DRk-M is more flexible and less exclusive than the k-modes in terms of the selection of the centroids in each cluster, the algorithm would provide more efficient results and thus the accuracy will be boosted. 

Cost-sensitive dualbidirectional linear discriminant analysis

Efficient agglomerative hierarchical clustering

Hierarchical clustering multi-task learning for joint human action grouping and recognition

Extensions to the k-means algorithm for clustering large data sets with categorical values

The K-means-type algorithms versus imbalanced data distributions

An initialization method for the k-Means algorithm using neighborhood model

The k-modes type clustering plus between-cluster information for categorical data

An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data

A novel attribute weighting algorithm for clustering high-dimensional categorical data

A novel fuzzy clustering algorithm with between-cluster information for categorical data

A dissimilarity measure for the k-Modes clustering algorithm

A modified Fuzzy k-Partition based on indiscernibility relation for categorical data clustering

A weighting k-modes algorithm for subspace clustering of categorical data

A fast and effective partitional clustering algorithm for large categorical datasets using a k -means based approach

Clustering Categorical Data Using the K-Means Algorithm and the Attribute's Relative Frequency

A computational costeffective clustering algorithm in multidimensional space using the manhattan metric: application to the global terrorism database

A fast density and grid based clustering method for data with arbitrary shapes and noise

Distance and density based clustering algorithm using Gaussian kernel

GRIDEN: an effective grid-based and densitybased spatial clustering algorithm to support parallel computing

Model-based clustering

A survey of distance/similarity measures for categorical data

Determining the number of clusters using information entropy for mixed data

An ensemble clusterer of multiple fuzzy k-means clusterings to recognize arbitrarily shaped clusters

Genetic intuitionistic weighted fuzzy k-modes algorithm for categorical data

Rough sets

Hierarchical clustering algorithm for categorical data using a probabilistic rough set model

Image change detection based on an improved rough fuzzy c-means clustering algorithm

Rough-fuzzy clustering and multiresolution image analysis for text-graphics segmentation

Segmentation of brain MR images using rough set based intuitionistic fuzzy clustering

Rough sets in economy and finance Transactions on Rough Sets XVII

Data mining and linked open data-New perspectives for data analysis in environmental research

Comparing unsupervised probabilistic machine learning methods for market basket analysis

Mapping the DNA of urban neighborhoods: clustering longitudinal sequences of neighborhood socioeconomic change

Scalable k-NN based text clustering

2020) k-PbC: an improved cluster center initialization for categorical data clustering

Cluster center initialization algorithm for K-modes clustering

Initialization of k-modes clustering using outlier detection techniques

Improving k-Modes algorithm considering frequencies of attribute values in mode

Rough set approach for clustering categorical data using information-theoretic dependency measure

A rough set approach for selecting clustering attribute

Ensemble based rough fuzzy clustering for categorical data

Detecting outliers in categorical data through rough clustering

SDR: An algorithm for clustering categorical data using rough set theory

Incremental fuzzy cluster ensemble learning based on rough set theory