Expert Systems With Applications 105 (2018) 23–33 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa Robust Semi-Supervised Growing Self-Organizing Map Ali Mehrizi, Hadi Sadoghi Yazdi ∗, Amir Hossein Taherinia Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran a r t i c l e i n f o Article history: Received 14 October 2017 Revised 22 March 2018 Accepted 23 March 2018 Available online 26 March 2018 Keywords: Semi-supervised learning Online learning Dynamic self-organization network Adaptive learning Half quadratic a b s t r a c t Semi-Supervised Growing Self Organizing Map (SSGSOM) is one of the best methods for online classifica- tion with partial labeled data. Many parameters can affect the performance of this method. The structure of GSOM network, activation degree and learning approach are the most important factors in SSGSOM. In this paper, a comprehensive robust mathematical formulation of the problem is proposed and then half quadratic (HQ) is used to solve it. Furthermore, an adaptive method is proposed to adjust activation de- gree optimally to improve the performance of SSGSOM. The results are reported on a variety of synthetic and UCI datasets and in the noisy conditions, which show superiority and robustness of the proposed method compared with the state of the art approaches. © 2018 Elsevier Ltd. All rights reserved. 1 p b a u f l t t l ( p Y & d b o i l t c t l Y i b o t i w n n t p m fl a o t s o d w f t b ( t h 0 . Introduction Extracting knowledge from unlabeled data, also known as unsu- ervised learning, improves the performance of systems. However, y increasing the volume of imported data, they require a large mount of memory and computation time and therefore become nusable or very inefficient. If a small portion of data has side in- ormation such as labels, it can increase the accuracy and speed of earning. The use of labeled data for learning together with clus- ering methods is called semi-supervised learning. In other words, he learning algorithms that combine unsupervised and supervised earning algorithms, are called semi-supervised learning algorithms Zhu & Goldberg, 2009 ). Semi-supervised learning has been ap- lied for many problems such as those reported in ( Cong, Liu, uan, & Luo, 2013; Lu, Wang, Xue, & Pan, 2014; Chen, Li, Su, Cao, Ji, 2014 ). Considering the data importing method, learning algorithms are ivided into two categories: online learning and batch learning. In atch learning, all the learning data are available at the beginning f the process. In contrast, in online learning, none of the data ex- sts at the beginning and data are logged gradually. Although, on- ine learning is adapted to the varying environment better, it lacks he accuracy of batch learning. One of the major reasons for de- reasing accuracy of the online algorithms is their dependency on he sequence of data entry. In recent years, attention to online earning is growing, because by increasing volume of data, learn- ∗ Corresponding author. E-mail addresses: mehrizi@um.ac.ir (A. Mehrizi), h-sadoghi@um.ac.ir (H. Sadoghi azdi), taherinia@um.ac.ir (A.H. Taherinia). s a D i ttps://doi.org/10.1016/j.eswa.2018.03.046 957-4174/© 2018 Elsevier Ltd. All rights reserved. ng methods must have online capabilities and rapid responses ecomes inevitable. Accordingly, many researchers have focused n eliminating defects of online learning methods to achieve bet- er accuracy in a reasonable time. Some promising online cluster- ng methods are online k-mean ( Alpaydin, 2004 ), neural gas net- ork ( Martinetz, Berkovich, & Schulten, 1993 ), and self-organizing etwork ( Kohonen, 1990 ). Among these methods, self-organization etworks and neural gas networks have been more useful because hey are inherently online and have properties such as topology reservation ( Alahakoon, Halgamuge, & Srinivasan, 20 0 0 ). Since the ajority of studies on semi-supervised learning have been on of- ine method, in the following we first review the learning methods nd then discuss studies on online semi-supervised learning. Farid et al. (2013) presented a semi-supervised algorithm based n the ensemble learning method, which is an offline approach hat determines the label of the data via voting among several clas- ifications. They used decision trees for data classification. Then, an nline algorithm was proposed by modifying the structure of the ecision tree. For this purpose, in the learning phase, a random eight is allocated to every datum and the decision tree is built or them. Afterward, the allocated weights are modified based on he training data. Leite and colleagues reported that features and main distri- ution of data change by time in non-stationary environments Leite, Costa, & Gomide, 2010 ). In machine learning terminology, his change is called as a “concept drift”. For example, in clas- ification, a concept can be either a class of data or the bound- ry of data and the drift is data boundary that changes over time. rifts can be gradual or sudden, contracting or expanding, and def- nite/random or periodic change of concept. The main requirement 24 A. Mehrizi et al. / Expert Systems With Applications 105 (2018) 23–33 t a t m l m ( s g t u t K t l a o c s 2 m d c A fi a ( p G A c 2 i t c & t t f fi t t o S of online learning is to identify and deal with the drift. They pro- posed a semi-supervised method based on a granular neural net- work to deal with data streams. This model has five layers; the first and fifth layers of the neural network are the input and output layer, respectively. The second layer performs the data-clustering task. The third layer performs the merge operation and the fourth layer makes the decisions about data class. Macario and de Carvalho (2012) proposed a semi-supervised FCM algorithm that adds a supervised ability to the original FCM algorithm. The main idea of this paper is that data of the same class have similar degrees of membership. Therefore, if data be- longs to a certain class, differences between their cluster member- ship degrees are low and the data have stable degrees of mem- bership in all clusters. They also proposed an adaptive method to prevent uncontrolled growth of a cluster. For the first time, an online constrained clustering algorithm was presented by Halkidi, Spiliopoulou, and Pavlou (2012) in order to find clusters that are centralized and have constraints entering as a stream. The Halkidi algorithm was based on the MPCK 1 -Means clustering algorithm. The MPCK-Means algorithm was an offline algorithm and cannot be used for online data. Therefore, Halkidi considered data as a chunk that enters the algorithm over time. The MPCK-Means clustering algorithm performs offline on small chunk of data. This method can be used to reduce the computa- tional memory. Cong et al. (2013) introduced a self-supervised online met- ric learning with the low-rank constraint for scene categorization. Their work used a two-sided graph to evaluate the similarity be- tween data and metric learning. The labeled and unlabeled data were used to update metric learning by a high score. The proposed algorithm of Kung was an online algorithm. Li and colleagues introduced a semi-supervised learning algo- rithm based on the extreme learning combined with cooperative learning ( Li, Zhang, Xu, Luo, & Li, 2013 ). The main idea of cooper- ative learning is using data from a learner’s testing phase as train- ing data for other learners. Looping in the learning process is the disadvantage of cooperative learning, which leads to a decrease in learning speed and propagation of the learning process errors to others. In their paper, using extreme learning was proposed to deal with these problems because it is very fast and can cover the prob- lem of cooperative learning ( Huang, Zhu, & Siew, 2004 ) ( Huang, Li, Chen, & Siew, 2008 ). Zhang and colleagues proposed a semi-supervised algorithm based on co-training ( Zhang, Wen, Wang, & Jiang, 2014 ). This al- gorithm applies co-training to select the most reliable instances according to two criteria of high confidence and nearest neighbor for boosting the classifier and also exploits the most informative instances with human annotation for improving the classification performance ( Zhang et al., 2014 ). Dornaika and El Traboulsi (2016) proposed a Graph-Based Semi- Supervised method and its kernelized version for generic classifi- cation. Although this method is flexible, it is not online. Beyer and Cimiano (2012) proposed a semi-supervised method based on neural gas. This paper extends the offline Growing Neu- ral Gas classifier to an online method that predicts labels for un- labeled example and incorporates these labeled examples into the model on-the-fly ( Beyer & Cimiano, 2012 ). Maximo, Quiles, & Nascimento (2014) proposed a new semi- supervised growing neural gas (GNG) model, wherein instead of assigning a single scalar label value to each neuron, a vector con- taining the representativeness level of every class is associated with each neuron. Also, to propagate the labels among the neurons 1 Metric pairwise constrained K-means. e S d f he Consensus-Based Semi-Supervised GNG employs a consensus pproach ( Maximo et al., 2014 ). Fritzke (1994) proposed growing self-organizing map (GSOM) hat is able to find a suitable network structure and size of the odel automatically. Next, this researcher proposed a supervised earning method that results from the combination of the above- entioned self-organizing network with the radial basis function RBF) approach. Hsu and Halgamuge (2008) improved Fritzke method and pre- ented a semi supervised algorithm based on GSOM. In their al- orithm, a two-layer model for online data classification with par- ial labels was provided. In the first layer of this model, GSOM is sed for data clustering and when the labeled data were logged, he label of this data is used for classification. In this method, like -mean and fuzzy C-means, there are repetitive operations with he difference, where in each step only one datum is randomly se- ected and entered into the GSOM. Shen, Yu, Kamiya, and Hasegawa (2010) and colleagues offered three-layer architecture that represents the topological structure f training data (with SOM), learned node labeling, and classifying onstruction. In our previous work, we proposed an online constraint semi- upervised learning based on GSOM ( Allahyar, Yazdi, & Harati, 015 ), based on a two-layer model that came from the develop- ent of the HSU model ( Hsu & Halgamuge, 2008 ). Self-Organizing Maps (SOM) is a suitable learning method for ata clustering ( Kohonen, 1990 ). This method is based on artifi- ial neural networks (ANNs) and its structure is formed online. ccordingly, the network is suitable for online learning. SOM was rst proposed in 1990 by Kohonen (1990) . Dimensions of SOM are parameter that strongly influences the final result of clustering Alahakoon et al., 20 0 0 ). In the Kohonen’s model, the user sets this arameter manually. To resolve this problem, ALkahon presented rowing Self-organizing Map (GSOM) ( Alahakoon et al., 20 0 0 ). In Lkahon models, expansion dimensions are performed automati- ally depending on type and structure of data ( Alahakoon et al., 0 0 0 ). Hsu and Halgamuge (2008) presented Semi-supervised Grow- ng Self-Organizing Map (SSGSOM) by adding a supervised layer o GSOM. Hsu model is an online semi-supervised model that an perform clustering and classification of data very well ( Hsu Halgamuge, 2008 ). Their model uses labeled data to control he Expansion nodes in the SOM. This model employs an innova- ive method to determine the effective parameters in the objective unction ( Hsu & Halgamuge, 2008 ). In the present paper, we rede- ne the objective function and then discuss the determination of he optimum effective parameters. Table 1 summarizes the significant semi-supervised methods. In his table, type of algorithm, advantages, and unresolved problems f each algorithm are presented. The main contributions of the proposed method are: • Redefining robust cost function of online Semi-Supervised GSOM by half quadratic, • An adaptive online semi-supervised GSOM algorithm is pro- posed. The remainder of this paper is organized as follows: In ection 2 , the primary concepts for the proposed algorithm are xplained. In Section 3 , the proposed algorithm is presented. In ection 4 , the results of the performed experiments on different ata are shown. In the final section, the conclusion and guidelines or future works are expressed. A. Mehrizi et al. / Expert Systems With Applications 105 (2018) 23–33 25 Table 1 Summary of Semi-supervised Algorithms. Proposed algorithm Advantage Unsolved problems Farid et al. √ Semi Supervised learning ❖ Slow execution speed because of creating the decision tree Leite et al. √ Fuzzy ❖ Slow execution speed √ Resistant data types ❖ Not improve clustering by using the feedback of labeled data Macario et al. √ Fuzzy clustering ❖ Offline method √ Adaptive distance Shen et al. √ online ❖ Not adaptive √ Based on SOM ❖ Not robust Halkidi et al. √ Data binding ❖ Receiving data in chunk format Cong et al. √ Metric learning ❖ Receiving in chunk format ❖ Slow execution speed Li et al. √ Cooperative extreme learning ❖ Not suitable for online data ❖ Slow execution speed because of cooperative learning Zhang et al. √ Co-training ❖ Offline ❖ Slow (because used active learning to determine appropriate data for labeling) Dornaika et al. √ Flexible method ❖ Offline √ Graph-based ❖ Adjust many parameters Beyer et al. √ Based on neural gas ❖ Not adaptive √ Online ❖ Not robust Maximo et al. √ Based on neural gas ❖ Not adaptive √ Online ❖ Not robust Allahyar et al. √ Online ❖ Not adaptive √ Based on GSOM ❖ Not robust Hsu et al. √ online ❖ Determine the parameters of the GSOM innovatively √ Based on GSOM √ Improve clustering by using feedback of labeled data ❖ Not robust Proposed method of the paper √ Online √ Based on GSOM √ Adaptive √ Robust Fig. 1. Semi-supervised GSOM Structure ( Hsu & Halgamuge, 2008 ). 2 G s s n t i n c t w t t t d m � t d t ζ P t d b m w s b δ . Preliminary concept In this section, the preliminary concepts of semi-supervised SOM are expressed. GSOM is a method that is able to show the tructure of data ( Fritzke, 1994 ). Hsu and colleagues added the emi-supervised ability by adding a supervised layer on top of this etwork ( Hsu & Halgamuge, 2008 ). Fig. 1 shows the structure of his model. In the presented semi-supervised model, the number of nodes n the upper layer is equal to the number of classes (M) and the umber of nodes in the bottom layer represents the number of lusters (C). There is a full connection of weights between the wo layers that is indicated as w c,m . Its purpose is to train the eights in such a way that by entering data into the cluster layer, he nearest node is activated and then the suitable class is de- ermined for it. Based on the weights between the two layers, raining of weights between the two layers is done by the labeled ata and according to Eq. (1) for each data sample ( Hsu & Halga- uge, 2008 ). w c,m k = β × ( ζ m k − P S m k ) × o c k (1) In Eq. (1) , ζ m k is actual label for datum k that is in class m and he P S m k is the estimated value of the semi-supervised GSOM for atum k that is in class m ; where ζ m k and P S m k are defined respec- ively by Eqs. (2) and (3) ( Hsu & Halgamuge, 2008 ). m k = { 0 l k � = ˆ m k or missing label 1 l k = ˆ m k (2) S m k = C ∑ c=1 w c,m k × o c k m = { 1 , 2 , . . . , M } (3) In Eqs. (1) and (3) , o c k the function determines the similarity of he entered data with cluster nodes and is known as the activation egree. The value of activation degree for each c node is calculated y using a Gaussian function that is given in Eq. (4) ( Hsu & Halga- uge, 2008 ). o c k = exp ( − ‖ x k −v c ‖ 2 ( δc k ) 2 ) c ∈ { 1 , . . . , C } (4) here δc k is average distance of all neighbor’s nodes of c for k th ample and is calculated by (5) and | Ne ( c )| in (5) represents num- er of neighbor’s node c ( Hsu & Halgamuge, 2008 ). c k = ∑ k ∈ Ne ( c ) ‖ v k − v c ‖ | Ne ( c ) | (5) 26 A. Mehrizi et al. / Expert Systems With Applications 105 (2018) 23–33 w E w fi l c r t v δ s J ( m fi δ w a 3 r ∇ o More information about how growing GSOM in ( Hsu & Halga- muge, 2008 ). 3. Proposed method Since the number of labeled data may be few, optimal opera- tion of semi-supervised algorithms is very important. Although the method proposed by Hsu is semi-supervised and online, it faces the following problems: • The objective function is optimized only based on the weight of classifier layer, but in fact, the objective function also depends on the parameters of the shape and location of the nodes in the clustering layer and the degree of activation. • Growing of GSOM is not optimized, because the activation func- tion plays an important role in amount of similarity of data for each node, but Hsu algorithm adjusts this parameter by the av- erage distance to neighbor nodes. This innovative solution is a local solution and this method is not optimal by changing data. • It suffers the problem of instability versus noisy data. Accordingly, to overcome the mentioned problems, the objec- tive function was redefined and resolved in the present work. 3.1. Definition of the problem According to Section 2 , in the semi-supervised GSOM error for each data k is defined as: e k = M ∑ m =1 ( l m k − ps m k ) (6) And cost function based on the expectation of error function is defined as: g ( e k ) = N ∑ k =1 exp ( −ηe 2 k ) (7) J = max w, δ, v N ∑ k =1 exp ( −ηe 2 k ) (8) The half-quadratic technique Rockafellar, 1970 ; He, Hu, Zheng, & Guo, 2010 ; Zhang et al., 2014 ) is often used to solve the nonlin- ear optimization problem ( He, Hu, Zheng, & Kong, 2011 ). Therefore, we derived an algorithm to solve ( (8) based on the half quadratic. Based on the theory of convex conjugated functions, we can solve (8) as following proposition. Proposition 1. There exists a convex conjugated function φ of g ( e ) such that g ( e ) = sup p ( ηe 2 p − ϕ ( p ) ) (9) where convex function φ( p ) is defined as φ( p ) = −p log ( − p ) + p . Therefore, J is rewritten as follows: J = max w,δ, v N ∑ k =1 ( ηe 2 k p k − ϕ ( p k ) ) (10) According to Proposition 1 , we notice that when [ w , δ, v ] is fixed, the following equation holds: max w,δ, v J ( w, δ, v ) = max w,δ, v ,p ˆ J ( w, δ, v , p ) (11) Then, the alternative solution is used: ste p 1 : max p ˆ J ( w, δ, v , p ) ⇒ p = −exp ( −ηe 2 ) . ste p 2 : max w,δ, v ˆ J ( w, δ, v , p ) ⇒ max w, δ, v J = max w,δ, v N ∑ k =1 ηe 2 k p k . By defining q k = −p k = exp ( −ηe 2 k ) : min , δ, v J = min w,δ, v N ∑ k =1 ηe 2 k q k (12) Let assume η is constant and expand (12) as: q. ( 8 ) ⇒ J = ∑ n k =1 ⎛ ⎝ ( M ∑ m =1 ( l m k − ps k m )) 2 × q k ⎞ ⎠ Eq. 3 ⇒ J = ∑ n k =1 ((∑ M m =1 ( l m k − (∑ C c=1 w c,m k × o c k )))2 × q k ) = ∑ n k =1 ((∑ M m =1 ∑ C c=1 ( l m k − w c,m k × o c k ))2 × q k ) Eq. 4 ⇒ J = ∑ n k =1 ((∑ M m =1 ∑ C c=1 ( l m k − w c,m k × exp ( − ‖ x k − v c ‖ 2 σ 2 c )))2 × q k ) (13) Therefore: min ,δ, v ,p J ( w, σ, v ) = min w,δ, v ,p ∑ n k =1 ((∑ M m =1 ∑ C c=1 ( l m k − w c,m k × exp ( − ‖ x k − v c ‖ 2 σ 2 c )))2 × q k ) (14) Based on (14) , the effective parameters include weight classi- er layer ( w ), the shape and location of the nodes in the clustering ayer ( v ) and the width of activation degree function ( δ). It is so lear that the parameter δ is important because it affects w di- ectly. Moreover, the calculation of the optimal value of parame- er v in the objective function practically is impossible and will be ery time consuming. Therefore, if the optimal value of parameter can be determined, other parameters will be optimal. In conclu- ion, it can be derived: ( w, δ, v ) ∼= J ( w, δ) (15) Next, we use the gradient search method in the following form: w d ) k +1 = ( w d ) k − μ ˆ ∇ ( w d ) k (16) Now, Eq. (16) is simplified in two steps: one step is an adjust- ent of clustering layer and the other is the adjustment of classi- cation layer. For clustering layer, it is: k +1 = δk − μδ ˆ ∇ δk (17) And in classification layer, k +1 = w k − μw ̂ ∇ w k (18) By derivation of each variable, the minimal cost function is chieved by the corresponding variable ( μδ and μw ). .2. Parameters adaption To obtain the adaptive value of δ, we use a least mean algo- ithm for Eq. (17) as: ˆ δk = ∂ J k ∂ δk = 2 e k ∂ e k ∂ δk (19) The optimal value of δ is calculated through a derivation of the bjective function (14) based on δ as follows: ∂ e k ∂ δk = ∂ (∑ n k =1 ∑ M m =1 ∑ C c=1 ( l m k − w c,m k ex p ( − ‖ x k −v c ‖ 2 σ 2 c )) × q k )) ∂ δk A. Mehrizi et al. / Expert Systems With Applications 105 (2018) 23–33 27 n δ ∇ s s b ∇ w n w 3 b e e ⇒ 3 fi f m w l t η 3 u t the width degree of activation of each node. = 0 − ∂ (∑ n t=1 (∑ M m =1 ∑ C c=1 ( w c,m k ex p ( − ‖ x t −v c ‖ 2 σ 2 c ))) × q k ) ∂ δk (20) If c is replaced with Best Matching Node (BMN of winning ode) then (20) is rewritten as follows, = − ∂ ∂ δk (∑ n k =1 (∑ M m =1 ( w BMN,m k × exp ( − ‖ x k − v BMN ‖ 2 σ 2 BMN )) × q k )) = − ∑ n k =1 ∑ M m =1 ( 2 × w BMN,m k × exp ( − ‖ x k −v BMN ‖ 2 σ 2 BMN ) × ( ‖ x k − v BMN ‖ 2 )) × q k δ3 BMN (21) So, according to Eq. (19) , δ is updated as follows: k +1 = δk + 4 μδ × ∑ n k =1 ∑ M m =1 ( l m k − P S m ) × ( w BMN,m k × exp ( − ‖ x k −v BMN ‖ 2 σ 2 BMN ) × ( ‖ x k − v BMN ‖ 2 )) × q k δ3 BMN δ = 4 μδ × ∑ n k =1 ∑ M m =1 ( w BMN,m k × ( l m k − P S m ) ×exp ( −‖ x k −v BMN ‖ 2 σ 2 BMN ) × ( ‖ x k −v BMN ‖ 2 )) q k δ3 BMN (22) The coefficient μδ affects the convergence speed of δ. For the ake of simplicity, the learning rate can be assumed as a con- tant and positive value that is reduced by time and exceeding the oundary confidence. To obtain adaptive w , likewise the δ, we have: ˆ w t ( e 2 k ) = ∂ J k ∂ w k = 2 e k ∂ e k ∂ w k (23) here ∂ e k ∂ w k = ∂ ( ∑ n k =1 ∑ M m =1 ∑ C c=1 ( l m k − w c,m k ex p ( − ‖ x k −v c ‖ 2 σ 2 c )) × q k )) ∂ w k = 0 − ∂ ( ∑ n k =1 ∑ M m =1 ∑ C c=1 ( w c,m k ×ex p ( −‖ x k −v c ‖ 2 σ 2 c )) × q k )) ∂ w k (24) If c is replaced with Best Matching Node (BMN of winning ode) then (24) is rewritten as follows: = − ∂ ((∑ n k =1 (∑ M m =1 ( w BMN,m k × exp ( − ‖ x k −v BMN ‖ 2 σ 2 BMN )) × q k ))) ∂ w k = − ∑ n k =1 (( exp ( − ‖ x k − v BMN ‖ 2 σ 2 BMN )) × q k ) (25) Therefore, w is updated as follows: k +1 = w k + 2 μw × n ∑ k =1 M ∑ m =1 ( l m k − ps m k ) × exp ( − ‖ x k − v BMN ‖ 2 σ 2 BMN ) × q k ∇w = 2 μw × ∑ n k =1 ∑ M m =1 ( l m k − ps m k ) × exp ( − ‖ x k −v BMN ‖ 2 σ 2 BMN ) × q k (26) .3. Parameters adaption through back propagation method In addition to Section 3.1 , the back propagation method can also e used for error estimation. In the back propagation method, the rror is defined as Eq. (27) . k = M ∑ m =1 ( l m k − ps m k ) × w m k (27) Hence, (21) id rewritten as: ∂ e k ∂ δk = ∂ ( ∑ n k =1 ∑ M m =1 ∑ C c=1 ( l m k − w c,m k × exp ( − ‖ x k −v c ‖ 2 σ 2 c )) × w c,m k × q k )) ∂ δk After simplification: ∂ e k ∂ δk = − n ∑ k =1 M ∑ m =1 ( 2 × ( w BMN,m k )2 × exp ( − ‖ x k − v c ‖ 2 σ 2 BMN ) × ( ‖ x k − v c ‖ 2 σ 3 BMN )) × q k ) ⇒ ∇δ = 4 μδ × ∑ n k =1 ∑ M m =1 (( w BMN,m k )2 × (l m k − ps m k ) × exp ( − ‖ x k −v BMN ‖ 2 σ 2 BMN ) × ( ‖ x k − v BMN ‖ 2 )) q k δ3 BMN (28) Also, for w, it is as follows: ∂ e k ∂ w k = ∂ ( ∑ n k =1 ∑ M m =1 ∑ C c=1 ( l m k − P S m ) × w k m × q k )) ∂ w k (29) After simplification, we have: ∂ e k ∂ w k = −2 × ∑ n k =1 ∑ M m =1 ∑ C c=1 w BMN,m k × exp ( − ‖ x k − v BMN ‖ 2 σ 2 BMN ) × q k ∇w = 4 μw × ∑ n k =1 ∑ M m =1 ( l m k − ps m k ) × ( w BMN,m k )2 × exp ( − ‖ x k − v BMN ‖ 2 σ 2 BMN ) × q k (30) .4. ƞ tuning In this section, we discuss adaption of the ƞ parameter. We de- ne a constraint for ƞ parameter over the minimization problem as ollows: in ηi ∑ N i =1 η2 i e 2 i q i , ∑ ηi = p, ηa i ∈ [ 0 , 1 ] (31) Producing the Lagrange equation by adding the penalty term, e have: = N ∑ i =1 η2 i e 2 i q i − λ (∑ ηi − p ) (32) Then, we obtain ƞi as follows: ∂l ∂ ηi = 0 ⇒ ηi = λ 2 e 2 i q i ∑ ηi = p ⇒ ηi = p ∑ n j=1 e 2 i q i e 2 j q j (33) To deal with division by zero problems, we add ξ parameter to he denominator of the equation as follows: i = p ∑ n j=1 e 2 i q i e 2 j q j + ξ (34) .5. Structure of the proposed method As shown in Fig. 2 , the proposed algorithm receives labeled and nlabeled data continuously. Depending on the type of the data, wo events may happen: • Unlabeled data are received: GSOM performs clustering and up- dates weight of nodes in the clustering layer. In this study, we assumed that most of data sizes are unlabeled. • Data with known labeled is received: First, the winner node in the clustering layer is identified and then the label of data is estimated by the system based on this node. The system error is calculated from the difference between actual Label and esti- mated Label system. This error, which updates weight between layers of clustering and classification, is applied to determine 28 A. Mehrizi et al. / Expert Systems With Applications 105 (2018) 23–33 Fig. 2. The proposed model for creation adaptive Semi-supervised GSOM. Fig. 3. Synthetic evaluated datasets: (a) four cluster (b) two moon (c) spiral (d) two ring. Table 2 Abbreviation and title of proposed and state of the art methods. Abbreviated name Title of method ASSGSOM Adaptive Semi Supervised GSOM BASSGSOM Back propagation Adaptive Semi Supervised GSOM SSGSOM Hsu’s Semi-supervised GSOM{Hsu, 2008 #20} CS2GS Constrained Semi-Supervised GSOM{Allahyar, 2015 #2} HSS Halkidi’s Semi Stream algorithm{Halkidi, 2012 #18} P 4. Performance evaluation and tests In this section, the efficiency of the proposed algorithm is eval- uated. The datasets used for the evaluation include a synthetic and UCI repository, 2 which are detailed in the next sections. Three online algorithms were selected among the state of the art methods mentioned in Section 1 . Table 2 shows title and ab- breviation for the two proposed method and state of the art meth- ods. The measures considered for assessing the proposed method and those from the literature include Accuracy and F-measure. The Accuracy measure calculates based on (35) . In (35) , TP, TN, FP, and FN are true positive, true negative, false positive, and false nega- tive, respectively. Ac c q = T P T P + T N + F P + F N (35) 2 http://archive.ics.uci.edu/ml/ R Also, Precision and Recall measures are defined as follows: R = T P T P + F P (36) E = T P T P + F N (37) A. Mehrizi et al. / Expert Systems With Applications 105 (2018) 23–33 29 Fig. 4. Result on Two Moon dataset. d F a a c G c t r t v s o t t T a i t l r d 4 s s d f r d t t l o s i w Fig. 5. Result on Four Cluster dataset. Fig. 6. Result on Two Ring dataset. Fig. 7. Result on Spiral dataset. T S s A A 4 U ( m d S And then F -measure, which is based on precision and recall, is efined as follows: M = 2 × P R q × R E q P R q + R E q (38) For all algorithms that are based on GSOM (i.e., SSGSOM, CS2GS, nd both proposed methods), the scale factor (SF) is considered as constant value and based on growing threshold of GSOM is cal- ulated as (39) . Parameter d in (39) represents data dimension. T = −d ∗ log ( SF ) (39) Furthermore, the regularization parameter for GSOM layer and lassification layer is very influential on the results. So, in all of he experiments, constant values of 0.4 and 0.6 were used for the egularization parameter for GSOM and Classification Layer, respec- ively. Besides, the initial value of w (weight of classification layer), (weight of GSOM node) and other settings related to GSOM, are elected randomly and are the same for all algorithms. The order f data entry, which affects the performance of the algorithms is he same for all of them. The test scenario is that at first each data set is divided into he train and test subsets considering a 10-folds cross-validation. hen, 5% of training data is labeled randomly and all the methods re trained using them. Next, the performance of each algorithm s evaluated using test data. This process is replicated 5 times and he average of results is reported. In the following, the number of abeled training data is increased by 5% and the test scenario is epeated. This procedure will continue until the number of labeled ata is 30%. .1. The evaluation results on synthetic datasets In this section, the performances of the proposed methods and tate of the art methods are shown on synthetic datasets. The tructures of the synthetic datasets are shown in Fig. 3 . All these ata sets have two classes. Also, the number of data in two moons, our clusters, two ring and spiral sets are 258, 160, 237, and 30 0 0, espectively. Figs. 4–7 present the obtained results using the synthetic atasets based on F-measure. As shown in Figs. 4–7 , ASSGSOM method is superior compared o other methods. The closest results to this method are those ob- ained by SSGSOM method. Moreover, a statistical t -test was estab- ished to determine whether the difference between these meth- ds is statistically significant. Table 3 , presents the t -test compari- on between ASSGSOM and SSGSOM. In this test, 15% of the data s labeled and a number of iteration is 35. In Table 3 , if the value of t -test is 1, it means that the method ith a higher average is statistically superior. For example, in the wo Ring dataset, the value of Accuracy t -test column for ASSG- OM is 1. It means that Accuracy result of ASSGSOM is statistically uperior to the SSGSOM because the average of the accuracy of the SSGSOM algorithm (98%) from SSGSOM (93%) is higher, then the SSGSOM algorithm is superior. .2. The evaluation results on UCI datasets In this section, the performance of the proposed methods on CI datasets is evaluated. Several datasets from UCI repository Blake & Merz, 1998 ) were selected to evaluate the proposed ethod. Table 4 shows the details of each dataset. In Figs. 8–13 , results on a part of UCI datasets are shown and etailed results are reported in Table 5 . Results in Figs. 8–13 indicate the relative superiority of ASSG- OM compared to state of art methods. As can be seen, when the 30 A. Mehrizi et al. / Expert Systems With Applications 105 (2018) 23–33 Table 3 The t -test results between ASSGSOM and SSGSOM. Dataset method Accuracy (%) F -Measure (%) Accuracy t -test F -Measure t -test Two Moon ASSGSOM 96 95 1 1 SSGSOM 88 87 Four Cluster ASSGSOM 94 94 1 1 SSGSOM 86 85 Two Ring ASSGSOM 98 98 1 1 SSGSOM 93 92 Spiral ASSGSOM 88 87 1 1 SSGSOM 86 85 Fig. 8. Result on Spiral dataset. Fig. 9. Result on Iris dataset. Fig. 10. Result on Wine dataset. Fig. 11. Result on E.coli dataset. A. Mehrizi et al. / Expert Systems With Applications 105 (2018) 23–33 31 Fig. 12. Result on Vehicle dataset. Fig. 13. Result on Vowel dataset. Table 4 Detail of used UCI repository. Dataset name Number of class Number of features Number of data Soybean 4 35 47 Iris 3 4 150 Wine 3 13 178 Sonar 2 60 208 E.coli 6 6 332 Ionosphere 2 34 351 WDBC 2 30 569 Scale 3 4 625 Vehicle 4 18 846 Vowel 11 10 990 n l u 3 i f S s S Table 5 The results of the proposed methods and state of art me Dataset/Method Measure ASSGSOM (%) BAS Soybean F -Measure 94 96% Accuracy 87 89% Iris F -Measure 94 90% Accuracy 94 90 Wine F -Measure 70 63% Accuracy 71 63% Sonar F -Measure 60 67% Accuracy 61 67% E.coli F -Measure 83 73% Accuracy 84 75% Ionosphere F -Measure 90 85% Accuracy 91 87% WDBC F -Measure 90 87% Accuracy 90 87% Scale F -Measure 75 70% Accuracy 84 76% Vehicle F -Measure 51 50% Accuracy 52 50% Vowel F -Measure 88 71% Accuracy 88 72% umber of labeled data is increased the proposed method shows ess fluctuate and it becomes more stable. Results in different UCI dataset are summarized in Table 5 . Eval- ation is done in the condition that the number of labeled data is 0% of the total data in the dataset. The results in Table 5 show that when the number of Attributes s high (such as Soybeans and Sonar), ABSSGSOM is better. Except or the results on Iris dataset, in the other datasets ASSGSOM and SGSOM methods are superior compared to other methods. The re- ults of these two methods are very close to each other but ASSG- OM is superior totally. thods on UCI datasets. SGSOM SSGSOM (%) CS2GS (%) HSS (%) 93 85 82 85 70 80 97 90 90 97 90 89 68 70 67 70 71 67 57 57 51 58 58 51 81 70 64 82 72 67 88 86 80 89 87 81 88 87 80 88 87 80 71 68 55 84 69 56 52 45 42 51 45 42 88 64 60 87 65 61 32 A. Mehrizi et al. / Expert Systems With Applications 105 (2018) 23–33 Fig. 14. Effect of the noise on the four datasets selected. Fig. 15. Resistance analysis of the accuracy of the all evaluated datasets. 5 t q m a a t s o b c r a R A 4.3. The effect of noise In this section, noise effect on the proposed methods is eval- uated. Fig. 14 presents the results for the datasets contaminated with noise. Eq. (40) is used for calculating the SNR. 3 In this for- mula, D is the domain of signal. SN R dB = 20 lo g 10 ( D signal D noise ) (40) 4.4. Resistance analysis Resistance analysis shows the stability of the algorithm in dif- ferent condition ( Geng, Zhan, & Zhou, 2005 ). Eq. (41) presents how to calculate resistance analysis. Fig. 15 exhibits resistance analysis for the discussed dataset. R A q = Ac c q ∀ j ∈ { 1 , 2 , . . . , Q } (41) max Ac c j 3 Signal to noise ratio. A A . Conclusion and future works In this paper, a comprehensive robust mathematical formula- ion of the semi-supervised GSOM was proposed and then half uadratic (HQ) was used to solve it. Furthermore, an adaptive ethod was proposed to adjust activation degree optimally for chieving a better performance of semi-supervised GSOM. Evalu- tion in noisy conditions and on different datasets suggests that he proposed method has superiority and stability compared to the tate of the art methods. In the future, we add a regular term to cost function to avoid ver-fitting the complex dataset. Also, we plan to develop a ro- ust active semi-supervised SSGSOM. Using this active learning, we an determine which data are important for increasing the accu- acy. Therefore, a higher degree of accuracy can be achieved with smaller number of labeled data. eferences lahakoon, D. , Halgamuge, S. K. , & Srinivasan, B. (20 0 0). Dynamic self-organizing maps with controlled growth for knowledge discovery. IEEE Transactions on Neu- ral Networks, 11 (3), 601–614 . llahyar, A. , Yazdi, H. S. , & Harati, A. (2015). Constrained semi-supervised growing self-organizing map. Neurocomputing, 147 , 456–471 . lpaydin, E. (2004). Introduction to machine learning . MIT Press . A. Mehrizi et al. / Expert Systems With Applications 105 (2018) 23–33 33 B B C C D F F G H H H H H H K L L L M M M R S Z Z eyer, O. , & Cimiano, P. (2012). Online semi-supervised growing neural gas. Interna- tional Journal of Neural Systems, 22 (05), 1250023 . lake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases, Irvine, CA: Department of Information and Computer Science, University of Cal- ifornia , 55. [http://www.ics.uci.edu/ ∼mlearn/MLRepository.html] . hen, S. , Li, S. , Su, S. , Cao, D. , & Ji, R. (2014). Online semi-supervised compressive coding for robust visual tracking. Journal of Visual Communication and Image Representation, 25 (5), 793–804 . ong, Y. , Liu, J. , Yuan, J. , & Luo, J. (2013). Self-supervised online metric learning with low rank constraint for scene categorization. IEEE Transactions on Image Process- ing, 22 (8), 3179–3191 . ornaika, F. , & El Traboulsi, Y. (2016). Learning flexible graph-based semi-supervised embedding. IEEE Transactions on Cybernetics, 46 (1), 206–218 . arid, D. M. , Zhang, L. , Hossain, M. A. , Rahman, C. M. , Strachan, R. , Sexton, G. , et al. (2013). An adaptive ensemble classifier for mining concept drifting data streams. Expert Systems with Applications, 40 (15), 5895–5906 . ritzke, B. (1994). Growing cell structures—a self-organizing network for unsuper- vised and supervised learning. Neural Networks, 7 (9), 1441–1460 . eng, X. , Zhan, D.-C. , & Zhou, Z.-H. (2005). Supervised nonlinear dimensionality re- duction for visualization and classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 35 (6), 1098–1107 . alkidi, M. , Spiliopoulou, M. , & Pavlou, A. (2012). A semi-supervised incremental clus- tering algorithm for streaming data advances in knowledge discovery and data min- ing (pp. 578–590). Springer . e, R. , Hu, B. G. , Zheng, W. S. , & Guo, Y. (2010). Two-Stage Sparse Representation for Robust Recognition on Large-Scale Database. In AAAI: Vol. 10 (p. 1) . e, R. , Hu, B.-G. , Zheng, W.-S. , & Kong, X.-W. (2011). Robust principal component analysis based on maximum correntropy criterion. IEEE Transactions on Image Processing, 20 (6), 1485–1494 . su, A. , & Halgamuge, S. K. (2008). Class structure visualization with semi-super- vised growing self-organizing maps. Neurocomputing, 71 (16), 3124–3130 . uang, G.-B. , Li, M.-B. , Chen, L. , & Siew, C.-K. (2008). Incremental extreme learning machine with fully complex hidden nodes. Neurocomputing, 71 (4), 576–583 . uang, G.-B. , Zhu, Q.-Y. , & Siew, C.-K. (2004). Extreme learning machine: A new learning scheme of feedforward neural networks. In Proceedings of the IEEE in- ternational joint conference on neural networks . ohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78 (9), 1464–1480 . eite, D. , Costa, P. , & Gomide, F. (2010). Evolving granular neural network for semi– supervised data stream classification. In Proceedings of the international joint conference on neural networks (IJCNN) . i, K. , Zhang, J. , Xu, H. , Luo, S. , & Li, H. (2013). A semi-supervised extreme learn- ing machine method based on co-training . Journal of Computational Information Systems, 9 (1), 207–214 . u, K. , Wang, Q. , Xue, J. , & Pan, W. (2014). 3D model retrieval and classification by semi-supervised learning with content-based similarity. Information Sciences, 281 , 703–713 . acario, V. , & de Carvalho, F. d. A. (2012). An adaptive semi-supervised fuzzy clus- tering algorithm based on objective function optimization. In Proceedings of the IEEE international conference on fuzzy systems (FUZZ-IEEE) . artinetz, T. M. , Berkovich, S. G. , & Schulten, K. J. (1993). Neural-gas’ network for vector quantization and its application to time-series prediction. IEEE Transac- tions on Neural Networks, 4 (4), 558–569 . aximo, V. R. , Quiles, M. G. , & Nascimento, M. C. (2014). A consensus-based semi– supervised growing neural gas. In Proceedings of the international joint conference on neural networks (IJCNN) . ockafellar, R. (1970). Convex analysis . Princeton, NJ: Princeton University Press Google Scholar . hen, F. , Yu, H. , Kamiya, Y. , & Hasegawa, O. (2010). An online incremental semi– supervised learning method. Journal of Advanced Computational Intelligence and Intelligent Informatics, 14 (6), 593–605 . hang, Y. , Wen, J. , Wang, X. , & Jiang, Z. (2014). Semi-supervised learning com- bining co-training with active learning. Expert Systems with Applications, 41 (5), 2372–2378 . hu, X. , & Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3 (1), 1–130 .