key: cord-0045470-6j92piho authors: Gómez-Flores, Wilfrido; Sossa-Azuela, Juan Humberto title: Towards Dendrite Spherical Neurons for Pattern Classification date: 2020-04-29 journal: Pattern Recognition DOI: 10.1007/978-3-030-49076-8_2 sha: 275933f9d7db97d865ea03df4a810f3b3f91202f doc_id: 45470 cord_uid: 6j92piho This paper introduces the Dendrite Spherical Neuron (DSN) as an alternative to the Dendrite Ellipsoidal Neuron (DEN), in which hyperspheres group the patterns from different classes instead of hyperellipses. The reasoning behind DSN is simplifying the computation of DEN architecture, where a centroid and covariance matrix are two dendritic parameters, whereas, in DSN, the covariance matrix is replaced by a radius. This modification is useful to avoid singular covariance matrices since DEN requires measuring the Mahalanobis distance to classify patterns. The DSN training consists of determining the centroids of dendrites with the k-means algorithm, followed by calculating the radius of dendrites as the mean distance to the two nearest centroids, and finally determining the weights of a softmax function, with Stochastic Gradient Descent, at the output of the neuron. Besides, the Simulated Annealing automatically determines the number of dendrites that maximizes the classification accuracy. The DSN is applied to synthetic and real-world datasets. The experimental results reveal that DSN is competitive with Multilayer Perceptron (MLP) networks, with less complex architectures. Also, DSN tends to outperform the Dendrite Morphological Neuron (DMN), which uses hyperboxes. These findings suggest that the DSN is a potential alternative to MLP and DMN for pattern classification tasks. Artificial Neural Networks (ANN) are mathematical models inspired by the biological neurons in the nervous system of the animals, which can be described as mapping an input space to an output space [7] . Probably, the Multilayer Perceptron (MLP) is the most common ANN used in practice for pattern classification tasks. MLP training requires adjusting the synaptic weights of each neuron by minimizing a loss function (e.g., cross-entropy), where the backpropagation algorithm is often used. The inner product between the neuron inputs and the synaptic weights produces a linear combination that is modified by a nonlinear activation function (e.g., the sigmoid function). Thus, the MLP divides the input space with a hypersurface, which is built by combining the responses of several neurons distributed in one or more hidden layers. In nonlinear separability scenarios, the MLP could require a complex architecture to separate the input space accurately. The Dendrite Morphological Neuron (DMN) is an alternative technique that reduces the complexity of the classification models since nonlinear classification problems can be solved by using a single neuron. The morphological processing involves minimum and maximum operations, which can generate complex nonlinear decision boundaries [8] . A typical DMN has dendrites defined as hyperboxes in R D , where D is the dimensionality of the input space. A set of hyperboxes can model each class pattern, where the minimum and maximum operations determine if an input pattern is inside of a hyperbox; therefore, the input pattern is assigned to the class of the most active dendrite. The DMN training consists of distributing the hyperboxes over the input space such that every class pattern is covered accurately, where heuristic methods [8] , evolutionary computation [3] , and stochastic gradient descent (SGD) [9] have been used for this purpose. Because DMN uses hyperboxes, the produced decision boundaries are complex piecewise linear functions. In order to obtain smoother decision boundaries, it is feasible to replace the hyperboxes with other geometrical shapes. In this context, Arce et al. [2] proposed a neuronal model called Dendrite Ellipsoidal Neuron (DEN), where an input pattern is assigned to the class of the dendrite (i.e., hyperellipse) with the minimum Mahalanobis distance. A hyperellipse is defined by two parameters: centroid and covariance matrix. In DEN, the centroid positions within the input space are defined by the k-means algorithm, in which k is the number of dendrites within a class. Next, for obtaining rotated hyperellipses, the covariance matrix of each cluster is calculated. Note that a class is modeled by k dendrites, where a dendrite clusters only a fraction of samples from the class. Therefore, as the value of k increases, the number of samples in the dendrite decreases, so that there could be variables with zero variance, generating a singular covariance matrix. Consequently, the calculation of the Mahalanobis distance cannot be performed for that dendrite since it is required to invert its covariance matrix. To overcome this inconvenience, we propose a simplification of the DEN model by using spheres instead of ellipses to get a new neuronal model called Dendrite Spherical Neuron (DSN), in which the full covariance matrix of a dendrite is replaced by a radius that depends on the closeness among centroids. Moreover, the computation of the DNS response is more straightforward than DEN because covariances and matrix inversions are no longer necessary. 1 shows the neural architecture for a DSN, in which each class is represented by a cluster of dendrites, that is, a set of hyperspheres in R D . The DSN output is performed the linear combination of dendrite responses d j , for j = 1, . . . , C classes, and the softmax function gives the probability of the input pattern x = [x 1 , . . . , x D ] T belongs to the jth class. Thus, the assigned class is given by the maximum probability rule [6] : where z j is the response of the jth output node defined as where σ(·) is the softmax function, w k,j is a weight value to connect the kth cluster to the jth output node, w 0j is the bias, and d j is the output of the jth dendrite cluster: where h i,j is the output of the ith dendrite for the jth class: where · is the Euclidean norm, c i,j ∈ R D is the centroid of the dendrite, and r i,j > 0 is its corresponding radius. Figure 2 illustrates the three possible responses of a dendrite given in Eq. 4. A dendrite obtains its maximum response when x = c, that is, h(x) = r. As x moves away from the centroid, the dendrite response decreases to zero on its boundary, and becomes negative outside of the dendrite region. Thus, the most active dendrite cluster can be identified with Eq. 3. A hypersphere in 2D generated by its dendrite parameters c and r. The response is positive when the pattern x is inside of the hypersphere, it is zero when x is on the hypershpere boundary, and it is negative when x is outside of the hypersphere. Algorithm 1 shows the pseudocode for training a DSN based on the k-means algorithm and SDG. First, the centroids of dendrites are calculated with the k-means algorithm (lines 3-8), where the parameter k = l j is the number of hyperspheres in a class. Next, to reduce the overlap between dendrite regions, the mean distance to the two nearest centroids determines the radius of a dendrite (lines 9-10). Finally, the cluster dendrite responses are calculated from the entire training set (Eqs. 3 and 4), which are used to obtain the weights of the softmax function by minimizing the cross-entropy loss function with SDG (lines 12-13). Notice that Algorithm 1 requires the number of dendrites per class, which is problem-dependent and is typically not known a priori. It is desirable a DSN configuration with a reduced number of dendrites and a high classification rate. This goal can be achieved by using an optimization procedure to maximize the Get patterns of the jth class from X to obtain the subset X j Save centroids: DSN.c i,j ← c i,j , ∀i = 1, . . . , l j 8 end for 9 Measure the pairwise distance between centroids in C 10 Calculate the mean distance of c i,j to the 2-nearest centroids to get its corresponding radius r i,j , ∀i, j 11 Save radii: DSN.r i,j ← r i,j , ∀i, j 12 Obtain the cluster dendrite responses d j (x), j = 1, . . . , C, for all samples in X to obtain Y = {y 1 , . . . , y N } 13 Calculate the softmax weights W = [w 1 , . . . , w C ] with SDG and cross-entropy from the tuple (Y, t) 14 Save weights: DSN.W ← W 15 return DSN classification accuracy with few dendrites. Herein, the Simulated Annealing (SA) algorithm is employed to tune the DSN configuration automatically. SA is a stochastic local search method for global combinatorial optimization, which allows gradual convergence to a near-optimal solution. SA performs a sequence of moves from a current solution to a better one according to specific transition rules while occasionally accepting some uphill solutions in order to guarantee diversity in the domain exploration and to avoid getting caught at local optima. The optimization process is managed by a cooling schedule that controls the number of iterations [1] . Thus, SA is useful to find the combination of the number of dendrites per class that results in the best classification rate in a finite number of iterations. Algorithm 2 shows the pseudocode for DSN tuning with the SA algorithm. In line 3, the number of dendrites per class is randomly initialized in the range [1, N j ] , where N j is the number of patterns in the jth class. In line 10, the neighborhood structure generates a new solution by randomly moving (or not) backward or forward the number of dendrites per class. In lines 13-16, a DSN solution is accepted if its accuracy is higher than the previous solution; otherwise, a probability of acceptance criterion is applied, which depends on the current temperature. With this scheme, the current solution may be accepted even if it is worse than the previous solution, which is useful to avoid local optima. Evaluate the accuracy f (z) with validation set (X,t) z * ← z t 19 until cooling condition is reached 20 return z * Finally, in lines 16-17, the best solution is updated if its accuracy is lower than the current solution, or if both solutions have the same accuracy and the current solution has less number of dendrites than the current best solution. For evaluating the classification performance of the DSN approach, synthetic and real-world datasets are considered. The former comprises three didactic 2D datasets for illustrating the nonlinear boundaries generated by a DSN trained with Algorithm 2. On the other hand, ten real-world datasets were obtained from the UCI Machine Learning Repository [5] , whose characteristics are summarized in Table 1 . These datasets were also previously used to evaluate DMN, and DSN approaches [2, 9] . For comparison purposes, the real-world datasets are also classified by MLP with one hidden layer, DMN initialized with the dHpC method and trained with SGD [9] , and DEN trained with a hill-climbing algorithm for determining the number of dendrites per class [2] . In order to find statistical differences between methods, the Kruskal-Wallis test (α = 0.05) is used to evaluate whether the medians of the approaches compared differ under the assumption that the shapes of the underlying distributions are the same. Also, the correction for multiple testing on the basis of the same data is made by the Bonferroni method. The k-fold cross-validation method (with k = 10) is used to built training and test sets to measure the classification accuracy (i.e., the hit rate) of neural models. Moreover, in Algorithm 2, the training set is partitioned again into two parts to create the training (80%) and validation (20%) sets. It is worth mentioning that a procedure of grid search and k-fold crossvalidation (with k = 5) determines the number of hidden neurons that maximizes the accuracy of the MLP network, where the number of hidden neurons is increased from 5 to 100 neurons, in steps of 5 [4] . Figure 3 shows the distribution of hyperspheres per class (i.e., dendrites) for each synthetic dataset. The SA algorithm determined the number of dendrites that maximized accuracy. For instance, it is notable that only three dendrites (one per class) are required for correctly classifying all the patterns of the Ring dataset ( Fig. 3(b) ). The corresponding decision regions obtained by the DSN approach are also illustrated in Fig. 3 . Notice that nonlinear decision boundaries are built, which are capable of modeling complex class distributions. In the case of real-world datasets, Fig. 4 shows the accuracy results obtained by MLP, DMN, DEN, and DSN neural models. It is remarkable that DSN outperformed the DEN approach in eight of ten datasets, and obtained competitive results in relation to DMN and MLP methods. Moreover, the multiple comparisons with the Kruskal-Wallis test and Bonferroni correction determined that DSN did not present statistically significant differences with MLP (p = 0.7035) and DMN (p = 0.3037), whereas DSN and DEN were statistically significantly different (p < 0.0001). In addition, for all the datasets, the DSN presented a simpler structure than MLP and DEN. For instance, for the Thyroid Gland dataset (D 9 ), MLP obtained an accuracy of 97.2% with 41 hidden neurons, whereas DSN reached an accuracy of 96.8% with four dendrites. Also, for the Page Blocks dataset (D 6 ), the accuracy of DEN is 91.1% with 71 dendrites, whereas DSN used 59 dendrites to attain an accuracy of 93.8%. In this paper, it was presented the theoretical basis of the Dendrite Spherical Neurons (DSN) for pattern classification. The DSN can be categorized in the family of neuronal models with dendritic processing, like Dendrite Morphological Neurons (DMN) and Dendrite Ellipsoidal Neurons (DEN). These neural models are an alternative to the Multilayer Perceptron (MLP) to solve classification problems with a simple architecture. Moreover, the DSN model can be viewed as a simplification of the DEN model, whose covariance matrix is diagonal with all its elements equal. DSN can overcome potential issues found in DEN, such as singular covariance matrices when a dendrite clusters a small number of patterns, while maintaining the smoothness of decision boundaries. The number of dendrites in DSN is a free parameter that should be tuned adequately. Thus, we proposed an optimization procedure based on the Simulated Annealing (SA) algorithm, in which the classification accuracy is maximized, while the number of dendrites is used as a constraint. Notice that this scheme does not guarantee the minimum number of dendrites, which represents a limitation of the proposed method. Hence, the problem of DSN tuning can be further extended to multiobjective optimization, in which the accuracy is maximized, and the number of dendrites is minimized. The experiments with real-world datasets revealed that DSN tended to reach better accuracy results than DMN. This behavior is because the DMN strongly depends on its initial solution (here was used the dHpC method) that is refined by Stochastic Gradient Descent (SGD); therefore, the error obtained in the initial solution will be carried to the final solution. Unlike to DMN, DSN does not refine an initial solution but uses the k-means algorithm to distribute the dendrites over the input space, while the SGD is used to train the weights of the softmax function at the output of the neuron. On the other hand, DEN obtained the lowest classification performance. This behavior is because dendrites are created independently for each class without considering the interaction between dendrites of different classes, causing overlaps in transition regions between classes. This drawback is addressed in the DSN model by considering the closeness between dendrites for calculating the radius of the hyperspheres. DSN obtained competitive results concerning MLP for classifying real-world datasets. However, MLP obtained more complex structures than DSN; that is, MLP usually requires more hidden neurons than dendrites in DSN to model the same classification problem. Therefore, DSN can be potentially used for classification problems where computational resources are limited. Future work involves a study of the effect of the number of dendrites on the DSN classification performance. Also, an extensive study with larger datasets and other kinds of classifiers is pending. Besides, the accuracy of DNS can be improved by using other distance metrics to measure the closeness between patterns as well as applying to SDG mechanisms of momentum and adaptive learning rate. Multiobjective simulated annealing: principles and algorithm variants Dendrite ellipsoidal neurons based on k-means optimization Differential evolution training algorithm for dendrite morphological neural networks Finding optimal model parameters by deterministic and annealed focused grid search UCI machine learning repository Pattern Classification Artificial Neural Networks: An Introduction (SPIE Tutorial Morphological neural networks with dendritic processing for pattern classification Dendrite morphological neurons trained by stochastic gradient descent Acknowledgements. We want to express our sincere appreciation to CINVESTAV-IPN and the Instituto Politécnico Nacional for the support provided to carry out this research. This research was supported by a Fondo SEP-Cinvestav 2018 grant (No. 145), and SI-IPN 20190007 and SIP-IPN 20200630.