key: cord-0531875-k58mjoro authors: Angelov, Plamen; Soares, Eduardo title: Towards Explainable Deep Neural Networks (xDNN) date: 2019-12-05 journal: nan DOI: nan sha: 5660c5ec41ab8e76b21f3ffafa5f4f663914ac7f doc_id: 531875 cord_uid: k58mjoro In this paper, we propose an elegant solution that is directly addressing the bottlenecks of the traditional deep learning approaches and offers a clearly explainable internal architecture that can outperform the existing methods, requires very little computational resources (no need for GPUs) and short training times (in the order of seconds). The proposed approach, xDNN is using prototypes. Prototypes are actual training data samples (images), which are local peaks of the empirical data distribution called typicality as well as of the data density. This generative model is identified in a closed form and equates to the pdf but is derived automatically and entirely from the training data with no user- or problem-specific thresholds, parameters or intervention. The proposed xDNN offers a new deep learning architecture that combines reasoning and learning in a synergy. It is non-iterative and non-parametric, which explains its efficiency in terms of time and computational resources. From the user perspective, the proposed approach is clearly understandable to human users. We tested it on some well-known benchmark data sets such as iRoads and Caltech-256. xDNN outperforms the other methods including deep learning in terms of accuracy, time to train and offers a clearly explainable classifier. In fact, the result on the very hard Caltech-256 problem (which has 257 classes) represents a world record. Deep learning has demonstrated ability to achieve highly accurate results in different application domains such as speech recognition [2] , image recognition [3] , and language translation [4] and other complex problems [5] . It attracted the attention of media and the wider public [6] . It has also proven to be very valuable and efficient in automating the usually laborious and sometimes controversial preprocessing stage of feature extraction. The main criticism towards deep learning is usually related to its 'black-box' nature and requirements for huge amount of labeled data, computational resources (GPU accelerators as a standard), long times (hours) of training, high power and energy requirements [7] . Indeed, a traditional deep learning (e.g. convolutional neural network) algorithm involves hundreds of millions of weights/coefficients/parameters that require iterative optimization procedures. In addition, these hundreds of millions of parameters are abstract and detached from the physical nature of the problem being modelled. However, the automated way to extract them is very attractive in high throughput applications of complex problems like image Plamen Angelov, and processing where the human expertise may simply be not available or very expensive. Feature extraction is an important pre-processing stage, which defines the data space and may influence the level of accuracy the end result provides. Therefore, we consider this very useful property of the traditional deep learning and step on it combined with another important recent result in the deep learning domain, namely, the transfer learning. This concept postulates that knowledge in the form of a model architecture learned in one context can be re-used and useful in another context [8] . Transfer learning helps to considerably reduce the amount of time used for training. Moreover, it also may help to improve the accuracy of the models [9] . Stepping on the two main achievements of the deep learning -top accuracy combined with an automatic approach for feature extraction for complex problems, such as image classification, we try to address its deficiencies such as the lack of explainability [7] , computational burden, power and energy resources required, ability to self-adapt and evolve [10] . Interpretability and explainability are extremely important for high stake applications, such as autonomous cars, medical or court decisions, etc. For example, it is extremely important to know the reasons why a car took some action, especially if this car is involved in an accident [11] . The state-of-the-art classifiers offer a choice between higher explainability for the price of lower accuracy or vice versa ( Figure 1 ). Before deep learning [12] , machine-learning and pattern-recognition required substantial domain expertise to model a feature extractor that could transform the raw data into a feature vector which defines the data space within which the learning subsystem could detect or classify data patterns [4] . Deep learning offers new way to extract abstract features automatically. Moreover, pre-trained structures can be reused for different tasks through the transfer learning technique [8] . Transfer learning helps to considerably reduce the amount of time used for training, moreover, it also may helps to improve the accuracy of the models [9] . In this paper, we propose a new approach, xDNN that offers both, high level of explainability combined with the top accuracy. The proposed approach, xDNN offers a new deep learning architecture that combines reasoning and learning in a synergy. It is based on prototypes and the data density [13] as well as typicality -an empirically derived pdf [14] . It is noniterative and non-parametric, which explains its efficiency in terms of time and computational resources. From the user perspective, the proposed approach is clearly understandable to human users. We tested it on some well-known benchmark data sets such as iRoads [15] and Caltech-256 [16] and xDNN outperforms the other methods including deep learning in terms of accuracy, time to train, moreover, offers a clearly explainable classifier. In fact, the result on the very hard Caltech-256 problem (which has 257 classes) represents a world record [1] . The remainder of this paper is organized as follows: The next section introduces the proposed explainable deep learning approach. The experimental data employed in the analysis and results are presented in the results section. Discussion is presented in the last section of this paper. The proposed explainable deep neural network (xDNN) classifier is formed of several layers with a very clear semantic and functional meaning. In addition to the internal clarity and transparency it also offers a very clear from the user point of view set of prototype-based IF...T HEN rules. Prototypes are selected data samples (images) that the user can easily view, understand and appreciate the similarity to other validation images. xDNN offers a synergy between the statistical learning and reasoning bringing both together. In most of the other approaches there is a dichotomy and preference of one over the other. We advocate and demonstrate that both, learning and reasoning can work together in a synergy and produce very impressive results. Indeed, the proposed xDNN method outperforms all published results [15] , [1] , [17] in terms of accuracy. Moreover, in terms of time for training, computational simplicity, low power and energy required it is also far ahead. The proposed approach can be described as a feedforward neural network which has an incremental learning algorithm that autonomously selfdevelops and evolves its structure adding new prototypes to reflect the possibly changing (dynamically evolving) data pattern [10] . As shown in Figure 3 , xDNN is composed of the following layers-1) Features descriptor layer; 2) Density layer; 3) Typicality layer; 4) Prototypes layer; 5) MegaClouds layer; 1) Features descriptor layer: (Defines the data space) The Feature Descriptor Layer is the first phase of the proposed xDNN method. This layer is in charge of extracting global features vector from the images. This first layer can be formed by more traditional 'handcrafted' methods such as GIST [19] or HoG [20] . Alternatively, it can be formed by the fully connected layer (FCL) of the pre-trained convolutional neural network approaches such as AlexNet [21] , VGG-VD-16 [18] , and Inception [22] , residual neural networks such as Resnet [3] or Inception-Resnet [23] , etc. Using pre-trained deep neural network approach allows automatic extraction of more abstract and discriminative high-level features. In this paper, pre-trained VGG-VD-16 DCNN is employed for feature extraction. According to [24] , VGG-VD-16 has a simple structure and it can achieve a better performance in comparison with other pre-trained deep neural networks. The first fully connected layer from VGG-VD-16 provides a 1 × 4096 dimensional vector. a) The values are then standardized using the following equation (1): wherex denotes the normalized value of the features vector. For clarity in the rest of the paper we will use x instead ofx. Meta-parameters for the xDNN are initialized with the first observed data sample (image). The proposed algorithm works per class; therefore, all the calculations are done for each class separately. where µ denotes the global mean of data samples of the given class. P is the total number of the identified prototypes from the observed data samples (images). Each class C is initialized by the first data sample of that class: where, p 1 is the vector of features that describe the prototypeÎ of the C 1 ;Î is the identified prototype; Support 1 is the corresponding support (number of members) associated with this prototype; r 1 is the corresponding radius of the area of influence of C 1 . In this paper, we use r * = 2 − 2cos(30 o ) same as [13] ; the rationale is that two vectors for which the angle between them is less than π/6 or 30 o are pointing in close/similar directions d. That is, we consider that two feature vectors can be considered to be similar if the angle between them is smaller than 30 degrees. Note that r * is data derived, not a problem-or userspecific parameter. In fact, it can be defined without prior knowledge of the specific problem or data through the following equation (5). 2) Density layer: The density layer defines the mutual proximity of the images in the data space defined by the features from the previous layer. The data density, if use Euclidean form of distance, has a Cauchy form (6) [13]: where D is the density, µ is the global mean, and σ is the variance. The reason it is Cauchy is not arbitrary [13] . It can be demonstrated theoretically that if Euclidean or Mahalanobis type of distances in the feature space are considered, the data density reduces to Cauchy type as referred in equation (6). Density can also be updated online [25] : where µ i and the scalar product, i can be updated recursively as follows: Data samples (images) that are closer to the global mean have higher density values. Therefore, the value of the data density indicates how strongly a particular data sample is influenced by other data samples in the data space due to their mutual proximity. Typicality is is an empirically derived form of probability distribution function (pdf). Typicality τ is given by the equation (10) . The value of τ even at the point x = p i is much less than 1; the integral of ∞ −∞ τ dx = 1 [13] . The prototypes identification layer is the core of the proposed xDNN classifier. This layer is responsible to provide the clearly explainable model. The xDNN classifier is free from prior assumptions about the data distribution type, as well as the random or deterministic nature of the data. In contrast, it extracts the actual distribution empirically form the data samples (images) bottom up [13] . The prototypes are independent from each other. Therefore, one can change the structure by adding a new prototype without influencing the other already existing prototypes. In other words, the proposed xDNN is highly parallelizable and suitable for evolving form of application where new prototypes may be added (if the data pattern requires this). The proposed xDNN method is trained per class forming a set of prototypes per class. Therefore, all the calculations are done for each class separately. Prototypes are the local peaks of the data density (and typicality) identified in the previous layers/ stages of the algorithm from the images of the corresponding class based on their feature vectors. The prototypes can be used to form linguistic logical IF...T HEN rules of the following form: where ∼ stands for similarity, it also can be seen as a fuzzy degree of membership; p is the identified prototype; P is the number of identified prototypes; c is the class c = 1, 2, ..., C, I denotes an image. One rule per prototype can be formed. All rules per class can be combined together using logical OR, also known as disjunction or S-norm: THEN (class c) Figure 4 illustrates the area of influence of the identified prototypes. These areas around the identified prototypes are called data clouds [13] . Thus, each prototype defines a data cloud. We call all data points associated with a prototype data clouds, because their shape is not regular (e.g., hyperspherical, hyper-ellipsoidal, etc.) and the prototype is not necessarily the statistical and geometric mean , but actual image [13] . The algorithm absorbs the new data samples one by one by assigning then to the nearest (in the feature space) prototype: In case, the following condition [13] is met: It means that x i is out of the influence area of p j . Therefore, the vector of features x i becomes a new prototype of a new data cloud with meta-parameters initialized by equation (13) . Add a new data cloud: P ← P + 1; C P ← x i ; p P ← I i ; Support P ← 1; Otherwise, data cloud parameters are updated online by equation (14) . It has to be stressed that all calculations per data cloud are performed on the basis of data points associated with a certain data cloud only (i. e. locally, not globally, on the basis of all data points). The xDNN learning procedure can be summarized by the following algorithm. xDNN: Learning Procedure 1: Read the first feature vector sample x i representing the image I i of the class c; 2: Set i ← 1; n ← 1; P 1 ← 1; p 1 ← x i ; µ ← x 1 ; Support ← 1; r 1 ← r 0 ;Î 1 ← I 1 ; 3: FOR i = 2, ... Read x i ; 5: Calculate D(x i ) and D(p j ) (j = 1, 2, ..., P ) according to equation (9); 6: IF Equation (12) In the MegaClouds layer the clouds formed by the prototypes in the previous layer are merged if the neighbouring prototypes have the same class label. In other words, they are merged if they belong to the same class. MegaClouds are used to facilitate the human interpretability. Figure 5 illustrates the formation of the MegaClouds. Rules in the MegaClouds layer have the following format: where M C are the MegaClouds, or the areas formed from the merging of the clouds, and mc is the number of identified MegaClouds. Multimodal typicality, τ , can also be used to illustrate the MegaClouds as illustrated by Figure 6 . Architecture for the validation process of the proposed xDNN method is illustrated by Figure 7 . The validation process of xDNN is composed of the following layers: 1) Features descriptor layer; 2) Similarity layer (density); 3) Local decision-making. 4) Global decision-making. Which is detailed described as following: 1) Features descriptor layer: Similarly to the features descriptor layer described in the training process. 2) Prototypes layer: In this layer the degrees of similarity to the nearest prototypes (per class) are extracted for each unlabeled (new/validation) data sample/image I i defined as follows: where S denotes the similarity degree. Local (per class) decision-making is calculated based on the 'winner-takes-all' principle and can be obtained by: 4) Global decision-making layer: The global decisionmaking layer is in charge of forming the decision by assigning labels to the validation images based on the degree of similarity of the prototypes obtained by the prototype identification layer as illustrated by Figure 7 and determining the winning class. In order to determine the overall degree of satisfaction, the maximum of the local, per class winners is applied. The label is obtained by the following equation (18): III. EXPERIMENTAL DATA We validated our proposed approach, xDNN using several complex, well-known image classification benchmark datasets (iRoads and Calltech-256). The iROADS dataset [15] was considered in the analysis first. The dataset contains 4,656 image frames recorded from moving vehicles on a diverse set of road scenes, recorded in day, night, under various weather and lighting conditions, as described below: • Daylight -903 images • Night -1050 images • Rainy day -1049 images • Rainy night -431 images • Snowy -569 images • Sun strokes -307 images • Tunnel -347 images Caletch-256 has 30,607 images divided into 257 object categories (one of which is the background) [16] . The performance of the classification methods is usually evaluated based on their accuracy index which is defined as follows: where T P, F P, T N, F N denote true and false, negative and positive, respectively. All the experiments were conducted with MATLAB 2018a using a personal computer with a 1.8 GHz Intel Core i5 processor, 8-GB RAM, and MacOS operating system. The classification experiments were executed using 10-fold cross validation under the same ratio of training-to-testing (80% to 20%) sample sets. Computational simulations were performed to assess the accuracy of the proposed explainable deep learning method, xDNN against other state-of-the-art approaches. Table I shows that the proposed xDNN method provides the best result in terms of classification accuracy as well as time/complexity and simplicity of the model structure (number of parameters/prototypes). The number of model parameters for xDNN (and DRB) is, strictly speaking, zero, because the 2 parameters (mean, µ and standard deviation, σ) per prototype (data cloud) are derived from the data and are not algorithmic parameters or user-defined parameters. For kNN method one can argue that the number of parameters is the number of data samples, N. The proposed explainable DNN surpasses in terms of accuracy the state-of-the-art VGG-VD-16 algorithm which is a well-established convolutional deep neural network. Moreover, the proposed xDNN has at its top layer a set of a very small number of MegaClouds (27 or, on average, 4 MegaClouds per class) which makes it very easy to explain and visualize. For comparison, our earlier version of deep rule-based models, called DRB [17] also produced a high accuracy and was trained a bit faster, but ended up with 521 prototypes (on average 75 prototypes per class) [26] . With xDNN we do generate meaningful IF...T HEN rules as well as generate an analytical description of the typicality which is the empirically derived pdf in a closed form which lends itself for further analysis and processing. [26] 99.51 % 836.28 Not reported DRB [26] 99.02% 2.95 521 SVM [26] 94.17% 5.67 Not reported KNN [26] 93.49% 4.43 4656 Naive Bayes [26] 88.35% 5.31 Not reported MegaClouds generated by the proposed xDNN model can be visualized in terms of rules as illustrated by the Figure 8 . Voronoi tesselation can also be used to visualize the resulting MegaClouds as illustrated by Figure 9 . Typicality for classes 'night scene' and 'snow scene' are given by Figure 10 . Typicality can also be used for interpreatability and explainability as it is correspondent to the pdf. One can use the typicality to represent the likelihood that an image represents a specific type of driving conditions. For a given image a vector of features can be extracted, x ∈ R 4096 which can be standardized and normalized and used to demonstrate the likelihood of a certain type of driving condition as shown on Fig. 10 . Results for Caltech-256 are presented in Table II . [27] 24.6 % SVM(2) [27] 39.6% SVM(3) [27] 46.0% SVM(4) [27] 51.3% SVM(5) [27] 65.6% SVM(7) [27] 71.7% Softmax(5) [27] 65.7% Softmax(7) [27] 74.2% Results presented in Table II demonstrate that the proposed xDNN approach can obtain the best classification reported so far world wide for this complex problem, namely, 75.41%. The proposed approach did surpass all of the competitors, offering the highest accuracy, as well as, clearly explainable model. xDNN produced on average 3 MegaClouds per class (a total of 721) which are clearly explainable. Rules have the following format: Experiments have demonstrated that the proposed xDNN approach is able to produce highly accurate results surpassing state-of-the-art methods for different challenging datasets. Moreover, xDNN presents highly interpretable results that can be presented in the form of IF...T HEN logical rules, Voronoi tessellations, and/or typicality (empirically derived form of pdf) in a closed analytical form allowing further analysis. Because of its recursive, non-iterative and nonparametric form it allows computationally very efficient implementations to be realized. In this paper we propose a new method, explainable deep neural network (xDNN), that is directly addressing the bottlenecks of the traditional deep learning approaches and offers a clearly explainable internal architecture that can outperform the existing methods. The proposed xDNN approach requires very little computational resources (no need for GPUs) and short training times (in the order of seconds). The proposed approach, xDNN is prototype-based. Prototypes are actual training data samples (images), which have local peaks of the empirical data distribution called typicality as well as of the data density. This generative model is identified in a closed form and equates to the pdf but is derived automatically and entirely from the training data with no user-or problemspecific thresholds, parameters or intervention. The proposed xDNN offers a new deep learning architecture that combines reasoning and learning in a synergy. It is non-iterative and non-parametric, which explains its efficiency in terms of time and computational resources. From the user perspective, the proposed approach is clearly understandable to human users. Results for some well-known benchmark data sets such as iRoads and Caltech-256 show that xDNN outperforms the other methods including state-of-the-art deep learning approaches (VGG-VD-16) in terms of accuracy, time to train and offers a clearly explainable classifier. In fact, the result on the very hard Caltech-256 problem (which has 257 classes) represents a world record [1] 1 . Future research will concentrate on the development of a tree-based architecture, synthetic data generation, and local optimization in order to improve the proposed deep explainable approach. Spatial pyramid pooling in deep convolutional networks for visual recognition The microsoft 2017 conversational speech recognition system Deep residual learning for image recognition Deep learning Deep learning The deep learning revolution Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Deep transfer metric learning Supervised representation learning: Transfer learning with deep autoencoders Novelty detection and learning from extremely weak supervision Towards a rigorous science of interpretable machine learning Deep learning in neural networks: An overview Empirical approach to machine learning A generalized methodology for data analysis Vehicle detection based on multi-feature clues and Dempster-Shafer fusion theory Caltech-256 object category dataset Deep rule-based classifier with human-level performance and characteristics Very deep convolutional networks for large-scale image recognition Classifying web videos using a global video descriptor Architectural study of hog feature extraction processor for real-time object detection Imagenet classification with deep convolutional neural networks Going deeper with convolutions Inception-v4, inception-resnet and the impact of residual connections on learning Object detection networks on convolutional feature maps Autonomous learning systems: from data streams to knowledge in real-time Actively semisupervised deep rule-based classifier applied to adverse driving scenarios Visualizing and understanding convolutional networks