key: cord-1032665-hzsq2hzk authors: Bhattacharyya, Ankan; Chatterjee, Somnath; Sen, Shibaprasad; Sinitca, Aleksandr; Kaplun, Dmitrii; Sarkar, Ram title: A deep learning model for classifying human facial expressions from infrared thermal images date: 2021-10-19 journal: Sci Rep DOI: 10.1038/s41598-021-99998-z sha: 99119f84815edfb3b11073c10673aff532255423 doc_id: 1032665 cord_uid: hzsq2hzk The analysis of human facial expressions from the thermal images captured by the Infrared Thermal Imaging (IRTI) cameras has recently gained importance compared to images captured by the standard cameras using light having a wavelength in the visible spectrum. It is because infrared cameras work well in low-light conditions and also infrared spectrum captures thermal distribution that is very useful for building systems like Robot interaction systems, quantifying the cognitive responses from facial expressions, disease control, etc. In this paper, a deep learning model called IRFacExNet (InfraRed Facial Expression Network) has been proposed for facial expression recognition (FER) from infrared images. It utilizes two building blocks namely Residual unit and Transformation unit which extract dominant features from the input images specific to the expressions. The extracted features help to detect the emotion of the subjects in consideration accurately. The Snapshot ensemble technique is adopted with a Cosine annealing learning rate scheduler to improve the overall performance. The performance of the proposed model has been evaluated on a publicly available dataset, namely IRDatabase developed by RWTH Aachen University. The facial expressions present in the dataset are Fear, Anger, Contempt, Disgust, Happy, Neutral, Sad, and Surprise. The proposed model produces 88.43% recognition accuracy, better than some state-of-the-art methods considered here for comparison. Our model provides a robust framework for the detection of accurate expression in the absence of visible light. IRFacExNet can classify the facial expressions from the thermal images more accurately. 2. The applied snapshot ensemble technique (which is based on cosine annealing) can enhance the prediction capability. 3. The model outperforms many existing FER methods on the IRDatabase. The rest of the paper is organized as follows: "Literature survey" section discusses the recent works related to FER on both thermal and normal images. "Data analysis" section provides a brief discussion on the IRDatabase 12 used in the current experimentation. "Proposed methodology" section elaborates the proposed methodology used for the FER with the details of the architecture, snapshot ensembling, and training process. "Results and discussion" section describes the obtained outcomes by the proposed model with exhaustive analysis. Finally, "Additional experiments" section concludes the work with the pros and cons of the proposed architecture and also highlights the future scope of the work. As stated earlier, research on building FER systems from images captured by cameras using visible light has been popular due to the easy availability and low cost of such cameras. Hammal et al. 13 classified facial expressions from videos by fusing facial deformation using a rule-based decision with the help of the framework known as the transferable belief model (TBM), which considers simple distance coefficients based on simple facial features like eyes, mouth, and eyebrows, etc. Ojo et al. 14 classified two facial expressions, fear, and sadness using a local binary pattern histogram on two databases. One database is the Japanese female facial expression (JAFFE) database and the other one is the Cohn cade database. Kyperountas et al. 15 developed a multi-step two-class classification problem for facial expressions and reported results on the JAFFE and MMI databases. During each step of the process, among many two-class classifiers, the best classifier is identified. This helped the authors to come up with a better FER system. Ali et al. 16 proposed a two-step method for classifying facial expressions. First, the facial features were extracted using a histogram of oriented gradients (HOG), and then a sparse representation classifier (SRC) was used to classify the facial expressions. Bartlett et al. 17 used a combination of Adaboost and Support vector Machine (SVM) to recognize facial expressions to be used in the human-robot interaction assessment. With the advent of deep learning architecture, the need for manually extracting features is no longer important, as techniques like convolution and max-pooling can be used as a part of a model, stacked in layers, to extract features and feeding the features into classifiers. Authors in 18 have shown a DL-based method that exploits the spectral correlation and the spatial context to extract more relevant features used in their experiment. Rodriguez et al. 19 proposed a DL-based architecture called Deep Pain, to detect pain automatically by classifying facial expressions. Here, instead of manually extracting the facial features, the face is directly fed into a Convolution Neural Network (CNN) linked to Long Short-Term Memory (LSTM) to remove long-term dependencies. From the classified facial expressions, the model can predict the type of pain a patient experiences. One of the simplest processes is to detect FACS-AU by passing facial features to models like the Hidden Markov Model (HMM) to decode the emotions. Similar work has been performed by Lien et al. 20 by extracting facial features using three modules and feeding them into a discriminant classifier or HMM to classify them into FACS-AUs. As stated earlier, due to other objects and conditions, it becomes too difficult to classify the facial expressions as there is an overhead for face detection at different poses and also due to the qualitative nature of FACS-AU, there are ambiguities in classifying facial expressions as these vary from the person to person depending on the angle of viewing. So, to make only the face part visible, thermal images are used. Thermal images allow only the skin of the human to be exposed when the emissivity value is set to near 1 21 . Moreover, in thermal images, the thermal distribution in facial muscles is detected. This fact allows better facial expression classification and leaves no room for ambiguity as it is not dependent on external factors like human viewing through naked eyes and inconvenient lighting conditions. In contrast, the fact that IRTI cameras are less affordable by the common people 30 . The proposed mechanism combines Gaussian, Bilateral and Non-local means denoising Filter that helps to increase the performance of the CNN in turn by achieving 85% and 95% accuracies in FER2013 and LRFE datasets, respectively. Reddy et al. in 31 proposed a technique that combines hand-crafted and deep learning features to recognize facial expressions on the AffectNet database. The authors have extracted hand-crafted features from facial landmark points and XceptionNet was used for the extraction of deep features. The authors have achieved 54%, 58%, and 59% accuracies using their proposed model on three sets of data distributions (Imbalanced set, Down-sampled set, and Up-sampled set). Looking into the various applications of facial expression recognition and the volume of works presented in the literature, it is clear that there is a great scope to work in this field. For the current work, we have used the database published by Kopaczka et al. 12 . The images of this database were recorded using an Infratec HD820 high-resolution thermal infrared camera with a 1024 768 pixel-sized microbolometer sensor with a thermal resolution of 0.03 K at 30 • C and equipped with a 30-mm f/1.0 prime lens. Subjects were filmed while sitting at a distance of 0.9 m from the camera, resulting in a spatial resolution of the face of approximately 0.5 mm per pixel. A thermally neutral backdrop was used for the recordings to minimize background variation. To build the database, video recordings of the subjects acquired with a frame rate of 30 frames/s were manually screened and images were extracted. The database contains a total of 1782 sample images with eight classes of expressions 12 namely, Fear, Anger, Contempt, Disgust, Happy, Neutral, Sad, and Surprise. The number of sample images belonging to each class has been shown in Table 1 . A few sample images from the considered database have been shown in Fig. 1 . In the proposed work, as a pre-processing step, we have converted the input images into grayscale images and reshaped them into a uniform size of 200 × 200 pixels before feeding to the network. The present study deals with Human Facial Expression Recognition (HFER) using thermal infrared images acquired from the database mentioned in 12 . Several studies exist in the literature for the processing of facial expressions when the image is taken in the visible spectrum. However, these systems are not directly applicable for infrared facial processing, rather we can utilize the ideas after a certain amount of tailoring to obtain competent results for the infrared images 33 . It is to be noted that there exist some limitations when dealing with infrared images for HFER that include lack of colors for the distinction of facial features, obscure skin folds, and wrinkles that are formed on the forehead or cheeks indicating specific expressions like happiness or anger. The hotspot is a relatively common problem with infrared imaging which is a blurry bright spot in the center of the image. The presence of a hotspot makes the expression detection task more challenging. Recently, it has been observed that the rise of CNNs greatly influenced the building of efficient HFER systems. It has proved its extraordinary success when working with RGB images. Here we try to exploit the strengths of deep CNN architecture for feature extraction which in turn will help to recognize different facial expressions more accurately. The proposed IRFacExNet architecture is specially designed to work with infrared images. This network efficiently extracts useful information from the input images for the detection of facial expressions. To make the predictions more accurate an efficient ensemble strategy (known as Snapshot Ensemble) 34 has been adopted. It is to be noted that this ensemble technique helps to achieve better performance with no computational overhead 35 . During the training process, any CNN model gets converged to several local minima along the optimization path. In this approach, we save the corresponding models at regular intervals (known as snapshots). In the later stage, these snapshots are combined by an aggregation rule to create an ensemble classifier. Proposed architecture. Instead of using traditional convolution units, the proposed IRFacExNet utilizes depth-wise convolutions, where each input channel is convolved by each filter channel separately. This is followed by a point-wise convolution of a 1x1 window to merge the channel-wise extracted features. The standard convolution does channel-wise and spatial convolution in one step. On the other hand, depth-wise convolution used in the proposed model splits this into two steps, which helps to lower the number of trainable parameters and thus reduces the chances of overfitting. These depth-wise convolutions have been used with varying dilation rates for a global perspective and to capture the spatial and cross-channel correlations of the inputs. The proposed IRFacExNet model shown in Fig. 2 is divided into two structural units, namely Residual Unit, and Transformation Unit. Dilated convolution is the primary innovation incorporated in these units to extract variegated features from the entire image. The working procedure of dilated convolutions, Residual Unit, and Transformation Unit has been mentioned below. For accurate classification of facial expressions, it is important to extract distinctive features that are unique to an expression while covering the complete image from different perspectives. Dilated convolutions broaden the receptive field and help to bring diversity to the feature maps with no increase in trainable parameters 34 . Figure 3 shows the working procedure of the dilated convolutions. In the proposed structural units, an input image first goes through a point-wise convolution which projects the inter-channel information into a broader space. It is then propagated through several depth-wise convolutions with varying dilation rates from 1 to m which broaden the receptive field for feature extraction and help to find the localized features as well as the features that are distributed over the wide area. Thus it helps to find spatial and cross-channel features from different perspectives. Finally, these varied features are merged using the point-wise convolutions. In the Residual Unit, the output of the mapping scheme is merged with the input features itself known as residual learning proposed by He et al. 36 . If the input is termed as U and the residual mapping function as R(), then the output can be defined as F: U [U + R(U)]. This allows us to build a deeper network without overfitting and helps the dominant features to propagate deeper into the network without much distortion. It also avoids vanishing gradient problems and the skip connection technique helps to build a much deeper network that is easier to train. After all the internal processing inside the residual block, the output shape is maintained equal to the input shape. The basic block diagram of the Residual Unit has been shown in Fig. 4 . www.nature.com/scientificreports/ The Transformation Unit can be considered as a substitute for the pooling layers. Here, we have tried to avoid the greedy technique of Max-pooling because it usually causes the loss of diffused features and positional information. The Transformation Unit has been designed similarly to the Residual Unit i.e., point-wise-depthwise-point-wise convolutions. Here, the first point-wise convolution extends the number of channels four times its input to complement the depth-wise convolutions. Next, the spatial dimensions are halved using the stridden depth-wise convolutions. This helps to eliminate unnecessary details and makes it suitable for further operations. The depth of the output feature map is doubled in the final point-wise convolution to increase the filtering operations in later stages. The basic block diagram of the Transformation Unit has been shown in Fig. 5 . In a nutshell, the working strategy of the proposed IRFacExNet model is mentioned as follows: The initial input is processed by some traditional convolutions with larger-sized kernels to produce an adequate number of channels and spatial dimensions. The major part of the network is stacked with residual blocks of depth (d) to produce a deeper network followed by transformation blocks which are introduced to bring dimensional transformations and to improve the generalization performance. The dilation rate (m) depends on the input size of the feature map. For processing larger sized feature maps, a higher dilation rate is preferred as it makes the receptive field wider which helps to extract both generic as well as complex features. Finally, the feature maps are passed through the global average pooling layer followed by some fully connected layers and the classification layer. All the layers are ReLU (Rectified linear unit) activated to bring the non-linearity. Batchnormalization is used to make the convergence faster and to provide a modest generalization effect. It also helps to decrease the internal covariate shift 37 . Snapshot ensemble. An ensemble-based classification model has been adopted to leverage the specialties of several classifiers to achieve better prediction. It helps to build a robust model with better generalization performance, higher accuracy, and lower error rate 35 . There exist several methods involving different aggregation rules to aggregate the predictions from different classifiers. Depending on the aggregation rule adopted, the performance of the ensemble model varies. Some commonly used techniques to combine the predictions from the base classifier are Horizontal Voting Ensemble, Weighted Average Ensemble, Stacked generalization, etc. All these methods work pretty well when the base classifiers have learned from the training data in different ways, thus have different generalization errors. However, everything has a cost, and when using traditional ensemble strategies there is an additional computational overhead of training several base learners from scratch with parameter tweaking to achieve the best performance. As opposed to the usual approach of training different base learners incurring additional computation power and time, Huang et al. 35 proposed a method to produce many base learners having different generalization errors from a single training run known as Snapshot ensemble. Stochastic Gradient Descent (SGD) and its extensions www.nature.com/scientificreports/ are widely used for optimizing neural networks. It is capable of avoiding and escaping saddle points and local minima that are considered a great deal. It is advantageous to converge the models to local minima with flat basins. Different models that might have converged to different local minima can have very similar error rates but tend to make different mistakes which is the exact requirement for creating a robust ensemble classification model. With the increase in the number of trainable parameters, the number of possible local minima also increases exponentially. Therefore it is not surprising that two networks having the same architecture with different initialization and mini-batch ordering will converge to different solutions. These models with variegated convergence can be exploited using ensembling in which multiple neural networks can be trained from different initialization which is combined with horizontal voting. Instead of building (M) models from the scratch, SGD optimizer allows to converge to unique local minima M times and saving the corresponding models which are known as snapshots. After saving the snapshot models, the learning rate is increased to bring the optimizer out of the basin and again converging it to other local minima. Note that the model has different initializations when the learning rate is increased again and it reaches a different local minima. Repeating these steps for M times will provide M base learners having different biases. This cyclic nature of learning rate can be achieved by the method known as Cosine Annealing proposed by Loshchilov et al. 38 , in which the learning rate is abruptly raised and then quickly lowered following a cosine function. Training process. When experimenting with numerous hyperparameters for training a deep neural network, it is required to check the effect of the tuning and to draw justification from therein about the performance. However, an optimized model might fail to fit the training data and thus cannot extract useful information from the input data. To cope up with this, we have performed the experiment with architecture depth (d) starting from 1 to 4 and assessed the performance of the model. The architecture depth (d) can be understood as the number of Residual Units to be stacked on each other to make the proposed network deep or shallow. We have achieved the best result for d=2 and similar results for others. To reduce the complexity of the model, a depth of 2 units is used for all the residual blocks. Training DCNN with limited data and without overfitting can be considered a very challenging task. To solve this task Dropout and Batch Normalization is generally used throughout the network that helps to decrease the misclassification rate. Grayscale images of (200 × 200) resolution are feed to the network for training purposes. The details of the database have already been mentioned in "Data analysis" section. The 85% sample images from this database have been used for training the network and the performance is evaluated on the rest 15% of the database. Firstly, the proposed IRFacExNet is trained using Adam optimizer for 100 epochs having a batch size of 16 samples. Then the Adam optimizer is replaced by SGD as it works best with a cyclic learning rate scheduler. An efficient Snapshot ensemble requires an aggressive cyclic learning rate scheduler to converge to different minima for even minor fluctuations in the learning rate. We have utilized Cosine Annealing as the scheduler with the slight modification that varies the learning rate by following a cosine function. Instead of using abrupt restart after reaching the minima 13 , we have applied a gradual rise in the learning rate that allows the optimizer to explore around the local minima and find a better optimization path. Eq. (1) is the mathematical expression of the scheduler. where α 0 lower bound of learning rate (10 −5 ) , �α the difference between upper bound and lower bound of learning rate (10 −5 − 10 −2 ) , e current epoch number (1 − 500) , h the range of half-cycle of the cosine function (100 epochs). This scheduler considers the parameters like total training epochs, maximum learning rate, minimum learning rate, range of half-cycle, and the epoch number. Figure 6 shows the variability of the learning rate for the training www.nature.com/scientificreports/ span of 500 epochs. In the present study, we have used the upper bound and lower bound of the learning rate for the scheduler as 10 −2 and 10 −5 , respectively. The range of half-cycle has been set to 100 epochs. The snapshot of the model is taken at an interval of 100 epochs and thus producing a total of 5 snapshot models throughout the training process. Due to the cyclic nature of the learning rate, for each snapshot, the model reaches a local minimum having a lower error rate and thus fitting the training data totally (means it has a very high variance). Again, when the model is saved during the rise of the learning rate, it has a comparatively higher error rate and is biased. Such diversity of the model brings a better generalization performance on the unseen data. These developed five snapshot models are combined in all possible combinations for testing purposes and the performance has been estimated on the test data. Such a type of ensemble model helps to build a robust system and to produce better recognition accuracy. The summary of hyperparameters used for training the model can be found in Table 2 . HFER is considered one of the challenging research problems in the field of computer vision. It is because of the close similarity of human faces among the correlated expressions (like Disgust and Anger). The naive model (without using the snapshot ensemble technique) successfully recognizes the human facial expressions from thermal infrared images with 82.836% classification accuracy, as shown in Table 3 . To improve the overall performance of the prediction model, we have employed an ensemble strategy known as Snapshot Ensemble, as stated earlier in "Training process" section. This technique uses the strength of different snapshot models having different biases and helps to predict the final results that improve the performance of the system on unseen data. As stated earlier, during the training of the Snapshot ensemble model, an aggressive learning rate scheduler has been used that allows the model to reach a new local minimum each time when there is a drop in the learning rate. When it tries to reach a local minimum, it extracts variegated information from the training data which is dissimilar compared to other local minima and such property is well-suited for the building of an efficient ensemble classifier. We have saved five snapshots, namely S1, S2, S3, S4, and S5 during the complete training process. The snapshots were combined using the voting aggregation rule to generate various ensemble classifiers. For example, snapshot models S1 and S2 are combined to generate (S1, S2), models S1, S2, and S3 are combined as (S1, S2, S3), and so on. The performance of the individual snapshot model (S1 to S5) and all possible combinations are evaluated on the test database. The observed outcomes are reported in Table 3 (Fig. 7) . During the training process of the model by using the cosine learning rate scheduler, the first half cycle of the cosine function is completed in 100 epochs (Fig. 6) , and the first Snapshot (S1) is generated. During this phase, there is a gradual decrease in the learning rate that helps the optimizer to find local minima. This makes the model perform better for the data similar to the training samples. However, it may not provide good predictions on unfamiliar images. Hence, the learning rate starts increasing gradually and reaches its upper bound, which takes another 100 epochs to produce the second Snapshot S2. This allows the optimizer to explore around the minima and rises along the hill having a gentle slope. The variation of loss during training of the model can be seen in Fig. 8 . It helps to decrease the variance of the model and thus improves the overall predictive performance. Therefore, the alternating nature of the learning rate allows the SGD optimizer to bring the models having different properties. The ensemble works best if the base models incorporated in the ensemble have low test errors and do not overlap in the set of examples they fail to classify 35 . Cyclic learning rate scheduler helps to visit several local minima before converging to a final solution. At an interval of 100 epochs, the Snapshots (S1 to S5) are successively considered and are used for the development of the ensemble classifier in the current experimentation. The change in accuracy during the training process can be seen in Fig. 9 . All possible combinations of the Snapshot models are assessed on the test data for the performance evaluation. Among all possible combinations of the snapshot models, the best performance has been achieved by the ensemble system when (S1, S2, S3, and S5) are combined to produces 88.433% recognition accuracy as shown in Table 3 and Fig. 7 . The possible reason for achieving the best recognition performance when (S1, S2, S3, and S5) are combined is that the considered snapshot models have low test errors and do not overlap in the set of examples. The naïve model has a recognition accuracy of 82.836% and after adopting the ensemble strategy, the model produced 88.433% recognition accuracy and thereby, increases 5.597% prediction accuracy. In this section, we have also tried to interpret the working procedure of the model used for the facial expressions classification. The interpretation of the model is required as facial expressions depend on different www.nature.com/scientificreports/ combinations of facial landmarks like eyebrows, nose, and lips, etc. Interpretation helps to verify the model whether it is detecting the correct areas of an image before making a decision. Selvaraju et al. 39 have introduced a Gradient Weighted Class Activation Mapping (Grad-CAM) technique that helps to visualize the explainability of any deep learning model. Figure 10 pictorially demonstrates the functionality of Grad-CAM applied in our model. The process takes an image as input. The detection technique is applied to the image using the proposed model. After the successful calculation of the predicted class, Grad-CAM is applied to any of the convolution layers. In our case, we have considered the last layer for Grad-CAM analysis. The gradient information flow the last convolution layer of the proposed deep learning model is used by Grad-CAM. The information is useful to interpret the decision of each neuron. This in turn helps to interpret the whole deep learning model. To calculate the class discriminative localization map of width w and height h for any class c, we first compute the output probability, y c , of the class c. This probability is calculated before the softmax function. Then the gradient of y c is calculated with respect to A k , the feature maps of a convolution layer. These gradients are then global average-pooled to obtain the neuron importance weights α k for the target class, as shown in Eq. (2). After calculating α k of the target class c, a weighted combination of activation maps is performed. Then it is followed by an activation function. The activation function used is ReLU (Rectified linear unit) as shown in Eq. (3). ReLU is applied because we are only interested in the visualization of the features having a positive influence on the class of interest. If ReLU was not used, then other features would have been considered for visualization contributing negatively to the prediction of the class. This results in a coarse heatmap of the same size as that of the convolutional feature maps. We have applied ReLU to the linear combination because we are only interested in the features that have a positive influence on the class of interest. Without ReLU, the class activation map highlights more than it requires and hence achieves low localization performance. Figure 11 illustrates Grad-CAM output of different expressions of IR facial images. It can be observed that for each facial expression, the Grad-CAM output highlights the facial structure most prominently. This visualization confirms that the feature extracted by IRFacExNet are prominent and focuses on various facial landmarks. Then using the combination, it classifies the facial expressions. In the expression of fear, the two eye-brows can be seen blue-colored. But the space around the face is blue. This means that it uses the features of eyebrows that are most important. Similarly, for the disgust expression, the whole facial landmarks including nose, eye-brows, and mouth have been used as the most important features to classify the expression as disgust (Fig. 12) . From the confusion matrix shown in Fig. 13 , it can be observed that the model gets confused for a similar type of expressions a human makes while feeling different emotions. For example, the highest misclassification has been observed between the expressions Disgust and Fear, which are strongly correlated to each other. These expressions are really confusing to create a distinction between them. The second highest wrong estimation has been found for the expressions between Anger and Disgust. An interesting phenomenon can be observed that expression of Fear has the lowest recognition rate of 76.47% and is mostly detected as the expression of Anger, Contempt, Disgust, and Sadness. Figure 12 highlights the most confusing pairs of facial expressions owing to which they get misclassified. As discussed above, many times, humans make indistinguishable faces during expressing these emotions, and this causes the model to mislead. Also, it is to be noted here that the dataset we consider here for evaluation is comparatively a new one, and hence we have not found many works which also considered this dataset. However, there is still a lot of improvement in the recognition rate compared to the classification rate reported by Kopaczka et al. 12 . Table 4 compares the efficiency of the proposed method with some previously developed methods for the considered database 12 . It can be seen that our technique outperforms the previous recognition systems by achieving 88.433% classification accuracy. The key reasons for higher accuracy are: • The proposed IRFacExNet model is very deep which helps to extract the dominanat features from the input images leading to better classification accuracy. • Due to the presence of depthwise convolution operations in model, it extracts features over a wider area. • The inclusion of snapshot ensemble strategy helps to improve the performance and robustness of the model. • Besides, the IRDatabase used is fairly balanced. It ensures that the model is not biased towards a particular class. The obtained confusion matrix from the proposed methodology when compared with confusion matrix (Fig. 14 ) obtained from Kopaczka et al. 12 indicates the improvements in recognition accuracy in both class-wise and overall. This rise in recognition accuracy is the result of using DL methods. In the case of Neutral and surprise expressions, the precision is improved by almost a factor of 2. The lowest recognition accuracy reported was for expression of contempt and in our case, it is for Fear. These two expressions are highly correlated due to which two input samples of contempt are misclassified as fear in our study. Due to the lack of benchmark results on HFER is an active research area in the domain of computer vision. In this paper, we have proposed an efficient deep learning model, called IRFacExNet for the recognition of human expressions from thermal images. Going by the current research trend, to develop this model, we have relied on a DCNN architecture. We have used two structural units namely, Residual Unit and Transformation Unit having their distinct strengths and these are able to extract useful features from the human faces used for the detection of the various expressions. This naïve model is able to achieve a recognition rate of 82.836%. To make the recognition system more robust, we have employed the Snapshot ensemble technique that has been trained using a cyclic learning rate scheduler. The predictions of the multiple models obtained after the Snapshot ensemble are combined to achieve a better prediction model. The ensemble system is able to achieve a state-of-the-art accuracy of 88.433% on the considered database. Although the proposed ensemble model produces a good recognition accuracy, however, there are some scopes for further improvement.The training of IRFacExNet model is computationally expensive compared to the naive model. The long training time can be reduced by the model's architectural optimization which may reduce its complexity. Apart from that some attention mechanism in the CNN architecture can be added which might help to reduce Figure 14 . Confusion matrix of the compared method presented by Kopaczka et al. 12 . www.nature.com/scientificreports/ the misclassification among similar types of expressions. To enhance the robustness of the proposed model, the recognition framework can be aided with multimodal sensors that may help to find facial features 43, 44 . Again, towards making the model more explainable several strategies can be adopted to decipher the decision-making process of deep learning models 45 . It might help to make high-level decisions and be confident as the decisions are driven by combinations of data features specifically selected to achieve the desired results. To obtain the most relevant features, some feature selection algorithms can also be employed that might decrease the computational overhead. Apart from that the proposed model can be evaluated on other large-sized databases which would help us to ensure the robustness of the proposed model. Received: 2 July 2021; Accepted: 5 October 2021 Communication without words Facial Action Coding System What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS 3-d model-based synthesis of facial expressions and shape deformation An application of optical flow-extraction of facial expression Recognition of facial expression from optical flow Recognition of facial expressions using potential net and kl expansion Analysis of neural network recognition characteristics of 6 basic facial expressions Thermal infrared imaging-based affective computing and its application to facilitate human robot interaction: A review Emotion analysis in children through facial emissivity of infrared thermal imaging Infrared thermography as a measure of emotion response A fully annotated thermal face database and its application for thermal facial expression recognition Facial expression classification: An approach based on the fusion of facial deformations using the transferable belief model Improved model for facial expression classification for fear and sadness using local binary pattern histogram Salient feature and reliable classifier selection for facial expression classification An approach for facial expression classification Towards social robots: Automatic evaluation of human-robot interaction by face detection and expression classification A spatio-spectral hybrid convolutional architecture for hyperspectral document authentication Deep pain: Exploiting long short-term memory networks for facial expression classification Detection, tracking, and classification of action units in facial expression Face identification using thermal image processing Facial expression recognition using thermal image processing and neural network Facial expression recognition using thermal image processing and neural network Human emotion recognition using thermal image processing and eigenfaces Facial expression recognition from infrared thermal videos Visual and thermal image processing for facial specific landmark detection to infer emotions in a child-robot interaction Automated facial expression classification and affect interpretation using infrared measurement of facial skin temperature variations Thermal facial expression recognition using modified resnet152 An automated and efficient convolutional architecture for disguise-invariant face recognition using noise-based data augmentation and deep transfer learning Facial expression recognition for low resolution images using convolutional neural networks and denoising techniques Facial expression recognition in the wild, by fusion of deep learnt and hand-crafted features Deep facial expression recognition: A survey Multi-scale context aggregation by dilated convolutions Snapshot ensembles: Train 1, get m for free Deep residual learning for image recognition Batch normalization: Accelerating deep network training by reducing internal covariate shift SGDR: Stochastic gradient descent with warm restarts Grad-cam: Visual explanations from deep networks via gradient-based localization A modular system for detection, tracking and analysis of human faces in thermal infrared recordings A comprehensive database for benchmarking imaging systems A deep learning approach for thermal face emotion recognition A review on automatic facial expression recognition systems assisted by multimodal sensor data Facial expression recognition system using multimodal sensors Neuro-symbolic visual reasoning for multimedia event processing: Overview, prospects and challenges S; Writing-original draft preparation The authors declare no competing interests. Correspondence and requests for materials should be addressed to D.K.Reprints and permissions information is available at www.nature.com/reprints.Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.