key: cord-0804932-e5fbu3ct authors: Sreevidya, P.; Veni, S.; Ramana Murthy, O. V. title: Elder emotion classification through multimodal fusion of intermediate layers and cross-modal transfer learning date: 2022-01-18 journal: Signal Image Video Process DOI: 10.1007/s11760-021-02079-x sha: 61688ca3eea7e50a245adf428d24506dca1e8c10 doc_id: 804932 cord_uid: e5fbu3ct The objective of the work is to develop an automated emotion recognition system specifically targeted to elderly people. A multi-modal system is developed which has integrated information from audio and video modalities. The database selected for experiments is ElderReact, which contains 1323 video clips of 3 to 8 s duration of people above the age of 50. Here, all the six available emotions Disgust, Anger, Fear, Happiness, Sadness and Surprise are considered. In order to develop an automated emotion recognition system for aged adults, we attempted different modeling techniques. Features are extracted, and neural network models with backpropagation are attempted for developing the models. Further, for the raw video model, transfer learning from pretrained networks is attempted. Convolutional neural network and long short-time memory-based models were taken by maintaining the continuity in time between the frames while capturing the emotions. For the audio model, cross-model transfer learning is applied. Both the models are combined by fusion of intermediate layers. The layers are selected through a grid-based search algorithm. The accuracy and F1-score show that the proposed approach is outperforming the state-of-the-art results. Classification of all the images shows a minimum relative improvement of 6.5% for happiness to a maximum of 46% increase for sadness over the baseline results. advertisement, automated job interviews, interactive voice assistants, human assistive robotics, etc. The emotions can be identified through different modalities like audio, video, image, gestures/poses, text or from physiological signals. Specifically, multimedia signals are noninvasive, information-rich medium which can be explored using the said methods. Handcrafted features or deep features can be used for machine learning-based methods for emotion classification [1] . Deep learning models which are trained on established datasets like ImageNet [3] , CIFAR10 [2] , COCO [4] database are available for image classifications, while there are audio models trained with datasets like Audioset. There are tailor-made multimodal-based datasets like IMEO-CAP [5] , EMOReact [6] , AffectNet [7] and EMOTIC [8] for emotion recognition. When the problem of emotion recognition is addressed, the research results shows that there is a marked deviation in display of emotions in normal people and elderly people [9] . According to the meta-analysis by Hayes et al. [10] , the older adults less accurately identify facial expressions of sadness, fear and anger compared to younger people. The effect is lesser in surprise and happiness, and disgust is identified equally as that of the young. It implies that custom-made automated systems are to be developed for addressing the emotional-level requirements of aged adults. The said findings were the motivation behind the work. The proposed work attempts to address the following issues: 1. To identify suitable methods to develop audio and video models for emotion recognition in aged individuals. 2. To suggest a suitable multimodal fusion technique for emotion recognition system in aged adults. 3. To compare the performance of the results of the various experiments conducted. The proposed model is a fusion of audio and video modalities. The audio models are developed using cross-modal transfer learning techniques in addition to a feature-based approach. We used pretrained Inception Nets which are trained on ImageNet to transfer knowledge on audio spectrograms. The video signals are sampled, and a convolutional neural network-long short-time memory (CNN-LSTM) network is used for developing the model. Further, the fusion between these two modalities are performed. Experiments were conducted to analyze the significance of customized datasets for the people of age group above 60. Various sections of this paper are arranged as follows. In Sect. 2, the state of the art in emotion recognition and multimodal approaches is discussed. In Sect. 3, the proposed framework is presented, and Sect. 4 discusses the different experiments conducted by us. The dataset for emotion recognition in elderly people are also discussed here. The analysis of the results is given in section, which is followed by conclusion and future work. Here, the state-of-the-art techniques for the emotion recognition in the wild are discussed. Since the work addresses the problem of identifying the emotions in elderly people, the available datasets for this purpose is investigated. The ElderReact [11] dataset proposes an emotion reaction video dataset which has only elderly adults as actors in it. FACES and Lifespan are some of the datasets that contain emotion annotations for elderly people. For emotion recognition from audio signals, [12] suggested a cross-modal transfer learning framework [13] , which transfers knowledge from AlexNet, which is trained on ImageNet. Thus, it can be concluded that large-scale image classification benchmarks can help audio classification. Similarly in [14] , spectrogram-based CNN models for speech emotion recognition are implemented on Berlin Dataset [15] . According to [16, 17] , the techniques in neu-ral style transfer [18] can be applied for spectrograms as it is the two-dimensional representations of audio frequencies with respect to time. In [19] , Poorna et al. applied a multistage learning network for classifying speech emotions in Arabic speaking community. Kown et al. [20] proposed an artificial intelligence-assisted deep stride convolution neural network architecture using the plain nets strategy to learn discriminative and salient features from spectrogram of speech signals that are enhanced in prior steps to perform better. Boateng et al. [21] applied a transfer learning technique using YAMNet CNN for classifying emotions in elder individuals. YAMNet is a pretrained network with 1024 embeddings that are based on MobileNet. Experiments on FER-13 and AffectNet were done by [22] to show the combined effect of handcrafted and deep features. In Emotiw2019, Zhou and his team [23] taken a feature fusion strategy for classification of emotions. Zadeh et al. [24] introduced a multimodal dictionary to understand the interaction between facial gestures and spoken words for expressing sentiment, which is basically taking positive, negative and neutral expressions in a better manner. In [25] , late fusion network was used for sentiment classification using MOUD dataset, and [26] used fusion of text and speech for emotion classification on eNTERFACE dataset. The multimodal clues from videos were taken into different modalities and explored in [27, 28] . The hybrid deep learning framework introduced through this work include static spatial appearance information, motion patterns within a short-time window, audio information, as well as longrange temporal dynamics. Three CNN models were operating on static frames, and temporal relations were extracted through two LSTM networks. Hunag et al. [29] used a transformer model and a LSTM model for classifying the audio and video modalities. It had a multi-head attention mechanism by which multimodal emotional intermediate representations from common semantic feature space were used after encoding audio and visual modalities. The proposed frame work incorporates a multimodal interaction between audio and video modalities. The audio model has been developed by combining the spectrogram features as well as the handcrafted features. In the video modality, CNN-based networks are incorporated to learn the information from videos. We also tried a CNN network and LSTM network for modeling the raw video data. The input data were given to the network after performing necessary preprocessing steps. Further, feature-level and hybrid-type approaches are adapted to develop the final model as shown in Fig. 1 . In order to classify the emotions in elderly people, a major limitation is the lack of suitable datasets conducting the experiments. The ElderReact, a dataset which has description of emotion of old age people above fifty only, is selected for the experimentation purpose. This is one of the largest dataset available for emotion recognition in aged individuals. It contains 1323 video clips of from 46 elderly people, which are divided into 615 clips for training, 353 clips for testing and 355 clips for validations [11] . These videos are collected from YouTube channels. The dataset was annotated manually for six basic emotions along with valance and gender using Amazon Mechanical Turk. The emotions considered in this work are anger, fear, disgust, happiness, surprise and sadness along with valence and gender information. The cropped samples of faces of aged people from the database are shown in Fig. 2 . The annotations in the train, validation and test segments are distributed as shown in Fig. 3 . The two other datasets considered here for comparison purposes are EmoReact and RAVDESS [30] . The emotion classifications based on the audio signals are carried out by both 1-D and 2-D approaches. At first, the audio features like prosody, spectral coefficients and voice quality features like tenseness, creakiness, etc., are extracted. There are 72 selected features. The features are extracted using the open-source tool, COVARAP [31] with frame length 10 ms. The model based on the handcrafted features is developed by forming a deep neural network with two 1-D CNN layers and two dense layers. The number of filters are 64 and 32, respectively. The dense layers have 256 and 128 neurons in it. Mean square error is monitored for convergence. The network was optimized with Adam optimizer with a learning rate of 0.001. Further, a spectrogram model was developed. The spectrogram is a 2-D representation of the audio signal [32] . It appertains instantaneous frequency information of the audio feeds. The amplitudes are mapped into the intensity levels. In order to generate spectrograms, audio files are segmented uniformly. Each audio sample is sampled at 44.1 KHz frequency. The images constituted patches of 20 ms with an overlap of 75%. The short-time Fourier transform (STFT) was applied on the original signal. The Hanning window of length 10 ms was selected which hopes on the spectrum for adjusting the weighing factor. The spectrogram images obtained was pseudo-color-mapped. These images were standardized, before applying on the model proposed. The selection of the window type decides the sidelobe suppression, and the hop size determines the time-frequency smearing. By using Hanning window, we could ensure that there is smooth transition from main lobes to sidelobes, and there is no discontinuities due to windowing. The 25% hopsize was suitable for improving the time resolution. Fig. 4 . The idea here was to utilize the rich set of weights of the pretrained models. The pretrained models trained on ImageNet are retrained with training data of the dataset for the learning purpose. The Inception-V2 is identified as the suitable pretrained network [33] . This is because of the separable convolutions in the inception units. The filterbanks in the network are making the modules wider than just deeper, and there is an internal regularization that prevents overfitting. For the video model, at first, feature extraction was done from the visual modality through OPENFACE [34] , and face only frames were selected from the extracted frames. The selection of 178 features was done based on gaze, head pose detection, facial action units and non-rigid shape parameters as these features are the prominent visual indicators of the emotion that the participant is displaying [11] . A CNN model was developed and trained for classifying the emotions from these handcrafted features. There is batch normalization layer [35] incorporated which will prevent the model from overfitting by giving internal co-variance shift among minibatches. The raw video data were sampled and frames with only face images were selected for the purpose. During the preprocessing steps, the successful face recognition algorithms were executed to get the cropped face only images [36] . The multi-task cascaded convolution neural network (MTCNN) algorithm was applied for selecting face only frames from the video data. As a result, there are 90 face only images of size 160 × 160 is stacked into a single folder. The CNN network was developed. The model is pretrained with FER-2013 database. This dataset has a versatile set of images which has complex and subtle emotions. This dataset con- Figure 3 shows that the dataset is imbalanced except for the emotions happy and surprise. So at the preprocessing stage, the dataset is preprocessed with some resampling and subsampling methods. The dataset is normalized between the minimum and maximum value of the available feature values. The models are developed using the extracted features, and the results are tabulated in Table 1 . The accuracy and F1score are both presented in the said table. It shows good F1-score above 70% for happiness. Further, for the featurebased model in visual environment, again a 1-D model has been developed. The 1-D CNN layers are selected with the number of filters 256 and 128, respectively. Each layer was followed by drop out layers of 0.2 and 0.5, respectively. No regularization parameter was used except the batch normalization. Here again, Adam optimizer is used with a learning rate 0f 0.001. The results are tabulated in Table 2 . It is observed that while developing the model, increasing the depth of the model resulted in overfitting. The features are then concatenated, and fusion of two modalities was tried. The embeddings from intermediate layers are taken and fused together. The fused model has three dense layers with decreasing number of neurons applied. Each layer was followed by dropouts carefully chosen to avoid overfitting. Batch normalization was applied to normalize the minibatches before applying the classifier. The softmax activation function was used in the last layer. The hyperparameter tuning was done using Adam optimizer, with a learning rate of 0.001. The early stopping technique was applied with a patience level of 10. The results are tabulated in Table 3 . It can be observed that there is a significant improvement in all results except for disgust. This is especially true while considering the F1-score. The F1-score of happiness was increased by 6.8%, and the F1-score of surprise was found to increase by 4%. The F1-score of fear class was improved by Bold values indicate the best value Table 4 . The algorithms like random forest, Naive Bayes, SVM and XGBOOST were applied on the dataset. It is found that except for happiness none of the classes could perform well using these algorithms. There was a strong consistency in the results for our proposed model. The comparison between the combined model and unimodels is shown in Fig. 6 . It shows that the fusion model is better performing on classification of all images, and there is a slight decrease of 0.3%. in F1-score for anger. The raw video models were developed by convolutional neural network-based approaches. The model has two convolutional neural network(CNN) layers followed by a flattening layer. The 2-D CNN layer selected has 32 filters with 3 × 3 filter size. Then, three more dense layers of neurons 512, 256 and 128, respectively, are added. There are dropouts of 0.1 and batch normalization for normalizing the features extracted. Instead of the rectified linear unit (ReLU) activation function, leaky ReLU was used as the activation function. The small negative slope in the activation function helped to incorporate the negative values also. The input to this model is embeddings of size 90 × 128. These embeddings are extracted from a CNN model. The model was trained with FER-2013 database. Now, the model has initial weights learned from the database. The embeddings of the preprocessed image frames are predicted from this pretrained CNN model. The result obtained through this novel approach is tabulated in Table 5 . The results show that the video model is giving the best result for happiness, and the failure to distinguish between disgust and anger is the reason for decrease in (56.1%) F1-score for anger. The next experiment was conducted to develop a model based on the raw spectrogram images collected through librosa [37] . The cross-model transfer learning technique was adopted. It was experimentally determined to take the output from the 'mixed9' layer of the pretrained inception model. The layer which is selected is high-dimensional layer of 2048 embeddings with 2D settings. Three Conv2D layers with 512 filters of size 3 × 3 are applied further. Then, the layers were flattened and two more dense layers are added. The model was optimized through hyperparameter tuning. Nadam [38] was found to give the best results. The learning rate selected was 0.00001. Table 6 shows the accuracy and F1-score obtained for the customized model for elder emotion recognition task. Once the raw video model and the spectrogram model are developed, the next step was to observe the performance of the fusion model. The structure of the model is shown in Fig. 1 . The embeddings are taken from both audio model and video models from intermediate layers. The layers are selected through a grid-based search. These embeddings are concatenated. Table 7 shows the accuracy and F1-score of the fusion model. The model selected has 1-D CNN layers followed by dense layers. One-dimensional CNN layers has 256 filters in it. Instead of the CNN model, an LSTM model was applied. It has two LSTM layers of 128 units, and a dropout of 0.1 is applied, and finally, a dense layer is added. The results were better than the CNN model and are given in Table 7 . A comparison between spectrogram-based audio model and feature-based audio model is given in Fig. 7 . It can be observed that the feature-based model is giving better performance than the spectrogram-based model for all the emotions with respect to F1-score. But accuracy is more for sad and surprise in spectrogram model. But for video model, F1-sore of disgust and anger has come out well than the feature-based model. Bold values indicate the best value Bold values indicate the best value In the next step, spectrogram model was applied on the EmoReact database. It is found that the results were comparable for positive emotions only. Then, we applied audio feature model on a generalized audio emotion classification model, which was trained on RAVDESS dataset. The same pattern could be observed in this case also. The results are tabulated in Tables 7 and 8 . Further, the video models are compared in Fig. 8 . For anger and disgust, the model based on the deep feature is better, and for the emotions like fear, happiness, sad and surprise, first model is performing slightly better. With the rapid developments in the field of machine intelligence and deep learning techniques, automated emotion recognition systems are getting developed. But there is the requirement of customized systems for emotion recognition based on age or sex. The work here proposes automated emotion classification in aged people above 60. Various unimodal and fusion modals are tried here. The feature-based fusion model is found to give the best results. Compared to the generalized datasets, the ElderReact dataset is giving better consistency in all results for the proposed models. There has to be more databases tailor made for elder emotions so that more sophisticated systems can be developed. The results are also in accordance with the meta-analysis on the effect of age on expression of emotions. Local learning with deep and handcrafted features for facial expression recognition Learning multiple layers of features from tiny images Imagenet classification with deep convolutional neural networks Microsoft coco: Common objects in context IEMOCAP: interactive emotional dyadic motion capture database Emoreact: a multimodal approach and dataset for recognizing emotional responses in children AffectNet: a database for facial expression, valence, and arousal computing in the wild Context based emotion recognition using EMOTIC dataset Effects of age on the identification of emotions in facial expressions: a meta-analysis Task characteristics influence facial emotion recognition age-effects: a meta-analytic review ElderReact: a multimodal dataset for recognizing emotional response in aging adults Cross-domain transfer learning for complex emotion recognition Cross-modal generalization: learning in low resource modalities via meta-alignment Speech emotion recognition from spectrograms with deep convolutional neural network Weiss: A database of German emotional speech Image style transfer using convolutional neural networks Demystifying neural style transfer Neural style transfer for audio spectrograms Multistage classification scheme to enhance speech emotion recognition A CNN-assisted enhanced audio signal processing for speech emotion recognition Speech emotion recognition among elderly individuals using multimodal fusion and transfer learning Local learning with deep and handcrafted features for facial expression recognition Exploring emotion features and fusion strategies for audio-video emotion recognition Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages Sentiment analysis by deep learning approaches Hybrid approach for emotion classification of audio conversation based on text and speech mining Modeling multimodal clues in a hybrid deep learning framework for video classification A new hybrid deep learning model for human action recognition Multimodal transformer fusion for continuous emotion recognition Emotions understanding model from spoken language using deep neural networks and mel-frequency cepstral coefficients COVAREP-a collaborative voice analysis repository for speech technologies Amplitude-frequency analysis of emotional speech using transfer learning and classification of spectrogram images Inception-v4, inception-resnet and the impact of residual connections on learning OpenFace: an open source facial behavior analysis toolkit Batch normalization: accelerating deep network training by reducing internal covariate shift Performance evaluation and comparison of software for face recognition, based on Dlib and Opencv Library Incorporating Nesterov momentum into Adam