key: cord-1031502-w3ew14uk
authors: Kapoor, Shalini; Kumar, Tarun
title: Fusing traditionally extracted features with deep learned features from the speech spectrogram for anger and stress detection using convolution neural network
date: 2022-04-08
journal: Multimed Tools Appl
DOI: 10.1007/s11042-022-12886-0
sha: 90650232d0a5727ef77fecd46cd27b35e14632d3
doc_id: 1031502
cord_uid: w3ew14uk

Stress and anger are two negative emotions that affect individuals both mentally and physically; there is a need to tackle them as soon as possible. Automated systems are highly required to monitor mental states and to detect early signs of emotional health issues. In the present work convolutional neural network is proposed for anger and stress detection using handcrafted features and deep learned features from the spectrogram. The objective of using a combined feature set is gathering information from two different representations of speech signals to obtain more prominent features and to boost the accuracy of recognition. The proposed method of emotion assessment is more computationally efficient than similar approaches used for emotion assessment. The preliminary results obtained on experimental evaluation of the proposed approach on three datasets Toronto Emotional Speech Set (TESS), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and Berlin Emotional Database (EMO-DB) indicate that categorical accuracy is boosted and cross-entropy loss is reduced to a considerable extent. The proposed convolutional neural network (CNN) obtains training (T) and validation (V) categorical accuracy of T = 93.7%, V = 95.6% for TESS, T = 97.5%, V = 95.6% for EMO-DB and T = 96.7%, V = 96.7% for RAVDESS dataset.

Stress is a psychological state in which the body's physical and mental balance gets disturbed due to physical, mental, or emotional strain [6] . Anger and stress are feelings that erupt due to displeasure hurting an individual's personal life, health, and interpersonal relation. Today stress is a global concern with more and more people experiencing stress in their daily lives. This necessitates a system that could continuously monitor and highlight our affective states such as anger and stress. Alert generated using such systems will enable us to take prompt and timely action to avoid the risks associated with these. Nowadays, there has been a rapid increase in the demand for smart and voice-driven interfaces due to the comfort and convenience offered by them. Voice is expected to be a preferred medium to interact with nextgeneration user interfaces, besides gesture, brain-computer, augmented reality, and other interfaces.

Low et al. [15] Speech is not just mere communication of messages but also carries a wealth of information related to mood, stress, psychological behavior, and mental health [12] . There are a large number of cues that specify that there are objectively measurable speech parameters that could reflect the emotional state of a person [7] . Emotions are expressed in speech using different speaking styles, tone of voice, rate of speech and intonation, etc. Speech is composed of two important components linguistic and paralinguistic components. A linguistic component is what is said while the paralinguistic component refers to how it is said. Speech not only contains the message but also contains information related to the speaker's affective state such as feelings and emotions. Emotions are expressed in speech using different speaking styles, tone of voice, rate of speech and intonation, etc. Even after extensive research carried out in the field of speech emotion recognition (SER) systems still researchers are facing lots of challenges and we are still quite far from desirable performance and accuracy. An emotional speech dataset that is ethnically diverse phonetically/prosodically balanced and big enough to cater to possible variations related to age, gender, culture, style, and language are required we are still lacking such database. Annotation of emotion in natural speech corpora is a timeconsuming and difficult task so there is a need to develop a robust mechanism to annotate natural speech corpora. Till now, there is no consensus regarding optimal size of unit of analysis and best set of features. Most of the studies carried in this field are based on isolated speech while real time SER systems require continuous speech processing. Several times perception of emotions is entirely different from internal feelings so a mechanism to gaze at internal emotions by a deeper insight into the cognitive processes is required. Models that could differentiate between expressed and experienced emotions using the information gathered from some other source which provides deeper insight into the cognitive process needs to be developed. We still lack a compact feature set that could differentiate emotional patterns associated with particular emotions [2] . According to past research two approaches were used in past for SER traditional and deep learning. The traditional approach follows pipeline architecture with different steps such as feature extraction, feature selection, and classification. Each step needs to be optimized separately which is quite time-consuming and a cumbersome task. Optimal feature selection is a very crucial step in SER that has a high impact on the accuracy and performance of the SER system. The selection of optimal features requires expertise and domain knowledge.

Deep neural networks (DNN) classifiers have provided an elegant solution to the problem of optimal feature selection by bypassing it. The concept is to employ an end-to-end network that receives raw data as input and outputs a class label. It's not necessary to compute hand-crafted features or to figure out which parameters are best for categorization. The network takes care of everything. Specifically, during the training phase, the network parameters (i.e., the weights and bias values provided to the network nodes) are tuned to behave as features efficiently splitting the input into the required categories. In comparison to conventional classification, this otherwise highly straightforward technique comes with substantially higher needs for labeled data samples. The adoption of deep learning (DL) techniques was a major turning point in SER. In a wide range of classification applications, supervised DL neural network models have been proven to outperform classical techniques, with the classification of images being particularly successful [22] . DL is becoming more important in the early detection of the new coronavirus . In many hospitals throughout the world, DL has become the primary technique for automatic COVID-19 categorization and detection utilizing chest X-ray pictures or other types of imaging [25] . It is used in the detection of skin disease, [4] plant diseases, [10] cervical cancer, [17] in disease classification, etc.

Due to the advent of CNN, various studies have shown the feasibility of using spectrograms for SER. The spectrogram is a 2D representation of speech signal carrying useful patterns related to the speaker's affective state and importantly, it maintains the signal in its entirety. Both traditional and deep learning approaches offer their advantages and disadvantages. Using the traditional approach for SER there is the possibility of biased feature selection or missing important features. Deep learning algorithm shows good performance on training dataset but sometimes show poor performance on other datasets. DNN requires the configuration of millions of parameters, each with intricate interrelationships, which are referred to as black boxes since the model's behavior is difficult to explain, even when the structure and weights are visible. A crucial impediment to fully exploiting the promise of DNNs for emotion identification is the unavailability of a sufficient number of emotion-labeled speech datasets, which makes training a deep network from scratch difficult. Traditional and deep learningbased techniques have clear trade-offs. Traditional algorithms are well-known, transparent, and optimized for performance and power efficiency, whereas DL provides higher accuracy and variety at the expense of a significant amount of computer resources.

Hybrid approaches traditional and deep learning techniques to get the best of both worlds. In the proposed work we have used a hybrid feature set obtained fusing handcrafted and deep learned features for emotion assessment. In the proposed work we have tried to assess three affective states anger, stress, and neutral. The datasets used in the proposed study are TESS, EMO-DB, and RAVDESS. The aim of using three different datasets is to validate our proposed speech emotion recognition model cross-culturally. Features fusion helps in gathering complementary information and boosting the accuracy of recognition.

with the emotion assessment module. ECD helps in localizing the emotion change points in continuous speech signals and triggering an emotion recognition algorithm for accurate emotion classification. Most of the studies in the field of SER have used equal size segments for emotion assessment while in real life emotions persist for different periods. Emotion change detection marks the boundaries between different emotions based on how long they persist. These boundaries further help in speech segmentation for emotion assessment. The glottal features are used for emotion change detection. ECD module has considerably reduced the processing time and computational resources required to implement the proposed algorithm, instead of always an ON speech emotion recognition system. As most of the part of continuous speech is neutral only the segments of speech where emotion change is observed are used for emotion assessment.

Implemented emotion assessment module using fused features set. The feature vector used for classification is composed of handcrafted features pitch, spectral, and prosodic derived from the raw audio signal and deep learned derived from spectrogram images. Feature fusion has boosted the accuracy of emotion assessment due to complementary information generated from different representations of speech signals.

The flow of the proposed work is as follows: In Section 2, we review the related works. In Section 3, we discuss the proposed methodology. In Section 4 there are Results and discussions and Section 5 contains the conclusion and future work.

After reviewing the work done in the field of speech emotion recognition it is found that lots of variation exist in terms of approaches, emotion representation model, unit of analysis, choice of features, feature selection algorithms, and classifiers exist [24] . Two different models discrete and in continuous dimensions were used for emotion representation. In the discrete model, emotions are categorized into six basic emotions such as neutral, anger, disgust, fear, happiness, and sadness. In the continuous dimension, emotions are represented using two-axis arousal or activation and valence or positivity. Emotion recognition using a conventional pipeline requires feature extraction, feature selection, and emotion classification. A variety of features such as glottal, spectral, prosodic, voice quality features, and Teager energy operator were used for speech emotion recognition [13] . During heightened emotion change in vibration of the vocal fold is observed which further leads to variation glottal flow. Glottal features are used to capture the characteristics of the sound source and glottal flow and are among the prominent features used for speech emotion recognition. Glottal features are independent of language and robust against noise.

Nooteboom [23] Prosodic feature refers to the variations in melody, intonation, pauses, stresses, intensity, vocal quality, and accents of speech these features are based on the perception of humans. Utterance level prosodic features are widely used for speech emotion recognition [11] . Spectral features are extracted from the frequency domain and represent the characteristics of the vocal chard. Mel-frequency cepstrum coefficient (MFCC) features represent the spectral property of the speech signal. Recent studies have shown that spectral features also contain rich information about expressivity and emotion. The popularity of spectral features lies in the fact that vocal chord characteristics could be observed using spectral features [26] . Voice quality features used for SER are jitter, shimmer, and harmonics-to-noise ratio. Voice quality features provide insight into variations occurring in vocal tract characteristics during heightened emotions. Teager features are widely used for stress recognition these features help in detecting stresses occurring in muscles of the vocal tract due to changes in mental state. Spectral features provide frequency-domain metrics on your data. Both local and global features were used in emotion classification. The features are extracted from frames are called local features while global features are extracted from the complete utterance. Local features are good in capturing local dynamic information of emotion in speech. The global features take statistics of the features in a whole utterance for speech emotion classification. Features used for speech emotion recognition were extracted from frame, word, or a complete sentence. Feature extraction is followed by feature selection in this step features that are highly correlated or redundant are removed.

Different algorithms were used for feature selection such as principal component analysis (PCA), Linear discriminant analysis (LDA), Extreme learning machines, swarm intelligence, etc. The classification algorithm used for SER is linear regression, K-nearest neighbor (KNN), support vector machines(SVM), Gaussian mixture model (GMM), Hidden Markov model, Artificial neural network, etc. A lot of time, effort, and domain knowledge is required to generate an optimal feature set still there is no accepted feature set that can distinguish different emotions.

Currently, deep learning algorithm has replaced the conventional method for speech emotion recognition as they are capable of automatic generation of optimal features from training data without human intervention. Deep learning algorithms have shown good performance in several tasks such as image classification, object recognition, speaker recognition, speech and handwriting recognition, etc. Different variants of deep learning architecture such as Convolution neural network, long short term memory network, autoencoders, adversarial neural network, and RNN were employed for speech emotion recognition and have shown marked improvement in accuracy and performance compared to conventional approaches used for speech emotion recognition. Several researchers have recently used speech spectrograms or sub-bands of spectrograms for speech emotion recognition using CNN. As a result, it's reduced to an image classification problem. (Zhang et al.,2015) proposed CNN with different configurations with different features for speech emotion recognition. The prosody features usually focused on fundamental frequency (F0), speaking rate, duration, and intensity are not able to confidently differentiate angry and happy emotions from each other [18] proposed deep convolutional neural network (CNN) for speech emotion recognition using features extracted from spectrograms. The rectangular kernels of varied sizes were proposed to learn discriminative from the spectrogram. The proposed method was evaluated on Emo-DB and Korean datasets and has shown better performance in comparison to other state-of-the-art techniques proposed for SER. Lu et al. [16] proposed long short-term memory (LSTM) and convolutional neural network (CNN) for SER. LSTM extracts the temporal context features of the speech signals and CNN extracts high-level emotional features from low-level features for emotional classification. The proposed method obtained accuracies of 49.15%, 85.38%, and 37.90% on eNTRAFACE'05, RML, and AFEW6.0 databases.

Zhang et al. [29] proposed integrated distributed-gender feature and gender-driven feature with spectrogram features for speech emotion recognition using convolutional neural network (CNN) and bi-directional long short-term memory (BLSTM) and reduced recognition error rate to 45.74%. Zhang et al. [30] proposed low-level descriptors and spectrographic features for speech emotion recognition using CNN and reduced the error rate to 36.91%. Guo et al. [8] used amplitude and phase information for speech emotion recognition and recognition error to 33% the dataset used to experiment are EMO-DB). Hajarolasvadi and Demirel [9] proposed 3D CNN for speech emotion recognition using Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity feature and features extracted from spectrogram derived from same frame dataset used for experimentations were Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE'05 databases and obtained results superior to state-of-the-art methods reported in the literature [19] . A proposed novel architecture of ADRNN networks for speech emotion recognition. The local correlations feature and global contextual information learned from Mel-spectrogram are passed as input to BI LSTM for speech emotion recognition. Accuracies of 90.78% and 85.39% and unweighted accuracy of 74.96% and 69.32% were obtained on speaker-dependent and speaker-independent using Berlin EMODB and IEMOCAP.datasets. Badshah et al. [2] proposed a convolution neural network using affect-salient features for SER.

Anvarjon et al. [1] proposed frequency features extracted from the speech data to predict emotions. The accuracies of 92.02% and 77.01% are obtained on the IEMOCAP and Emo-DB datasets, respectively. Mustaqeem et al. [21] proposed a computationally efficient method for SER. Instead of using entire utterance only key segments of utterance were proposed for speech emotion analysis using CNN. The proposed method achieved accuracies of 72.25%, 85.57%, and 77.02% on IEMOCAP, EMO-DB, and RAVDESS datasets, respectively. [20] Proposed a one-dimensional dilated convolutional neural network (DCNN) to learn spatial salient emotional features and learn long-term contextual dependencies from the speech signals. [27] Proposed low-level descriptors (LLD), segment-level features extracted from speech Mel-spectrograms, and features extracted from the complete utterance. In the proposed speech emotion recognition model three classifiers deep neural network (DNN), a convolution neural network (CNN), and a recurrent neural network (RNN) were integrated. The integrated model achieved weighted accuracy of 57.1% and unweighted accuracy of 58.3% on the IEMOCAP dataset. The accuracy of the proposed system was significantly better in comparison to the individual classifier. Zhang et al. [28] proposed multi-CNN to learn multimodal audio features from spontaneous speech for emotion recognition. Through a fusion of different features, complementary information could be generated which significantly improved the accuracy of emotion recognition. The proposed model was evaluated on AFEW5.0 and BAUM-1 s datasets. Motivated by the well-established success of pitch, and spectral features for SER, as well as by the recent popularity of spectrogram images, we propose a new SER approach by concatenating pitch and spectral features with the spectrogram image features extracted by a CNN.

The convolutional neural network is the most popular and established deep learning algorithm. CNN has shown exceptionally good performance in several tasks related to a variety of domains with benchmark performance in image classification. CNN mimics the structure of the human brain. CNN takes input such as image and systematically apply filters to input data to learn discriminatory features. The filters applied to input help to learn abstract concepts such as boundaries and edges, etc. Although filters could be handcrafted the advantage of using CNN is that it automatically learns the filters during the training process in the context of the problem at hand. CNN requires lesser pre-processing in image classification when compared to other classification algorithms. CNN automatically extracts features from the input which was earlier performed by a human in traditional algorithms. The major advantage of CNN there is no need for prior knowledge for feature engineering and they are highly scalable to massive data sets. The building blocks of CNN are convolution layers and fully connected layers. A typical CNN architecture is composed of multiple convolutions and pooling layers stacked over each other followed by one or more fully connected layers. Two important operations of CNN are convolution and pooling. The convolution operation is the application of filters to input such as an image that results in inactivation. The repeated application of filters on input results in an activation map indicating the strength and location of the detecting feature in an Input. Pooling operation reduces the number of features by extracting dominant features which further reduces the processing time and computational complexity of algorithms. The fully connected layer contains a feature vector which is further used for classification.

Spectrograms are a time-frequency visual representation of a signal produced by a short-time Fourier transform (STFT) [2] . Spectrograms carry both vocal tract and excitation signals, as well as amplitude and phase information. They are critical to analyzing speech signals both in the time and frequency domain. Although there are other representations of speech signal such as wavelet transform and Wigner-Ville distribution spectrograms offer distinct advantages and are preferred in practical applications. As a result of increasingly complex modeling approaches and large datasets, their value is growing, even if more stringent feature extraction could potentially worsen overall performance.

The spectrogram is the image of sound representing changing frequency content of the signal concerning time. The frequency of the signal is represented on the y-axis, and time on the x-axis in the 2D spectrogram. The lowest frequencies are represented at the bottom of the spectrogram while the higher frequencies are represented at the top. The colors are used to represent the amount of energy in the signal. Higher energy regions in spectrogram caused by events such as vocal fold closures, formants, and harmonics are shown using darker color; lighter color such as white is used to represent the region of little energy such as silence. Spectrograms are of two types' narrowband and wideband spectrogram. A Narrowband spectrogram is used to show the characteristics of the source like the vibration of vocal folds while a wideband spectrogram is used to investigate the characteristics of the vocal tract such as vocal tract resonance (formants).To generate a spectrogram of sound signal; the signal is first divided into smaller overlapping segments called frames followed by short-Fourier transform. Figure 1 shows a spectrogram for emotion anger stress and neutral.

Where x(n) denotes the input signal at time n and w(n) denotes the R-length window function.

A An ng ge er St tre es ss N No orm m mal l The proposed methodology for emotion recognition (anger, stress, and neutral) consist of three important steps pre-processing, feature extraction, and classification and the block diagram in Fig. 2 showing the details of each step used in the proposed methodology for anger and stress recognition.

To implement the proposed work three publically available datasets Toronto Emotional Speech dataset (TESS), Berlin Emotional Database (EmoDB), and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) were downloaded. The details of each dataset are available in Section 4.1. After downloading the dataset corpus related to emotion anger, stress, and neutral were extracted and stored separately. Then, each speech signal is segmented into shorter overlapping frames of equal length. Each frame has a 50% overlap with the previous one. This step results in the division of each speech signal into n frames. These frames are further analyzed using glottal features to identify the frame which shows glottal asymmetry. The frames of speech where glottal asymmetry was observed were further used for emotion assessment.

During normal phonation both vocal folds left and right always show symmetric oscillation. The oscillatory pattern of the vocal folds gets asymmetric due to the influence of emotion or pathological conditions. Glottal symmetry is an important feature used in determining emotional state and abnormality in voice due to pathological conditions. For estimation of glottal signal, we have used inverse filtering. The glottal signal could be obtained using Eq. 2, where 

The shape of the glottal pulse is composed of To is the pulse opening phase, Tc is the pulse closing phase, Uo represents peak volume velocity of glottal pulse which occurs at tp, and FG signifies glottal frequency of oscillation. To opening phase of the pulse and Tc closing phase of the pulse could be determined using an equation. The T c and T 0 is calculated using Eqs. 3 and 4 respectively.

After determining T o and T c the glottal symmetry of the pulse is determined using Eq. 5. The glottal symmetry is defined as the ratio of the closing phase and opening phase.

Several samples of neutral speech were used to extract T c and T o value related neutral speech. The recorded T c and T o values are used to calculate glottal symmetry related to each recorded sample. The mean value of glottal symmetry is obtained from the collected data. The control chart is further used in identifying part speech where glottal asymmetry is observed. The control chart shown in Fig. 3 is the diagram used to monitor the variations in particular characteristics of the process over time. After analyzing the sufficient points mean value for the process is calculated. The mean value is used to calculate the upper and lower control limits. For the process to be stable values must lie between the upper or lower control limit. The value which lies above or below the upper or lower control shows process instability. Adding (3 x σ Fig. 3 Description of implementing control charts for emotion change detection to the average) for the UCL and subtracting (3 x σ from the average) for the LCL where σ is used for standard deviation.

If the values of glottal symmetry obtained at the current position go beyond the upper or lower control limit of reference neutral speech emotion assessment events are triggered.

In the proposed approach two sets of features were used for the assessment of emotion anger, stress, and neutral. These features are derived from the frame which shows glottal asymmetry.

The first feature vector is composed of 11 handcrafted features RMS, ZCR, Spectral centroid, Spectral Entropy, Spectral roll-off, mean pitch, max pitch, min pitch, tempo, low energy, spectral irregularity were extracted from each frame of speech showing glottal asymmetry. The RMS, ZCR, Spectral centroid, Spectral Entropy, and Spectral roll-off are derived using Eqs. 6, 7, 8, 9 , and 10 respectively.

& RMS is the measurement of energy in the signal.

& ZCR The zero-crossing rate indicates the frequency at which the signal crosses the zero amplitude level.

M = total number of samples in processing window, X(m) = is the value of mth sample.

& The spectral centroid indicates at which frequency the energy of a spectrum is centered.

here S(k) is the spectral magnitude at frequency bin k, f(k) is the frequency at bin k.

& Spectral Entropy measures signals irregularity.

Where P n (f i ) represents the probability of the ith frequency component, SE corresponding to the frequency range [f 1, f 2].

& Spectral roll-off represents the frequency below which total spectral energy is concentrated.

Where Mt[n] is the magnitude of the Fourier transform at frame t and frequency bin n.

& Pitch statistic mean pitch, max pitch, min pitch. & Tempo represents how fast or slow sound is. & Low energy measures the number of frames whose RMS energy is less than the threshold.

The second set of features was automatically extracted from the spectrogram derived from speech frames where glottal asymmetry was observed. The features were extracted from the spectrogram using the proposed CNN architecture shown in Fig. 4 . The spectrogram is a visual representation of STFT where the horizontal axis represents the time and the vertical axis represents the frequency of the signal in that short frame. In a spectrogram, at a particular time point and a particular frequency, dark colors illustrate the frequency in a low magnitude, whereas light colors show the frequency in higher magnitudes. Spectrograms are perfectly suitable for a variety of speech analyses including SER [18] . 

For emotion classification, both sets of features were concatenated followed by batch normalization. The batch normalization layer performs the standardization and normalization on the input coming from the previous layer. Batch normalization smoothens the loss function that in turn optimizes the model parameters and speeds up the training process. Which are further passed to a fully connected layer. The final fully connected layer provides voting of each emotion classes.

TESS [5] is an emotional speech dataset of seven emotions (anger, happiness, pleasure, fear, disgust, surprise, neutral, and sadness). Dataset was recorded in English and recording consisted of 200 words by two actresses of ages 26 and 64 years. An audiometric test was carried out to keep the thresholds within the normal range.

Burkhardt et al. [3] Berlin Emotional Database (EmoDB) is a publically available emotional speech database containing utterances by 10 German obtained from Emergency call centers consisting of seven emotions (i.e., anger, fear, disgust, and boredom, happy, neutral, and sad). Background noise is present in the recording and recording is made by non-actors. The numbers of utterances in the dataset are 500.

This is emotional speech dataset in the English language consists of 8 emotions (happy, disgust, anger, surprise, calm, fear, sad and neutral, and sad). The 24 actors consisting of 12 males and 12 females have used to record the 8 emotions. The total recorded utterances are 1440 wav files at a sampling rate of 48,000 Hz.

The evaluation metric used to evaluate the performance of the proposed model are confusion matrix, categorical accuracy, and cross-entropy loss.

The detailed data samples correctly and incorrectly classified can be shown in terms of the N*N matrix called as confusion matrix where the number of classes denoted by N. The number of data samples that are classified correctly forms the diagonal elements in the confusion matrix while off-diagonal elements in confusion matrix depict data samples that are misclassified. So higher is the classification accuracy if diagonal values are higher.

This is calculated by dividing correctly classified samples by a total number of samples. Accuracy can be training accuracy or validation accuracy. The accuracy obtained on the training set is called training accuracy and the accuracy obtained on the validation set is called validation accuracy.

The diversion of actual prediction from a true prediction by classification model is measured as cross-entropy loss whose probability values lie between 0 and 1. Cross entropy loss is used for optimizing a deep learning model.

Three experiments were carried out to evaluate the proposed architecture. The peltarion platform was used to perform experiments. The platform is GUI-based and could be used for developing, deploying, and managing deep learning models. It is a cloud-based platform capable of advanced modeling, data access, data pre-processing, training deep learning models, evaluating the model, and visualizing results. To perform the experiment dataset is uploaded on the platform followed by model architecture design, hyper-parameter setting, model training, and model evaluation. The convolutional neural network model shown in Fig. 5 was designed, implemented, and evaluated using the peltarion platform. Experiment-1 was carried out using Toronto Emotional Speech Set (TESS) dataset. The confusion matrix obtained after implementing the proposed model on the TESS dataset is shown in Fig. 6 . The classification accuracy obtained for class anger is 90.9% with 6.1% and 3% misclassified as normal and stress. The classification accuracy obtained for class normal is 96.8% with 3.2% misclassified as stress. The classification accuracy for class stress is 100%. Table 1 . Figure 7 shows the training and validation crossentropy loss plot for Experiment-1 on the TESS database and Fig. 8 shows the training and validation categorical accuracy plot for Experiment-1 on the TESS database.

Experiment-2 was carried out using the EMO-DB dataset. The confusion matrix obtained using the proposed model is shown in Fig. 9 . The classification accuracy for class anger is 93.9% with 6.1% of data samples misclassified as normal. The classification accuracy for class normal is 93.5% with 6.5% of data samples misclassified as anger. The classification accuracy for class stress is 100%. So the highest classification accuracy is obtained for class stress and lowest for class anger. Table 2 . The Fig. 10 . Show training and validation cross-entropy loss plot for Experiment-2 on Emo-DB and Fig. 11 . Show training and validation categorical accuracy plot for Experiment-2 on Emo-DB database. Table 3 . The Fig. 13 . Shows training and validation cross-entropy loss plot for Experiment-3 on RAVDESS database and Fig. 14 . Shows training and validation categorical accuracy plot for Experiment-3 on RAVDESS database.

The proposed model was trained on 600 speech files belonging to each dataset. 200 files are belonging to each category (anger, stress, and neutral). From each speech file features proposed in the section were extracted and each file is also transformed into spectrograms.

To evaluate the performance of the proposed architecture on each dataset, datasets are first to split into training and validation sets. 80% of data is used for training while 20% of data is used for validation. The total number of epochs used for training the network is 100; batch size 64, the initial learning rate is set to 0.00025, and Optimizer used is Adam. The exponential decay rate β1 = 0.9 and β2 = 0.999. Hyper-parameters values are initialized based on a heuristic. The softmax activation function is used at the output layer and all the other layers use the ReLU activation function. The probability associated with the output is determined using the softmax function. The summary of hyper-parameter settings is Table 4 .

The performance of the proposed method is further compared with the state of art techniques used for speech emotion recognition (Table 5 ).

We are always in search of methods to improve the performance of SER systems. The traditional methods used for speech emotion detection used pipeline architecture where each module has to be optimized separately. In the proposed work we have used end to end approach by replacing pipeline architecture with a single convolution neural network. The system extracts useful features from spectrogram images along with that we have also passed handcrafted features such as pitch, spectral, and energy-related features to gather complementary information. In the proposed work we have introduced an emotion change detection module before emotion assessment has considerably reduced the computational complexity as well as boosted the performance of the proposed algorithm by processing only those segments of speech that differ from neutral speech patterns. A similar work done earlier has used pretrained networks for emotion classification. They are heavy architecture requiring more memory and more processing time. So we have proposed our own CNN architecture with fewer layers and feature fusion. The proposed research provides insight to researchers regarding the relevance emotion change module in emotion assessment as well as how a hybrid feature set could boost the accuracy of the system. The proposed research also tries to highlight that lighter CNN architecture must be preferred above heavy pretrained architecture to reduce computational complexity. Methods with high accuracy requiring fewer resources and lesser processing time are always preferred. Response time is one of the very important parameters for the system used to measure affective such as those in health care for monitoring mental health, customer perception, product marketing, education, etc. The limitations of the proposed study are a selection of handcrafted features is solely based on their popularity and use in past studies. Other features can be further explored for emotion change detection and emotion assessment to further boost the accuracy of emotion recognition. Still, there is a lot of scope for further improvement, the contribution of different features for specific action has not been paid enough attention and is yet to be explored. The architecture of the CNN used in the above study is self-developed which could be further optimized using different hyperparameter settings or different numbers of layers, sizes of spectrogram images. Instead of spectrogram other time-frequency representations such as Mel-spectrograms, different frequency bands of spectrograms, and single frequency spectrograms for further analysis. Future work in this direction could be exploring different sets of handcrafted features. Features extracted from different representations of speech signal such as Mel-spectrograms, different bands of spectrograms could be fused to see how they affect the accuracy of the proposed method.

Authors' contributions The methodology was performed by Shalini Kapoor., Literature survey is jointly done by all authors.

Data availability Availability of data materials, and software applications which is developed by present authors.

Code availability Code availability did by Shalini Kapoor.

Conflict of interest The authors have no competing interests to declare that are relevant to the content of this article.

Ethical approval All authors ensure ethical approval.

Consent to participate All authors have the consent to participate.

Deep-net: A lightweight cnn-based speech emotion recognition system using deep frequency features

Deep features-based speech emotion recognition for smart affective services

A database of German emotional speech

A survey of deep convolutional neural networks applied for prediction of plant leaf diseases

Toronto Emotional Speech Set (TESS) | TSpace Repository

Stress, Definitions, Mechanisms, and Effects Outlined: Lessons from Anxiety

Speech emotion recognition method using time-stretching in the preprocessing phase and artificial neural network classifiers

Speech emotion recognition by combining amplitude and phase information using convolutional neural network

3D CNN-based speech emotion recognition using k-means clustering and spectrograms

Data-driven cervical cancer prediction model with outlier detection and over-sampling methods

Speech emotion recognition using emotion perception spectral feature

Speech emotion analysis

Comparison of glottal closure instants detection algorithms for emotional speech

The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north American english

Automated assessment of psychiatric disorders using speech: A systematic review

Speech emotion recognition based on long short-term memory and convolutional neural networks. Nanjing Youdian Daxue Xuebao (Ziran Kexue Ban)/J Nanjing Univ Posts Telecommun

A tri-stage wrapper-filter feature selection framework for disease classification

Learning salient features for speech emotion recognition using convolutional neural networks

Speech emotion recognition from 3D log-mel spectrograms with deep learning network

MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach

Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM

Intelligent system for COVID-19 prognosis: A state-of-the-art survey

The prosody of speech: melody and rhythm

Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends

Classification of skin disease using deep learning neural networks with MobileNet V2 and LSTM

Classifying females' stressed and neutral voices using acoustic-phonetic analysis of vowels: an exploratory investigation with emergency calls

Speech emotion recognition using fusion of three multi-task learning-based classifiers

Learning deep multimodal affective features for spontaneous speech emotion recognition

Convolutional neural network with spectrogram and perceptual features for speech emotion recognition

Gender-aware CNN-BLSTM for speech emotion recognition

Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.