key: cord-0838980-sphtbf5r
authors: Patel, Nivedita; Patel, Shireen; Mankad, Sapan H.
title: Impact of autoencoder based compact representation on emotion detection from audio
date: 2021-03-03
journal: J Ambient Intell Humaniz Comput
DOI: 10.1007/s12652-021-02979-3
sha: 32b999a5ba807c5f5e9c3533b8cd632f6f507a2b
doc_id: 838980
cord_uid: sphtbf5r

Emotion recognition from speech has its fair share of applications and consequently extensive research has been done over the past few years in this interesting field. However, many of the existing solutions aren’t yet ready for real time applications. In this work, we propose a compact representation of audio using conventional autoencoders for dimensionality reduction, and test the approach on two benchmark publicly available datasets. Such compact and simple classification systems where the computing cost is low and memory is managed efficiently may be more useful for real time application. System is evaluated on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and the Toronto Emotional Speech Set (TESS). Three classifiers, namely, support vector machines (SVM), decision tree classifier, and convolutional neural networks (CNN) have been implemented to judge the impact of the approach. The results obtained by attempting classification with Alexnet and Resnet50 are also reported. Observations proved that this introduction of autoencoders indeed can improve the classification accuracy of the emotion in the input audio files. It can be concluded that in emotion recognition from speech, the choice and application of dimensionality reduction of audio features impacts the results that are achieved and therefore, by working on this aspect of the general speech emotion recognition model, it may be possible to make great improvements in the future.

Speech is one of the major communication methods used by humans (Mustaqeem and Kwon 2019) . Emotions are forms of expression for humans and therefore, emotion is naturally used in everyday speech by human beings for expressing their sentiments clearly. Speech contains both linguistic and non linguistic information (Mansour et al. 2019) . A speech signal contains information like intended message, speaker identity and emotional state of the speaker (Bhaykar et al. 2013) . Efficient communication through language and speech has enabled sharing of ideas, messages, and perceptions to one another. In voice based signals, there are two factors of primary importance: acoustic variation and words that are spoken. Acoustic features such as the pitch, timing, voice quality, and articulation of the speech signal highly correlate with the underlying emotion due to the effects of arousal in the nervous system, increased heart rate, etc. The variation of these features forms the basis of emotion recognition in speech.

Speech emotion recognition is the task of extracting the emotions of the speaker from his or her speech signal. Detecting these emotions provide insight into deeper complexities that help to navigate through real time situations. Emotion recognition from speech is one of the major challenges in the field of human computer interaction. The formulation of powerful emotion recognition systems are thus beneficial and the objective of a good emotion recognition system is to be able to mimic human perception in the way that humans are able to detect emotions such as anger, sadness, and happiness while talking to one another (Basu et al. 2017) . Despite extensive research in emotion recognition from speech, there are still several challenges such as imperfect databases, low quality of recorded utterances, cross-database performance, and difficulties when it comes to speaker independent recognition as each person has a different way of speaking.

Speech emotion recognition systems are pattern recognition systems, and are generally composed of three parts:

(1) speech signal acquisition, (2) feature extraction, and (3) emotion recognition through the use of classifiers . Speech features can be broadly categorized into four types: continuous features, qualitative features, spectral features, and Teager energy operator (TEO)-based features. In this work, three classifiers are implemented to demonstrate the impact of the proposed model: decision tree classifier, support vector machines, and convolution neural network. Figure 1 describes a typical speech based emotion recognition system.

Applications of audio based emotion recognition systems provide aid in mental health assessments. Speech processing technology can help diagnose and also detect the severity of disorders (Low et al. 2020) . Additionally, it can aid in speech therapy which aims to help with people's speech impairments (Schipor et al. 2014) . Moreover, applications like health-care and counseling can benefit the most from such automated systems. Speech recognition systems are particularly useful where man and machine interaction (MMI) is required like in web movies, computer tutorial applications, online learning (Cen et al. 2016) , call center communications, mobile communications, etc, because the response in such systems is dependent on the sentiment of the user. The objective of such systems in the case of call center communications would be to detect the emotional state and urgency of the caller (Bojani et al. 2020 ). This would help to improve the functionality of call centres especially those giving health care support for old aged people and emergency call centers. It is also used in car systems to detect the mental state of drivers which is directly correlated to the probability of rash driving and subsequent accidents (Kamaruddin and Wahab 2010) . These systems can help ensure safety of drivers, passengers, and people on roads. It can also be used for the purpose of lie detection in criminal and forensic investigation. Furthermore, research shows its application in detection of school violence based on children's speech (Han et al. 2018) . Lastly, it can improve other artificial intelligence applications like playing customized music based on the emotion of the speaker on call, marketing, and intelligent toys.

The rest of this paper is organized as follows. Section 2 discusses the previous works that have been used for emotion recognition from speech. Section 3 describes the proposed methodology for emotion classification. The experimental work and results are discussed in Sects. 4 and 5 respectively. Section 6 concludes the paper with observations and future remarks.

A substantial amount of research has been carried out in the field of speech emotion recognition (SER). In this section, we present a brief review of the work done on emotion detection from audio.

Many of the current research methodologies are based on two different classification approaches. The first is the use of classical classifiers such as SVM and artificial neural networks (ANN) and the second is the use of classifiers based on deep learning such as convolutional neural networks (CNN) and deep neural networks (DNN) (Akçay and Oğuz 2020).

Using both linguistic (probabilistic and entropy-based models of words and phrases) and acoustic (pitch, loudness, spectral characteristic) feature modeling, SVM was used as (Polzehl et al. 2011) . This study showed that the acoustic modeling outperforms linguistic modeling. Accuracy of 75% for the WoZ database and approximately 79% for IVR datasets were achieved.

In Zhang et al. (2016) , four models for the binary classification problem: the simple model, the single task (ST) model, the multi-task feature selection/learning (MTFS/ MTFL) model, and the group multi-task feature selection/ learning (GMTFS/GMTFL) model were implemented. Feature extraction of acoustic low level descriptors (LLDs) was done and then four models are used for each emotion classification. It was tested on the RAVDESS dataset and the maximum accuracy achieved was 64.29%. Support vector machines have been used as a classification technique by many researchers. Feature extraction using MFCCs, Spectral Centroids, and Delta and Delta-Delta MFCCs along with a bagging ensemble with SVM as a classifier was used for speech detection on three different datasets, namely, IITKGP-SEHSC, RAVDESS, and Berlin EMO-DB. 75.69% accuracy was obtained on the RAVDESS dataset using the proposed methodology (Bhavan et al. 2019) . Another study Tomba et al. (2018) aimed to be able to detect stress through speech analysis using mean energy, the mean intensity and MFCC features. Using SVM and neural networks on the RAVDESS dataset, accuracies of 78.75% and 89.16% were achieved. In Deb and Dandapat (2016) , feature selection of a relatively new feature, residual sinusoidal peak amplitude (RSPA), for emotion classification was utilized. The RSPA feature is evaluated from the LP residual of the speech signal using a sinusoidal model. Again, SVM classifier was used and evaluated on EMO-DB dataset giving a maximum accuracy of 74.4%.

Furthermore, architectures such as convolutional neural network (CNN) and long short-term memory (LSTM) have also been used to test the emotion capturing capability from various standard speech representations such as mel spectrogram, magnitude spectrogram and Mel-Frequency Cepstral Coefficients (MFCC's). Bidirectional long short term memory network and convolutional neural network were used and the best accuracy was 82.35%, achieved for CNN + BLSTM architecture with MFCC as input for EMO-DB in Pandey et al. (2019) . Convolutional neural network model was evaluated on RAVDESS in Jannat et al. (2018) , but the accuracy of the sole audio tests is comparatively low at 66.41%. In Zhao et al. (2019) , one 1D CNN LSTM network and one 2D CNN LSTM network were constructed to learn local and global emotion-related features from speech and log-Mel spectrogram respectively. Accuracies of 95.33% and 95.89% on Berlin EmoDB of speaker-dependent and speaker-independent experiments, and of 89.16% and 52.14% on IEMOCAP database of speaker-dependent and speaker-independent experiments, respectively were achieved.

Recurrent neural network (RNN) architectures have also been used for the purpose of SER. 63.5% accuracy with the IIEMOCAP corpus in Mirsamadi et al. (2017) . Popova et al. (2018) used a fine-tuned DNN to classify the mel spectrograms obtained from the speech samples of RAVDESS dataset. The authors obtained the accuracy of 71% using VGG-16 network as a classifier.

A sparse autoencoder method for feature transfer learning for speech emotion recognition was proposed in Deng et al. (2013) . Average accuracy of 51.6% (original) and 59.9% (reconstructed) was achieved for the datasets. To learn from labelled and unlabelled data, the semi-supervised autoencoder (SS-AE) was introduced in Deng et al. (2018) . It extends a popular unsupervised deep denoising autoencoder. A variant of SS-AE that introduces skip connections from the lower layer to the upper one called SS-AE-Skip was also implemented. SS-AE and SS-AE-Skip obtain an average UAR of 42.7% and 42.8%, respectively. Deng et al. (2017) also introduces Universum learning to a deep autoencoder, leading to reducing the inherent mismatch between the training and test data by simultaneously learning common knowledge from labelled and unlabelled data. The Universum Autoencoder achieves an accuracy of 59.3%, which is comparable to the SVM UAR 54.1%. In Aouani and Ben Ayed (2018) , the model implements stack and simple auto encoder after MFCC feature extraction. The experimental results show that DSVM method outperforms the standard SVM with a classification rate of 69.84% and 68.25% using 39 MFCC, respectively. Additionally, the auto-encoder method outperforms the standard SVM, yielding a classification rate of 73.01%.

A brief review of the work done on emotion detection from audio is presented in Table 1 .

In this work we attempt to use both traditional classifiers and deep learning classifiers with the addition of an autoencoder which is a deep learning based enhancement technique. The results achieved with the implementation of some state-ofthe-art classifiers such as Alexnet and Resnet50 are also presented. To the best of our knowledge, our model outperforms all the current works that have been evaluated using the same datasets, namely, RAVDESS and TESS in terms of accuracy with the exception of Tomba et al. (2018) . However, our model may be comparable in terms of simplicity and reliability as the implementation in this work consists of simple autoencoders along with some classical classifiers. The highest accuracy we report is 96% by evaluating CNN on the TESS dataset which also outperforms classifier models that have been tested on other datasets.

Any SER system consists of two components: a processing unit that extracts the appropriate features from the speech data and a classifier that ultimately decides the emotion from the underlying speech utterance. In this section, the methodology used for feature extraction, dimensionality reduction, and classification in the proposed model are presented. Also, the use of autoencoders for the purpose of dimensionality reduction, and its impact on classification is discussed.

The first step is preprocessing which includes the extraction and selection of a set of specific acoustic features as well as normalization, noise reduction, etc. In some works, basic acoustic features like pitch-related, intensity-related, and duration-related features have been extracted (Chen et al. 2012) . Feature extraction is an important stage of the recognition. There are many kinds of feature extraction methods and some parametric representations are Mel-frequency cepstrum coefficients (MFCC), the linearfrequency cepstrum coefficients (LFCC), the linear prediction coefficients (LPC), and the reflection coefficients (RC). MFCC based features are very common and are used in a lot of SER models to this day such as Likitha et al. (2017) and Sowmya and Rajeswari (2020) . MFCCs represent audio based on perception with their frequency bands logarithmically positioned. It captures the power spectrum and unique characteristics of humans.

The main steps of MFCC feature extraction are preemphasis, frame-blocking, fast-Fourier transform (FFT), Mel frequency warping, and discrete cosine transform (DCT) Muljono et al. (2019) . Pre-emphasis is a filtering process that is used to process a signal before performing feature extraction on it. Framing consists of splitting the signal into several frames. This process aims to convert each frame from the time domain to the frequency domain. FFT is a rapid algorithm that is used to implement a discrete Fourier transform (DFT).

In the mel scaling stage, a pattern is measured in the 'mel' scale. The 'mel' scale is a linear frequency scale below 1000 Hz and a logarithmic scale above 1000 Hz. Mel scaling is performed as shown in Eq. (1):

At the discrete cosine transform (DCT) stage, the mel spectrum coefficient is converted into the time domain. The result is called MFCC. Figure 2 explains the MFCC extraction process from an audio signal. MFCC has numerous advantages like simple calculation, better ability of distinction and high robustness to noise. We have used MFCC features to represent audio samples in this work.

(1) mel(f ) = 2595 * log 10 (1 + f ∕700). 

Dimensionality reduction is defined as the process of reducing the number of features that describe some data. It is a necessary approach to downsize data.

There are many methodologies that can be used in order to reduce the dimensionality of data such as principal component analysis (PCA), Linear discriminant analysis (LDA), Random forests, etc. PCA seems to be one of the most popular methodologies when it comes to SER. PCA is a preprocessing linear transformation technique. Chen et al. (2012) describe principal component analysis (PCA) which is used to find a subspace whose basis vectors correspond to the maximum-variance in the original space. They also describe Linear discriminant analysis (LDA) which selects those vectors that best discriminate among classes and how these methods may be selected for application in speech features. Further, they present an independent, comparative analysis of PCA, LDA and PCA + LDA used in speech emotion recognition. It is found that none of the three algorithms is the state-of-the-art for all emotion categories. A new integrated approach was also introduced. Furthermore, Daneshfar and Kabudian (2019) propose a system that is based on a modified quantum-behaved particle swarm optimization (QPSO) algorithm for feature-vector dimension reduction. The proposed method improves the accuracy of the SER system compared to classical methods such as PCA, LDA, and standard QPSO.

Autoencoders can also be used for dimensionality reduction. Deep autoencoders have already proved to be effective tools for denoising (Xia et al. 2014 ) and classification (Cibau et al. 2013) for SER. They are being extended to the process of dimensionality reduction. Autoencoder is an unsupervised learning process that does not require external labels. The autoencoder algorithm belongs to a special family of dimensionality reduction methods that is implemented using artificial neural networks. It aims to learn a compressed representation for an input while simultaneously minimizing its reconstruction error (Wang et al. 2014) .

For example, Zabalza et al. (2016) proposes the use of a stacked autoencoder for dimensionality reduction and feature extraction in hyperspectral imaging. Stacked autoencoders are an extension of the autoencoder framework as they contain several layers between the input and the output. Therefore, final features are obtained through progressive abstraction levels. Variational auto-encoders which use variational inference to generate a latent representation of the data have also been used for the task of dimensionality reduction (Martin et al. 2019 ). Finally, Sahay et al. (2019) suggests the use of a cascaded autoencoder that can perform both tasks of denoising and dimensionality reduction. Thus, autoencoders prove to be a useful tool for dimensionality reduction as this method has added benefits over traditional methods such as PCA. This is due to the fact that they remove the need to select meaningful features from the entire list of components, reducing subjectivity and significant human interaction from the analysis (Thomas et al. 2016) .

In addition to this, autoencoders depending on the size of the dataset and application have often been shown to perform better than principal component analysis (Wang et al. 2012 ). There are no guidelines to choose the size of the bottleneck layer in the autoencoder like there are in PCA. Autoencoders retain all the information of the original data set and since the autoencoder encodes all the information into the reduced layer, the decoder is in turn better equipped to reconstruct the original data set. It is more optimized as compared to PCA. The drawbacks of using an autoencoder for dimension reduction includes the requirement for greater computation and the tuning, but the trade off provides higher accuracy. In this paper too, an autoencoder has been used for the purpose of dimensionality reduction before attempting to classify the data.

The most basic architecture of an autoencoder has the same number of dimensions in the input layer as well in the output layer, but the hidden layer has less number of dimensions which is where the dimension reduction occurs. A general representation of an autoencoder with a single hidden layer is depicted in Fig. 3 . It will contain learned information of the input data in a compressed manner. As with other neural networks, there is a lot of flexibility in how autoencoders can be constructed including variation in the number of hidden layers and the number of nodes in each. As shown in Fig. 3 , the encoder takes an input x i ∈ R d x and reduces it to a form of y i ∈ R d y in the hidden layer through the use of a function f() which is a standard activation function either an identity function for a linear projection or sigmoid function f (x) = 1 1 + e −Wx for non-linear mapping where W is a d y × d x weight matrix. Ignoring the bias of the neural network, the encoding process is represented as follows:

W T represents another d y × d x weight matrix and the decoding process is represented by the following equation:

Here, g() is either a sigmoid function for non-linear reconstruction or an identity function for linear reconstruction similar to f(). g() function has been used to represent the decoding process. d x refers to the dimension of inputs and d y

refers to the dimension of output after dimensionality reduction. R d y represents the set of y dimensional output data vectors and R d x represents the set of x dimensional input data vectors. The decoder reconstructs sets of instances that are indexed by and have specific weights S i = s ij , s jk , for x i to get a weighted reconstruction error e i :

The total weighted reconstruction error for all the n input samples to the autoencoder is E:

A general autoencoder iteratively computes and updates the values of S i and by using an algorithm such as the K-nearest neighbor algorithm. Furthermore, using the concept of stochastic gradient descent, the autoencoder will minimize the total weighted reconstruction error. Finally, it updates the parameters W and W T . This is done iteratively until the convergence point is reached.

Each classifier has a unique set of advantages and limitations and therefore, the performance may vary with each classifier. The objective of this section is to provide an overview of the classifiers used in this work.

Generally, SVM is used as a binary classifier, however, it can also be used as a multi-class classifier. It is a highly effective tool for computation of machine learning algorithms and is widely used in all types of pattern recognition problems. Especially in the cases of limited training data availability, it has been known to outperform other classifiers. SVM is basically designed on the use of kernel functions to non-linearly map the original features to a high-dimensional space where data is then well classified using a linear manifold. It has been used extensively for classification especially image classification. It has been proven successful in applications such as thyroid disease detection (Shankar et al. 2020) , for classification of mammograms in breast cancer detection (Vijayarajeswari et al. 2019) , and even determination of poverty (Naviamos and Niguidula 2020) . SVM has shown superior performance for emotion recognition in comparison to linear discriminant classifiers and nearest neighbor classifiers. It has been used as a classifier for sound based emotion recognition (Sonawane et al. 2017 ) and shown great accuracies. Furthermore, deep support vector machines were tested for (2019) and gave better performance than previous studies. A decision tree SVM model with Fisher feature selection for speech emotion recognition was also implemented and achieves as high as 98.29% accuracy (Sun et al. 2019) . Therefore, in this work we decided to implement SVM as a classifier as it can be considered to be one of the most successful classifiers.

Decision trees are widely used for the purpose of classification and regression. They are tree-like structure consisting of three types of components including internal nodes, root node, and terminal node as shown in Fig. 4 . There has to be a parent node for each internal and terminal node present in the tree which denotes the data source, and at least two child nodes will be created from each parent node depending on the decision rules that might be different for different scenarios (Pantazi et al. 2020) . Decision tree classifiers have been applied in diverse areas such as Agile Management System (AMS), for intelligent data mining in agriculture (Pantazi et al. 2020) , for microscopic image analysis, and for character recognition, speech recognition and radar signal classification. Decision trees are able to disintegrate a complex decision making process into simpler decisions in hierarchical manner and hence, making it easier for interpretation. Decision tree classifiers have high adaptability and effective features, making them capable of extracting decision making knowledge from the given data.

This is one of the most popular deep learning methods manifested in areas of face recognition, handwriting recognition, and many other processing and recognition problems.

Recently, CNN has also been applied to COVID-19 related applications in order to facilitate screening approaches during this pandemic. In a recent study, COVID-Net, an open source deep convolutional neural network design was introduced and it is tailored for the detection of COVID-19 cases from chest X-ray (CXR) images (Wang and Wong 2020) . The promising results achieved by COVID-Net on the COV-IDx test dataset are credible. Similarly, a deep CNN, called decompose, transfer, and compose (DeTraC), for the classification of COVID-19 chest X-ray images was adopted and accuracies up to 95.12% were achieved (Abbas et al. 2020) . Also, three different convolutional neural network based models (ResNet50, InceptionV3 and Inception-ResNetV2) have been proposed for the detection of coronavirus pneumonia infected patients using chest X-ray radiographs in Narin et al. (2020) . It is observed that the pre-trained ResNet50 model provides the highest classification with 98% accuracy. Apart from computer vision, CNNs have also been used specifically for the task of speech emotion recognition.

Authors in Barra et al. (2020) present a study that exploits an ensemble of CNNs, trained over Gramian angular fields (GAF) images for market financial forecasting and trend Fig. 4 Flowchart of decision tree algorithm (Pantazi et al. 2020) analysis in the US. A multiresolution imaging approach is used to feed each CNN. This enables the analysis of different time intervals for a single observation. A method for speech emotion recognition using spectrograms and deep convolutional neural network (CNN) is capable of predicting emotions accurately and efficiently (Badshah et al. 2017) . Spectrograms generated from the speech signals are input to the deep CNN. This study also investigates the effectiveness of transfer learning for emotions recognition using a pre-trained AlexNet model. However, they conclude that the results are not satisfactory. Zheng et al. (2018) proposed an SER model based on CNN feature extraction followed by random forest classification. Therefore, CNN can be used in multiple approaches as well. Huang et al. (2014) achieved results using CNN by trying to learn salient feature maps using an auto-encoder. Similarly, applied deep convolutional neural networks, however, failed to get an accuracy of more than 40%. One dimensional CNN has also been used successfully to produce an accuracy of about 80% (Basu et al. 2017) . Therefore, CNN models have extensive applications. They are well known and proven in use.

Deep CNNs have two essential ingredients: a rectified linear unit (ReLU) defined as a univariate nonlinear function given by and a sequence of convolutional filter masks w = w (j) j inducing sparse convolutional structures (Zhou 2020) . Filter mask w = (w k ) ∞ k=−∞ defines a sequence of filter coefficients, where the filter length is a fixed integer s ≥ 2 in order to control the sparsity, and it has been assumed that w k (j) ≠ 0 only when 0 ≤ k ≤ s . When a filter mask w is convoluted with v = (v 0 , … , v D ) , we get a new sequence defined as (wv) i = D k=0 w ik v k . This generates (D + s) × D Toeplitz type convolutional matrix T that has constant diagonals. The matrix has larger number of rows than the columns, thus where, (h k ) ij : (i, j)th element of the kth output feature map W k : k th filter b k : kth bias q: input feature maps *: 2D spatial convolution operation Then, the pooling layer generally follows which reduces the number of parameters and hence, reducing the complexity of the model. Max-pooling and average pooling are two types that are mainly used in the model. Max pooling chooses the maximum from the window specified, whereas, average pooling calculates the average of the specified window. The CNN model in addition to convolutional layers and pooling layers also consists of dropout layers, dense layers, and the last fully connected layer which is responsible for generating an output for regression/classification tasks.

The proposed system is evaluated on two datasets: Ravdess dataset, and Toronto Speech dataset. Feature extraction from raw audio files is done with the help of MFCC. Further, the audio files are fed into an autoencoder model for the purpose of dimension reduction. Then, newly reconstructed data is used as an input for the SVM model, decision tree classifier, and CNN. The performance on the basis of different Figure 5 illustrates the system model that has been proposed in this paper.

The three classifying models have been evaluated on two publicly available speech emotion datasets: (1) Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) 1 (Livingstone and Russo 2018), and (2) Toronto Emotional Speech Set (TESS) 2 (Pichora-Fuller and Dupuis 2020).

RAVDESS dataset contains a complete set of 7356 files (24.8 GB) of audio and video, speech and song. It is a dynamic and multimodal set consisting of facial and vocal expressions in North American English. The database consists of 24 actors who vocalize two lexically matched statements in neutral north American accent. Neutral, calm, happy, sad, angry, fearful, disgust, and surprised are the eight speech emotions. Each expression is produced at two levels of intensity-strong and normal, with neutral expression as an addition. All data are available in three modality formats-Audio-Video, Audio only, and Video only.

However, for this work, we use only the audio files that make up 1440 files (24 actors * 60 trials per actor). Tables 2 and 3 represent the distribution of wave files of the RAVDESS dataset and the filename identifiers respectively. Figure 6 depicts the file naming convention used in this dataset for each audio file.

TESS dataset, represented in Table 4 , consists of recordings of two women, aged 26 and 64 years, portraying seven emotions: anger, happiness, disgust, fear, sadness, neutral and pleasant surprise. 200 target words were spoken in the carrier phrase Say the word_ by each woman for all the seven emotions. The two women chosen for the recordings were from the Toronto area. Both have received musical training, are university educated, and speak English as their first language.

Mel-Frequency Cepstral Coefficients (MFCCs) have been used for the purpose of feature extraction in both, RAVDESS AND TESS dataset. Since MFCCs have been used extensively to extract features from audio signal and discard all the unnecessary background noise, we incorporated MFCC in our approach for feature extraction. The MFCC features are calculated for the speech files with the default sliding 

For our study, we used a simple autoencoder based on a fully connected neural layer as encoder and decoder. The proposed autoencoder architecture is the same for both the datasets. The encoding dimension taken is 64 with input shape being (128, ). A single layer of encoder and decoder has been taken, with encoded label storing the encoded representation of the input, and 'decoded' label representing the lossy reconstruction of the input. A separate encoder and decoder models are also built. Now, in order to train our autoencoder to reconstruct the audio files, firstly the configuration of the implemented model is done to use binary cross-entropy as the loss function and AdaDelta optimizer to optimize the loss function. Further, we try to train our autoencoder for 100 epochs with 256 as the batch size. For visualizing the encoded representations and the reconstructed inputs, Matplotlib library in Python was used. As mean squared error (MSE) based metric deals with the difference between the predicted output and the real label, we used MSE as the error function because the model tries to reconstruct the input. Adam optimizer was used to compute the gradients. Figures 7 and 8 give a graphical representation of the performance of autoencoder on the RAVDESS and TESS datasets respectively. The visual representation shows the original input and the reconstructed input of the procedure. As the encoded representations have an encoding dimension of 64, and the input is 128 floats, the input will be compressed by a factor of 2 (i.e., 128/64). Hence, we will get 64-dimensional encoded representations. The decoded input that is produced after passing the encoded input to the decoder layer, has the same size as the original input, but with a reduced pixel value. Thus, we will have an output with shape (128,) but with reduced dimensions.

The accuracies of the reconstructed input after applying our proposed autoencoder model in TESS and RAVDESS dataset is 91.92% and 90.58% respectively.

The audio dataset has been splitted into training and testing datasets using the train_test_split() function in Sklearn model. The training dataset is used to build the model, and the testing dataset is used to evaluate the performance of the model on unknown data. Here, we have kept the testing dataset size as 33% of the total dataset, and random state seed is 42 used to perform a random split. Hence, the size of the training dataset will be 67%. The data is then fed into different classifiers described in this section.

CNN architecture used for the RAVDESS dataset has two convolutional layers followed by pooling, dropout, flatten layers in addition to fully connected dense layers with softmax layer as depicted in Fig. 9 . We have made use of 1D CNN model as our data consists of fixed length audio signal and 1D CNN is very effective for extracting features from a fixed length segment of the entire dataset. The input shape being 128 × 1 , the output of the first convolutional layer will be 128 × 32 because the first convolutional layer has 32 parallel feature maps and a kernel size of 5. The next convolutional layer has 64 parallel feature maps with kernel size of 5 and relu as the activation function, giving an output of size 128 × 64 . Relu has been used as an activation function so as to increase non-linearity in our audio files. We have used two CNN layers in order to Fig. 7 After applying autoencoder model to RAVDESS dataset Fig. 8 After applying autoencoder model to TESS dataset Fig. 9 Keras visualization of the 1D CNN model applied to RAVDESS help the model learn features from the input. Next is the pooling layer of size 8 giving an output of size 16 × 64 , thus reducing the number of learned features and keeping only the important elements. It is followed by a dropout layer having a rate of 0.25 that keeps the size of the output matrix the same. Dropout layer is needed to slow down the learning process of the model and in turn avoid overfitting. After, the matrix is flattened to get a height of 1024, forming a matrix of size 1 × 1024 . After a dense and one more dropout layer of rate 0.5, the final dense layer and the softmax function will reduce the vector of height 1024 to a vector of 8 as in RAVDESS we are to predict from eight classes of emotions. Adadelta, a part of the gradient descent algorithms, has been used as the optimizer and sparse categorical crossentropy for the loss function, as our inputs are in integer format and our model is built for multi-class classification. The training model has the batch size of 128 along with 1000 epochs. The loss function and optimizer defined is the same as that defined in the TESS CNN model.

The CNN architecture used for the TESS dataset consists of 1D convolutional layer with a pooling layer, dropout and flatten layers along with the fully connected dense layers with softmax layer as the output as shown in Fig. 10 .

The shape of the input layer is 128 × 1 and the kernel size and number of feature maps are taken as 5 and 32 respectively, thus, giving an output of size 128 × 32 after passing through the CNN layer. A pooling layer of size 8 has been taken after in order to reduce the complexity of the output and prevent overfitting of the data. The dropout layer follows which is used to increase accuracy on the unseen data as it randomly assigns zero weights to the neurons in the network, hence, making the network less sensitive to small variations in data. For our model, since we have chosen a rate of 0.25, 25% of the neurons will receive a zero weight. The size of the output matrix remains the same, i.e., 16 × 32 . Now, the output matrix of the dropout layer is flattened to get an output of height 512, forming a 1 × 512 matrix. Next, passing 1 × 512 matrix to the dense layer reduces the size of the matrix to 1 × 128 which is again given to a dropout layer of rate 0.5. The final fully connected dense layer with softmax used as the activation function will reduce the vector height of 128 to the vector of 7 since we need to make predictions based on seven classes of emotions. Then, the model is trained with batch size of 128 and 70 epochs. Adadelta optimizer having a default learning rate of 0.001 and sparse categorical cross-entropy as the loss function has been used for the training of our model. After, the model is evaluated on the test dataset.

In our proposed methodology, along with SVM, decision tree, and 1D CNN techniques, we also tried to implement Alexnet and Resnet50 models on our two datasets-RAVDESS and TESS. As the architecture of alexnet and Resnet50 have been defined for image classification, we converted the audio files into spectrograms followed by conversion of spectrograms created to RGB images. An audio signal is represented in time domain which is converted into frequency domain to be able to be represented as a spectrogram. Fast Fourier transform (FFT) is a mathematical tool which analyses the frequency content of audio and it is calculated over a bunch of overlapping window segments. In spectrograms, the y-axis which represents frequency is converted to log scale, and is mapped onto mel scale to get mel spectrograms. In our approach we have used mel spectrograms. The size of FFT has been taken as 1024 which also defines the window length. Hop length which defines the steps between windows is 100. Then, the amplitude is transformed into decibels to get a logarithmic scale followed by saving the spectrograms created in a specific folder. Once we have spectrograms of all our audio files, we need to convert them into RGB images, thus, each image will be represented by 3 channels. As the alexnet architecture requires the images to be in size 227 × 227 , and the Resnet50 model takes input of size 224 × 224 , we need to resize all images in the respective size format. Thus, all 1440 spectrograms of RAVDESS dataset will be resized to (227, 227, 3) to be fed into the alexnet model, and (224, 224, 3) to be fed into the Resnet50 model. Similarly, all 2880 spectrograms of the TESS dataset will be resized to (227, 227, 3) for Alexnet, and (224, 224, 3) for Resnet50.

The images are then shuffled in order to prevent any bias during training the model. The next step is to normalize the data to ensure that all the input images have a common scale. The input is now splitted into training and testing datasets using the train_test_split() function. The test size taken is 20% with random features having an integer value of 42.

The alexnet model has 8 layers, i.e., 5 convolutional and 3 fully connected layers. The first convolutional layer has 96 kernels having size (11, 11) with stride 4 followed by a max pooling layer of size (3, 3). The second convolutional layer consists of 256 kernels of size (5, 5) having stride = 1 followed by another max pooling layer of size (3, 3) with 2 as the stride. The next three convolutional layers are connected directly having 384, 384, and 256 kernels respectively. These layers have kernel size as (3, 3) and stride as 1. The max pooling layer after the fifth convolutional layer feeds the output into a series of two fully connected dense layers whose output is then passed onto the third fully connected layer having softmax function. The alexnet model uses ReLU as the activation function to improve the non-linearity of the model. Adam has been used as the optimizer and sparse categorical crossentropy as the loss function. Now, the model is trained on the training dataset for 100 epochs having batch size of 128 for the TESS dataset, and for 100 epochs having batch size of 32 for the RAVDESS dataset. In our methodology, we have also used a pre-trained resnet50 model inbuilt in keras which has been trained on the Imagenet data. However, we apply the model to our dataset of size 224 × 224 × 3 . The layers have been initialized with imagenet weights. The pre-trained model is then followed by a classifier using softmax activation function. Adam is used as the optimizer, and categorical crossentropy as the loss function. Next, the model is trained on the training dataset for 20 epochs having batch size of 64 for the RAVDESS dataset, and for 20 epochs with batch size of 128 for the TESS dataset.

Speech based emotion recognition is a classification system and following performance parameters have been used in this work to assess the system efficiency.

Classification accuracy is defined as the percentage of test samples predicted correctly by the classifier. This measure gives an overall success rate of the classifier. Precision (P r ) is the ratio of correctly predicted positive samples to predicted positive samples, and recall (R) is the predictive positive samples to actual positive samples. F1-score is the harmonic average of precision and recall for a specific class.

Macro precision and recall values represent average of different precision and recall values derived from different trials of experiments, respectively, and macro-F1 score is their weighted average. These measures are typically used in multi-class classifier settings. An assumption of uniform weights is typically made while calculating macro-average values of these measures. However, if the weight is given as per the number of samples of each class during calculation, then we obtain weighted average precision and recall values.

(4) P r = TP TP + FP

The performance of systems implemented in this work are compared on the two datasets using different evaluation measures. This section describes results and some observations.

The number of training and testing samples taken are 1929 and 951 respectively. 

The number of training and testing samples taken are 1876 and 924 respectively. Table 8 presents the comparison between the performance of three classifiers used before and after applying the autoencoder. The CNN model that is used for original data before reconstructing the input files, has the same architecture as the CNN model implemented for the reconstructed input with batch size of 128 and 70 epochs. Similarly, SVM and decision tree classifier has been applied in the same way for the original data as that for the reconstructed data. CNN achieves the best accuracy, i.e., 94% on original data and 96% on the reconstructed data. Tables 9 and 10 represent the classification results of the TESS dataset. There is no significant improvement in case of CNN classifier except surprise and sad class after 

After looking at the performance of SVM, decision tree classifier, and 1D CNN, let us discuss the results obtained by implementing alexnet and resnet50 models. The accuracies obtained after evaluating the performance of the trained alexnet model on the testing dataset of RAVDESS and TESS were 54.17% and 82.32% respectively. However, after incorporating autoencoder for dimensionality reduction, the respective accuracies obtained were 21.18% and 43.03%. When the resnet50 model was applied and tested on the RAVDESS and TESS testing dataset, we obtained accuracies of 15.97% and 13.03% respectively on the original data, and 12.84% and 15.71% respectively on the reconstructed data. When we compare the performance of alexnet and resnet50 models with SVM, decision tree, and 1D CNN as reported in the previous sections, we find that we get maximum accuracy in 1D CNN for TESS as well as RAVDESS dataset and not in either of the deep learning models used, i.e., alexnet and resnet50. The state of the art approaches, Alexnet and resnet50, have high computational cost and high processing delays in addition to low performance as calculated on the RAVDESS and TESS dataset. We tried to use the alexnet and pre-trained resnet50 model for the SER problem, however, the results were not satisfactory. In this paper, we have presented major contributions for increasing the accuracy of speech emotion recognition compared to state-of-the-art, and reducing the computational complexity of the presented SER model which has been achieved using SVM, decision tree, and 1D CNN. Thus, the authors majorly focus on SVM, decision tree, and 1D CNN as the architecture of these techniques is compact and simple in structure, cost-effective, and memory efficient.

Thus, it can be seen from Tables 5 and 8 that there is a significant improvement seen in the performance of SVM and CNN after using autoencoder for the dimensionality reduction. However, decision tree classifier doesn't show any improvements in its accuracy as for TESS dataset it is the same 90% in both scenarios, while for RAVDESS it becomes 76% from 77%. Observations from Fig. 11 indicate that RAVDESS dataset is more challenging as the system achieved less performance across all classifiers in comparison to TESS dataset. Another conclusion is that decision tree based classifier is mostly invariant to the proposed method i.e. the compression hasn't affected its performance much. However, the other two classifiers show promising performance with compact representation. This indicates that the efficiency of the system is data-driven and classifier-dependent.

In this paper, we demonstrated the impact of autoencoder based compact representation of audio data to recognize human emotions. An improvement was observed with the aid of this compact representation on two benchmark datasets. The relative improvements varied according to the type of classifier used as well as according to the dataset used for demonstration. The average relative improvement was 4.66% for the RAVDESS dataset and 2.616% for the TESS dataset.

To our best knowledge, this is the first attempt to exploit autoencoders on direct audio files for audio emotion detection and getting a highest accuracy of 96% on the TESS dataset. For future work, we would suggest replacing the decision tree classifier with other classifiers such as long short term memory (LSTM) or its combination with CNN. Further, in the proposed model, a simple autoencoder was employed, but improvement of the results are likely using 

Writing-Original Draft; SP: Conceptualization, Methodology, Formal Analysis, Data Curation, Writing-Original Draft; SM: Conceptualization, Data Curation, Supervision, Writing-Review and Editing

Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network. arXiv preprint arXiv

Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers

Deep support vector machines for speech emotion recognition

Emotion recognition in speech using MFCC with SVM, DSVM and auto-encoder

Speech emotion recognition from spectrograms with deep convolutional neural network

Deep learning and time series-to-image encoding for financial forecasting

A review on emotion recognition using speech

Bagged support vector machines for emotion recognition from speech

Speaker dependent, speaker independent and cross language emotion recognition from

Performance of a SVM, b decision tree, and c CNN classifier on both the datasets speech using GMM and HMM

Call redistribution for a call center based on speech emotion recognition

Chapter 2-A real-time speech emotion recognition system and its application in online learning

Speech emotion recognition: features and classification models

Speech emotion recognition using a deep autoencoder

Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm

Emotion classification using residual sinusoidal peak amplitude

Sparse autoencoderbased feature transfer learning for speech emotion recognition

Universum autoencoder-based domain adaptation for speech emotion recognition

Semisupervised autoencoders for speech emotion recognition

Emotion recognition and school violence detection from children speech

A research of speech emotion recognition based on deep belief network and SVM

Ubiquitous emotion recognition using audio and video data

Driver behavior analysis through speech emotion understanding

Speech based human emotion recognition using MFCC

Funding Information Natural Sciences and Engineering Research Council of Canada: 2012-341583 Hear the world research chair in music and emotional speech from Phonak Low DM, Bentley KH, Ghosh SS (2020) Automated assessment of psychiatric disorders using speech: a systematic review

Emotional speaker recognition in real life conditions using multiple descriptors and I-vector speaker modeling technique

Deep variational auto-encoders: a promising tool for dimensionality reduction and ball bearing elements fault diagnosis

Automatic speech emotion recognition using recurrent neural networks with local attention

Speech emotion recognition of Indonesian movie audio tracks based on MFCC and SVM

A CNN-assisted enhanced audio signal processing for speech emotion recognition

Automatic detection of coronavirus disease (COVID-19) using X-ray images and deep convolutional neural networks

A study on determining household poverty status: SVM based classification model

Deep learning techniques for speech emotion recognition: a review

Intelligent data mining and fusion systems in agriculture

Anger recognition in speech using acoustic and linguistic cues

Emotion recognition in sound

Combatting adversarial attacks through denoising and dimensionality reduction: a cascaded autoencoder approach

Improving computer assisted speech therapy through speech based emotion recognition

Optimal feature-based multi-kernel SVM approach for thyroid disease classification

Sound based human emotion recognition using MFCC multiple SVM

Speech emotion recognition for Tamil language speakers

Decision tree SVM model with fisher feature selection for speech emotion recognition

Dimensionality reduction of mass spectrometry imaging data using autoencoders

Stress detection through speech analysis

Classification of mammogram for early detection of breast cancer using SVM classifier and Hough transform

COVID-net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images

A folded neural network autoencoder for dimensionality reduction

Generalized autoencoder: a neural network framework for dimensionality reduction

Modeling gender information for emotion recognition using denoising autoencoder

Novel segmented stacked autoencoder for effective dimensionality reduction and feature extraction in hyperspectral imaging

Cross-corpus acoustic emotion recognition from singing and speaking: a multi-task learning approach

Speech emotion recognition using deep 1d & 2d CNN LSTM networks

Speech emotion recognition based on convolution neural network combined with random forest

Universality of deep convolutional neural networks

Funding This research did not receive any specific grant from funding agencies in the public, commercial, not-for-profit sectors.