Literature Review in Speech Emotion Recognition



Speech is one of the fastest and most natural ways of communication between humans. The speech signal contains not only the message but also necessary information like the emotions of the speaker [1]. Therefore, Speech emotion recognition (SER) is a research field that based on speech recognition but deals with the recognizing the emotional state of the speaker.

Speech emotion recognition can have applications between a natural man and machine interaction. Such as web movies and computer tutorial applications, in-car board system where information of the mental state of the driver, it can be also used as a diagnostic tool for therapists, as well as it may also be useful in automatic translation systems [2,3,4].

This literature review approaching three important aspects of speech emotion recognition system. The first one is the proper preparation of an emotional speech database for evaluating system performance. The second is the selection of suitable features for speech representation, and the third is the design and decision of an appropriate classifier.


An important part in a speech emotion recognition study is the database. A relative big number of databases related with speech emotion recognition exist. Many researchers in this field created databases for speech emotion recognition. Nevertheless, the number of public databases is low [2]. Going deeper into these databases, a number of differences is noticeable. These differences are the language of the speeches, the number of the emotions were recorded, as well as the total size of the database and finally the source.

A very common database used in several studies in SER is Berlin Emotional Database [5]. It has developed by the Technical university of Germany and it is public for the research community. Because of its publicity many studies use this database to recognize emotions from speech [3, 4, 6]. It is containing 800 utterances from professional actors and it can be used for recognizing emotions such as: anger, joy, sadness, fear, disgust, boredom and neutral.

Another database is BHUDES [7] which is created by Beihang University. This database contains utterances of 6 emotions in Mandarin language. These utterances are: sadness, fear, happiness, surprise, anger and disgust. The databases contains in total 5400 utterances that 15 nonprofessional actors repeated.

Vera am Mittag (VAM) [8] is a database that was recorded from a German TV talk-show with the same name. The database contains not only audio signals but also video signals as well as face images. The data contains in total 1018 utterances and the emotions are: valence, activation, and dominance.

Audiovisual Thai Emotion Database [9] is a database contains speech data from 6 students that asked to read 972 Thai words. The database contains emotions such as: happiness, sadness, surprise, anger, fear and disgust. Interactive emotional dyadic motion capture database (IEMOCAP) [10] is a database that contains audiovisual data from 10 professional actors. Containing data from the speech of the actors it can be used for studies in SER [11].


One of the most important speech features which indicate emotion is energy, and the study of that depends on short term energy and short term average amplitude [4]. As the arousal level of emotions is correlated with the short-term speech energy, consequently, it can be used in the scientific field of emotion recognition. Commonly, the speech features can be classified into two central categories, long-term and short-term features.

Prosodic features

These features are used in most SER studies. Prosodic features are based on pitch, energy, intensity, speaking rate and fundamental energy. Prosodic features provide a reliable indication of the emotion. However, there are contradictory report son the effect of emotions on prosodic features [2].

Spectral features

Mel Frequency Cepstral Coefficient (MFCC) are the features that used most in studies about SER [3, 4, 12, 6, 11]. MFCC has a simple calculation good ability of distinction and anti-noise [3].

Linear Prediction Coefficients (LPC) are also extracted from speech, serving as an alternative choice of short-term spectral features for comparison [4]. In LPC gives the details about the characteristics of a particular channel of any single person and this channel characteristic will get a change by the different emotions, so using these features can extract the emotions in speech.

Classifier Selection

In the speech emotion recognition there are various types of classification systems which recognizes the emotion in the speaker’s speech utterance. After calculation of the features, the best features are provided to the classifier. Speech emotion recognition system components a front-end processing unit that extracts the appropriate features from the speech data, and a classifier for the decision of the underlying emotion of the speech. Some of these are Gaussian Mixtures Model (GMM), K-nearest neighbors (KNN), Hidden Markov Model (HMM),

Support Vector Machine (SVM), Artificial Neural Network (ANN) and based on several studies, we can conclude that HMM is the most used classifier in emotion classification probably because it is widely used in almost all speech applications. While HMM is the most widely used classifier in the task of automatic speech recognition (ASR), GMM is considered the state-of-the-art classifier for speaker identification and verification. Each classifier has some advantages and limitations over the others.

Hidden Markov Model (HMM)

The HMM classifier has been extensively used in speech applications because it is physically related to the production mechanism of speech signal. It is a doubly stochastic process which consists of a first-order Markov chain whose states are hidden from the observer. There are many design issues regarding the structure and the training of the HMM classifier.

The topology of the HMM may be a left-to-right topology as in most speech recognition applications or a fully connected topology [2]. The assumption of left-to-right topology explicitly models advance in time. However, this assumption may not be valid in the case of speech emotion recognition since, in this case, the HMM states correspond to emotional cues such as pauses. In general, the classification accuracy of the HMM classifier is compared to other well – known classifiers.

Gaussian mixture models (GMM)

Gaussian mixture model is a probabilistic model for density estimation using a convex combination of multi-variate normal densities. It can be considered as a special continuous HMM which contains only one state. Moreover, GMMs are very efficient in modeling multi-modal distributions and their training and testing requirements are much less than the requirements of a general continuous HMM [2, 13]. Therefore, GMMs are more appropriate for speech emotion recognition when only global features are to be extracted from the training utterances. The main problem of GMM is that cannot model temporal structure of the training data since all the training and testing equations are based on the assumption that all vectors are independent.

Neural networks

The artificial neural network classifier has some advantages over the GMMS and the HMMs. They are more effective in modeling nonlinear mappings. Also, their classification performance is usually better than HMM and GMM when the number of training examples is relatively low. Almost all ANNs can be categorized into three main basic types: MLP, Recurrent Neural Networks (RNN), and Radial Basis Functions (RBF) networks. The latter is rarely used in speech emotion recognition.

On the other hand the MLP neural networks are relatively common in speech emotion recognition, due to its ease of implementation and the well-defined training algorithm once the structure of ANN is completely specified. ΑΝΝ classifiers have many design parameters. Therefore, in some speech emotion recognition systems, more than one ANN is used. An appropriate aggregation scheme is used to combine the outputs of the individuals ANN classifiers. The classification accuracy of ANN is fairly low compared to other classifiers.

Support vector machine (SVM)

SVM classifiers are mainly based on the use of kernel functions to nonlinearly map the original features to a high-dimensional space where data can be well classified using a linear classifier [2]. SVM classifiers are widely used in many pattern recognition applications and shown to outperform other well-known classifiers.
They have some advantages over GMM and HMM including the global optimality of the training algorithm, and the existence of excellent data-dependent generalization bounds [13]. Disadvantage: there is no systematic way to choose the kernel functions, and hence, separability of the transformed features is not guaranteed.

Multiple classifier systems (MCS)

In order to deal with the large computational requirement for training, required possibly by highly complex classifiers, MCS are recently proposed. There are three approaches for combining classifiers: hierarchical, serial, and parallel. In the hierarchical approach, classifiers are arranged in a tree structure where the set of candidate classes becomes smaller as we go in depth in the tree. At the leave-node classifiers, only one class remains after decision. In the serial approach, classifiers are placed in a queue where each classifier reduces the number of candidate classes for the next classifier [2]. In the parallel approach, all classifiers work independently and a decision fusion algorithm is applied to their outputs.


Speech emotion recognition is a field that tries to recognize human emotions through the speech. In order to achieve this, a sequence of steps must be done. This work presented those steps that are important and used in studies about SER.
Although the relative good performance of some articles in recognizing emotions through speech there are some issues. Firstly, the limit number of public available databases leads researchers to use a significant small number of databases to evaluate their methodology. Public available databases also have issues in their design. The language used and the actors that compose a database are very significant. A Benchmark database may solve the problem that each study uses a different database or each own.
The next significant thing in a SER system is the features that used. The number and the type of features seems to have an effect in SER. Prosodic features are used in many studies but also MFCC features seems to have an important role in recognizing emotions through the speech.
Finally, the classifier selection is another important part of SER. HMM, SVM and ANN have widely used in such studies. Every classifier has strong and weak points against the others. From the review of previous research done, it is proven that the recognition rate depends on the features, data and classification method used. Apart from that, integrated features will give better recognition rate compare to a single feature. There are still more hybrid features that were not studied.


[1] Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from speech: a review. International journal of speech technology, 15(2), 99-117.

[2] El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572-587.

[3] Pan, Y., Shen, P., & Shen, L. (2012). Speech emotion recognition using support vector machine. International Journal of Smart Home, 6(2), 101-107.

[4] Wu, S., Falk, T. H., & Chan, W. Y. (2011). Automatic speech emotion recognition using modulation spectral features. Speech communication, 53(5), 768-785.

[5] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, B. Weiss, “A database of German emotional speech”, Proc. Interspeech, 2005.

[6] Seehapoch, T., & Wongthanavasu, S. (2013, January). Speech Emotion Recognition Using Support Vector Machines. In Knowledge and Smart Technology (KST), 2013 5th International Conference on (pp. 86-91). IEEE.

[7] X. Mao, L. Chen, L. Fu, Mandarin speech emotion recognition based on a hybrid of HMM/ANN, Int. J. Comput. 1 (4) (2007) 321–324.

[8] Grimm, M., Kroschel, K., & Narayanan, S. (2008, June). The Vera am Mittag German audio-visual emotional speech database. In Multimedia and Expo, 2008 IEEE International Conference on (pp. 865-868). IEEE.

[9] Stankovic, I., Karnjanadecha, M., and Delic, V., “Improvement of Thai speech emotion recognition by using face feature analysis”, Proceedings of the Nineteenth IEEE International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS2011), Chiang Mai, Thailand, December 7-9, pp. 87, 2011.

[10] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008.

[11] Han, K., Yu, D., & Tashev, I. (2014). Speech emotion recognition using deep neural network and extreme learning machine. Proceedings of INTERSPEECH, ISCA, Singapore, 223-227.

[12] Chen, L., Mao, X., Xue, Y., & Cheng, L. L. (2012). Speech emotion recognition: Features and classification models. Digital Signal Processing, 22(6), 1154-1160.

[13] Ingale, A. B., & Chaudhari, D. S. (2012). Speech emotion recognition. International Journal of Soft Computing and Engineering (IJSCE) ISSN, 2231-2307.

Disclaimer: The present content may not be used for training artificial intelligence or machine learning algorithms. All other uses, including search, entertainment, and commercial use, are permitted.