Recognition of Human Emotion in Speech Using Modulation Spectral Features and Support Vector Machines

Thumbnail Image
Wu, Siqing
Emotion recognition , Speech modulation , Spectro-temporal representation , Affective computing
Automatic recognition of human emotion in speech aims at recognizing the underlying emotional state of a speaker from the speech signal. The area has received rapidly increasing research interest over the past few years. However, designing powerful spectral features for high-performance speech emotion recognition (SER) remains an open challenge. Most spectral features employed in current SER techniques convey short-term spectral properties only while omitting useful long-term temporal modulation information. In this thesis, modulation spectral features (MSFs) are proposed for SER, with support vector machines used for machine learning. By employing an auditory filterbank and a modulation filterbank for speech analysis, an auditory-inspired long-term spectro-temporal (ST) representation is obtained, which captures both acoustic frequency and temporal modulation frequency components. The MSFs are then extracted from the ST representation, thereby conveying information important for human speech perception but missing from conventional short-term spectral features (STSFs). Experiments show that the proposed features outperform features based on mel-frequency cepstral coefficients and perceptual linear predictive coefficients, two commonly used STSFs. The MSFs further render a substantial improvement in recognition performance when used to augment the extensively used prosodic features, and recognition accuracy above 90% is accomplished for classifying seven emotion categories. Moreover, the proposed features in combination with prosodic features attain estimation performance comparable to human evaluation for recognizing continuous emotions.
External DOI