UTD Home UTD Home

MSP-Podcast corpus:

A large naturalistic speech emotional dataset

We are building the largest naturalistic speech emotional dataset in the community. The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus is an ongoing process. Version 1.1 of the corpus has 22,630 speaking turns

  • Test set: We use segments from 50 speakers (25 female, 25 male) - 7,181 segments
  • Development set: We use segments from 20 speakers (10 female, 10 male) - 2,614 segments
  • Train set: We use the remaining speech samples - 12,835 segments

Spontaneous speech emotional data

This corpus is annotated with emotional labels using attribute-based descriptors (activation, dominance and valence) and categorical labels (anger, happiness, sadness, disgust, surprised, fear, contempt, neutral and other). To the best of our knowledge, this is largest speech emotional corpus in the community.

Upon downloading the podcasts, the recordings are formatted and named using a predefined protocol. Then, they are automatically segmented and analyzed so that the resulting segments include clean speech segments. We do not want to include background music, overlapped speech, or voices recorded over a telephone where the bandwidth is limited at 4KHz, neglecting spectral components that can be useful features for speech emotion recognition. Using existing algorithms, the recordings are segmented into speaking turns. The approach consider voice activity detection, speaker dialization, and music/speech recognition. The code also include a module which will estimate the noise level of the recordings using automatic algorithms (e.g., WADA SNR).

The size of this naturalistic recording provides unique opportunities to explore machine learning algorithms that require large corpora for training (e.g., deep learning). Another interesting feature of this corpus is that the emotional content is balanced, providing enough samples across the valence-arousal space. Emotion recognition systems are trained with existing emotional corpora, which are used to retrieve emotional samples that we predicted have certain emotional content. These samples are then emotional annotated using crowdsourcing.

The MSP-Podcast corpus is being recorded as part of our NSF project "CAREER: Advanced Knowledge Extraction of Affective Behaviors During Natural Human Interaction" (NSF IIS: 1453781). For further information on the corpus, please read:

  1. Reza Lotfian and Carlos Busso, "Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings," IEEE Transactions on Affective Computing, vol. To appear, 2019. [pdf] [cited] [bib]

We plan to share this corpus with the research community in the future.

Some of our Publications using this Corpus:

  1. Reza Lotfian and Carlos Busso, "Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings," IEEE Transactions on Affective Computing, vol. To appear, 2019. [pdf] [cited] [bib]
  2. Mohammed Abdelwahab and Carlos Busso, "Domain adversarial for acoustic emotion recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 12, pp. 2423-2435, December 2018. [pdf] [cited] [ArXiv] [bib]
  3. Reza Lotfian and Carlos Busso, "Curriculum learning for speech emotion recognition from crowdsourced labels," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 4, pp. 815-826, April 2019. [pdf] [cited] [ArXiv] [bib]
  4. Kusha Sridhar and Carlos Busso, "Speech emotion recognition with a reject option," in Interspeech 2019, Graz, Austria, September 2019. [soon cited][soon pdf] [bib]
  5. Mohammed Abdelwahab and Carlos Busso, "Active learning for speech emotion recognition using deep neural network," in International Conference on Affective Computing and Intelligent Interaction (ACII 2019), Cambridge, UK, September 2019. [soon cited][soon pdf] [bib]
  6. John Harvill, Mohammed AbdelWahab, Reza Lotfian, and Carlos Busso, "Retrieving speech samples with similar emotional content using a triplet loss function," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019), Brighton, UK, May 2019, pp. 3792-3796. [soon cited] [pdf] [bib] [poster]
  7. Srinivas Parthasarathy and Carlos Busso, "Ladder networks for emotion recognition: Using unsupervised auxiliary tasks to improve predictions of emotional attributes," in Interspeech 2018, Hyderabad, India, September 2018, pp. 3698-3702. [pdf] [cited] [ArXiv] [bib] [poster]
  8. Kusha Sridhar, Srinivas Parthasarathy and Carlos Busso, "Role of regularization in the prediction of valence from speech," in Interspeech 2018, Hyderabad, India, September 2018, pp. 941-945. [pdf] [cited] [bib] [slides]
  9. Srinivas Parthasarathy and Carlos Busso, "Preference-learning with qualitative agreement for sentence level emotional annotations," in Interspeech 2018, Hyderabad, India, September 2018, pp. 252-256. [pdf] [cited] [bib] [poster]
  10. Reza Lotfian and Carlos Busso, "Predicting categorical emotions by jointly learning primary and secondary emotions through multitask learning," in Interspeech 2018, Hyderabad, India, September 2018, pp. 951-955. [pdf] [cited] [bib] [slides]
  11. Mohammed Abdelwahab and Carlos Busso, "Study of dense network approaches for speech emotion recognition," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), Calgary, AB, Canada, April 2018, pp. 5084-5088. [pdf] [cited] [bib] [poster]
  12. Reza Lotfian and Carlos Busso, "Formulating emotion perception as a probabilistic model with application to categorical emotion classification," in International Conference on Affective Computing and Intelligent Interaction (ACII 2017), San Antonio, TX, USA, October 2017, pp. 415-420. [pdf] [cited] [bib] [slides]
  13. Srinivas Parthasarathy and Carlos Busso, "Predicting speaker recognition reliability by considering emotional content," in International Conference on Affective Computing and Intelligent Interaction (ACII 2017), San Antonio, TX, USA, October 2017, pp. 434-436. [pdf] [cited] [bib] [poster]
  14. Srinivas Parthasarathy and Carlos Busso, "Jointly predicting arousal, valence and dominance with multi-task learning," in Interspeech 2017, Stockholm, Sweden, August 2017, pp. 1103-1107. [pdf] [cited] [bib] [slides]
    Nominated for Best Student Paper at Interspeech 2017!
  15. Srinivas Parthasarathy, Chunlei Zhang, John H.L. Hansen, and Carlos Busso, " A study of speaker verification performance with expressive speech," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), New Orleans, LA, USA, March 2017, pp. 5540-5544. [pdf] [cited] [bib] [poster]

This material is based upon work supported by the National Science Foundation under Grant IIS-1453781. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

(c) Copyrights. All rights reserved.