UTD Home UTD Home

MSP-Podcast corpus:

A large naturalistic speech emotional dataset

We are building the largest naturalistic speech emotional dataset in the community. The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus is an ongoing process. Version 1.7 of the corpus has 62,140 speaking turns (100hrs)

  • Test set 1: We use segments from 60 speakers (30 female, 30 male) - 12,902 segments
  • Test set 2: We randomly select 3,521 segments from 100 podcasts. Segments from these podcasts are not included in any other partition.
  • Development set: We use segments from 44 speakers (22 female, 22 male) - 7,538 segments
  • Train set: We use the remaining speech samples - 38,179 segments

Spontaneous speech emotional data

This corpus is annotated with emotional labels using attribute-based descriptors (activation, dominance and valence) and categorical labels (anger, happiness, sadness, disgust, surprised, fear, contempt, neutral and other). To the best of our knowledge, this is largest speech emotional corpus in the community.

Upon downloading the podcasts, the recordings are formatted and named using a predefined protocol. Then, they are automatically segmented and analyzed so that the resulting segments include clean speech segments. We do not want to include background music, overlapped speech, or voices recorded over a telephone where the bandwidth is limited at 4KHz, neglecting spectral components that can be useful features for speech emotion recognition. Using existing algorithms, the recordings are segmented into speaking turns. The approach consider voice activity detection, speaker dialization, and music/speech recognition. The code also include a module which will estimate the noise level of the recordings using automatic algorithms (e.g., WADA SNR).

The size of this naturalistic recording provides unique opportunities to explore machine learning algorithms that require large corpora for training (e.g., deep learning). Another interesting feature of this corpus is that the emotional content is balanced, providing enough samples across the valence-arousal space. Emotion recognition systems are trained with existing emotional corpora, which are used to retrieve emotional samples that we predicted have certain emotional content. These samples are then emotional annotated using crowdsourcing.

Key features of this corpus:

  • We download available audio recordings with common license. We only use the podcasts that have less restrictive licenses, so we can modify, sell and distribute the corpus (you can use it for commercial product!).
  • Most of the segments in a regular podcasts are neutral. We use machine learning techniques trained with available data to retrieve candidate segments. These segments are emotionally annotated with crowdsourcing. This approach allows us to spend our resources on speech segments that are likely to convey emotions.
  • We annotate categorical emotions and attribute based labels at the speaking turn label
  • This is an ongoing effort, where we currently have 62,140 speaking turns (100h). We collect approximately 10,000-13,000 new speaking turns per year. Our goal is to reach 400 hours.

The MSP-Podcast corpus is being recorded as part of our NSF project "CCRI: New: Creating the largest speech emotional database by leveraging existing naturalistic recordings" (NSF CNS: 2016719). For further information on the corpus, please read:

  1. Reza Lotfian and Carlos Busso, "Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings," IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471-483, October-December 2019. [pdf] [cited] [bib]

Release of the Corpus: Academic License

The corpus is now available under an Academic License (free of cost). Please download this pdf. The process requires your institution to sign the agreement. A couple of notes about this form:

Instructions:

  1. It has to be signed by someone with signing authority in behalf of the university (usually someone from the sponsored research office).
  2. The license is a standard FDP data transfer form. It should be easy for you to obtain a signature.

We emphasize that the form should only be signed by someone with signature authority in the academic institution in order to make it a valid agreement. If the group leader or laboratory directory does somehow have that authority, it would be fine for them to sign, but that scenario is often highly unlikely. This is a legally binding document that must be signed by someone with signature authority in the academic institution. Send the signed form to Prof. Carlos Busso -

Release of the Corpus: Commercial License

Companies interested in this corpus can obtain a commercial license from UT Dallas. The commercial license is US$10,000, which will give you access to the corpus. After a year, we provide the opportunity to renew the agreement for US$5,000 dollar. If you renew the agreement, we give you the current version of the corpus with all the updates during the year (again, this is an ongoing project). If you do not renew the agreement, then you can use the version that you received without any restriction. However, you will not receive the new updates on the corpus, unless you enter in another separate agreement. Please contact Carlos Busso if you are interested. -

Some of our Publications using this Corpus:

  1. Reza Lotfian and Carlos Busso, "Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings," IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471-483, October-December 2019. [pdf] [cited] [bib]
  2. Mohammed Abdelwahab and Carlos Busso, "Domain adversarial for acoustic emotion recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 12, pp. 2423-2435, December 2018. [pdf] [cited] [ArXiv] [bib]
  3. Reza Lotfian and Carlos Busso, "Curriculum learning for speech emotion recognition from crowdsourced labels," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 4, pp. 815-826, April 2019. [pdf] [cited] [ArXiv] [bib]
  4. Wei-Cheng Lin and Carlos Busso, "An efficient temporal modeling approach for speech emotion recognition," in Interspeech 2020, Shanghai, China, October 2020. [soon cited] [pdf] [bib]
  5. Kusha Sridhar and Carlos Busso, "Ensemble of students taught by probabilistic teachers to improve speech emotion recognition," in Interspeech 2020, Shanghai, China, October 2020. [soon cited] [pdf] [bib]
  6. Kusha Sridhar and Carlos Busso, "Modeling uncertainty in predicting emotional attributes from spontaneous speech," in IEEE international conference on acoustics, speech and signal processing (ICASSP 2020), Barcelona, Spain, May 2020, pp. 8384-8388. [soon cited] [pdf] [bib] [slides]
  7. Kusha Sridhar and Carlos Busso, "Speech emotion recognition with a reject option," in Interspeech 2019, Graz, Austria, September 2019, pp. 3272-3276 [pdf] [cited] [bib] [poster]
  8. Mohammed Abdelwahab and Carlos Busso, "Active learning for speech emotion recognition using deep neural network," in International Conference on Affective Computing and Intelligent Interaction (ACII 2019), Cambridge, UK, September 2019, pp. 441-447. [pdf] [cited] [bib] [slides]
  9. Michelle Bancroft, Reza Lotfian, John Hansen, and Carlos Busso, "Exploring the Intersection Between Speaker Verification and Emotion Recognition," in International Workshop on Social & Emotion AI for Industry (SEAIxI), Cambridge, UK, September 2019, pp. 337-342. [soon cited] [pdf] [bib] [slides]
  10. John Harvill, Mohammed AbdelWahab, Reza Lotfian, and Carlos Busso, "Retrieving speech samples with similar emotional content using a triplet loss function," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019), Brighton, UK, May 2019, pp. 3792-3796. [pdf] [cited] [bib] [poster]
  11. Srinivas Parthasarathy and Carlos Busso, "Ladder networks for emotion recognition: Using unsupervised auxiliary tasks to improve predictions of emotional attributes," in Interspeech 2018, Hyderabad, India, September 2018, pp. 3698-3702. [pdf] [cited] [ArXiv] [bib] [poster]
  12. Kusha Sridhar, Srinivas Parthasarathy and Carlos Busso, "Role of regularization in the prediction of valence from speech," in Interspeech 2018, Hyderabad, India, September 2018, pp. 941-945. [pdf] [cited] [bib] [slides]
  13. Srinivas Parthasarathy and Carlos Busso, "Preference-learning with qualitative agreement for sentence level emotional annotations," in Interspeech 2018, Hyderabad, India, September 2018, pp. 252-256. [pdf] [cited] [bib] [poster]
  14. Reza Lotfian and Carlos Busso, "Predicting categorical emotions by jointly learning primary and secondary emotions through multitask learning," in Interspeech 2018, Hyderabad, India, September 2018, pp. 951-955. [pdf] [cited] [bib] [slides]
  15. Najmeh Sadoughi and Carlos Busso, "Expressive speech-driven lip movements with multitask learning," in IEEE Conference on Automatic Face and Gesture Recognition (FG 2018), Xi'an, China, May 2018, pp. 409-415. [pdf] [cited] [bib] [poster]
  16. Mohammed Abdelwahab and Carlos Busso, "Study of dense network approaches for speech emotion recognition," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), Calgary, AB, Canada, April 2018, pp. 5084-5088. [pdf] [cited] [bib] [poster]
  17. Reza Lotfian and Carlos Busso, "Formulating emotion perception as a probabilistic model with application to categorical emotion classification," in International Conference on Affective Computing and Intelligent Interaction (ACII 2017), San Antonio, TX, USA, October 2017, pp. 415-420. [pdf] [cited] [bib] [slides]
  18. Srinivas Parthasarathy and Carlos Busso, "Predicting speaker recognition reliability by considering emotional content," in International Conference on Affective Computing and Intelligent Interaction (ACII 2017), San Antonio, TX, USA, October 2017, pp. 434-436. [pdf] [cited] [bib] [poster]
  19. Srinivas Parthasarathy and Carlos Busso, "Jointly predicting arousal, valence and dominance with multi-task learning," in Interspeech 2017, Stockholm, Sweden, August 2017, pp. 1103-1107. [pdf] [cited] [bib] [slides]
    Nominated for Best Student Paper at Interspeech 2017!
  20. Srinivas Parthasarathy, Chunlei Zhang, John H.L. Hansen, and Carlos Busso, " A study of speaker verification performance with expressive speech," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), New Orleans, LA, USA, March 2017, pp. 5540-5544. [pdf] [cited] [bib] [poster]

This material is based upon work supported by the National Science Foundation under Grant IIS-1453781. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

(c) Copyrights. All rights reserved.