Capturing emotions in voice: A comparative analysis of methodologies in psychology and digital signal processing

  • Daniela Hekiert SWPS University of Social Sciences and Humanities, Warsaw
  • Magdalena Igras-Cybulska AGH University of Science and Technology, Cracow
Keywords: emotional vocalizations; emotional prosody; vocal bursts; process of encoding and decoding


People use their voices to communicate not only verbally but also emotionally. This article presents theories and methodologies that concern emotional vocalizations at the intersection of psychology and digital signal processing. Specifically, it demonstrates the encoding (production) and decoding (recognition) of emotional sounds, including the review and comparison of strategies in database design, parameterization, and classification. Whereas psychology predominantly focuses on the subjective recognition of emotional vocalizations, digital signal processing relies on automated and thus more objective vocal affect measures. The article aims to compare these two approaches and suggest methods of combining them to achieve a more complex insight into the vocal communication of emotions.


Albas, D. C., McCluskey, K. W., & Albas, C. A. (1976). Perception of the emotional content of speech: A comparison of two Canadian groups. Journal of Cross-Cultural Psychology, 7(4), 481-490.

App, B., McIntosh, D. N., Reed, C. L., & Hertenstein, M. J. (2011). Nonverbal channel use in communication of emotion: How may depend on why. Emotion, 11(3), 603-617.

Baart, M., Vroomen, J. (2018). Recalibration of vocal affect by a dynamic face. Experimental Brain Research, 236(7), 1911-1918.

Banse, R., & Scherer, K. R. (1996). Acoustic profiles in vocal emotion expression. Journal of Personality and Social Psychology, 70(3), 614-636.

Basu, S., Chakraborty, J., Bag, A., & Aftabuddin, M. (2017, March). A review on emotion recognition using speech. In International Conference on Inventive Communication and Computational Technologies (ICICCT) (pp. 109-114). IEEE.

Bestelmeyer, P. E., Kotz, S. A., & Belin, P. (2017). Effects of emotional valence and arousal on the voice perception network. Social Cognitive and Affective Neuroscience, 12(8), 1351-1358.

Birdwhistell, R. L. (1970). Kinesics and context: Essays on body motion communication. Philadelphia, PA, US: University of Pennsylvania Press.

Bryant, G. A., & Barrett, H. C. (2008). Vocal emotion recognition across disparate cultures. Journal of Cognition and Culture, 8(1), 135-148.

Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In EUROSPEECH-2005: Ninth European Conference on Speech Communication and Technology (pp. 1517-1520). Lisbon, Portugal.

Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., . . . Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335-359.

Chronaki, G., Wigelsworth, M., Pell, M. D., & Kotz, S. A. (2018). The development of cross--cultural recognition of vocal emotion during childhood and adolescence. Scientific Reports, 8(1), 8659.

Clark-Polner, E., Johnson, T. D., & Barrett, L. F. (2017). Multivoxel pattern analysis does not provide evidence to support the existence of basic emotions. Cerebral Cortex, 27(3), 1944-1948.

Cordaro, D. T., Keltner, D., Tshering, S., Wangchuk, D., & Flynn, L. M. (2016). The voice conveys emotion in ten globalized cultures and one remote village in Bhutan. Emotion, 16(1), 117-128.

Darwin, C. (1872/1998). The expression of emotion in man and animals. New York, NY, US: Oxford University Press.

Demenko, G., & Jastrzębska, M. (2012). Analysis of voice stress in call centers conversations. In Proceedings Speech Prosody. 6th International Conference (pp. 183-186). Shanghai, China.

Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003). Emotional speech: Towards a new generation of databases. Speech Communication, 40(1-2), 33-60.

Ekman, P. (2003). Emotions revealed. New York, NY, US: Times Books.

Ekman, P., & Friesen, W. V. (1969). The repertoire of nonverbal behavior: Categories, origins, usage, and coding. Semiotica, 1(1), 49-98.

Ekman, P., Friesen, W. V., & Ellsworth, P. (1972). Emotion in the human face: Guidelines for research and a review of findings. New York, NY, US: Pergamon Press Inc.

Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., André, E., Busso, C., . . . Truong, K. P. (2016). The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), 190-202.

Fayek, H. M., Lech, M., & Cavedon, L. (2017). Evaluating deep learning architectures for Speech Emotion Recognition. Neural Networks, 92, 60-68.

Fecteau, S., Belin, P., Joanette, Y., & Armony, J. (2007). Amygdala responses to nonlinguistic emotional vocalizations. NeuroImage, 36, 480-487.

Fitch, W. T. (2000). The evolution of speech: A comparative review. Trends in Cognitive Sciences, 4(7), 258-267.

Gałka, J., Grzybowska, J., Igras, M., Jaciów, P., Wajda, K., Witkowski, M., & Ziółko, M. (2015). System supporting speaker identification in emergency call center. Sixteenth Annual Conference of the International Speech Communication Association – INTERSPEECH (724-725). Dresden, Germany.

Gendron, M., Roberson, D., van der Vyver, J. M., & Barrett, L. F. (2014). Cultural relativity in perceiving emotion from vocalizations. Psychological Science, 25(4), 911-920.

Harár, P., Burget, R., & Dutta, M. K. (2017). Speech emotion recognition with deep learning. In 4th International Conference on Signal Processing and Integrated Networks (SPIN) (pp. 137-140). New Delhi, India.

Hawk, S. T., Van Kleef, G. A., Fischer, A. H., & Van Der Schalk, J. (2009). “Worth a thousand words”: Absolute and relative decoding of nonlinguistic affect vocalizations. Emotion, 9(3), 293-305.

Johar, S. (2016). Psychology of voice. In S. Johar (Ed.), Emotion, affect and personality in speech (pp. 9-15). Berlin: Springer.

Johnstone, T., & Scherer, K. R. (2000). Vocal communication of emotion. Handbook of Emotions, 2, 220-235.

Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music performance: Different channels, same code? Psychological Bulletin, 129(5), 770-814.

Kamińska, D., & Sapiński, T. (2017). Polish emotional speech recognition based on the committee of classifiers. Przegląd Elektrotechniczny, 93, 101-105.

Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from speech: A review. International Journal of Speech Technology, 15(2), 99-117.

Kraus, M. W. (2017). Voice-only communication enhances empathic accuracy. American Psychologist, 72(7), 644-654.

Kuhn, L. K., Wydell, T., Lavan, N., McGettigan, C., & Garrido, L. (2017). Similar representations of emotions across faces and voices. Emotion, 17(6), 912-937.

Laukka, P., Elfenbein, H. A., Chui, W., Thingujam, N. S., Iraki, F. K., Rockstuhl, T., & Althoff, J. (2010). Presenting the VENEC corpus: Development of a cross-cultural corpus of vocal emotion expressions and a novel method of annotating emotion appraisals. In Proceedings of the LREC 2010 Workshop on Corpora for Research on Emotion and Affect (pp. 53-57). Paris, France: European Language Resources Association.

Laukka, P., Elfenbein, H. A., Söder, N., Nordström, H., Althoff, J., Iraki, F. K. E., . . . Thingujam, N. S. (2013). Cross-cultural decoding of positive and negative non-linguistic emotion vocalizations. Frontiers in Psychology, 4, 353. DOI: 10.3389/fpsyg.2013.00353

Oleszkiewicz, A., Pisanski, K., Lachowicz-Tabaczek, K., & Sorokowska, A. (2017). Voice-based assessments of trustworthiness, competence, and warmth in blind and sighted adults. Psychonomic Bulletin & Review, 24(3), 856-862.

Pisanski, K., Kobylarek, A., Jakubowska, L., Nowak, J., Walter, A., Błaszczyński, K., . . . Sorokowski, B. (2018). Multimodal stress detection: Testing for covariation in vocal, hormonal and physiological responses to Trier Social Stress Test. Hormones and Behavior, 106, 52-61.

Pisanski, K., Nowak, J., & Sorokowski, P. (2016). Individual differences in cortisol stress response predict increases in voice pitch during exam stress. Physiology & Behavior, 163, 234-238.

Ramachandra, V., Depalma, N., & Lisiewski, S. (2009). The role of mirror neurons in processing vocal emotions: Evidence from psychophysiological data. International Journal of Neuroscience, 119(5), 681-691.

Russell, J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39, 1161-1178.

Russell, J. A. (1994). Is there universal recognition of emotion from facial expressions? A review of the cross-cultural studies. Psychological Bulletin, 115, 102-141.

Rymarczyk, K. (1999). Zaburzenia prozodii emocjonalnej i lingwistycznej u pacjentów z uszkodzeniami mózgu [Disorders of emotional and linguisticprosody in patients with brain damage]. Przegląd Psychologiczny, 42, 135-150.

Saarimäki, H., Gotsopoulos, A., Jääskeläinen, I. P., Lampinen, J., Vuilleumier, P., Hari, R., . . . Nummenmaa, L. (2015). Discrete neural signatures of basic emotions. Cerebral Cortex, 26(6), 2563-2573.

Sauter, D. (2006). An investigation into vocal expressions of emotions: The roles of valence, culture, and acoustic factors (Doctoral dissertation). University College London.

Sauter, D. A., Eisner, F., Calder, A. J., & Scott, S. K. (2010a). Perceptual cues in nonverbal vocal expressions of emotion. The Quarterly Journal of Experimental Psychology, 63(11), 2251-2272.

Sauter, D. A., Eisner, F., Ekman, P., & Scott, S. K. (2010b). Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations. Proceedings of the National Academy of Sciences, 107(6), 2408-2412.

Sauter, D. A., & Scott, S. K. (2007). More than one kind of happiness: Can we recognize vocal expressions of different positive states? Motivation and Emotion, 31(3), 192-199.

Scherer, K. R. (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin, 99(2), 143-165.

Scherer, K. R. (1994). Affect bursts. In S. H. M. van Goozen, N. E. van de Poll, & J. A. Sergeant (Eds.), Emotions: Essays on emotion theory (pp. 161-193). Hillsdale, NJ, US: Erlbaum.

Scherer, K. R., Banse, R., & Wallbott, H. G. (2001). Emotion inferences from vocal expression correlate across languages and cultures. Journal of Cross-Cultural Psychology, 32(1), 76-92.

Scherer, K. R., Sundberg, J., Fantini, B., Trznadel, S., & Eyben, F. (2017). The expression of emotion in the singing voice: Acoustic patterns in vocal performance. The Journal of the Acoustical Society of America, 142(4), 1805-1815.

Scherer, K. R., Sundberg, J., Tamarit, L., & Salomão, G. L. (2015). Comparing the acoustic expression of emotion in the speaking and the singing voice. Computer Speech & Language, 29(1), 218-235.

Schirmer, A., & Adolphs, R. (2017). Emotion perception from face, voice, and touch: Comparisons and convergence. Trends in Cognitive Sciences, 21(3), 216-228.

Schröder, M. (2001). Emotional speech synthesis: A review. In EUROSPEECH-2001: Seventh European Conference on Speech Communication and Technology (pp. 561-564). Aalborg, Denmark.

Sidorova, J. (2007). Speech emotion recognition. DEA report, doctoral program Ciencia Cognitiva i Llenguatge. Universitat Pompeu Fabra, Barcelona.

Simon-Thomas, E. R., Keltner, D. J., Sauter, D., Sinicropi-Yao, L., & Abramson, A. (2009). The voice conveys specific emotions: Evidence from vocal burst displays. Emotion, 9(6), 838-846.

Tomkins, S. S. (1955). Consciousness and the unconscious in a model of the human being. In Proceedings of the 14th International Congress of Psychology (pp. 160-161). Amsterdam: North-Holland Publishing Co.

Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, B. W., & Zafeiriou, S. (2017). End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1301-1309.

Vallee, M. (2017). The science of listening in bioacoustics research: Sensing the animals’sounds. Theory, Culture & Society, 35(2), 47-65.

Ververidis, D., & Kotropoulos, C. (2003, October). A state of the art review on emotional speech databases. In Proceedings of 1st Richmedia Conference (pp. 109-119). Laussane, Switzerland.

Waaramaa-Mäki-Kulmala, T. (2009). Emotions in voice. Acoustic and perceptual analysis of voice quality in the vocal expression of emotions (Doctoral dissertation). University of Tampere.

Witkowski, M., Gałka, J., Grzybowska, J., Igras, M., Jaciów, P., & Ziółko, M. (2016). Online caller profiling solution for a call centre. Odyssey 2016: The Speaker and Language Recognition Workshop. Bilbao, Spain.

Zaki, J., Bolger, N., & Ochsner, K. (2009). Unpacking the informational bases of empathic accuracy. Emotion, 9, 478-487.

Zhang, H., Chen, X., Chen, S., Li, Y., Chen, C., Long, Q., & Yuan, J. (2018). Facial expression enhances emotion perception compared to vocal prosody: Behavioral and fMRI studies. Neuroscience Bulletin, 34(5), 801-815.