The Hoosier Vocal Emotions Corpus: A validated set of North American English pseudo-words for evaluating emotion processing The Hoosier Vocal Emotions Corpus: A validated set of North American English pseudo-words for evaluating emotion processing Isabelle Darcy1 & Nathalie M. G. Fontaine2 # The Psychonomic Society, Inc. 2019 Abstract This article presents the development of the “Hoosier Vocal Emotions Corpus,” a stimulus set of recorded pseudo-words based on the pronunciation rules of English. The corpus contains 73 controlled audio pseudo-words uttered by two actresses in five different emotions (i.e., happiness, sadness, fear, anger, and disgust) and in a neutral tone, yielding 1,763 audio files. In this article, we describe the corpus as well as a validation study of the pseudo-words. A total of 96 native English speakers completed a forced choice emotion identification task. All emotions were recognized better than chance overall, with substantial variability among the different tokens. All of the recordings, including the ambiguous stimuli, are made freely available, and the recognition rates and the full confusion matrices for each stimulus are provided in order to assist researchers and clinicians in the selection of stimuli. The corpus has unique characteristics that can be useful for experimental paradigms that require controlled stimuli (e.g., electroencephalographic or fMRI studies). Stimuli from this corpus could be used by researchers and clinicians to answer a variety of questions, including investigations of emotion processing in individuals with certain temperamental or behavioral characteristics associated with difficulties in emotion recognition (e.g., individuals with psychopathic traits); in bilingual indi- viduals or nonnative English speakers; in patients with aphasia, schizophrenia, or other mental health disorders (e.g., depression); or in training automatic emotion recognition algorithms. The Hoosier Vocal Emotions Corpus is available at https:// psycholinguistics.indiana.edu/hoosiervocalemotions.htm. Keywords Vocal emotions . Forced choice identification . Emotion perception . Speech corpus . Validation . English . Pseudo-words . Emotion stimulus set The ability to process salient emotional and social cues is critical for adaptive behavior. A failure to process expressions of emotion adequately can have important negative and long- term effects on social behavior and can be a risk factor for adaptation problems, including aggressive and antisocial be- havior (Herba & Phillips, 2004). The majority of studies on emotion processing have focused on facial expressions of emotion (e.g., Pollak & Sinha, 2002; Tottenham et al., 2009). There is less research on vocal expressions of emotion, notably because of the difficulty in obtaining naturalistic re- cordings of vocal expressions of specific emotions (Scherer, Banse, Wallbott, & Goldbeck, 1991). Still, vocal cues play an important role in the expression of emotions. By “vocal,” we refer to “everything that remains present in a spoken message after lexical and syntactic information has been removed” (van Bezooijen, 1984, p. 1). A growing number of studies conducted in the past decade have indicated that humans, across languages and cultures, can infer emotion from vocal expression alone because of differential acoustic patterns (e.g., Banse & Scherer, 1996; Bänziger, Mortillaro, & Scherer, 2012; Castro & Lima, 2010; Juslin & Laukka, 2003; Liu & Pell, 2012; Livingstone & Russo, 2018; Pell, Paulmann, Dara, Alasseri, & Kotz, 2009; Sauter, Eisner, Ekman, & Scott, 2010; Scherer, Banse, & Wallbott, 2001). A number of emotion corpora have been produced (see Scherer, Clarke-Polner, & Mortillaro, 2011; Ververidis & Kotropoulos, 2006, for reviews). They all have their particular features and are composed of diverse vocal stimuli. Table 1 presents a sample of data collections of vocal expressions of emotion. We developed and validated a set of pseudo-words based on the phonology and pronunciation rules of North American English, which we aim to make available to the research and clinical communities. The corpus, named the Hoosier Vocal Emotions Corpus (HVEC), includes important unique * Isabelle Darcy idarcy@indiana.edu 1 Indiana University, Bloomington, IN, USA 2 University of Montreal, Montreal, Quebec, Canada Behavior Research Methods https://doi.org/10.3758/s13428-019-01288-0 (2020) 52:901–917 Published online: 4 September 2019 http://crossmark.crossref.org/dialog/?doi=10.3758/s13428-019-01288-0&domain=pdf https://psycholinguistics.indiana.edu/hoosiervocalemotions.htm https://psycholinguistics.indiana.edu/hoosiervocalemotions.htm mailto:idarcy@indiana.edu Ta b le 1 S am pl e o f d at a co ll ec ti o ns o f v o ca l ex p re ss io n s of em o ti o n R ef er en ce s N am e/ D es cr ip ti on o f th e d at a co ll ec ti o n L an g u ag e S p ea k er s T y p e an d n u m b er o f v o ca l st im u li K in d o f sp ee ch E m ot io n s (i n te rm s u se d b y au th o rs ) O th er p er ce p tu al m o d al it ie s B än zi g er et al . (2 01 2 ) G en ev a M ul ti m o da l E m o ti o n P or tr ay al s C o re S et (G E M E P -C S ) N on la ng u ag e (p se ud o -s pe ec h se nt en ce s an d a no n ve rb al v o ca li za ti o n ; “a aa ” b y F re nc h sp ea ke rs ) 5 w om en an d 5 m en (p ro fe ss io na l F re nc h -s pe ak in g th ea te r ac to rs ) 14 5 em ot io n ex pr es si o ns (p se ud o - sp ee ch se n te nc es ) A ct ed sp ee ch 17 em ot io n s (e .g ., am us em en t, de sp ai r, ho t an g er , fe ar /p an ic , jo y /e la ti on , sa dn es s, co nt em pt , di sg us t, su rp ri se ) V id eo (i .e ., pr es en ta ti on of dy n am ic pi ct ur e w it ho u t so u nd ) an d au d io – vi d eo (i .e ., pr es en ta ti on of dy n am ic pi ct u re an d so un d) B el in , F il li on -B il od ea u, an d G o ss el in (2 0 08 ) M on tr ea l A ff ec ti ve V oi ce s (M A V ) N on ve rb al af fe ct bu rs ts us in g th e F re nc h v ow el “a h” 10 d if fe re nt ac to rs (5 w om en an d 5 m en ) 90 n on v er ba l af fe ct b ur st s A ct ed sp ee ch A ng er , d is gu st , p ai n, sa dn es s, su rp ri se , ha pp in es s, p le as u re an d ne ut ra l — B ur k ha rd t, P ae sc hk e, R ol fe s, S en dl m ei er , an d W ei ss (2 00 5 ) B er li n E m o ti o n al S pe ec h D at ab as e (E M O -D B ) G er m an 5 w om en an d 5 m en 10 m ea ni n gf ul se nt en ce s by 6 em ot io ns (p lu s th e ne ut ra l st at e) by 1 0 ac to rs , in ad di ti on to so m e se co nd ve rs io n s (n = ab ou t 80 0 se n te n ce s) A ct ed sp ee ch A ng er , fe ar , jo y, sa d ne ss , di sg us t, bo re d om an d ne ut ra l — C as tr o an d L im a (2 0 10 ) S et of P o rt u gu es e se n te n ce s an d p se ud o se nt en ce s E ur op ea n P or tu gu es e 2 w om en 16 P or tu gu es e se n te nc es an d 1 6 ps eu d os en te nc es by 6 em ot io ns (p lu s th e ne ut ra l st at e) M ea n le ng th = 8 sy ll ab le s (r an ge 6– 11 ) A ct ed sp ee ch H ap pi ne ss , sa dn es s, an ge r, fe ar , di sg us t, su rp ri se an d ne ut ra l — C os ta nt in i, Ia da ro la , P ao lo ni , an d T o di sc o (2 01 4) E M O V O C or pu s It al ia n 6 ac to rs (3 w o m en an d 3 m en ) 14 se n te n ce s b y 6 em ot io ns (p lu s th e ne ut ra l st at e) by 6 ac to rs (5 88 se nt en ce s) A ct ed sp ee ch D is gu st , jo y, fe ar , an ge r, su rp ri se , sa dn es s an d ne u tr al — L au kk a et al . (2 01 0) V oc al E x pr es si on s of N in et ee n E m ot io ns ac ro ss C ul tu re s (V E N E C ) E ng li sh 10 0 pr of es si on al ac to rs fr o m 5 E ng li sh sp ea k in g cu lt ur es (U S A , In di a, K en ya , S in ga po re an d A us tr al ia ) (5 0% w o m en ) A bo ut 6, 50 0 vo ca l ex pr es si on s (m ai nl y sh or t ph ra se s w it h em ot io na ll y ne ut ra l co nt en t, ex p re ss ed in th re e le v el s o f in te n si ty ) A ct ed sp ee ch 19 em ot io n s (e .g ., am us em en t, an ge r, co nt em pt , di sg us t, di st re ss , fe ar , gu il t, h ap pi ne ss , sh am e) an d ne ut ra l — L iu an d P el l (2 01 2) A d at ab as e of C hi ne se v oc al em o ti on al st im ul i P se ud o -s en te nc es (s em an ti ca ll y m ea ni n gl es s an d re la ti ve ly p la u si b le as C hi n es e se nt en ce s) 10 n at iv e M an da ri n sp ea ke rs (5 w om en an d 5 m en ) 35 p se ud o -s en te nc es by 6 em ot io ns (p lu s th e ne ut ra l st at e) A ct ed sp ee ch A ng er , d is gu st , fe ar , sa d ne ss , ha pp in es s, pl ea sa n t su rp ri se an d ne ut ra l — L im a, C as tr o , an d S co tt (2 01 3) A co rp us of n on v er ba l vo ca li za ti on s N on ve rb al v oc al iz at io n s by E u ro p ea n P or tu gu es e n at iv e sp ea ke rs 4 sp ea ke rs (2 w om en an d 2 m en ) w ho d id 12 1 so u nd s (n o gu id an ce w as pr ov id ed as to th e sp ec if ic A ct ed sp ee ch 4 po si ti ve st at es (a ch ie ve m en t/ tr iu m ph , am us em en t, se n su al pl ea su re — Behav Res (2020) 52:901–917902 T ab le 1 (c o n ti n u ed ) R ef er en ce s N am e/ D es cr ip ti on o f th e d at a co ll ec ti o n L an g u ag e S p ea k er s T yp e an d n u m b er of v o ca l st im u li K in d o f sp ee ch E m o ti o n s (i n te rm s u se d b y au th o rs ) O th er p er ce pt ua l m o d al it ie s n ot h av e fo rm al ac ti ng tr ai ni ng . k in d of so un ds th e sp ea ke rs ha d to m ak e) an d re li ef ) an d 4 ne ga ti ve st at es (a n ge r, di sg us t, fe ar an d sa dn es s) L iv in g st o ne an d R u ss o (2 0 18 ) T he R ye rs on A u di o - V is ua l D at ab as e of E m ot io na l S pe ec h an d S on g (R A V D E S S ) E ng li sh 2 4 N or th A m er ic an E n gl is h- sp ea k in g p ro fe ss io na l ac to rs (1 2 w om en an d 12 m en ) E ng li sh se nt en ce s (t o ta l o f 7 ,3 5 6 re co rd in g s) A ct ed sp ee ch an d so ng S pe ec h: ca lm ,h ap py ,s ad ,a ng ry , fe ar fu l, su rp ri se an d di sg us t S on g : ca lm , ha pp y, sa d, an gr y an d fe ar fu l E ac h ex p re ss io n w as pr od uc ed at tw o le ve ls of em ot io na l in te ns it y w it h an ad di ti on al n eu tr al ex pr es si on . F ac e an d v oi ce , fa ce on ly P ar so ns , Y o un g , C ra sk e, S te in , an d K ri ng el ba ch (2 0 14 ) O xf or d V oc al S ou nd s da ta ba se (O xV oc ) N on v er ba l so u nd s In fa n t vo ca li za ti on s (4 gi rl s an d 5 b oy s) A d ul t v oc al iz at io n s (1 9 cl ip s by w o m en on ly fo r di st re ss vo ca li za ti on s, 1 5 w o m en an d 1 5 m en fo r la ug h te r vo ca li za ti on s an d 1 5 w om en an d 1 5 m en fo r n eu tr al vo ca li za ti on s) A n im al vo ca li za ti on s (p et ca ts an d do gs ) T ot al of 17 3 st im ul i In fa nt s: cr y vo ca li za ti on s (n = 2 1) ; la u gh te r v oc al iz at io ns (n = 18 ); n eu tr al ba bb le s (n = 25 ) A du lt s: di st re ss vo ca li za ti on s (n = 1 9) ; la u gh te r (n = 30 ); n eu tr al (n = 30 ) A n im al s: d is tr es s (n = 30 ) In fa nt s: so un ds fr o m vi d eo re co rd in gs o f in fa nt s fi lm ed in th ei r o w n h om es A du lt s an d an im al s: so un ds fo un d fr o m o n li n e re so ur ce s H ap py (l au gh te r v oc al iz at io ns ), sa d (c ry an d d is tr es s v oc al iz at io n s) an d ne ut ra l — R ig ou lo t, W as si li w iz k y, an d P el l (2 01 3) D at ab as e of em ot io na ll y in fl ec te d ps eu d o- ut te ra nc es P se ud o- ut te ra nc es by na ti ve sp ea ke rs o f C an ad ia n E n gl is h 4 sp ea k er s (2 w o m en an d 2 m en ) 1 20 p se ud o - ut te ra n ce s (7 sy ll ab le s in le n g th ) A ct ed sp ee ch A ng er , di sg us t, fe ar , ha pp in es s, sa dn es s, an d n eu tr al — W en dt et al . (2 0 03 ); W en dt an d S ch ei ch (2 0 02 ) M ag de bu rg er P ro so di e- K or pu s G er m an 2 ac to rs (w om an an d m an ) L in gu is ti ca ll y m ea ni ng fu l w o rd s (n > 3, 0 00 ) an d d is y ll ab ic ps eu d o- w o rd s (n = 20 0) A ct ed sp ee ch A ng er , di sg us t, fe ar , ha pp in es s, sa dn es s an d ne ut ra l — Behav Res (2020) 52:901–917 903 characteristics. First, it focuses on disyllabic pseudo- words, rather than meaningful words or sentences, to remove the semantic meaning and allow for the speech prosody to become the central attribute of emotion pro- cessing (Wendt et al., 2003; Wendt & Scheich, 2002). To our knowledge, only one other corpus (the Magdeburger Prosodie Korpus, a set of stimuli respect- ing the phonotactic and phonetic rules of the German language) includes isolated pseudo-words (Wendt et al., 2003; Wendt & Scheich, 2002). Our corpus’s main fea- tures are based on this German corpus. Other corpora of vocal emotions contain pseudo-sentences (e.g., Castro & Lima, 2010; Liu & Pell, 2012). However, experimental paradigms can require shorter stimuli, which would be difficult to manually extract from sentences and subsequently validate separately. In addition, Rigoulot et al. (2013) demonstrated in a gating paradigm study that the length of the stimuli matters for the time course of emotion recognition, and that full sentences are rec- ognized much more easily than truncated ones. Other corpora use affect bursts (e.g., “ah”) or emotional sounds such as screams or laughter (e.g., Belin et al., 2008; Parsons et al., 2014). Despite the high effectivity of such stimuli to convey specific emotions, they are also not necessarily suitable for experimental paradigms requiring controlled stimuli with medium or normal emotional intensity. The Hoosier Vocal Emotions Corpus includes 73 con- trolled audio pseudo-words, uttered twice apiece by two actresses in five different positive or negative emotions (i.e., happiness, sadness, fear, anger, and disgust) and in a neutral tone, yielding 1,763 stimuli (some of the stimuli were pronounced more than two times). We selected the emotions on the basis of the basic emotions identified by Ekman (1992), except for surprise, because this emotion can have any valence (it can be neutral, positive or neg- ative). In addition, surprise utterances can be difficult to simulate experimentally (Pell et al., 2009). Although con- cerns have been raised about the use of acted rather than natural stimuli (Bachorowski & Owren, 2008), there are also arguments suggesting that actors can produce realis- tic portrayals and valid instances of vocal expressions of emotion (Ververidis & Kotropoulos, 2006). One important argument is that much of our verbal communication is subject to sociocultural censure and involves making im- pressions on others (Bachorowski & Owren, 2008; Banse & Scherer, 1996). Therefore, having people utter an emo- tion as if they were experiencing it may not be signifi- cantly different from a real-life communicative situation. Two female voices were preferred over having one male and one female voice, mainly for reasons of comparability and homogeneity between such acoustic dimensions as pitch range, and to facilitate their use in experimental paradigms requiring tight control of the acoustic param- eters of stimuli, such as event-related potential (ERP) studies. In this article, we describe the structure of the Hoosier Vocal Emotions Corpus, as well as the valida- tion of the pseudo-words in terms of the emotion they portray. We also discuss potential applications of this set of stimuli. Method Creation of the stimuli The stimulus set is composed of pseudo-words based on real English words. These pseudo-words were created by selecting common English disyllabic words using the COBUILD frequency information (per million) from the CELEX English Wordforms database (Baayen, Piepenbrock, & Gulikers, 1995), and manipulating the order of segments within the word (see Wendt & Scheich, 2002, or Castro & Lima, 2010, for a similar procedure). For example, the pseudo-word “elby” was constructed from the noun belly. As a result, there is no clear phonetic relationship between the pseudo- words and their originals, but they are matched in terms of number of syllable and phonemes. Care was taken to ensure that the pseudo-words were phonotactically legal—that is, that the sequences of phonemes were permitted and easily pronounceable in English. Similarly, slight phonetic adjustments were made to comply with English pronunciation rules. For example, the pseudo-word “domner,” based on modern, did not retain the flapped /d/ found in the North American English pronunciation of modern, since the flap is not found in word-initial position in English. Pseudo-words that were too clearly reminiscent of their original or of other real words were excluded. A final list of 73 pseudo-words was generated (see Table 2). Stress al- ways fell on the first syllable, but the vowel in the second syllable was not always fully reduced (indicated by the International Phonetic Alphabet [IPA] symbols in Table 2, where only “schwa” [ə] represents a reduced vowel). The transcriptions provided in Table 2 closely reflect the actual pronunciation of most of the stimuli by both actresses. Since each actress pronounced a given pseudo-word 12 times (2 × 6 emotions), there are es- sentially 24 pronunciations of the same pseudo-word, thus displaying some variation from one token to the next. The transcription here reflects the most common pronunciation of the stimuli, and there might be some variation across specific stimuli, especially in terms of the vowels. Table 2 is provided here to give further guidance to researchers about the possible variations in Behav Res (2020) 52:901–917904 pronunciation for the same pseudo-word, but we encour- age researchers and clinicians who need an exact con- trol of sound properties to check each stimulus they plan to use. Elicitation and recording procedures Two actresses were recruited to record the 73 pseudo- words in a neutral tone as well as in five different modal emotions: happiness, sadness, fear, anger, and disgust. Female voices were recorded as the basis of another ex- periment (i.e., an electroencephalography [EEG] paradigm involving young children; Hoyniak et al., 2018). Both actresses were native speakers of Midwestern United States English (North Midland dialect region; Clopper & Pisoni, 2004), and had lived exclusively in that region prior to the recording. They reported no fluency in any language other than English and have not lived abroad. Table 2 List of the 73 pseudo-words included in the corpus, in the Roman alphabet and in IPA transcription Item number Orthographic representation IPA transcription Item number Orthographic representation IPA transcription 1 nervack /’nɜɹvæk/ 38 vigging /’vɪgɪŋ/ 2 lorack /’loɹæk/ 39 voker /’voʊkəɹ/ 3 lairet /’lɛɹət/ 40 vokered /’voʊkəɹd/ 4 vokered /’voʊkəɹd/ 41 volers /’voʊləɹs/ 5 tairack /’tɛɹək/ 42 winnith /’wɪnɪθ/ 6 domner /’dɑmnəɹ/ 43 ziddy /’zɪdi/ 7 nammy /’næmi/ 44 zilard /’zɪləɹd/ 8 tannock /’tænək/ 45 vercoed /’vɜɹkoʊd/ 9 agerth /’ægəɹθ/ 46 forny /’fɔɹni/ 10 armidge /’ɑɹmɪdʒ/ 47 admage /’ædmɪdʒ/ 11 burish /’bʊɹɪʃ/ 48 affning /’ɑfnɪŋ/ 12 dernom /’dɜɹnəm/ 49 elby /’ɛlbi/ 13 revo /’ɹɛvoʊ/ 50 ervy /’ɜɹvi/ 14 fingill /’fɪŋgəl/ 51 infess /’ɪnfɛs/ 15 jouless /’dʒoʊlɛs/ 52 youssle /’jusəl/ 16 lebby /’lɛbi/ 53 kervo /’kɜɹvoʊ/ 17 lowmen /’loʊmən/ 54 kervoed /’kɜɹvoʊd/ 18 madage /’mædədʒ/ 55 larpy /’lɑɹpi/ 19 menno /’mɛnoʊ/ 56 leknodge /’lɛknədʒ/ 20 merrus /’mɛɹəs/ 57 modner /’mɔdnəɹ/ 21 mowan /’moʊwən/ 58 mokers /’moʊkəɹs/ 22 nabick /’næbɪk/ 59 musser /’mʌsəɹ/ 23 nemmy /’nɛmi/ 60 naffing /’næfɪŋ/ 24 nidder /’nɪdəɹ/ 61 nifish /’nɪfɪʃ/ 25 nillen /’nɪlən/ 62 nipher /’nɪfəɹ/ 26 nomel /’nɔməl/ 63 othening /’ɔθ(ə)nɪŋ/ 27 nomey /’noʊmi/ 64 rackies /’ɹækiːz/ 28 ramidge /’ɹæmɪdʒ/ 65 scopies /’skoʊpiːz/ 29 shavil /’ʃævɪl/ 66 shifin /’ʃɪfɪn/ 30 shibur /’ʃɪbəɹ/ 67 vackner /’væknəɹ/ 31 slover /’sloʊvəɹ/ 68 vashil /’væʃɪl/ 32 terrel /’tɛɹəl/ 69 vishal /’vɪʃəl/ 33 thager /’θægəɹ/ 70 wedick /’wɛdɪk/ 34 thomer /’θoʊməɹ/ 71 winthy /’wɪnθi/ 35 valish /’vælɪʃ/ 72 youshing /’juːʃɪŋ/ 36 venner /’vɛnəɹ/ 73 zuber /’zubəɹ/ 37 verney /’vɜɹni/ Boldface in the orthographic representations indicates the syllable carrying the main stress Behav Res (2020) 52:901–917 905 They were students in the Department of Theatre and Drama at a large Midwestern higher education institution (Indiana University, Bloomington, IN) and were 18 and 20 years old, respectively, at the time of the recordings. Both actresses were paid and gave consent to share the recordings in a publicly accessible database. Each actress (henceforth, A.G. and K.M.) was record- ed individually in a single session of approximately 1.5 to 2 h. The experimenter first briefly explained the gen- eral procedures to each actress, who was also given time to familiarize herself with the list of stimuli. Pronunciation of the pseudo-words was clarified as needed. The different emotions were discussed and ex- plained. The stimuli were elicited using a short sentence preceding the pseudo-word: ‘it starts like /word/, I say /pseudo-word/, I say /pseudo-word/ again’ (see Table 3). This was done to help maintain consistent pronunciation of the pseudo-words and to enhance fluent delivery and more natural sounding speech. In addition, this form of elicitation was chosen to enable a similar delivery con- text for each pseudo-word across emotions and ensure high comparability. Each pseudo-word was thus pro- nounced at least twice (two times per carrier sentence). For each actress, at least 146 stimuli were pronounced for each emotion, yielding a total of at least 876 stimuli per actress. However, some stimuli were pronounced more than two times, when an actress chose to reat- tempt the emotion portrayal for a given carrier sentence, resulting in a total of 876 pseudo-words for A.G. and 887 pseudo-words for K.M., for a grand total of 1,763 audio files. The stimuli are overall similar in terms of duration (M = 613 ms, Median = 608 ms, SD = 132 ms) and intensity (M = 62.29 dB, Median = 62.23 dB, SD = 3.849 dB). Actresses were allowed to choose the order in which they preferred to utter each emotion. They were then seated in a recording booth, wearing Sennheiser HD515 Dynamic Stereo headphones, and before record- ing a set were shown a short presentation of pictures and auditory examples of (non-English) pseudo-words spoken in the corresponding emotion (Wendt & Scheich, 2002). The pictures depicted situations in which examples of the specific emotion to be uttered were displayed. For example, various clip art pictures of angry individuals, arguing friends and knit eyebrows were shown to illustrate anger, and to clarify a general mood for each emotion. The experimenter demonstrated a few items in their carrier sentences (without modeling a particular emotion), to help with pronunciation of stimuli (fluency) and overall rhythm. The actresses were also encouraged to imagine situations/scenarios accord- ing to the emotion to be expressed. They were given as much time as they needed to “get into the character” of the emotion before proceeding with the recordings. The experimenter also instructed the actresses not to exag- gerate their expressions of the emotions, but to achieve a “normal” rather than a “strong” level of emotional intensity (see Livingstone & Russo, 2018). The stimuli were recorded in a noise-isolated record- ing booth, at a sampling rate of 44100 Hz with 16-bit resolution on a mono channel, using a Sennheiser e835 dynamic cardioid microphone and an Edirol UA25 USB stereo audio interface. The distance and orientation of the actresses with regard to the microphone were held as constant as possible. Each stimulus (pseudo-word) was then manually cut from its sentence context and saved separately in a .wav format for presentation in the subsequent evaluation procedures. We conducted a validation study with approximately 25 participants rating each sound file of the Hoosier Vocal Emotions Corpus, to estimate to what extent each recorded stimulus represents an acceptable rendition of the intended emotion. We included stimuli from both actresses into the corpus validation, that is, a total of 1,763 audio files. Given the large number of audio files, the time required for a single listener to evaluate all of them would have been prohibitively long. We therefore divided the files into four stimuli lists, which were pre- sented to listeners for evaluation. All emotions were equally balanced in each list. However, we decided against mixing the two voices in each list (see Castro & Lima, 2010, for a similar design). Each list contained stimuli from only one speaker (Lists 1 and 2: A.G., Lists 3 and 4, K.M.). This was done in order to reduce comparison between voices, and to enhance the reliance on actual acoustic properties of the stimuli. An addition- al consideration was the cognitive load of this task, which is demanding for participants. Each participant rated only one list. The dataset accompanying the cor- pus contains ratings for each audio file from about 25 persons (see below for the method details). All proce- dures were approved by the Indiana University Institutional Review Board. Validation of the stimuli Procedure To validate the stimuli of the corpus, we opted for a forced choice identification task similar to the one used by van Bezooijen (1984) or Castro and Lima (2010). The stimuli were presented to listeners Table 3 Example of the materials used to elicit the pseudo-words for each emotion /’sloʊvəɹ/ It starts like ‘slow’ I say slover I say slover again /’loɹæk/ It starts like ‘lord’ I say lorack I say lorack again Behav Res (2020) 52:901–917906 via headphones, using the Praat software (version 5.4.04; Boersma & Weenink, 2014) on computers run- ning under Windows 7. Participants were tested individ- ually and were seated at a computer station in a partitioned computer lab, wearing high-quality Sanako over-the-ear headphones at a self-chosen comfortable listening level. Their task was to listen to each sound file and identify what emotion they thought the speaker intended to convey. They were asked to choose one out of six possible emotions and indicate their choice by clicking on the correspondingly labeled button on the screen. The labels were “neutral,” “happy,” “sad,” “fear,” “angry,” and “disgust.” There was no “other/ none of the above” option (Livingstone & Russo, 2018). Participants were also asked to choose how con- fident they were in their choice by clicking on a num- ber, on a scale ranging from 1 (not sure) to 5 (very sure). The instructions were displayed on the screen as follows: This is a judgment experiment about how actors convey emotions. You will hear an actress say non-words and your task is to choose what emotion you think it con- veys. (Some non-words might be repeated a few times). Please don’t spend too much time on each non-word. Try to do it using your intuition. In addition, we ask that you indicate how confident you are with your choice on a scale of 1 (not sure) to 5 (very sure). There are several breaks. If you have questions, please ask now. The buttons appeared as rectangles on a single line in the middle of the screen, and their order was randomly varied across list (but kept constant for any given participant) to avoid preference effects. The task was not timed, and listeners could replay the sound up to eight times by clicking on a repeat button (Fig. 1). The presentation order of the sound file was randomized for each participant, and the script implemented a break after every 50 stimuli. No stimulus file was repeated. The average duration of the identification task was about 45 min. As ex- plained above, the sound files were divided into four lists to keep the duration manageable for a single participant. Each of the four lists contained roughly the same number of stimuli: Lists 1 and 2 (A.G.) each contained 438 sound files, List 3 contained 443 sound files, and List 4 contained 444 sound files (K.M.). Participants were randomly assigned to one list upon arrival in the testing room. All participants also filled out a sociodemographic questionnaire (notably to assess their age, sex, and languages spoken) administered through the Qualtrics survey software. Participants In all, 102 participants were tested. The test- ing took place between February 2016 and December 2016. Six participants were excluded for various reasons (not native speakers of English or did not grow up in the United States, multiple neurocognitive issues report- ed, incomplete dataset, technical failure, or more than twice the average time needed to complete the task). In total, data from 96 participants (67% female), who were between 18 and 38 years old (M = 21.09, SD = 3.21), were included in the analysis (List 1, N = 24; List 2, N = 25; List 3, N = 24; List 4, N = 23). Most of the participants were college students, and they were predominantly Caucasian. Only one participant reported not knowing any language other than English. Twelve of the participants reported growing up bilingually using English and another language. About half of the partic- ipants (53.1%) reported knowledge of Spanish, 21.9% of French, and 6.3% of German, with 13 other lan- guages mentioned by fewer than 4% of the participants (e.g., Japanese, 3.1%). A total of 34.4% of the partici- pants reported knowing two languages besides English, and 12.5% reported knowing three languages besides English. Two of the participants reported knowing four or more languages besides English. Aside from the early bilinguals, three participants reported high proficiency in other languages learned after the first. None reported having any kind of uncorrected speech or hearing dis- order. We recruited the participants using flyers posted in public areas (e.g., various departments at Indiana University) and word of mouth. Participants were com- pensated for their time. Results To ascertain the validity of the corpus, we used two dependent variables: emotion identification accuracy rates and confidence scores (how confident the partici- pants were in their choices). Response times (RTs) were collected on each trial but is not analyzed as a depen- dent variable given that the task was not speeded. Because there were six choice options on each trial, a random selection would yield an overall accuracy of 16.7%. The data were submitted to a chi-square analysis to estimate whether or not the participants were equally likely to choose among the six possibilities for a given stimulus. Table 4 provides the confusion matrix overall, across both speakers, and reveals that overall, emotion portrayals were recognized accurately. Figure 2 shows the overall median accuracy in emotion identification by the 96 participants, separated by speaker. Random performance level (~ 16%) is indicated by the dotted line. Behav Res (2020) 52:901–917 907 Figure 2 suggests that participants were able to iden- tify each stimulus’ intended emotion above chance.1 The mean recognition accuracy is 45%. Sadness was recognized most accurately (M = 59%), followed by neutral (M = 51%), fear (M = 50%), disgust (M = 43%), and anger (M = 38%). The emotion that was recognized least accurately was happiness (M = 31%). All emotions were recognized better than chance for both stimulus sets, except for happiness for the K.M. stimuli, which was misidentified as neutrality more of- ten than it was identified as happiness (see Table 6 below). A global chi-square analysis on the chosen response categories over all data points (across emotions and 1 The pattern of accuracy remained the same even after removing very slow and very fast trials (RT outliers, defined as data points that were more than 2.5 SDs beyond all participants’ mean RT, or faster than 100 ms; 3.24% of the data were removed). The slow RTs on some trials were likely the result of the option of listening to the stimuli multiple times and of the fact that the task was not speeded. Fig. 1 Screenshots of the Praat script interface for the recognition task. The top panel shows the first screen in a trial, where the emotion labels are highlighted (clickable). The bottom panel shows the second screen in a trial, with the confidence scale now also highlighted. The respondent’s choices appear highlighted in red, and a next button is displayed for participants to move to the next trial. The task was self-paced. Up to eight replays were allowed Behav Res (2020) 52:901–917908 speakers) was significant [χ(25) = 29,429.29; p < .001, Cramer’s V = .37]. This suggests that for each emotion, respondents did not randomly choose among the six options. Before evaluating whether this pattern holds for each emotion separately, we first examined whether there is a difference in accuracy between speakers, as is suggested in Fig. 2. A one-way analysis of variance (ANOVA) comparing accuracy for each speaker (K.M., A.G.) revealed that mean recognition accuracy was significantly higher for A.G. (M = 48%, 95%CI = 45–50) than for K.M. (M = 43%, 95%CI = 40–45), F(1, 574) = 8.26, p = .004. This significant effect of speaker indicates that raters were overall slightly more accurate at recognizing emo- tions portrayed by one speaker (A.G.) over the other (K.M.). However, such differences are to be expected among voice actors, and this is unlikely to reflect an inherent difference among our listener groups. If one group of listeners were systematically less concentrated or accurate during the task, we would expect this dif- ference to hold across the emotions for a given speaker. To verify this, a mixed-effect model with speaker and Table 4 Classification counts of vocal emotion portrayals by the participants’ responses and the overall proportions of accurate responses (%) within each emotion, across both speakers Emotion portrayed Responses of participants Total Anger Disgust Fear Happiness Neutral Sadness Anger Count 2,662 1,026 563 876 1,413 515 7,055 % within emotion 37.7 14.5 8.0 12.4 20.0 7.3 100.0 Disgust Count 1,209 3,021 247 466 1,244 821 7,008 % within emotion 17.3 43.1 3.5 6.6 17.8 11.7 100.0 Fear Count 479 168 3,563 762 977 1129 7,078 % within emotion 6.8 2.4 50.3 10.8 13.8 16.0 100.0 Happiness Count 852 591 509 2,159 2,003 894 7,008 % within emotion 12.2 8.4 7.3 30.8 28.6 12.8 100.0 Neutral Count 862 719 465 409 3,664 1030 7,149 % within emotion 12.1 10.1 6.5 5.7 51.3 14.4 100.0 Sadness Count 98 188 927 217 1,428 4,150 7,008 % within emotion 1.4 2.7 13.2 3.1 20.4 59.2 100.0 Modal response is indicated in boldface (n = 42,306 data points) Emotion SadnessNeutralHappinessFearDisgustAnger M ed ia n ac cu ra cy 0.80 0.60 0.40 0.20 0.00 Error Bars: 95% CI KM AG Speaker Random performance level (.16) Fig. 2 Overall median accuracy of emotion identifications by the 96 participants, separated by speaker. The random performance level (~ 16%) is indicated by the dotted line Behav Res (2020) 52:901–917 909 emotion as fixed factors (and participants as a random factor) was conducted in SPSS 25. Multiple compari- sons were adjusted with the Sidak correction. The type III tests of fixed effects shows a main effect of speaker [F(1, 94) = 8.6, p = .004], a main effect of emotion [F(5, 470) = 37.7, p < .001], and crucially, a significant interaction between the two factors [F(5, 470) = 33.4, p < .001]. The interaction and pairwise comparisons re- veals that for all emotions except disgust and neutral, A.G.’s portrayals were recognized significantly more ac- curately than K.M.’s; conversely, K.M.’s portrayals of disgust and neutral were recognized significantly more accurately than A.G.’s. The presence of an interaction suggests that it is unlikely to be the case that the K.M. listeners were systematically less accurate than the A.G. listeners (otherwise, one would have expected an ab- sence of interaction). Tables 5 and 6 provide the confusion matrices obtained for our stimulus set (emotion portrayal by participants’ choices; n = 42,306 data points), separated by speaker. Given the significant effect of speaker and the speak- er by emotion interaction, we further conducted a series of chi-square analyses (nonparametric goodness-of-fit tests) in SPSS 25 for each speaker and emotion sepa- rately, which confirmed the global analysis. The results of the tests for each emotion and each speaker are pro- vided in Tables 5 and 6. They show that the tests were significant for all speakers and all emotions, indicating that listeners were not responding randomly. Examination of the patterns of misidentifications in Tables 5 and 6 revealed the following tendencies. For A.G., all emotions except fear were most often misidentified as neutral, which represents the second- highest proportion of choices in these cases. In the case of fear, items were misidentified most often as happi- ness. However, even though, for instance, happiness was misinterpreted as neutral in 25% of the cases for A.G., the reverse was not true: Neutral items only were misinterpreted as happiness in 7% of cases, and were more commonly misinterpreted as anger or sadness, each in roughly 15% of cases (see Table 5). The error patterns for the K.M. stimuli stand out, in that happi- ness stimuli were most often recognized as neutral, which is the dominant, modal response. Happiness choices were given in 25% of cases, and neutral choices in 32%. For the other emotions, unlike for the A.G. stimuli, neutral was the second choice after the correct identification only for anger and sadness. Disgust was misidentified as anger in 21% of cases, more often than neutral, and fear was confused with sadness in 27% of cases (see Table 6). This overall high proportion of neutral choices is possibly due to the fact that the stimuli were created at a medium/normal intensity level, without emotional exaggeration, rendering the identification task potentially more difficult. To help researchers evaluate how ambig- uous a given recording is, we also provide the full con- fusion matrix for each stimulus in the database (see Bänziger et al., 2012, supplemental materials, for a similar approach). Some items were identified at very high accuracy rates by all participants who rated them, and conversely, others were almost never identified correctly. Figure 3 shows the accuracy variance obtained for each stimulus (each sound file in the corpus represents one dot). The boxplots in the top and bottom panels of the figure show the distribution and median accuracy for each emotion (top, A.G.; bottom, K.M.). The figures reveal that a proportion of items (particularly for happiness) fell below the random performance level (i.e., 16.7%)—suggesting that these particular stimuli are am- biguous and not ideal representations of the intended emotion, at least for the participants who rated the stimuli. We also obtained confidence ratings for each stimulus rated (i.e., how confident the participants were in their Table 5 Confusion matrix for the A.G. stimuli, with chi-square goodness-of-fit test per emotion Emotion (Speaker A.G.) Response Chi-Square goodness-of-fit test A D F H N S Total Anger 1,564 559 268 267 624 295 3,577 χ(5) = 2,089, p < .001 Disgust 496 1,168 227 266 724 696 3,577 χ(5) = 1,021, p < .001 Fear 362 62 2,092 599 279 183 3,577 χ(5) = 4,779, p < .001 Happiness 217 195 329 1,313 896 627 3,577 χ(5) = 1,645, p < .001 Neutral 539 294 398 262 1,550 534 3,577 χ(5) = 1,944, p < .001 Sadness 42 99 238 55 550 2,593 3,577 χ(5) = 8,328, p < .001 Behav Res (2020) 52:901–917910 choices). Figure 4 shows the correlations (Pearson’s r) between accuracy of identification and the confidence ratings of the participants for each emotion and each speaker separately (see top of each panel of Figure 4). Only one relationship (neutral for A.G.) was not significant. Discussion The goal of this project was to create a corpus of au- ditory pseudo-words uttered in different emotions. The corpus includes 73 controlled audio pseudo-words uttered by two actresses in five different emotions (i.e., happiness, sadness, fear, anger and disgust) and in a neutral tone, yielding at least 876 stimuli per ac- tress. In addition, the pseudo-words are based on the pronunciation rules of North American English, and they are not caricatures or exaggerations of the emo- tions portrayed. Each recording has been validated by native English listeners in terms of recognition accuracy of the intended emotion portrayal. Overall, the emotions were recognized at accuracy levels that were clearly higher than chance (M = 45% across emotions and speakers, for a chance level at about 16%). The recog- nition proportions obtained for our data were most ac- curate for fear, neutral, and sadness, and least accurate for happiness and disgust, consistent with previous data from other languages (Banse & Scherer, 1996; Castro & Lima, 2010; Liu & Pell, 2012; Pell et al., 2009; Scherer et al., 1991; van Bezooijen, 1984). The one exception is anger. In our stimuli, anger was recognized with surpris- ingly low accuracy (38%). It is often among the best recognized emotions (e.g., Bänziger et al., 2012; Scherer et al., 2011; Wendt & Scheich, 2002). This effect was possibly due to the fact that our pseudo-words were produced with a medium/normal emotional intensity lev- el, possibly making them more confusable with neutral stimuli. Indeed, for both speakers (and particularly for K.M.), anger was most often confused with neutrality. The resulting accuracy in our dataset was globally sim- ilar to the levels reported in previous studies on vocal emotion (hovering in the 40%–60% range; see Scherer et al., 2011), in particular among the studies that used similar stimuli (words or short sentences, such as Rigoulot et al., 2013) and a similar number of response options. The audio stimuli were created as high-quality re- cordings in the .wav format, which allows experi- menters to run more detailed acoustic analyses in order to match stimuli for specific experimental purposes. For instance, intensity (as loudness, in decibels) and duration measurements (in milliseconds) are provided in the corpus database, but other acoustic parameters can be extracted, such that matched stimuli could be selected for the needs of an EEG study, for instance. The corpus is available at https://psycholinguistics. indiana.edu/hoosiervocalemotions.htm. The website prvides basic information about the corpus and how to request access to the sound files and the database. For each item, recognition accuracy and confusion patterns, as well as speaker, filename, and a number of acoustic details, are provided in an accompanying database, in order to allow researchers to select items specifically for their needs. The list of attributes provided for each sound file in the corpus is detailed in the Appendix. A number of methodological issues need to be con- sidered. First, the validation of the stimuli was based on data collected in a laboratory setting using a forced choice methodology with six response alternatives. Even though this methodology is commonly used across studies, its ecological validity for real-time interactions in social situations remains limited. It is unclear to what extent these results would generalize to real-life situa- tions outside the laboratory, or to experimental paradigms in which a given stimulus was presented only once without any Table 6 Confusion matrix for the K.M. stimuli, with chi-square goodness-of-fit test per emotion Emotion (Speaker K.M.) Response Chi-Square goodness-of-fit test A D F H N S Total Anger 1,098 467 295 609 789 220 3,478 χ(5) = 925, p < .001 Disgust 713 1,853 20 200 520 125 3,431 χ(5) = 4,033, p < .001 Fear 117 106 1,471 163 698 946 3,501 χ(5) = 2,664, p < .001 Happiness 635 396 180 846 1,107 267 3,431 χ(5) = 1,124, p < .001 Neutral 323 425 67 147 2,114 496 3,572 χ(5) = 4,870, p < .001 Sadness 56 89 689 162 878 1,557 3,431 χ(5) = 3,052, p < .001 The underlined number indicates that for these stimuli, happiness was not chosen as the modal response for intended happy stimuli; neutral was the most frequently chosen response Behav Res (2020) 52:901–917 911 https://psycholinguistics.indiana.edu/hoosiervocalemotions.htm https://psycholinguistics.indiana.edu/hoosiervocalemotions.htm available “categorization labels,” because forced choice pro- cedures produce better performance than free-choice tests (see Bachorowski & Owren, 2008). Second, similar considerations are involved with the specific linguistic context in which an emotion is heard and the type of linguistic materials used. Hearing a short (two-syllable) pseudo-word in order to identify an emotion is likely much more difficult than identify- ing it via a longer, meaningful sentence (see Rigoulot et al., 2013), and is likely to lead overall to lower recognition accuracy. Similarly, medium/normal emo- tional intensity (as opposed to high, such as in affect bursts) is likely to make emotion recognition less straightforward. Taken together, the identification Emotion spoken by A.G. SadnessNeutralHappinessFearDisgustAnger A cc ur ac y 1.00 .80 .60 .40 .20 .00 Random performance level Emotion spoken by K.M. SadnessNeutralHappinessFearDisgustAnger A cc ur ac y 1.00 .80 .60 .40 .20 .00 Random performance level a b Fig. 3 Box plot with overlaid dot plots for each emotion’s identification accuracy. Each dot represents one stimulus (i.e., one sound file in the corpus). Horizontal lines represent the medians, boxes show the interquartile range (IQR) representing 50% of the cases, and whisker bars extend to 1.5 times the IQR. (Top) A.G. stimuli. (Bottom) K.M. stimuli Behav Res (2020) 52:901–917912 accuracy we obtained in our study was the product of the forced choice methodology, as well as of the medium/normal intensity of the stimuli, the fact that they are pseudo-words presented in isolation, and the context-free format of their presentation in the recog- nition task. Third, the corpus includes a limited set of emotions (i.e., happiness, sadness, fear, anger, and disgust) and a neutral tone. Other emotions could have been included (i.e., surprise and contempt). We selected the emotions to be included in the corpus on the basis of the basic emotions identified by Ekman (1992) and of whether they can have either a positive or a negative valence. Therefore, surprise was not included, because it can have any valence (it can be neutral, positive, or nega- tive), and also because this emotion can be difficult to simulate in the laboratory (Pell et al., 2009). Researchers and clinicians should then consider this limitation when selecting this corpus for their work, as well as the fact that only one positive emotion (i.e., happiness) is included, which would impede systematic analyses of valence effects and the examination of dif- ferent positive emotions. Fourth, resarchers should also consider that the cor- pus contains pseudo-words uttered by two females (i.e., it does not include male voices). Finally, because the validation of the stimuli was based on a between- subjects design (i.e., each participant rated the pseudo- words from one actress only), it is hard to establish differences in the validation between the two speakers. Although this design could be seen as a limitation, due to logistics, the information we provide in the corpus should enable researchers and clinicians to make in- formed decisions as to what stimuli to select for their work. Potential applications of the Hoosier Vocal Emotions Corpus The Hoosier Vocal Emotions Corpus was specifically developed for the requirements of EEG research on emotion processing. The stimuli from this corpus were first used in a study on the neural responses (using EEG techniques) to vocal emotion processing and their asso- ciations with temperamental traits and behavioral prob- lems in young children (Hoyniak et al., 2018). The cor- pus has unique characteristics that are useful for exper- imental paradigms requiring controlled stimuli (e.g., EEG or fMRI studies)—namely, they are disyllabic A c c u r a c y 1.00 .80 .60 .40 .20 .00 Emotion 5.04.54.03.53.02.5 A c c u r a c y 1.00 .80 .60 .40 .20 .00 5.04.54.03.53.02.5 5.04.54.03.53.02.5 5.04.54.03.53.02.5 5.04.54.03.53.02.5 5.04.54.03.53.02.5 S p e a k e r A G K M Anger Disgust Fear Happiness Neutral Sadness Confidence r = .539, p < .001 r = .306, p < .001 r = .587, p < .001 r = .651, p < .001 r = .051, n.s. r = .754, p < .001 r = .337, p < .001 r = .595, p < .001 r = .606, p < .001 r = .497, p < .001 r = .269, p < .01 r = .355, p < .001 Fig. 4 Correlations between the mean identification accuracy for each stimulus and the mean confidence ratings by speaker (top panels, A.G.; bottom panels, K.M.) Behav Res (2020) 52:901–917 913 pseudo-words (i.e., short stimuli without a semantic meaning) that are overall similar in terms of duration and loudness, and that represent medium/normal emo- tional intensity. To the best of our knowledge, the Magdeburger Prosodie Korpus (Wendt et al., 2003; Wendt & Scheich, 2002) is the only other corpus that includes isolated disyllabic pseudo-words. However, this corpus is composed of stimuli that respect the phonotactic and phonetic rules of the German language. Although there are data suggesting that emotions can be recognized across languages and cultures, there is still an in-group advantage in the processing of emotional vocalizations (Sauter et al., 2010). We therefore developed new emo- tional vocalizations based on the phonology and pro- nunciation rules of North American English, for re- search and clinical work requiring English-based stimuli. The use of the corpus does not need to be limited to English speakers, however. For instance, studies of emo- tion or prosodic processing in monolingual or in multi- lingual individuals, or in nonnative English speakers, could be easily conducted using stimuli from this corpus (e.g., Dewaele, 2004; Min & Schirmer, 2011; Paulmann & Uskul, 2014). Stimuli from the corpus could also be used to inves- tigate emotion processing in individuals with certain temperamental or behavioral characteristics associated with difficulties in emotion recognition (e.g., individuals with psychopathic traits or alexithymia). In addition, the stimuli could be used to study the extent to which pa- tients with aphasia, schizophrenia, or other mental dis- orders (e.g., depression) are able to process prosodic/ vocal emotion information. The Hoosier Vocal Emotions Corpus’s short, disyllab- ic pseudo-words, which are acoustically more homoge- neous than longer sentences, can also be useful to re- searchers performing acoustic analyses. Investigations that seek to characterize the prosodic and acoustic fea- tures of different emotions would benefit from this kind of tightly controlled and not exaggerated materials, since they can help isolate specific acoustic parameters for emotion recognition more precisely. Also, the fact that our stimuli were produced with normal emotional intensity (as opposed to high, such as in affect bursts) contributes to creating more ambiguity in the corpus and makes emotion recognition not only less straightfor- ward, but possibly also more ecologically valid. Ambiguous or subtle acoustic characteristics can be studied with a corpus like ours, that preserves this var- iability, and because we provide the full confusion ma- trix for each stimulus, researchers seeking to determine the acoustic parameters of various emotions will have a large range of clear, ambiguous, and misclassified stim- uli to choose from. This variability and the range of stimulus uncertainty could also be very useful for the field of automatic emotion recognition. Training para- digms would thus be able first to use the nonambiguous stimuli (see Brendel, Zaccarelli, Schuller, & Devillers, 2010) and progressively to incorporate more subtle stimuli, ultimately leading to robust recognition scores. Finally, the neutral-tone stimuli can be used on their own for research applications other than emotional processing. For instance, they could be used for pseudo-word or voice recog- nition tasks in investigations of individual differences in audi- tory, phonetic, or phonological processing or learning. Conclusion In this article, we have presented the Hoosier Vocal Emotions Corpus, a set of controlled disyllabic pseudo-words spoken in five basic emotions and in a neutral tone. This corpus is one of the few databases of pseudo-word vocal expressions for North American English. The corpus consists of 1,763 high-definition audio recordings by two female speakers at a medium/ normal emotional intensity level. The validation of the corpus with a forced choice recognition paradigm re- vealed high rates of emotional validity. The recognition accuracy for each item and the full confusion matrix are provided in an accompanying database, which will al- low researchers to explore the full range of stimulus uncertainty. Despite some of the limitations discussed above, this corpus presents a valuable resource for a wide variety of researchers and clinicians. Acknowledgments We gratefully acknowledge the Department of Criminal Justice and the Office of Women’s Affairs (Women in Science Program) at Indiana University for their financial support (grant to N.M.G.F.), as well as the participants involved in this study. We also thank Gabriela Cepeda, Franziska Krüger, Pyoung-Hwa Han, Trisha Thomas, Chung-Lin Yang, and Joshua Lee for their assistance with the pseudo-word creation, acoustic analyses, participant testing, and data analysis. We are indebted to the actresses who took part in the recording sessions. We are further grateful to Beate Wendt for sharing some of her German stimuli, which were extremely useful in constructing our corpus. N.M.G.F. is a Research Scholar, Junior 1, Fonds de recherche du Québec–Santé. Compliance with ethical standards Conflict of interest None. Open practices statement The validation study was not preregistered. The data and materials for all experiments are available at https:// psycholinguistics.indiana.edu/hoosiervocalemotions.htm. Behav Res (2020) 52:901–917914 https://psycholinguistics.indiana.edu/hoosiervocalemotions.htm https://psycholinguistics.indiana.edu/hoosiervocalemotions.htm Appendix: The following table outlines the structure of the corpus (see https://psycholinguistics.indiana.edu/hoosiervocalemotions. htm). Row 1 and row 2 refer to the corresponding rows in the Excel file (see the website link). Each line is a column header in the Excel file or in the comma-delimited spreadsheet (csv). Explanation provides a brief outline of the column content. Row 1 Row 2 Explanation ipa International Phonetic Alphabet transcription spelling Item in English roman alphabet item Item number token Token number file_name Audio file name with extension duration_ms File duration in milliseconds intensity_average_dB Average intensity in dB intensity_min Minimum Intensity intensity_max Maximum intensity voice Speaker list List number n_listeners Number of listeners who rated this list emotion Emotion accuracy_mean Mean accuracy over all trials confidence_mean Mean confidence score over all trials confusion_matrix_cnt A Confusion matrix: raw count of trials in which the emotion was chosen, over all trials D F H N S confusion_matrix_prct A Confusion matrix:% of trials in which the emotion was chosen, over all trials D F H N S accuracy_mean_validrt Mean accuracy over selected trials only (RT outliers removed) confidence_mean_validrt Mean confidence score over selected trials only (RT outliers removed) confusion_matrix_cnt_validrt A Confusion matrix: raw count of trials in which the emotion was chosen, over selected trials only D F H N S confusion_matrix_prct_validrt A Confusion matrix: % of trials in which the emotion was chosen, over selected trials only D F H N S rt_mean Mean RT over all trials rt_median Median RT over all trials rt_mean_validrt Mean RT over selected trials (RT outliers removed) rt_median_validrt Median RT over selected trials (RT outliers removed) a, Anger; d, Disgust; f, Fear; h, Happiness; n, Neutral; s, Sadness. Behav Res (2020) 52:901–917 915 https://psycholinguistics.indiana.edu/hoosiervocalemotions.htm https://psycholinguistics.indiana.edu/hoosiervocalemotions.htm References Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). The CELEX Lexical Database (Release 2, CD-ROM). Philadelphia, PA: Linguistic Data Consortium, University of Pennsylvania. Bachorowski, J.-A., & Owren, M. J. (2008). Vocal expressions of emotion. In M. Lewis, J. M. Haviland-Jones, & L. Feldman Barrett (Eds.), Handbook of emotions (3rd ed., pp. 196–210). New York, NY: Guildford Press. Banse, R., & Scherer, K. R. (1996). Acoustic profiles in vocal emotion expression. Journal of Personality and Social Psychology, 70, 614– 636. Bänziger, T., Mortillaro, M., & Scherer, K. R. (2012). Introducing the Geneva Multimodal Expression Corpus for experimental research on emotion perception. Emotion, 12, 1161–1179. Belin, P., Fillion-Bilodeau, S., & Gosselin, F. (2008). The Montreal Affective Voices: A validated set of nonverbal af- fect bursts for research on auditory affective processing. Behavior Research Methods, 40, 531–539. https://doi.org/10. 3758/BRM.40.2.531 Boersma, P., & Weenink, D. (2014). Praat: Doing phonetics by computer (Version 5.3.35) [Computer program]. Retrieved from www.praat. org Brendel, M., Zaccarelli, R., Schuller, B., & Devillers, L. (2010). Towards measuring similarity between emotional corpora. In Proceedings of the International Conference on Language Resources and Evaluation (pp. 58–64). Luxembourg City: European Language Resources Association. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss, B. (2005). A database of German emotional speech. In Proceedings of interspeech (pp. 1517–1520). Lisbon, Portugal. Castro, S. L., & Lima, C. F. (2010). Recognizing emotions in spoken language: A validated set of Portuguese sentences and pseudosentences for research on emotional prosody. Behavior Research Methods, 42, 74–81. Clopper, C. G., & Pisoni, D. B. (2004). Some acoustic cues for the perceptual categorization of American English regional dialects. Journal of Phonetics, 32, 111–140. Costantini, G., Iadarola, I., Paoloni, A., & Todisco, M. (2014). Emovo corpus: An Italian emotional speech database. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, . . . S. Piperidis (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (pp. 3501–3504). Luxembourg City: European Language Resources Association. Dewaele, J.-M. (2004). The emotional force of swearwords and taboo words in the speech of multilinguals. Journal of Multilingual and Multicultural Development, 25, 204–222. Ekman, P. (1992). An argument for basic emotions. Cognition and Emotion, 6, 169–200. https://doi.org/10.1080/02699939208411068 Herba, C., & Phillips, M. (2004). Annotation: Development of facial expression recognition from childhood to adolescence: Behavioral and neurological perspectives. Journal of Child Psychology and Psychiatry, 45, 1185–1198. Hoyniak, C. P., Bates, J. E., Petersen, I. T., Yang, C.-L., Darcy, I., & Fontaine, N. M. G. (2018). Reduced neural responses to vocal fear: A potential biomarker for callous-uncaring traits in early childhood. Developmental Science, 21, e12608. Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music performance: Different channels, same code? Psychological Bulletin, 129, 770–814. https://doi.org/10.1037/ 0033-2909.129.5.770 Laukka, P., Elfenbein, H. A., Chui, W., Thingujam, N. S., Iraki, F. K., Rockstuhl, T., & Althoff, J. (2010). Presenting the VENEC corpus: Development of a cross-cultural corpus of vocal emotion expressions and a novel method of annotating emotion appraisals. In Proceedings of the International Conference on Language Resources and Evaluation (pp. 53–57). Luxembourg City: European Language Resources Association. Lima, C.F., Castro, S. L., & Scott, S. K. (2013). When voices get emo- tional: A corpus of nonverbal vocalizations for research on emotion processing. Behavior Research Methods, 45, 1234–1245. https:// doi.org/10.3758/s13428-013-0324-3 Liu, P., & Pell, M. D. (2012). Recognizing vocal emotions in Mandarin Chinese: A validated database of Chinese vocal emotional stimuli. Behavior Research Methods, 44, 1042–1051. https://doi.org/10. 3758/s13428-012-0203-3 Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio– Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13, e0196391. https:// doi.org/10.1371/journal.pone.0196391 Min, C. S., & Schirmer, A. (2011). Perceiving verbal and vocal emotions in a second language. Cognition and Emotion, 25, 1376–1392. Parsons, C. E., Young, K. S., Craske, M. G., Stein, A. L., & Kringelbach, M. L. (2014). Introducing the Oxford Vocal (OxVoc) Sounds database: A validated set of non-acted affec- tive sounds from human infants, adults, and domestic ani- mals. Frontiers in Psychology, 5, 562. https://doi.org/10. 3389/fpsyg.2014.00562 Paulmann, S., & Uskul, A. K. (2014). Cross-cultural emotional prosody recognition: Evidence from Chinese and British listeners. Cognition and Emotion, 28, 230–244. Pell, M. D., Paulmann, S., Dara, C., Alasseri, A., & Kotz, S. A. (2009). Factors in the recognition of vocally expressed emotions: A com- parison of four languages. Journal of Phonetics, 37, 417–435. Pollak, S. D., & Sinha, P. (2002). Effects of early experience on children’s recognition of facial displays of emotion. Developmental Psychology, 38, 784–791. Rigoulot, S., Wassiliwizky, E., & Pell, M. D. (2013). Feeling backwards? How temporal order in speech affects the time course of vocal emo- tion recognition. Frontiers in Psychology, 4, 367. https://doi.org/10. 3389/fpsyg.2013.00367 Sauter, D. A., Eisner, F., Ekman, P., & Scott, S. K. (2010). Cross-cultural recognition of basic emotions through nonverbal emotional vocali- zations. Proceedings of the National Academy of Sciences, 107, 2408–2412. Scherer, K. R., Banse, R., & Wallbott, H. G. (2001). Emotion inferences from vocal expression correlate across languages and cultures. Journal of Cross-Cultural Psychology, 32, 76–92. Scherer, K. R., Banse, R., Wallbott, H. G., & Goldbeck, T. (1991). Vocal cues in emotion encoding and decoding. Motivation and Emotion, 15, 123–148. Scherer, K. R., Clarke-Polner, E., & Mortillaro, M. (2011). In the eye of the beholder? Universality and cultural specificity in the expression and perception of emotion. International Journal of Psychology, 46, 401–435. Tottenham, N., Tanaka, J., Leon, A. C., McCarry, T., Nurse, M., Hare, T. A., . . . Nelson, C. A. (2009). The NimStim set of facial expressions: Judgments from untrained research partic- ipants. Psychiatry Research, 168, 242–249. van Bezooijen, R. A. M. G. (1984) Characteristics and recogniz- ability of vocal expressions of emotion. Dordrecht, The Netherlands: Foris. Retrieved on January 30, 2018, from https://repository.ubn.ru.nl/handle/2066/114117 Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48, 1162–1181. Wendt, B., Hufnagel, K., Brechmann, A., Gaschler-Markefski, B., Tiedge, J., Ackermann, H., & Scheich, H. (2003). A method Behav Res (2020) 52:901–917916 https://doi.org/10.3758/BRM.40.2.531 https://doi.org/10.3758/BRM.40.2.531 http://www.praat.org http://www.praat.org https://doi.org/10.1080/02699939208411068 https://doi.org/10.1037/0033-2909.129.5.770 https://doi.org/10.1037/0033-2909.129.5.770 https://doi.org/10.3758/s13428-013-0324-3 https://doi.org/10.3758/s13428-013-0324-3 https://doi.org/10.3758/s13428-012-0203-3 https://doi.org/10.3758/s13428-012-0203-3 https://doi.org/10.1371/journal.pone.0196391 https://doi.org/10.1371/journal.pone.0196391 https://doi.org/10.3389/fpsyg.2014.00562 https://doi.org/10.3389/fpsyg.2014.00562 https://doi.org/10.3389/fpsyg.2013.00367 https://doi.org/10.3389/fpsyg.2013.00367 https://repository.ubn.ru.nl/handle/2066/114117 for creation and validation of a natural spoken language cor- pus used for prosodic and speech perception. Brain and Language, 87, 187. https://doi.org/10.1016/S0093-934X(03) 00263-3 Wendt, B., & Scheich, H. (2002). The “Magdeburger Prosodie-Korpus.” In Proceedings of the Speech Prosody 2002 Conference (pp. 699– 701). Aix-en-Provence, France: ISCA. Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Behav Res (2020) 52:901–917 917 https://doi.org/10.1016/S0093-934X(03)00263-3 https://doi.org/10.1016/S0093-934X(03)00263-3 The Hoosier Vocal Emotions Corpus: A validated set of North American English pseudo-words for evaluating emotion processing Abstract Method Creation of the stimuli Elicitation and recording procedures Validation of the stimuli Results Discussion Potential applications of the Hoosier Vocal Emotions Corpus Conclusion Appendix: References