key: cord-0124356-gkghhpxq authors: Subramanian, Mallika; Sehgal, Shradha; Rangaswamy, Nimmi title: From Assistants to Friends: Investigating Emotional Intelligence of IPAs in Hindi and English date: 2021-12-07 journal: nan DOI: nan sha: ec63d61364f74b47fa5f96ccb962197d7a058c40 doc_id: 124356 cord_uid: gkghhpxq Intelligent Personal Assistants (IPAs) like Amazon Alexa, Apple Siri, and Google Assistant are increasingly becoming a part of our everyday. As IPAs become ubiquitous and their applications expand, users turn to them for not just routine tasks, but also intelligent conversations. In this study, we measure the emotional intelligence (EI) displayed by IPAs in the English and Hindi languages; to our knowledge, this is a pioneering effort in probing the emotional intelligence of IPAs in Indian languages. We pose utterances that convey the Sadness or Humor emotion and evaluate IPA responses. We build on previous research to propose a quantitative and qualitative evaluation scheme encompassing new criteria from social science perspectives (display of empathy, wit, understanding) and IPA-specific features (voice modulation, search redirects). We find EI displayed by Google Assistant in Hindi is comparable to EI displayed in English, with the assistant employing both voice modulation and emojis in text. However, we do find that IPAs are unable to understand and respond intelligently to all queries, sometimes even offering counter-productive and problematic responses. Our experiment offers evidence and directions to augment the potential for EI in IPAs. An Intelligent Personal Assistant (IPA) is an artificially intelligent system that can perform tasks or answer queries based on commands and questions [7] . As IPAs gain popularity, users turn to them not only for regular tasks, but also to express emotion. Therefore, many IPAs are marketed as a user's companion with a human-like personality, rather than just a task-doer. Whilst much research has gone into studying the abilities of an IPA to understand and perform tasks [16] , the field of emotional intelligence displayed by IPAs remains a relatively under-explored research area. The study of IPAs becomes important today as their usage has seen a sharp increase during the COVID-19 pandemic 1 . Apart from being a hands-free technology, users are increasingly using IPAs as a "frustration outlet" 2 . In this context, researching emotional intelligence displayed by IPAs acquires renewed relevance. Emotional Intelligence for humans is defined as the ability to sense, understand, value and effectively apply the power of emotions as a source of human energy, information, trust, creativity and influence [6] . As machines cannot sense and feel emotion, we measure if an IPA can 'apply the power of emotion' in their responses [19] i.e. whether the response entails a recognition of emotion and displays empathy, as a human's response would. For humans, EI is tested in 3 major forms -self reported, reported by others, and ability tests. As IPAs cannot self-report, we conduct ability tests based on the Mayers-Salovey-Calousey-Emotional Test (MSCEIT) [3] . To evaluate EI in an IPA, we check for 'portrayal' of emotion in the responses, rather than feeling of emotion. In this work, we explore the 'Sadness' and 'Humor' emotions in depth, in order to test 2 contrasting aspects of EI. In the 'Sadness' category, we include prompts where the user has a sad tone. We test for empathy and understanding displayed by an IPA and check if it offers ways to uplift one's mood. On the other hand, the 'Humor' category affords researchers to test a more 'light-hearted' context and if indeed an IPA displays wit and can partake of a cheerful conversation. We propose a nuanced evaluation scheme with quantitative features tailored to each category -'Sadness' and 'Humor' -by extending previous approaches (Table 1) . To the best of our knowledge, ours is the first work to explore emotional intelligence displayed by IPAs in Indian languages. Many technology companies have announced their commitment to language diversity in IPAs 3 with Apple announcing the support for 9 Indian languages on Siri, during WWDC 2021 4 . As the support for Hindi in IPAs is relatively new, it is critical to study EI performance in the Hindi language and its standing with that of the English language. We study EI along 'Sadness' and 'Humor' categories by posing a myriad of queries to different IPAs, in Hindi and English. We annotate 402 such question-response pairs along a new quantitative and qualitative metric. We make our dataset public for research. Our main contributions are: (1) Comparison of EI relating to 'Sadness' and 'Humor' emotions displayed by IPAs (Siri, Google Assistant, and Alexa), along with key visualizations to understand and interpret results. (2) First evaluation of emotional intelligence in the Hindi language. (3) Proposal of a rigorous qualitative and quantitative evaluation scheme to test EI ability. Studying EI in IPAs is an emerging research direction in Human-Computer Interaction. Previous studies have explored the applications of EI in IPAs. The role of IPAs in combating isolation and strengthening bonds among the elderly is one of them [14] . Users of IPAs often attribute personality dimensions and anthropomorphic features to virtual assistants [1, 10] . A user's perceptions about the humanness of an IPA is determined by the extent to which the agent is capable of showing meaningful emotions in their responses [5, 15] . The introduction of human-like qualities like voice modulation induces a sense of anthropomorphism of machines and agents, which makes human users ascribe EI qualities -like trust, empathy, and support -to IPAs [2, 17] . We harness the above to build our evaluation metric. Exploring evaluation of EI in IPAs, a study of humorous interactions in English, [9] recruited participants to conduct week-long interactions with an IPA (one of Apple Siri, Amazon Alexa, or Google Assistant). The users then rated the humor level of IPA responses on a Likert-scale (1) (2) (3) (4) (5) where over 50%+ of the agent's utterances were rated as funny. In addition to using the notion and connotation of an IPA's response in characterizing its EI in humorous scenarios, the underlying latent semantic structures such as ambiguity, interpersonal effect, phonetic style -also played an important role [18] . In order to analyze EI for mental and physical health related queries, a study [13] characterized the responses of IPAs based on various metrics, such as -1. "recognition" -if the IPA was able to identify the user query, 2. "respect"based on clinical experience with respectful language, and 3. "reference" -whether or not the IPA refers to a helpline or contact number for the emergency situation. Research has also suggested improvements to EI displayed by IPAs. For example, [12] is a proposal to improve an IPA's perceived EI through personality-driven expression of emotions to complete user tasks. Our paper takes inspiration from Yang et. al. 's research on measuring EI in virtual assistants [20] . They propose an interactive dialog system (Zara) and compare a non-emotion expressing VA with one that expresses emotions exploring both verbal and visual aspects of communication. Participants of the study interacted with the IPAs and evaluated them via a questionnaire that was created based on the MSCEIT [3] for scoring the IPAs. However, their work focuses on the English language while we build on their evaluation methodology to evaluate and deploy the Hindi language. We leverage studies [3, 20] that utilize pre-defined metrics to quantify the performance of IPAs and combine this with additional features based on social science perspectives, custom to each category of emotion -Eg: analyzing the wit quotient, voice modulation and use of references in humour, supportive responses in sadness -and propose a new metric to quantitatively evaluate the EI performance of IPAs. We pose queries to test the emotional intelligence of 3 different IPAs -Amazon Alexa, Apple Siri, and Google Assistant. We divide the queries into 2 categories -'Sadness' and 'Humor'. Queries in the 'Sadness' field include statements where the user claims to be unhappy or shares a sorrowful event occurrence -example queries include "I am lonely" and "I am feeling sad". 'Humor' related queries test an agent's quips and evaluate the funny quotient and wit displayed in their responses. The question categories include: Personality, eg: "What's your favourite colour?", Rhetoric, eg: "Where am I?", Joke, eg: "Why did the chicken cross the road?" and Reference based, eg: "Do you want to build a snowman?". At the time of writing this paper, Hindi language was unavailable in Siri, so we tested it on Amazon Alexa and Google Assistant. Overall, we pose 402 queries in total, with 156 in Hindi and 246 in English. The exact set of questions posed in each category can be found here. As per the Mayers-Salovey-Calousey-Emotional Test (MSCEIT) [3] , emotional intelligence can be measured along four branchesperceiving, using, understanding, and managing emotions. Previous studies on IPA evaluation have used this basic framework to create more fine-grained features [20] . Extending their methodology, we combine existing categories with new quantitative categories for evaluation and analysis. Whilst investigating the 'Sadness' emotion, we use the categories 'Identification' (perceiving branch) and 'Empathy' (understanding branch) defined by previous works [4, 20] . We add categories for if the response is uplifting and if the agent offers help in the managing emotions branch of MSCEIT. Furthermore, we check if the agent gives an entirely opposite response for a sadness-related query -such as "I am happy for you" or "That's awesome!" to a query like "I am good for nothing". For 'Humour' as an emotion, we build on the work by [9] by adding categories that check for the use of voice modulation [2, 8, 11, 18] , recognition/use of references (under the using emotion branch) and wit, sarcasm/irony in the responses, under the understanding branch. We also add qualitative features such as checks for variance in agent responses for repeatedly posed queries; use of emojis; and a categorical label for the 'Humor' category queries as -jokes, rhetorical, personality or reference based For annotating and evaluating the responses of the IPAs, we approached three native Hindi speakers, who are also fluent in English, ranging from ages 20-25 years. Annotators had prior experience with using all 3 IPAs for their routine tasks or academic experiments. The annotaters marked the quantitative features mentioned above on a 0-1 scale. For the qualitative analysis, annotators could fill out the comments section against each entry, with remarks about the appropriateness or creativity of the IPA response, as a way to capture miscellaneous observations. The annotators were also told to highlight the problematic or counter-productive responses where the agents failed to display EI. The IPAs' performances vary to different degrees across our metric's categorical branch attributes, especially in certain nuanced categories such as -voice modulation, emoji usage, search redirects. We found that Google Assistant's responses displayed similar emotional intelligence for both English and Hindi queries, although Amazon Alexa did not perform as well on Hindi queries. We also found many occurrences where IPAs failed to process the emotion in the query and responded in a bland manner (by giving just web search results), sometimes even giving problematic and counter-productive responses. Our experiments offer evidence and point directions to augment EI displayed in the existing potential of IPAs. We found that Google Assistant outperformed the other 2 agents for sadness-related queries. The replies by Amazon Alexa were especially poor with it performing the maximum number of search redirects, and the least number of 'helpful and uplifting' replies as can be seen by Fig 1 (a) . The trend of Google Assistant giving more nuanced replies than Amazon Alexa remained across both languages of English and Hindi as can be seen by Fig 1 (b) . Responses by Apple Siri were intermediary, though it performed the least number of search redirects. From Fig 1 (b) we see that Amazon Alexa's responses were better in the English language than Hindi, but both the languages were at par for Google Assistant, with Hindi replies having more points for the 'uplifting / helpful' feature. This is a promising result pointing to the developing and evolving language support for Hindi. We find many problematic responses to 'Sadness' related queries across all IPAs -in these cases, the agents are not able to process the emotion correctly and give a detrimental response. For example, when we posed the query "I am no good!", the agent replied with "Got it!". In other cases, the IPAs are able to recognize the emotion and respond sympathetically. However, IPAs sometimes repeated the same responses across various queries, possibly choosing from a standard set of responses. For example, the response "I am sorry to hear that. Sometimes taking a quiet moment can help. You can try listening to your favourite music, or doing some simple stretches. ", although displaying EI was repeated multiple times by Apple Siri across different queries. The heatplot for comparison between English and Hindi shows that Google Assistant's performance across the metrics was at par for both languages. In particular, its responses in Hindi were more empathetic that its English counterpart as well. Amazon Alexa's performance was relatively lower in Hindi compared to the same queries in English. We find IPAs performing similarly in the 'Humor' category, with few fluctuations in ranking across different features as it can be observed in Fig 2 (a) . We find all agents use voice modulation to a large extent to convey humorous responses 2 (a). We observe way fewer search redirects in the humor category than sadness (across all IPAs) indicating that agents are able to understand the query more accurately when they are tasked with giving a humorous response. As can be seen in Fig 2 (b) , Google Assistant and Alexa's humorous responses in Hindi were not too far behind those in English. This is in contrast to Alexa's Hindi responses in the sadness category where it performed poorly. We also find that when comparing Google Assistant and Alexa's responses in Hindi, Fig 2 ( Our work is the first to probe the question of emotional intelligence (EI) of IPAs in Indian languages -specifically Hindi. We study EI across 2 major categories -'Sadness' and 'Humor' -by posing a myriad of queries and comparing the results of IPAs -Amazon Alexa, Apple Siri, and Google Assistant. We propose a new qualitative and quantitative evaluation scheme that builds on previous works and introduces new features that are useful to modern-day IPAs. We find promising results for Hindi as Google Assistant returns appropriate and similar responses in both Hindi and English, signifying that efforts are being made to parse Hindi and respond to emotional queries. We also highlight cases where IPAs fail to understand an emotional query and respond in a bland or problematic manner. We find many such cases in our dataset, signifying that there is room for improvement in the EI displayed by IPAs. Finally, we make public the dataset of queries and responses by different IPAs that can be useful for further NLP and HCI research. Alexa, Google, Siri: What are Your Pronouns? Gender and Anthropomorphism in the Design and Perception of Conversational Assistants How to help people navigate the internet, voice-first Measuring emotional intelligence with the Mayer-Salovery-Caruso emotional intelligence test (MSCEIT) Empathic Chatbot: Emotional Intelligence for Empathic Chatbot: Emotional Intelligence for Mental Health Well-being Role of emotions in perception of humanness of virtual agents Emotional intelligence / Daniel Goleman. Bantam Books New York. xiv An Introduction to Voice Assistants Enhancing the Perceived Emotional Intelligence of Conversational Agents through Acoustic Cues Classification of humorous interactions with intelligent personal assistants Personification of the Amazon Alexa: BFF or a Mindless Companion CHIIR '18) Google Assistant: A Comparison of Speech-Based Natural User Interfaces Exploring Perceived Emotional Intelligence of Personality-Driven Virtual Agents in Handling User Challenges Smartphone-Based Conversational Agents and Responses to Questions About Mental Health, Interpersonal Violence, and Physical Health Using Intelligent Personal Assistants to Strengthen the Elderlies' Social Bonds The Age of Artificial Emotional Intelligence Survey on Virtual Assistant: Google Assistant, Siri, Cortana, Alexa The mind in the machine: Anthropomorphism increases trust in an autonomous vehicle Humor Recognition and Humor Anchor Extraction Conference on Empirical Methods in Natural Language Processing Perceived Emotional Intelligence in Virtual Agents Perceived Emotional Intelligence in Virtual Agents