key: cord-0058208-xnlsqmty authors: de Filippis, Maria Laura; Federici, Stefano; Mele, Maria Laura; Borsci, Simone; Bracalenti, Marco; Gaudino, Giancarlo; Cocco, Antonello; Amendola, Massimo; Simonetti, Emilio title: Preliminary Results of a Systematic Review: Quality Assessment of Conversational Agents (Chatbots) for People with Disabilities or Special Needs date: 2020-08-10 journal: Computers Helping People with Special Needs DOI: 10.1007/978-3-030-58796-3_30 sha: 24d879b02c0911905e6ed73898e80cc9fee368f8 doc_id: 58208 cord_uid: xnlsqmty People with disabilities or special needs can benefit from AI-based conversational agents, which are used in competence training and well-being management. Assessment of the quality of interactions with these chatbots is key to being able to reduce dissatisfaction with them and to understand their potential long-term benefits. This will in turn help to increase adherence to their use, thereby improving the quality of life of the large population of end-users that they are able to serve. We systematically reviewed the literature on methods of assessing the perceived quality of interactions with chatbots, and identified only 15 of 192 papers on this topic that included people with disabilities or special needs in their assessments. The results also highlighted the lack of a shared theoretical framework for assessing the perceived quality of interactions with chatbots. Systematic procedures based on reliable and valid methodologies continue to be needed in this field. The current lack of reliable tools and systematic methods for assessing chatbots for people with disabilities and special needs is concerning, and may lead to unreliable systems entering the market with disruptive consequences for users. Three major conclusions can be drawn from this systematic analysis: (i) researchers should adopt consolidated and comparable methodologies to rule out risks in use; (ii) the constructs of satisfaction and acceptability are different, and should be measured separately; (iii) dedicated tools and methods for assessing the quality of interaction with chatbots should be developed and used to enable the generation of comparable evidence. Chatbots are intelligent conversational software agents which can interact with people using natural language text-based dialogue [1] . They are extensively used to support interpersonal services, decision making, and training in various domains [2] [3] [4] [5] . There is a broad consensus on the effectiveness of these AI agents, particularly in the field of health, where they can promote recovery, adherence to treatment, and training [6, 7] for the preparation of different competencies and the maintenance of well-being [3, 8, 9] . In view of this, an evaluation of the perceived quality of engagement with chatbots is key to being able to reduce dissatisfaction, facilitate their possible long-term benefits, increase loyalty and thus improve the quality of life of the large population of end-users that they are able to serve. Chatbots are interaction systems, and irrespective of their domain of application, their output in terms of the quality of interaction should be planned and measured in conjunction with their users, rather than by applying a system-centric approach [1] . A recent review by Abd-Alrazaq and colleagues [6] found that in the field of mental health, researchers typically only test chatbots in a randomized controlled trial. The efficiency of interaction is seldom assessed, and is generally done by looking at non-standardized aspects of interaction and qualitative measurements that do not require comparisons to be made. This unreliable method of testing the quality of interaction of these devices or applications through a wide and varied range of variables is endemic in all fields that use chatbots, and makes it difficult to compare the results of these studies [1, 10, 11] . While some qualitative guidelines and tools have emerged [1, 12] , it is still hard to find agreement on which factors should be tested. As argued by Park and Humphry [13] , the implementation of these innovative systems should be based on a common framework for assessing the perceived interaction quality, in order to prevent chatbots from being regarded by their end-users as merely another source of social alienation, and being discarded in the same way as any other unreliable assistive technology [14, 15] . A common framework and guidelines on how to determine the perceived quality of chatbot interaction are therefore required. From a systems perspective, a subjective experience of consistency arises from the interaction between the user and the program in specific conditions and contexts. Subjective experience cannot be measured merely by believing that the optimal performance of the system as perceived by the user is the same as a good user experience [16] . The need to quantify the objective and subjective dimensions of experience in a reliable and comparable manner is a lesson that has been learned by those in the field of human-computer interaction, but has yet to be learned in the field of chatbots, as outlined by Lewis [17] and Bendig and colleagues [18] . Chatbot developers are forced to rely on the umbrella framework provided by the International Organization for Standards (ISO) 9241-11 [19] for assessing usability, and ISO 9241-210 [20] for assessing user experience (UX), due to the absence of a common assessment framework that specifies comparable evaluation criteria. These two ISO standards define the key factors of interaction quality: (i) effectiveness, efficiency and satisfaction in a specific context of use (ISO 9241-11); and (ii) control (where possible) of expectations over time concerning use, satisfaction, perceived level of acceptability, trust, usefulness and all those factors that ultimately push users to adopt and keep using a tool (ISO 9241-210). Although these standards have not yet been updated to meet the specific needs of chatbots and conversational agents, the two aspects of usability and UX are essential to the perceived quality of interaction [21] . Until a framework has been developed and broad consensus reached on assessment criteria, practitioners may benefit from the assessment of chatbots against these ISO standards, as they allow for an evaluation of the interactive output of these applications. This paper examines how aspects of perceived interaction quality are assessed in studies of AI-based agents that support people with disabilities or special needs. Our systematic literature review was conducted in accordance with the PRISMA reporting checklist. A systematic review was carried out of journal articles investigating the relationship between chatbots and people with disabilities or special needs over the last 10 years. To determine whether and how the quality of interaction with chatbots was evaluated in line with ISO standards of usability (ISO 9241-11) and UX (ISO 9241-210), this review sought to answer the following research questions: R1. How are the key factors of usability (efficiency, effectiveness, and satisfaction) measured and reported in evaluations of chatbots for people with disabilities or special needs? R2. How are factors relating to UX measured and reported in assessments of chatbots? We included in our review studies that: (i) referred to chatbots or conversational interfaces/agents for people with disabilities or special needs in the title, abstract, keywords or main text; (ii) included empirical findings and discussions of theories (or frameworks) of factors that could contribute to the perceived quality of interaction with chatbots, with a focus on people with various types of disability. We excluded records that did not include at least one group of end-users with a disability in either the testing or the design of the interaction, and studies that focused on: (i) testing emotion recognition during the interaction exchange, or assessing applications for detecting the development of disability conditions or disease; (ii) chatbots supporting people with alcoholism, anxiety, depression or traumatic disorders; (iii) the assessment of end-user compliance with clinical treatment, or assessment of the clinical effectiveness of using AI agents as an alternative to standard (or other) forms of care without considering the interaction exchange with the chatbot; and (iv) the ethical and legal implications of interacting with AI-based digital tools. Records were retrieved from Scopus and the Web of Science using the Boolean operators (AND/OR) to combine the following keywords: chatbot*, conversational agent*, special needs, disability*. We searched only for English language articles. A total of 147 items were retrieved from Scopus and Web of Science. A further 53 records were added based on a previous review of chatbots in mental health in [6] . After removing eight duplicates, a scan of the remaining 192 records by title and abstract was performed by two authors (MLDF, SB). Articles that defined their scope as including the assessment of interactions between chatbots and conversational agents and people with various types of intellectual disabilities or special needs were retained. The full text of 68 records was then scanned to look for articles mentioning methods and factors for assessing the interactions of people with disabilities or special needs with chatbots. The final list consisted of 15 documents [3, 8, 9, [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] , 80% of which had already been discussed in previous work by Abd-Alrazaq et al. [6] for different purposes. Of the 15 records that matched our criteria, 80% examined AI agents in terms of supporting people with autism and (mild to severe) mental disabilities, while the other 20% focused on the testing of applications to support the general health or training of people with a wide range of disabilities. The main goal of 66.6% of the applications was to support health and rehabilitation, while the remaining studies focused on solutions to support learning and training for people with disabilities. In terms of their approach to assessment, 46.7% of the studies used surveys or questionnaires, 26.7% applied a quasi-experimental procedure, and the remaining 26.7% tested chatbots using randomized controlled trials (i.e., testing the use of the agent versus standard practice with a between design) that assessed several aspects relating to the quality of the interaction. Factors relating to usability (i.e., effectiveness, efficiency, and satisfaction) were partly assessed, with 80% of the studies reporting measures of effectiveness, 26.7% measures of efficiency and 20% measures of satisfaction. In terms of UX, acceptability was the most frequently reported measure (26.7% of the cases) while a few other factors (e.g., engagement, safety, helpfulness, etc.) were measured using various approaches. The results suggest that the main focus of studies of chatbots for people with disabilities or special needs is the effectiveness of such apps compared with standard practice, in terms of supporting adherence to treatment. The results can be summarized in accordance with our research questions as follows: R1. A total of 80% of the studies [3, 8, 9, 23, 25, 27, [30] [31] [32] [33] tested the effectiveness of chatbots according to the ISO standard [19] , i.e., the ability of the app to perform correctly, allowing the users to achieve their goals. Only 26.7% of the studies [9, 25, 26, 32] also investigated efficiency, by measuring performance in terms of time or factors relating to the resources invested by participants to achieve their goals. Only 20% [9, 22, 23] referred to an intention to gather data on user satisfaction in a structured way, and only one study [23] used a validated scale (e.g., the System Usability Scale, or user metrics of UX [34] ). In another, practitioners adapted a standardized questionnaire without clarifying the changes to the items [22] , and a qualitative scale was used in a further study [9] . R2. Acceptability was identified as an assessment factor in 26.7% of the studies [9, 22, 24, 25] . Despite the popularity of the technology acceptance model [35, 36] , acceptability was measured in a variety of ways (e.g., lack of complaints [25] ) or treated as a measure of satisfaction [24] . A total of 53% of the studies used various factors to assess the quality of interaction, such as the overall experience, safety, acceptability, engagement, intention to use, ease of use, helpfulness, enjoyment, and appearance. Most used non-standardized questionnaires to assess the quality of interaction. Even when a factor such as safety was identified as a reasonable form of quality control, in compliance with ISO standards for medical devices [37] and risk analysis [38] , the method of its measurement in these studies was questionable, i.e., assessing a product to be safe based on a lack of adverse events [9] . The results of the present study suggest that informal and untested measures of quality are often employed when it comes to evaluating user interactions with AI agents. This is particularly relevant in the domain of health and well-being, where researchers set out to measure the clinical validity of tools intended to support people with disabilities or special needs. The risk is that shortcomings in these methods could significantly compromise the quality of chatbot usage, ultimately leading to the abandonment of applications that could otherwise have a positive impact on their end-users. Three major findings can be identified from this systematic analysis. (i) Researchers tend to consider a lack of complaints as an indirect measure of the safety and acceptability of tools. However, safety and acceptability should be assessed with consolidated and comparable methodologies to rule out risks in use [37] [38] [39] . (ii) Satisfaction, intended as a usability metric, is a different construct from acceptability, and these two constructs should be measured separately with available standardized questionnaires [39, 40] . (iii) Although dedicated tools and methods for assessing the quality of interaction with chatbots are lacking, reliable methods and measures to assess interaction are available [17, 19, 21, 37] , and these should be adopted and used to enable the generation of comparable evidence regarding the quality of conversational agents. Evaluating quality of chatbots and intelligent conversational agents Music, search, and IoT: how people (really) use voice assistants Getting ready for adult healthcare: designing a chatbot to coach adolescents with special health needs through the transitions of care Emotional storytelling using virtual and robotic agents AutoTutor and affective AutoTutor: learning by talking with cognitively and emotionally intelligent computers that talk back An overview of the features of chatbots in mental health: a scoping review Assistive conversational agent for health coaching: a validation study Using virtual interactive training agents (ViTA) with adults with autism and other developmental disabilities Feasibility of a virtual exercise coach to promote walking in community-dwelling persons with Parkinson disease Assessing user satisfaction with information chatbots: a preliminary investigation Chatbots' perceived usability in information retrieval tasks: an exploratory analysis Exclusion by design: intersections of social, digital and data exclusion Providing assistive technology in Italy: the perceived delivery process quality as affecting abandonment Why people use and don't use technologies: introduction to the special issue on assistive technologies for cognition/cognitive support technologies Measuring usability as quality of use Usability: lessons learned… and yet to be learned The next generation: chatbots in clinical psychology and psychotherapy to foster mental health -a scoping review Ergonomic Requirements for Office Work with Visual Display Terminals -Part 11: Guidance on Usability Human-Centred Design for Interactive Systems Shaking the usability tree: why usability is not a dead end, and a constructive way forward A virtual conversational agent for teens with autism: experimental results and design lessons Assessing the usability of a chatbot for mental health care Using affective avatars and rich multimedia content for education of children with autism Design of a virtual reality based adaptive response technology for children with autism A fully automated conversational agent for promoting mental well-being: a pilot RCT using mixed methods Development of a virtual agent based social tutor for children with autism spectrum disorders The LISSA virtual human and ASD teens: an overview of initial experiments Job offers to individuals with severe mental illness after participation in virtual reality job interview training Virtual reality job interview training for individuals with psychiatric disabilities Embodied conversational agents for multimodal automated social skills training in people with autism spectrum disorders Usability assessment of interaction management support in Louise, an ECA-based user interface for elders with cognitive impairment Virtual reality job interview training in adults with autism spectrum disorder Assessing user satisfaction in the era of user experience: comparison of the SUS, UMUX and UMUX-LITE as a function of product experience User acceptance of information technology: toward a unified view Ambient assistive technology for people with dementia: an answer to the epidemiologic transition Medical Devices -Application of Risk Management to Medical Devices Short scales of satisfaction assessment: a proxy to involve disabled users in the usability testing of websites Is the lite version of the usability metric for user experience (UMUX-LITE) a reliable tool to support rapid assessment of new healthcare technology?