key: cord-0783309-jjujqd6u authors: Dennis, Alan R; Kim, Antino; Rahimi, Mohammad; Ayabakan, Sezgin title: User reactions to COVID-19 screening chatbots from reputable providers date: 2020-07-06 journal: J Am Med Inform Assoc DOI: 10.1093/jamia/ocaa167 sha: 035eeb6bc88aa2b097564756179ca250f8df2594 doc_id: 783309 cord_uid: jjujqd6u OBJECTIVE: The objective was to understand how people respond to COVID-19 screening chatbots. MATERIALS AND METHODS: We conducted an online experiment with 371 participants who viewed a COVID-19 screening session between a hotline agent (chatbot or human) and a user with mild or severe symptoms. RESULTS: The primary factor driving user response to screening hotlines (human or chatbot) is perceptions of the agent’s ability. When ability is the same, users view chatbots no differently or more positively than human agents. The primary factor driving perceptions of ability is the user’s trust in the hotline provider, with a slight negative bias against chatbots’ ability. Asians perceived higher ability and benevolence than Whites. CONCLUSION: Ensuring that COVID-19 screening chatbots provide high quality service is critical, but not sufficient for widespread adoption. The key is to emphasize the chatbot’s ability and assure users that it delivers the same quality as human agents. Many people are seeking information in response to the COVID-19 pandemic [1] . Individuals with various symptoms and conditions are looking for guidance on whether to seek medical attention for COVID- 19 . Providing accurate, timely information is crucial to help those with-as well as those without-COVID-19 make good decisions. The sudden unprecedented demand for information is overwhelming resources [2, 3] . One solution is the deployment and use of technologies such as chatbots [3, 4] . Chatbots have the potential to relieve the pressure on contact centers [3, 5] . Chatbots are software applications that conduct an online conversation in natural language via typed text or voice commands (e.g., Siri) [6] . Chatbots are scalable, so they can meet an unexpected surge in demand when there is a shortage of qualified human agents [7] . Chatbots can provide round-theclock service at a low operational cost [7] . They are consistent in quality in that they always provide the same results in response to the same inputs, and are easily retrained in the face of rapidly changing information [8] . Chatbots are also non-judgmental; they make no moral judgments about the information provided by the user, so users may be more willing to disclose socially undesirable information [9] . As chatbots increase in quality, their use is expanding. For example, chatbots are already asking patients a series of clearly-defined questions and determining a risk score [9, 14] . Chatbots can help call centers triage patients and advise them on the most appropriate actions to take, which may be to do nothing because the patient does not present symptoms that warrant immediate medical care [14] . Despite all the potential benefits, like any other technology-enabled services, chatbots will help only if people use them and follow their advice [11, 15] . In this paper, we examine whether people will use high-quality chatbots provided by reputable organizations. We control for chatbot quality by examining a chatbot that provides the exact same service as a human agent. screening is based on a very specific set of criteria, so a well-designed chatbot can perform at close to a trained human level [16] . Trust is an important factor that influences the use of chatbots [11] , as well as patient compliance [17, 18] . Users will be reluctant to use chatbots if they do not trust them [11] . Trust in humans is influenced by three primary factors [19] that also have parallels for trust in technology [20] . The first is ability: the agent-human or chatbot-must be competent within the range of actions required of it [19] . The agent must have the knowledge and skills needed to make a correct diagnosis. Second, integrity: the agent must do what it says it will do [19] . For example, if the agent says the user's information is private and will not be disclosed, the information must truly be private. In the era where data breaches are common [21] , do users believe that technology has integrity? Finally, benevolence: the agent must have the patient's best interests in mind, and not be guided by ulterior motives, such as increasing profits [19] . The underlying trust factors of ability, integrity, and benevolence play important roles in the use of technology, and technology providing recommendations in particular [22] [23] [24] . Ability and integrity are typically more important for instrumental outcomes associated with transactions (e.g., purchasing) because users are most concerned with whether the technology will work as intended to complete the transaction [22] [23] [24] . Affect and other perceptual outcomes (e.g., satisfaction) are often influenced more by benevolence as these are based more on relationship aspects of technology use [22] [23] [24] . Accordingly, we examine ability, integrity, and benevolence as potential factors to drive trust in chatbots and, subsequently, influence patients' intentions to use chatbots and comply with their recommendations. We conducted a 2×2 between-subjects-two agent types (human vs chatbot) by two patient severity levels (mild vs severe)--online experiment where subjects were randomly assigned to view a video vignette of COVID-19 screening hotline session between an agent and a patient. The online setting is appropriate as screening services can be provided via various online channels [10, 13] . Vignettes have been commonly used to study human behavior [25] , technology use [26] , and trust [27] because they provide excellent experimental control [28] . Research shows that reading or watching a vignette triggers the same attitudes as actually engaging in the behaviors shown in the vignette [25] ; meta-analyses have shown no significant differences in conclusions between vignette studies and studies of actual behavior, although effect sizes in vignette-based studies tend to be slightly lower [25, 26] . In April, 2020, we recruited 402 participants from Amazon Mechanical Turk following usual protocols to ensure data quality [29] . Participants were paid $2.00. Thirty subjects failed one or more of the six attention checks and one did not report gender, and were removed, leaving 371 participants for analysis. About half were female (188), 83% were White, 8% Asian, 6% Black and 3% other (individuals selecting multiple ethnicities and individuals selecting "other"). The Participants watched a 2½ minute video vignette of a fictitious text chat between an agent at a COVID-19 screening hotline and a user with possible COVID-19 symptoms. We designed two vignettes in which the users either reported mild or severe symptoms. We developed our vignettes based on our experiences using four COVID-19 chatbots [13] and the screening questions recommended by the CDC. Participants were informed that the video was either a human agent or a chatbot (randomly assigned), but the videos were the same between the two conditions to control for quality differences between human and chatbot. Thus, the study compares a chatbot with capabilities identical in quality to those of a human agent. Participants were informed that the hotline was provided by the Centers for Disease Control and Prevention (CDC) and were informed of the deception at the end of the study. Thus, any differences between the chatbot and human agent are due to human bias because participants saw the exact same vignette in both conditions. We used established measures of ability, integrity, benevolence, trust, and the control factors of disposition to trust, and personal innovativeness with information technology. We adapted prior measures for satisfaction, persuasiveness, likelihood of use and likelihood of following up on the diagnosis of the agent. All measures used 1-7 scales and all scales proved reliable (Cronbach alpha > .80). All demographic items were categorical variables. More details on the items and reliabilities are provided in the Supplementary Materials. The experimental materials were pilot tested with 100 undergraduate students at the first author's university prior to the study. The first part of our analysis shows that participants perceived the chatbot to have significantly less ability, integrity and benevolence (see Table 1 ). Severity of symptoms influenced the perceptions of ability and integrity, but not benevolence. The effect sizes for the models as a whole (R 2 ) were what Cohen [30] calls medium or small to medium. The individual effect sizes of the chatbot (partial eta 2 ) for ability and integrity were between what Cohen [30] terms small (.01) and medium (.06), while the effect size for benevolence was medium. The primary factor influencing perceptions of ability was trust in the provider (i.e., the CDC), with the type of agent (human or chatbot) being a secondary factor. For integrity, both the trust in the provider and the type of agent were primary factors. For benevolence, the primary factor was the type of agent, with trust secondary. We also controlled for gender, age, and ethnicity. Gender had no significant effect but compared to Whites, individuals of Asian ethnicity perceived the agent to have significantly higher ability and benevolence. Age was significant for benevolence but there was no pattern to its effects. In the second part of our analysis, we examined five outcomes: (i) persuasiveness, (ii) satisfaction, (iii) likelihood of following the agent's advice, (iv) trust, and (v) likelihood of use (see Table 2 ), after controlling for the effects of ability, integrity and benevolence. The effect sizes for the models as a whole (R 2 ) were large. The dominant factor across all five outcomes was perceived ability (very large effect sizes), with chatbot a secondary factor having a medium-sized positive effect on persuasiveness, and small to medium positive effects on satisfaction, likelihood of following the agent's advice, and likelihood of use. Lastly, severity of the condition did not directly affect the outcomes nor moderate the relationship between chatbot and outcomes. The control variables (gender, age, and ethnicity) had no significant effects on the outcome variables. Simply put, the results show that the primary factor driving patient response to COVID-19 screening hotlines (human or chatbot) is users' perceptions of the agent's ability. A secondary factor for persuasiveness, satisfaction, likelihood of following the agent's advice, and likelihood of use was the type of agent, with participants reporting they viewed chatbots more positively than human agents, which is good news for healthcare organizations struggling to meet user demand for screening services. This positive response may be because users feel more comfortable disclosing information to a chatbot, especially socially undesirable information, because a chatbot makes no judgment [9] . The CDC, the World Health Organization (WHO), UNICEF and other health organizations caution that the COVID-19 outbreak has provoked social stigma and discriminatory behaviors against people of certain ethnic backgrounds, as well as those perceived to have been in contact with the virus [31, 32] . This is truly an unfortunate situation, and perhaps chatbots can assist those who are hesitant to seek help because of the stigma. The primary factor driving perceptions of ability was the user's trust in the provider of the screening hotline. Our results show a slight negative bias against chatbots' ability, perhaps due to recent press reports [13] . Therefore, proactively informing users of the chatbot's ability is important; users need to understand that chatbots use the same up-to-date knowledge base and follow the same set of screening protocols as human agents. Developing a high-quality COVID-19 screening chatbot-as qualified as a trained human agent-will help alleviate the increased load on COVID-19 contact centers staffed by human agents. When chatbots are perceived to provide the same service quality as human agents, users are more likely to see them as persuasive, be more satisfied, and be more likely to use them. A user's tech-savviness (PIIT) has only a small effect, so these results apply to both those with deep technology experience and those with little. Yet, therein lies the rub: There is a gap between how users perceive chatbots' and human agents' abilities. Therefore, to offset users' biases [33], a necessary component in deploying chatbots for COVID-19 screening is a strong messaging campaign that emphasizes the chatbot's Follow Advice (Source: [14] ) (1-7 scale)  How likely would you be to take advice from Robin?  How likely would you be to take advice from Robin again in the future (if a similar situation took place)?  How likely would you be to follow up/carry through with the next steps proposed by Robin?  How soon would you be willing to carry through with the next steps proposed by Robin? Likelihood of Use / Intention to Use (Source: [15] ) (1-7 scale) Please indicate whether you agree or disagree with the following statements.  If I was faced with a similar situation, I would interact with Robin.  If I was faced with a similar decision in the future, I would contact Robin.  If a similar need arises in the future, I would feel comfortable contacting Robin to meet my needs.  If I had problems like this, I would contact Robin. What follows is the list of questions and statements used as attention checks in the survey. What follows is a description of the two agent types (i.e. human Robin and chatbot Robin) as they appear in the survey. What follows is a conversation between a user and Robin, a chatbot at Covid-19 Screening Hotline. -Chatbot is an example of Conversational AI; it is a piece of software that conducts conversations. -This hotline is the first line of response for users who suspect that they might have contracted Covid-19 and guide them with the next recommended steps based on the screening results. What follows is a conversation between a user and Robin, an agent at Covid-19 Screening Hotline. Notes: -Robin is a qualified staff member trained to be the first line of response for users who suspect that they might have contracted Covid-19 and guide them with the next recommended steps based on the screening results. Item Reliability for Disposition to Trust Item Reliability for Trust -Integrity average item-test item-rest interitem Item | Obs Sign correlation correlation covariance alpha -------------+-----------------------------------------------------------------integ1 | Video for Severe Symptoms: https://youtu.be/E-YOMDjJlVo Data S1. Data File: https://iu.box.com/v/JAMIADataExport Led by COVID-19 surge, virtual visits will surpass 1B in 2020: report. Becker's Hospital Review Surge in patients overwhelms telehealth services amid coronavirus pandemic Report: Implementation of a Digital Chatbot to Screen Health System Employees during the COVID-19 Pandemic Chatbots in the fight against the COVID-19 pandemic The pandemic is emptying call centers. AI chatbots are swooping in TinyMuds, and the Turing Test: Entering the Loebner Prize Competition. Proceedings of the Eleventh National Conference on Artificial Intelligence A Comparative Study of Chatbots and Humans Usefulness, localizability, humanness, and language-benefit: additional evaluation criteria for natural language dialogue systems Chatbots for customer service: user experience and motivation Acceptability of artificial intelligence (AI)-led chatbot services in healthcare: A mixed-methods study Chatbots in Healthcare: Status Quo, Application Scenarios for Physicians and Paitents, and Future Directions. European Conference on Information Systems I asked eight chatbots whether I had Covid-19. The answers ranged from 'low' risk to 'start home isolation Facilitating User Symptom Check Using a Personalised Chatbot-Oriented Dialogue System Trust and TAM in Online Shopping: An Integrated Model Towards a human-like open-domain chatbot Relationship Between Internet Health Information and Patient Compliance Based on Trust: Empirical Study Physician-Patient Relationship and Medication Compliance: A Primary Care Investigation An integrative model of organizational trust Rethinking Trust in Technology Data breaches of protected health information in the United States Trust in and adoption of online recommendation agents Effects of rational and social appeals of online recommendation agents on cognition-and affect-based trust Do different kinds of trust matter? An examination of the three trusting beliefs on satisfaction and purchase behavior in the buyer-seller context To justify or excuse? A meta-analysis of the effects of explanations Seeing the Forest and the Trees: A Meta-Analysis of the Antecedents to Information Security Polict Compliance How the packaging of decision explanations affects perceptions of trustworthiness Best Practice Recommendations for Designing and Implementing Experimental Vignette Methodology Studies Data Collection in the Digital Age: Innovative Alternatives to Student Samples Statistical Power Analysis for the Behavioral Sciences Secondary Reducing Stigma 2020 Social stigma associated with the coronavirus disease (COVID-19). Secondary Social stigma associated with the coronavirus disease Trust and distrust in information systems at the workplace Trust Management -An Information Systems Perspective Not so different after all: A cross-discipline view of trust Trust in information technology Trust in a specific technology: An investigation of its components and measures An empirical examination of individual traits as antecedents to computer anxiety and computer self-efficacy A theoretical assessment of the user-satisfaction construct in information systems research The meaning and measurement of user satisfaction: A multigroup invariance analysis of the end-user computing satisfaction instrument Measurement of user satisfaction with web-based information systems: An empirical study A short-form measure of user information satisfaction: a psychometric evaluation and notes on use Adoption of ICT in a government organization in a developing country: An empirical study Research note-trust is in the eye of the beholder: A vignette study of postevent behavioral controls' effects on individual trust in virtual teams Perceived effectiveness rating scales applied to insomnia help-seeking messages for middle-aged Japanese people: a validity and reliability study Collaborative business engineering with animated electronic meetings System design features and repeated use of electronic data exchanges average item-test item-rest interitem Item | Obs Sign correlation correlation covariance alpha -------------+-----------------------------------------------------------------followup1 |