key: cord-0673223-ybxgawnj
authors: Ambsdorf, Jakob; Munir, Alina; Wei, Yiyao; Degkwitz, Klaas; Harms, Harm Matthias; Stannek, Susanne; Ahrens, Kyra; Becker, Dennis; Strahl, Erik; Weber, Tom; Wermter, Stefan
title: Explain yourself! Effects of Explanations in Human-Robot Interaction
date: 2022-04-09
journal: nan
DOI: nan
sha: ae1237288b8a959f8a8b2c6a8ffab1726c17e79b
doc_id: 673223
cord_uid: ybxgawnj

Recent developments in explainable artificial intelligence promise the potential to transform human-robot interaction: Explanations of robot decisions could affect user perceptions, justify their reliability, and increase trust. However, the effects on human perceptions of robots that explain their decisions have not been studied thoroughly. To analyze the effect of explainable robots, we conduct a study in which two simulated robots play a competitive board game. While one robot explains its moves, the other robot only announces them. Providing explanations for its actions was not sufficient to change the perceived competence, intelligence, likeability or safety ratings of the robot. However, the results show that the robot that explains its moves is perceived as more lively and human-like. This study demonstrates the need for and potential of explainable human-robot interaction and the wider assessment of its effects as a novel research direction.

Explainable Artificial Intelligence (XAI) [1] promises to provide humanly understandable explanations about the actions, recommendations, and underlying causes of Artificial Intelligence (AI) techniques. Explanations of an algorithm's decisions can increase transparency and build trust in users of artificial intelligence and robotic systems [2, 3] . In Human-Robot Interaction (HRI), robots that explain their actions can increase users' confidence, trust, and reduce safety risks in human-robot cooperative tasks [4, 5] , whereas miscommunication can create confusion and mistrust [6] .

While the field of XAI has the potential to enhance humanrobot interaction, there is still a lack of research on the effects and perception of robots that explain their actions to users. Specifically, due to the currently limited functionality and challenges to generate humanly understandable explanations from deep neural networks, the research on the interaction of humans and explainable robots is in its early stages of development [7] . Despite the increasing attention on the application of XAI and its benefits, only a small amount of user studies have been conducted [7, 8] . A considerable part of research in the field focuses on the design of theoretical frameworks for human-robot interaction studies that leverage XAI [9, 10] .

To measure the effect of XAI in human-robot interaction, we conduct a study in which two simulated robots compete in a board game. While one of the robots explains the reasoning behind each of its game-play decisions, the other is constrained to only announcing each move without any explanations. For the game scenario, the game Ultimate Tictac-toe was selected. The game is similar in rules to regular Tic-tac-toe but is less well-known, more challenging to play, and game states are difficult to analyze [11] . For measuring the effect of a robot explaining its actions, the Godspeed series [12] , perceived competence, and participants' predictions about which robot will win are assessed. While the Godspeed questionnaire measures the general robot perception, perceived competence measures an underlying dimension of forming trust.

The technical components of the study are implemented in robot simulation software and can be transferred to real robots (see Figure 1 ), enabling the study to be conducted in a lab experiment in the future. Due to the COVID-19 pandemic, the study was conducted entirely online.

The research on explainable artificial intelligence reaches back to rule-based expert systems in the 1970's [13, 14] . With the introduction of increasingly performant models, the importance of providing explanations and justifications for their actions was stressed early on [15, 16] . The recent success of deep learning [17] and the ubiquitous use of such artificial intelligence systems, which are largely considered to be black-boxes, has intensified the discussion and need for explainable AI considerably. Thereupon, explainable artificial systems have been declared an economic priority, which causes governmental regulations and tech-nical requirements [18, 19] . For example, citizens of the European Union, who are affected by automated decisionmaking systems, have the right to an explanation of that decision [20] . Such regulations enforce that AI systems are transparent, explainable, and accountable for safety-critical applications [21, 22, 23] .

Consequently, there has been a surge in research of explainable systems under various names such as comprehensible, understandable, and explainable AI [24] . Various attempts at taxonomizing the methods for explaining neural networks and other machine learning algorithms have been proposed recently. The properties of these methods, such as their dependence on the employed model, the degree of faithfulness to the prediction process and their means of communication, vary considerably among them. We direct the reader to corresponding surveys and taxonomies for a comprehensive overview [1, 24, 25, 26] . The majority of the methods that have been proposed generate explanations that are understandable only for experts, rather then potential users [27] . On the other hand, in the field of autonomous robotics [28, 29] and in human-robot interaction [7, 30] , explaining actions to a user rather than an expert is particularly important. This is especially pronounced in humanrobot cooperative tasks [31] . In this context, explanations provide new information to the user, thereby assisting in understanding the reasoning behind actions, as well as providing an intuition about the robot's capability [10, 32] .

Trust has been identified as the foundation of human-robot interaction for collaborative and cooperative tasks [33, 34] . Explanations can be considered a human-like attribute, and research has shown that humans expect explanations in conjunction with a robot's actions [35] . An accurate understanding of the robot's intent, which is derived by explanations of the robot's decision-making, can consequently increase trust [5] . Additionally, research suggests that the robot's display of human-like attributes can lead to fostering trust as well [36] . However, trust can be lost quickly in the event of inconsistent actions [37] or robot failure [38] .

One dimension of human-robot trust is the robot's competence to achieve the user's desired goal [39] . The perception of a robot as competent is a requirement for the development of trust [40] and changes the social human-robot relationship [41] . The robot's competence has been suggested as one of the main factors determining the preference of one robot over another in HRI scenarios [42] . Concurrently, users tend to over-or underestimate a robot's competence based on their perception and observed robot behavior [43] . Despite the significance of perceived competence for successful humanrobot interaction and its link to trust, there has been, to our knowledge, little work directly assessing how explanations can impact perceived competence.

To assess the perception of robots, remote [44, 45, 46] and video-based studies [47, 48, 49] have been successfully conducted. Prior work found that video-based HRI studies generally achieve comparable results with their live counterparts [50] . However, it has been shown that certain aspects, such as empathy for a robot and willingness to cooperate, can be greater in physical interaction with a robot as opposed to interacting with the robot remotely [51, 52] . Due to the COVID-19 pandemic, researchers further explored the possibilities of remote HRI studies and noticed that this type of experiment can increase the participants' effort and their frustration [53] . In contrast to live HRI studies, video-based studies have the advantage of reaching more participants and consistent experiment conditions among participants [54] .

Data for the study were collected from February 19th, 2021, to March 3rd, 2021. For the study evaluation, only the data of participants that left a remark about the study's objective and required at least 20 minutes to complete the study are considered. This left a total of 92 participants for evaluation. The majority of participants are female, students, and had no prior experience with the game Ultimate Tic-tactoe. Further demographic information is shown in Table I . 

The selected study design aims to provide a plausible scenario, in which two robots interact and one explains its actions. The setting is intended to be natural and simple to understand, while simultaneously difficult to analyze to obscure the objective of the experiment. To this end, we chose a game scenario, in which two robots play a board game and one explains its moves. The goal of the participant is to bet on the winner of each game, after observing a game-play snippet. Prior work showed that the use of a game scenario can increase participant engangement [55] .

According to the aforementioned requirements, the twoplayer board game Ultimate Tic-Tac-Toe (UTTT) was selected. UTTT is a relatively unknown game with game rules that are quick to understand, but the game is challenging to play due to its large state space [56, 57] . The game board consists of nine nested Tic-tac-toes that compose a larger 3x3 board. Each smaller board is called a local board and the player has to win three local boards to form a line on the global board, just like in regular Tic-tac-toe. Thus, it is non-trivial for the participants to analyze the robots' gameplay.

To examine the influence of XAI, we use two humanoid robots with different behaviors. Both robots have the same capability to win, as they use the same reinforcement learning algorithm (see Section III-C.1) to play UTTT. Each turn, the respective robot provides a verbal cue in addition to its game-play. Both robots can play UTTT against each other autonomously, and simulation software was used to record videos of complete games. Figure 2 shows both robots in the simulation environment, sitting at a table across from each other with the UTTT board game in front of them.

In the study, participants are shown short sequences of pre-recorded games without knowledge about the robots' capabilities and the outcome of the game. Each short gameplay video shows a sequence of three moves by each robot, in which both robots comment on their moves via speech output. While one robot merely announces its moves, the robot using XAI provides an elaboration for the selected move. To actively involve the participants, they have to bet on the robot they believe to win the game.

During the pilot study, participants reported difficulties distinguishing the robots in the questionnaire assessment. As a result, the robots were colored differently and were given clearly distinguishable voices. 

To enable two humanoid robots to play against each other and conduct the study, a variety of components are required. The center of the experiment are two Neuro-Inspired Companion (NICO) [58] , which are child-sized humanoid robots. The modules that allow these two robots to play UTTT in the simulation software are implemented in the Robot Operating System (ROS) [59] . For simulation and game-play recording, the robot simulation software CoppeliaSim [60] is used. The voices for both robots are created using Google Cloud Text-To-Speech 1 . The experiment was conducted using LimeSurvey [61] . It was used to provide the participants with the game-play videos and assess the questionnaires and bets on the robots.

1) Reinforcement Learning: To present the study participants with non-trivial and unscripted game-play, a reinforcement learning agent was trained to play the game of UTTT. The reinforcement learning agent's neural network architecture is based on the model presented by Mnih et al. [62] , where it is originally used to play Atari games. In contrast to Atari games where each pixel has to be processed, in UTTT only the game state has to be considered. The current board state, estimated by a computer vision model, serves as the agent's input, and the agent returns a probability distribution over the board for the best move.

The agent was trained for 80.000 games against different types of simple strategy agents and itself. Initially, the agent was bootstrapped with 20.000 games against a strategy that selects a random move unless it can win a local board. Afterward, the agent was trained with 40.000 games against a strategy that does a random move unless it can win a local board or block the opponent from winning a local board. Finally, the agent was trained for 20.000 games against a random mix of the previous strategy (30%) and self-play (70%).

2) XAI-Algorithm: For generating explanations of the reinforcement learning agent's moves, a post-hoc approach is utilized. The XAI-algorithm processes the board and the agent's next move using a set of predefined rules to justify the move. The derived explanations are human justifications of the robot's actions. Overall, 12 rules were implemented in the XAI-algorithm, which are applied to generate the explanation. As an example of a generated explanation, consider the following scenario: The agent should not play in fields such that his opponent will get a chance to win within his next move. The agent's immediate goal is to force the opponent to play on a different local board. In this case, the explanation will be: "I did not play in row x and column y because my opponent could win the game if I send him to the corresponding local board." A second situation can be that the agent blocks the opponent in a local board, where he already has placed two marks in a row. Given this event, the generated explanation is "I have blocked my opponent here, so he won't be able to win this local board".

For the evaluation of the empirical study, the Godspeed questionnaire, perceived competence ratings, and bets on the winning robot are collected.

The Godspeed questionnaire measures the participant's overall perception of the robot. It measures the dimensions of anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety [12] .

The Perceived Competence Scale (PCS) [63, 64] is applied to measure the robot's competence as perceived by the participant. It is a 4-item questionnaire that was initially used in the Self-Determination Theory [65] . The questions are adjusted to the subject of the study by converting the perspective of the questions from "me" to "robot". The first dimension considers the participant's belief that the robot will "make the right move", whereas the second dimension assesses their belief of the robot to "win a game".

As a direct measure of the robots' perceived competence, the participants are requested to place a bet on the robot they believe to win the game. These bets are assessed after each game-play video.

The study is constructed as a between-and within-subjects design to investigate how the robot's explanation for its game-play impacts the participant's perception of the robot. Specifically, two test groups and one control group are utilized. In the test group condition, either the red or blue robot explains its moves during the game. This is in contrast to the control group, where neither robot utilizes XAI and instead only announces its moves. The experiment conditions can be summarized as follows:

• XAI red: The red robot (on the left) explains the moves, while the blue robot only announces its moves. • XAI blue: The blue robot (on the right) explains the moves, while the red robot only announces its moves. • Control: Both robots only announce their moves. For recruitment, participants are provided with a link to the survey and are randomly assigned via LimeSurvey to one of the three conditions (XAI red, XAI blue, control). The study begins with a general introduction to the experimental procedure. Afterward, the participants receive detailed introduction videos for the rules of UTTT and the experiment setup. Then, they are provided with an example game-play snippet and an introduction to the betting procedure.

After acknowledging to have understood the study procedure, the experiment begins. The participants watch three game-play videos of their respective experiment condition in random order. Directly after each video, they place their bet on the robot they believe to win the game. After the experiment, the Godspeed questionnaire, perceived competence questionnaire, and demographics are assessed. An overview of the experiment procedure is illustrated in Figure 3 .

For measuring the internal consistency of the Godspeed and perceived competence questionnaire, we calculate the Cronbach's Alpha [66] for each dimension. Each dimension of both questionnaires shows good internal consistency, except for the safety measure of the Godspeed questionnaire. The dimension of safety only consists of three items and the Cronbach's Alpha indicates an adequate consistency [67] . An overview of the estimated Cronbach's Alphas is shown in Table II . For the evaluation of the study data, the dimensions of the questionnaires and robots for each experiment condition (control (31 participants), XAI red (33 participants), XAI blue (28 participants)) are separated. This results in the evaluation of six different robots. An overview of the Godspeed dimensions with estimated means and standard errors is shown in Figure 4 .

The individual dimensions of the Godspeed questionnaire are evaluated independently utilizing the Kruskal-Wallis test [68] . In case of significant differences among the robots, we utilize the pairwise Wilcoxon rank sum test as post-hoc analysis to estimate the differences among the robots [69] .

The Kruskal-Wallis test suggests a difference in the anthropomorphism for the different robots (p = 0.038). However, the pairwise Wilcoxon test does not reveal any significant difference. To infer if there is a difference in anthropomorphism between the XAI and non-XAI robot, the measures of anthropomorphism for both robots in the XAI experiment (Table III) . Likewise, for the dimension of likeability, the Kruskal-Wallis test shows a difference in the likeability among the robots (p = 0.003). The post-hoc pairwise Wilcoxon test (Table IV) The perceived competence questionnaire is evaluated identically to the previous procedure, and an illustration of the results of that questionnaire is shown in Figure 5 .

The Kruskal-Wallis test for the right move does not suggest a difference among the robots (p = 0.225). Corre- spondingly, there appears to be no difference in the robots' ability to win the game (p = 0.205).

The participants' betting behavior is a direct measure of the robots' ability to win the game. For the evaluation, the bets among the three games are aggregated for the different experiment conditions. For each experiment condition, a binomial test is used to estimate if either the red or blue robot was favored during the experiment. The findings are summarized in Table V and the betting behavior is illustrated in Figure 6 . The results show that in the control group, the participants significantly preferred to bet on the red robot despite both robots behaving identically.

The evaluation of the Godspeed questionnaire shows that the XAI robots, which explain their moves to the participant, are perceived as more human-like and lifelike than their non-explaining counterpart. Specifically, the behavior of explaining the reasoning behind a move in a board game was perceived as more human-like. Analogously, a significant increase of animacy ratings was observed. This is not surprising, since the XAI robot was communicating more than the non-XAI robot that only announced the moves during the experiment.

In contrast to the increase in these dimensions, the XAI robot was not perceived as more intelligent, likeable or safer than its counterpart. The explanation of the robot could have not been precise or convincing enough to affect the perception of the participants, in spite of the majority not having prior experience with the game UTTT. Furthermore, the provided game-play might have shown too few moves to convince the participants of the robot's capability. The similarity in safety might arise due to both robots being identical except for one explaining the move, or due to the study being conducted online.

Surprisingly, the perceived competence questionnaire did not reveal a difference in competence between the XAI and non-XAI robots. Despite the XAI robot being perceived as more human-like and lively, the perception of the robot's competence was unaffected. This finding suggests that instead of a relationship among perceived competence and the anthropomorphism or animacy of the robot, there might be a link to perceived intelligence that was similarly unaffected by the XAI robot. A reason for not observing a difference in perceived competence could be that the effect of the manipulation by the robot explaining its move was too small. Alternatively, the participants might have analyzed only the board-state and game-play to estimate which of the robots will win the game to place their bet. This could have led to the suspicion that both robots have equal capabilities, which could result in a diminished effect of the manipulation.

After each short game-play of the robots, the participants had to decide which robot would win the game. This betting behavior is a direct measure of confidence in the robot's ability. However, the results show that the XAI robot was not preferably selected as the winner. For the control group, the participants favored the red robot despite both exhibiting the same behavior. This might have had an effect on the study, since it appears that overall the participants favored the red robot over the blue robot.

Preferring the red robot could be a result of assuming the gender of the red robot as female and the blue robot as male. While both robots had a female voice according to Google's Text-To-Speech service, the blue robot's voice was slightly lower. In conjunction with the blue color of the robot, this might have affected the participants' perception of the robot. For example, it has been shown that gendering a robot's voice affects the perception of the robot in terms of stereotypes, preference, and trust [70, 71] .

The preference of the red robot is also reflected in the assessment of likeability, where the red non-XAI robot was perceived as significantly more likeable than the blue non-XAI robot. Remarkably, providing explanations seems to cancel this effect, leading the blue XAI robot to not be perceived as significantly less likeable than any of the red robots.

The results demonstrate that explanations for actions influence the perception of the robot, however, there are several limitations that need to be addressed in future work.

Since the game of UTTT does not allow symmetric gameplay, both robots have to make different moves. Therefore, the selected game sequence might affect the participant's betting behavior and perceived competence. A game that allows identical moves might be preferred for the evaluation of the influence of XAI in human-robot interaction.

Further, the study was conducted online, which allows for little control over the engagement of the participants. Participants could have been distracted during the study, consciously evaluated the objective of the study, or might have been confused by the wording in the study without the possibility to ask for clarification. Finally, a higher number of participants for each experiment condition could permit a more accurate estimation of the difference in perception between the robots.

In this study, a human-robot interaction scenario was designed that explores the effect of robots that utilize XAI to explain their actions. Two robots played the board game UTTT, in which one robot explained the reasoning behind its moves, while the other robots simply announced the next move. The study was conducted online by presenting three short sequences of game-play. After each sequence, the participants had to bet on the winning robot. The experiment concluded with the assessment of several traits each participant assigned to the robots.

In our findings, we could not show that a robot that explains the motivation behind its actions increases the perceived competence. On the one hand, humans might still have limited trust in a robot's ability to perform a specific task despite their provided explanation. On the other hand, to manifest the concept of perceived competence in the participants mind might require more time and evidence of the robot's capability and skill than was provided in the short sequences of game-play.

However, a robot that explained its moves was perceived as more human-like and lively than a robot that only announced the moves. This demonstrates that robots that provide reasoning about their actions influence human perception. Such methods could be utilized to increase trust in robots, especially since robots and artificial intelligence are often perceived as black boxes. The observed effect might be more pronounced in a human-robot cooperative task.

All the software components of this study are implemented in robot simulation software, which allows the NICO robot to play UTTT autonomously and react to a player's moves. In future work, a human could play against the robot and the effects on the perception for an XAI and non-XAI opponent could be evaluated. This study illustrates the potential and effects of XAI in human-robot interaction, and demonstrates that a robot explaining its behavior can be perceived as more lively and human-like.

Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI

Explaining collaborative filtering recommendations

Explanation and Justification in Machine Learning: A Survey

The Role of Frustration in Human-Robot Interaction -What Is Needed for a Successful Collaboration?

Trust calibration within a human-robot team: Comparing automatically generated explanations

Relationships between robots' selfdisclosures and humans' anxiety toward robots

ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology -Workshops, WI-IAT 2011

Explainable agents and robots: Results from a systematic literature review

Automated rationale generation: A technique for explainable AI and its effects on human perceptions

An abstract framework for agent-based explanations in AI

Trust considerations for explainable robots: A human factors perspective

Group actions on winning games of super tic-tac-toe

Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots

A model of inexact reasoning in medicine

A historical perspective of explainable Artificial Intelligence

XPLAIN: a system for creating and explaining expert consulting programs

Explanations in Knowledge Systems: The Role of Explicit Representation of Design Knowledge

Deep learning

Preparing for the future of Artificial Intelligence

Challenges towards production-ready explainable machine learning

Robustness and Explainability of Artificial Intelligence -From technical to policy solutions

Stakeholders in explainable ai

European union regulations on algorithmic decision making and a "right to explanation

Transparent, explainable, and accountable AI for robotics

Peeking inside the black-box: a survey on explainable artificial intelligence (xai)

Explainable artificial intelligence: A survey

Towards explainable artificial intelligence

Explainable ai: Beware of inmates running the asylum or: How i learnt to stop worrying and love the social and behavioural sciences

Explainable autonomous robots: A survey and perspective

Explainable Artificial Intelligence (XAI): An Engineering Perspective

Explainable robotics in human-robot interactions

Artificial cognition for social human-robot interaction: An implementation

The structure and function of explanations

A meta-analysis of factors affecting trust in human-robot interaction

Modeling trust in human-robot interaction: A survey

The need for verbal robot explanations and how people would like a robot to explain itself

Effects of anthropomorphism and accountability on trust in human robot interaction

I Don't Believe You': Investigating the Effects of Robot Trust Violation and Repair

Impact of robot failures and feedback on real-time trust

Multifaceted trust in tourism service robots

Can Robots Earn Our Trust the Same Way Humans Do? A Systematic Exploration of Competence, Warmth, and Anthropomorphism as Determinants of Trust Development in HRI

A taxonomy of social errors in human-robot interaction

Warmth and competence to predict human preference of robot behavior in physical human-robot interaction

Perceived robot capability

Investigating human perceptions of robot capabilities in remote human-robot team tasks based on first-person robot video feeds

Disentangling the effects of robot affect, embodiment, and autonomy on human team members in a mixed-initiative task

Anthropomorphic interactions with a robot and robot-like agent

Evaluating the robot personality and verbal behavior of domestic robots using video-based studies

Physiologically inspired blinking behavior for a humanoid robot

Video prototyping of dog-inspired non-verbal affective communication for an appearance constrained robot

Comparing human robot interaction scenarios using live and video based methods: Towards a novel methodological approach

The benefits of interactions with physically present robots over videodisplayed agents

Poor Thing! Would You Feel Sorry for a Simulated Robot?: A comparison of empathy toward a physical and a simulated robot

Remote-hri: a pilot study to evaluate a methodology for performing hri research during the covid-19 pandemic

Evaluation of methodologies and measures on the usability of social robots: A systematic review

Engagement in digital entertainment games: A systematic review

At most 43 moves, at least 29: Optimal strategies and bounds for ultimate tic-tac-toe

A game based implementation of minimax algorithm using ai agents

Nico-neuro-inspired companion: A developmental humanoid robot platform for multimodal interaction

Robotic operating system

Coppeliasim (formerly v-rep): a versatile and scalable robot simulation framework

LimeSurvey: An Open Source survey tool

Playing atari with deep reinforcement learning

Supporting autonomy to motivate patients with diabetes for glucose control

Internalization of biopsychosocial values by medical students: a test of self-determination theory

Self-determination theory and the facilitation of intrinsic motivation, social development, and wellbeing

Coefficient alpha and the internal structure of tests

The Use of Cronbach's Alpha When Developing and Reporting Research Instruments in Science Education

Use of ranks in one-criterion variance analysis

The unequal variance t-test is an underused alternative to student's t-test and the mann-whitney u test

Robots and Gender

Trusting robocop: Gender-based effects on trust of an autonomous robot

A. Adjusted Perceived Competence Scale 1) "Make the right move":• I feel confident in the < blue|red > robot's ability to make the right move. • The < blue|red > robot is capable of making the right move. (The < blue|red > robot has the possibility of doing this in the future.) • The < blue|red > robot is able to make the right move.(The < blue|red > robot can currently do that.) • I feel that the < blue|red > robot is able to meet the challenge of choosing the right move. 2) "Win a game":• I feel confident in the < blue|red > robot's ability to win a game. • The < blue|red > robot is capable of winning a game.(The < blue|red > robot has the possibility of doing this in the future.) • The < blue|red > robot is able to win a game. (The < blue|red > robot can currently do that.) • I feel that the < blue|red > robot is able to meet the challenge of winning the game.