key: cord-0446094-oix2a4tr authors: Duchetto, Francesco Del; Hanheide, Marc title: Learning on the Job: Long-Term Behavioural Adaptation in Human-Robot Interactions date: 2022-03-20 journal: nan DOI: nan sha: 6cfac9992c6ee0b2b5a82a6321a1fc1c532e66be doc_id: 446094 cord_uid: oix2a4tr In this work, we propose a framework for allowing autonomous robots deployed for extended periods of time in public spaces to adapt their own behaviour online from user interactions. The robot behaviour planning is embedded in a Reinforcement Learning (RL) framework, where the objective is maximising the level of overall user engagement during the interactions. We use the Upper-Confidence-Bound Value-Iteration (UCBVI) algorithm, which gives a helpful way of managing the exploration-exploitation trade-off for real-time interactions. An engagement model trained end-to-end generates the reward function in real-time during policy execution. We test this approach in a public museum in Lincoln (UK), where the robot is deployed as a tour guide for the visitors. Results show that after a couple of months of exploration, the robot policy learned to maintain the engagement of users for longer, with an increase of 22.8% over the initial static policy in the number of items visited during the tour and a 30% increase in the probability of completing the tour. This work is a promising step toward behavioural adaptation in long-term scenarios for robotics applications in social settings. The ability to maintain user engagement during an interaction is an essential ability for a robot designed to be deployed in a social scenario. In order to do so, a robot should be able to assess the users' state and to learn from experience how the actions it takes affect that same users' state and the ongoing interaction. Reinforcement Learning (RL) techniques are a special case of Machine Learning algorithms that deals exactly with those scenarios where the "goodness" of the actions an agent can take is not known in advance, and exploration is required. The goal is to find the best sequence of actions to maximise a certain objective, which is manifested through rewards. However difficult this may seem for scenarios where the goal is well defined, it becomes even more challenging for social scenarios where the objective is expressed by the users' internal state and can only be estimated from sensors or proxy variables (like the duration of the interaction). In this paper, we approach this problem by enabling our robot Lindsey to estimate the users' engagement during the interaction and by allowing it to learn, through RL, the actions to maximise such engagement. We build on our previous work [1] where we described and analysed the long-term deployment, which is still ongoing with the present work, of our robot in a public museum where it serves as a tour guide to the visitors. For detecting the users' engagement we use our regression model, proposed in [2] , which provides a single scalar engagement from standard video streams obtained from the point of view of the interacting robot. Figure 1 shows Lindsey the robot in the museum while interacting with the users and depicts the concept for our proposed learning framework. By attempting to learn in a long-term scenario using only the users' engagement as guidance, we provide a proof-ofconcept for a framework to enable behavioural adaptation in social robots. Given that the learning is guided by the users' state, manifested by their expressed engagement, this will allow the robot to adapt to the users' preferences, which cannot be known or programmed in advance. Finally, with our proposed framework we set out to test the hypotheses of whether the engagement value estimated during the interactions a sufficient feedback signal for driving the insitu learning of the robot social behaviours; and if it leads to a more sustained interaction with the users. Previous work have addressed the issues of deploying robots in public environments for long periods of time taking into account the necessity of interacting with people. Past long-term deployments in public spaces [3] , [4] have identified the interaction with humans a necessity for recovering for failures and performing tasks that the robot was not able to. Similarly, in a deployment with the SCITOS G5 robot, that travelled more than 160 km within the STRANDS project [5] , the authors report the need for a way to manage failures and to have a better understanding of human activities. Hanheide et al. [6] propose a spatio-temporal model to learn when, where and how users interacted with the robot info-terminal during a long-term deployment. They found they could improve the efficiency and usefulness of the system by proposing the right content at the right time and place. Building from these works, we designed a system that is able to interact autonomously in a public environment for years while learning from the interactions with humans. A survey on long-term interaction between users and robots [7] raises the issue that memory and adaptation remains nearly unexplored in the field. Similarly, Kunze et al. [8] explores the state of the art on Artificial Intelligence (AI) techniques for long-term autonomy recognising that interactions in this context can be exploited to improve a robot behaviour. With the learning algorithm described in this work, we lay the foundations for a robot framework that allows exploiting the human feedback during interaction in order to optimise its social abilities. Previous works featured a robot deployed in a museum environment. The robot Rhino [9] was deployed in a museum in Germany for 6 days guiding hundreds of visitors. At the time, the main issues and the focuses of the work were navigation and obstacle avoidance. The Minerva robot [10] traversed more than 44 km and interacted with more than 50k people. Moreover, it was able to display mood (i.e. happy or angry) and, more importantly for our work, used an RL approach to learn the best actions to engage visitors. In [11] four robots were deployed over five years. Focusing on interactivity and education they learned that long and noninteractive presentations are guaranteed to drive audience away. In the present work, building on our ongoing long-term study of a museum robot [1] , we plan to address some of these issues by enabling the robot with the ability to directly optimise the users' engagement during the interaction through RL. Previous work has shown how it is possible incorporate the users feedback, with a focus here on user engagement, for modifying the robot behaviour or influencing the user own engagement. Sidner et al. [12] explored how the use of gazing and gestures affects positively the user perception of the robot, increasing their engagement. Similarly, Holroyd [13] defines policies with the goal of increasing user engagement and shows that the robot equipped with these policies is perceived to be more human-like, to behave more fluently and that users reciprocate more robot cues. Recent works aimed at learning these social behaviors typically use (Deep) RL techniques to exploit the real-world interaction experiences a robot can collect. Qureshi et al. [14] , [15] proposed end-toend models to teach a robot the most appropriate action for approaching humans and starting an interaction. The reward signal was triggered by successful/unsuccessful handshakes. Lathuilière et al. [16] uses Deep RL to learn a gaze policy from an intrinsic reward function based on the audiovisual position of people with respect to the robot camera field of view. Gao et al. [17] learns a robot policy for approaching groups of people by maximising a group formation score and minimising the displacement of other participants in the group when the robot approaches. Also in a museum context, Meng et al. [18] propose an RL approach where they use the users' engagement during group interactions with an interactive sculpture as the reward to learn engaging interactive behaviours. In this work, we use a state-of-the-art RL approach [19] in order to improve our robot's social behavior, in particular its choice of actions during the guided tours. We employ the human engagement level as a robot internal reward to maximise. Lindsey, the robot used in this long-term study, is a Scitos G5 robot manufactured by MetraLabs GmbH. In order to sense the environment, the robot has a laser scanner with 270 • scan angle on its base and an Asus xtion depth camera mounted on a pan-tilt unit above his head. The interactions with the visitors are mediated through a touch screen, two speakers, a microphone and a head with two eyes that can move with five degrees of freedom to provide human-like expressions. To ensure safe operations in public environments the robot is equipped with an array of bumpers around the circular base with sensors to detect collisions and two easily reachable emergency buttons that, when activated, cuts the power to the motors. The software framework is based on ROS and uses STRANDS project [5] core modules for topological navigation, people tracking, task scheduling and data collection. The scenario of this project is The Collection museum 1 in Lincoln, UK. The museum is freely accessible to all members of the public 5 days a week, although it used to be open 7 days a week before the COVID lockdown, from 10AM to 4PM. Lindsey, the robot, is deployed as a tour guide in the archaeological section of the museum which displays local findings dating from the stone age to the early modern age. Working with the museum's staff four different guided tours have been devised for the robot. Figure 2 shows the position of the items visited by the robot during the tours. Different color markers correspond to items in different tours. Figure 3 depicts the items shown in each tour. The robot is free to roam around during idle times in search for people to interact with. When people enter the gallery and go close to the robot they are greeted by it. Using the robot's touchscreen they can choose to start one of the tours or to being guided to the location of a specific item and receiving a description of it. For the purpose of this study only the guided tour interactions are considered. The museum visitors are not instructed how to behave with the robot and what to expect during the interaction, therefore the interactions themselves are unstructured with users being free to interrupt the tours or to just leave at any moment. Moreover, both the robot and the users move in the environment during the guided tours making the robot perception of users from the on-board cameras often limited. In a previous analysis of the interactions with our tour guide robot [1] we report that Lindsey has been very popular among the museum's visitors, totalling more than 2300 guided tours in just 103 days of operations. However, observing the duration of these tours one realises that most interactions are actually very short and that only about 18% of them are completed by the user. These observations evidences the need for adaptation and motivate us to study, with the present work, how the users' feedback can be taken into account in such long-term scenario to improve the robot ability to maintain engagement over time. In this study we combine together several methodologies in order to have a unified framework that allows the robot to explore and learn online without the need of having separate phases for data collection and learning. The robot behaviours are specified as conditional plans using the Petri net plans (PNP) formalism [20] . The formalism facilitates action abstraction a reusability in the specification of plans, moreover it allows to monitor the execution and deal with failure situations during execution with Execution Rules (ER) [21] . Each action used in this study is a conditional sequence of lower-level actions with ERs for allowing robust execution. More information about how robustness during autonomous operations is assured can be found in our previous work [1] . The duration of each action can vary from seconds to minutes and it is not known in advance, with some requiring navigating some distance around the museum or describing items with different amount of words. Moreover, each action is interruptible and can potentially terminate the tour before the end, for example in case of failure or users stopping the guided tour. The scenario requires that certain actions are performed exclusively in a specific sequence. For example, it would be wrong to start describing an exhibit's item before going to its location or describing the theme tour only at the end of the interaction. Moreover, the gotoExhibit * actions cannot be executed more than once because it wouldn't make sense to bring people to the same item in the museum multiple times during a tour. To enforce these constraints at execution time, we created subsets of all possible actions C s ⊂ A ∀ s ∈ S from which the policy can choose the next action. The action space with the associated successive actions constraints is showed in Table I. The state vector used for learning is composed as follows where v k ∈ 0, 1 for k = 1, ..., 6 indicates whether the k th item in the tour has been visited, n ∈ {death, tools, religion, art} is the name of the tour, Fig. 4 : Examples of continuous engagement predictions from the robot head camera using model from [2] . Engagement prediction value is LOW in the left frame (the girl is looking elsewhere), MEDIUM in the center (man is taking a picture of the robot) and HIGH on the right (man looking at robot screen). Written consent was obtained from the users for taking and reusing these pictures. a p ∈ A is the previous action, e ∈ {LOW, MEDIUM, HIGH} is the average engagement level during the execution of a p , and t ∈ {none, ended, stopped, abandoned} indicates the terminal state. The engagement of users who interact with Lindsey at the museum is detected using the engagement model presented in [2] , which runs in real-time on the robot's GPU. The model proposed comprises and end-to-end regression model, taking the camera feed as input and providing a holistic engagement measure in the interval [0,1] as output for each image frame. It has been validated on HRI data sets as an accurate measure of engagement. Figure 4 depicts 3 frames of a scene during a tour with the continuous engagement signal provided by the model. For the purpose of this work, we want to use the engagement of users as an evaluation of the robot behaviours to drive learning. Therefore, in our RL scenario the reward observed at each state is derived from the engagement collected during the execution of the last action. The engagement scalar values are provided by the model at a frequency of about 1 Hz, they are then averaged over the entire duration of the action and discretised in the variables LOW, for values in (0, 1 3 ], MEDIUM, for values in ( 1 3 , 2 3 ], or HIGH, for values in ( 2 3 , 1]. Finally, the reward function, given the state s, is We provide the scalar value of engagement as reward in order to guide the learned policy toward actions that are expected to elicit higher future engagement in the users. Also, given that r(s) > 0 ∀ s ∈ S there is an implicit effect of favouring longer tours. When training an agent to learn in an unknown environment, the exploration-exploitation dilemma quickly arises where the agent needs to decide if to keep exploring new actions or executing the one that has returned the highest rewards so far. This trade-off is particularly relevant in robotic applications, like ours, where the amount of exploration one can perform is costly in terms of time and resources and limited by the dynamics of the real world. In this work, for online learning, we use a Reinforcement Learning algorithm based on the "optimism in the face of uncertainty" principle. This principle prescribes that when the model of the environment is uncertain, one should consider the best possible world; if the model is correct, you have no regrets (exploitation); otherwise, you have effectively learned something new about the world (exploration). This general principle gives an almost optimal solution for the stochastic multi-armed bandit problem [22] and episodic [19] and ergodic [23] RL problems. In this work, we are dealing with an episodic RL problem where every guided tour given by the robot is an episode. For improving the robot policy we use the Upper Confidence Bound Value Iteration (UCBVI) algorithm [19] which allows to balance the exploration of novel state-action pairs with the exploration of already explored ones. In out setting this is particularly important because we have a part of the stateaction space which was explored very well during the initial deployment with the static policy. UCBVI gives a natural way of incorporating previously accumulated data without biasing the policy to the point that it doesn't explore new actions. Algorithm 1 outlines our implementation of the algorithm. As in standard model based Value Iteration, the algorithm is composed of 3 phases. First, the value function is computed from the model and an initial policy is generated, then the policy is used for acting in the environment while collecting a new episode, finally the model is improved based on the updated dataset of episodes. The loop repeats, in our case, for every new guided tour that is requested by the users. UCBVI favours exploration of novel state-action pairs thanks to a bonus function that increases the Q-value for pairs that have been scarcely visited. The bonus decreases toward zero as we obtain more data and the Q-value tends to the real value. Algorithm 2 shows the bonus, inspired by the Chernoff-Hoeffding's bonus proposed in [19] . The speed with which the bonus tends to zero is set by the free In our implementation, is important to notice that the Qfunction defined favours the exploration of poorly explored areas over other areas that have never been observed. The effect is that the policy sticks to choosing the same, poorly explored, actions over consecutive episodes and starts exploring a new one only after a certain number of visitations have been done, rather than continuously selecting different actions to explore. This effect, which we found quite useful to maintain a more consistent behaviour over time for our on-line application, could be eliminated by optimistically initialising Q h (s, a) = H − h ∀ (s, a) ∈ S × A and h = 1, ..., H. In order to validate the hypothesis that optimising a social robot policy to maximise the user engagement leads to more sustained human-robot interactions we have performed a long-term study in the museum with our tour guide robot Lindsey. Since 2019, Lindsey has been delivering the guided tours depicted in Figure 3 to the visitors in a "static" way. With its static policy, during each tour the robot would guide the users to the exhibits always in the same order and the choice of whether to give a more detailed description about each item was left to the user. Motivated by the fact that users' displaying different perceived engagement have different willingness to continue the interaction [24] , as shown also by our data in Figure 8 (a) where for example a perceived low engagement at the first stop of the tour means a 20% decrease in chances of users continuing the interaction, we attempt to employ the users engagement value as a factor to select different actions during the interactions. In our learning scenario, we allow the policy to: 1) choose freely the order of items visited in the tours; and 2) decide whether to provide additional information about each specific item, rather than asking the user. The model of our scenario, namely the visitation frequency table N h (s, a, s ), is initialised with the transitions from the thousands of episodes collected with the static policy. During the learning phase, our optimistic learning algorithm implemented takes care of directing the exploration toward the other less explored areas. The reward function is computed on-line from the engagement values detected by our engagement model, as described in Section IV-B. The robot deployment, interrupted in 2020 because of COVID lockdown, was resumed in December 2021 and the learning phase began. During the following 8 weeks the robot behaviour was driven by our online learning algorithm which explored the different actions it could now take, while exploiting the model to increase users' willingness to continue. In order to assess more properly the effect of our learned policy on the tour success and on users' engagement we perform a 2 week verification phase, during which the robot behaviour is again driven by the original static policy. This validation was necessary to observe whether the effect over our results during the learning phase was actually caused by our learned policy or by other spurious effects, like people willing to spend more time at the museum after COVID lockdowns or face masks wearing. In this section we report the results of our evaluation to assess the performances of the learned robot behaviour. In Figure 5 we report the evolution of different metrics over the entire deployment (omitting the periods during which the robot was not operational). The values shown are collected and averaged per-week with the number of tours for each data point reported in Figure 5 (a) for significance. During the whole deployment period the museum scenario did not change in any major way to affect the robot tour guides. Similarly, the robot structure and touch screen interface had only minor interventions like the replacement of an RGB-D sensor. Data reported for the weeks from (2019, 21) to (2020, 10) and after (2022, 5) are results of the "static" tours, while from (2021, 50) to (2022, 5) are caused by the learned policy. In Figure 5 (b) we report the average number of stops visited (exhibit described) in the tours. During the learning phase we observe an increase of 22.8% over the previous period, with the verification phase bringing back the values in line with the previous data for the static tour. Similar results are obtained for the tour success rate, i.e. the rate of tours that terminate after visiting all the stops, with an overall increase of over 30% with respect to the static tour as reported in Figure 5 (c). Notice that the nominal number of items is 6 for the Death tour and 5 for the rest, therefore a lower number of items means that the users have stopped or abandoned the interaction before the end. These results together suggest that our learned policy can indeed lead users to keep a more sustained engagement during the interactions in the museum, confirming the hypotheses which motivates this work. In addition, we study the average users' engagement detected by our model during the whole duration of the tours and the average change in engagement over each consecutive actions executed. The plots in Figure 5 (d)-(e) suggest that during the learning phase there was a general decrease of detected engagement with respect to the previous and the verification periods. Also, we can observe that the learned policy had the effect of maintaining the users' engagement stable over time almost 50% of the times, differently from the previous period where there was an equal chance of increasing, decreasing or maintaining stable the engagement. In this section we analyse how the guided tour has changed after the learning phase, with respect to the original static tour. One of the effect of allowing the learning algorithm to explore different actions was that now the robot could scramble the sequence of the items in the tour, guiding people to different places in order. In Figure 6 we show, as an example of this change, how the final learned policy would conduct the art tour at different level of engagements detected during execution. The figure evidences how different users' engagement levels generates different tours. Additionally, the different engagements had the effect of changing the amount of information given to the users at each exhibit. In Figure 7 we report how many times an additional description is given to user, by performing the describeMoreExhibit action, for each stop number of the tour at different engagement levels. We observe that the robot learned to give it more often at the beginning of the tour, while for later steps only when the engagement is MEDIUM or HIGH. Finally, to verify whether a different engagement level correlates with a different willingness to stay in the interaction we study how many times a tour is stopped or abandoned varying the engagement value at each tour stop. The results in Figure 8 shows that users are more probable to disengage in the earlier stops, after the initial tour description, and when they show an already LOW engagement level, consistent with our intuition that poorly engaged users are less willing to continue an interaction in the first place. We observe an overall decrease in disengagement when using the learned policy compared to the static policy, but no other significant difference is present between the two conditions. The UCBVI algorithm used in this work gives a simple yet efficient way of exploring new unseen states while exploiting the best performing actions, as described in Section IV-C. In this section we study how the algorithm has effectively explored the state-action space in our scenario. Figure 9 shows the amount of exploration that the robot has done, reported per week. As per our expectations, when using the static tour little to no exploration is performed, given that the robot is always choosing the same actions at the same point in the tour. Note that in this condition a little amount of exploration is still ongoing, but it is entirely governed by the users' behaviour with their choices during the tour, i.e. asking for more information or not, and by manifesting different levels of engagement. The adoption of our learning algorithm has brought an initial week of high exploration of many new state-action transitions that were previously forbidden and kept a steady exploration rate in the subsequent weeks. The exploration is smaller after the initial period given that the policy starts to exploit the new actions that have resulted being most promising in terms of future users' engagement, as evidenced also by the analysis of the learning outcomes in Sections V-B and V-A. The exploration rate achieved as of the beginning of 2022 is only about 13% of the entire state-action space. We expect that as the learning phase is resumed, after the verification period, with more exploration the robot will achieve higher performances in terms of being able to retain users interest over time in the tours. However, we hypothesise that the more we continue to act and explore, the more will be difficult to find new states to explore. As it was mentioned in the previous paragraph, some parts of the state-action space are not directly explorable by the robot but they can only be observed indirectly after the user behaviour allows so. To validate this hypothesis, we have performed a simulation of the next 5 weeks into the future where we keep using and improving our learned policy and we simulate the next observed states by sampling uniformly from the set of all the possible that could be observed at each point in the tour. The simulated exploration, shown in Figure 9 , is an upperbound of the empirical exploration we can obtain in real life because it does not simulate the possibility of stopping or abandoning the interaction, which would ultimately prevent exploring further into the tour sequence. However, the data shows that as the time passes the amount of states that can be explored by the algorithm decreases. In this work, we proposed an Reinforcement Learning approach to in-situ behavioural adaptation for social robots. The framework exploit a long-term museum deployment of a tour guide robot for learning, by experience, what are the best actions the robot should perform during the guided tour to sustaining the users interaction with the robot for longer. The user engagement is detected from the robot own camera in real-time to be used both as a part of the state, to plan actions that are more appropriate for the displayed engagement, and as a continuous reward to the robots actions, to encourage it executing actions that lead to eliciting higher future engagement. With our experimental validations, we observe that the adaptation framework leads the robot to perform longer tours and the users to stop, or abandon, the interaction less frequently than before with a static policy. Even though only a fraction of the entire stateaction space has been explored, complete exploration cannot be practically achieved in our real-world scenario. The UCBVI algorithm implemented is capable of exploiting the most promising actions explored so far, hence maintaining good overall performances, while exploring new states and actions, as confirmed by our empirical results. Our study present results that span over a period of almost 3 years, including the COVID pandemic, which could very well encompass many factors influencing the users' behaviour over time. In fact, when reporting the users' detected engagement, we observe an overall decrease since the museum reopening after lockdown which could be caused by mask wearing affecting the ability of our model to detect engagement. However, the improved results of our learned policy over the static one, even confirmed by the two-weeks verification period, shows that the overall learning framework maximising detected engagement is able to produce more sustained human-robot interactions. A natural extension of this work is to study how our learning framework would allow the adaptation of the robot behaviour over time as people's preferences change. The UCBVI bonus of the present algorithm favours exploration of poorly explored areas, but it would not recognise whether the estimation of well explored areas is still relevant, or it needs more experience caused by a shift in users' behaviour. A general idea to integrate this would be to have the bonus factor in the divergence between the values of newly explored state-action pairs and their estimation from past data. Lindsey the tour guide robot-usage patterns in a museum long-term deployment Are You Still With Me? Continuous Engagement Assessment From a Robot's Point of View Long term autonomy in office environments The 1,000-km challenge: Insights and quantitative and qualitative results The strands project: Long-term autonomy in everyday environments The when, where, and how: An adaptive robotic info-terminal for care home residents Social robots for long-term interaction: a survey Artificial intelligence for long-term robot autonomy: A survey The interactive museum tourguide robot Minerva: A second-generation museum tour-guide robot The mobot museum robot installations: A five year experiment Explorations in engagement for humans and robots Generating engagement behaviors in human-robot interaction Robot gains social intelligence through multimodal deep reinforcement learning Show, attend and interact: Perceivable human-robot social interaction through neural attention q-network Deep reinforcement learning for audio-visual gaze control Learning socially appropriate robot approaching behavior toward groups using deep reinforcement learning Learning to Engage with Interactive Systems Minimax regret bounds for reinforcement learning Petri net plans: a formal model for representation and execution of multi-robot plans A practical framework for robust decision-theoretic planning and execution for service robots Regret analysis of stochastic and nonstochastic multi-armed bandit problems Near-optimal regret bounds for reinforcement learning A model of attention and interest using gaze behavior