key: cord-0535827-zto0o8g5 authors: Yang, Boling; Habibi, Golnaz; Lancaster, Patrick E.; Boots, Byron; Smith, Joshua R. title: Motivating Physical Activity via Competitive Human-Robot Interaction date: 2022-02-14 journal: nan DOI: nan sha: de71f670b574f5a5958ea620517d740d9885b123 doc_id: 535827 cord_uid: zto0o8g5 This project aims to motivate research in competitive human-robot interaction by creating a robot competitor that can challenge human users in certain scenarios such as physical exercise and games. With this goal in mind, we introduce the Fencing Game, a human-robot competition used to evaluate both the capabilities of the robot competitor and user experience. We develop the robot competitor through iterative multi-agent reinforcement learning and show that it can perform well against human competitors. Our user study additionally found that our system was able to continuously create challenging and enjoyable interactions that significantly increased human subjects' heart rates. The majority of human subjects considered the system to be entertaining and desirable for improving the quality of their exercise. rewards in the long run. A second user study demonstrated that an RL-trained policy made it significantly more challenging for subjects to make quantitative improvement in a long sequence of games compared to a carefully designed heuristic baseline policy. The RL-trained agent also appeared to be more intelligent to the human subjects. Figure 1 : Competitive fencing games between a PR2 robot and human subjects. The detailed game rules are described in Sec. 2. Please refer to this link for example gameplay videos. In this section, we discuss the significance of competitive interaction and justify why we believe that it deserves increased attention in the robotics community. We will first discuss how competition can influence people in a positive manner from a psychological perspective. Afterward, we will discuss the technical challenges in competitive-HRI tasks and propose the use of the Fencing Game as the main task that this project will focus on. Positive Influences of Competitive Interaction. Competition between players provides motivation and can foster improvement in performance at a given task. Plass et al. [15] compared competitive versus cooperative interactions in an educational mathematics video game. The results revealed that, compared to working individually, the subjects performed significantly better when working competitively. In particular, competitive players demonstrated higher effectiveness in problem solving compared to non-competitive players. This study also observed that subjects experienced a higher level of enjoyment via an increased tendency to engage in the game [16] . Furthermore, they also displayed higher situational interest, i.e. the subject paid more attention and had more interaction during the game [17] . Viru et al. [18] showed that competitive exercises can improve athletic performance in a treadmill running test. Subjects' average running duration was increased by 4.2%. During cycling and planking exercise, Feltz et al. [4] were also able to inspire higher performance from subjects by placing them in competition with a manipulated virtual partner. Inspired by these studies, we envision that a personal robot can become a competitive partner that provides enjoyment, increases motivation, and motivates improvement in activities such as physical exercise. For this reason, we initiate our competitive-HRI research by focusing on creating a physical exercise companion. Technical Challenges. Creating an actual robot that can compete with a human physically is challenging. The robot needs to constantly reason about the human's intent via their actions, and strategically control its high degree-of-freedom body to counteract the opponent's adversarial behavior and maximize its own return. Therefore, a big part of our competitive-HRI research focuses on solving these technical challenges to create a robotic system with real-time decision making capability and body agility that is comparable to that of humans. The Fencing Game. Based on the expected technical challenges, we designed a two-player zerosum physically interactive game. The Fencing Game is an attack and defense game where the human player is the antagonist with the goal of maximizing their game score. The robot is the protagonist who aims to minimize the antagonist's score. Fig. 1 shows three images of human subjects playing the game, and Algo. 2 in Appendix A.1 summarizes the scoring mechanism for this game. The orange spherical area located between the two players denotes the target area of the game. The antagonist on the right earns 1 point for every 0.01 seconds that their bat is placed within the target area without contacting the opponent's bat. The antagonist will lose 10 points if the antagonist's bat is placed within the target area and makes contact with the protagonist's bat simultaneously. Moreover, the antagonist will get 10 points of reward if the protagonist's bat is placed within the target area, waiting for the antagonist to attack, for more than 2 seconds. Each game will last for 20 seconds. The observation space for both agents includes the Cartesian pose and velocity of the two bats, as well as the game time in seconds. For the sake of simplicity, both agents in this project are non-mobile. Yet, mobility can be easily integrated into future iterations of this game. Reinforcement Learning in Competitive Games. Competitive games have been used as benchmarks to evaluate the ability of algorithms to train an agent to make rational and strategic decisions [19, 20, 21] . Multi-agent reinforcement learning methods allow agents to learn emergent and complex behavior by interacting with each other and co-evolving together [22, 23, 24, 25, 26, 27] . Many recent efforts have used multi-agent RL methods to learn continuous control polices that can achieve high complexity tasks. Bansal et al. [28] created control policies for simulated humanoid and quadrupedal robots to play competitive games such as soccer and wrestling. Lowe et al. [29] has extended DDPG [30] to multi-agent settings by employing a centralized action-value function. Human-Robot Competition. There are a few studies focusing on human-robot interaction in competitive games. For instance, Kshirsagar et al. [31] studied the affect of "co-worker" robots on human performance when they are performing in the same environment and competing for a monetary prize. This study showed that humans were slightly discouraged when competing against a high performing robot. On the other hand, humans exhibited a positive attitude towards the low performing robot. Mutlu et al. [32] showed that male subjects were more engaged in competitive video games when they played with an ASIMO robot. However, the majority of the subjects preferred cooperative games when they played with this robot. Short et al. [33] analysed the "rock-paperscissors" game and found that human subjects were more socially and mentally engaged when the robot cheated during the game. Robots in Physical Training. In the context of using robots to assist humans in physical exercises, a robot developed by Fasola and Mataric [34] was able to provide real-time coaching and encouragement for a seated arm exercise. Süssenbach et al. [35] developed a motivational robot for indoor cycling exercise that employed a set of communication techniques in response to the subject's physical condition. The results showed that the robot sufficiently increased the users' workout efficiency and intensity. Sato et al. [36] created a system capable of imitating the motion and strategy of top volleyball blockers for assisting vollyball training. These works are typically limited to a few competitive scenarios that require simple and repetitive motions. In this work, we leverage reinforcement learning to create a robotic system that can potentially play various physically competitive games against human players. Humans are highly efficient at recognizing patterns and learning skills from just a small number of examples [37] . Existing research also shows that human subjects can adapt to robots and improve their performance in just a few trials in physical HRI tasks [38, 39] . However, games against a robot competitor are less challenging if a human player can easily predict its behavior and quickly find an optimal counter-strategy. We hypothesized that human players can quickly learn to improve their performance against a given robot policy, but a change in the robot's gameplay style could interrupt this learning effect and keep the games challenging. A gameplay style is characterized by patterns in the agent's end-effector motion trajectories. For example, one antagonist may prefer using more stabbing movements, while another antagonist may prefer slashing movements. Therefore, there are two primary objectives that govern the generation of robot control policies. First, a policy should allow the robot to play the Fencing Game sufficiently well, such that the games are intense and engaging to the human users. Second, we propose to obtain three policies with unique gameplay styles for our user study. A multi-agent reinforcement method created the robot control policies that are used in the user study. We designed and implemented the physical system based on a PR2 robot. Extra discussions on the physical system implementation and the technical details of the learning algorithm can be found in Appendix A.2. To generate robot control policies that comply with the two aforementioned requirements, we formulate the Fencing Game as a multi-agent Markov game problem [40] . Both the antagonist agent and the protagonist agent are represented by a PR2 robot model in a Mujoco simulation environment. Both agents are trained in a co-evolving manner by playing games against each other. We break down the multi-agent proximal policy optimization method proposed by Bansal et al. [28] into two separate training processes, which reduces the computational processing needed to obtain multiple pairs of agents with acceptable performance. As a result, we were able to complete all sampling and training processes on a desktop computer (cpu: 1 × i7, gpu: 1 × GT X970). Our version of multi-agent PPO, the two-phase iterative co-evolution algorithm, is presented in Algo. 1. Learning to Move and Play. The first phase of the training can be seen as a pre-training process, which aims to allow both agents to quickly learn the motor skills required for joint control and the rules of the game. The agents are rewarded by both the weighted sum of a continuous reward and the game score at each timestep. The continuous reward encourages the agents' exploration in the task space. The policy of the antagonist µ with parameters θ µ i will first be trained by collecting trajectories that result from playing against the protagonist with its most recent policy. This process continues until timeout or µ has converged. The protagonist's policy ν with parameter θ ν i will then be trained against the antagonist with its most recent policy. We acquired robot policies that exhibited competent (but still imperfect) gameplay by only running this training sequence twice (N iter = 2). The pair of resulting policies from phase one will be called the warm-start policies. Creating Characterized Policies. It has been shown that by using different random seeds to guide policy optimization toward different local optima, an agent can learn different behaviors and approaches to complete the same task [41, 26] . The highly variable nature of multi-agent systems enhances this random effect. The agents in multi-agent systems are more likely to learn drastically different strategies and emergent behaviors when they continuously learn by competing with each other [28, 42, 43, 44] . The second phase of training generates a policy with random characteristics by exploiting this fact. Both agents are initialized with their warm-start policies from phase one and trained in the same iterative scheme shown in Algo. 1. Now the agents are solely rewarded by the game scores. When training each of the agents, instead of having an agent face against the opponent's latest policy, one of the previous versions of the opponent's policy in the history will be randomly selected in phase two. We created a policy library that contains six pairs of randomly characterized policies that resulted from six different rounds of phase two training. Fig. 2 demonstrates the change of game scores during the two-phase training process. Algorithm 1: Iterative Co-evolution Input: Environment E; Stochastic policies µ and ν; Instantaneous Reward Function r(·) Initialize: Parameters θ µ 0 for µ and θ ν 0 for ν In order to identify the most distinctive protagonist policies in the library described in the last subsection, each agent's gameplay style needs to be quantified and compared. We first generate trajectories for all protagonist policies in a tournament [45] , where each agent plays 100 games with each of the six opponents. Eight end effector trajectory features are selected to quantify agent's gameplay styles: total displacement change on x, y, and z axis, average velocity, average acceleration, average jerk, total kinetic energy, and trajectory smoothness. These features are calculated for each of the games played by each of the protagonists. The quantified style of a protagonist agent is the averaged features across 600 games. The features of all protagonists are then compared via their three most significant principal components (less than 2% of information loss) [46] , and the three most separable policies are selected for the user study. The protagonist's gameplay style for the warm-start policy and the three selected characterized policies are visualized in Fig. 3 . Due to the kinematic and dynamic mismatch between simulation and reality, policies that are trained solely on simulated data can perform poorly in reality. We use a combination of a Jacobian Transpose end-effector controller and a system identification (systemID) process to solve this problem. Instead of specifying the torque values for each joint, the policy outputs an offset from the current end-effector pose. The new desired pose is executed by the end-effector controller. We used the CMA-ES algorithm to optimize the following objective over the parameter space of both the controller and the robot model in the simulation. Where θ m represents the simulated robot model parameters: damping, armature, and friction loss. θ c represents the proportional gains and derivative gains of the end-effector controller. T r and T s are trajectories sampled from the real robotic system and simulation respectively that result from the same control sequence. s t r ∈ T r and s t s ∈ T s are the robot's end-effector pose in reality and simulation at time t respectively. As a result, the difference in end-effector dynamics between the simulated robot and the real PR2 robot is reduced. The maximum controller output is bounded conservatively to prevent possible human injury. We performed two in-lab user studies in this work. The first user study performed a broad exploration on the idea of competitive-HRI under the fencing game setting. Sixteen human subjects were asked to play five games with each of the four RL trained policies that resulted from Sec. 4. Subjects' game scores, heart rates, arm movements, and their responses to a modified technology acceptance model (TAM) [47] were used to evaluate our system from the following three perspectives: 1. Is a competitive robot accepted by human users under the scenarios of physical games and exercise? 2. Can our system effectively create challenging and intense gameplay experience? 3. Can the robot interrupt the human learning effect by switching its gameplay style? The second user study compared characterized policy 1 with a carefully designed heuristic-based policy. By placing its bat in between the target area and the point on the human's bat that is closest to the target area, the robot exploits embedded knowledge of the game's rules in order to execute a strong baseline heuristic policy. Action noise was added to the heuristic policy to create randomness in the robot's behavior. Ten human subjects were asked to play 10 games with each robot policy. This experiment compared the two policies via game scores, subjects' TAM responses, subjects' perception of difficulty, enjoyment, and robot intelligence. Details about the heuristic baseline policy design, experiment procedures, subjective question design, and the demographic information for both user studies are discussed in Appendix A.3. Our first experiment demonstrates that participants highly accept the use of a competitive robot as an exercise partner. The majority of the subjects considered competitive games with our robot to be useful, entertaining, desirable, and motivating. Our system was able to provide a challenging and intensive interactive experience that significantly increased subjects' heart rates. While competing against our RL trained robot, most subjects struggled to significantly improve their performance over time. Yet, a subset of subjects who constantly explored different strategies achieved higher scores in the long run. User Acceptance. Table 1 summarizes the subjects' responses to the technology acceptance model. The majority of the subjects (i.e. 68.75%) agreed that a competitive robot could improve the quality of their physical exercise. Moreover, 87.5% of subjects agreed that competitive human-robot interactive games are entertaining, and 81.25% of subjects agreed that a competitive robot exercise companion would be desirable in the future. Interestingly, the intention to use (62.5% agree + strongly agree) and increased engagement (56.25% agree + strongly agree) metrics are not as high as the vast agreement on perceived enjoyment and desirability. Some subjects who didn't have a strong intention to use our system explained their reasons in the open-ended question. They stated that it is not immediately clear how competitive robots can play a role in their routine exercises, such as jogging, and weight training. On the other hand, most subjects (i.e., 71%) who exercise less than three hours per week agreed that a competitive robot partner will increase their engagement with physical exercise. Therefore, our competitive robot is more effective in motivating people who do relatively little exercise. Future research should explore how to effectively apply competitive-HRI to common exercises. Increased Heart Rate. Most of the gameplay with our competitive robot increased the human subjects' heart rates significantly. Subjects' peak heart rates were higher than their resting heart rates in more than 99% of the games, and their peak heart rates in 92.6% of the games were higher than their walking baseline heart rates. Since the subjects were asked to keep their feet planted on the ground during a game, their body maneuverability was very limited. Despite this movement limitation, subjects' peak heart rates were significantly higher (i.e., 20% to 58%) than their walking baseline in 29% of the games. We found that, other than physical effort, subjects' cognitive effort and reported emotions also corresponded to a rise in heart rate. Figure 4 . b. shows that sections with higher average heart rate corresponded to people feeling cognitive demand, motivated, frustrated, and intimidated by the robot. In short, playing competitive games with our robot can be cognitively demanding and can trigger noticeable emotional reactions. The Human Learning Effect. As mentioned in Sec. 4, we wanted to test if changing the robot's gameplay style interrupts the human learning process and keeps the interaction challenging throughout the whole experiment. This hypothesis was based on our assumption that most human players can make significant improvement within five games against a fixed robot policy. Surprisingly, this assumption was invalidated by our user study data. Fig. 4 . c. shows the average game scores and standard deviation of all subjects in the five consecutive games against each of the four robot policies. An analysis of variance (ANOVA) suggests that subjects have no significant performance increase between the five games for a given policy (Warm-start Policy: p = 0.96, Characterized Policy 1: p = 0.74, Characterized Policy 2: p = 0.85, Characterized Policy 3: p = 0.43). The red horizontal dashed line in each subplot of Fig. 4 . c. represents the best mean score achieved by a subject, which approximates the performance of an observed best human policy. The average performance of all subjects were lower than (i.e., much lower in most cases) the performance of the best human policies, and most subjects still have much room for performance improvement. Our assumption on human learning was based on evidence from noncompetitive HRI experiments. In contrast, our competitive setting creates a much more dynamical environment and resulted in a more challenging learning environment for the human. Since no significant performance variance was observed across the four sections either (ANOVA F = 1.51, p = 0.26), our system was able to remain challenging to users throughout the whole experiment. However, more studies are needed to analyze how changes in gameplay style interrupt human learning. Human Performance. Despite the fact that no significant learning effect was found in the subject population, we observed that some subjects tried multiple strategies in the experiment, which could be interpreted as exploration within a learning framework. This observation motivated us to analyze the relationship between strategy exploration and performance, which we first quantified as variance in game scores, and then as the featurized gameplay styles from Sec. 4. As shown in Fig 4. a., a larger variance in score corresponds to better performance in terms of maximum and mean scores. We then compared the variance of the gameplay style between the five subjects with the highest maximum scores and the five subjects with the lowest maximum scores. For the sake of simplicity, we only compared the end-effector displacement change on x, y, and z axes and the averaged velocity. The variance of each selected feature for subjects with high performance were at least 29% higher than subjects with low performance. This suggested that subjects who have high variance in score tend to take risks by constantly exploring different strategies, resulting in better rewards in the long run. Future research could examine whether a robot can help human users achieve better performance by verbally or implicitly encouraging them to explore different strategies. This experiment compared an RL trained policy (characterized policy 1) to a strong heuristic baseline policy in a long gameplay sequence setting (10 games per policy). Compared to the baseline, the RL trained policy has a slightly better game score performance. In contrast to the first experiment, the human learning effect was observed in both policies. However, the RL trained policy was significantly better at suppressing human subjects from learning to make progress even without switching gameplay styles during the experiment. While the responses to the TAM model were very similar for both policies, the subjects considered the RL policy to be more intelligent because of its "defensive" and "diverse" behavior. Game Scores: When playing against the baseline policy, the subject population's average game scores (Baseline mean: 383.5, RL mean: 349.1), the maximum game score (Baseline max: 929.0, RL max: 744.0), and the minimum game score (Baseline min: -291.0, RL min: -320.0) are higher than those against the RL policy. Instead of switching policies every five games, we used a longer sequence of game-plays to evaluate each policy in this study. We were able to find a positive correlation between game scores and the amount of game-play experience against a specific policy. A linear regression over the Baseline method data (slope=30.0, coefficient=0.56, p value=9.3e-10) has a larger positive slope, stronger correlation coefficient, and a smaller statistical significant p value compared to that of the RL policy (slope=15.7, coefficient=0.34, p value=0.0005). Subjective Responses: Subjects' responses to most of the modified TAM questions for the two policies are very similar as shown in Table 3 . However, 70% of the population considered the RL policy to be more intelligent. In the responses for short questions 2, 3, and 4 in Table 2 , the Baseline policy was described as "fast" by 2 subjects, "follows my movement" by 4 subjects, and has "repetitive/predictable" behavior by 4 subjects. Meanwhile, the RL policy was considered to be "defensive" by 5 subjects, "strategic" by 2 subjects, and to have "diverse behavior" by 2 subjects. This work motivated research in competitive-HRI by discussing how competition can be beneficial to people and the technical difficulties competitive-HRI tasks represent. The Fencing Game, a physically interactive zero-sum game is proposed to evaluate system capability in certain competitive-HRI scenarios. We created a competitive robot using an iterative multi-agent RL algorithm, which is able to challenge human users in various competitive scenarios, including the Fencing Game. Our first user study found that human subjects are very accepting of a competitive robot exercise companion. Our competitive robot provides entertaining, challenging, and intense gameplay experiences that significantly increases the subject's heart rate. In our second user study, one of the policies resulting from the proposed RL method was compared to a strong heuristic baseline policy. The RL-trained policy were significantly better at suppression of the human learning affect, and appeared to be more intelligent to 70% of the population. COVID-19 Effects. We were fortunate to be able to perform an in-person experiment on a realworld robot and human subjects, yet the subject recruiting process was particularly challenging during the current COVID pandemic. With limited access to potential experiment participants, we were only able to run a pilot test with two people prior to the actual experiment reported here. Solely from the pilot study data, we were not able to disqualify our eventual experimental assumption that there is a significant learning effect among participants even within only five games. Instead, as described in Sec. 5.1, it was not until obtaining our full experimental results that we were able to discard this assumption due to observing a large variance in the degree of human learning across participants. Future research that aims to understand the human learning effect in competitive-HRI tasks should collect more data for each individual and facilitate the participant's ability to learn. In particular, increased learning can be achieved by reducing environmental complexity, such as decreasing the robot's joint velocity, reducing the manipulator's reachability in task space, using a robot with a lower degree of freedom (DoF) and so on. Algo. 2 summarizes the scoring mechanism for the Fencing Game described in Sec. 2 [48] . Hardware Details. The PR2 robot is a popular general purpose robotic platform with two 7 degree of freedom (DoF) arms and its overall form-factor is similar to a human adult [49, 50, 51] . It is comparable to a human player in a competitive game in terms of body size and arm flexibility. The human player's bat is attached to an infrared photo-diode array tracker, and its position and orientation is perceived by the robot via two tracking base stations. An audible scoring feedback system is created to report the scoring situation of each game in real-time. The participants will hear a higher frequency (440 Hz) signal sound when scoring, and a lower frequency (300 Hz) signal sound when receiving penalty from the robot. The system implementation pipeline is shown in Fig. 5 . Algorithm Details. This paragraph provides extra technical details on the two-phase iterative coevolution algorithm described in Sec. 4. There are two major differences between phase one and two training in Algo. 1. (1) Phase one training uses a continuous reward to facilitate agents' development of basic motor skills. In phase two training, the agents are solely rewarded by the game scores. (2) In phase one training, each agent is always learning against the latest version of the opponent. But in phase two training, an agent would constantly and randomly load a previous version of opponent from the history, after a short period of training. The use of continuous reward encourages the robot to quickly explore the task space, which makes the phase one training good for quickly initializing the policies for the antagonist and protagonist. However, continued used of the iteration strategy in phase one training would likely trap both agents in a low quality local equilibrium and/or chasing each other in circles in parameter space in a long sequence of training [52] . In contrast, the reward and iteration mechanisms of phase two training create a high variance learning environment which effectively mitigates the circling problem. In addition, because of the high variance nature of the phase two training, agents are more likely to learn emergent behaviors and converge to more sophisticated policies. Training Details. As shown in Fig. 2 , at the beginning of the training (i.e., phase 1 -itr 1) the game scores tend to be largely biased toward whichever agent is currently learning. This is because both agents' policies are simple/naive at the beginning phase and the opponent can easily find a counter strategy to dominate the game during learning. After the warm-start training (i.e., phase 1 -itr 2), the two agents converged to an area in their policy space where both agents are challenging each other without completely dominating the game. Figure 5 : A block diagram demonstrating the pipeline of the proposed robotic system. Human motion tracking is achieved via a HTC VIVE VR system. Demographics There were 16 human subjects (10 male, 6 female, of age M = 28.8 years, SD = 5.56) recruited for the first human-subject studies. Nine out of 16 subjects reported doing more than three hours of physical exercise per week, and seven subjects reported less than 3 hours of weekly exercise. Jogging, walking, cardio, and weight training were the most common exercises chosen among all subjects. Ten subjects from this population participated the second user study. Before the Experiment. Before the experiment, each subject was asked to sit for 3 minutes and then walk for 1 minute to record two average heart rate baseline values. In the experiment, both the robot and a subject held a polystyrene bat to play the games. A safety line was drawn on the ground, and every subject stood behind it to prevent a potential collision. Since the robot was not mobile, each subject also kept his/her feet planted on the ground during a game. The target area was not directly visible; an audible scoring feedback system notified a subject when his/her bat was placed within the target area. A high-frequency signal sound indicated that the human player was scoring, and a low-frequency signal sound indicated that the human player was getting penalization. Before the experiment, a subject had 5 minutes to explore this target area with their bat, so that the target area's location and the scoring mechanism were clear to the subject. Afterward, the subject played two warm-up games with the robot to get further familiarized with the system. In these warm-up games, the robot's actions were slowed down and no data was collected. Experiment Procedure for User Study One. The experiment contains four sections. In each section, a subject will play five consecutive games with the robot (with a fixed robot policy) and rest for approximately 30 seconds between games. Subjects' heart rates will be recorded during each 20 second game. Due to the short duration of the games, our discussions in Sec. 5.1 use peak heart rate as a summarizing statistic of this recorded data. In the first section of the experiment, the robot will use the warm-start policy resulting from the phase one training. For the rest of the sections, the robot will load one of the three selected characterized policies in each section by following a random order. This ensures that all subjects will first learn to play with the robot at a regular speed, and then play with the robot with a new game style in each of the sections. After each section, a subject will be asked to describe the interaction in the last 5 games by selecting one or more of the following adjectives: 'Exciting', 'Joyful', 'Frustrating', 'Motivating', 'Amusing', 'Intimidating', 'Physically Demanding', 'Cognitively Demanding', 'Boring', 'Others (please describe: )'. Finally, when a subject finishes all four sections of games, they will complete the last part of the questionnaire that assesses their acceptance of the competitive robot, and their subjective feelings towards the games. We modified the technology acceptance model (TAM) [47] and created the questions in Table. 2. We introduced two extra questions (i.e., DE and IE) to understand the human subjects' acceptance and desirability of a competitive robot companion in the future, and if a competitive robot can motivate them to engage in physical exercise more frequently. Other than the open-ended question, all TAM questions were measured on a 5-point scale where 1 = "Strongly Disagree," 3 = "Neutral," 5 = "Strongly Agree". At the end of each gameplay section, a subject is also asked to consider and compare both the enjoyability and the difficulty of the games between the finished sections. One section can be equally, less or more enjoyable and difficult than another section. This will allow the participant to have two rankings of the four sections based on their perceived enjoyment and difficulty by the end of the experiment. Experiment Procedure for User Study Two. This experiment contains two sections. In each section, a subject is asked to play 10 consecutive games with the same robot policy and rest for approximately 15 seconds between games. The order in which each subject plays against the baseline policy and characterized RL policy is randomized. After each section, a subject is asked to answer the modified TAM questions. After the final section, a subject is also asked to answer short questions 2, 3, and 4 in Table. 2. Baseline Heuristic Policy. We aimed to design a strong baseline heuristic policy to create an intense human robot gameplay experience. Given an observation of the world, the robot orients its bat perpendicular to the human's bat with random angular offsets drawn uniformly from -25 to 25 degrees on the x, y, and z axes. In order to ensure that the robot is always executing a competitive defense, the policy commands the robot to position the center of its bat in between the target area and the point on the human's bat that is closest to the target area: b p =t ar + (h close −t ar) · unif orm(0.5, 1) Whereb p ,t ar,h up andh low represent the position of the robot's bat frame, the center of the target area, the upper end of human's bat, and the lower end of human's bat respectively.h close indicates the point on the human's bat that is closest to the center of the target area, and L sword indicates the length of a bat. The function unif orm(0.5, 1) randomly determines how far apart the robot's bat should be from the human's bat. In addition, there is a 50% chance for the robot to execute the desired bat position calculated from the last time step instead of the latest desired pose. The added uncertainties introduce randomness to the robot's behavior. This heuristic allows the robot to dominate the fencing game when it can move faster or as fast as the antagonist. However, human subjects are able to move slightly faster than our PR2 robot, which leaves room for human subjects to discover counter strategies. All heart rate data were recorded by a Polar OH1+ optical heart rate sensor. Fig. 4 . b. compares the subjective descriptions between four groups of gameplay sections with different levels of average human heart rates. For each section in the user study, we first calculate the average peak heart rate over the corresponding five games in the section. A section's heart rate level l is calculated by dividing the section's average peak heart rate by the corresponding user's walking baseline heart rate, which results in a percentage value describing how much more or less the average section heart rate is compared to the baseline. The low, medium, high, and ultra-high heart rate groups contain the sections that l ≤ 100%, 100% < l ≤ 120%, 120% < l ≤ 140%, and 140% < l respectively. Having a competitive robot companion would improve the quality (PU) of my physical exercise. Perceived Ease of Use Learning to earn higher score (make progress) in the games with (PEOU) a competitive robot would be easy for me. Attitude Using a competitive robot exercise partner to improve my exercise (ATT) quality is a good idea. Intention to Use Assuming I had access to a competitive robot for exercise, (ITU) I would intend to use it. Perceived Enjoyment I would find competitive human-robot gameplays are (PENJ) entertaining. Desirability Based on your experience today, future physical exercises and (DE) games with a competitive robot will be desirable. Increased Engagement Having a competitive robot companion would make me more likely (IE) to engage in physical exercise. Short Question 1 Are there anything you would like to change to improve the inter-(used in study one) action experience? Short Question 2 Which robot (Section 1, Section 2, or equally) do you think is more (used in study two) challenging/difficult to play against? and why? Short Question 3 Which robot (Section 1, Section 2, or equally) do you think is more (used in study two) enjoyable/fun to play against? and why? Short Question 4 Which robot (Section 1, Section 2, or equally) do you think is more (used in study two) intelligent? and why? Perceived Ease of Use. Interestingly, although we did not observe significant performance improvement within each section nor between sections, 62.5% of subjects perceived that it was easy for them to make progress when playing against the robot. Furthermore, some participants expressed a desire to beat the best score of previous participants. It is possible that some subjects focused only on attaining their perception of a "high" score for a small number of games rather than maintaining good performance in all 20 games. On the other hand, in a competitive game setting, we are still not sure exactly what it means for people to feel that getting a better score would be easy. In our experiment, it probably suggested that most participants considered the robot opponent to be surmountable because only a very small number of participants described their gameplay experience as "Frustrating", "Intimidating", or "Boring". However, future work could study how the perceived ease of use affects the participants' effort in competitive games. Enjoyment and Difficulty. We found that perceived enjoyment and difficulty do not significantly vary as the amount of player experience increases, but the amount of data used to train the corresponding policy does have an affect. Fig. 6 shows the average ranking comparison for enjoyability and difficulty across both policies and experiment sections. Among the utilized policies, no significant variance was observed across the three characterized polices for both enjoyability(p = 0.40) and difficulty(p = 0.35). Paired-T tests show that the warm-start policy is less enjoyable and difficult than characterized policies (p < 0.01), except for the enjoyability of the warm-start policy and policy 2 (p = 0.13). The result is similar in the time domain -no significant variance was found in the last three sections, in which characterized policies were randomly ordered (enjoyability: p = 0.5, difficulty: p = 0.42). Section one, which only uses the warm-start policy, is less enjoyable and difficult than other sections(p < 0.05). Across all sections, the perceived difficulty is positively correlated to perceived enjoyment with a moderate coefficient of 0.6. Evolution and competition Competition-driven evolution of organismal complexity The relational view: Cooperative strategy and sources of interorganizational competitive advantage Cyber buddy is better than no buddy: A test of the köhler motivation effect in exergames Sport psychological constructs related to participation in the 2009 world masters games Is more autonomy always better? exploring preferences of users with mobility impairments in robot-assisted feeding Less is more: Rethinking probabilistic models of human behavior Trust-aware decision making for human-robot collaboration: Model learning and planning Human-robot interaction for cooperative manipulation: Handing objects to one another Unsupervised early prediction of human reaching for human-robot collaboration in shared workspaces Socially assistive robotics for post-stroke rehabilitation Gender differences in motivation and barriers for the practice of physical exercise in adolescence Forced use of the upper extremity in chronic stroke patients: results from a single-blind randomized clinical trial Competitive orientations and motives of adult sport and exercise participants The impact of individual, competitive, and collaborative mathematics game play on learning, performance, and motivation Rules of play: Game design fundamentals The four-phase model of interest development Competition effects on physiological responses to exercise: performance, cardiorespiratory and hormonal factors Deep blue Mastering the game of go without human knowledge Co-evolving parasites improve simulated evolution as an optimization procedure Competitive coevolution through evolutionary complexification Opponent modeling in deep reinforcement learning Multiagent cooperation and competition with deep reinforcement learning A general reinforcement learning algorithm that masters chess, shogi, and go through self-play Learning with opponent-learning awareness Emergent complexity via multiagent competition Multi-agent actorcritic for mixed cooperative-competitive environments Continuous control with deep reinforcement learning Monetary-incentive competition between humans and robots: experimental results Perceptions of asimo: an exploration on co-operation and competition with humans and humanoid robots No fair!! an interaction with a cheating robot Robot exercise instructor: A socially assistive robot system to monitor and encourage physical exercise for the elderly A robot as fitness companion: towards an interactive action-based motivation model Development of a block machine for volleyball attack training The importance of shape in early lexical learning Telemanipulation with chopsticks: Analyzing human factors in user demonstrations Game-theoretic modeling of human adaptation in human-robot collaboration Markov games as a framework for multi-agent reinforcement learning Emergence of locomotion behaviours in rich environments Emergent tool use from multi-agent autocurricula Emergent coordination through competition Emergent behaviors and scalability for multiagent reinforcement learning-based pedestrian models. Simulation Modelling Practice and Theory A generalised method for empirical game theoretic analysis Principal component analysis. Chemometrics and intelligent laboratory systems Older adults' acceptance of a robot for partner dance-based exercise Competitive physical human-robot game play Benchmarking robot manipulation with the rubik's cube Pre-touch sensing for sequential manipulation Contact-less manipulation of millimeter-scale objects via ultrasonic levitation Cycles in adversarial regularized learning This work was supported in part by NSF awards EFMA-1832795 and CNS-1305072. This study (IRB ID: STUDY00012211) has been approved by the University of Washington Human Subjects Division.