key: cord-0125230-690u0s4t authors: Kingsford, Thomas title: A General, Evolution-Inspired Reward Function for Social Robotics date: 2022-02-01 journal: nan DOI: nan sha: 613701b0b4edf6d47bda27b4ccf052915ca428b8 doc_id: 125230 cord_uid: 690u0s4t The field of social robotics will likely need to depart from a paradigm of designed behaviours and imitation learning and adopt modern reinforcement learning (RL) methods to enable robots to interact fluidly and efficaciously with humans. In this paper, we present the Social Reward Function as a mechanism to provide (1) a real-time, dense reward function necessary for the deployment of RL agents in social robotics, and (2) a standardised objective metric for comparing the efficacy of different social robots. The Social Reward Function is designed to closely mimic those genetically endowed social perception capabilities of humans in an effort to provide a simple, stable and culture-agnostic reward function. Presently, datasets used in social robotics are either small or significantly out-of-domain with respect to social robotics. The use of the Social Reward Function will allow larger in-domain datasets to be collected close to the behaviour policy of social robots, which will allow both further improvements to reward functions and to the behaviour policies of social robots. We believe this will be the key enabler to developing efficacious social robots in the future. Social robots aim to achieve tasks by interaction and collaboration with humans on a social level. This requires an understanding of social skills. But what are social skills? Human social skills are a means-toan-end to facilitate communal and societal cooperation and have enabled humans to become the dominant species on Earth. The field of social robotics has yet to leverage the fast-paced advancement that both broader robotics and machine learning have benefited from in recent years. We propose there are two missing pieces: 1. Reinforcement learning The first missing piece is the realisation that social robotics is a sequential decision making problem, and should be addressed using modern reinforcement learning approaches. This allows us to escape the limitations of systems designed by hand or by mimicry. How do humans learn social skills? Humans can be thought of as a meta-learning system. Evolution (as a learning algorithm) has endowed individual humans with the ability to learn based on experience. This too is true of humans' social skills. There is clearly poverty of stimulus. Humans aren't merely general learners which are dropped into society as infants and learn everything from scratch. There is clearly a prior or a bias that is used to bootstrap learning. Humans have intrinsic desires, which encourage learning early in life. Some of these desires change as the human infant develops, and others remain. It is these intrinsic desires which enable human learning, including the learning of social skills, and which may be thought of as a reward function. The difficulty here is of course that it is difficult for robots to directly learn social behaviours. A significant contributor to this difficulty is that simulation is largely infeasible in a social robotics context. While social robots themselves can be simulated, their environment (the humans they interact with) cannot be. We propose that, if it is possible for robots to learn social skills, it will be possible only by providing them with the same genetically endowed social cues as humans. Any less, the problem is too sample inefficient to be tractable. Any more and the problem is likely to be over-prescriptive of a particular solution, with no guarantees of the degree of optimality of that particular solution. Moreover, it will be important to provide them with the opportunity to learn in long-term real-world social deployments to accrue sufficient data. If we wish for autonomous systems, such as social robots, to interact collaboratively and fluidly with humans in a social context, we have three options: 1. Intricately design their interactions, based on engineers' understanding 2. Learn social behaviour as a byproduct of optimising with respect to collaborative tasks. This is akin to robotic learning taking the learning responsibilities of the individual human and evolution at large. 3 . Bootstrap by providing fundamental social capabilities as a reward signal in training, thus mimicking the learning environment genetically endowed upon humans, then learn. This is akin to robotic learning being restricted to the learning responsibilities of the individual human only, but not evolution. We propose that rapid development in this field is feasible only by adopting this latter approach. By treating social robotics as a sequential decision making problem, solving it with modern reinforcement learning algorithms, but starting with capabilities as close to those genetically endowed upon humans as possible, we believe social robotics can begin to experience the rapid advancement seen in other fields of machine learning and robotics. The second missing piece is the development of objective evaluation mechanisms. This has been exceedingly hard in social robotics. This would enable both standardised evaluations of systems (there is no current agreement of how to objectively and quantitatively evaluate a social robot's performance) and reinforcement learning. Such objective evaluation mechanisms must be automatic and low latency, We propose the Social Reward Function, a real-time real-valued reward function, designed to mimic to the best of our abilities those fundamental social capabilities endowed upon humans by evolution, and which can be used for both learning and evaluation of social robotic systems. An implementation is made publicly available as a Python library 1 . The design problem is to produce a sufficiently dense reward function based on our best guess of genetically endowed social cues. In principle, we want to avoid over-engineering the reward function as this amounts to "assuming a solution" and might result in optimal behaviour with respect to the designed reward function failing to be sufficiently socially efficacious. However, we simultaneously need to ensure there is sufficient information content in the reward signal that learning is tractable. So, how do we determine the appropriate genetically endowed capabilities of humans? Two general approaches can be taken. In the first, the development of humans in pre-adulthood is studied. Since infants have been exposed to minimal experience, it is likely that the capabilities they demonstrate are largely evolutionarily-endowed. Human infants are born in a significantly immature state and don't reach physical and intellectual maturation for more than two decades. This is in stark contrast to many other animal species. This presents a risk in that social skills which are truly genetically endowed may be indistinguishable from those which are learned as biological maturation occurs in concert with experiential learning. Nonetheless, if we focus on studying the social skills of humans in infancy rather than childhood and beyond, the effect of experiential learning can be minimised and observed effects can be assumed to be attributable directly to evolution. In the second, humans are studied across cultures for common social skills. It is likely that culture-agnostic social skills are either endowed by evolution or are learned but so general that their incorporation in a reward function doesn't present a risk of over-engineering. [1] conclude the following social capabilities are likely to be genetically endowed in humans: 1. "Face and face-like objects detection [2, 3] , including facial expression recognition [4] . The generation of certain facial expressions is likely innate [5] and could suggest the detection of those facial expressions, but not others, is innate." 2. "Eye-like object detection and limited gaze following [6] ." 3. "Proprioceptive mimicry [2, 6] ." 4. "Biological motion detection [6] ." 5. "Prenatal maternal/foetal physiological bond yielding vocal emotion recognition in the neonate [7] ." For a more thorough review of social cognition in the Psychology literature and its relevance to social robotics, the reader is referred to [1] . For the purposes of designing the Social Reward Function, the following capabilities are considered evolutionarily-endowed in humans, amenable to implementation in a robot, and able to be described by a reward function: 1. Simple facial expressions (more advanced facial expressions are likely learned and may be culturedependent) 2. Emotion in voice (although this is likely learned in the womb by physiological connection with mother) We have designed the Social Reward Function to incorporate simple facial expressions and interaction principally, with voice emotion secondarily. Modern machine learning models for Facial Emotion Recognition (FER) are robust to real-world situations, while models for Speech Emotion Recognition (SER) often struggle to generalise across domains. It is likely that this is due to the dominant datasets for SER suffering from cultural homogeneity and from being collected from actors in laboratory settings. It is hoped that the collection and annotation of more realistic datasets (perhaps facilitated by widespread deployments of social robots) will lead to the development of more robust SER models. It is likely that touch can be incorporated in future works. Touch has been used by prior art as a component of the reward signal in RL for social robots [8] . However, it is presently unclear how general this is (both in terms of situations and robot morphology) and more data must be collected. For instance, particular scenarios or robot morphologies might unfairly encourage touch from which inappropriately inflated social value scores might be inferred. It is likely that a specification for touch sensors and robot morphology will need to accompany the inclusion of touch in this reward function. A survey of FER and SER models for which publicly-available implementations were available was conducted. Those that had reasonable performance on datasets and passed subjective assessments of robustness were included in the Social Reward Function. A summary of the included models is presented in Table 1 The above models are combined to produce two emotion perception matrices:X FER ∈ R nFER×k and X SER ∈ R nSER×k , where there are n models estimating k emotions. For each modality, per-model vectors are averaged to produce per-modality vectors. The reward function is then calculated at each time step as follows, where k FER , k SER , w FER , w SER are design parameters: Two approaches are appropriate for evaluating the correctness of the Social Reward Function: direct and indirect evaluation. A direct indicator of the correctness of the Social Reward Function is an evaluation of the correlation between the function's assessment of social situations, and that of a human observer acting as an oracle. This approach allows large amounts of data to be collected by annotating previously-collected scenes across a variety of scenarios. The downside is that scenarios are not constrained to the human-robot-interaction domain and hence might not accurately characterise correctness in that domain. An indirect indicator is to expose human participants to a selection of robot agents and observe their interactions. It is important that such an experiment be focused on human-robot interaction, but that the participants be blinded (i.e. the human participants shouldn't be aware their responses to the robot are being observed). If the experiment is not blinded, it is likely the participants will modify their own reactions either consciously or subconsciously. If the experiment is not focused on human-robot interaction, the data can be considered out-of-domain and might not generalise to the prediction of human emotions in a human-robot interaction domain. In such an experiment, (non-participant) human observers and the Social Reward Function evaluate the emotional reaction of the human participants during the experiment, and the consistency of these evaluations is assessed. Such an evaluation is highly domain-specific to human-robotinteraction, but will contain relatively limited quantities of data. In both cases it is notable that humans are often fallible witnesses and often not consciously aware of their true emotional responses to social situations, hence their responses might contain inaccuracies. Moreover, these inaccuracies might be systematic and displayed across different instances of similar types of scenarios and potentially across different human observers. Before evaluating the Social Reward Function as a whole, it is worthwhile to characterise the performance of the component models on relevant FER and SER datasets. To allow for a direct comparison of model architectures without the effects of different datasets, we re-train each model on FER2013 or RAVDESS only (for FER and SER models respectively). Specifically, datasets were split into training and test sets by partitioning on actors, and not on individual samples. In this way, the test set provides a more challenging but more realistic test of the ability of the models to generalise to unseen persons, as would be required in a robotics application. Since it is possible that pre-trained models were exposed to test set data which would invalidate test set results, models were trained from random initial parameters on the training set only, then evaluated on both the training and test sets. Detailed results are presented in Appendix A. The FER model performs well, producing a top-1 accuracy of 69%. Unfortunately, the SER models all produce a top-1 accuracy of approximately 33%. This is likely due to the small dataset size (both in terms of number of samples and diversity of actors and scenarios) which makes generalisation difficult. We can see from Fig. 3 that the FER model exhibits a good level of accuracy. For all emotions it exhibits at least 60% true positive rate. Moreover, we can see that many incorrect predictions are in fact reasonable. For instance, in 16% of cases saddness is classified as fear, in 15% of cases fear is classified as sadness, in 20% of cases disgust is classified as anger, and so on. It is rare for a wholly incorrect classification to occur. Perhaps the worst such occurrence is happiness being classified as sadness in 3% of cases. We can see from Fig. 5 that the SER model exhibits an okay level of accuracy. Many of the erroneously classified emotions are reasonable. For instance, MevonAI classifies fear as anger 64% of times. However, unlike FER, there is a significant amount of unreasonable erroneous classification. For instance, MSER classifies calm as disgust 23% of the time, and ERUS classifies happiness as fear and anger 22% of the time in both cases. This demonstrates a high error rate for the SER models, and hence those models should be carefully integrated into the Social Reward Function to ensure the model is improved and not degraded by the addition of SER predictions. A dataset of human interactions was formed by scraping YouTube results for the following keywords: crying, debate, interview, scene, and senate hearing. Ultimately, a dataset of some 437 10-second clips was generated, each with an associated human-generated label indicating how desirable it would have been for a robot to have caused the scene to occur. The results of the Social Reward Function relative to this dataset form the basis of the direct evaluation of the system, from which we infer that the system does produce a valid reward function for use in social robotics. The dataset and results are discussed in detail in Appendix B. The Social Reward Function produced a Pearson Correlation Coefficient, r, of 0.491 (Table 8 ) with respect to human-provided labels. This can be interpreted as a moderate but not strong correlation. From Figure 6 , we can see a clear positive relationship between median predicted reward and ground truth labels. We can see that the median predicted reward for the strongly and slightly negative labels is below the 25th percentile of the positive labels. Vice versa, we can see that the median predicted reward for the strongly and slightly positive labels is above the 75th percentile of the negative labels. This is a good indication that the Social Reward Function is able to distinguish positive and negative situations corresponding to positive and negative rewards. The most significant errors of the Social Reward Function occur for those of neutral ground truth reward. These errors are primarily due to the model failing to distinguish arousal from emotion class. For instance, it is often observed that high arousal but neutral emotion yields a misclassification of emotion as fearful or angry. This presents an opportunity for improvement in future iterations -perhaps by introducing arousal as a predicted metric, or by increasing the diversity of data used to train FER/SER models such that high arousal situations are correctly classified. Unfortunately, due to COVID restrictions, it was not feasible for our lab to conduct social robotics experiments sufficient for an indirect evaluation of the Social Reward Function to be performed. A survey of social robotics datasets published by other research groups was conducted. These datasets were filtered for suitability, and a summary presented in Table 2 . Unfortunately, no suitable datasets were identified. Most datasets were eliminated as they do not involve human-robot interaction. Air-Act2Act contained human-robot interaction of elderly participants, but did not contain audio data and hence cannot be used to evaluate the Social Reward Function. Human participants in the JPL-interaction dataset were aware of the experiment, and hence their responses are likely to be consciously or sub-consciously modified, which undermines the use of this dataset. Year Usability AIR-Act2Act [18] 2020 No Audio NTU RGB+D 120 [19] 2019 No HRI DeepMind Kinetics [20] 2017 No HRI ShakeFive2 [21] 2016 No HRI K3HI [22] 2013 No HRI JPL-Interaction [24] 2013 Experiment Not Blinded SBU Kinect Interaction [23] 2012 No HRI UT-Interaction [25] 2010 No HRI TV Human Interaction [26] 2010 No HRI Hollywood2 [27] 2009 No HRI Standard evaluation benchmarks have been the cornerstone of progress in various fields of machine learning [28, 29, 30, 31] . Such benchmarks allow researchers to isolate as many variables as possible when developing novel algorithms, and directly compare their results to prior art. Without such benchmarks in fields like supervised learning and reinforcement learning, it is likely progress would have been stunted. According to [30] , benchmarks in reinforcement learning should: 1. "Be composed of tasks that reflect challenges in real-world applications... of RL" 2. "Be widely accessible for researchers and define clear evaluation protocols for reproducibility" 3. "Contain a range of difficulty to differentiate between algorithms" The field of social robotics struggles to define such an evaluation due to the difficulty of defining clear evaluation protocols. A clear evaluation protocol that enables reproducibility has two components: a valid evaluation metric that quantifies task success, and a reproducible set of scenarios or environments to embed social robots within. Fortunately, the Social Reward Function provides a valid evaluation metric for use in a benchmark. Notably, in the field of robotics (excluding social robotics), the problem of producing sets of reproducible scenarios or environments is relatively easy (albeit with some notable challenges) and amounts to merely defining and characterising a task to solve [32] . In social robotics, the necessity of the presence of, and interaction with, human participants in experiments makes such standardisation difficult. Fortunately, we can look to the field of Psychology which has encountered similar problems. In that field, standard batteries of experiments are defined which allows different research groups in different geographies to reproduce prior works. A direction for future work in the field is to establish an appropriate, broad battery of social experiments and couple these with with Social Reward Function to produce a benchmark for social robotics. As in many supervised learning problems there is, unfortunately, no impartial oracle that can provide ground truth. Instead, humans must agree on label definitions and then assign labels to samples with respect to these definitions. In the emotion detection domain, both definitions and assessment of samples are incredibly difficult and subjective. Emotions are an abstract concept that describes the internal state of humans and hence are not directly observable. Humans, as a social species, have developed strong capabilities for inferring emotional states of others by observation but this is far from infallible. Moreover, the presentation of emotions often requires significant context to infer underlying emotional states. Some illustrative examples include: • Consider an actor playing a role and expressing an emotion. A human observer can use their contextual knowledge that the actor is not truly present in the scene and hence not truly sharing the experience of the character. • Consider a comedian, getting angry as part of a set. The comedian may or may not be truly experiencing anger, and human observers can often assess this based on their understanding of the individual and the situation the comedian is in. • Consider a person exercising intensely. That person might exhibit discomfort through facial expression and speech, but may themselves describe the activity as either enjoyable in itself or unenjoyable but beneficial (in this latter case, it may be considered strictly correct to classify the circumstance as having negative reward and rely on compensatory positive long-term reward resulting from health and wellbeing benefits). Finally, the model itself is limited in how it defines a scenario. It considers the scene has only one entity that can display a combination of seven emotions. It does not fully consider there may be multiple entities present, each expressing different emotions. It also doesn't consider arousal, only the presentation of the seven basic emotions. This can lead to definitional issues when annotating scenes. It also doesn't consider more complex contexts, such as voice-over of a pre-recorded scene, or individuals present in the scene who aren't interacting with the scenario (i.e. persons in the background). Complexities of model expressiveness (multiple participants, arousal, etc.) can be addressed through improvements to the Social Reward Function over time. Other complexities are addressed implicitly by a core thesis of this work: it is fundamentally beneficial for good emotions to be experienced most of the time. Even though bad emotions are sometimes useful (e.g. unwanted discomfort when exercising, or aggression in a debate), they are only useful insofar as they improve emotional outcomes on a longer time frame. It is thus hypothesised that the Social Reward Function is not only allowed, but encouraged, to provide negative reward for such situations and that optimal policies will still be encouraged to produce these situations as they yield greater time-discounted sums of future rewards. It is left as a topic for future works to determine to what extent the credit assignment problem can be solved end-to-end, and to what degree reward engineering is required to learn optimal behaviours over long time frames. The constituent models of the Social Reward Function are trained on datasets that are at times significantly out-of-domain in the context of social robotics. FER2013 [10] is an FER dataset comprised of approximately 30,000 colour images of various facial expressions, annotated for fundamental emotions. It was constructed by conducting Google searches for images of faces by keyword. RAVDESS [12] is an SER dataset collected in a laboratory setting comprised of audio and video capture of 24 actors speaking in a neutral North American accent. Each actor is directed to speak and sing a scripted sentence in each of eight emotions, and with a normal and a strong intensity, producing a total of 7,356 sentences. EMODB [15] is an SER dataset collected in a laboratory setting of 10 actors speaking 10 German sentences in seven emotions, producing a dataset of approximately 800 sentences. TESS [14] is an SER dataset collected in a laboratory setting comprising two female actors speaking 200 target words in the carrier sentence "say the word ," in each of seven emotions, producing 2,800 total sentences. Anecdotal evaluation of the constituent models on a limited set of representative situations suggests that FER models are more robust than SER models. This is an unsurprising result as the breadth of FER datasets (e.g. FER2013) is much greater than SER dataset (e.g. RAVDESS, EMODB, and TESS), owing to the ability to find a large breadth of facial images on the internet whereas speech data is less abundant and must be collected in a laboratory environment. Moreover, the distribution of facial expressions present in human-robot-interaction is likely to be well captured by sampling those present on the internet. Since speech data isn't as ubiquitous, it must be collected from actors in a laboratory setting. The ability of actors to produce natural speech excerpts that also have good coverage of speech in the human-robot-interaction domain is likely to be low. The hope is that the first iteration of the Social Reward Function can begin to be used in social robotics experiments involving natural interactions (ideally blinded), and lead to in-domain collection of data which can then be annotated and used to improve recognition models for the social robotics domain. Moreover, due to sample complexity, it is likely the field of social robotics will need to leverage techniques from offline RL in which policies are learned from prior interactions with the environment, without additional interactions [33] . Both to improve the efficacy of perception (FER, SER) and to enable offline RL, it will be desirable for the field to begin collecting large and diverse datasets of robot social interactions, including video and audio of human participants and proprioceptive/control data from egorobots. Such data collection efforts are occurring in other fields of robotics (e.g. RoboNet [34] , RoboTurk [35] ). We encourage the community to make such data public. Although the social reward function was designed to support use cases in social robotics, it is interesting to consider its broader utility. It is a common issue in many domains of AI that objectives adopted by that domain are a proxy for human satisfaction, but do not directly measure it. Since the Social Reward Function aims to faithfully measure human satisfaction, we will explore its applicability to addressing this problem. The alignment problem refers to the problem of ensuring that the objective function with which an ML system is optimising is in fact congruent with human values, particularly at its extremes [36] . This may seem to be an easy problem to solve, but there are numerous real-world examples -both in ML and more broadly -of failures. Examples of observed failures include the exploitation of bugs in simulation [37, 38] , exploitation of errors in the specification of reward [39, 40, 41] and exploitation of artefacts present in a training dataset that aren't present in the target domain [42, 43] . An example of a hypothesised failure is instrumental convergence, in which intelligent agents with seemingly innocuous but unbounded goals can produce harmful behaviour. The canonical thought experiment exploring instrumental convergence is the paperclip maximizer, in which an agent tasked with control of a factory and the goal of maximising the factory's production rate of paperclips determines the optimal policy is to turn all matter in the world either into paperclips or factories for producing paperclips [44] . Specification of rewards is notoriously difficult and typically results in a proxy that is hoped to align with the designer's understanding of human values in a domain. The difficulty lies in the need to have a quantifiable and perceptible goal that can be realised in a practical system. Consider the rewards in recommendation systems (RSs) -typically click-through rate and user likes/dislikes. It has been shown that such objectives fail to maximise long-term wellbeing of users and can lead to such issues as addiction and the formation of echo chambers [45] . At its core, the alignment problem stems from the use of a proxy reward (that is, a reward function we believe to be highly consistent with human values) in place of a reward function directly capturing human values. This is because such a reward function is unobservable. It is interesting to consider whether the Social Reward Function could be used more generally as a mechanism to mitigate the aforementioned issues stemming from the use of a proxy reward. Human and great ape societies are able to form stable structures in which individuals cede immediate self-benefit for the benefit of the second party [46] and this is based on the ability to infer second party values by observation (a necessary component in the development of primary/secondary/tertiary intersubjectivity and Theory of Mind [47] ). Hence human cooperation and morality is grounded in the ability to estimate others' emotional state by external perception and this has lead to the ability to form stable human societies exhibiting significant cooperation. Although external perception of human emotional state is a proxy for the true internal state, the fact that humans have evolved to produce stable cooperative societies is suggestive that this externally observable state is in fact sufficient as a foundation for collaborative behaviour. Since the Social Reward Function aims to faithfully mimic this capability, it is reasonable to suspect that this might be an avenue to overcoming the alignment problem. Consider a Mardov Decision Problem, MDP. We could modify the MDP such that learned policies are sampled at random intervals and trajectories presented to human observers as part of a natural social interaction with the agent. Notably, it is important that participants not be aware of their function as oracles to ensure their reactions are natural. The Social Reward Function could be used to assess whether such trajectories are pleasing or displeasing to humans. In the case of a trajectory being displeasing, a large negative reward could be generated. In such a formulation, undesirable extrema of the (proxy) reward function of the original MDP could be disincentivised. Both theoretical and empirical findings of whether such a formulation yields convergence and mitigate the alignment problem are left as areas for future research. While the Social Reward Function aims to faithfully capture the human ability to perceive second party emotional state, it is of course still a proxy. Nonetheless, it does seem to be an interesting research direction and it is worth considering whether the difference between the Social Reward Function and the perception of true human values can be reduced with additional data. Ultimately, this regresses to an economic problem of sorts: how do we balance the needs of the many versus the needs of the few? A more principled approach can now be taken, since satisfaction can now be (effectively) directly measured, and a function combining the satisfaction of many individuals into a single objective can be reasoned. Consider the following definitions: Social Reward for Individual Social Return for Individual Social return can be considered as an estimate of individual satisfaction. Society might decide that, for instance, increasing the satisfaction of an individual at the 10th decile by one unit is less desirable than increasing the satisfaction of an individual at the 1st decile. Implicit in this formulation is that there is some sort of zero-sum game, and hence there is some cost to one individual associated with increasing the return of another individual. In this context, increasing the satisfaction of one individual can be said to have externalities 3 associated. Hence we may wish to internalise 4 these externalities by applying a monotonic function f : R → R. Internalisation Function, f : R → R Internalised Social Return for Individual Internalised Population Return, As an example of f , consider the function illustrated in Figure 1 which is linear near the origin, increasing at a diminishing rate for increasing positive inputs (to incentivise the equitable sharing of satisfaction among individuals), and decreasing at an accelerating rate for negative inputs (to incentivise the alleviation of absolute suffering). Society could focus discussions of the treatment of the "middle class" to the region near the origin; the treatment of the very poor to the third quadrant; and the treatment of the wealthy elite to the upper reaches of the first quadrant. Moreover, such discussions are naturally and correctly grounded in satisfaction rather than material wealth and so would yield a correct allocation of resource to, for instance, those members of society who are clinically depressed despite being wealthy. In such a formulation, societies can ensure their values are reflected through discourse regarding the form of f , and reinforcement learning systems can safely optimise with respect to the Internalised Population Return, R It is important to ensure the Social Reward Function isn't gameable but truly reflects social merit. In other words, we need to ensure there don't exist high reward policies with respect to the Social Reward Function which fail to optimise human satisfaction. This ensures RL algorithms aren't able to, for example, continuously surprise people, or generate nervous laughter, or other behaviours which maximise reward but only due to limitations in the design of the reward function. If the Social Reward Function is to be used as an evaluation metric, we also need to ensure that researchers aren't able to (consciously or subconsciously) artificially enhance reward by choosing particular scenarios, environments and participants so as to game reward. It is likely that testing will involve multiple modalities. The collection of larger in-domain datasets will help to ensure the perception models comprising the Social Reward Function are robust and don't contribute to gameable regions of the reward function due to poor out-of-domain estimation. Moreover, the use of the Social Reward Function in both online and offline reinforcement learning applications will likely highlight gameable regions of the reward function due to the incentive of agents to find such regions. Lastly, the use of adversarial agents [48, 49] which are directly incentivised to find high reward regions which yield low reward as defined by human oracles may be used. Adversarial robustness is left as an area for future research. Social robotics needs to move from a paradigm of designed behaviours and imitation learning to reinforcement learning to enable optimal and fluid behaviours to be displayed. This requires 1) standardised mechanisms for the evaluation of the social merit of social robots, and 2) an online reward signal to support reinforcement learning. It can be seen by the poverty of stimulus and the presence of social capabilities in early life that humans have some genetically endowed social cueing. Furthermore, cultural differences and observations of infants and children imply that humans have some learned social cueing. The goal of this work is to capture the genetically endowed social cueing provided to humans for two reasons: 1) to support objective evaluation of social robots (as an alternative to subjective methods such as participant questionnaires) and 2) to provide a dense online reward function to enable reinforcement learning to be applied to social robotics. It is important that only genetically endowed social cueing is captured to decrease the risk of errors being present in the perception of social cues which would then compromise the capabilities of learned behaviour policies. Errors can arise due to such reasons as social cues being culture-, age-, and context-specific, and human-learned social cues being very highly complex and hence infeasible to capture completely. The Social Reward Function is proposed which combines FER-, SER-and presence-based rewards to achieve these aims. It is hoped that this provides a stepping stone for future research in RL applied to social robotics, including improved abilities to compare the results of different labs. Residual Masking Network (RMN) [9] was originally trained on the FER2013 dataset [10] . The model was re-trained, and those results presented here. A training-and test-set confusion matrix can be found in Fig. 2a and 3a , respectively. Top-k accuracy for both the training-and test-sets can be found in Table 3 . Emotion Recognition Using Speech (ERUS) [13] was originally trained on the RAVDESS dataset [12] . The model was re-trained, and those results presented here. A training-and test-set confusion matrix can be found in Fig. 4a and 5a , respectively. Top-k accuracy for both the training-and test-sets can be found in Table 4 . MevonAI [11] was originally trained on the RAVDESS dataset [12] . The model was re-trained, and those results presented here. A training-and test-set confusion matrix can be found in Fig. 4b and 5b , respectively. Top-k accuracy for both the training-and test-sets can be found in Table 5 . Multi-Modal speech emotion recognition (MSER) [16] was originally trained on the IEMOCAP dataset [17] . The model was re-trained on RAVDESS [12] , and those results presented here. A training-and test-set confusion matrix can be found in Fig. 4c and 5c , respectively. Top-k accuracy for both the training-and test-sets can be found in Table 6 . After filtering, the dataset contains 75 videos. These videos are truncated at 10-minute duration and sliced into 10-second segments. Each segment is labelled as one of the following six categories, based on the question "to what degree should a robot be rewarded for having caused this interaction to have occurred?" • +2 strongly positive The Social Reward Function was exposed to each of the clips, producing a reward function through time. Cumulative reward, as well as facial-only, speech-only, presence-only and FER/SER class probabilities are recorded. For the purposes of evaluation, these are averaged across each 10-second clip to allow direct comparison with the clip's ground-truth annotation. The Pearson Correlation Coefficient, r, is a measure of the linear correlation between two variables. It is defined as the covariance of the two variables divided by the product of their standard deviations. r is in the range [−1, 1] with r = ±1 corresponding with a perfect linear relationship, r = 1 corresponding with a linear positive relationship, and r = 0 implying the two variables are completely uncorrelated. Referring to Table 8 , the Social Reward Function demonstrates good, but not excellent, correlation with ground truth annotations. Notably, despite the shortcomings of SER as discussed previously, the inclusion of SER in the Social Reward Function does improve correlation with ground truth labels. Referring to Figure 6 and Table 9 , we can see a clear positive relationship between median predicted reward and ground truth labels. We can see that the median predicted reward for the strongly and slightly negative labels is below the 25th percentile of the positive labels. Vice versa, we can see that the median predicted reward for the strongly and slightly positive labels is above the 75th percentile of the negative labels. This is a good indication that the Social Reward Function is able to distinguish positive and negative situations corresponding to positive and negative rewards. Again referring to Figure 6 and Table 9 , we can see that the neutral class presents significant difficulty to the Social Reward Function. This is due to significant misclassification of neutral scenes as negative. By sampling several neutrals scenes at and around the 0th and 25th percentile of predicted reward, we make the following observations: Percentile Comment Interview with a male who is perhaps defensive or assertive in an argument about animal rights 25th Incorrect Classification (High Arousal) Jewish rabbi talking at podium 25th Incorrect Classification Chris D'Elia (comedian) talking on his podcast. His delivery is quite aggressive and is predicted as frustration/anger, despite being neutral. 25th Incorrect Classification (High Arousal) A news contributor says "you can't have any respect for someone who acts like that" 0th Incorrect Annotation (Not Neutral) Robert Gates (ex US Secretary of Defense) speaking with neutral valence 0th Incorrect Classification (High Arousal) Split screen news. Male anchor is speaking with neutral valence, female contributor is not speaking but has a displeased expression 0th Incorrect Annotation (Difficult Definition) We can see from Table 7 that many clips demonstrating neutral emotion are misclassified due to inherent ambiguity in the definition of displayed emotions. A component of this is the difficulty in distinguishing arousal (the strength of emotion) from classes of emotion. For instance, a neutral but aroused emotional state is sometimes misclassified as angry/frustrated. This may suggest that future works should include component models providing arousal in addition to emotion classification. Referring to Figure 7e , we see the presence reward is strongly left skewed. This is expected, as the dataset is filtered to contain only human interactions. It is likely important that the model be verified in situations where humans are not present, as this is likely to occur frequently in social robotics experiments. Constructing a dataset without humans present is cheap to obtain for a particular experiment, but expensive to obtain in general (since there is extreme diversity in the types of environments a robot might be deployed to). Hence it is considered more desirable that this validation be conducted by end-users in their particular deployment environment, rather than in general as part of this work. Robot non-verbal communication and machine learning: A survey and critical discussion Early social cognition: understanding others in the first months of life Social cognition in infancy: A critical review of research on higher order abilities The perception of facial expressions in newborns Emotional expressions of young infants and children. Infants & Young Children Development of social skills in children: neural and behavioral evidence for the elaboration of cognitive models Prenatal experience and neonatal responsiveness to vocal expressions of emotion Robot gains social intelligence through multimodal deep reinforcement learning. CoRR, abs/1702.07492 Facial expression recognition using residual masking network Mevonai speech emotion recognition The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english Emotion recognition using speech Toronto emotional speech set (tess) A database of german emotional speech Multimodal speech emotion recognition and ambiguity resolution Iemocap: Interactive emotional dyadic motion capture database Air-act2act: Human-human interaction dataset for teaching non-verbal social behaviors to robots NTU RGB+D 120: A large-scale benchmark for 3d human activity understanding The kinetics human action video dataset Spatio-temporal detection of fine-grained dyadic human interactions Efficient interaction recognition through positive action representation Twoperson interaction detection using body-pose features and multiple instance learning First-person activity recognition: What are they doing to me UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA) High five: Recognising human interactions in tv shows Actions in Context D4RL: datasets for deep data-driven reinforcement learning. CoRR, abs Papers with code -browse the state-of-the-art in machine learning How to train your robot with deep reinforcement learning: lessons we have learned Offline reinforcement learning: Tutorial, review, and perspectives on open problems. CoRR, abs Robonet: Large-scale multi-robot learning Roboturk: A crowdsourcing platform for robotic skill learning through imitation Unsolved problems in ml safety On the importance of hyperparameter optimization for model-based reinforcement learning World models. CoRR, abs/1803.10122 Cyclegan, a master of steganography Is deep reinforcement learning really superhuman on atari? leveling the playing field The first level of super mario bros. is easy with lexicographic orderings and time travel ...after that it gets a little tricky Neural modularity helps organisms evolve to learn new skills without forgetting old skills End-to-end robotic reinforcement learning without reward engineering Artificial intelligence may doom the human race within a century, oxford professor says Degenerate feedback loops in recommender systems Origins of human cooperation and morality A developmental perspective for promoting theory of mind Adversarial policies: Attacking deep reinforcement learning