key: cord-0602849-tw14mggl authors: Fu, Di; Abawi, Fares; Carneiro, Hugo; Kerzel, Matthias; Chen, Ziwei; Strahl, Erik; Liu, Xun; Wermter, Stefan title: A trained humanoid robot can perform human-like crossmodal social attention and conflict resolution date: 2021-11-02 journal: nan DOI: nan sha: 5e7a33f4c47112f4ab546cbd58da17064ae8b92d doc_id: 602849 cord_uid: tw14mggl Due to the aging population and life digitalisation, humanoid robots could be seen as potential community resources to accompany the elderly, support remote work, and improve individuals' mental and physical health. To enhance human-robot interaction, it is essential for robots to become more socialised via processing multiple social cues in a complex real-world environment. For this aim, our study adopted the neurorobotic paradigm of gaze-triggered audio-visual crossmodal conflict resolution to make an iCub robot express human-like social attention responses. For the human study, a behavioural experiment was conducted on 37 participants. To improve ecological validity, we designed a round-table meeting scenario with three animated avatars. Each avatar wore a medical mask to cover facial cues of the nose, mouth, and jaw. The central avatar was capable of gaze shifting, while the peripheral two were capable of sound generation. The gaze direction and the sound localisation were either congruent or incongruent. We observed that the central avatar's dynamic gaze could trigger crossmodal social attention with better human performance in the audio-visual congruent condition than in the incongruent condition. For the robot study, our saliency prediction model was trained to implement social cue detection, audio-visual saliency prediction, and selective attention. After finishing the model training, the iCub robot was exposed to similar laboratory conditions as the participants. While the human performance was overall superior, our trained model demonstrates that it can replicate similar attention responses as humans regarding the congruency and incongruency performance. Robots are increasingly becoming an integral part of daily life. For instance, robots play an assistive role in mitigating the secondary effects of COVID-19 outbreaks by improving mental health, supporting remote education and work during social distancing, and boosting manufacturing and economic recovery [61, 51, 18] . The need for such solutions encourages the design of socially functional robots to meet more significant challenges and difficulties. For example, researchers have developed robots which can conduct viral nucleic acid detection [36] , hotel service [31] , masked facial recognition [47] , social distancing company [2] , etc. Robots are no longer seen as ordinary machines but are social actors capable of processing and responding to multimodal social cues in the human world. Currently, there are computational models trained on human fixation datasets able to integrate multiple social cues and predict saliency in natural human interaction scenarios [1, 13] . However, practical and commercial applications in humanoid robots are still scarce, making this study an important step towards building humanoid agents which can process information within a complex social environment as humans. Moreover, to better understand people's intentions, it is also crucial to explore how humans process information and cognitive mechanisms underneath human behaviours. The current study adopts a dynamic gaze-cueing paradigm for testing the attentional orienting effect of eye gaze on auditory target detection. Pursuing realistic experimental scenarios, we use the avatar stimuli and meeting scenario published in our previous work on audio-visual crossmodal spatial attention for sound localisation [44, 23] . In these two studies, a 4-avatar round-table meeting scenario experiment is conducted on human participants. During the task, lip movement and and arm movement are used as the visual cues, which are spatially congruent or incongruent with the auditory target. Our previous findings indicate that lip movement is more salient than arm movement, showing a stronger visual bias on auditory target localisation. This is due to the strong natural association between mouth movement and voice generation of humans. Furthermore, previous research also reveals that head orientation is a crucial social cue for triggering the reflexive attention of the observer [34] . In our current study, to keep the balance between ecological validity and confounding variables, we adopt a new 3-avatar round-table meeting scenario. The central avatar shifts its eyes with a slight tilt in head and upper body posture towards the direction of gaze. To avoid the distraction of lip movement, all three avatars wear medical masks to hide their nose, mouth and jaw movements. This task design is also inspired by the current social norm. In multiperson social contexts during the COVID-19 pandemic, the use of medical masks is unavoidable. Reseach shows that wearing masks decreases both adults' and children's face recognition abilities [56, 26] . As a result, humans have to rely on gaze cues to compensate for the lack of lip movement in identifying social intentions [16] . For the robotic experiment in this work, an iCub head is used to simulate human social attention in the audio-visual crossmodal meeting scenario mentioned above. The modified GASP (Gated Attention for Saliency Prediction) model [1] is mounted on the iCub head to predict the crossmodal saliency. GASP can detect social cues from eye gaze information and sound source localisation, producing feature maps for each of these cues. By prioritising some social cues rather than others through a directed attention module and then sequentially integrating them, GASP is capable of providing a fixation density map that resembles the human social attention, allowing for the iCub head to provide human-like responses in our experimental scenario. The goals of our current study are twofold. First, we aim to detect human responses in an audio-visual crossmodal social attention scenario with life-like stim-uli to determine the eye gaze orienting effect on sound localisation. Second, we aim to make humanoid robots simulate human behavioural patterns by training the GASP model based on human behaviours and test the robot in similar laboratory conditions. Thus, comparisons between human and iCub responses are performed using congruent and incongruent audio-visual spatial localisation conditions in the gaze-cueing task. In this study, the Stimulus-Response Compatibility (SRC) effect is measured to detect the conflict resolution ability of the participants and the iCub robot. This effect occurs when stimulus and response in a SRC paradigm are spatially incongruent. Participants show inferior performance (e.g., worse accuracy and slower response) in incongruent conditions compared with congruent conditions [6] . Larger SRC effects indicate weaker conflict resolution ability [37] . By processing human social cues and interpreting human intention, our study aims to make humanoid robots collaborate more with humans and better serve society. According to our research goals, the current study proposes the following hypotheses: H1: Eye gaze shows the reflexive attentional orienting effect on the auditory target for both humans and the robot, resulting in better performance in the congruent condition than in the incongruent one, with shorter reaction time and lower error rates. H2: The robot has worse performance than humans and shows larger Stimulus-Response Compatibility effects than humans because of the task being considerably complex for a neurorobotic model, given its crossmodality and the use of synthetic, although realistic, inputs [44] . For humans, social attention is the fundamental function of sharing and conveying information with other agents, contributing to the functional development of social cognition [39] . Social attention allows humans to capture and analyse others' facial expressions, voice, gestures, and social behaviour quickly so that they can participate in social interaction and adapt within society [34, 35] . Furthermore, this social function enables the recognition of others' intentions, and the capture of relevant occurrences in the environment (e.g., frightening stimuli, novel stimuli, reward, etc.) [43] . The neural substrates underlying social attention are brain regions responsible for processing social cues and encoding human social behaviour, including orbital frontal and middle frontal gyrus, superior temporal gyrus, temporal horn, amygdala, anterior precuneus lobe, temporoparietal junction, anterior cingulate cortex, and insula [43, 5] . From the developmental perspective, infants' attention to social cues helps them quickly learn how to interact with others, learn language, and build social relationships [55] . However, dysfunctional social attention may lead to mental disorders [4] and poor social skills [27] . For example, infants with Autism Spectrum Disorder (ASD) are born with less attention to social cues, an inability to track the sight of others, and a fear of looking directly at human faces [52] . This results in their failure to understand others' intentions and to engage in typical social interactions. At present, research on developmental mechanisms of social attention is still in its initial stages. Exploring these scientific questions will be significant for understanding mechanisms of interpersonal social behaviour and for developing clinical interventions for individuals diagnosed with ASD. One of the most critical manifestations of social attention is the ability to follow others' eye gaze and respond accordingly [53] . Eye gaze is proven to have higher social saliency and prioritisation than other social cues [34] since it gives a person the direction of another person's gaze and intention [22] . Gaze following is supposed to be the foundation of more sophisticated social and cognitive functions like the theory of mind, social interaction, and survival strategies formed by evolution [9, 34] . Even infants can track the eye gaze of their parents while they are still unable to speak and walk [19, 29] . Moreover, gaze following contributes significantly to the language development of infants [12] . Psychological studies use the modified Posner cueing task [46] or named gazecueing task [20] to study reflexive attentional orienting generated by the eye gaze. During the task, the eye gaze is presented as the visual cue in the middle of the screen, followed by a peripheral target, which could be spatially congruent (e.g., a right-shift eye gaze followed by a square frame or a Gabor patch shown on the right side of the screen) or incongruent. However, studying visual modality alone is not enough to reveal how humans can quickly recognise social and emotional information conveyed by others in an environment full of multimodal information [10] . Being able to select information from the environment across different sensory modalities allows humans to detect crucial information such as life threats, survival strategies, etc. [40, 24] . Therefore, several studies conducting a crossmodal gaze-cueing task demonstrate the reflexive attentional effect of the visual cue on the auditory target [17, 38] . A majority of these studies rely on images of gaze shifts as visual cues to trigger the observers' social attention [40, 42] . However, these images are not dynamic and lack ecological validity. A total of 37 participants (female = 20) participated in this experiment. Participants aged from 18 to 29 years, with a mean age of 22.89 years. All participants reported that they did not have a history of any neurological conditions (seizures, epilepsy, stroke, etc.), and had either normal or corrected-to-normal vision and hearing. This study was conducted in accordance with the principles expressed in the Declaration of Helsinki. Each participant signed a consent form approved by the Ethics Committee of Institute of Psychology, Chinese Academy of Sciences. Virtual avatars were chosen over recordings of real people, as the experiment requires strict control over the behaviour of the avatar both in terms of timing and exact motion. By using synthetic data as the experimental stimuli, it can be ensured, for instance, that looking to the left and right are exactly symmetrical motions, thus avoiding any possible bias. Moreover, the use of three identical avatars that are only different in terms of clothing colour also alleviates a bias towards individual persons in a real setting. The static basis for the highly-realistic virtual avatars was created in MakeHuman 1 . Based on these avatar models, a data generation framework for research on shared perception and social cue learning with virtual avatars [30] (realised in Blender 2 and Python) creates the animated scenes with the avatars, which are used as the experimental stimuli in this study. The localised sounds are created from a single sound file using a head-related transfer function that modifies the left and right audio channels to simulate different latencies and damping effects for sounds coming from different directions. In our 3-avatar scenario the directions are frontal left and frontal right at 60 degrees, which corresponds to the positions where the peripheral avatars stand. During the experiment, the participants sit positioned 55 centimetres from the desktop screen at a desk and wear a headphone, as depicted in Figure 1a . In each trial, a fixation cross firstly comes out in the middle of the screen for 100-300 milliseconds with equal probability. Next, a visual cue is displayed for 400 milliseconds, consisting of an eye gaze shift and a synchronised slight head and upper body shift from the central avatar. In each trial, the central avatar randomly chooses to look at the avatar at the right, at the one at the left, or directly towards the participant, meaning no eye gaze shift at all. Afterwards, either the left or the right avatar says "hello" with a human male voice as the auditory target. This step lasts for 700 milliseconds. Finally, another fixation cross is shown at the centre of the screen for 700, 800 or 900 milliseconds, with equal probability, until the end of the trial (cf. Figure 1c for a schematic representation of the trial). During the task, the participants are asked to determine as soon and precisely as possible whether the auditory stimulus originated from the avatar at the left or the one at the right. The participants made decisions by pressing the keys "F" and "J" on the keyboard, corresponding to the left and right avatar, respectively. The participants' response during the display of the auditory target and the second fixation are recorded. The stimulus display and the response recording are both under the control of E-prime 2.0 3 . In the current study, all participants report perceiving the simulated masks as the real ones. In the experimental design, there are three directions of the visual cue (left, right, and central), and two directions of the auditory target localisation (left, right). Thus, there are three audio-visual crossmodal congruency conditions: congruent, incongruent, and neutral. The participants begin the experiment with 30 practice trials and enter into the formal test when their accuracy of practice trials reaches 90%. Each condition is repeated 96 times, with a total of 288 trials separated into 4 blocks. There is a 1-minute rest between each two blocks. The time duration for each trial is 1900-2300 milliseconds, and the formal test lasts for 12 minutes. Reaction time (RT) and error rates (ER) are analysed as human response indices. For the RT analysis, error trials and trials with RTs shorter than 200 milliseconds as well as those with RTs beyond 3 standard deviations above or below the mean were not included, which corresponds to 2.42% of the data being removed. To examine the Stimulus-Response Compatibility effects of the audio-visual crossmodal conflict task, one-way repeated measures analysis of variance (ANOVA) is used to test differences in the participants' responses under the three congruency conditions (congruent, incongruent and neutral). All post hoc tests in the current study use Bonferroni correction. A repeated measures ANOVA with a Greenhouse-Geisser correction shows that the participants' RT differs significantly between different congruency conditions, F (2, 34) = 24.19, p < .001, η 2 p = .40 (see Figures 2a and 2b ). Post hoc tests show that the participants responded significantly faster under the congruent condition (mean ± SE = 466.25 ± 14.92 ms) than both incongruent condition (mean ± SE = 485.12 ± 14.82 ms, p < .001) and neutral condition (mean ± SE = 485.11 ± 14.80 ms, p < .001). However, the difference between the incongruent and neutral condition was not significant, p > .05. A repeated measures ANOVA with a Greenhouse-Geisser correction shows that the participants' ER differs significantly between different congruency conditions, F (2, 34) = 5.69, p < .05, η 2 p = .14 (see Figures 2c and 2d ). Post hoc tests show that the participants presented significantly lower ER under the congruent condition (mean ± SE = .02±.002) than the incongruent condition (mean±SE = .03±.004), p < .01. However, there was no statistical significance in the difference between the neutral condition (mean ± SE = .02 ± .003) and both other congruency conditions, p > .05 in both cases. The process of predicting saliency is divided into two stages. The first one, Social Cue Detection (SCD), is responsible for extracting social cue feature maps (FM) from a given image. Figure 3a depicts the architecture of the SCD stage. Given a sequence of images and their corresponding high-level feature maps, the second stage, GASP, then predicts the corresponding saliency region. The architecture of GASP is shown in Figure 3b . Originally, SCD comprised four modules, each responsible for extracting a specific social cue [1] . Those modules encompassed gaze following (GF), gaze estimation (GE), facial emotion recognition (FER), and audio-visual saliency prediction (AVSP). For the current task, however, the FER module is not employed since facial expressions are majorly obscured by the medical masks. Also, in order to better replicate the experiments done with participants, the iCub robot receives auditory stimuli from both ears. Originally, AVSP was implemented to work with monaural stimuli. To operate on binaural stimuli, the AVSP module is replaced by a sound source localisation module (SSL), denoted as "SSL model" in Figure 3a , whose architecture is offered in Figure 3c . Video streams used as input are split into frames and their corresponding audio signals. For every video frame and corresponding audio signal, SCD extracts the aforementioned high-level feature maps, which are then sent as input to GASP. A direct attention module (DAM) weighs the FM channels so as to emphasise those which are more unexpected. Convolutional layers further encode those weighted FM channels. In Figure 3b , they are denoted as "Enc." (for encoder). One can notice that there are as many encoders as there are FM channels, and that each encoder is responsible for encoding a particular feature map. The encoded feature maps of all video frames are then finally integrated via a recurrent extension of the convolutional variant of the gated multimodal unit (GMU) [8] . The GMU's mechanism weighs the features of its inputs. The addition of a convolutional aspect to it accounts for the preservation of spatial properties of the input features. The recurrent property of the integration unit considers the whole sequence of frames by performing the gated integration at every timestep. For the purpose of this work, the LARGMU (Late Attentive Recurrent Gated Multimodal Unit) is used because of its high performance compared to other GMUbased models [1] . Since LARGMU is based on the convo- lution GMU, it preserves the input spatial features. The LARGMU's recurrent nature allows it to integrate those features sequentially. The addition of a soft-attention mechanism based on the convolutional Attentive Long Short-Term Memory (ALSTM) [15] prevents gradients from vanishing as feature sequences get sufficiently large. As the name implies, LARGMU is a late fusion unit, which means that the gated integration is performed af-ter the input channels are concatenated and, in sequence, propagated to the ALSTM. audio stream, which are projected into a feature space by 3D-ResNets (one for each input stream). Its encoder is followed by a convolutional saliency decoder that upscales the latent representation and provides the corresponding saliency map. For our current work, DAVE is extended to accept binaural input, and this binaural extension structure uses a similar rationale with the monaural DAVE, see Figure 3c . The main difference is the use of two 3D-ResNets to process the auditory modality, whose output features are concatenated and then encoded and downsampled by a two-dimensional convolutional layer. This layer is responsible for guaranteeing that the dimension of the feature produced by this part of the architecture matches that of the feature produced by DAVE's original audio-stream 3D-ResNet. DAVE's pretrained weights are transferred to its binaural extension accordingly. The pretrained weights of DAVE's original audio-stream 3D-ResNet are cloned to both 3D-ResNets associated with binaural DAVE's left and right audio channels. The 1 × 1 convolutional layer that encodes the concatenated audio features is initialised using the normal method from He et al. [28] . During the model training, all model parameters are optimised except for those of the video-stream 3D-ResNet, which are kept frozen throughout the training procedure. The binaural DAVE is trained on a subset of the FAIR-Play dataset [25] , comprising 500 randomly chosen videos. The FAIR-Play dataset consists of 1,781 video clips of individuals playing instruments in a room. Auditory inputs are binaural and the source location of the sound is provided in the dataset. Even though the trials presented to the iCub have a speech signal as auditory input, given that the task tackled in this research relates to sound localisation, we hypothesised that, regardless of the domain, the sound source should still be correctly identified. Similarly to its monaural counterpart, training binaural DAVE consists of minimising the Kullback-Leibler divergence between the predicted and ground-truth fixation maps. ADAM optimiser is used for the minimisation during 50 epochs with mini-batches with 64 dataset entries. The GASP architecture used in the experimental setup consists of the pretrained GASP ablated by removing the FER input stream. Also, input streams originally provided by SCD's AVSP model are substituted by those provided by the trained binaural DAVE as a sound source localisation model. Even though one input stream was removed and another one was substituted, there was no need to fine-tune GASP [1] due to its robustness. This is supported by experiments that indicate that the performance of GASP with the addition of a given pretrained saliency prediction model M was higher than that of M alone [1] . GASP receives four sequences of data as input, one sequence of consecutive frames of the original video and three sequences of feature maps, one for each model in the SCD part. For the sake of this experimental work, sequences of 10 frames are used (cf. timesteps t 0 to t 9 in Figure 3b ). The number of frames received as input by each model in SCD varies due to the dissimilar nature of their original specifications. The SSL model receives a sequence of 16 frames as input, and the GE and GF models receive a sequence of 7 frames each. A more detailed explanation on how the frames are selected based on the timestep being processed is provided in [1] . Regarding the audio input, a whole one-second chunk is offered as input to each of the audio 3D-ResNets of the SSL model regardless of the timestep. In this experiment, GASP is embedded in an iCub robot, which, in turn, is subjected to the same series of one-second videos the participants are. In summary, the one-second chunk used as input to the SSL model's audio 3D-ResNets consists of the entire audio of an event. Different streams of audio are sent as input to each of the audio 3D-ResNets depending on where the audio signal originates. After the iCub acquires the visual and auditory inputs, the social cue detectors and the sound source localisation model extract features from those audio-visual frames. Following the detection and generation of the feature maps, those are propagated to GASP, which, in turn, predicts a fixation density map (FDM) F : Z 2 → [0, 1], which is displayed in the form of a saliency map for a given frame. The FDM peak (x F , y F ) is determined by calculating (x F , y F ) = argmax x,y F (x, y) . (1) The values of x F and y F , originally in pixels, are then normalised to scalar valuesx F andŷ F within the [−1, 1] range, such that where l x and l y are the lengths of the FDM in the horizontal and vertical axes, i.e., its width and height measured in pixels respectively. A value ofx F = −1 represents the leftmost point andx F = 1 the rightmost one. Regarding the vertical axis,ŷ F = −1 represents the highest point andŷ F = 1 the lowest one. The robot is actuated to look towards the FDM peak. For simplicity, eye movement is assumed to be independent of the exact camera location relative to the playback monitor. For all experiments, only the iCub eyes were actuated while disregarding microsaccadic movements and vergence effects. The positions the iCub should look at are expressed in Cartesian coordinates while assuming the monitor to be at a distance of 30 centimetres from the image plane. To limit the viewing range of the eyes,x F andŷ F are scaled down by a factor of 0.3. Those Cartesian coordinates are then further converted to spherical coordinates θ and φ via where θ and φ are the yaw and pitch angles respectively. These angles are used to actuate the eyes of the iCub such that they pan ∼ 27 • and tilt ∼ 24 • at most 4 . To replicate the experiments with participants as closely as possible for the iCub head, some technical adjustments proved necessary. First, the iCub head was placed at a distance of approximately 30 centimetres from a 24inch monitor (1920 × 1200 pixel resolution), as depicted in Figure 1b . This distance is, however, shorter than the 55 centimetres distance the participants sat from the desktop screen. The distance reduction was performed so that the iCub's field of vision covers a larger portion of the monitor. Since the robot lacks a foveated vision, the attention is distributed uniformly to all visible regions, causing the robot to attend to irrelevant environmental changes or visual distractors. Second, the previous robot eye fixation position needed to be retained as a starting point for the next trial, as a means to provide scenery variations to the model. This approach closely resembles the human experiment, by intending to simulate a memory mechanism, as the one observed in humans. Direct light sources also needed to be switched off to avoid glare. Once the experimental setup was ready, the pipeline started the video playback in fullscreen mode, simultaneously capturing a 30-frames segment of the video using a single iCub camera 5 along with one-second audio recordings from each microphone 6 mounted on the iCub's ears. In the current study, the iCub head made responses to the auditory target by shifting its eyeballs. This is different from how participants responded. This difference could lead to systematic differences in RT, therefore the RT of the robot was not measured nor analysed. Nevertheless, it is worth noticing that even though humans and the robot respond differently to a trial, the task they perform is essentially the same. So, ER can be properly measured and analysed as the robot response indices. One-way repeated measures ANOVA is used to test the SRC effects of the robot's response under the three congruency conditions (congruent, incongruent and neutral). All post hoc tests in the current study use Bonferroni correction. Additionally, an independent t-test is conducted to compare the difference of SRC effects between humans and the robot. The SRC effect 5 http://wiki.icub.org/wiki/Cameras 6 http://wiki.icub.org/wiki/Microphones is measured by subtracting congruent responses from incongruent responses. A repeated measures ANOVA with a Greenhouse-Geisser correction showed that the robot's ER differed significantly between different congruency conditions, F (2, 34) = 8.02, p < .01, η 2 p = .18 (see Figures 2e and 2f) . Post hoc tests showed that the robot presented significantly lower ER under the congruent condition (mean ± SE = .37 ± .01) than the incongruent condition (mean ± SE = .41 ± .01), p < .01. However, there was no statistical significance in the difference between the neutral condition (mean ± SE = .38 ± .01) and both other congruency conditions, p > .05 in both cases. Results of the t-test displayed that the robot showed a significantly larger SRC effect (mean ± SE = .04 ± .001) than humans (mean ± SE = .01 ± .01), t (72) = 2.35, p < .05 (see Figure 5 ). Our current neurorobotic study investigated human attentional response and modelled the human-like response with the humanoid iCub robot head in an audio-visual crossmodal social attention meeting scenario. According to the research goals, the main findings of the current study are also twofold. First, in line with previous crossmodal social attention research [41, 54] , our study shows that the visual cue direction enhances the detection of the following auditory target occurring in the same direction although from a different modality. Notably, in the current study, a dynamic gaze shift with corresponding head and upper body movements is used as visual cue stimuli. It replicates the previous findings by studies using static eye gaze [42, 27] , showing a robust reflexive attentional orienting effect. More specifically, the participants show longer RT and higher ER under the audio-visual incongruent condition than the congruent one. Some previous research found that eye gaze has a stronger attentional orienting effect than simple experimental stimuli (e.g. arrows) [21, 48] . Although we do not have any conditions using arrows as visual cues in our current study, we first demonstrate that realistic and dynamic social cues could have a similar effect in a human crossmodal social attention behavioural study. Second, the results from the iCub response demonstrate a successful human-like simulation. With the GASP model, the iCub robot was able to trigger similar attentional patterns as humans, even in a complex crossmodal scenario. Lastly, the statistical comparison of the SRC effects between humans and the iCub show that the robot experienced a larger conflict effect than humans. In the human experiment, corresponding to our hypotheses, social cues that triggered social attention were extended to multiple modalities. Our results support the nature of social gaze cueing, and the view of stimulusdriven shifts of auditory attention might be mediated by other modality information [53] . Furthermore, different from previous gaze-cueing experiments [41] , we added a neutral condition to make the meeting scenario more realistic. At the same time, the neutral condition was regarded as the baseline compared with other conditions. For the neutral condition, the participants only saw a static meeting scenario without any dynamic social visual cues before the auditory target came out. Since there was no conflict between audio-visual stimuli in the neutral condition, we assumed there should be no difference between the congruent and neutral conditions. However, out of our expectation, experiment results show that the participants had significantly longer RT under the neutral condition than under the congruent condition, but no significant difference in RT between neutral and incongruent conditions was found. These counterintuitive results could be due to the participants' expectations on the direction of the visual cue during each trial in the experiment. The ER show a reasonable result that the neutral condition had higher ER than the congruent condition and lower ER than the incongruent condition, but without any significant difference. Taken together, a neutral condition without gaze cue lowers the response speed, while not facilitating the following auditory target detection. The iCub experiment reveals that, similarly to humans, the robot's response accuracy is significantly better (p < .01) in a congruent condition than in an incongruent one. This similarity is further corroborated by the lack of significant difference (p > .05) in both the humans' and the robot's ER in the neutral condition compared to either of the other conditions (cf. Figures 2c and 2e) . Notably, in the current study, we did not compare ER from humans and the robot directly under each condition. Because robots do not respond as accurately as humans, a lower accuracy is to be expected for robots [58] . However, it is still important to find that the relevant values between incongruent and congruent conditions between humans and the robot are closely related. Although the robot shows significantly larger SRC effects than the humans, it is reasonable for responses from the robot to have more variants and uncertainty than those of the humans. The iCub's ego noise, though very low, still makes audio localisation more challenging than for a human who could adjust to the visuals of the avatars in the pretrials whereas the iCub could rely solely on its pretrained model. Besides, it is worth noticing that even though the participants responded to the stimuli by pressing the corresponding keys on the keyboard, while the iCub robot responded by shifting its eyeballs, the SRC effects were still replicated with the performances on ER. The robot provides a fixation density map, representing the most likely region a human would tend to fix his/her attention in an audiovisual crossmodal social attention meeting scenario. By providing different degrees of attention to each modality, guaranteeing that all of them would be considered for the determination of the fixation density map, the neurorobotic model is capable of mimicking the human crossmodal attention behaviour. The possibility of making a humanoid robot mimic human attention behaviour is an essential step towards creating robots that can understand human intentions, predict their behaviours, and behave naturally and smoothly in human-robot interactions. The current study could inspire future studies from multiple areas and perspectives. For example, eye-tracking techniques could be used to collect the humans' eye movement responses, e.g., vestibulo-ocular reflex (VOR), visual fixation, saccades, etc., during the social attention task since they are more intuitive than a keyboard response in an attention task. In addition, the GASP model could also be trained based on the humans' eye movement data, which may lead to a better accuracy in modelling human behaviour in sound source localisation. This is due to the GASP model being originally trained by human eye movement watching natural videos [1] . To make the experimental design more diverse and realistic, future studies could utilise other social cues from the avatar's face and body. Besides, the experimental design could be enhanced by considering additional factors, such as the emotion and other identity features from the avatars. This could be helpful for target speaker detection, emotion recognition, and sound localisation in future robotic studies. Considering that speaking activity is one key feature in determining which people to look at [60] , it is crucial to consider when creating robots that mimic human attention behaviour. Also, the high performance of the most recent in-the-wild active speaker detection models [49, 14, 33] indicates their reliability in providing accurate attention maps. Our current work and findings can be applied to build social robots to play with children who have ASD or other mental disorders. Previous research has shown that ASD children avoid mutual gaze and other social interaction with humans but not with humanoid robots [50] . This is explained as they fear human faces with eye gaze but not humanoid robot faces. Thus, it is possible and meaningful for social robots to help ASD children improve their social functions. Finally, the current experiment could be extended to a human-robot interaction scenario, such as replacing avatars with real humans or robots and evaluating responses from the participants and robots [7] . There have been several human-robot interaction studies about how humans react to a robot's eye gaze [45, 3, 59] or the mutual gaze effect on human decision-making [32, 11] . Based on our study, what can be extended, but can also be challenging, is to make robots learn multiperson eye gaze and detect the important target person in realtime during a collaborative task or social scenario with humans. In conclusion, our interdisciplinary study provides new insights on how social cues trigger social attention in a complex multisensory scenario with realistic and dynamic social cues and stimuli. We also demonstrated that by predicting the fixation density map, the GASP model triggered the iCub robot to have a human-like response and similar socio-cognitive functions, resolving sensory conflicts within a high-level social context. By combining stimulus-driven information with internal targets and expectations, we hypothesise that these aspects of multisensory interaction should enable current computational models of robot perception to yield ro-bust and flexible social behaviour during human-robot interaction. We especially thank Dr. Cornelius Weber, Dr. Zhong Yang, Dr. Guochun Yang, and Guangteng Meng for their improvement of the experimental design and the manuscript. Conflict of Interest: The authors declare that they have no conflict of interest. Ethical Standard: All procedures performed in studies involving participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Informed Consent: Informed consent was obtained from all participants included in the study. GASP: Gated attention for saliency prediction Novelty experience in prolonged interaction: A qualitative study of socially-isolated college students' in-home use of a robot companion animal Social eye gaze in humanrobot interaction: a review Gaze-triggered orienting is reduced in chronic schizophrenia Unilateral amygdala lesions hamper attentional orienting triggered by gaze direction Spatial stimulus-response compatibility and affordance effects are not ruled by the same mechanisms Do I have a personality? Endowing care robots with context-dependent personality traits Gated multimodal networks Mindblindness: An essay on autism and theory of mind Coordinating attention requires coordinated senses Mutual gaze with a robot affects human neural activity and delays decision-making processes The development of gaze following and its relation to language FaVoA: Face-Voice association favours ambiguous speaker detection Predicting human eye fixations via an LSTM-based saliency attentive model Face masks do not alter gaze cueing of attention: Evidence from the covid-19 pandemic Cross-modal cueing effects of visuospatial attention on conscious somatosensory perception Using robot animal companions in the academic library to mitigate student stress Gaze following in newborns The eyes have it! Reflexive orienting is triggered by nonpredictive gaze Attentional effects of counterpredictive gaze and arrow cues Gaze cueing of attention: visual attention, social cognition, and individual differences Assessing the contribution of semantic congruency to multisensory integration and conflict resolution What can computational models learn from human selective attention? A review from an audiovisual unimodal and crossmodal perspective 2.5D visual sound Masking emotions: face masks impair how we read emotions Abnormal alpha modulation in response to human eye gaze predicts inattention severity in children with ADHD Delving deep into rectifiers: Surpassing human-level performance on Ima-geNet classification Unconscious discrimination of social cues from eye whites in infants Towards a data generation framework for affective shared perception and social cue learning using virtual avatars Preference for robot service or human service in hotels? Impacts of the COVID-19 pandemic It's in the eyes: The engaging role of eye contact in HRI How to design a three-stage architecture for audio-visual active speaker detection in the wild Do the eyes have it? Cues to the direction of social attention Cortical processing of head-and eye-gaze cues guiding joint social attention Clinical application of an intelligent oropharyngeal swab robot: implication for the COVID-19 pandemic Neurodevelopment of conflict adaptation: Evidence from event-related potentials Directing eye gaze enhances auditory spatial cue discrimination Attention, joint attention, and social cognition Social gaze cueing to auditory locations Joint attention: Inferring what others perceive (and don't perceive) When one sees what the other hears: Crossmodal attentional modulation for gazed and non-gazed upon auditory targets Neural mechanisms of social attention A neurorobotic experiment for crossmodal conflict resolution in complex environments An operational model of joint attention-timing of gaze patterns in interactions between humans and a virtual human Attention and performance X: Control of language processes Real-time multiview face mask detector on edge device for supporting service robots in the COVID-19 pandemic Attentional control and reflexive orienting to gaze and arrow cues AVA-ActiveSpeaker: An audiovisual dataset for active speaker detection Robots for use in autism research The potential of socially assistive robots during infectious disease outbreaks Atypical eye contact in autism: models, mechanisms and development Following gaze: Gaze-following behavior as a window into social cognition Spatial orienting of tactile attention induced by social cues Early alterations of social brain networks in young children with autism Face masks disrupt holistic processing and face perception in school-age children Deep audio-visual saliency: Baseline model and data Binaural sound localization based on deep neural network and affinity propagation clustering in mismatched HRTF condition Robot faces that follow gaze facilitate attentional engagement and increase their likeability Find who to look at: Turning from action to saliency Combating COVID-19 -The role of robotics in managing public health and infectious diseases The example of experimental stimuli and videos for both human data and the iCub robot data collection can be viewed at this link: https://www.youtube.com/watch?v=bjiYEs1x-7E. The human and the iCub response data can be found at this link: https://osf.io/fbncu/.