key: cord-0317676-2j3w609j authors: Hamzeloo, Mohammad; Kvasova, Daria; Soto-Faraco, Salvador title: Association Between Action Video Game Playing Experience and Visual Search in Real-life Multisensory scenes date: 2021-10-05 journal: bioRxiv DOI: 10.1101/2021.10.04.462750 sha: 77dd22fe40294388a5758543013cc7549bc73329 doc_id: 317676 cord_uid: 2j3w609j Prior studies investigating the effects of playing action video games on attentional control have demonstrated improvements on a variety of basic psychophysical tasks. However, as of yet, there is little evidence indicating that the cognitive benefits of playing action video games generalize to naturalistic multisensory scenes – a fundamental characteristic of our natural, everyday life environments. The present study addressed the generalization of attentional control enhancement due to AVGP experience to real-life like scenarios by comparing the performance of action video-game players (AVGPs) with non-players (NVGPs) on a visual search task using naturalistic, dynamic audio-visual scenes. To this end, a questionnaire collecting data on gaming habits and sociodemographic data as well as a visual search task was administered online to a gender-balanced sample of 60 participants of age 18 to 30 years. According to the standard hypothesis, AVGPs outperformed NVGPs in the search task overall, showing faster reaction times without sacrificing accuracy. In addition, in replication of previous findings, semantically congruent cross-modal cues benefited performance overall. However, according to our results, despite the overall advantage in search, and the multisensory congruence benefit, AVGPs did not exploit multisensory cues more efficiently than NVGPs. Exploratory analyses with gender as a variable indicated that the advantage of AVG experience to both genders should be done with caution. Over the last decade, the rapid development in smart technologies has produced an explosion of mobile gaming, consequently pushing video-game playing to the forefront of scientific attention. Video game playing is associated with improvements in a variety of cognitive and perceptual skills (Hisam et al., 2018; A. C. Oei & Patterson, 2015; Powers, Brooks, Aldrich, Palladino, & Alfieri, 2013; Strobach, Frensch, & Schubert, 2012) , and a significant body of work in this domain has established that a causal relation exists between the playing action video games and attentional control (Bavelier & Green, 2019) . It is proposed that attentional control improvement due to playing action video games can be considered as training-related transfer between different perceptual and cognitive abilities (Ducrocq, Wilson, Vine, & Derakshan, 2016; C. Green, Gorman, & Bavelier, 2016) . However, it remains unknown whether the attentional control improvement in AVGPs generalizes beyond simplified laboratory conditions to tasks with the complexity of realistic, multisensory conditions. We set out to address this question. The literature defines action video games as a type of game that requires the processing of a large amount of visual information, presented rapidly over a wide field of view, and often requires the simultaneous tracking of multiple targets under high attention demands (C Shawn Green & Bavelier, 2006) . These games include games in the so-called first-or third-person shooter category or adventure games, such as Medal of Honor (Electronic Arts Inc.), Call of Duty (Activision Publishing Inc.) or Grand Theft Auto (Rockstar Games). Attentional control is a key mechanism of adaptive behavior and can be defined as our ability to stay focused on task-relevant information while resisting distraction, and being responsive to changes in the environment that require efficient re-orienting to new sources of relevant information (Engle, 2002) . Thus, attentional control not only enables selective attention (focusing on spatial locations or objects that are goal-relevant while minimizing task-irrelevant information) but also the capacity to shift between selective and divided attention to allow consistent monitoring of changes in the environment (Corbetta & Shulman, 2002) . One of the first scientific studies related to visual attention in Action Video Game Players (AVGPs), compared to Non-Video Game Players (NVGPs), reported that the cost associated to targets appearing at low probability positions in a stimulus detection paradigm, was lower for AVPGs (Greenfield, DeWinstanley, Kilpatrick, & Kaye, 1994) . Based on this finding, Greenfield et al. suggested that playing action video games can foster the skills of allocating and dividing attention, thereby improving visual search performance. Over the past decade, researchers have taken an interest in visual search advantages in AVGPs. For example, AVGPs outperform NVGPs on traditional visual search tasks (Castel, Pratt, & Drummond, 2005 ; Hubert-Wallander, Green, Sugarman, & Bavelier, 2011) , flanker/load tasks (Matthew WG Dye, C Shawn Green, & Daphne Bavelier, 2009; C Shawn Green & Bavelier, 2003; C Shawn Green & Bavelier, 2006; Irons, Remington, & McLean, 2011; Xuemin & Bin, 2010) , distraction-based tasks (Joseph D Chisholm, Hickey, Theeuwes, & Kingstone, 2010; Rupp, McConnell, & Smither, 2016) , and a change detection task (Clark, Fleck, & Mitroff, 2011; Durlach, Kring, & Bowens, 2009 ). These findings have suggested that action video game experience enhances various aspects of top-down attention, and the effect can be seen both in cross-sectional and in intervention studies (Bediou et al., 2018; Joseph D. Chisholm & Kingstone, 2015a; Schubert et al., 2015) . Although it has often proposed that these improvements are a result of changes in selective attention, more recently, enhancement in various additional aspects of top-down attention have been considered. For example, changes in attentional control or in the capacity to swiftly switch between attentional modes based on task demands (Bavelier & Green, 2019) . The vast majority of studies in this area have compared AVGP and NVGP performance on traditional laboratory protocols with basic visual stimuli. What is more, only a handful of studies have considered scenarios where other sensory modalities are present in their experimental designs. In a cross-modal study, Donohue, Woldorff, and Mitroff (2010) compared AVGPs and NVGPs in an audio-visual simultaneity judgment task. The results showed that AVGPs were more accurate to distinguish whether simple visual and auditory pairs occurred in synchrony or slightly offset in time. They also revealed an enhanced ability to determine the temporal order of the different modalities in cross-modal stimuli. In a study measuring auditory decision making, AVGPs were found to be faster compared to NVGPs at indicating the ear to which the sound was presented, especially at low signal to noise ratios (C. S. Green, Pouget, & Bavelier, 2010) . In a recent study using a highly demanding auditory discrimination task, AVGPs managed to detect auditory targets and to distinguish them from auditory non-target standards more accurately than NVGPs (Föcker, Mortazavi, Khoe, Hillyard, & Bavelier, 2019) . Despite the growing body of research on the benefits of action video game experience, there have been little studies investigating whether the cognitive benefits observed in the laboratory conditions can also be seen in more complex, naturalistic contexts. This step would be important to understand how the putative superiority of AVGPs in certain tasks transfers to real-life environments that are complex, multisensory, and often semantically meaningful . Recent studies on cross-modal interactions in attention orienting have highlighted that in real-life scenarios, not only temporal and spatial congruence between stimuli across modalities plays a functional role in the control of attention, but also semantic correspondence can facilitate detection and recognition performance Roberts & Hall, 2008; Spagna, Mackie, & Fan, 2015) . Cross-modal semantic facilitation has been shown in a variety of tasks, including audio-visual matching task (Chen & Spence, 2010; Hein et al., 2007; Laurienti et al., 2003) , visual awareness (Chen, Yeh, & Spence, 2011; Cox & Hong, 2015; Hsiao, Chen, Spence, & Yeh, 2012) , spatial attention (List, Iordanescu, Grabowecky, & Suzuki, 2014; Mastroberardino, Santangelo, & Macaluso, 2015; Pesquita, Brennan, Enns, & Soto-Faraco, 2013) , and object search in real-life scenes (Kvasova, Garcia-Vernet, & Soto-Faraco, 2019) . Although evidence to support the idea that benefits of action video game experience can be found in real-life performance is very limited, C. Shawn suggest that action video game experience induces a form of 'learning to learn'. This metalearning would enable players to learn to suppress sources of noise or distraction efficiently while they extract task-relevant information very fast and more precisely. Based on this hypothesis, one outstanding question is whether the 'learning to learn' ability that putatively emerges as a result of action video game experience can be observed in real-life search. That is, for example when looking for an object in complex, dynamic and naturalistic scenes. Because naturalistic environments are multisensory, a second interrelated question is whether AVGPs can benefit more efficiently than NVGPs from cross-modal semantic congruence between visual events and their associated sounds when they search for an object in real-life scenes. To answer the two questions posed above, we conducted a study using a visual search task on realistic multisensory scenes. In the present visual search task, targets consist of usual everyday life objects embedded in video clip fragments of natural scenes. For example, participants were asked to look for a dog in a city street scene. A characteristic object sound (consistent with the search target, with a distractor object present in the scene, or unrelated any of the objects in the scene) was presented by stimulus onset, mixed with ambient noise. Reaction times as well as visual search accuracy were measured. We hypothesized that, if AVGPs benefit from more efficient attentional control to direct their attention to target relevant information while minimizing target irrelevant information in multisensory environments, they would outperform NVGPs in the task of searching objects in real-life scenarios. We assumed that this advantage would be observed in faster reaction times and/or more accurate responses overall in the task. In addition, if AVPGs have learned to use environmental multisensory cues more efficiently, then we expect a larger cross-modal advantage for AVGPs. That is, in AVGPs the improvement in reaction times in the cross-modal congruent condition with respect to the incongruent one will be proportionally larger, compared to that of the NVGPs. We expect this because previous studies revealed that AVGPs can resist distraction more efficiently in high-load perceptual scenarios (M. W. G. Dye, C. S. C Shawn Green & Bavelier, 2003) , whilst NVGPs show a reduction in the magnitude of the flanker effect (Bavelier & Green, 2019) . The hypothesis, methods and analysis pipeline were pre-registered prior to data analysis. The pre-registration can be found at https://osf.io/sgy5a/. The results of the planned analyses are presented separately from the exploratory ones in the results section, below. The design includes two independent variables: Video-game experience (between subjects) and sound-target relation (within-subjects). The first variable, Video-game experience, was measured using the Bavelier Lab Gaming Questionnaire, version November 2019. Based on the questionnaire adapted from C. S. Green et al. (2017) , we had two groups of participants according to their experience: AVGPs and NVGPs. An AVGP is a person who plays 5+ hours per week of First/Third-person shooter and/or action-RPG/adventure genre of games. An NVGP is a person who plays at most 1 hour per week of any genre of video games. This categorization of video game playing habits is based on Li, Polat, Makous, and Bavelier (2009) . The other variable in the design, Sound-target relation, manipulated as an independent factor within-subjects, relates to the semantic relationship between the visual search target in the task and the object sounds on each trial. This variable had 3 levels: target-consistent sound, distractor-consistent sound, and neutral sounds. In the target-consistent condition, the sound was matched with the target object. In the distractor-consistent condition, the sound was matched with the distractor (a non-target object present in the scene), and in the neutral condition, the object sound was not semantically congruent with any object in the scene. In order to be able to measure false-alarm rates and to balance the response types (50% yes, 50% no) in the task, the task included additional catch trials in which targets are not present (hence, the correct response was 'NO'). Within these catch trials, the object sound had 3 levels, like in target-present trials: target-consistent sound (here, the sound is consistent with the designated search object, which is not present in the video clips), distractor-consistent sounds, and neutral sounds. The dependent variable of our design was visual search object performance, that was measured with reaction times of correct responses in target-present trials, and with d' using the hit rate in target-present conditions and false-alarm rate from target-absent (catch) trials. We used G Power to estimate the sample size for the study, taking a repeatedmeasures ANOVA within-between interactions with a medium effect size f=0.25, and α level= 0.05. The total sample size for a statistical power of 0.95 was estimated at N=54 (Tomczak, Tomczak, Kleka, & Lew, 2014) . By considering drop-out and inclusion criteria, we enrolled 60 individuals using the online platform Prolific.co. Most of the participants were from European countries, and some other participants from South American countries, the US, Canada, and one from South Africa. According to their responses to a questionnaire asking about video game play habits of different genres during the previous 12 months (Bavelier Lab Gaming Questionnaire), we had 30 AVGPs and 30 NVGPs. The mean age was 23.78 and SD = 3.12. There was not a significant age difference between the two groups (t = 0.115, df = 28, p = 0.91), and the two groups were gender-balanced (15 males and 15 females within each group). Otherwise, general inclusion criteria were: (1) having a normal or corrected-to-normal vision and hearing, (2) having a good quality of internet connection and access to the experiment from a laptop or personal computer, (3) the false alarm rate in catch trials (trials in which the search target was not present) was less than 15%, and (4) the average accuracy in the three target-present experimental conditions was more than 70%. The materials for the visual search task (target objects, sounds, and video clips) in naturalistic environments were selected from those used in (see Fig. 1 , for an example). There was a set of 156 different video-clips extracted from movies, TV shows, advertisements, and others recorded by Kvasova et al. from everyday life scenes. Video clips were recorded/played in color, with 30 fps and 1024 * 768 pixels resolution. All video-clips were edited to 2s duration fragments, without fade-in or -out. Twelve videos were used for training trials, 72 videos were used for the three target-present experimental conditions (24 each; see Table 1 ), and 72 further videos were used for catch trials in which the search target object was not present (again, 24 videos in each condition). The original sounds of the videos had been replaced with background noise created by the superposition of various everyday life sounds matched with the context of the video. Each video clip in targetpresent trials contained at least two visual objects that have a familiar characteristic sound (such as musical instruments, animals, tools, ...). One of these objects was designated as the search target and the other as the distractor. The choice of target/distractor objects followed two criteria, checked during selection of the videos by at least two judges: (1) They were visible but not a part of the main action in the scene. For instance, if a person is talking on the cellphone as a main action in the scene, the cellphone cannot be a target object. (2) Both target and distractor objects were present throughout the video clip. The contribution of the two designated objects of each video as a target or distractor was counterbalanced across the experimental design to compensate for potential biases related to specific objects. To reach this goal, we used each video of all three conditions in target-present trials. The three different experimental conditions were created by combining the video with target-consistent sound, distractor-consistent sound, and neutral sound while one of the two target objects present in the screen before the video to create six equivalent versions of a target-present trial (2 target objects * 3 conditions). In each video clip in catch trials, there was at least one visual object that had a characteristic sound, that was used as a distractor object. We created 3 equivalent versions of catch trials from each video, in the same way, so that each catch video appeared in the three conditions across different versions of the experiment. By 6 versions of target-present trials and 3 versions of catch trials, we had six versions of the task with the same length. Each version was presented to 10 participants with different random order of videos. Characteristic sounds were semantically compatible with the target/distractor object, depending on the particular condition, and were presented centrally, providing no cues for object location. All the sounds were normalized at origin to have equivalent SPL and were presented for 600ms. Sequence of events in a typical experimental trial. The trial started with the presentation of target word for 2000 ms, followed by the auditory cue and the video-clip (see times in figure). There was no time limit for the participant response. 200 ms after the participant had responded a new target word was presented, beginning a new trial. (B) Example of conditions. In this example of stimulus (illustrated with the snapshot of the videoclip), the possible target (in target present conditions) is the smartphone. In the target-consistent condition the sound will match the target (e.g., a ring tone), in the distractor-consistent condition the sound will match the distractor object (in this case, the car), and in the catch trial the sound will not match any object of the scene (e.g., dog barking). The image is a frame of the video clip filmed by the research group. Procedure The coronavirus outbreak triggered us to run the experiment online. We built the experiment using Builder under Psychopy package (v. 2020.1.3) and ran it under the Pavlovia.org platform. Psychopy also is the only package with reaction time precision under 4 ms online (Sauter, Draschkow, & Mack, 2020) . We recruited participants from the Prolific.co. platform. Each participant was asked to read the informed consent and confirm their agreement to participate in the experiment voluntarily. The study approved by the UPF Institutional Committee for Ethical Review of Projects (CIREP-UPF). They were able to exit the experiment at any moment by pressing the 'Escape' key in their keyboard. After consent approval, they filled a form of demographic data including age, gender (by selecting M for males, F for females, and O for others), and the Video Game Questionnaire in the first part of the study. If they felt in each group category (AVGP or NVGP), they have received an invitation for the second part of the study. By clicking on the invitation link, the instruction of the main task appeared on the screen and they were asked to do the task in indoor conditions with dim illumination, turn their phone off in order to avoid distractions during the task. They were also asked to increase the volume of their device up to 80% or to some extent in which they can easily hear the sounds. By pressing the space bar, they entered a 12-trial training block before the beginning of the experiment. Feedback related to the participant's response was provided in the training trials to make sure that participants understand the task. No feedback was provided during the experimental blocks. Each trial started with a cue word designating the search target object, printed in the middle of the screen for 2000ms. This cue was followed by the video clip, with the corresponding sound. Auditory cues (target-consistent, distractor-consistent, or neutral sound) begin slightly ahead of the video, by 100ms and last for 600ms. The video was shown for 2000ms. The participant judged whether or not the target object is present in the video clip. They responded by pressing the Y key as Yes with their left index finger in the case of finding the target object in the video or pressing the N key as No with their right index finger in the case of the target object were not presented. Participants were instructed to respond as fast as possible, but there was no time limitation for the trial response. A question mark was presented after the video offset, and until a response was made. There were a 200ms blank screen between trials. When they finished the task, they were returned to the Prolific website where a successful submission appeared in their account that waiting for the researcher's approval. If the accuracy of the data they provided was more than 85% of all catch trials, and 70% of all target-present trials, their submission was approved and they were paid 2.5 £ as a compensation for their participation. We collected 85 individual datasets but 25 (17 NVGPs and 8 AVGPs) of them were excluded from analyses because of the accuracy criteria. We continued to run the experiment until we had 60 valid individual datasets to enter in the analyses. To eliminate other processes outside the interest of the study, including fast lucky guesses, delayed responses due to subject's inattention, or guesses based on the subject's failure to reach a decision, we considered an outlier filter for RTs +/-2SD around the mean of each condition for each subject: neither RT nor accuracy data were analyzed for these outlier trials. In some of the within-subject analyses, where indicated, we used normalized RTs. Data from neutral trials were used to normalize the data from the conditions of interest across participants, in order to reduce inter-individual differences and concentrate on the effects of interest. The normalization was done according to Equation (1) for each subject (i) and condition of interest (j): (1) Where NormRT_j_i is the normalized RT for subject i in condition j, RT_j_i is the mean RTs for subject i in condition j, and NeutralRT_i is the mean RTs in the neutral sound condition of subject i. Signal Detection was used to measure precision of responses from the conditions of interest across participants by calculating d' according to Equation (2) for each subject (i) and condition of interest (j): where d'_j_i is the d prime for subject i in condition j, hit_rate_j_i is yes responses / total responses for subject i in condition j of target-present trials, fa_rate_j_i is yes responses / total responses for subject i in condition j of catch trials, z(hit_rate_j_i) and z(fa_rate_j_i) is standardized z score for hit_rate_j_i and fa_rate_j_i. If hit_rate or fa_rate is equal to 0, it was replaced with 1/total_responses; and if hit_rate or fa_rate is equal to 1 it was replaced with (total_responses-1)/total_responses. Additionally, data from neutral trials were used to be able to explore whether the congruence effects were seen in the conditions of interest, if any, are due to cross-modal benefit in target detection, cross-modal interference from distractors, or both. To evaluate our first hypothesis, whether the benefit of action video game experience can transfer to search in real-life scenarios, we addressed whether there is a superiority of AVPGs in RTs and/or precision (d') . To test this, we used the inter-subject averages per group (AVPG, NVGP) across all other conditions pooled and performed a one-sided t-test on RTs of correct responses to target-present trials, and on d' scores. The result showed significant differences between groups on mean RTs (t(58) = 7.10, p < 0.001, Cohen's d=0.39) (Figure 2A) , whereas no such difference occurred in the d' total (t(58) = 0.68, p = 0.245) ( Figure 2B ). These results confirmed the first part of our first hypothesis and indicate that AVGPs responded faster, albeit with similar precision than NVGPs. In particular, the average advantage of AVPGs was of 220ms, a 14% variation over the total RT, which can be considered a medium effect size. Please note that the lack of group differences in precision may be related to the accuracy-based inclusion criteria, which were introduced precisely to render latency estimations interpretable. In all plots, error bars indicate the standard error of mean, and asterisks indicate significant differences (**p-value<0.01). (B) Visual search accuracy (d prime) plotted for the two groups (all three sound conditions pooled; see text for details). (C) Normalized reaction times in the target-consistent and distractor-consistent conditions. (D) D primes plotted separately in the three sound conditions. To evaluate our second hypothesis, whether AVGPs can benefit more from crossmodal cues and/or are more resistant to cross-modal distractors, we expected to see an interaction between Group and Sound Condition. We entered normalized RTs into a repeated-measures analysis of variance (ANOVA) with the between-participants factor of Group (AVGPs vs. NVGPs) and the within-participants factor Sound Condition (targetcongruent and distractor congruent, normalized by RTs in the neutral condition). The interaction of group*condition was not significant (F(1, 58) = 0.218, p = 0.64) ( Figure 2C ). We ran another repeated measures ANOVA on d' scores with the same factors and the result of the interaction between groups and sound conditions was not significant (F(2, 57) = 0.526, p = 0.59) ( Figure 2D ). The analyses did not confirm our second hypothesis and suggest that AVGPs do not seem to benefit more, or to be more interfered from, cross-modal cues than NVGPs in speed or accuracy. In addition to the hypothesis-driven analyses described above, we performed two further exploratory analyses which are not addressed directly at testing our initial hypotheses, but help characterize the pattern of results and provide some reality checks: (1) Overall cross-modal effects. We addressed a comparison between the neutral and target-consistent condition, and between the neutral and distractor consistent condition. This analysis shall be informative as to whether overall, congruent cross-modal cues benefit performance, incongruent cross-modal cues hinder performance, or both. Also, will help situate the present data with respect to previous studies finding cross-modal benefits in visual search. We ran two one-sided paired samples t-tests to compare mean RTs of targetconsistent condition and distractor consistent condition to mean RTs of neutral condition. The results show a significant speed advantage of target-consistent condition over neutral condition (t(59) = 1.849, p = 0.033, Cohen's d=0.16). The difference between mean RTs of distractor-consistent condition and neutral condition was, however, not significant (t(59) = 0.602, p = 0.275) ( Figure 3A) . We ran two more one-sided paired samples t-tests, this time to compare d' of target-consistent condition and distractor consistent condition to the neutral condition. The results showed a marginal tendency for higher sensitivity in the targetconsistent condition compared the neutral condition (t(59) = 1.405, p = 0.083, Cohen's d=0.26) ( Figure 3B ). Similar to the RT outcome, there was no significant difference between d' of distractor-consistent condition and neutral condition (t(59) = 0.266, p = 0.395). Altogether, these results indicate that congruent cross-modal cues tend to benefit performance, but incongruent cross-modal cues do not hinder performance, with respect to neutral conditions. These results replicate recent findings from Kvasova & Soto-Faraco (2019) using naturalistic scenes, and are in line with previous cross-modal findings using more traditional laboratory tasks (Iordanescu, Guzman-Martinez, Grabowecky, & Suzuki, 2008; Knoeferle, Knoeferle, Velasco, & Spence, 2016; Laurienti et al., 2003) . (2) Gender-dependent effects. Given that we collected data from a gender-balanced sample for both AVGP and NVGP groups, we decided to perform exploratory analyses of the two main hypotheses, considering gender as a between subject variable. Relating to the first hypothesis (overall advantage of AVGPs over NVGPs), the result of a group (AVGP/NVGP) by gender (male/female) between-subject ANOVA on mean RTs revealed no group*gender interaction or main effect of gender (F(1, 56) = 0.98, p = 0.32; F(1, 56) = 0.04, p = 0.85 respectively), whilst group had a significant main effect (F(1, 56) = 6.72, p < 0.05, η2=0.11), as it would be expected from the main analysis. We ran another group*gender between-subject ANOVA on d' scores. The result indicated no significant group*gender interaction (F(1, 56) = 0.19, p = 0.67) nor main effect of group (F(1, 56) = 0.43, p = 0.51), while the main effect of gender was significant (F(1, 56) = 6.79, p < 0.05, η2=0.11). As it can be seen in the descriptive data in table 1, total d' scores for males (M = 2.78, SD = .56) was higher than for females (M = 2.45, SD = .42) (see Figure 4) . Related to our second hypothesis (differential cross-modal advantage for AVGPs), we entered gender and video-game experience as between-participant's factors and condition as a within participant variable in an ANOVA on normalized RTs. This ANOVA rendered a nearly significant interaction for group*condition*gender (F(1, 56) = 3.07, p = 0.085). Owing to this marginal effect, we broke down the analysis in two separate ANOVAS, one for each gender, seeking for a possible group by condition interaction. Neither the ANOVA for male (F(1, 28) = 2.58, p = 0.12) ( Figure 5A ) nor for female (F(1, 28) = 0.79, p = 0.38) ( Figure 5B ) were significant in isolation. This is parallel to the result of the main analysis, reported above. We performed some follow up analyses by running a repeated measures ANOVA on normalized RTs just for AVGPs, using gender as a factor. We found a significant interaction between gender and sound condition on normalized RTs (F(1, 57) = 5.84, p = 0.022) ( Figure 5C ). Please note that the main effect of gender would be meaningless in this analysis, as the RTs are normalized, but interactions can be interpreted. Post hoc t-tests show a significant difference between male AVGPs and female AVGPs in the distractor-consistent condition (t(28) = 2.80, p = 0.009, Cohen's d=1.0) while there was not such difference between male NVGPs and female NVGPs (t(28) = 1.47, p = 0.152). The result would suggest that there was an interference effect for male AVGPs compared to female AVGPs, in way that distractor sounds slowed down their reaction times. Given the low statistical power of this exploratory analysis, this result should be taken cautiously. By entering gender as factor in the repeated measures ANOVA gender*group*condition on d' scores, we found a significant interaction for condition and gender (F(2, 55) = 3.51, p = 0.037) and a significant main effect of gender (F(1, 59) = 11.34, p = 0.001). Follow up analyses showed significant differences between male AVGPs and female AVGPs in d' for the distractor-consistent condition (t(28) = 2.91, p = 0.007, Cohen's d=1.08) ( Figure 5D ) and between male NVGPs and female NVGPs in d' in the neutral condition (t(28) = 3.488, p = 0.002, Cohen's d=1.27) ( Figure 5E ). When we analyzed all sensitivity data for gender differences, the differences between male and female in d' of distractor-consistent condition (t(58) = 2.966, p = 0.004, Cohen's d=0.77) and neutral condition (t(58) = 3.667, p = 0.001, Cohen's d=0.97) were significant ( Figure 5F ). These findings suggest that males in general, provided more precise responses in distractor-consistent and neutral condition than females. Again, these outcomes must be carefully interpreted, given that the study was underpowered in terms of investigating interactions with the gender variable. ) for target-consistent, distractor-consistent, and neutral conditions plotted for male and female AVGPs, and (E) for male and female NVGPs. (F) D primes separately for each gender in each of the three sound conditions. Error bars indicate standard error of the mean, and significant differences are indicated by asterisks (*p-value<0.05, **p-value<0.01). The first aim of the present study was to assess whether the visual search advantage demonstrated by AVGPs in comparison to NVGPs with simple stimuli in classic psychophysical tasks would generalize to a real-life multisensory complex situation. Using a visual search task in real-life scenes, with audio-visual congruent/incongruent cross-modal cues, we demonstrated that AVGPs can extend their advantage to real-life like scenarios by a faster response time while they keeping precision equal to NVGPs. In the line with previous studies on multisensory advantages of AVGP experience (Joseph D. Chisholm & Kingstone, 2012 , 2015a Donohue et al., 2010; Stewart, Martinez, Perdew, Green, & Moore, 2019) our results have revealed that advantage of extensive action video game experience can transfer to naturalistic multisensory situations. While the mechanism of this generalization still remains debated, one explanation is that AVGPs may have a heightened speed of processing (e.g., M. W. Dye, C. S. . Action video game playing requires rapid processing of audio-visual information and promotes action per unit of time, forcing players to make decisions and execute responses as fast as possible. Each speed processing which ends up with a speed accurate response provides an incentive for players while delays in processing often have severe consequences. Therefore, extensive experience with action video games may lead to more efficient visual processing and speeded RTs across a range of unrelated tasks without sacrificing accuracy. Most past studies have compared AVGPs to NVGPs by measuring RTs have consistently shown that AVGPs are faster overall than NVGP (Castel et al., 2005; Greenfield et al., 1994; Nuyens, Kuss, Lopez-Fernandez, & Griffiths, 2019; Powers et al., 2013; Schubert et al., 2015; Stewart et al., 2019; Torner, Carbonell, & Castejón, 2019) . Please note that if playing merely induced faster decision making, without increasing information processing efficiency, it would typically come at the cost of accuracy (Heitz, 2014) . This does not seem to be the case in most findings. Another explanation for the advantage of AVGPs would be greater attentional control capacity, compared to NVGPs (e.g., C Shawn Green & Bavelier, 2006) , allowing them to direct their attention toward the spatial position of the visual stimulus more quickly and accurately, which in turn allows them to include other modality inputs when they are consistent with goal-directed visual search and exclude other irrelevant inputs from processing. It has been suggested that attentional control plasticity as a result of action video gameplay is a fundamental building block of cognitive enhancement in AVGPs (Bavelier & Green, 2019) . Enhancement of various aspects of top-down attention as an effect of action video gameplay is seen in cross-sectional studies (Bediou et al., 2018; Joseph D. Chisholm & Kingstone, 2012 , 2015b . Similarly, intervention studies mirrored the analysis of crosssectional studies by showing that action video game training can radically alter visual attentional processing. The results of these studies suggest a causal role of action gaming on spatial cognition and top-down attention, since they rule out inherently enhanced attentional control in AVGPs (Bediou et al., 2018; Chiappe, Conger, Liao, Caldwell, & Vu, 2013; C Shawn Green & Bavelier, 2003; C Shawn Green & Bavelier, 2012) . Furthermore, training studies make a distinction between action video games and other video game types such as mimetic games, sport games, simulation games or puzzle games because they could not find the same impact for other types of games (Cohen, Green, & Bavelier, 2007; Powers & Brooks, 2014; Powers et al., 2013) . Given the online cross-sectional design of the current study, it provides evidence that the advantages associated with action video game experience extend beyond typical laboratory tasks to more complex naturalistic scenes. Like in many other studies before, a degree of caution must be advised when considering the causal relationship between action video game experience and improved visual search in real-life multisensory context, since a training component was not included in the protocol. Therefore, future work assessing performance within multisensory paradigms pre-and post-action video game training will be important to further evaluate the relation between action video game experience and cognitive abilities in real life. The second aim of the present study was to assess whether there is a cross-modal advantage for AVGPs than NVGPs. By entering the normalized data in the analysis of performance in the visual search task in real-life scenarios, the results suggest that AVGPs do not benefit from cross-modal cues more than NVGPs, nor they are more resistant to distractors within the statistical power of the present study, sensitive to medium to large effect sizes. These results are consistent with the previous studies (Gao et al., 2018; Rupp et al., 2016) showing that there is no difference between AVGPs and NVGPs in the ability to integrate audio-visual stimuli or in the driving performance while distracted in a driving simulator. Yet, AVGPs display greater performance in some cross-modal tasks, such as discrimination (temporal order judgments, TOJs) for visual and auditory stimuli (Donohue et al., 2010) . Please note that unlike in our task, TOJs require for effective segregation (rather than integration) of cross modal events. Because processing of the meaning of a complex naturalistic sound can require more time due to the temporal nature of the information (for a similar procedure and for a review see Vatakis & Spence, 2010) , in the task used in our study, the sound presented slightly earlier than the videos to capture attention and allowed us to study object-based congruency/incongruency effects in the integration of semantic information. Overall, this cross-modal semantic congruence benefited performance across the board, but it did so to the same extend in AVPGs and NVPGs. Overall, this pattern of results suggests that although AVGPs are faster in overall visual search, they do not benefit more from semantic audio-visual integration to direct their attention toward the target object than NVGPs. One explanation for this is that AVGPs and NVGPs may employ similar cognitive strategies when they need to process the information at the higher-level, semantic aspects. Hence, the AVGP advantage may be based on early, lower-level processing stages. It has been shown that AVGPs benefit from low-level spatial and temporal factors in their visual (C Shawn Green & Bavelier, 2006; Greenfield et al., 1994; Schubert et al., 2015; Wong & Chang, 2018; Xuemin & Bin, 2010) , auditory (C. S. Green et al., 2010; Stewart et al., 2019) , or audio-visual (Zhang, Tang, Moore, & Amitay, 2017) search tasks. A recent meta-analysis confirmed a smaller impact on higher cognitive performance like inhibition (g = 0.31, 95% CI [0.07, 0.56], df = 7.2, p = .02 and verbal cognition (g = 0.30, 95% CI [0.03, 0.56], df = 7.7 p = .033), while there was not a significant impact on problem-solving in cross-sectional studies (Bediou et al., 2018) . Future studies need to explore whether extensive action video game experience has a positive impact on high cognitive functions or it is limited to low-level perception, spatial cognition, even if under the mediation of top-down attention. From our exploratory analyses, we could reveal that overall semantic congruency between sounds and target objects speed up search latencies in comparison with neutral sounds while semantic incongruency in distractor-consistent condition did not produce any disadvantage with respect to the baseline (neutral condition). Hence, consistent cross-modal cues benefit performance, whilst incongruent cross-modal cues do not hinder performance to a measurable extent. This finding presents a replication of previous findings (Knoeferle et al., 2016; , indicating that semantic congruency effect can generalize beyond typical laboratory protocols and guide attention in a complex, multisensory environment. Also, it provides an extension of simpler, laboratory protocols (Iordanescu et al., 2008; Laurienti et al., 2003) . The design of this study allows us to calculate false alarm rate and d prime for distractor-consistent condition which was a limitation in study. In keeping with the conclusions of that study, there was no difference in d' between distractor-consistent and neutral conditions. This indicates that the cross-modal congruence effects can be more reliably taken as an advantage of congruent sounds, than distractor effects, which neither slow down response time nor create a more impulsive response than neutral conditions. An open question here is: If cross-modal semantics are clearly being processed, given the facilitation, why semantic incongruency does not hinder performance? There are two possibilities to answer this question. One is that a target-inconsistent sounds do not strongly distract participants from responding to the visual target when visual stimuli are highly informative. In this study, when the cue word is presented to participants, it activates a semantic network related to the target and creates an attentional template to direct participant's attention to other semantically congruent inputs from other modalities while inhibiting unrelated cross-modal information. If inputs come from other modalities that are incongruent with the target search and is easy to inhibit, it cannot enter in the processing to create a distractor effect. The second possibility is that in our study, some auditory stimuli may not be informative enough to direct participant's attention and affect visual search because some of them are physically similar (e.g., the sound of the coin and keys) or semantically similar (e.g., sound of musical instruments). Thus, we cannot be sure that they were powerful enough to play a proper role of a distractor or neutral sound. To clarify this point, further studies should be run on the effect of high informative (hard to inhibit) versus low informative (easy to inhibit) cross-modal cues. The low prevalence of female AVGPs has typically caused a great gender imbalance in cross-sectional studies comparing AVGPs with NPVGs, which has been almost exclusively based on a male population (Cohen et al., 2007; C Shawn Green & Bavelier, 2003; C. S. Green et al., 2010; Adam C. Oei & Patterson, 2013) . Here, we used a genderbalanced sample of participants, with equal number of males and females in both AVPG and NVPG groups. Therefore, despite the sample size was not calculated to detect possible gender effects, we believe that exploratory analyses with this variable may provide a first impression unavailable in previous studies. In terms of the putative general advantage of AVPGs in the search task overall (first hypothesis proposes), our results suggested that the main AVGP effect applies across the board. Playing AVGs enhances speed of processing or attentional control capacity while accuracy of responses (d' scores) will remain unchanged irrespective of gender, as far as the sensitivity of our design allows us to tell. Our results also hinted at a basic difference in strategy of responding between experienced male and female AVG players in dealing using cross-modal cues. Please note upfront, that these analyses are probably underpowered, and they will need general confirmation. However, what we observed in the analyses by gender is that female AVGPs were faster in the distractorconsistent condition than male AVGPs, but male AVGPs provided more accurate responses, than female AVGPs. If this finding were confirmed, it may be interpreted as slightly different strategies. The increased speed of processing noted in female AVGPs in distractor-consistent condition can be viewed as impulsive behavior, in which female AVGPs respond faster but make more errors, compared to males who sacrifice time for precision (see figure 4 ). With the due cautiousness, given the exploratory nature of this analysis, this finding could be in line with some previous studies suggesting that there is a gender difference in selective attention and spatial cognition (Evans & Hampson, 2015; Halpern, 2013; Merritt et al., 2007; Posner & Marin, 2016; Stoet, 2017) . More studies, with larger samples and with the adequate statistical should be performed to follow up this result. For now, one conclusion that can be drawn from this study is that there could be potential for studying gender differences in the effects of AVGP, an area that has been neglected in most studies so far. Perhaps it is worth mentioning that this study was performed online with the experiment presented in a browser at the participant's personal device (personal computer or laptop). This adds to the generalization of our findings to more ecological situations. Despite our efforts, probably the auditory stimuli and video clips quality, size, and luminance have been varied between participants. It is our assumption that this variation has been similar between the groups (AVGPs vs. NVGPs and men vs, women). There is reason to assume that such group differences were negligible because our data replicated the finding of the crossmodal semantic effect, very similar to the laboratory-based study of . However, more works in other natural contexts (e.g., virtual reality or other situations that do not involve sitting in front of a display) are needed to evaluate the extent to which the observed advantages can be generalized. The results of the present investigation demonstrate that AVGP advantage in attention tasks can be extended to naturalistic multisensory situations. We also failed to demonstrate that AVGP experience creates any specific advantage to benefit more from cross-modal cues, or more resistance to distractors, compared than NVGPs. Our findings from exploratory analyses shed some light on semantic aspects of multisensory integration by generalizing (and confirm) previous laboratory findings on semantic congruency/incongruency effect on cross-modal interactions. Other findings from exploratory analyses with gender indicate that despite the AVPG advantage applies across the board, there could be strategic differences regarding how this advantage is achieved. To gain a full understanding about AVG advantages in males and females it will be important to include female participants in this field of study and directly compare results between genders more systematically. Salvador Soto-Faraco has been funded by Ministerio de Ciencia e Innovación (Ref: PID2019-108531GB-I00 AEI/FEDER), and AGAUR Generalitat de Catalunya (2017 SGR 1545 . This project has been co-funded with 50% by the European Regional Development Fund under the framework of the ERFD Operative Programme for Catalunya 2014-2020, with a grant of 1.527.637,88€. The datasets generated for this study are available on request to the corresponding author. Enhancing Attentional Control: Lessons from Action Video Games Meta-analysis of action video game impact on perceptual, attentional, and cognitive skills The effects of action video game experience on the time course of inhibition of return and the efficiency of visual search When hearing the bark helps to identify the dog: Semanticallycongruent sounds modulate the identification of masked pictures Crossmodal constraints on human perceptual awareness: auditory semantic modulation of binocular rivalry Improving multi-tasking ability through action videogames Reduced attentional capture in action video game players Improved top-down control reduces oculomotor capture: The case of action video game players. Attention, Perception, & Psychophysics Action video game players' visual search advantage extends to biologically relevant stimuli Action video games and improved attentional control: Disentangling selection-and response-based processes Enhanced change detection performance reveals improved strategy use in avid action video game players Training visual attention with video games: not all games are created equal Control of goal-directed and stimulus-driven attention in the brain Semantic-based crossmodal processing during visual suppression. 6(722) Video game players show more precise multisensory temporal processing abilities Training Attentional Control Improves Cognitive and Motor Task Performance Effects of action video game experience on change detection The development of attention skills in action video game players Increasing Speed of Processing With Action Video Games. Current directions in psychological science The development of attention skills in action video game players Working memory capacity as executive attention. Current directions in psychological science Sex-dependent effects on tasks assessing reinforcement learning and interference inhibition Neural correlates of enhanced visual attentional control in action video game players: An event-related potential study Action Video Games Influence on Audiovisual Integration in Visual Selective Attention Condition Action Video-Game Training and Its Effects on Perception and Attentional Control Action video game modifies visual selective attention Effect of action video games on the spatial distribution of visuospatial attention Learning, attentional control, and action video games Action video game training for cognitive enhancement Playing Some Video Games but Not Others Is Related to Cognitive Abilities: A Critique of Improved probabilistic inference as a general learning mechanism with action video games Action video games and informal education: Effects on strategies for dividing visual attention Sex differences in cognitive abilities Object familiarity and semantic congruency modulate responses in cortical audiovisual integration areas The speed-accuracy tradeoff: history, physiology, methodology, and behavior Does playing video games effect cognitive abilities in Pakistani children? Assessing the effects of audiovisual semantic congruency on the perception of a bistable figure Changes in search rate but not in the dynamics of exogenous attention in action videogame players Characteristic sounds facilitate visual search Not so fast: Rethinking the effects of action video games on attentional capacity Multisensory brand search: How the meaning of sounds guides consumers' visual attention Characteristic Sounds Facilitate Object Search in Real-Life Scenes Not so automatic: Task relevance and perceptual load modulate cross-modal semantic congruence effects on spatial orienting. bioRxiv Crossmodal sensory processing in the anterior cingulate and medial prefrontal cortices Enhancing the contrast sensitivity function through action video game training Haptic guidance of overt visual attention Crossmodal semantic congruence can affect visuo-spatial processing and activity of the fronto-parietal attention networks Evidence for gender differences in visual selective attention The Empirical Analysis of Non-problematic Video Gaming and Cognitive Skills: A Systematic Review Enhancing Cognition with Video Games: A Multiple Game Training Study Enhancing perceptual and attentional skills requires common demands between the action video games and transfer tasks Isolating shape from semantics in haptic-visual priming Attention and performance XI: Routledge Evaluating the Specificity of Effects of Video Game Training Effects of video-game play on information processing: A meta-analytic investigation Examining a supramodal network for conflict processing: a systematic review and novel functional magnetic resonance imaging data for related visual and auditory stroop tasks Examining the Relationship Between Action Video Game Experience and Performance in a Distracted Driving Task Building, Hosting and Recruiting: A Brief Introduction to Running Behavioral Experiments Online Video game experience and its influence on visual attention parameters: An investigation using the framework of the Theory of Visual Attention (TVA) Multisensory Interactions in the Real World Supramodal executive control of attention Auditory cognition and perception of action video game players Sex differences in the Simon task help to interpret sex differences in selective attention Video game practice optimizes executive control skills in dual-task and task switching situations Using power analysis to estimate appropriate sample size A comparative analysis of the processing speed between video game players and non-players. Aloma: Revista de Psicologia Audiovisual temporal integration for complex speech, object-action, animal call, and musical stimuli Attentional advantages in video-game experts are not related to perceptual tendencies Effects of action video game on spatial attention distribution in low and high perceptual load task Supramodal Enhancement of Auditory Perceptual and Cognitive Learning by Video Game Playing