key: cord-0058214-eqblziz8 authors: Thevin, Lauren; Machulla, Tonja title: Guidelines for Inclusive Avatars and Agents: How Persons with Visual Impairments Detect and Recognize Others and Their Activities date: 2020-08-10 journal: Computers Helping People with Special Needs DOI: 10.1007/978-3-030-58796-3_21 sha: 6f6119d6831bd5401f2568b5b0e9cedfb6a54243 doc_id: 58214 cord_uid: eqblziz8 Realistic virtual worlds are used in video games, in virtual reality, and to run remote meetings. In many cases, these environments include representations of other humans, either as stand-ins for real humans (avatars) or artificial entities (agents). Presence and individual identity of such virtual characters is usually coded by visual features, such as visibility in certain locations and appearance in terms of looks. For people with visual impairments (VI), this creates a barrier to detecting and identifying co-present characters and interacting with them. To improve the inclusiveness of such social virtual environments, we investigate which cues people with VI use to detect and recognize others and their activities in real-world settings. For this, we conducted an online survey with fifteen participants (adults and children). Our findings indicate an increased reliance on multimodal information: vision for silhouette recognition; audio for the recognition through pace, white cane, jewelry, breathing, voice and keyboard typing; sense of smell for fragrance, food smell and airflow; tactile information for length of hair, size, way of guiding or holding the hand and the arm, and the reactions of a guide-dog. Environmental and social cues indicate if somebody is present: e. g. a light turned on in a room, or somebody answering a question. Many of these cues can already be implemented in virtual environments with avatars and are summarized by us in a set of guidelines. Virtual reality (VR) technology allows users to visit virtual environments and to interact with them. This increasingly includes interactions with virtual representations of other human users and artificial characters controlled by the computer, termed avatars and agents, respectively. Much recent research focuses on improving the appearance of such virtual humans and how different characteristics such as emotional expressiveness, personalization, or motion profiles influence the quality of the interpersonal interaction. As with many other features of VR environments, current representations of human characters rely heavily on visual aspects. As a result, they are less accessible to users with visual impairments. Difficulties include not being aware of the presence of other characters, or not being able to discern their location, identity, or activity. As a result, persons with visual impairments face a higher threshold when it comes to joining in social or collaborative VR experiences. Recently, there have been efforts to increase the accessibility of VR for people with VI [23] . So far, these efforts have focused on improving interactions with physical features of the VR environment. Our work aims to also improve inclusiveness in terms of social and collaborative aspects of VR. To this end, we investigated how people with VI determine presence, identity, and activity of other humans in real-world settings. These factors are likely to influence whether a person will approach others with the goal of engaging in a social interaction: first, a person with VI has to be aware of the presence of potential interaction partners; second, the identity of co-present persons might play a role for certain types of interactions (such as a banter between friends); and third, the activity might determine whether the other person is available for an interaction or not (e.g., a working colleague should not be interrupted). We created an online survey answered by fifteen adults and minors with VI. We present the results of this survey and propose guidelines for the creation of inclusive virtual characters, that can be implemented at present with currently available technological solutions (Fig. 1 ). Virtual environments support more and more interactive and collaborative activities [15] , such as physical and cognitive therapies [3, 11] , training [12, 14, 17] , gaming, telepresence for meeting [18] , and communication with conversational agents [13, 16] . All of these applications require the representation of oneself or others in virtual space, in the form of (self-)avatars or virtual agents. These representations rely strongly on visual cues, as these are considered essential for the realism and faithfulness of avatars [19] . Hence, many research efforts focus on photo-realistic avatars as well as visual rendering of life-like motions and non-verbal cues. For example, research on therapy to counter perceived body image distortion in anorexia has used biometrically correct virtual bodies, so the visual representation is realistic [3, 11] . Crisis management training [12, 17] and surgery training [14] aims to teach collaboration with virtual agents and other users to improve collective work. The representation of others and virtual agents has mostly made use of visual feedback. Occasionally, multisensory cues are used to increase the realism and individuality of avatars, e.g., in the context of communication. Here, it is increasingly common to additionally implement auditory cues (synthetic voices, Voice Over Internet Protocol (VoIP), etc.; e.g., [17] and [14] . Voices rendering is known as a key part of collaborative skills and avatar realism, combined with visual cues [19] . Rarely, social interactions in VR include other auditory cues and tactile feedback to the user, such as a touch to the forearm or shoulder [6, 7] . As of yet, it is unclear whether these developments are conducive to creating inclusive virtual environment. Improving on fine visual detail is unlikely to create strong cues for people with VI, and the current focus on voices ignores other potential auditory sources as cues to the presence and identity of others. Accessibility of Virtual Reality for People with VI. Multiple projects have specifically investigated VR for people with VI. It is now possible to navigate auditory, sometimes audio-visual, virtual environments without vision using a force-feedback joystick and environmental cues through audio-tactile feedback [10] , with a keyboard to explore an virtual audio environment [2] , with smartphone applications using and gesture [4, 5] , and by physically walking in a virtual or augmented room [9, 20] . Two works propose virtual white canes: one simulates a whitecane with a braking system [22] , and another work designed a virtual white cane through vibration in the hand controller of the VR system [8] . In all of these works, the user interacts with physical aspects of the VE only, such as exploring virtual environments and interact with them. As far as we are aware, there are no virtual characters present, thus ignoring the potential for collaborative experiences. We created a questionnaire to gather insights into the strategies and cues used by persons with VI in real-world settings when attempting to detect and identify others with the goal of initiating an interaction. To this end, we developed 20 questions, falling into several different categories. The full set of questions is provided Table 1 . A first set of categories covers four types of mutual awareness in social and collaborative settings: environmental awareness, user and interaction awareness, action awareness, and organization awareness [1] . As summarized in [21] , environmental awareness refers to sharing the same space (Q1), action awareness is being aware of the concurrent actions of the other users (Q6), interaction and user awareness concerns knowing the available interactions and knowing who the other users are (Q2, Q3, Q4, Q5, Q8, Q9), and organizational awareness is about knowing and sharing the implications of context on the group (Q7). The question Q8, Q9, Q19 and Q20 aim to identify the good practices for both representing people and initiating interaction in virtual environment. Further, we asked about the use of the senses (Q10 to 15), as well as potential other information found in literature and blogging (Q16-17) and one open question (Q18). The survey was self-administered. It was organized as follow: presentation of the goal of the study, confidentiality and consent form. For minors, this part was read and agreed upon by the participants' legal guardian, for legal reasons (it was the only part that required another person to answer for the minor participants). demographic information. Information regarding age, gender, and the level of vision was collected. survey. The twenty questions presented Table 1 . We send our survey to three blind adults first, so they could verify the accessibility and the questions. We then distributed our survey to professionals of two schools for people with visual impairments in France. The schools send our surveys to former students and to parents of current students. Questionnaire responses were returned by fifteen participants in total, i.e., eight adults and seven minors under the age of eighteen years old. For confidentiality reasons and to respect the consent form terms, we do not provide the individual demographics. However, we will describe here individually each demographics dimension (gender, age, and level of vision). Regarding the adults' group, 6 participants were male, 2 were female, and 0 other. The average age is 38 years old, with a Regarding the minors' group, 5 participants were male, 1 was female, and 1 was other. The average age is 12 years old, with a standard deviation of 4 years. The range is from 6 to 16 years old, with the following values: 6, 10, 10, 10, 14, 15, 16. 1 participant was low vision, 2 were blind with light perception, and 4 were blind without light perception. In the following, we analyze participants' responses in terms of their own observational and behavioral strategies as well as strategies by others that are used to: detect and subsequently identify a person as known or unknown (Sects. 4.1, 4.2 and 4.3), and identify their activity (Sect. 4.4). In Sect. 4.5, we summarize strategies that others might use to initiate an interaction. Numbers in round brackets indicate the number of participants who mentioned a particular fact. Observation strategies: Vision is used to identify silhouettes (2 participants), although audition is more commonly used, by paying attention to noises people make (9 participants) through their activities (indistinct noise, voice, breathing, typing on a computer, pace noise) or by being directly greeted by somebody in the room. Participants also use smells (4 participants), such as parfume, food, hygiene products such skin care and aftershave, touch for airflow perception and clothes, and, interestingly, a sense of a co-presence (5 participants) in form of passive echolocation or a feeling of being observed. Smell can help to estimate the distance of another person. Behavioral Strategies: Participants mention saying hello and then waiting whether somebody answers. The direction of the answer can help to visually search for the other person. An indirect strategy is trying to detect whether the light in a room is turned on since rooms containing people are often illuminated. This is useful when visual capabilities are insufficient to recognize silhouettes but light perception is possible. Observation Strategies: People can be identified visually by the individual shape of their silhouettes (2 participants), and their face if close enough (2 participants). However, eleven of the participants do not use vision at all to identify others. Visual or tactile feedback is used to recognize somebody by their hair (tactile, 2 participants; visually, 2 participants), their clothes (3 participants), the size (silhouette, tactilely when guiding by the height of the leading arm, with audio when talking). Audition is used to pay attention to the voice (14 participants; sample needs to be long enough to disambiguate identity), to personspecific sounds like laughter (2 participants), way of breathing (3 participants), moving and walking (3 participants). Other person-specific sounds mentioned were the way to playing with a pen, typing on a keyboard, and sounds from jewelry. Jewelry can also help to identify a person by touch. Two participants said they do not dare too much using active tactile exploration, as it may make others uncomfortable. Passive tactile feedback is used to recognize close relatives (e.g., by the way how another person grabs one's arm or how they guide, how their hands feel, musculature). Smell and fragrance can help (10 participants), but do not always give a clear cue to identity. Participants also mentioned using passive echo-location signature for identification (2 participants). Lastly, guidedog reactions indicate that a person is known; however, this does usually not help to identify the person (2 participants). Behavioral Strategies: Some participants let people approach and present themselves (5 participants). To this end, they define a precise meeting point (3 participants) and make their arrival known (4 participants; by phone, calling out loud, or hitting the white cane on the floor). The context can help to disambiguate persons (colleagues in the workplace, landlord inside a specific house). Other people can draw attention to themselves by calling out or waving at the person with VI. One participant reports never going out without a sighted guide; two others report they are not able to identify people and have to rely on others approaching them. A person that is easy to identify has a recognizable voice (9 participants) or breath, a particular smell or fragrance (3 participants), makes distinctive noises (wrist bands, keys, heels, personal style of walking (3 participants), person in a wheelchair). A person is difficult to recognize if it is difficult to identify their voice (7 participants; person not talking, or too quiet, strong similarity to the voice of another person). Similarly, a person that does not move or is encountered out of their usual context is difficult to recognize, as well as encounters within large public spaces, with too many people (4 participants) and too much noise (3 participants). Lastly, familiarity plays a role-a person that is encountered less often is more difficult to recognize (4 participants) because there were insufficient opportunities to learn distinctive elements for identification (clothes, pace, voices, way of grabbing). Observation Strategies: In general, the cues are the same as when identifying a known person; however, without any correspondence in memory: participants use the voice (10 participants), typical words or gestures, way of grabbing the arm or guiding, silhouette and smell. The fact that participants do not know a person is usually confirmed during the conversation and by asking questions. Sometimes, the course of the conversation will lead to the realization that the participant wrongly assumed to be talking to a known person. Observation Strategies and Joining an Activity: Ongoing activities can be identified visually from gestures and, in case of physical activities, movements of human silhouettes (2 participants). Audition is particularly helpful to pick up noises that humans or materials emit (9 participants; e.g., from a keyboard, specific sounds of a familiar activity such as board games tokens or cooking tools) as well as the density of the noise, the number of different voices and the content of the conversation are considered useful cues to the activity (6 participants), or the general atmosphere of the room (2 participants; e.g., tavern vs. meeting). Smell and touch give information about the material used (cooking, smell of a pen when drawing). Three participants indicated they would not participate in an activity or a conversation if they are not explicitly included by the others or invited to join. In the following, we summarize good and bad strategies mentioned by our participants regarding how others can identify themselves and initiate an activity. Good Practices: In many situations, it is helpful if the approaching person states their name (8 participants), their function or job (3 participants), and the context of previous encounters (5 participants). If the person is unknown, they should indicate why they are approaching and, if applicable, the name of the person who send them. When others did not present themselves properly, our participants' strategies are to ask who it is (4 participants), asking possible past memories or if there have been previous encounters (2 participants), and asking questions in general. Bad Practices: Several participants stated that sudden and unannounced physical contact is undesirable, such as catching somebody who is moving with a white cane by the arm to stop them (2 participants), pat them on the shoulder (2 participants), or touching otherwise without prior consent (4 participants). Approaching somebody silently can generate negative effects, like surprise or making the participants uncomfortable because they failed to hear the person's approach. Addressing a person with VI verbally has its pitfalls, as well: others may not speak up long enough for them to be identified (e.g., "Hello" is too short), a call out from too far away can lead to a visual search for the caller but without a successful localization, and calling and speaking loudly at close distance may have startling effects (3 participants). First Impression: What influences the first impression somebody makes on a person with VI? Participants stated that they rely on concrete cues: the handshaking, the tone and intonation of the voice (10 participants), if there is a warm feeling in the voice, the way of speaking and the words used (three participants), the attention the person pays, the face, and the overall behavior (2 participants). One participant stated that he or she can notice if the visual handicap will be an obstacle from the voice and the intonation. The responses to our questionnaire suggest a number of extensions that can be made to virtual characters to increase their accessibility for persons with VI with current technologies that we describe in the first subsection. The answers also open future perspectives that we describe in the second subsection. Visual Attributes. Avatars and agents should have distinctive silhouettes, clothes (shapes, color), and faces, to be easily recognized even with low vision. They could have personalized movement patterns, e.g., with idle animation or when talking. A light turned on in a virtual room can indicate that an avatar or agent is present. Some adjustments make VR potentially more accessible, for instance avoiding the necessity of eye to have to adapt to differences in brightness, e.g., when changing virtual environments (e.g., rooms, or from outside to inside). Audio Attributes. Voice is a highly valuable cue to identify people, in particular if the avatar is talking long enough. Distinctive voices, e.g., in terms of pitch, differentiate between characters. With only three voices with referring to gender (feminine, masculine, mixed) a user can differentiate between three characters. Vocabulary, speed, and rhythm add specificity. Other person-specific sounds such as breathing, laughter, way of walking, or wearing different types of shoes (heels, sport shoes, squeaky shoes) improve the chances of identifying the virtual character. Pace can be a cue to decide whether to start an interaction or not (e.g., walking fast may imply little time). Identity and social cues could be conveyed by any sound-generating activity linked to a habit (playing with a pen, tapping finger nails on the table), to objects (keys, chains, sword), or to jewelry (wristbands, ear rings). Multimodal Attributes. The height of a person can be represented visually as well as via spatialized audio feedback. The cloth of virtual characters' garments can be represented visually by a texture as well as by distinctive sounds like rustling and swishing. Avatars in a wheelchair can be a very recognizable by their silhouette and associated sounds. Airflow. Airflow from a fan could simulated that someone just came closer or walked by. Initiating an interaction is a crucial prerequisite for collaboration. Therefore, it is important that it is transparent to the user how they can join activities. The user should be able to address the entire virtual room, equivalent to "Hello?" , using voice or through a controller. The virtual characters, either agents or avatars, should detect it (broadcast to remote users or audio recognition for agents). The user may want to send a directed call to a particular character (virtual phone call, virtual poke) or semi-directed call (like making noise with the white can). The virtual characters should be able to answer, wave at the person or get closer. They may spontaneously address the user and present themselves or even invite the user to join. The virtual space should ease recognition of characters and activities (e.g., by being a meeting point, providing distinctive activity sounds, or good separation of multiple voices). The place of meeting (noisy, full of people or not) make the recognition difficult or not. Smell. Combined with airflow, a fragrance dispenser could indicate that someone is in the room; the distance from this person could be indicated by smell intensity. Many flavors, fragrances and spices exist, liquid or other (vanilla, chocolate, coconut, mint, banana, musk, cedar, cumin, curry, basil, cinnamon), even scent kits, and can be used to give smell feedback. Tactile Feedback for Shapes. Tactile feedback helps to recognize jewelry by touch, the face and objects that help identifying the person. Tactile Feedback for Materials. Tactile feedback could indicates hair (long, short, no hair) with only on actuator at the hand level for the user. Tactile feedback for cloth can be added to visual and audio-feedback. Tactile Feedback for Human Contact. Tactile feedback can give a lot of information, to identify a person and to make a first impression. In real life, tactile feedback for recognition is used during handshakes, guiding and even to know the size and the musculature of the person. Guide Dog. A virtual guide dog could react differently depending on the time spent with a virtual character or a real person controlling the avatar. Passive Echo-Location. Passive echo location is the ability to feel shapes and obstacles from reverberations without emitting a sound purposefully for echoing. If it can be more efficiently modeled in VR, it could be a new means to feel spaces and people. We presented the results of an online survey on the strategies used by people with VI for detecting and identifying people as well as on-going activities in real environments. We use the findings to propose solutions for inclusive collaborative VE, with inclusive representations of virtual characters and activities. Many of these solutions can already be implemented, while others will become achievable with future technological developments. Further, our work is applicable to any virtual environment such as for multiplayer games, applications with non-player characters, training environments, telepresence scenarios, or video conferencing. It may require new ways of recording and broadcasting people, to add information beyond the currently predominant video stream. Our future research will focus on designing and evaluating such virtual characters with people with and without VI. Utilizing knowledge context in virtual collaborative work Virtual environments for the transfer of navigation skills in the blind: a comparison of directed instruction vs. video game based learning approaches Body size estimation in women with anorexia nervosa and healthy controls using 3D avatars Virtual navigation for blind people: building sequential representations of the real-world Sensitive interfaces for blind people in virtual visits inside unknown spaces Mediated social touch: a review of current research and future directions A human touch: Social touch increases the perceived human-likeness of agents in virtual reality First steps towards walk-in-place locomotion and haptic feedback in virtual reality for visually impaired Virtual navigation environment for blind and low vision people Construction of cognitive maps of unknown spaces using a multi-sensory virtual environment for people who are blind Assessing body image in anorexia nervosa using biometric self-avatars in virtual reality: Attitudinal components rather than visual body size estimation are distorted Virtual cities for real-world crisis management Collaborative interaction analysis in virtual environments based on verbal and nonverbal interaction Simcec: a collaborative VR-based simulator for surgical teamwork education Why and how to use virtual reality to study human social interaction: the challenges of exploring a new research landscape Who, me? How virtual agents can Shape conversational footing in virtual reality Designing soundscapes of virtual environments for crisis management training Cultural diversity and information and communication technology impacts on global virtual teams: an exploratory study Collaboration in immersive and non-immersive virtual environments X-road: virtual reality glasses for orientation and mobility training of people with visual impairments How to move from inclusive systems to collaborative systems: the case of virtual reality for teaching O&M Enabling people with visual impairments to navigate virtual reality with a haptic and auditory cane simulation Seeingvr: a set of tools to make virtual reality more accessible to people with low vision Acknowledgments. We thank the participants and the institutions that distributed our survey, especially IRSA and Ocens. This work was supported by the European Union's Horizon2020 Program under ERCEA grant no. 683008 AMPLIFY.