key: cord-0881451-c60ia6l9
authors: Lee, Kyungjun; Sato, Daisuke; Asakawa, Saki; Asakawa, Chieko; Kacorri, Hernisa
title: Accessing Passersby Proxemic Signals through a Head-Worn Camera: Opportunities and Limitations for the Blind
date: 2021-09-13
journal: ASSETS. Annual ACM Conference on Assistive Technologies
DOI: 10.1145/3441852.3471232
sha: dc86d97f60eedc1b2fc7b37856c9659dbf614e99
doc_id: 881451
cord_uid: c60ia6l9

The spatial behavior of passersby can be critical to blind individuals to initiate interactions, preserve personal space, or practice social distancing during a pandemic. Among other use cases, wearable cameras employing computer vision can be used to extract proxemic signals of others and thus increase access to the spatial behavior of passersby for blind people. Analyzing data collected in a study with blind (N=10) and sighted (N=40) participants, we explore: (i) visual information on approaching passersby captured by a head-worn camera; (ii) pedestrian detection algorithms for extracting proxemic signals such as passerby presence, relative position, distance, and head pose; and (iii) opportunities and limitations of using wearable cameras for helping blind people access proxemics related to nearby people. Our observations and findings provide insights into dyadic behaviors for assistive pedestrian detection and lead to implications for the design of future head-worn cameras and interactions.

Access to the spatial behavior of passersby can be critical to blind individuals. Passersby proxemic signals, such as change in distance, stance, hip and shoulder orientation, head pose, and eye gaze, can indicate their interest in "initiating, accepting, maintaining, terminating, or altogether avoiding social interactions" [50, 63] . More so, awareness of others' spatial behavior is essential for preserving one's personal space. Hayduk and Mainprize [40] demonstrate that the personal space of blind individuals does not differ from that of sighted individuals in size, shape, or permeability.

However, their personal space is often violated as sighted passersby, perhaps in an attempt to help, touch them, or grab their mobility aids without consent [19, 103] . Last, health guidelines as with the recent COVID-19 pandemic [74] require one to practice social distancing by maintaining a distance from others of at least 3 feet (1 meter) 1 or 6 feet (2 meters) 2 , presenting unique challenges and risks for the blind community [31, 38] ; mainly, due to the fact that spatial behavior of passersby and signage or markers designed to help maintain social distancing are predominantly accessible through sight and thus are, in most cases, inaccessible to blind individuals. Other senses such as hearing and smell could be utilized perhaps to estimate passersby distance and orientation but require close proximity or quiet spaces; in noisy environments, blind people's perception of surroundings out of the range of touch is limited [28] . More so, mobility aids may work adversely in some cases (e.g., guide dogs are not trained to maintain social distancing [87] ).

Why computer vision and wearable cameras. Assistive technologies for the blind that employ wearable cameras and leverage advances in computer vision, such as pedestrian detection [5, 36, 54, 96] , could provide access to spatial behavior of passersby to help blind users increase their autonomy in practicing social distancing or initiating social interaction; the latter motivated our work since it was done right before COVID-19. 'Speak up' and 'Embrace technology' are two of the tips that the LightHouse Guild provided to people who are blind or have low vision for safely practicing social distancing during COVID-19 [38] , mentioning technologies such as Aira [11] and BeMyEyes [27] . However, these technologies rely on sighted people for visual assistance. Thus, beyond challenges around cost and crowd availability, they can pose privacy risks [6, 13, 94] . Prior work, considering privacy concerns for parties that may get recorded, has shown that bystanders tend to be amicable toward assistive uses of wearable cameras [7, 80] , especially when data are not sent to servers or stored somewhere [54] . However, little work explores how blind people capture their scenes with wearable cameras and how their data could work with computer vision models. As a result, there are few design guidelines for assistive wearable cameras for blind people.

To better understand the opportunities and challenges of employing such technologies for accessing passersby proxemic signals, we collect and analyze video frames, log data, and open-ended responses from an in-person study with blind (N=10) and sighted (N=40) participants 3 . As shown in Fig. 1 , blind participants ask passersby for information while wearing smart glasses with our testbed prototype for real-time pedestrian detection. We explore what visual information about the passerby is captured with the head-worn camera; how well pedestrian detection algorithms can extract proxemic signals in camera streams from blind people; how well the real-time estimates can support blind users 1 Minimum distance suggested by the World Health Organization (WHO) [72] 2 Minimum distance suggested by the Center for Disease Control and Prevention (CDC) in the United States [73] 3 The data were collected right before COVID-19 with a use case scenario of blind individuals initiating social interactions with sighted people in public spaces. Out team, co-led by a blind researcher, was annotating the data when the pandemic reached our shores. Thus, we felt it was imperative to expand our annotation and analysis to include distance thresholds recommended for social distancing. After all, both scenarios share common characteristics such as need for proximity estimation and initiating interactions to maintain distancing, i.e. recommended 'Speak up' strategy by the LightHouse Guild [38] . at initiating interactions; and what limitations and opportunities head-worn cameras can have in terms of accessing proxemic signals of passersby.

Our exploratory study shows that there are still many limitations to overcome for such assistive technology to be effective. When the pedestrians faces are included, the results are promising. Yet, idiosyncratic movements of blind participants and the limited field of view in current wearable cameras call for new approaches in estimating passersby proximity. Moreover, analyzing images from smart glasses worn by blind participants, we observe variations in their scanning behaviors (i.e., head movement) leading to capturing different body parts of passersby, and sometimes excluding essential visual information (i.e., face), especially when they are near passersby. Blind participants' qualitative feedback indicates that they found it easy to access proxemics of passersby via smart glasses as this form factor did not require camera aiming manipulation. They generally appreciated the potential and importance of such an wearable camera system, but commented on how experience degraded with errors. They mentioned trying to adapt to these errors by aggregating multiple estimates, instead of relying on a single one. At the same time, they shared their concern that there was no guaranteed way for them to check errors in some visual estimates. These findings help us discuss the implications, opportunities, and challenges of an assistive wearable camera for blind people.

Our work is informed by prior work in assistive technology for non-visual access of passersby spatial behaviors and existing approaches for estimating proximity beyond the domain of accessibility. For more context, we also share prior observations related to blind individuals' interactions with wearable and handheld cameras.

Prior work on spatial awareness for blind people with wearable cameras and computer vision has mainly focused on tasks related to navigation [29, 55, 68, 70, 99] and object or obstacle detection [4, 55, 56, 64, 70] . We find only a few attempts exploring the spatial behavior of others whom a blind individual may pass by [48, 61] or interact with [51, 91, 96] . For example, McDaniel et al. [61] proposed a haptic belt to notify blind users of the location of a nearby person, whose face is detected from the wearable camera. While closely related to our study, blind participants were not included in the evaluation of this system. Thus, there are no data on how well this technology would work for the intended users. Kayukawa et al. [48] , on the other hand, evaluate their pedestrian detection system with blind participants. However, the form factor of their system, a suitcase equipped with a camera and other sensors, is different.

Thus, it is difficult to extrapolate their findings for other wearable cameras such as smart glasses that may be more sensitive to blind people's head and body movements. Designing wearable cameras for helping blind people in social interactions is a long-standing topic [32, 51, 66, 76] . Focusing on this topic, our work is also inspired by Stearns and Thieme [96] 's preliminary insights on the effects of camera positioning, field-of-view, and distortion for detecting people in a dynamic scene. However, they focused on interactions during a meeting (similar to [91] ) and analyzed video frames captured by a single blindfolded user, which is not a proxy for blind people [3, 79, 88, 100] . In contrast to our work, the context of a meeting does not allow for extensive movements and distance changes -typical tasks when proxemics of passersby are needed. In our paper, we explore how blind people's movements and distances can lead to visual characteristics of their camera frames.

Distance between people is key to understanding spatial behavior. Thus, epidemiologists and behavioral scientists have long been interested in automating the estimation of people's proximity [14, 60] . Often their approaches employ mobile sensors [60] or Wi-Fi signals [75] . For example, Madan et al. [60] recruited participants in a residence hall and tracked foot traffic based on their phones' signals to understand how their behavior change relates to reported symptoms.

While these approaches work well for the intended objective, they are limited in the context of assistive technology for real-time estimates of approaching passersby in both indoor and outdoor spaces.

Since the recent pandemic outbreak [74] , spatial proximity estimation (i.e., looking at social distancing) has gained much attention. The computer vision community started looking into this using surveillance cameras [33, 86] or camera-enabled autonomous robots [2]. Using cameras capturing people from the perspective of a third person, they monitor the crowd and detect people who violate social distancing guidelines. However, for blind users, an egocentric perspective is needed. Other approaches include using ultrasonic range sensors that can measure a physical distance between people wearing the sensors (e.g. [41] ). This line of work is promising in that the sensors consume low power, but the sensors can help detect only presence and approximate distance. They cannot provide access to other visual proxemics that blind people may care about [54] . More so, they assume that sensors are present on every person. Thus, in our work, we prioritize RGB cameras that can be worn only by the user to estimate presence, distance, and other visual proxemics.

Specifically, we borrow from pedestrian detection approaches in computer vision, but some do not really work for our context. For example, they estimate distances by looking at visual differences of the pedestrian between two images from a stereo-vision camera [43, 69] . They first track the pedestrian and then change the camera angle to position the person in the center of frames [43] . Alternatively, they use two stationary cameras [69] . Neither case is appropriate for our objective. However, approaches that estimate the distance based on a person's face or eyes captured from an egocentric user perspective could work [30, 83, 104, 110] . For our testbed, we use the face detection model of Zhang et al. [110] to estimate a passerby's presence, distance, relative position, and head pose.

Smartphones and thus handheld cameras are the status quo for assistive technologies for the blind [9, 16, 17, 45, 47, 53, 101, 102, 116] , though, there are still many challenges related to camera aiming [9, 17, 45, 47, 52, 53, 102, 116] , which can be trickier for a pedestrian detection scenario. Nonetheless, what might start as research ideas quickly translate to real-world applications [1, 11, 21, 23, 24, 26, 27, 57, 84, 92, 98] . However, the use of cameras along with computationally intensive tasks can rapidly heat up the phone and drain its battery [62, 111] . Although this issue is not unique to this form factor, it can harm phone availability that blind users may rely on; many have reported experiencing anxiety when hearing battery warnings in their phones, especially at their work or during commuting [106] .

With the promise of hands-free and thus more natural interactions, head-worn cameras are falling on and off the map of assistive technologies for people who are blind or have low vision [37, 42, 58, 78, 95, 97, 113, 115] . This can be partially explained by hardware constraints, such as battery life and weight, limiting their commercial availability [93] which can also relate to factors such as cost, social acceptability, privacy, and form factor [54, 80] . Nonetheless, we see a few attempts in commercial assistive products [1, 11, 26, 71] , and people with visual impairments remain amicable towards this technology [117] , especially when it resembles a pair of glasses [67] similar to the form factor used in our testbed. However, to our knowledge, there is no prior work looking at head-worn camera interaction data from blind people, especially in the context of accessing proxemics to others. Although wearable cameras do not require wearers to aim a camera, including visual information of interest in the frame can still be challenging for blind people. Prior work with sighted [105] and blind-folded [96] participants indicate that the camera position can lead to different frame characteristics, though such data are not proxy for blind people's interactions nor related to the pedestrian detection scenario. Thus, collecting and analyzing data from blind people, our work takes a step towards understanding their interactions with a head-worn camera in the context of pedestrian detection.

We build a testbed called GlAccess to explore what visual information about passersby is captured by blind individuals with a head-worn camera and how well estimates of passersby's proxemics can support initiating interactions. As shown in Fig. 2 , GlAccess consists of Vuzix Blade smart glasses [18] , a Bluetooth earphone, and a computer vision system running on a server. With a sampling rate of one image per second, a photo is sent to the server to detect the presence of a passerby and extract proxemic signals, which are then communicated to the user through text-to-speech (TTS). To mitigate cognitive overload, estimates that remain the same as the last frame are not communicated. Vuzix Blade smart glasses have an 80-degree diagonal angle of camera view and run on Android OS. Our server has eight NVIDIA 2080 TI GPUs running on Ubuntu 16.04. We build a client app on the glasses that is connected to the server through RESTful APIs implemented with Flask [81] and uses the IBM Watson TTS to convert the proxemic signals to audio (e.g., "a person is near on the left, not looking at you. " as shown in Fig. 1 ). More specifically, our computer vision system estimates the following related to proxemics:

• Presence: A passerby presence is estimated through face detection. We employ a Multi-task Cascaded Convolutional Networks (MTCNNs) [110] pre-trained on 393,703 faces from the WIDER FACE dataset [108] , which is a face detection benchmark with a high degree of variability in scale, pose, occlusion, facial expression, makeup, and illumination. The model provides a bounding box where the face is detected in the frame.

• Distance: For distance estimation, we adopt a linear approach that converts the height of the bounding box from the MTCNNs face detection model to the distance. Using the Vuzix Blade smart glasses, we collect a total of 240 images with six data contributors from our lab. Each contributor walks down a corridor that is approximately 66 feet (20 meters); this is the same corridor where the study takes place. We take two photos every one meter for a total of 40 photos. We train a linear interpolation function with the height of faces extracted from these photos and the ground truth distances between the contributors and the photo-taker with the smart glasses. This function estimates the distance as a numeric value, which is communicated either more verbosely (e.g., '16 feet Fig. 3 . Diagram of study procedure. A blind participant walked eight times in a corridor that is 66 feet (20 meters) long and 79 inches (2 meters) wide. Each time a different sighted person walked towards them. In four of the eight times, those individuals were sighted study participants. In the other four, they were sighted members in our research lab. In addition to the proxemic signals, GlAccess can recognize the lab members, simulating future teachable applications [46] where users can personalize the model for familiar faces.

away') or less so by stating only 'near' or 'far' based on a 16 feet (5 meters) threshold 4 . Participants in our study are exposed to both (i.e., experiencing each for half of their interactions).

• Position: We estimate the relative position of a passerby based on the center location of the bounding box of their face detected by the MTCNNs model. A camera frame is divided in three equal-width regions, which are accordingly communicated to the user as 'left', 'middle', and 'right'.

• Head Pose: We use head pose detection to estimate whether a passerby is looking at the user. More specifically, we build a binary classification layer on top of the MTNNs face detection model and fine-tune it with the 240 images from our internal dataset collected by the six contributors and 432 images from a publicly available head pose dataset [35] . The model estimates whether a passerby is looking at the user or not.

We also simulate future teachable technologies [46, 77] that can be personalized by the user providing faces of family and friends. In those cases, the name of the person can be communicated to the user. Sighted members in our research group volunteered to serve this role in our user study.

• Name: We pre-train GlAccess with a total of 144 labeled photos from five sighted lab members so that the system can recognize them in the study setting to simulate a personalized experience. This face recognition is implemented with a SVM classifier that takes face embedding data from the MTCNNs model.

To explore the potential and limitations of head-worn cameras for accessing passersby proxemics, we conducted an in-person study with blind and sighted participants (under IRB # 2019_00000294). We employed a scenario, illustrated in Fig. 3 , where blind participants wearing GlAccess were asked to initiate interactions with sighted passersby in an indoor public space. Given the task of asking someone for the number of a nearby office, they walked down a corridor where sighted participants were coming in the opposite direction. 

We recruited ten blind participants; nine were totally blind, and one was legally blind. On average, blind participants were 63.6 years old ( = 7.6). Five self-identified as female and five as male. Three participants (P4, P8, P10) mentioned having light perception, and four participants reported having experience with wearable cameras such as Aira [11] with frequencies shown in Table 1 

We shared the consent form with blind participants via email to provide ample time to read it, ask questions, and consent prior to the study. Upon arrival, the experimenter asked the participants to complete short questionnaires on demographics, prior experience with wearable cameras, and attitude towards technology. The experimenter read the questions; participants answered them verbally. Blind participants first heard about the system and then practiced walking with it in the study environment and detecting the experimenter as he approached in the opposite direction.

During the main data collection session, blind participants were told to walk in the corridor and ask a person, if detected, about the office number nearby. On the other hand, sighted participants, one at a time, were told to simply walk in a corridor with no information about the presence of blind participants in the study. As illustrated in Fig. 3 , a blind participant walked in the corridor eight times -four times with four sighted participants, respectively, and four times with four sighted members of our research lab, respectively. Our study was designed not to extend beyond 2 hours.

All sighted participants were recruited on site and thus strangers to blind participants; blind participants did not meet any sighted participants before the study. A stationary camera was used to record the dyadic interactions in addition to the head-worn camera and the server's logs. Both blind and sighted participants consented to this data collection prior to the study -sighted participants were told that there will be camera recording in the study, which may include their face, but these images will not be publicly available without being anonymized. The study ended with a semi-structured 5 Questionnaire based on [89] is available at https://iamlabumd.github.io/assets2021_lee: -3 indicates the most negative, 0 neutral, and 3 the most positive. interview for blind participants with questions 5 created for eliciting their experience and suggestions towards the form factor, proxemics feedback delivery, error interaction, and factors impacting future uses of such head-worn cameras.

We collected 183-minute-long session recordings from a stationary camera, more than 1,700 camera frames from the smart glasses testbed sent to the server, estimation logs on proxemic signals of sighted passersby that were communicated back to blind participants, and 259-minute-long audio recordings of the post-study interview.

We first annotated all camera frames captured by the smart glasses worn by blind participants.

As shown in Table 2 , annotation attributes include the presence of a passerby in the video frame, their relative position and distance from the blind participant, whether their head pose indicates that they were looking towards the participant, and whether there was an interaction. Annotations for the binary attributes, presence, head pose, and interaction, were provided by one researcher as there was no ambiguity. For presence, the annotator marked whether the head, torso, arm(s), and leg(s) of the sighted passerby were captured in the camera frames. Regarding head pose, the annotator checked whether the sighted passerby was looking at the blind user. As for interaction, the annotator marked whether the blind or the sighted participant was speaking while consulting the video recordings from the stationary camera.

Annotations for distance and position were provided by two researchers, who first annotated independently and then resolved disagreements. The initial annotations from the two researchers achieved a Cohen's kappa score of 0.89 and 0.90 for distance and position, respectively. Distance was annotated as < 6ft.: within 6 feet (2 meters), near: farther than 6 feet (2 meters) but within 16 feet (5 meters); far: farther than 16 feet (5 meters). Distance in frames where the passerby was not included was annotated as n/a. As for position, the annotators checked whether the person was on the left, middle, or right of the frame. This relative position was determined based on the location of the center of the person's face within the camera frame. When the face was not included, the center of included body part was used. The annotators resolved all annotation disagreements together at the end.

For our quantitative analysis, we mainly used descriptive analysis on our annotated data to see what visual information about passersby blind participants captured using the smart glasses and how well our system performed to extract proxemic signals (i.e., the presence, position, distance, and head pose of passersby). We adopted the common performance metrics for machine learning models (i.e., F1 score, precision, and recall) to measure the performance of our system in terms of extracting proxemic signals.

For our qualitative analysis, we first transcribed blind participants' responses to the open-ended interview questions.

Then, we conducted a thematic analysis [20] on the open-ended interview transcripts. One researcher first went through the interview data to formulate emerging themes in a codebook. Then, two researchers iteratively refined the codebook together while applying codes to the transcripts and addressing disagreements. The final codebook had a 3-level hierarchical structure -i.e., level-1: 11 themes; level-2: 20 themes; and level-3: 71 themes.

(a) Inclusion of any body part of passersby in camera frames per distance category.

(b) Inclusion of the head of passersby in camera frames per distance category. Fig. 4 . Diagrams presenting inclusion rates of any body part or head of passersby in camera frames per distance. A relatively high percentage of camera frames capture a passerby (face and body) at a distance larger than 6 feet in a corridor that is 79 inches wide; this can vary across blind participants with some outlier cases being as low as 50%. Within 6 feet it can get unpredictable with faces often being included less.

We simplified our user study to have only one passerby in a corridor since our prototype system was able to detect only one person at a time. In the future, designers should consider handling a case where more than one passerby appears.

When more than one passerby is included in a camera frame, a system needs to detect a person of the most interest to blind users and then delivering the person's proxemics to the users. Such a method should help the users perceive the output, efficiently and quickly. For example, detecting a person within a specific distance range could be a naive but straightforward way.

Although our data collection scenario is somewhat limited, the images, videos, and logs collected from 80 pairs of blind and sighted individuals allow us to account for variations within a single blind participant and across multiple blind participants. Our observations, contextualized with blind participants' feedback, provide rich insights on the potential and limitations of head-worn cameras.

In a real-world setting, where both blind users and other nearby people are constantly in motion, it is important to understand what is being captured by assistive wearable cameras -smart glasses in our case. What is visible dictates what kind of visual information regarding passersby's proxemics can be extracted. In this section, results are based on our analysis of 3,175 camera frames from the head-worn camera and those ground-truth annotations. 

At what distance is a passerby captured by the the camera? Fig. 4 presents that a relatively high percentage of camera frames captured a passerby at a distance larger than 6 feet in a corridor that was 79 inches wide; this varied across blind participants with some outlier cases being as low as 50%. Within 6 feet, blind participants seemed to have different tendencies of including either the face or any body part of passersby, which average inclusion rates tend to be lower than those when passersby were more than 6 feet away. When sighted passersby were more than 16 feet away (far), part of their body and head were included on average in 98% ( = 3.5) and 98% ( = 3.6) of the camera frames, respectively. When sighted passersby were less than 16 feet but more than 6 feet away (near), the inclusion ratios of any body part and head were 98% ( = 3.5) and 98% ( = 4.3), respectively. However, when passersby were within 6 feet (2 meters), there was a quick drop in the inclusion rate with high variability across the dyads -passersby's head and any body part were included on average in 56% ( = 39.1) and 66% ( = 39.6) of the camera frames, respectively.

For example, the camera of P9, who tended to walk fast, consistently did not capture the passerby at this distance.

Beyond fast pace, there are other characteristics that can contribute to low inclusion rates such as veering [39] and scanning head movements. We observed little veering, perhaps due to hallway acoustics. However, we observed that many blind participants tended to move their head side to side while walking and more so while interacting with sighted participants to ask them about the nearby office number. This behavior in combination with the low sampling rate of one frame per second seemed to contribute to exclusion of the passersby from the camera frames even when they were in front or next to the blind participants. These findings suggest a departure from naive fixed-rate time-based mechanism (i.e. 1 frame/sec) to more efficient dynamic sampling (e.g. increasing frames on proximity) as well as fields of view and computer vision models that can account for such movements.

The types of visual information and proxemic signals desired by a blind user can differ by proximity to a passerby and whether that person is merely approaching or interacting with them. Nonetheless, what can be accessible will depend on what part of the passersby body is captured by the camera (e.g., head, torso, and hands). In our dataset, we identified 11 distinct patterns. For example, a (Head, Torso) pattern indicates that only the head and torso of a passerby were visible in a camera frame. Fig. 5 shows the average rate for each pattern and whether a passerby was approaching or interacting with a blind participant. It also includes a break-down of rates across participants. When passersby were more than 16 feet away (far) and approaching, blind participants tended to capture the passersby in their entirety in more than 81% of the overall frames. This number droped to 29% in the distance within 6

feet, making it difficult to estimate some of their proxemics, but the head presence remained high at 82%. These seemed to be consistent among the participants, except P1 and P5 whose rates were not only lower for including passersby in their entirety (68% and 45%, respectively) but also higher for not including them at all (11% and 8%, respectively).

We also analyzed the frames collected during blind participants' interactions with sighted passersby, typically within 16 feet (5 meters). Blind participants tended to capture the passersby in their entirety in about 78% of the overall frames for distances between 6 and 16 feet (2-5 meters). Closer than 6 feet, the number quickly dropped to 12%. More importantly, we observed a higher diversity in the inclusion rate across blind participants for what their camera captured when they were interacting with a passersby within 16 feet. Indeed, the breakdown of rates shows that all except P8, who was legally blind, and P4, who reported having a light perception and onset age of 49, did not include the passersby or their head in more than a quarter of their camera frames. For P3, P5, and P6, who had no prior experience with wearable cameras and reported onset age at birth, these rates were even higher reaching up to 91%. When inspecting their videos from the static camera, we find that these participants were indeed next to the passersby interacting with them but they were not facing them.

We evaluate how well models that rely on face detection can estimate passersby' proxemic signals (i.e., presence, position, distance, and head pose) in camera frames from blind participants. Specifically, using visual estimates and those ground-truth annotations, we report results in terms of precision, recall, and F1-score (Table 3) . These results capture blind participants' experience with the testbed, and thus provide context for interpreting their responses related to machine learning errors. Table 3 presents that, when the presence of a passerby was detected, there was actually a passerby (precision=1.0) 6 . However, the model's ability to find all the camera frames where a passerby was present is low (recall=0.33) -even when their head was visible in the camera frame (recall=0.34). Our visual inspection on these false negatives informed that, for the majority (80%), this was due to the person being too far (> 16 feet); thus, their face was too small for the model to detect, raising questions such as How far away should a system detect people?

Since the estimation of position, distance, head pose depended on the presence detection, we reported metrics on both (i) frames where passersby were present and (ii) frames where they were first detected. Once passersby were detected, the estimation of their relative position worked relatively well (precision=0.90; recall=0.84). However, the head pose estimation (i.e., whether the passerby is looking at the user) was not as accurate (precision=0.77; recall=0.77).

The distance estimation was also challenging (precision=0.62; recall=0.66). This might be partially due to the limited training data, both in size and in diversity. A passerby's head size in an image, a proxy for estimating the distance, can depend on the blind user's height and camera angle.

The majority of blind participants ( On the other hand, our analysis also reveals potential reasons why blind participants might not pay much attention to visual estimates. Since such visual information was mostly inaccessible in their daily lives, they did not rely on the visual information and also was unsure how to utilize them. More specifically, P7 and P9 remarked, "[Getting visual information] never occurred to me. [Not sure if] I would want or need anything more than just listening to the voices around me and hearing the walking around me because I've never had that experience, " and "It's visual feedback that I'm not used to, I never functioned with, so I don't know, I'm not used to processing or analyzing things in a visual way. I love details, but I just never often think about them much related to the people, " respectively. To address this issue, participants suggested providing training that helps blind users interpret such visual estimates as P7 said, "I would need maybe to be trained on what to do [with the visual feedback]. "

Six participants commented on how easy it was to have the smart glasses on and access information, saying "Just wearing the glasses and knowing that would give you feedback" (P2); "Just have to wear it and listen" (P4); and "Once [I] put on and just walk" (P8). Some supported their statements by proving a rationale such as the head being a natural way to aim towards areas of interest and the fact that they didn't have to do any manipulations. that part was kind of nice," respectively. P7 and P10 commented on the manipulation, saying "You don't have a lot of controlling and manipulating to do" and "And [it] doesn't take any manipulation -it just feeds new information. And that was easy. It took no interaction manually changed, " respectively.

However, anticipating that some control would be eventually needed, P8 tried to envision how one could access the control panel for the glasses and suggested them being paired with the phone or support voice commands by stating 

Four participants shared concerns on outside noises blocking out the system's audio feedback, or vice versa. P4 mentioned, "The only time it could be harder [to listen to the system feedback] is in a situation with lots of noise or something. If it's very noisy, then it's going to be hard to hear, " and "There's a limit to how much you can process because you're walking. When we're walking [,] we're doing more than just listening to the [the feedback.] We listen to [others] .

Listening is big [to us]." Although GlAccess reported only feedback changed between two contiguous frames, this is possibly why they wanted to control the verbosity of the system feedback. P7 said, "It was reacting to you every couple seconds or literally every second. Until I'm looking for something specific, it [is] a little overwhelming to have all of it coming in at the same time. "

Moreover, blind participants emphasized that they would need to access such proxemic signals in real time, especially when they are walking. 

Seven blind participants mentioned being confident of interpreting person recognition output. It seems that they gained their trust in the person recognition feature since they were able to confirm the recognition result with a person's voice. P2 said, "I learned [a lab member]'s voice, and when it said '[lab member]', it was her; or at least I thought it was. "

Similar to any other machine learning models, however, our prototype system also suffered from errors such as false It creates a 360-degree panorama by stitching multiple photos. Two cameras were placed side by side to obtain 135-degree horizontal POV.

Our exploratory findings from the user study lead to several implications for the design of future assistive systems supporting pedestrian detection, social interactions, and social distancing. We discuss those in this section to stimulate future work on assistive wearable cameras for blind people.

Section 5.1 reveals that some camera frames from blind participants did not capture passersby's face but other body parts (i.e., torso, arms, or legs), especially when they were near the passersby. Although some visual cues such as head pose and facial expressions [116] cannot be extracted from the other body parts, detecting those may help increase accessibility of someone's presence [109, 112] . Also, we see that proximity sensors can enhance the detection of a person's presence and distance as those sensors can provide depth information. In particular, such detection can help blind users to practice social distancing, which has posed unique challenges to people with visual impairments [31] . The sensor data however needs to be processed along with person detection, which often requires RGB data from cameras.

Future work that focuses on more robust presence estimation for blind people may consider incorporating proximity sensors with cameras.

Regarding delivery of proxemic signals, it is important to note that blind people may have different spatial perception [22] and could thus interpret such proxemics feedback differently, especially if the feedback was provided with adjectives related to spatial perception, such as 'far' or 'near' (Section 5.5). Other types of audio feedback (e.g., sonification) or sensory stimulus (e.g., haptic) may overcome the limitations that the speech feedback has. On the other hand, blind participants' experiences with our testbed may have been affected by its performance and thus could differ if an assistive system had more accurate performance or employed different types of feedback. Future work should consider investigating such a potential correlation.

Several factors, such as blind people's tendency of veering [39] and behavior of scanning the environment when using smart glasses (Section 5.1), can change camera input on wearable cameras and thus may affect the performance of detecting proxemic signals. From our study, we observed that the head movement of blind people often led to excluding a passerby from camera frames of a head-worn camera. Although we did not observe blind participants' veering tendency in our study, the veering, which can vary by blind people [39] , could change their camera aiming and consequently affect what is being captured. Also, as discussed in earlier work [22] , blind people's onset age affects the development of their spatial perception. Observing that our blind participants had different onset ages, we suspect that some differences in their spatial perception may explain their different scanning behaviors with smart glasses. Future work should investigate these topics further to enrich our community's knowledge on these matters.

Moreover, using a wide-angle camera may help assistive systems to capture more visual information (e.g., a passerby coming from the side of a user) and tolerate some variations in blind people's camera aiming with wearable cameras.

The smart glasses in our user study, having only a front view with the limited horizontal range (64-degree angle), are unable to capture and detect people coming from the side or back of a blind user. As described in Table 4 , some prior work has employed a wide-angle camera [96] or two cameras side by side [49] to get more visual data, especially for person detection. However, accessibility researchers and designers need to investigate which information requires more attention and how it should be delivered to blind users. Otherwise, blind users may be overwhelmed and often distracted by too much feedback (Section 5.5).

In addition, aiming with a wearable camera can vary by where the camera is positioned (e.g., a chest-mounted camera)

as earlier work on wearable cameras for sighted people indicates that the position of a camera can lead to different visual characteristics of images [105] . In the context of person detection, prior work investigated assistive cameras for blind people in two different positions (i.e., either on a suitcase [48, 49] or on a user's head [54, 96] ). However, little work explores how camera positions can change the camera input while also affecting blind users' mental mapping and computer vision models' performances. We believe that it is important to study how blind people interact with cameras placed in different positions to explore design choices for assistive wearable cameras.

Data-driven systems (e.g., machine learning models) inherently possess the possibility of generating errors due to several factors, such as data shifts [82] and model fragility [34] . Since the system that study participants experienced merely served as a testbed, it was naturally error prone (Section 5.2), which is largely true for any AI system, and consequently affected blind participants' experiences. (Section 5.6). In particular, blind participants pointed out inaccessibility of error detection in visual estimates. To help blind users interpret feedback from such data-driven assistive systems effectively, it is imperative to tackle this issue (i.e., How to make such errors accessible to blind users?). One simple solution may be to provide the confidence score of an estimation, but further research is needed to understand how blind people interpret this additional information (i.e., confidence score) and to design effective methods of delivering such information.

Furthermore, different feedback modalities are worth exploring. In this work, we provided visual estimates in speech, but blind participants sometimes found it too verbose and imagined that it would be difficult for them to pay attention to speech feedback in certain situations, e.g. when they are in noisy areas or talking to someone. To prevent their auditory sensory system from being overloaded, other sensory stimuli such as haptic can be employed to provide feedback [15, 90] . It is however important to understand what information each feedback modality can deliver to blind users and how effectively they would perceive the information.

Blind participants envisioned personalizing their smart glasses to recognize their family members and friends (Section 5.7). To provide this experience for users, it would be worth exploring teachable interfaces, where users provide their own data to teach their systems [46] . Our community has studied data-driven approaches in several applications (e.g., face recognition [116] , object recognition [9, 47, 53] , and sound recognition [44] ) to investigate the potential and implications of using machine learning in assistive technologies. However, little work explores how blind users interact with teachable interfaces, especially for person recognition. Our future work will focus on this to learn opportunities and challenges of employing teachable interfaces in assistive technologies.

Moreover, in Section 5.5, blind participants suggested enabling future smart glasses to detect specific moments (e.g., sitting in a table or talking with someone) to control feedback delivery. We believe that egocentric activity recognition [59, 65, 107] , a widely-studied problem in the computer vision community, can help realize this experience.

However, further investigation is required for adopting such models in assistive technologies, especially for blind users.

For instance, one recent work proposed an egocentric activity recognition model assuming that input data inherently contains the user's gaze motion [65] . However, that model may not work on data collected by blind people since the assumption, (i.e., eye gaze), only applies to a certain population (i.e., sighted people).

There are still challenges related to hardware, privacy, and public acceptance, which keep wearable cameras from being employed for assistive technologies. For example, Aira ended their support for Horizon glasses in March 2020, due to hardware limitations [10] . Also, prior work investigated blind users' privacy concerns over capturing their environments without being given access to what is being captured [8, 12] , and social acceptance of assistive wearable cameras as the always-on cameras can capture other people without notification [7, 54, 80] .

Beyond the high cost of real-time support from sighted people (e.g., Aira), as of now, there is no effective solution that can estimate the distance between blind users and other people in real time. Computer vision has the potential to provide this proxemic signal in real time on a users' device. No recordings [54] and privacy-preserving face recognition [25, 85] may help address some privacy issues, but more concerns may arise if assistive wearable cameras need to store visual data (even on the device itself) to recognize the users' acquaintances. Future work should consider investigating associated privacy concerns and societal issues, as these factors can affect design choices for assistive wearable cameras.

In this paper, we explored the potential and implications of using a head-worn camera to help blind people access proxemics, such as the presence, distance, position, and head pose, of pedestrians. We built GlAccess that focused on detecting the head of a passerby to estimate proxemic signals. We collected and annotated camera frames and usage logs from ten blind participants, who walked in a corridor while wearing the smart glasses for pedestrian detection and interacting with a total of 80 passersby. Our analysis results show that their smart glasses tended to capture passersby's head when they were more than 6 feet (2 meters) away from passersby. However, the rate of including the head dropped quickly as blind participants were getting closer to passersby, where we observed variations in their scanning behaviors (i.e., head movements). Blind participants shared that using smart glasses for pedestrian detection was easy since they did not have to manipulate camera aiming. Their qualitative feedback also led to several design implications for future assistive cameras to consider. In particular, we found that errors in visual estimates need to be accessible to blind users to better interpret such outputs for their autonomy. Our future work will explore this accessibility issue in several use cases where blind people can benefit from wearable cameras.

We thank Tzu-Chia Yeh for her help in our work presentation and the anonymous reviewers for their constructive feedback on an earlier draft of our work. This work is supported by NIDILRR (#90REGE0008) and Shimizu Corporation.

OrCam MyEye 2. 2021. Help people who are blind or partially sighted

A study of spatial cognition in an immersive virtual audio environment: Comparing blind and blindfolded individuals

Low cost ultrasonic smart glasses for blind

Investigating the Intelligibility of a Computer Vision System for Blind Users

Privacy concerns and behaviors of people with visual impairments

Up to a Limit? Privacy Concerns of Bystanders and Their Willingness to Share Additional Information with Visually Impaired Users of Assistive Technologies

Addressing physical safety, security, and privacy for people with visual impairments

Recog: Supporting blind people in recognizing personal objects

Aira Ends Support for Horizon

Connecting you to real people instantly to simplify daily life

Privacy Considerations of the Visually Impaired with Camera Based Assistive Technologies: Misrepresentation, Impropriety, and Fairness

I am uncomfortable sharing what I can't see": Privacy Concerns of the Visually Impaired with Camera Based Assistive Applications

Social distance from the stigmatized: A test of two theories

Smartphone haptic feedback for nonvisual wayfinding

Blindcamera: Central and golden-ratio composition for blind photographers

VizWiz: nearly real-time answers to visual questions

Augment Reality (AR) Smart Glasses for the Consumer

Is Someone There? Do They Have a Gun": How Visual Information about Others Can Improve Personal Safety Management for Blind Individuals

Using thematic analysis in psychology

Use Lookout to explore your surroundings

Spatial memory for configurations by congenitally blind, late blind, and sighted adults

Digit-Eyes. 2021. Identify and organize your world

Privacy-preserving face recognition

Electronic glasses for the legally blind

Be My Eyes. 2021. Brining sight to blind and low-vision people

Auditory Space Perception in the Blind: Horizontal Sound Localization in Acoustically Simple and Complex Situations

Headlock: A Wearable Navigation Aid That Helps Blind Cane Users Traverse Large Open Spaces

Camera distance from face images

American Foundation for the Blind. 2020. Flatten Inaccessibility Study: Executive Summary

Person localization using a wearable camera towards enhancing social interactions for individuals with visual impairment

Analyzing Worldwide Social Distancing through Large-Scale Computer Vision

Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples

Estimating face orientation from robust detection of salient facial features

A Dynamic AI System for Extending the Capabilities of Blind People

A dynamic AI system for extending the capabilities of blind people

How People Who Are Blind or Have Low Vision Can Safely Practice Social Distancing During COVID-19

The veering tendency of blind pedestrians: An analysis of the problem and literature review

Personal space of the blind

Ultrasonic Range Sensors Bring Precision to Social-Distance Monitoring and Contact Tracing

An augmented-reality edge enhancement application for Google Glass. Optometry and vision science: official publication of the American Academy of

Person Tracking with a Mobile Robot using Two Uncalibrated Independently Moving Cameras

SoundWatch: Exploring Smartwatch-Based Deep Learning Approaches to Support Sound Awareness for Deaf and Hard of Hearing Users

Supporting blind photography

People with visual impairment training personal object recognizers: Feasibility and challenges

BBeep: A Sonic Collision Avoidance System for Blind Travellers and Nearby Pedestrians

Guiding Blind Pedestrians in Public Spaces by Understanding Walking Behavior of Nearby Pedestrians

Conducting interaction: Patterns of behavior in focused encounters

A wearable face recognition system for individuals with visual impairments

Revisiting blind photography in the context of teachable object recognizers

Hands Holding Clues for Object Recognition in Teachable Machines

Pedestrian Detection with Wearable Cameras for the Blind: A Two-Way Perspective

ISANA: wearable context-aware indoor assistive navigation with obstacle avoidance for the blind

iSee: obstacle detection and feedback system for the blind

LookTel Money Reader

Use of an augmented-vision device for visual search by patients with tunnel vision

Going deeper into first-person activity recognition

Social Sensing for Epidemiological Behavior Change

Using a haptic belt to convey non-verbal communication cues during social interactions to individuals who are blind

What can Android mobile app developers do about the energy consumption of machine learning?

Recognition of Spatial Dynamics for Predicting Social Interaction

Recovering the sight to blind people in indoor environments with smart technologies

Integrating Human Gaze into Attention for Egocentric Activity Recognition

Rita Faia Marques, and Abigail Sellen. 2021. Social Sensemaking with AI: Designing an Open-Ended AI Experience with a Blind Child

Functionality versus Inconspicuousness: Attitudes of People with Low Vision towards OST Smart Glasses

Goby: A wearable swimming aid for blind athletes

Stereo vision images processing for real-time object distance and size measurements

Designing audio-visual tools to support multisensory disabilities

Creating Inclusive Experiences Through Technology

World Health Organization

Social Distancing, Quarantine, and Isolation: Keep Your Distance to Slow the Spread

WHO Director-General's opening remakrs at the media briefing on COVID-19

An analysis of distance estimation to detect proximity in social interactions

Social interaction assistant: a person-centered approach to enrich social interactions for individuals with visual impairments

Teachable interfaces for individuals with dysarthric speech and severe physical disabilities

Image enhancement for impaired vision: the challenge of evaluation

Differences between early-blind, late-blind, and blindfoldedsighted people in haptic spatial-configuration learning and resulting memory traces

The AT effect: how disability affects the perceived social acceptability of head-mounted display use

The Pallets Projects. 2020. Flask: web development

Dataset shift in machine learning

Person to Camera Distance Measurement Based on Eye-Distance

KNFB Reader gives you easy access to print and files, anytime

Learning to anonymize faces for privacy preserving action detection

DeepSOCIAL: Social Distancing Monitoring and Infection Risk Assessment in COVID-19 Pandemic

Covid-19: The effects of isolation and social distancing on people with vision impairment

Improved auditory spatial tuning in blind humans

The media and technology usage and attitudes scale: An empirical investigation

Audio haptic videogaming for navigation skills in learners who are blind

A multimodal assistive system for helping visually impaired in social interactions

Talking camera app for those with a visual impairment

A Survey of Wearable Devices and Challenges

Visual Content Considered Private by People Who Are Blind

Design of an Augmented Reality Magnification Aid for Low Vision Users

Automated Person Detection in Dynamic Scenes to Assist People with Vision Impairments: An Initial Investigation

Chroma: a wearable augmented-reality solution for color blindness

TapTapSee. 2021. Assistive Technology for the Blind and Visually Impaired

X-road: virtual reality glasses for orientation and mobility training of people with visual impairments

Visual experience is not necessary for efficient survey spatial cognition: evidence from blindness

AIGuide: An Augmented Reality Hand Guidance Application for People with Visual Impairments

Helping visually impaired users properly aim a camera

just Let the Cane Hit It": How the Blind and Sighted See Navigation Differently

Development of face recognition system and face distance estimation using stereo vision camera

Effects of camera position and media type on lifelogging images

DarkReader: Bridging the Gap Between Perception and Reality of Power Consumption in Smartphones for Blind Users

Egocentric daily activity recognition via multitask clustering

Wider face: A face detection benchmark

SmartPartNet: Part-Informed Person Detection for Body-Worn Smartphones

Joint face detection and alignment using multitask cascaded convolutional networks

A Smartphone Thermal Temperature Analysis for Virtual and Augmented Reality

Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing

Understanding Low Vision People's Visual Perception on Commercial Augmented Reality Glasses

Designing AR visualizations to facilitate stair navigation for people with low vision

Foresee: A customizable head-mounted vision enhancement system for people with low vision

A Face Recognition Application for People with Visual Impairments: Understanding Use Beyond the Lab

Technology-Mediated Sight: A Case Study of Early Adopters of a Low Vision Assistive Technology