key: cord-0108472-ykvss11o authors: Zhang, Jiaming; Yang, Kailun; Constantinescu, Angela; Peng, Kunyu; Muller, Karin; Stiefelhagen, Rainer title: Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World date: 2021-07-07 journal: nan DOI: nan sha: 714bb0c515a76527438b83d9120badb228502feb doc_id: 108472 cord_uid: ykvss11o Common fully glazed facades and transparent objects present architectural barriers and impede the mobility of people with low vision or blindness, for instance, a path detected behind a glass door is inaccessible unless it is correctly perceived and reacted. However, segmenting these safety-critical objects is rarely covered by conventional assistive technologies. To tackle this issue, we construct a wearable system with a novel dual-head Transformer for Transparency (Trans4Trans) model, which is capable of segmenting general and transparent objects and performing real-time wayfinding to assist people walking alone more safely. Especially, both decoders created by our proposed Transformer Parsing Module (TPM) enable effective joint learning from different datasets. Besides, the efficient Trans4Trans model composed of symmetric transformer-based encoder and decoder, requires little computational expenses and is readily deployed on portable GPUs. Our Trans4Trans model outperforms state-of-the-art methods on the test sets of Stanford2D3D and Trans10K-v2 datasets and obtains mIoU of 45.13% and 75.14%, respectively. Through various pre-tests and a user study conducted in indoor and outdoor scenarios, the usability and reliability of our assistive system have been extensively verified. Knowledge of glass architecture [6] and glass doors [44, 47] are particular important for visually impaired people, because transparent objects often present architectural barriers which hinder the mobility of people with low vision or blindness. For example, a path behind a glass door is not a free way to navigate (see Fig. 1 ) unless it is correctly recognized and reacted. However, most common visionbased navigation assistance systems [1, 60, 73] cannot handle transparent obstacles well, as 3D vision-based methods hardly recover the depth information of texture-less transparent surfaces [1, 73] , whereas conventional image segmentation-based methods do not cover the categories of challenging transparent objects [38, 72] . In addition, guide dogs often get confused leading people with blindness to full-pane windows, and differentiation between doors, and large glass windows is difficult for people with residual sight [55] . A system that supports the recognition of landmarks such as doors is particularly appreciated by people with visual impairments, as finding a door before entering a building is difficult due to the inaccuracy of GPS [4, 55] . To address these issues, we propose a wearable system capable of real-time wayfinding and object segmentation to assist visually impaired individuals travel more safely. We present Trans4Trans, precisely Transformer for Transparency, an efficient semantic segmentation architecture with dual heads, as shown in Fig. 1(d) . As transparent objects are often texture-less or share similar content as the surroundings, it is essential to associate long-range visual concepts to robustly infer transparent regions. For this reason, Trans4Trans is established with both transformerbased encoder and decoder to fully exploit the long-range context modeling capacity of self-attention layers in transformers [59] . In particular, Trans4Trans features a novel Transformer Paring Module (TPM) to fuse multi-scale feature maps generated from embeddings of dense partitions, and the symmetric transformer-based decoder can consistently parse the feature maps from transformer-based encoder. Together with semantically predicting general things and stuff classes like walkable areas, the dual-head design allows to segment transparent objects accurately and completely, which are safety-critical for navigation. Trans4Trans is integrated in our wearable system which comprises a pair of smart vision glasses and a mobile GPU processor, which delivers a holistic scene understanding swiftly and accurately thanks to the high efficiency of our model. With the complete semantic information, the user interface consists of a customized set of acoustic feedback via sonification of detected objects, walkable directions and warnings of the obstacles, which yields intuitive suggestions and no prior knowledge is needed. A comprehensive set of experiments has been conducted on multiple semantic segmentation datasets [2, 70] . In particular, the proposed model outperforms state-of-the-art methods on the test sets of Stanford2D3D [2] and Trans10K-v2 [70] datasets. Finally, a user study with visually impaired people and a variety of field tests demonstrate the usability and reliability of our assistive system for navigational perception in the wild. To the best of our knowledge, we are the first to use vision transformers for assisting people with visual impairment. In summary, we deliver the following contributions: • We present a wearable assistive system with a pair of smart vision glasses and a mobile GPU ported with vision transformers for visually impaired people. • We propose an efficient semantic segmentation architecture Transformer for Transparency (Trans4Trans) with transformer-based encoder and decoder, a dualhead design to unify general object and challenging transparent object segmentation, and a Transformer Parsing Module to fuse multi-scale representations. • Trans4Trans, maintaining a high efficiency, surpasses state-of-the-art CNN-and transformer-based methods on Stanford2D3D and Tran10K-v2 datasets. • According to our designed system algorithm, we produce a customized set of acoustic feedback and conduct a user study and various field tests, demonstrating the usability and reliability of Trans4Trans. Semantic segmentation for visual assistance. Whereas traditional assistance systems rely on multiple monocular detectors and depth sensors [1, 13, 14, 37, 60] , semantic segmentation allows to solve many navigational perception problems at once and thereby has been quickly employed in visual assistance. Yang et al. [72] put forward seizing semantic segmentation to unify detection tasks and assist terrain awareness, whereas Mao et al. [43] argued for panoptic segmentation towards a holistic sensing. In [42, 77] , instance-specific segmentation methods like Mask R-CNN [23] were directly applied for contentaware surrounding understanding. Semantic segmentation has also been used to address intersection perception like detection of crosswalks, sidewalks, and blind roads [8, 25] . Moreover, it has received increasing interests and various systems appear in the field [16, 18, 26, 38, 48 ]. Yet, both traditional sensor-based and segmentation-driven approaches cannot handle challenging transparent obstacles well. Transparent object sensing. Classical visual assistance systems [3, 27] resort to multi-sensor fusion, e.g., fusing RGB-D cameras and ultrasonic sensors, to overcome the difficulties in dealing with transparent obstacles like glass objects, French windows, French doors, etc. Chen et al. [11] design a multimodal stereo matching algorithm to improve the depth measurements of transparent objects with dual depth sensors. Polarization cues [31] and reflection priors [36] are also frequently explored for transparency perception. For example, Xiang et al. [68] propose a polarization-driven semantic segmentation architecture by adaptively bridging RGB and polarization dimensions, which significantly lifts the performance of classes with polarization properties like glass. Recently, Xie et al. [69, 70] built the Trans10K dataset and show that while the pure RGB-based transparent object segmentation is a largely unsolved task, it is promising for real-world usages with the increased data amount. This allows the community to go beyond traditional perception regimens relying on sensor fusion schemes and develop novel methods addressing transparent object segmentation. For example, AdaptiveASPP [7] is designed to extract rich features of multiple fields-of-view with appropriate importances, whereas EBLNet [22] incorporates an edge-aware graph convolution module to model global shape representations. Differing from most of these accuracy-oriented methods, we aim for a both efficient and robust semantic segmentation desirable for navigation assistance. We establish a transformer-based system to assist the detection of transparent objects in real-world scenes. Efficient transformers for dense prediction. Due to the capacity to model long-range contextual correlations, attention in transformers [59] has been introduced in visual recognition tasks to learn inter-dependencies either in the channel or in the spatial dimension [19, 64, 79] by appending attention layers atop convolutional networks. To reduce the quadratic computation overhead of such non-local attention layers w.r.t. the input size, their disentangled or asymmetric versions [28, 74, 76, 85] are constructed. Recently, transformers are directly applied in vision tasks [9, 17, 84] . In ViT [17] and DeiT [58] , a pure transformer is utilized to sequences of image patches for image recognition. For pixel-wise tasks, SETR [83] views semantic segmentation from a sequence-to-sequence perspective with vision transformers [17] , whereas MaX-DeepLab [61] infers class masks with a dual-path transformer for panoptic segmentation. Inspired by their success, transformer architectures for dense prediction emerge [15, 40, 57, 66, 67, 71] . A vital set of these models is proposed with lightweight variants like Pyramid Vision Transformers (PVT) [63] , ResT [80] , and LeViT [20] , aiming to optimize the accuracy-efficiency trade-off when porting transformers to real-world applications. In this work, we devise an efficient Trans4Trans framework with focus set on assisting navigation of visually impaired people in the wild. In contrast to existing works that either stack attention layers [19, 74] and encoder-decoder transformers on CNN backbones [70] , or employ CNN-based decoders on top of transformer encoders [63, 83] , in Trans4Trans both encoder and decoder are based on transformers, together with a novel Transformer Parsing Module design in our dual-head decoder. Inspired by the benefit of the ViT [17] transformer model in acquiring long-range dependencies, our dual-head Trans4Trans model is entirely composed of transformers, as shown in Fig. 2(a) , while the single-head has only one decoder. The four-stage encoder is borrowed from PVT [63] . Different to PVT-based Trans2Seg [70] adopting CNN-decoder, both encoder and decoder of Trans4Trans are symmetrically constructed by transformers for maintaining consistency in both feature extraction and feature parsing stages. Furthermore, different from CNN-based models [46, 53, 78, 79] learning the inductive bias, the transformer-based decoder is supposed to be more robust to parse unseen data captured in the wild. Yet, training a transformer model requires a large-scale dataset [17] . In order to solve the data-hunger problem and correct the misidentified walkable area through transparent objects segmentation, we designed a double-head model. Through the joint training on multiple datasets, it brings greater data diversity for learning a robust transformer-based model. To construct such a lightweight decoder, we propose a Transformer Parsing Module (TPM) illustrated on Fig. 2 Each TPM contains only one single transformer-based layer, thus it only demands little computing resources and is flexible to be deployed on our portable hardware system. More precisely, our decoder consists of four symmetrical stages as encoder. Each stage has a TPM module and contains similar structure. As shown in Fig. 2 (a), the pyramid features {F 1 , F 2 , F 3 , F 4 } from encoder are parsed consistently by the specific TPM module. Between two stages, resize and element-wise addition will be performed for pyramid feature fusion. For balancing the capacity and computational demands, the feature resolution of each TPM is set as H 4 × W 4 × C, for which the default channel number is 64. Benefiting from our proposed TPM, the amount of GFLOPs and parameters of this dual-head structure is largely reduced compared to deploying two separate models. Also importantly, diverse features can be learned from various datasets. Thereby, the dual-head model maintains lightweight and is robust in terms of preventing overfitting when testing in real-world scenarios. The decoder composed of our TPM module can be flexibly applied with various CNN-or transformer-based encoder structures as well. For multi-task learning, mounting decoder heads robustifies the feature learned via the shared encoder, and the entire model will not be computationally overburdened. Our entire portable system consists of two hardware components: a pair of smart vision glasses and a portable GPU, e.g., NVIDIA AGX Xavier or a lightweight laptop. The smart vision glasses integrate a RealSense R200 RGB-D sensor to capture RGB and depth images at the resolution of 640×480 in real time, and a pair of boneconducting earphones for generating acoustic feedback to visually impaired people. This is critical as visually impaired people often use the sounds from their surrounding environments for orientation and bone-conducting headphones will not block their ears when using the system. The integrated RealSense R200 sensor leverages a combination of active speckle projecting and passive stereo matching, and thereby it can work in both indoor and outdoor scenes. In texture-less indoor scenes, the projected infrared speckles will augment the environments, which are beneficial for stereo matching algorithms to yield dense depth estimation. In sunny outdoor scenes, whereas projected patterns would be overwhelmed by sunlight, the infrared components of natural light shine on the scene to form well-textured infrared image pairs, thus producing robust depth sensing. In our system, depth information is used to perform the obstacle avoidance function and can be used for prioritizing near-range objects over mid-/long-range objects. Our software components are the aforementioned dualhead Trans4Trans and a user interface as described in Algorithm 1. Starting from the input data and to guarantee the timely capture of the facing environment, the frame rate of RGB-D stream is set to 60. Once the system starts, it repeats image segmentation every n seconds. According to our experiments, the time interval setting as 2 seconds can effectively prevent cognitive overload, especially in cases of complex scenes containing a large number of objects. Still, it is adjustable depending on the demands of users, e.g., a short interval for more feedback to explore unknown space. Obstacle avoidance. When moving in a relatively restricted indoor space, the building materials or denselyarranged objects will impede the flexibility of merely using white cane as the aid tool for avoiding obstacles. In order to tackle the collision issue and balance indoor and outdoor scenarios, our system presets the highest priority for obstacle avoidance. In other words, if the average value of the depth information is smaller than the preset distance threshold θ obstacle , the user will be immediately notified in the form of vibration. To minimize the uncertainty of vibrations and the cognitive load, only one single default threshold is set to 1 meter, instead of setting various vibration frequencies for different distances. Another purpose is to preclude the chaotic and low-confidence segmentation from the lesstextured images when users walk too close and face to the Algorithm 1: Assistive system Data: RGB-D as X ∈ R H×W ×3 and Y ∈ R H×W . Result: General segmentation G ∈ R H×W ×13 ; Transparency T ∈ R H×W ×11 ; 1 initialize walkable rate: R l , R f , R r , parameters: θ obstacle , θ trans , θ walkable ; 2 while system start and each n seconds do 3 RGB-D update and Trans4Trans segmentation: {lef t, f orward, right}; object surface, such as images from white wall or doors. (Transparent) object segmentation. After receiving the RGB image X ∈ R H×W ×3 , our efficient Trans4Trans model outputs two segmentation predictions, which are general object segmentation G ∈ R H×W ×13 and transparent object segmentation T ∈ R H×W ×11 , respectively. The general object segmentation is divided into G path for walkable path and G object for other objects. Afterwards, the walkable mask is further partitioned into three regions as {lef t, f orward, right} directions for orientation. In order to correct the wrongly-segmented walkable area by the high-confidence transparency perception, the transparent object segmentation is divided into two disjoint sets as: T stuf f ∈ R H×W ×3 with {window, glass door, glass wall}, and T things ∈ R H×W ×8 with {shelf, jar/tank, freezer, eyeglass, cup, bowl, bottle, box}. Walkable path detection. After achieving object segmentation, the local ratio of walkable area G path , e.g., floor category from Stanford2D3D, is further horizontally divided into three different directions as {R l , R f , R r } ← G path . Then, an intuitive and effective strategy is to prompt the direction that has the largest walkable area, only when its local ratio is greater than the preset threshold θ walkable for safety. According to our test, this orientation approach guarantees anti-veering in a straight path outdoors and indoors. Furthermore, it can also accurately predict the best instantaneous turning direction during walking at an intersection, so as to constantly yield a safe direction suggestion. [2] which splits Area 1, 2, 3, 4 and 6 as training set, Area 5a and 5b as validation and testing set. Implementation details. We implement the model with PyTorch 1.8.0 and CUDA 11.2. Learning rate is initialized as 1e − 4 and is scheduled by poly strategy [78] with power 0.9 in 100 epochs. The Adam [32] with epsilon 1e − 8 and weight decay 1e − 4 is used as the optimizer. Batch size is set as 4 on each of four 1080Ti GPUs. To maintain the shape of position embedding, the images are resized in the resolution of 512×512 for all experiments. For a fair comparison with [70] , some tricks such as OHEM, auxiliary or class-weighted loss are not applied in our experiments. In this subsection, we present analysis for the experiments on different datasets and the results for the combination of CNN/Transformer to materialize varied encoderdecoder structures. Experimental results on computation complexity in GFLOPs and segmentation accuracy are presented and compared with state-of-the-art methods. Results. As shown in Table 1 , four main encoderdecoder structures are utilized for comparison. Unlike ResNet encoder-based Trans2Seg [70] and PVT [63] , our Trans4Trans uses both Transformer-based encoder and decoder with a TPM design. It can be seen that single-head Trans4Trans-Medium has achieved the best performance in mIoU on both Stanford2D3D (45.73%) and Trans10K-v2 (75.14%), exceeding by more than 3% on the challenging transparent object segmentation benchmark w.r.t. PVT-Medium. Meanwhile, it has clearly the smaller com-putation complexity in GFLOPs compared to PVT-M and ResNet50-based Trans2Seg. Trans4Trans-Tiny and -Small also achieve higher performances on Trans10K-v2 than the state-of-the-art structures. Dual-head Trans4Trans consistently improves the performance on Trans10K-v2 by incorporating more general knowledge when learning jointly with supervision from Stanford2D3D, which is more suitable for real-world navigational perception, as it highly prevents overfitting and reduces false positives of transparent obstacle warning observed in our field tests. Overall, these results verify the superiority and efficiency of Trans4Trans for transparent and general object segmentation. Combination of CNN/Transformer. As shown in Table 2, varied combinations of CNN-/Transformer-based encoder and decoder are compared, where FCN [41] and OC-Net [79] are composed of only CNN, whereas Trans2Seg is composed of CNN-based encoder and transformer-based decoder. The proposed Trans4Trans is a fully transformerbased encoder-decoder structure. It outperforms both these competitive architectures and PVT, another transformerbased encoder-decoder architecture. Yet, our Trans4Trans keeps smaller GFLOPs while being more accurate, demonstrating its suitability for transparent object segmentation. Comparison to state-of-the-art models. Following [70] , we compare with both accuracy-and efficiency-oriented semantic segmentation models as shown in Table 3 . Compared with both CNNs and transformer-based methods like Trans2Seg [70] , the superiority of Trans4Trans is further confirmed. Our Trans4Trans-M model outperforms the state-of-the-art method Trans2Seg by 2.99% in mIoU and 0.87% in ACC, while requiring much less GFLOPs. For category-wise accuracy, our Trans4Trans model achieves the state-of-the-art IoU on the classes background, jar or tank, window, door, cup, wall, bottle and box. These experimental results show the efficacy of transparent object segmentation of the proposed Trans4Trans architecture. Channel of TPM. Since one of our critical designs lies in the TPM, we now analyze the effect of the numbers of embedding channels applied in the decoder of Trans4Trans, as shown in Table 4 . It can be seen that performance increases as the number of channels increases until 256, and it drops at 512 where the decoder overfits the encoded feature (see Fig. 3 ) and the computation complexity becomes exceedingly large. For the response time-critical wearable system, we adopt 64 channels when deploying Trans4Trans due to its high efficiency and good performance. Table 3 : Computation complexity in GFLOPs and category-wise accuracy evaluation and comparison with state-of-the-art semantic segmentation methods on the Trans10K-v2 dataset [70] . To calculate the inference speed of our different versions of dual-head Trans4Trans model, 300 samples from the Trans10K-v2 test set with a batch size of 1 and a resolution of 512×512 are tested on three different GPUs, i.e., a mobile NVIDIA AGX Xavier in the MAXN mode, an NVIDIA GeForce MX350 from a lightweight laptop and an RTX 2070 from a workstation. As shown in Table 5 , the computation costs of our tiny Trans4Trans model on three GPUs are considerably lower than the other two, meanwhile the performances of the three models on both datasets are suitable for our system. In real applications, the more timely response of the navigation system is beneficial for assisting users with a similar prediction accuracy on each frame. Hence, the tiny version is selected in our user study. Fig. 4 visualizes qualitative comparisons between our tiny Trans4Trans and the previous state-of-the-art method Trans2Seg [70] . Fig. 4(a) shows failed recognition cases of both models, but our model can yield a clearly better boundary. Fig. 4(b) shows that examples where our model predicts the correct label, whereas Trans2Seg is confused. In Fig. 4 (c)(d), it can be seen that our model is not only effective for detecting navigation-related glass door and glass window, but can also predict more refined segmentation of small objects like jar/tank and glass cup. We further perform field tests by navigating around the university campus and capturing real-world scenes with our smart vision glasses. The collected RGB-D images and corresponding predictions are shown in Fig. 5 . The glass door in the first row captured at a moderate distance can be correctly identified, whereas in the other rows they are mis-classified as walkable paths by general object segmentation models. As it can be observed, transparent surfaces are often texture-less and the infrared patterns projected by the glasses will transmit the glass regions, and thereby the depth information are often sparse, noisy or even lost, which makes it challenging for 3D vision-based systems [1, 60, 73] to help avoid hazards even though the obstacles are close. In contrast, our Trans4Trans accurately and completely segments those transparent objects, meanwhile covers general objects, which is ideally suitable for safety-critical navigation assistance. We conducted a qualitative study with 5 participants in order to assess the acceptance of our prototype and draw design conclusions [49, 50] . Methodology. The hardware used during the test consisted of the smart vision glasses and a backpack with a lightweight laptop and a battery pack inside. The system's battery life under these conditions was approximately 4 hours. Depth Walkable path Transparency Figure 5 : Visualization of real-world scenes. From left to right are RGB and depth image, segmentation as walkable path by single-head model trained on Stanford2D3D, and as transparent objects (glass door or glass wall) corrected by our dual-head Trans4Trans model. Participants tried the system inside 2 buildings, and the blind participant also on a 700m route outdoors -see Fig. 6 . The study lasted about 2 hours. As Corona-protective measures, everyone wore FFP2 or surgery masks throughout the study and the prototype was disinfected several times. After a short introduction, all participants agreed to participation and recordings and signed the data protection statement. The participants put on the system and walked around the rooms, thinking out loud [30] . The study was recorded with an action camera and voice recorder. At the end, demographics and NASA Raw Task Load Index (RTLX) [21] questionnaires were filled in. Participants. Due to COVID-19 restrictions, only one of the participants (P1B) belonged to the target user group, being early blind. The other 4 participants were sighted (P2-P5). Age and gender of the participants were fairly balanced -see Table 6 . When asked if they can see glass objects during the day, P1B said he can sometimes see closed windows, Figure 6 : Incidences of participants using the system for navigation outdoors and indoors. due to the light-dark contrast. Windows that open inside the room, however, are very dangerous, as one can get serious head injuries (P1B). All sighted participants said they can see glass objects most of the time, but some objects, like bottles and glass cups (P2), glass doors (P3,P5), glass walls and windows (P5), can be challenging under certain light conditions, e.g., backlighting. Table 6 : Aggregated demographics of participants. Cognitive load. The raw task load index, averaged over all participants, was 14.7 with a standard deviation of 4.1. This score is enough to keep the user motivated, while not burdening too much [45] . The blind participant, P1B, had the second lowest score. According to the individual ratings ( Fig. 7) , effort and physical demand were slightly higher, while frustration was the lowest subscale. This might suggest that users enjoyed the experience of using our system, but a further reduction of hardware would be welcome. User comments. A thematic analysis [5] performed on the comments made by users yielded the following insights: • All users found the system useful and were impressed by its functionality (P1B-P5) and smooth running (P5). Users also praised the fact that it works both indoor and outdoor (P2,P3), is easy to use and interact (P2), the several functions are well integrated and battery life is high (P5). P1B said: "for the first time, I had the feeling that artificial intelligence can be useful [...] It's just cool!". • All 5 users liked the fact that the system recognizes so many object classes, including glass objects. Objects recommended to be included in future versions were: trash cans, city scouters (P1B) and constructions site fences (P5). Detection of some false positives was mentioned by P1B and P5. • Users found the hardware light (P1B,P2,P3,P5), comfortable (P1B,P3) and good looking (P5). For a commercial system, however, the laptop should be replaced by a belt (P1B) or smartphone (P5) and the glasses should connect wirelessly (P1B,P5). • The direction of detected objects should be announced (P1B,P2,P5) as clock directions or stereo sound (P1B). • Synthetic voice was very much appreciated by P1B, as "it clearly stands out from background sounds". • The 1 − 2s delay setting was mentioned by P1B,P4,P5. P1B said the delay is acceptable for object segmentation, as he only uses this function when he is unsure. P5 only mentioned the delay with respect to obstacle avoidance. P4 said the delay is fine. Augmented reality for partially sighted people. Since transparent obstacles are often a threat for people with low vision in everyday navigation and even challenging for sighted people in some cases, we further test our Trans4Trans method with a HoloLens 2 device by capturing real-world data around our computer vision laboratory. As shown in Fig. 8 , the transparent objects like glass door, transparent wall, and glass window can be completely and consistently segmented, and the colored segmentation mask can be easily overlaid and naturally projected onto the original RGB image captured by the glasses for rendering augmented reality or mixed reality. This field test demonstrates that the proposed Trans4Trans framework is not only helpful for assisting blind people, but can be potentially useful for partially sighted people. We look into the perception of transparent object segmentation via Trans4Trans, an efficient transformer architecture established with both transformer-based encoder and decoder. With a novel Transformer Parsing Module (TPM) integrated in the dual-head, Trans4Trans precisely segments general and transparent objects. It attains stateof-the-art performances on Stanford2D3D and Trans10K-v2 datasets, meanwhile being swift and robust to support online navigational perception. The learned efficient vision transformer is ported in our wearable system with a pair of smart vision glasses designed to help visually impaired people travel and explore real-world scenes, where transparent objects are omnipresent. Extensive results from a user study and various field tests show that the proposed assistive system is reliable and with low cognitive load. Navigation assistance for the visually impaired using RGB-D sensor with range expansion Joint 2D-3D-semantic data for indoor scene understanding Smart guiding glasses for visually impaired people in indoor environment Floor extraction and door detection for visually impaired guidance Using thematic analysis in psychology Glass architecture: is it sustainable. Passive and Low Energy Cooling for the Built Environment FakeMix augmentation improves transparent object detection. arXiv Rapid detection of blind roads and crosswalks by using a lightweight semantic segmentation network End-toend object detection with transformers HarDNet: A low memory traffic network Improving Re-alSense by fusing color stereo vision and infrared stereo vision for the visually impaired Encoder-decoder with atrous separable convolution for semantic image segmentation Real-time pedestrian crossing lights detection algorithm for the visually impaired Crosswalk navigation for people with visual impairments on a wearable device Twins: Revisiting spatial attention design in vision transformers. arXiv Bring the environment to life: A sonification module for people with visual impairments to improve situation awareness Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale V-Eye: A visionbased navigation system for the visually impaired Dual attention network for scene segmentation LeViT: a vision transformer in ConvNet's clothing for faster inference NASA-Task load index (NASA-TLX); 20 years later Enhanced boundary learning for glass-like object segmentation Piotr Dollár, and Ross Girshick. Mask R-CNN Deep residual learning for image recognition Outdoor walking guide for the visually-impaired people based on semantic segmentation and depth map Development of a wearable guide device based on convolutional neural network for blind or visually impaired persons Glass detection and recognition based on the fusion of ultrasonic sensor and RGB-D sensor for the visually impaired CCNet: Criss-cross attention for semantic segmentation DUNet: A deformable network for retinal vessel segmentation. Knowledge-Based Systems Using the think aloud method (cognitive labs) to evaluate test design for students with disabilities and english language learners Deep polarization cues for transparent object segmentation Adam: A method for stochastic optimization DABNet: Depth-wise asymmetric bottleneck for real-time semantic segmentation DFANet: Deep feature aggregation for real-time semantic segmentation RefineNet: Multi-path refinement networks for highresolution semantic segmentation Rich context aggregation with reflection prior for glass surface detection KrNet: A kinetic real-time convolutional neural network for navigational assistance Deep learning based wearable assistive system for visually impaired people Feature pyramid encoding network for real-time semantic segmentation Swin transformer: Hierarchical vision transformer using shifted windows Fully convolutional networks for semantic segmentation Unifying obstacle detection, recognition, and fusion based on millimeter wave radar and RGB-depth sensors for the visually impaired Panoptic lintention network: Towards efficient navigational perception for the visually impaired Suitability evaluation of visual indicators on glass walls and doors for visually impaired people Helping the blind to get through COVID-19: Social distancing assistant using real-time semantic segmentation on RGB-D video ESPNetv2: A light-weight, power efficient, and general purpose convolutional neural network Don't hit me! Glass detection in real-world scenes The semantic paintbrush: Interactive 3D mapping and recognition in large outdoor spaces Estimating the number of subjects needed for a thinking aloud test How many test users in a usability study ENet: A deep neural network architecture for real-time semantic segmentation ContextNet: Exploring context and detail for semantic segmentation in real-time Fast-SCNN: Fast semantic segmentation network U-Net: Convolutional networks for biomedical image segmentation Closing the gap: Designing for the last-few-meters wayfinding problem for people with visual impairments MobileNetV2: Inverted residuals and linear bottlenecks Segmenter: Transformer for semantic segmentation Training data-efficient image transformers & distillation through attention Attention is all you need Enabling independent navigation for visually impaired people through a wearable vision-based feedback system MaX-DeepLab: End-to-End panoptic segmentation with mask transformers Deep high-resolution representation learning for visual recognition Pyramid vision transformer: A versatile backbone for dense prediction without convolutions Abhinav Gupta, and Kaiming He. Non-local neural networks LEDNet: A lightweight encoder-decoder network for real-time semantic segmentation Fully transformer networks for semantic image segmentation P2T: Pyramid pooling transformer for scene understanding Polarizationdriven semantic segmentation via efficient attention-bridged fusion Segmenting transparent objects in the wild Segmenting transparent object in the wild with transformer SegFormer: Simple and efficient design for semantic segmentation with transformers Unifying terrain awareness for the visually impaired through real-time semantic segmentation IR stereo Re-alSense: Decreasing minimum range of navigational assistance for visually impaired individuals Capturing omni-range context for omnidirectional segmentation DenseASPP for semantic segmentation in street scenes Disentangled non-local neural networks Content-aware video analysis to guide visually impaired walking on the street BiSeNet: Bilateral segmentation network for real-time semantic segmentation OCNet: Object context for semantic segmentation ResT: An efficient transformer for visual recognition ICNet for real-time semantic segmentation on high-resolution images Pyramid scene parsing network Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers Deformable DETR: Deformable transformers for end-to-end object detection Asymmetric non-local neural networks for semantic segmentation