key: cord-0112069-f7w17adn authors: Liu, Huayao; Liu, Ruiping; Yang, Kailun; Zhang, Jiaming; Peng, Kunyu; Stiefelhagen, Rainer title: HIDA: Towards Holistic Indoor Understanding for the Visually Impaired via Semantic Instance Segmentation with a Wearable Solid-State LiDAR Sensor date: 2021-07-07 journal: nan DOI: nan sha: 50ea983d467e6cd15922c66396e5dd2addb748cd doc_id: 112069 cord_uid: f7w17adn Independently exploring unknown spaces or finding objects in an indoor environment is a daily but challenging task for visually impaired people. However, common 2D assistive systems lack depth relationships between various objects, resulting in difficulty to obtain accurate spatial layout and relative positions of objects. To tackle these issues, we propose HIDA, a lightweight assistive system based on 3D point cloud instance segmentation with a solid-state LiDAR sensor, for holistic indoor detection and avoidance. Our entire system consists of three hardware components, two interactive functions~(obstacle avoidance and object finding) and a voice user interface. Based on voice guidance, the point cloud from the most recent state of the changing indoor environment is captured through an on-site scanning performed by the user. In addition, we design a point cloud segmentation model with dual lightweight decoders for semantic and offset predictions, which satisfies the efficiency of the whole system. After the 3D instance segmentation, we post-process the segmented point cloud by removing outliers and projecting all points onto a top-view 2D map representation. The system integrates the information above and interacts with users intuitively by acoustic feedback. The proposed 3D instance segmentation model has achieved state-of-the-art performance on ScanNet v2 dataset. Comprehensive field tests with various tasks in a user study verify the usability and effectiveness of our system for assisting visually impaired people in holistic indoor understanding, obstacle avoidance and object search. For sighted people, when they enter an unfamiliar indoor environment, they can observe and perceive the surrounding environments through their vision. However, such a global scene understanding is challenging for people with visual impairments. They often need to approach and touch the objects in the room one by one to distinguish their categories * Corresponding author (e-mail: kailun.yang@kit.edu). and get familiar with their locations. This is not only inconvenient but also creates some risks for visually impaired people in their everyday living and travelling tasks. In this work, we develop a system to help vision-impaired people understand unfamiliar indoor scenes. Some assistance systems leverage various sensors (such as radar, ultrasonic, and range sensors) to help the visionimpaired avoid obstacles [3, 24, 36, 65] . With the development of deep learning, vision tasks like object detection and image segmentation can yield precise scene perception. Different vision-based systems were proposed towards environment perception and navigation assistance for visually impaired people. However, most 2D image semanticsegmentation-based systems [31, 50, 56] and 3D-visionbased systems [4, 6, 58] cannot provide a holistic understanding, because these systems only process the current image or image with depth information captured by the camera, rather than a complete scan of the surroundings. Compared with 2D images, 3D point clouds contain more information and are suitable for reconstruction of the surrounding environment. Thereby, in this work, we propose HIDA, an assistance system for Holistic Indoor Detection and Avoidance based on semantic instance segmentation. The main structure of HIDA is shown in Fig. 1 . When the user enters an unfamiliar room, the user can wake up the system by voice, and then the system will help the user to scan the surroundings by running Simultaneous Localization and Mapping (SLAM). Then the obtained point cloud will be delivered into an instance segmentation network, where each point will be attached with instance information. We adapt PointGroup [23] in our instance segmentation network and modify the structure to acquire 3D semantic perception with higher precision. We enable a crossdimension understanding by converting the point cloud with instance-aware semantics to a usable and suitable representation, i.e., 2D top-view segmentation, for assistance. After reading the current user location, the system will inform vision-impaired people through voice about the obstacles around and suggest the safe passable direction. In addition, the user can specify a certain type of object in the room, and our system will tell the user the distance and direction Our segmentation model is trained and evaluated on the ScanNet v2 dataset [9] , which yields an accurate and robust surrounding perception. During field experiment and various pre-tests, the learned model performed satisfactorily in our indoor usage scenarios, which makes it suitable for real-world applications. Even when an object was partially scanned, this network is still able to recognize and render relatively accurate classification. To evaluate the assistance functions of our system, we designed different tasks. Our system has achieved significant results in reducing collisions with obstacles. According to the questionnaire survey, the users believe that our system will help vision-impaired people in indoor scene understanding. To the best of our knowledge, we are the first to use 3D semantic instance segmentation for assisting the visually impaired. In summary, we deliver the following contributions: • We propose HIDA, a wearable system with a solidstate LiDAR sensor, which helps the visually impaired to obtain a holistic understanding of indoor surroundings with object detection and obstacle avoidance. • We designed a 3D instance segmentation model, achieving the state-of-the-art in mAP on ScanNet v2. • We convert the point cloud with semantic instance information to a usable top-view representation suitable for assisting vision-impaired people. • We conducted user studies to evaluate obstacle avoidance and object search, verifying the usability and benefit of the proposed assistance system. Semantic Segmentation for Visual Assistance. Since the surge of deep learning particularly the concept of fully convolutional networks [35] , semantic segmentation can be performed end-to-end, which enables a dense surrounding understanding. Thereby, semantic segmentation has been introduced into vision-based navigational perception and assistance systems [10, 19, 31, 37, 40, 53, 66] . Yang et al. [56] put forth a real-time semantic segmentation architecture to enable universal terrain perception, which has been integrated in a pair of stereo-camera glasses and coupled with depth-based close obstacle segmentation. Mao et al. [38] designed a panoptic lintention network to reduce the computation complexity in panoptic segmentation for efficient navigational perception, which unifies semantic and instance-specific understanding. Semantic segmentation has also been leveraged to address intersection perception with lightweight network designs [5, 18] , whereas most instance-segmentation-based assistance systems [36, 61] directly use the classic Mask R-CNN model [15] and rely on sensor fusion to output distance information. Compared to 2D segmentation-driven assistance, 3D scene parsing systems [6, 58, 62] fall behind, as these classical point cloud segmentation works focus on designing principled algorithms for walkable area detection [1, 48, 62] or stairs navigation [44, 58, 59] . In this work, we devise a 3D semantic instance segmentation system for helping visually impaired people perceive the entire surrounding and provide a top-view understanding, which is critical for various indoor travelling and mapping tasks [20, 28, 29, 33] . 3D Semantic and Instance Segmentation. With the appearance of large-scale indoor 3D segmentation datasets [2, 9] , point cloud semantic instance segmentation becomes increasingly popular. It allows to go beyond 2D segmentation and render both point-wise and instance-aware understanding, which is appealing for assisting the visually impaired. Early works follow two mainstreams. One is based on object detection that first extracts 3D bounding boxes and then predicts point-level masks [17, 55, 60] . Another prominent paradigm is segmentation-driven, which first infers semantic labels and then groups points into instances by using point embedding representations [27, 32, 45, 51, 52] . Recently, PointGroup [23] is designed to enable better grouping of points into semantic objects and separation of adjacent instances. DyCo3D [16] employs dynamic convolution customized for 3D instance segmentation. Oc-cuSeg [14] relies on multi-task learning by coupling embedding learning and occupancy regression. 3D-MPA [11] generates instance proposals in an object-centric manner, followed by a graph convolutional model enabling higher-level interactions between nearby instances. Additional methods use panoptic fusion [22, 41, 54] , transformers [13, 67] and omni-supervision [12, 57] towards complete understanding. In this work, we build a holistic semantic instance-aware scene parsing system to help visually impaired people understand the entire surrounding. We augment instance segmentation with a lightweight dual-decoder design to better predict semantic-and offset features. Differing from other cross-dimension [21, 34] and cross-view [7, 42, 43] sensing platforms, we aim for top-view understanding and our holistic instance-aware solution directly leverages 3D point cloud segmentation results and aligns them onto 2D topview representations for generating assistive feedback. The entire architecture of HIDA is depicted in Fig. 1 , including hardware components, user interfaces, and the algorithm pipeline. Designed to maximize the stability of the portable system, only very few parts are integrated into our prototype. As shown in Fig. 1a) , the system is composed of three hardware components. First, a lightweight solid-state LiDAR sensor is attached to a belt for collecting point clouds. The RealSense L515, as the world's smallest high-resolution LiDAR depth camera, is well suitable as part of wearable devices. In addition, the scanning range up to 9m is suitable for most indoor scenes. A laptop placed in a backpack is the second component of our system. The laptop with a GPU processor ensures that the instance segmentation can be performed in an online manner. As for input and output interfaces, a bone-conduction headset with a microphone is the third component. The audio commands from users can be recognized by the user interface. Also beneficially, thanks to the bone-conduction earphones, internal acoustic cues from our system and external environmental sounds can be separately perceived by the users, which is safety-critical for assisting the visually impaired. The users will collect independently the point cloud under the audio guidance. At the same time, the system also needs to obtain the user's position in the point cloud map. These can be achieved through Simultaneous Localization and Mapping (SLAM) with odometry [8, 26, 49] . There are many mature SLAM frameworks. However, for the visually impaired, collecting point clouds in an unfamiliar environment is different from conventional ways. Specifically, the entire room is usually scanned at the entrance, so the user's movement is limited in a small range and those movement will mostly be the in-situ rotation. In addition, the motion of the human body is more unstable compared to robots. In this case, the odometry of SLAM may lose tracking. If this happens, the camera is required to move back to the previous position of the keyframe for loop closing, which is hard for vision-impaired people. Therefore, we value more the robustness of the mapping process. In our field test, RTAB-Map [26] achieved a reliable performance under such requirements. Fig. 2 shows two scanning results in real-world scenes using RTAB-Map. It can be seen that even in a limited range of movement, HIDA can obtain point clouds with rich and dense object information. After proficiency, users can complete the collection of dense point clouds in most cases by their own, even if they cannot check the collection of point clouds through the display at the same time. In our system, once the "start" instruction signal is recognized by the voice user interface, the environment scan process will begin. The odometry will update the current position and direction of the user at the same time. Under the audio guidance, the user can slowly turn around to scan the surrounding environment. In practice, the default scan time of a whole room is set as 20 seconds, which can obtain a sufficient amount of keyframes from different directions. After that, the captured point cloud will be sent to the segmentation network for 3D instance segmentation. In the data preprocessing stage, if the point cloud is too large, the segmentation time and memory requirement will be significantly increased. Considering the efficiency of 3D segmentation, we restrict the amount of points under 200, 000. If the total number of points in the point cloud exceeds 200, 000, we will downsample them evenly for saving running time and memory usage. Besides, some noise points will be removed as outliers which is caused by the influence of natural lighting. Inspired by PointGroup [23] , we design a 3D instance segmentation architecture. PointGroup uses a 7-layer sparse convolution U-Net [46] to extract features. Each layer consists of an encoder and a decoder. The extracted feature will be decoded into 2 branches: semantic branch and offset branch. The semantic branch builds the clusters that have the same semantic labels. The offset branch predicts perpoint offset vectors to shift each point towards the instance centroid and builds shifted point clusters that belong to the same instances. Those cluster proposals will be voxelized and then scored. In the inference, Non-Maximum Suppression (NMS) is performed to make final instance predictions. Different from [23] , our key adaptation lies in the backbone structure in order to enable a clear separation of dense instances and reach a better performance on semantic and offset predictions, which are essential in 3D instance segmentation. The architecture of our modified model is shown in Fig. 3 . Specifically, we use one encoder and dual decoders in each layer, including a decoder for semantic features and another for offset features. They are extracted separately in the early stage of the network. In addition, this dual-decoder U-Net consists of only 5 layers instead of 7, which can reduce parameters and speed up the infer- ence. More details are shown in Fig. 4 . This architecture has a clearly better performance compared with the original backbone, while maintaining a low amount of parameters. As mentioned before, many objects may only be partially scanned. Even so, the model can still accurately identify most of the objects in the point cloud. We show some examples of instance segmentation results in Fig. 2. After capturing point cloud and obtaining semantic and instance information, we can actually provide a variety of information to vision-impaired people. Although we have already downsampled the point cloud before, the calculation with the whole point cloud is still very time-consuming. In order to reduce the processing time, the proposed system only traverses the entire point cloud once to extract information of interest of the users. The proposed implementation is as follows: First, all points will be projected onto a plane parallel to the ground (XY -plane). Then, the current user location will be updated. For each instance, 5 feature points will be searched in points belonging to the same segmented instance: one point closest to the user and four coordinate extreme points (corner points) of this object. The direction will be calculated according to the camera pose information. We build the 2D camera coordinate X Y relative to the XY -plane, and then transform the feature points coordinate (x i , y i ) into (x i , y i ). +x corresponds to the front of the user, and +y corresponds to the left. The coordinates are not intuitive for the user, so our designed user interface will represent instances with distance and direction relative to the user. The directions will be divided into 12 areas, as shown in Fig. 5a ). In this way, the location of each instance will be replaced by the distance and direction of the point on the instance that is the closest to the user. Besides, in order to help the user bypass the obstacle, the direction of the "corner points" will also be denoted as the direction "occupied" by the object. An example is shown in Fig. 5b) . Obstacle avoidance. Furthermore, we proposed two interactive functions. The first function is the obstacle avoidance. The user will set an obstacle avoidance range. The system will primarily eliminate the direction occupied by obstacles within this range to find passable area. If all directions in the scanned area are already occupied by the obstacles, not scanned area will be directed and suggested to users as potential passable area. The passable area and all objects information within the detection range will be output. Fig. 6 is a functional example in an indoor scene. Object finding. The second function is the object search. The user specifies an object category of interest through voice commands, such as a "Find a desk" instruction. Then, the system will search for the corresponding instance and return the object position through acoustic cues. For instance, "Found a desk, distance 2.2 meters, direction in directly forward" will be output as speech via the boneconducted headset. In addition, in order to help users navigate to the object, our system can also alert obstacles in the direction leading to the object. For example, "Attention, a chair in this direction, distance 1.3 meters". The detailed schematic view is in Fig. 7. We trained our instance segmentation model on ScanNet v2 dataset [9] , containing 1613 scans of indoor scenes. The dataset is spitted into 1201, 312, and 100 scans in the train- ing, validation, and testing subsets, respectively. We set the cluster voxel size as 0.02m, and cluster radius as 0.03m and the minimum cluster point number as 50. In the training process, we uses Adam optimizer [25] with a learning rate of 0.001. Our model learned through the entire training set for 360 times with a batch size of 8. We report the quantitative performance in Table 1 and Table 2 . Several qualitative visualizations of the segmentation results are shown in Fig. 8 . In these three scenes, the instances are placed quite densely (e.g., many chairs are placed side by side in the third scenes), and some instances have not been completely scanned. But our model can still separate them well, which is beneficial for upper-level assistance. Following previous researches, mean Average Precision (mAP) is leveraged in our work as the main evaluation metric (see Table 1 and Table 2 ). Specifically, AP 25 and AP 50 denote the AP scores with IoU threshold set to 25% and 50%. Also, AP averages the scores with IoU threshold set from 50% to 95%, with a step size of 5%. We first assess the performance on the validation set of ScanNet v2, as shown in Table 2 . Here, we also compare the performance with different backbone sizes: "*-S" denotes a smaller backbone with m = 16 and "*-L" denotes a large backbone with m = 32. If we focus on the smaller-size backbone and preform a fair comparison with the original PointGroup model [23] and the recent DyCo3D model [16] , the proposed architecture clearly exceeds them, i.e., by 2.8% compared to PointGroup and 2.6% compared to DyCo3D. We also report the class-wise performance of our 3D point cloud instance segmentation model on the testing set of ScanNet v2, as listed in Table 1 . Compared with state-ofthe-art methods, our DD-UNet+Group model has achieved the best performance measured in mAP (43.6%). Among the 11 architectures, our method reaches high scores on many classes relevant for assisting the visually impaired. We tested this system on a laptop (Intel Core i7-7700HQ, NVIDIA GTX 1050Ti). We scanned four real-world scenes including a densely-scanned meeting room, a bedroom, a corridor, and an office. The scan time and point cloud sizes are different. We counted the pre-processing time, instance segmentation time (specific to each part of the network), and the extraction time of object information, listed in the Table 3 . It can be seen that the processing of the point cloud is still computationally complex, which inevitably leads to a short waiting time for the user. In the scenario of helping the visually impaired gather complete scene understanding, search for interested objects, or travel in unfamiliar indoor scenes, a short waiting time is understandable. Yet, a more powerful mobile processor and a more efficient large-scale point cloud instance segmentation method are expected to allow users travelling instantly after the scanning. Two tasks are designed to test HIDA with 7 participants including 5 males and 2 females. Their ages are within 24-30. Due to COVID-19 restrictions especially the social-distancing regulation challenging for the visually impaired [39, 47] , the voluntary participants were sighted, and they were blindfolded during the tests. The first task focuses on the functionality of obstacle avoidance, whereas the second tests the functionality of instance locating and guidance ability for assisting users to find their interested objects. When each participant was executing their tasks, we recorded the whole process for subsequent analysis. Task 1, find passable direction: The scenario of the first task is: The user enters a wide corridor (in most directions passable), within three meters around the entrance, we place different kinds of obstacles and leave a "gap" in a random direction. The users were asked to find and leave this area with obstacles. Each user performed this task first using only the white cane and then performed using both the white cane and our system. The arrangement and position of obstacles and the gap will be changed after each task has been successfully executed. 9a ) illustrates some of our qualitative analysis results of task 1. It can be seen that in the case of using only a white cane, users generally do global search without guidance, trying to recognize the surrounding environment and find passable areas by a brute-force like approach. While using HIDA, participants can directly find the correct direction of the gap. The collisions with obstacles were significantly reduced, providing a safe walking condition in indoor environments with obstacles. However, few problems among different users are found. For example, the visionimpaired user sometimes can not completely scan the surrounding environment. If our system did not find a "gap" between the scanned obstacles, it will suggest two directions pointed to unscanned area. The last row of Fig. 9a ) represents this case. Although there is a correct direction among two suggested directions, it is hard for users to go back to the previous position after they primarily tried in the wrong direction. This indicates that one should have more practice in order to maximize the system's effect. Task 2, find a specific object: We chose an office for the second task since objects placement in this office seems more complex, indicating a kind of difficulty level for testing our system. First, users will be taken to different start points. Then, the users were asked to find a random chair and sit down. Each user fulfilled this task twice, the first time with the white cane, and the second time with the white cane and our system. We only changed the position of part objects each time. In order to avoid the user becoming more familiar with the room when entering the room for the second time, half of the users first performed the task with white cane and then used both the system and cane, the other half on the contrary. Fig. 9b ) visualizes part of the results. Similar to task 1, when the user only relies on the white cane, the user perceived the surrounding objects one by one through touching them and feeling their shape, showing a low efficiency for finding the object. In the case of using our system as supplement, the users walked directly toward the chair, discarding the process of object shape recognition by users through touching, indicating a higher efficiency in searching objects. And they could bypass the obstacles on the path. As shown in the last row in Fig. 9b ), some objects may not be correctly classified since the accuracy of our segmentation network is not perfect. A network with an even higher precision is expected to be designed and deployed in the future. Efficiency analysis. We also counted the average time the user costed to complete these tasks in Table 4 . As it can be seen from the data, using our system did not significantly reduce the whole duration time. The reason is that scanning of the surrounding environment, segmentation of the point cloud, and audio interaction with users are still timeconsuming for the current method. If we only measure the time from the user leaving the starting area to completing the task, the differences are clear, indicating the potential and superiority of the proposed system, which gave a correct direction and significantly reduced wrong attempts. Operability I can get familiar with this system easily; The audio guidance is clear. Wearing this device has no negative effect on my other daily actions. Overall For vision impaired people, this system helps. Table 5 : Questionnaires: For each statement, users will select a score among 1-5, 5 means strongly agree, and 1 means strongly disagree. Questionnaire. A questionnaire from multiple aspects, listed in Table 5 , is designed in order to further evaluate the user experience of our system. After the participants completed the above two tasks, each participant answered these questions. Fig. 10 shows the feedback from the participants regarding our system. Participants make positive comments in terms of understanding, complexity, and comfort of the functionality of our system. They consider, that the system has an intuitive and smooth interface, and it is much easier to find the object with this system. However, the functionality and the operability still need to be improved. In summary, most participants think that the functions of HIDA are indeed very helpful in case of visual impairment and wearing the current hardware will not have too much impact on their daily lives. But they expect more functions, in addition, they think that some training are necessary in order to become familiar with the system, then they will accomplish the tasks more smoothly. User comments. We also collected some comments from the participants. Some positive comments include that our system did help them to understand the environment, it provided a very smooth interface, it is easy to wear, and the bone-conduction headset is also very suitable for such a system. However, some participants still put forward some improvement directions of this system. In the hardware aspect, one participant hopes to use a lighter processor instead of laptop in the future, which will make our system more wearable. In the functional aspect, two participants both suggested that if the real-time objects information could be provided, it will help the visually impaired more. In the interaction aspect, one participant mentioned that the current voice interaction is still a bit time-consuming, especially when there are many obstacles, the output of information is somewhat lengthy. In addition, some participants thought this system should have a clearer guidance on how to scan a room completely or give more training before the usage. These comments are very constructive for our future work. In this work, HIDA, a 3D vision-based assistance system is proposed to help visually impaired people gather a holistic indoor understanding with object detection and obstacle avoidance. By using point cloud obtained with a wearable solid-state LiDAR sensor, it provides obstacle detection and specific object searching. Visual SLAM and instance-aware 3D scene parsing are combined, where point-wise semantics are converted to yield a dense top-view surrounding perception. The devised 3D instance segmentation model attains state-of-the-art performance on ScanNet v2 dataset, thanks to separate predictions of semantic-and offset features. The overall system is verified to be useful and helpful for indoor travelling and object searching, and it is promising for more assisted living tasks. However, there are some limitations of our system. First, the surrounding environment scanning and point cloud segmentation are time-consuming, making a real-time processing not possible. In addition, the accuracy of the current point cloud instance segmentation model still has much space to be improved. Last but not least, although we have obtained a satisfactory point cloud map with semantic instance information, due to the visual odometry get lost easily when the movement range is too large, the real-time interaction with the map has not been fully investigated. In the future, we intend to address the above points and introduce improvements to the system such as leveraging multi-sensor fusion to speed up the scanning of the surrounding environment and improving the odometry to obtain the location of the user on the map in real time. Moreover, designing a higher-precision segmentation network structure will also significantly improve the accuracy of holistic guidance delivered by our system. Navigation assistance for the visually impaired using RGB-D sensor with range expansion 3D semantic parsing of large-scale indoor spaces Smart guiding glasses for visually impaired people in indoor environment Enhancing perception for the visually impaired with deep learning techniques and low-cost wearable sensors. Pattern Recognition Letters Rapid detection of blind roads and crosswalks by using a lightweight semantic segmentation network Computer vision for the visually impaired: the sound of vision system Semantic MapNet: Building allocentric SemanticMaps and representations from egocentric views Panoramic annular SLAM with loop closure and global optimization ScanNet: richly-annotated 3D reconstructions of indoor scenes V-Eye: A visionbased navigation system for the visually impaired 3D-MPA: Multi-proposal aggregation for 3D semantic instance segmentation Omni-supervised point cloud segmentation via gradual receptive field component reasoning PCT: Point cloud transformer OccuSeg: Occupancy-aware 3D instance segmentation Piotr Dollár, and Ross Girshick. Mask R-CNN DyCo3D: Robust instance segmentation of 3D point clouds through dynamic convolution 3D-SIS: 3D semantic instance segmentation of RGB-D scans Outdoor walking guide for the visually-impaired people based on semantic segmentation and depth map Development of a wearable guide device based on convolutional neural network for blind or visually impaired persons An indoor positioning framework based on panoramic visual odometry for visually impaired people Bidirectional projection network for cross dimension scene understanding Panoramic panoptic segmentation: Towards complete surrounding understanding via unsupervised contrastive learning PointGroup: Dual-set point grouping for 3D instance segmentation Safe local navigation for visually impaired users with a timeof-flight and haptic feedback device Adam: A method for stochastic optimization RTAB-Map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation 3D instance segmentation via multi-task metric learning ISANA: Wearable contextaware indoor assistive navigation with obstacle avoidance for the blind Semantic scene mapping with spatiotemporal deep neural network for robotic applications. Cognitive Computation 3D instance embedding learning with a structure-aware loss function for point cloud segmentation Deep learning based wearable assistive system for visually impaired people MASC: Multi-scale affinity with sparse convolution for 3D instance segmentation Indoor topological localization based on a novel deep learning technique 3D-to-2D distillation for indoor scene parsing Fully convolutional networks for semantic segmentation Unifying obstacle detection, recognition, and fusion based on millimeter wave radar and RGB-depth sensors for the visually impaired Computer vision-based assistance system for the visually impaired using mobile edge artificial intelligence Panoptic lintention network: Towards efficient navigational perception for the visually impaired Helping the blind to get through COVID-19: Social distancing assistant using real-time semantic segmentation on RGB-D video The semantic paintbrush: Interactive 3D mapping and recognition in large outdoor spaces PanopticFusion: Online volumetric semantic mapping at the level of stuff and things Cross-view semantic segmentation for sensing surroundings MASS: Multi-attentional semantic segmentation of LiDAR data for dense top-view understanding Stairs detection with odometry-aided traversal from a wearable RGB-D camera. Computer Vision and Image Understanding Joint semantic-instance segmentation of 3D point clouds with multi-task pointwise networks and multi-value conditional random fields U-Net: Convolutional networks for biomedical image segmentation Active crowd analysis for pandemic risk mitigation for blind or visually impaired persons Enabling independent navigation for visually impaired people through a wearable vision-based feedback system Lightweight 3-D localization and mapping for solid-state LiDAR An environmental perception and navigational assistance system for visually impaired persons based on semantic stixels and sound interaction SGPN: Similarity group proposal network for 3D point cloud instance segmentation Associatively segmenting instances and semantics in point clouds Footprints and free space from a single color image SceneGraphFusion: Incremental 3D scene graph prediction from RGB-D sequences Learning object bounding boxes for 3D instance segmentation on point clouds Unifying terrain awareness through real-time semantic segmentation Omnisupervised omnidirectional semantic segmentation A new approach of point cloud processing and scene segmentation for guiding the visually impaired 3-D object recognition of a robotic navigation aid for the visually impaired GSPN: Generative shape proposal network for 3D instance segmentation in point cloud Content-aware video analysis to guide visually impaired walking on the street Ilyes Mendili, and Soedji Ablam Edoh Barnabe. Ego-semantic labeling of scene from depth image for visually impaired and blind people Point cloud instance segmentation using probabilistic embeddings Spatial semantic embedding network: Fast 3D instance segmentation with deep metric learning An indoor wayfinding system based on geometric features aided graph SLAM for the visually impaired Perception framework through realtime semantic segmentation and scene recognition on a wearable system for the visually impaired