key: cord-0157377-fbj2f6lw authors: Ou, Wenyan; Zhang, Jiaming; Peng, Kunyu; Yang, Kailun; Jaworek, Gerhard; Muller, Karin; Stiefelhagen, Rainer title: Indoor Navigation Assistance for Visually Impaired People via Dynamic SLAM and Panoptic Segmentation with an RGB-D Sensor date: 2022-04-03 journal: nan DOI: nan sha: 048839169a175512bea05079590dfc562e3e31e2 doc_id: 157377 cord_uid: fbj2f6lw Exploring an unfamiliar indoor environment and avoiding obstacles is challenging for visually impaired people. Currently, several approaches achieve the avoidance of static obstacles based on the mapping of indoor scenes. To solve the issue of distinguishing dynamic obstacles, we propose an assistive system with an RGB-D sensor to detect dynamic information of a scene. Once the system captures an image, panoptic segmentation is performed to obtain the prior dynamic object information. With sparse feature points extracted from images and the depth information, poses of the user can be estimated. After the ego-motion estimation, the dynamic object can be identified and tracked. Then, poses and speed of tracked dynamic objects can be estimated, which are passed to the users through acoustic feedback. Human perception of the environment often takes an eye-based approach, which makes vision an indispensable part of daily life. People with visual impairments usually have very limited or no access to this channel. It is known that they mainly rely on information from other modalities to gain perception of the surroundings, the most important is hearing. Therefore, once the environment of visually impaired people is too noisy, their perception in environments with dynamic objects will be deviated, resulting in collisions with obstacles and even injuries, which greatly affects their daily life. Besides, visually impaired people find it hard to maintain proper social distances from others during the Covid-19 pandemic [13] . Some assistance systems tackle this issue through Simultaneous Localization And Mapping (SLAM) and deep learning approaches [12, 19] , to provide accurate guidance to visually impaired people, but they are less effective in highly dynamic scenarios. To address this problem, we propose a system to help people with visual impairments perceive dynamic objects in indoor environments and understand their motion. Recently, research on SLAM has gradually shifted from traditional static environments to more diverse dynamic environments, which are closer to reality. According to the further processing of dynamic objects, current solutions can be divided into two categories. Some works directly discard the dynamic information as outliers [3, 17] . Other works maintain them and jointly optimize the static map, pose of the camera, and dynamic objects in the scene [2, 20] . In this work, we propose a novel indoor assistance system for visually impaired people while dealing with dynamic environments, of which the main structure is shown in Fig. 1 . We develop a wearable assistance system based on an RGB-D sensor, which estimates the user's ego-pose, together with a static feature point map. The dynamic objects can be identified by the proposed system, and the average depth information to the user can be obtained. When the dynamic object belongs to a prior class, e.g., people, it can also be tracked between frames. Moreover, the linear velocity of prior dynamic objects can be estimated and transmitted through the bone-conduction headphones located on our smart glasses system (see Fig. 1 (a)) to the user. In summary, our main contributions are: a wearable assistant system with an RGB-D sensor is proposed, and it can be used to achieve localization and mapping and help visually impaired people detect dynamic objects in indoor scenes and obtain corresponding motion cues. SLAM in a dynamic environment. In the past years, many visual SLAM systems were proposed and have a satisfactory performance, such as ORB-SLAM [14] , DVO-SLAM [6] , etc. However, these visual SLAM systems are based on the assumption of a static-or a slightly dynamic environment. If there are highly dynamic objects in the environment -which is more closed to real-life scenarios -pose estimation and mapping may lead to poor results. Some approaches deal with dynamic objects in a pure geometry-based way [5, 7] . Other works leverage both deep learning and geometry-based methods to eliminate the negative effects of dynamic objects [3, 17, 21] . Recently, some works [2, 20] tackle this issue by tracking dynamic objects instead of removing them. SLAM in assistance systems. As for assistive applications for visual impaired people, SLAM is often used to achieve positioning of users and obstacle detec-tion [10, 18] . For navigational assistive systems, obstacle detection is desired to provide more detailed information to help avoid collisions and understand the surroundings. Therefore, semantic information is also incorporated into a SLAM system to achieve semantic path-and destination finding [22] . Besides, a prior map with semantic information can be established by SLAM in advance for navigation systems in indoor environments [1, 12] and can be later used for global path planning. Differing from these existing works, our work considers the localization and mapping in challenging dynamic environments and combines feature descriptors matching in ego-motion estimation and optical flow tracking in dynamic object motion estimation to guarantee valid tracking. Besides, our system can help seeing impaired people avoid collisions with diverse moving obstacles. Fig. 1 (a) shows our assistant system consisting of a pair of smart glasses and a lightweight laptop. First, the surrounding environment of visual impaired people is captured by the aforementioned RGB-D sensor attached to the glasses in the wearable device. Then, panoptic segmentation of the RGB image is executed online on a laptop with a processor, and optionally human joint keypoint estimation will be performed. After the pre-processing step, the RGB-D image and the segmentation mask will be passed into the tracking module and further processed in the local mapping and loop closing modules. Focusing on the perspective of human-computer interaction, the information about the surroundings are delivered to the users, through the bone-conduction earphones on the glasses. PanopticFCN [11] is leveraged using the obtained RGB image as input. The output of panoptic segmentation consists of semantic-and instance-based masks, as visualized in Fig. 2 (b). Since people tend to move dynamically in real-life indoor scenes, the annotation of people is set as prior dynamic. If a more accurate speed estimation of dynamic moving people is expected, the human joint keypoints' coordinates detected by OpenPose [4] can be generated as well. We select the joint keypoints on shoulder and middle hip, constraining the sampled points on the trunk body. So the points are more likely to be located on the slightly deformed parts of the body, enabling stable motion estimation. Dynamic objects that can be addressed by our system cover two categories, i.e., prior dynamic objects like people or pets, and non-prior moving objects, e.g., a carton passively moved by people as shown in Fig. 2(c) . On the one hand, the non-prior dynamic object will be identified after the initial ego-pose estimation through the depth difference method, which is similar to [3] . The main difference between our approach and [3] is that the instance mask is directly leveraged to calculate the percentage of dynamic points in an object and identify if it is moving or not, rather than using region growing on depth images. On the other hand, 3D scene flow of sampled points on prior objects in consecutive frames is utilized to verify if a prior object is dynamic or not. When the magnitude of scene flow of a point is larger than a threshold, i.e., 0.02 used in our work, the point will be regarded as dynamic. The object is regarded as a dynamic object if the percentage of dynamic points in the prior object is over a threshold (30%). Considering computational costs, only dynamic prior objects will be tracked and expected to be described with speed. The dynamic non-prior and static prior objects are feedback with the direction and averaged depth from the user. We follow the same pipeline proposed by [14] to achieve ego-motion estimation. In our work, we add the step of "Non-Prior Dynamic Object Identifying" before tracking the local map. As for the mapping step, merely static points are considered as map points. Thanks to the excellent feature-based visual SLAM framework, we can estimate the ego-pose robustly and accurately, which is essential to the following dynamic object tracking. We leverage a similar methodology as in [20] for the pose and speed estimation of the dynamic objects. Preparation procedure is firstly executed on the other thread while solving the ego-motion prediction. The potential dynamic objects with prior labels are numbered in the obtained panoptic segmentation mask respectively, while all the other pixels are annotated as 0. Then, the DIS optical flow [8] is used to find corresponding keypoints in the current frame. With this dense optical flow method, if a potential prior dynamic object appears in the mask of the last frame but fails to be segmented in the current frame, the object keypoints on the last frame can be tracked and recovered in the current frame. Those dynamic objects tracked by optical flow, as presented in Fig. 2(d) , can be assigned unique track indices over time. To guarantee a sufficient tracking number of dynamic points, we sample every five points within an object mask. When the number of tracked points within a dynamic object decreases below the threshold, those sampled candidate points can be supplied. After obtaining the ego-pose of current frame, the scene flow of points can be calculated. Therefore, the prior dynamic but actually remaining-static objects can be filtered according to the magnitude of scene flow. As for the initial pose estimation of dynamic objects, a better result with more inliers is selected from the EPnP method [9] or using the previous motion. Finally, further pose optimization and speed calculation as in [20] are used to obtain the final result. We test our system on some sequences of public indoor TUM RGB-D dataset [16] and Bonn RGB-D dataset [15] . We select sequences with dynamic objects, including seated-and several walking people and etc., to simulate real-life scenarios in the office or different rooms that are important for visually impaired people. Quantitative results of pose estimation. The experiments of evaluating the ego-motion are carried out. We choose RMSE of absolute trajectory-and relative pose error as error metrics, which indicates robustness of the system, as proposed by [16] . Table 1 presents the comparison between our system and the baseline framework ORB-SLAM2 [14] as well as the ORB-SLAM2 with prior semantic information. Our system shows better effectiveness in most cases, which verifies the superiority of our approach for highly dynamic indoor scenes. Since the dynamic objects in the sequences of TUM RGB-D dataset are mainly with the prior label 'people', so the difference between the latter two methods is small. For the slightly dynamic scenes like fr3/sitting rpy, ORB-SLAM2 has better performance, since it has more valid keypoints located on the people. However, the ATE of the latter two methods are relatively small. For the sequence crowed with three very fast moving and rotating people, the second method show a better result. But this method only maintains efficiency for sequences with prior dynamic objects. For the challenging sequences like moving nonobstructing box and moving obstructing box, our system shows a robust performance. Table 1 . Ego-motion comparison on the TUM and Bonn datasets. The top five sequences are from TUM, whereas the bottom six sequences are from Bonn. unit: ATE (m), RP Et (m/frame), RP Er (degree/frame). ORBv2: ORB-SLAM2 [14] . ORBv2 (RGB-D) ORBv2 with semantic Our system AT E RP Et RP Er AT E RP Et RP Er AT E RP Et RP Er Qualitative results of dynamic object estimation. Since the ground-truth information of the dynamic objects are not provided, a qualitative analysis of the estimation is presented in this section. We compute dynamic objects' speed as shown in Fig. 2(d) . In the example scenario, the moving direction and speed of an on-coming or passing person are to be detected by the system and delivered to the user, so that they can react in time and avoid a collision. To investigate the map built for further navigation tasks, we evaluate the mapping capacity of our system. Since the sparse point cloud map generated by ORB-SLAM2 can not be easily used in the practical application, we generate a dense point cloud map offline after obtaining the keyframe poses for the whole sequence. In Fig. 3 , we have visualized the dense point cloud with or without the effect of dynamic objects. Moreover, we also generate octree maps based on the correct dense point cloud map, which are suitable for assistive functions. The overlapping prior dynamic objects (person) in the point cloud are removed and the valid map is generated. Here, one should note that during the whole sequence, the point cloud of the non-prior dynamic objects, such like the chairs in the left sequence, still remain two versions, i.e., the position before it was moved by the person and the position when it became static again. Whether maintaining or deleting this kind of point cloud, a more complex strategy should be considered. For long-term SLAM systems, the map will be locally updated at a certain time. Runtime analysis. We test the average computational time of the system. The time excludes the part of panoptic segmentation and joint keypoints extraction, as this part totally depends on the GPU type and the selection of a certain neural network. And the time cost of our system is highly related to the number of tracked dynamic objects. For all the sequences, we achieve an average speed at 4∼7 FPS on an i5-10210U CPU, which is reasonable for indoor navigation. Real-life scenarios. In addition to the function of localization and mapping in dynamic environments, we also try to explore the application of dynamic information in the assistive system for the visually impaired. We collected some sequences in real-life scenarios, two of them as shown in Fig. 4 . Our system can give several dynamic information of the prior dynamic object, including its number, average depth, position, pose, velocity, and possible moving direction. The depth value of a person can help the user maintain social distance in a public indoor environment like in a shopping mall. Moreover, when the moving object gets closer to the user and its speed is relatively fast, the reminder of potential risk can be passed to user. The object with high velocity is often dangerous for the visually impaired and this velocity information can enhance current obstacle avoidance modules that mainly use the depth information [12, 19] . We also designed a questionnaire regarding the expected feedback form from our system. Through personal discussion and the results of the online questionnaire, the voice feedback of the system is preferred to be user-related and easily-understandable. The 'user-related' indicates that the moving object could affect the user's walking status in short time. For this case, the system should remind users of the potential risk with special signal tone. In this work, an assistive system is developed to help people with visual impairments to understand dynamic changes in indoor scenes. The static keypoints obtained by sparse-feature visual SLAM are combined with dynamic keypoints, which are obtained by optical flow tracking. The former aims to estimate the ego-motion robustly and the latter supports the identifying and stable tracking of dynamic objects without additional object models. However, there are still some limitations of our system. Since the result of panoptic segmentation is hardly perfectly accurate, it leads to some errors when identifying the nonprior dynamic object. Additionally, the computational complexity of this system highly depends on the number of dynamic objects. Some methods of reducing the number of optimized parameters need to be integrated. For the future work, we intend to conduct user experience research, i.e., invite visually impaired volunteers to use our devices and collect feedback, to further improve our system towards more holistic scene perception and reliable navigation assistance. Wearable travel aid for environment perception and navigation of visually impaired people DynaSLAM II: Tightly-coupled multi-object tracking and SLAM DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes OpenPose: Realtime multiperson 2D pose estimation using part affinity fields RGB-D SLAM in dynamic environments using point correlations Dense visual SLAM for RGB-D cameras Effective background model-based RGB-D dense visual odometry in a dynamic environment Fast optical flow using dense inverse search EPnP: An accurate O(n) solution to the PnP problem Vision-based mobile indoor assistive navigation aid for blind people Fully convolutional networks for panoptic segmentation HIDA: Towards holistic indoor understanding for the visually impaired via semantic instance segmentation with a wearable solid-state LiDAR sensor Helping the blind to get through COVID-19: Social distancing assistant using real-time semantic segmentation on RGB-D video ORB-SLAM2: An open-source SLAM system for monocular, stereo and RGB-D cameras ReFusion: 3D reconstruction in dynamic environments for RGB-D cameras exploiting residuals A benchmark for the evaluation of RGB-D SLAM systems DS-SLAM: A semantic visual SLAM towards dynamic environments An indoor wayfinding system based on geometric features aided graph SLAM for the visually impaired Trans4Trans: Efficient transformer for transparent object segmentation to help visually impaired people navigate in the real world VDO-SLAM: A visual dynamic object-aware SLAM system PoseFusion: Dense RGB-D SLAM in dynamic human environments A wearable navigation device for visually impaired people based on the real-time semantic visual SLAM system