key: cord-0204970-k4t17ktl
authors: Narayanan, Venkatraman; Manoghar, Bala Murali; RamaPrashanth, RV; Bera, Aniket
title: SeekNet: Improved Human Instance Segmentation via Reinforcement Learning Based Optimized Robot Relocation
date: 2020-11-17
journal: nan
DOI: nan
sha: a530e09d15dceaa36375faec3b994b49acbc7c67
doc_id: 204970
cord_uid: k4t17ktl

Amodal recognition is the ability of the system to detect occluded objects. Most state-of-the-art Visual Recognition systems lack the ability to perform amodal recognition. Few studies have achieved amodal recognition through passive prediction or embodied recognition approaches. However, these approaches suffer from challenges in real-world applications, such as dynamic objects. We propose SeekNet, an improved optimization method for amodal recognition through embodied visual recognition. Additionally, we implement SeekNet for social robots, where there are multiple interactions with crowded humans. Hence, we focus on occluded human detection&tracking and showcase the superiority of our algorithm over other baselines. We also experiment with SeekNet to improve the confidence of COVID-19 symptoms pre-screening algorithms using our efficient embodied recognition system.

Recent technologies in the field of robotics and AI have made remarkable advancements in the field of autonomous driving, mobile robots, social robots, etc. Most systems rely on a robust visual recognition system. Many recent work have improved Visual Recognition tasks such as Object recognition [28, 34, 39] , Semantic Segmentation [6, 22, 35] . Very few efforts have focused on amodal object recognition [40] and segmentation [12, 27, 42] . Amodal Visual Recognition is the ability of the system to perceive occluded objects [26] .

Some attempts were made to solve amodal recognition tasks by modeling it as an embodied recognition problem [7, 40] . These methods utilize the locomotive ability of a mobile robot to solve amodal recognition, rather than passively attempting to predict the occluded object. Such a system works specifically well for social robots since most of the environment is occluded from the robots' FOV.

The works [7, 40] have enabled a method to overcome The algorithms suffer from the following challenges:

• They only work on a single target object at a time and expect only one instance of the target object within the searchable area.

• They lack the ability to track a dynamic object.

We propose SeekNet to overcome such shortcomings with an ability to track dynamic objects and achieve embodied recognition tasks. Since object detection is a vast topic and targeted algorithms are required to address the many sub-groups (or classes) under object detection, we mainly focus on embodied recognition for social robots. We test our algorithm on a social robot, and since social robots primarily interact with dynamic humans in the environment, we designed an embodied recognition system that targets humans in the environment. To this end, our main contributions are:

• We present a novel approach to perform the amodal segmentation of humans in a crowded environment and track them further.

based on policy networks, that explores a predefined environment to track the humans in it.

• We demonstrate an application with our SeekNet to improve the pre-screening algorithms aimed at COVID-19 detection, which is currently a global pandemic.

COVID-19 or Coronavirus cases have spiked across the world. To slow the spread of COVID-19, the CDC (Centers for Disease Control and Prevention) in the US and WHO (World Health Organization) are encouraging people to practice self-quarantine if they symptoms of COVID-19 to slow down the outbreak to reduce the chance of infection among high-risk populations to reduce the burden on the health care system. Even with an aggressive testing process, it is not practical to frequently get tested at a required rate. Hence, it will be highly helpful if fast pre-screening becomes available to identify potential COVID-19 carriers.

Many contactless methods use a variety of sensors for pre-screening COVID-19 symptoms. We employ social robots in indoor environment to achieve higher confidence during the COVID-19 screening process, ensure better coverage and higher screening frequency. Furthermore, rather than relying on one modality, we fuse multiple modalities to simultaneously measure the vital signs, like body temperature, respiratory rate, heart rate, etc., to improve the screening accuracy.

The paper is organized as follows: Section 2 presents related work, section 3 gives an overview of of our pipeline, section 4 describes each stage of our pipeline in detail, and finally in section 5 we evaluate theoretical and practical results of our work.

From a social robot navigation perspective, we have to treat humans and other objects differently. Also, we have to draw boundaries between different humans, and thus instance segmentation becomes an integral part of our pipeline. Broadly speaking, there are three approaches for instance segmentation. The first approach generates a pixel map of separate objects using the output from the object detection task. Good results have been achieved using Mask-RCNN [13] , and there have been several improvements over this model to improve execution speeds. Another approach is to use individual networks to collect high-level object information and low level per pixel information. The results of the two networks are combined to form a pixel map of individual objects. The third method uses a single FCN. The features generated by this network are post-processed to obtain both object level and pixel-level information.

Though there have been many advancements in segmentation, only a few of them achieve frame rates that are sufficient for the navigation domain. The Box2Pix [37] is twice as fast as any other existing approach. Their approach represents a balanced fusion of object and pixel knowledge, which produces accurate instance segmentation with an efficient single FCN forward pass and a single image pass post-processing.

For social navigation, the problem of occlusion is more pronounced because of frequent human-human interaction, even in a sparse crowd [8, 23] . For pedestrians, detection in such complex scenarios is usually achieved by using pose estimation techniques [3, 5] . These networks are two-stage networks where the first stage extracts the skeleton information. The second stage combines the pixel classification and poses information to generate a pixel map of individual humans. The approaches by [36, 41] generate accurate masks even with heavy occlusion. Though the results are very promising, these networks are computationally intensive, and the execution time exponentially increases with the number of humans in the scene. Thus these methods are not suitable for integrating with a navigation scheme where real-time execution is necessary. To this end, our algorithm, SeekNet uses simpler instance segmentation methods and leverages the movement capability of robots to achieve occlusion free masks.

To better understand the shape of an object in case of occlusion, Lu Qi et al. [27] trained a model to estimate the hidden region. Though they produce good results when the shape of the object is complex, they are far from humanlevel performance. In order to accurately determine the shape of an object, Yang et al. [40] imitate the human ability to move and control the view angle actively. They introduced the task of Embodied amodal segmentation and addressed the problem using Embodied Mask-RCNN. This approach is trained for static objects, but moving humans are the prime targets in the case of social navigation. To this end, our approach SeekNet is trained for such dynamic environments and can maintain a constant distance with the target.

There are numerous AI models available to classify between an infectious person and a non-infectious person. The improvement in the accuracy of these models is also fastpaced. [9, 31, 38] explore different AI approaches related to COVID-19 detection. However, these models do not scale well, and there is a high risk of bias that raises concerns to be used in daily practice. To mitigate this, [32] , use multiple sensors to improve accuracy. Various methods to fuse different algorithms have also been explored in [24, 25] . These methods use late fusion due to unavailability or sparse availability of multi-modal data such as temperature, heart rate, respiratory rate, cough signature, etc., for the same human subject. Uniformly all these models need a specific environment setting to get the best accuracy and are far from deployment in public places. Our model SeekNet could be leveraged for deploying these contactless diagnosis models in public places as it can seclude a single person in a group and keep him in focus for the entire course of measurement and access the state of health of a person.

In order to coexist with humans, social robots need to understand their emotional state and incorporate socially acceptable behavior. Emotion recognition from features such as facial expressions, gestures, and walks has been addressed in the literature surveyed in [1, 2, 4, 29] . Multimodal and context-aware affect recognition models are also available for that purpose as well [19] [20] [21] 33 ].

The primary goal for SeekNet is to learn an optimal solution that improves the weakly learned detectors. Specifically, our navigation pipeline aims at improving Visual Recognition algorithms in an Embodied Recognition setup. Our SeekNet relies on an RGB camera onboard a mobile robot with other components necessary for robot perception and navigation. Our Visual Perception system consists of Amodal Recognition and Amodal Segmentation to identify potentially occluded objects, refined by our novel navigation system to improve the detection confidence (or accuracy) by maneuvering the robot to a more advantageous position. Our system relies on a mobile robot's ability to reposition itself to better complete the Visual Recognition task at hand. Figure 2 provides a brief overview of our SeekNet system.

The following subsections will describe our approach in detail. We discuss the details of the datasets used to train our perception and policy network, along with other processing techniques (if any) used. We also provide details on our Amodal Recognition and Segmentation routine, where we also briefly discuss our human detection and segmentation routine from an RGB camera. Finally, we discuss our navigation system.

Since we are deploying SeekNet on a robot, the model has to achieve frame rates sufficient to use in the navigation pipeline. We build on [37] and adapt their network architecture to generate three types of outputs: instance segmentation, object classification, object classification confidence.

We retain the [37] modifications from the GoogLeNet's inception module. The additional inception modules added to the backbone achieve a larger receptive field that helps in identifying humans near the robot. To predict the actual box parameters and box classes, we add 1×1 convolutions from different levels of backbone layers and recursively compute the receptive field theoretically as used in [37] .

where RF is the input and output receptive field, s is the stride of the corresponding layer, and k is the kernel size. We maintain the theoretical receptive fields to be twice as that of the maximum value of height or width of prior boxes while assigning it to a specific layer. This is done to compensate for the reduction in the receptive field during training, as suggested by [18] .

As used by [37] semantic class and center offset class is predicted using skip connections from inception modules of corresponding layers. It consists of 1 × 1 convolutions with element-wise addition, deconvolutions to upscale the low-resolution feature maps sequentially.

We use a hybrid loss (equation 2) to train our network. The hybrid loss is a weighted combination of losses for each of the sub-tasks (semantic, offsets, bounding box, classification) performed by our network using the approach presented by Kendall et al. [15] to learn task uncertainties σ L total = 1 σ 2 sem · L sem + log σ sem

We use a standard cross-entropy loss for semantic segmentation, L sem . L2 regression loss for center offset vectors L of f sets . Both semantic segmentation loss and center offset loss are normalized over the number of valid pixels. L bbox is the L2 regression loss for bounding box parameters (xmin, ymin, xmax, ymax). For classification L cls , we use Focal Loss [17] to counter the imbalance between foreground and background classes.

The OCHuman dataset has heavy occlusions, and for a heavily occluded human, the bounding box will be small. To detect such small objects, the IoU of the prior boxes has to be reduced, resulting in bad object detection performance. So to mitigate the problem of low coverage, we use the relative box parameter instead of IoU as suggested by [37] . This also helps in the training process since we match the loss (based on corner offsets) and generation metric in a common space. The relative change between b p rior of size (x min , y min , x max , y max ) and annotated ground truth box b GT is given by equation 3

where,

∆x and ∆y are the absolute difference in the two boxes' x and y parameters, and tl, br represent top-right and bottom-left positions.

In order to densely cover both small and large objects, we use 21 prior boxes. The dimensions of prior boxes are found by clustering, as suggested by [30] .

We combine the output from three outputs: semantic class, center offset vectors, and object detection with bounding boxes to generate instance segmentation output as proposed in [37] .

We train a policy network to achieve the embodied recognition task. Based on our amodal recognition's human detection confidence (section 3.1.1), we identify potential goal points to pursue and refine using our embodied recognition system. Our identification process is based on weak detection confidence below a threshold (λ). We build our policy network upon [11, 40] . Our policy network receives the LiDAR scans and each human segmentation masks from the amodal recognition system and outputs probabilities over the action space considered for the navigation task.

Action Space: The action space is a set of permissible robot velocities in continuous space. The action velocities consists of translational and rotational velocity. We set the bounds on translational velocity, v ∈ [0.0, 1.0] and rotational velocity, w ∈ [−1.0, 1.0] to accomodate the robot kinematics. We sample actions at step t using equation 6

where l 0 , l 1 , l 2 represent the three consecutive processed Li-DAR scan frames and h 0 , h 1 , ..h t represent the historical and human segmentation masks concatenated together. Policy Network: The policy network has four components {f human , f lidar , f act }. f human is for encoding the human segmentation masks. We resize the masks to 244 × 244, and pass them to f human , which consists of four 5 × 5 Conv, BatchNorm, ReLU. Each Conv block is followed by a 2 × 2 MaxPool blocks, producing an encoded human segmentation mask z img

We process the three consecutive lidar frames by passed them through two 1 × 1 Conv, followed by a 256D fullyconnected (FC) layer. The lidar frames are encoded as z lidar t = f enc ([l 0 , l 1 , l 2 ]). The f act is a multi-layer perceptron (MLP) network, with 1 128D FC hidden layer and finally produces the action velocities. f act takes in the encoded human trajectories, z img t , lidar encodings, z lidar t , previous velocity, v t−1 , goal position, s g , and current robot position, s t , to predict the robot velocities at time t, given by equation 6 .

v t is then sent to a linear layer with softmax to derive the probability distribution over the action space, from which the action is sampled. We learn {f human , f lidar , f act } via reinforcement learning.

Rewards: Our reward function for the policy network is inspired from [11] . We aim to arrive at an optimal strategy to avoid collisions during navigation while ensuring that we improve the targeted object's detection confidence (human).

The reward function to achieve the mentioned goals is given in equation 7

The reward r at time t is a combination of reward for avoid collisions, r c , reward for smooth movement, r w and reward for improving the detection confidence, r h . The penalty for colliding with obstacles is given by equation 8.

For ensuring smooth navigation, the penalty for large rotational velocities is given by equation 9.

To ensure that we progressively reposition the robot to improve detection confidence on the targeted object, we reward the system based on 10. The penalty is only applied when the robot is actively pursuing a target.

In our implementation, we use r arrival = 15, w g = 2.5, r collision = −15, w w = −0.1, r p = 2.5, r p = −0.5, ξ = 0.1

We evaluate our amodal recognition efficiency based on classification accuracy (Acc cls ), and segmentation accuracy as mean Intersection-over-Union (IoU) on the first frame of detection. We also report the tracking accuracy (Acc tr ) to evaluate our amodal recognition system. We evaluate the embodied recognition system in terms of change in classification accuracy (∆ h acc ).

In this work, we use datasets specially designed for detecting humans with heavy occlusions.

OCHuman:OCHuman [41] is a massive dataset designed for all three tasks: detection, pose estimation and instance segmentation that are most important human related tasks. This dataset captures severe occlusion between human bodies is often encountered in life. This dataset contains 8110 detailed annotated human instances within 4731 images. This dataset primarily emphasises on occlusions to encourage development of algorithms more suited for practical and real life situations.

JTA Dataset: JTA (Joint Track Auto) [10] dataset is a massive collection of pedestrian pose estimation and tracking in urban scenarios. The data is created by exploiting the highly photorealistic video game Grand Theft Auto V developed by Rockstar North. The dataset contains 512 video clips from several scenarios in urban environments. The dataset covers variation in illumination and a variety of view angles. It also covers the indoor and outdoor scenarios with natural actions lime sitting, running, chatting, etc., in a typical crowded environment. The clips are precisely annotated with values of visible and occluded body parts, people tracking with 2D and 3D coordinates.

Amodal Recognition: We train our pipeline on dataset described in section (4.2) with a train-validation split of 90%-10%. We use ADAM [16] optimizer, with decay parameters of (β 1 = 0.9 and β 2 = 0.999) to train our networks. We set the initial learning rate as 0.009 with 10% decay every 250 epochs. The models were trained with the hybrid loss detailed in section 3.2. We used 2 Nvidia RTX 2080 Ti GPUs having 11GB of GPU memory each and 64 GB of RAM to perform our experiments.

Embodied Recognition: We train our embodied recognition policy network on a simulation environment generated using Stage Mobile Robot Simulator. We generate multiple scenarios (see in figure 3 ) with obstacles to train our policy network. We use RMSProp [14] for training our policy network with learning rate 0.00004 and = 0.00005.

We tabulate the results from our experiments in Table 1 . We use scenarios 5 and 6 from figure 3 to perform our comparison studies. We report the metrics mentioned in section 4.1 for all the baselines described [40] . It can be seen that our implementation has the best change in classification accuracy across all our experiments.

As mentioned earlier, we use SeekNet for improving the existing algorithms for pre-screening COVID-19 symptoms. We perform the experiment similar to our Embodied Human Detection & Segmentation routine explained above, with the exception that we focus on the COVID-19 symptom detection confidence instead of the human detection confidence. As explained in section 1, we use an ensemble of algorithms that screens for COVID-19 symptoms from a multitude of sensors. In table 2, we report the learning from our experiment. We see that similar to our embodied human detection experiment, we see significant boost COVID-19 screening confidence when using our algorithm. It is important to note that we still use amodal human detection pipeline to detect and target human for COVID-19 symptom screening. 

We propose SeekNet to overcome shortcomings of passively detecting occluded objects with an ability to track dynamic objects and achieve embodied recognition tasks. Since object detection is a vast topic and targeted algorithms are required to address the many sub-groups (or classes) under object detection, we mainly focused on embodied recognition for social robots. We tested our algorithm on a social robot, and since social robots primarily interact with dynamic humans in the environment, we designed an embodied recognition system that targets humans in the environment.

Learning unseen emotions from gestures via semanticallyconditioned zero-shot perception with adversarial autoencoders

How are you feeling? multimodal emotion learning for sociallyassistive robot navigation

Step: Spatial temporal graph convolutional networks for emotion perception from gaits

Aniket Bera, and Dinesh Manocha. Generating emotive gaits for virtual agents using affect-based autoregression

Take an emotion walk: Perceiving emotions from gaits using hierarchical attention pooling and affective mapping

A simple, strong, and fast baseline for bottom-up panoptic segmentation

Embodied question answering

Can a robot trust you? a drl-based approach to trust-driven human-guided navigation

A guide to deep learning in healthcare

Learning to detect and track visible and occluded body joints in a virtual world

Autonomous mapless navigation in crowded scenarios

Learning to see the invisible: End-to-end trainable amodal instance segmentation

Piotr Dollár, and Ross Girshick. Mask r-cnn

Neural networks for machine learning lecture 6a overview of mini-batch gradient descent

Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

Adam: A method for stochastic optimization

Kaiming He, and Piotr Dollár. Focal loss for dense object detection

Understanding the effective receptive field in deep convolutional neural networks

Emotions don't lie: An audiovisual deepfake detection method using affective cues

M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues

Emoticon: Context-aware multimodal emotion recognition using frege's principle

Efficientps: Efficient panoptic segmentation

Proxemo: Gait-based emotion learning and multiview proxemic fusion for socially-aware robot navigation

Contactless vital signs measurement system using rgbthermal image sensors and its clinical screening test on patients with seasonal influenza

Infection screening system using thermography and ccd camera with good stability and swiftness for non-contact vital-signs measurement by feature matching and music algorithm

Vision science: Photons to phenomenology

Amodal instance segmentation with kins dataset

Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution

Modeling data-driven dominance traits for virtual characters using gait analysis

Yolo9000: better, faster, stronger

Ai-driven tools for coronavirus outbreak: need of active learning and cross-population train/test models on multitudinal/multimodal data

Temperature-compensated infrared-based lowcost mobile platform module for mass human temperature screening

Decision-level fusion method for emotion recognition using multimodal emotion recognition information

Efficientdet: Scalable and efficient object detection

Hierarchical multi-scale attention for semantic segmentation

Pose2instance: Harnessing keypoints for person instance segmentation

Single-shot instance segmentation by assigning pixels to object boxes

Computer vision for covid-19 control: A survey

Cspnet: A new backbone that can enhance learning capability of cnn

Embodied amodal recognition: Learning to move to perceive objects

Pose2seg: Detection free human instance segmentation

Learning semantics-aware distance map with semantics layering network for amodal instance segmentation