key: cord-0043147-24w3n4jj authors: Zhou, Cong; Yu, Han title: Mask-Guided Region Attention Network for Person Re-Identification date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_22 sha: 92164ea3e8136496345c46694bed0a4f57d95f6f doc_id: 43147 cord_uid: 24w3n4jj Person re-identification (ReID) is an important and practical task which identifies pedestrians across non-overlapping surveillance cameras based on their visual features. In general, ReID is an extremely challenging task due to complex background clutters, large pose variations and severe occlusions. To improve its performance, a robust and discriminative feature extraction methodology is particularly crucial. Recently, the feature alignment technique driven by human pose estimation, that is, matching two person images with their corresponding parts, increases the effectiveness of ReID to a certain extent. However, we argue that there are still a few problems among these methods such as imprecise handcrafted segmentation of body parts, and some improvements can be further achieved. In this paper, we present a novel framework called Mask-Guided Region Attention Network (MGRAN) for person ReID. MGRAN consists of two major components: Mask-guided Region Attention (MRA) and Multi-feature Alignment (MA). MRA aims to generate spatial attention masks and meanwhile mask out the background clutters and occlusions. Moreover, the generated masks are utilized for region-level feature alignment in the MA module. We then evaluate the proposed method on three public datasets, including Market-1501, DukeMTMC-reID and CUHK03. Extensive experiments with ablation analysis show the effectiveness of this method. Person re-identification (ReID) aims to identify the same individual across multiple cameras. In general, it is considered as a sub-problem of image retrieval. Given a query image containing a target pedestrian, ReID is to rank the gallery images and search for the same pedestrian. It plays an important role in various surveillance applications, such as intelligent security and pedestrian tracking. In the past years, many methods [1] [2] [3] [4] have been proposed to address the ReID problem. However, it still remains as an incomplete task due to large pose variations, complex background clutters, various camera views, severe occlusions and uncontrollable illumination conditions. Recently, with the improvement of human pose backpacks, handbags and caps. These items are definitely helpful for ReID and we can treat them as special parts of pedestrians, which should be included in the corresponding local region like in Fig. 2 (b). In the third problem, these methods like [1, 9, 12] only align the part features, considered as local features, and the global feature of the whole pedestrian region is not considered. However, each pedestrian is intuitively associated with a global feature including body shape, walking posture and so on, which cannot be replaced by local features. Due to the neglect of global features, the final feature representation will not be comprehensive and robust enough. Meanwhile, previous works [13, 14] extract the global feature from the entire pedestrian image including background clutters and occlusions, which will introduce noise and lead to the inaccuracy of feature representation. Here, we utilize pedestrian masks to redesign the global features, removing clutters with masks firstly and then extracting the global features of pedestrians. After these operations, multi-feature fusion can be used to align features. Based on above motivations, we propose a new Mask-Guided Region Attention Network for person re-identification. The contributions of our work can be summarized as follows: • To make the better use of feature alignment technique for person re-identification, a unified framework called Mask-Guided Region Attention Network (MGRAN) is proposed. • To further reduce the noise from background clutters and occlusions, we explore to utilize masks to separate pedestrians from them and obtain the finer silhouettes of pedestrian bodies. • Region-level feature alignment, based on head, upper body and lower body, is introduced as a more appropriate method for ReID. • We redesign the global feature and utilize multi-feature fusion to improve the accuracy and the completeness of feature representation. 2 Related Work Recently, person re-identification methods based on deep learning achieved great success [13, 15, 16] . In general, these methods can be classified into two categories, namely feature representation and distance metric learning. The first category [1, 3, 17, 18] often treats ReID as a classification problem. These methods dedicate to design view-invariant representations for pedestrians. The second category [19] [20] [21] mainly aims at measuring the similarity between pedestrian images by learning a robust distance metric. Among these methods, many of them [9, 12] achieved the success by feature alignment. Numerous studies proved the importance of feature alignment for ReID. For example, Su et al. [5] proposed a Pose-driven Deep Convolutional model (PDC) that used Spatial Transformer Network (STN) to crop body regions based on pre-defined centers. Xu et al. [9] achieved the more precise feature alignment based on their proposed network called Attention-Aware Compositional Network (AACN) and further improved the performance of identification. However, these methods align the part features based on the body shapes set by handcraft, which is usually imprecise. In our model, we utilize pedestrian masks in pixel-level to align features, intending to obtain more precise information of body parts. With the rapid development of instance segmentation based on deep learning methods such as Mask R-CNN [22] and the Fully Convolutional Networks (FCN) [23] , now we can easily obtain high-quality pedestrian masks which can be used in person reidentification. Furthermore, these instance segmentation methods can be naturally extended to human pose estimation by modeling keypoint locations as one-hot masks. We can further improve the performance of person re-identification by integrating the results of instance segmentation and human pose estimation. Spatial attention mechanism has achieved great success in understanding images and it has been widely used in various tasks, such as semantic segmentation [24] , object detection [25] and person re-identification [26] . For example, Chu et al. [6] proposed a multi-context attention model for pose estimation. Inspired by these methods, we use spatial attention maps to remove the undesirable clutters in pedestrian images. However, different from them, we use binary pedestrian masks as spatial attention maps to obtain more precise information of pedestrian bodies. The overall framework of our Mask-Guided Region Attention Network (MGRAN) is illustrated in Fig. 3 . MGRAN consists of two main components: Mask-guided Region Attention (MRA) and Multi-feature Alignment (MA). The MRA module aims to generate two kinds of attention maps: pedestrian masks and human body keypoints. It is constructed by a two-branch neural network, which predicts the attention maps of the pedestrians and their keypoints, respectively. The MA module is constructed by a four-branch neural network. It utilizes the estimated attention maps to extract global features and local features. A series of extracted features are then fused for multi-feature alignment. Different from other works, we use binary masks as attention maps for highlighting specific regions of human body in the image. With the rapid development of instance segmentation, there are many alternative methods to generate pedestrian masks. In this paper, we choose Mask R-CNN [22] to predict the masks due to its high accuracy and flexibility. As shown in Fig. 4 , there are two types of masks: pedestrian masks and keypoint masks. They are simultaneously learned in a unified form through our proposed Mask-guided Region Attention module. Pedestrian Masks P. A pedestrian mask P is the encoding of an input image's spatial layout. It is a binary encoding which means that the pixels of pedestrian region are encoded as number 1 and the others are encoded as number 0. Following the original article of Mask R-CNN, we set hyper-parameters as suggested by existing Faster R-CNN work [27] and define the loss L mask P ð Þ on each sampled RoI in Mask R-CNN as the average binary cross-entropy loss, where N is the number of pixels in a predicted mask, r denotes the sigmoid function, P i is a single pixel in the mask, and P Ã i is the corresponding ground truth pixel. Furthermore, the classification loss L cls and the bounding-box loss L box of each sampled RoI are set as indicated in [21] . Fig. 4 . Two types of masks: pedestrian masks and keypoint masks. In this paper, we define four keypoints (Blue Dots). By connecting two adjacent keypoints, we can divide the pedestrian region into three local regions: the head, the upper body and the lower body. (Color figure online) masks, one for each of the four keypoints as shown in Fig. 4 . Following the original article of Mask R-CNN, during training, we minimize the cross-entropy loss over an m 2 -way softmax output for each visible ground-truth keypoint, which encourages a single point to be detected. Based on the attention masks generated by Mask-guided Region Attention module, we propose a Multi-feature Alignment (MA) module to align the global feature and local features. MA consists of two main stages called Space Alignment (SA) and Multifeature Fusion (MF). The complete structure of MA is shown in Fig. 3 . Space Alignment (SA). Space Alignment aims to obtain the pedestrian region and the three local regions. Based on the attention masks generated by MRA module, we propose a simple and effective approach to obtain them. Specifically, we firstly apply Hadamard Product between the original image M and the corresponding pedestrian mask P to obtain the pedestrian region, as follows: where, denotes the Hadamard Product operator which performs element-wise product on two matrices or tensors and M Ã denotes the pedestrian region. It is worth noting that we use Hadamard Product on the original image to guarantee the accuracy of features. Some works [9, 12] use spatial attention maps on processed data such as data processed by convolution, which will introduce noise into the attention region from other regions in the image. Secondly, based on the obtained pedestrian body region, we utilize the four keypoint masks to obtain the three local regions by connecting two adjacent keypoints and segmenting the pedestrian region, as shown in Fig. 4 . In this module, we use four ResNet-50 networks [28] to extract the features of the four regions generated by SA module, respectively. Then feature fusion is used to align features, as follows: where Concat Á ð Þ denotes the concatenation operation on feature vectors, f g represents the global feature of the whole pedestrian body region, f 1 l , f 2 l and f 3 l denote the features of the three local regions respectively, and F is the final feature vector for the input pedestrian image. Overall, our framework integrated the MRA and MA to extract features for input pedestrian images. We construct the Mask R-CNN model with a ResNet-50-FPN backbone and use the annotated person images in the COCO dataset [29] to train it. Furthermore, the floatingnumber mask output is binarized at a threshold of 0.5. In MF, the four ResNet-50 networks share the same parameters and we use the Margin Sample Mining Loss (MSML) [30] to conduct distance metric learning based on the four features extracted by ResNet-50. We scale the all images input into Mask R-CNN and ResNet-50 with a factor of 1/256. Finally, MRA and MA are trained independently. In this section, the performance of Mask-Guided Region Attention Network (MGRAN) is compared with several state-of-the-art methods on three public datasets. Furthermore, detailed ablation analysis is conducted to validate the effectiveness of MGRAN components. We evaluate our method on three large-scale public person ReID datasets, including Market-1501 [31] , DukeMTMC-reID [32] and CUHK03 [1] , details of them are shown in Table 1 . For fair comparison, we follow the official evaluation protocols of each dataset. For Market-1501 and DukeMTMC-reID, rank-1 identification rate (%) and mean Average Precision (mAP) (%) are used. For CUHK03, Cumulated Matching Characteristics (CMC) at rank-1 (%) and rank-5 (%) are adopted. We choose 13 methods in total with state-of-the-art performance for comparisons with our proposed framework MGRAN. These methods can be categorized into two classes according to whether human pose information is used. The Spindle-Net (Spindle) [12] , Deeply-Learned Part-Aligned Representations (DLPAR) [10] , MSCAN [33] , and the Attention-Aware Compositional Network (AACN) [9] are pose-relevant. The Online Instance Matching (OIM) [14] , Re-ranking [34] , the deep transfer learning method (Transfer) [35] , the SVDNet [15] , the pedestrian alignment network (PAN) [36] , the Part-Aligned Representation (PAR) [10] , the Deep Pyramid Feature Learning (DPFL) [13] , DaF [37] and the null space semi-supervised learning method (NFST) [38] are pose-irrelevant. The experimental results are presented in Table 2 , 3 and 4. Based on the experimental results, it is obvious that our MGRAN framework outperforms the compared methods, showing the advantages of our approach. To be specific, compared with the second best method on each dataset, our framework achieves 6.10%, 1.89%, 1.28%, 7.62% and 6.57% rank-1 accuracy improvement on Table 3 . Comparison results on DukeMTMC-reID dataset. DukeMTMC-reID Rank-1 mAP SVDNet [15] 76.70 56.80 OIM [14] 68.10 -PAN [36] 71.59 51.51 AACN [9] 76.84 59.25 MGRAN (Ours) 78.12 63.57 In this section, we evaluate the effect of our proposed multi-feature fusion and regionlevel feature alignment by ablation analysis. Multi-feature Fusion (MF). We verify the effectiveness of MF on Market-1501 and DukeMTMC-reID dataset by removing global features in final feature vectors. As shown in Table 5 , MF increases the rank-1 accuracy by 2.61%, 2.25% and 0.81% on Market-1501 (Single Query), Market-1501 (Multiple Query) and DukeMTMC-reID. Furthermore, 3.77%, 0.77% and 3.47% mAP improvement on Market-1501 (Single Query), Market-1501 (Multiple Query) and DukeMTMC-reID are achieved based on MF. Region-Level Feature Alignment (RFA). We align features based on part-level and region-level respectively to verify the effectiveness of our proposed region-level feature alignment. Specifically, we replace region-level feature alignment in MGRAN with part-level feature alignment and keep the other parts unchanged. As shown in Table 6 , RFA increases the rank-1 accuracy by 1.19% and 1.32% on CUHK03 (Labeled) and CUHK03 (Detected). Meanwhile, RFA increases the rank-5 accuracy by 1.53% and 1.46% on CUHK03 (Labeled) and CUHK03 (Detected). The experimental results show the usefulness of our proposed RFA. In this paper, we propose a novel Mask-Guided Region Attention Network (MGRAN) for person re-identification to deal with the clutter and misalignment problem. MGRAN consists of two main components: Mask-guided Region Attention (MRA) and Multifeature Alignment (MA). MRA generates spatial attention maps to mask out undesirable clutters and obtain finer silhouettes of pedestrian bodies. MA aims to align features based on region-level which is more appropriate for ReID. Our method has achieved some success, but with the rapid development of science, a great number of excellent technologies have been created, such as GAN, and in the future work, we propose to use these technologies to further improve the performance of ReID. DeepReID: deep filter pairing neural network for person re-identification Deep ranking for person re-identification via joint representation learning Person re-identification by multichannel parts-based cnn with improved triplet loss function Deep attributes driven multi-camera person re-identification Pose-driven deep convolutional model for person re-identification Multi-context attention for human pose estimation Stacked hourglass networks for human pose estimation Pose-aware person recognition Attention-aware compositional network for person re-identification Deeply-learned part-aligned representations for person re-identification Person re-identification by salience matching Spindle Net: person re-identification with human body region guided feature decomposition and fusion Person re-identification by deep learning multi-scale representations Joint detection and identification feature learning for person search SVDNet for pedestrian retrieval Cross-view asymmetric metric learning for unsupervised person re-identification PersonNet: person re-identification with deep convolutional neural networks Embedding deep metric for person re-identification: a study against large variations defense of the triplet loss for person re-identification Deep feature learning with relative distance comparison for person re-identification Fast R-CNN. In: ICCV Mask R-CNN. In: ICCV Fully convolutional networks for semantic segmentation Attention to scale: scale-aware semantic image segmentation SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning End-to-end comparative attention networks for person re-identification Faster R-CNN: towards real-time object detection with region proposal networks Deep residual learning for image recognition Microsoft COCO: Common Objects in Context. In: Fleet Margin sample mining loss: a deep learning based method for person re-identification Scalable person re-identification: a benchmark Unlabeled samples generated by GAN improve the person re-identification baseline in vitro Learning deep context-aware features over body and latent parts for person re-identification Re-ranking person re-identification with k-reciprocal encoding Deep transfer learning for person re-identification Pedestrian alignment network for large-scale person reidentification Divide and fuse: a re-ranking approach for person reidentification Learning a discriminative null space for person reidentification