key: cord-0869432-xqgf8jif authors: Pan, Yadong; Kawai, Ryo; Yoshida, Noboru; Ikeda, Hiroo; Nishimura, Shoji title: Training Physical and Geometrical Mid-Points for Multi-person Pose Estimation and Human Detection Under Congestion and Low Resolution date: 2020-06-21 journal: SN COMPUT DOI: 10.1007/s42979-020-00217-9 sha: 60f65cfa28baec240b3d63e9e88272a768327d64 doc_id: 869432 cord_uid: xqgf8jif This paper introduces the design and evaluation of NeoPose which is developed for multi-person pose estimation and human detection. The design of NeoPose is targeting the issue of human detection under congested situation and with low resolution in the image. Under such situations, we compared the performance of different versions of NeoPose as well as other existing algorithms in a human detection task. Throughout the task, the usefulness of two kinds of mid-point (physical and geometrical mid-points) and a deconvolution structure was discussed. Experiment results indicated that NeoPose which applied geometrical mid-points and deconvolution structure performed the best in terms of both precision and recall in the evaluation. Human detection and pose estimation are two joint issues in recent artificial intelligence researches. They can be used for recognition of human action [1] [2] [3] , tracking [4, 5] and re-identification [6] of human in online surveillance and human-object interaction [7] . Most of the latest algorithms on pose estimation tend to embed a human detector at the beginning of its data processing unit, such as [9] [10] [11] which ranked top three in COCO key-point challenge 2019. Those human detection-based algorithms are called top-down methods. However, as to solving real industrial problems, such as the outdoor surveillance where people are often in a congestion (Fig. 1a) or monitoring suspicious people near two countries' border line where people captured are with low resolution (Fig. 1b) , human detector tends to fail. Such problem has been pointed out by Gkioxari et al. in their research [8] . To solve real industrial problems, compared to applying top-down method for recognizing human and human behavior, we consider than human key-point-based method which is called bottom-up method would be a better solution. A bottom-up method first recognizes human key-points (also called body region points) on visible area of human body in the whole image, then associates those visible key-points into individual persons and generates human bounding boxes. As a result, human bounding boxes may not enclose a person's full body but the detection itself is reasonable. Figure 2 shows an example of comparison between top-downand bottom-up-based human detection under congestion and low image resolution. From the example, we could see how bottom-up method benefits in such outdoor scene. Therefore, the scope of this paper is to develop a better bottom-up approach in order to solve real industrial problems. In this paper, we proposed a bottom-up pose estimation system called NeoPose. NeoPose detects different types of body region points in the image and associates them into different individuals. Each group of associated body region points is a called a pose vector, which represents the pose of an individual person in the image. Human detection can be realized by calculating several bounding boxes that enclose each pose vector. Compared to some previous researches such as OpenPose [12] , Art-Track [14] and Associative Embedding [13] , NeoPose performed better in a human detection task under low image resolution. The task will be explained in a following section of this paper. The data flow in NeoPose (Fig. 3 ) follows our previous work [15] , where a structure called basic pattern is generated for each person in the image after different types of body region points are detected through a deep neural network. A basic pattern is a set of body region points including a person's shoulders, ears and neck. After generating basic patterns, each body region point with other types is associated with one of the basic pattern or ignored as false detection. Mid-points (middle of two body region points) which are also detected through the deep neural network are used as reference to give a judgment in associating each body region point to a specific base pattern. To extend our previous research, in this paper, we made some additional design on NeoPose. (i) We extended the design of mid-points from physical mid-points to geometrical mid-points. Both physical and geometrical mid-points are defined based on general body region points as shown in Fig. 4a . Physical mid-points ( Fig. 4b) are those mid-points which physically locate on human body, such as mid-point between a person's right shoulder and right waist. On the other hand, a geometrical mid-points is defined as a midpoint between any two kinds of body region points, which may not locate on but around a person's body according to specific pose (Fig. 4c) . According to the definition, geometrical mid-points include physical mid-points but represent more types of mid-points. (ii) We enhanced the deep neural network for training both general body region points and mid-points. In this paper, we compared the quality of human detection under different design of NeoPose and discussed the usefulness of training mid-points under low image resolution as well as its usage in solving real industrial issues. NeoPose applies a deep neural network which is trained on COCO dataset [16] . The neural network (Fig. 5 ) which consists of two stages trains/infers the body region points we well as mid-points. The resolution of image as input to the network can be adjusted but should be multiple of 8. Throughout the network, a feature map is generated for each type of body region point and mid-point. The first stage is the backbone of the neural network. In our previous research, we applied a vgg for the first stage, while in this paper, we modified the vgg by adding in a deconvolution (dconv) structure including an up-sampling layers, two convolution layers and a concatenation layer. This is referring to some recent researches [20, 21] where deconvolution was applied to reduce false detections. As to the second stage, in our previous research we trained 19 channels for 18 body region points and the background in one branch, and 10 physical mid-points in another branch. The mid-term feature from the body region points' branch was shared with the mid-points' branch. In this paper, we modified this part of network to a four-branch structure. The main branch that receives the data from the first stage (vgg/vgg + dconv) consists of 19 channels for training 18 body region points and the background. Other three branches were designed for 30 types of geometrical mid-points that are defined according to a body region point in S = {N 0 , N 1 , N 2 } and one in where the geometrical mid-points corresponding to the same item in S were trained in the same branch. For example, the middle of N 8 and N 0 and the middle of N 9 and N 0 were trained in the same branch. In the network, the mid-term feature from the main branch was shared with the other three branches. The output of the network includes 49 feature maps. In case of training geometrical mid-points, the loss was calculated as: where C stands for the 49 channels, and P represents all pixels in the feature map. S T P is the score calculated by the deep network and S G P is the ground truth. The ground truth for geometrical mid-points was calculated based on that of body The process of generating basic patterns follows the method described in our previous research [15] . As to associating the detected body region points, in this paper, we proposed a novel method. The novel method is developed based on the training of 30 types of geometrical mid-points. For each detected body region point with type N i (8 ≤ i ≤ 17) (also called a N i point), the presence of its corresponding types of geometrical mid-points was checked to determine which basic pattern it should be associated to. Figure 6 is an example that shows the detection of basic patterns, body region points of left knee and the geometrical mid-points that correspond to left knee and left shoulder. From the figure, we can know that even though such geometrical mid-point may not locate on a person's body but around the body (e.g., the person on the right), it can be detected through the deep neural network. To associate a N i point (8 ≤ i ≤ 17) to a basic pattern. NeoPose first builds links from the N i point to the N 0 , N 1 and N 2 points of each basic pattern. In case of missing any of the three points in a core, the link to that point would not be built. However, the N 0 point must exist according to the definition of basic pattern. Next, for each basic pattern, NeoPose counts the number of valid links where a corresponding type of geometrical mid-point is detected within an ellipse area between the two terminals of the link. Figure 7 shows an example of ellipse area. Following our previous research [15] , in this research, R major of the ellipse area is set to |(N i , N j )| × 0.35 , and R minor is set to R major × 0.75 . The basic pattern(s) having the most number of valid links will be considered as candidate basic pattern(s). Then, the candidate basic pattern with the minimum distance from its N 0 point to the N i point will be accepted as the basic pattern for the N i point to be associated to. Taking Fig. 8a as an example, for the N i point, basic pattern with id 1 and 3 have the most number of valid links. In such case, the N i point should be associated to basic pattern with id 1 since the distance from the N i point to N 0 point in that basic pattern (id 1) is shorter than that of the other one (id 3). For other types of N i points (5 ≤ i ≤ 7) , the basic pattern was selected by having the minimum distance from its N 0 point to the N i point. Overall, the process of associating all different types of body region points to basic patterns can be done in parallel, which helps to improve NeoPose's processing speed. After the process that associates those N i points to specific basic patterns, for each basic pattern, NeoPose drops the number of N i points that are associated to it. If a basic pattern is associated with multiple N i points, the Ni point with the minimum distance to the N 2 point in the basic pattern will be accepted and others will be excluded (Fig. 8b) . As a result, each basic pattern can associate to no more than one body region point with any specific type. Figure 9 shows some examples of images rendered with pose analyzed by NeoPose with deconvolution structure and applying geometrical mid-points. In this section, we tested the performance of different design of NeoPose: vgg + physical mid-points vgg + deconvolution + physical mid-points vgg + geometrical mid-points vgg + deconvolution + geometrical mid-points The test we conducted is a human detection test on images from MHP dataset [17] . The dataset contains around 4000 images with multiple people captured in each image and a variety of different poses in outdoor scenes, which meets the needs to investigate NeoPose for industrial usage (e.g., outdoor surveillance). To simulate the situation in many industrial problems, we resized all images in the dataset to a fixed and smaller height (120 pixels) and a smaller width according to the image's aspect ratio and used the resized images as input to NeoPose's deep neural network. After executing pose estimation on all images, we extracted those pose vectors which associates at least 10 body region points. According to our experience in solving industrial problems, 10 associated body region points can be considered as a practical level to suggest that the human detection is successfully done. For each pose vector with at least 10 body region points associated, a minimum bonding box that encloses all body region points in the pose vector was calculated (Fig. 10) . Sub-images aligned with those bounding boxes and rendered with a pose vector's all body region points were segmented from the original images. The segmented images were checked frame-by-frame by experts on image sensing and categorized into three classes: (i) Correct detection, which means all body region points in this pose vector are located on the same person's body without a fatal error. The criterion for a fatal error is that it does not mislead the understanding of a person's pose, which is judged with the experts' experience. (ii) False detection, which means in this pose vector, the body region points are located on different persons, or some are located on the background rather than Fig. 9 Pose estimation on images from MHP dataset [17] using NeoPose that applies deconvolution structure and geometrical mid-points human body. (iii) Ghost detection, where all body region points are located on background rather than on human body. The evaluation explained above were conducted for different kinds of bottom-up approach including OpenPose [12] , Art-Track [14] Associative Embedding(AE) [13] and the four versions of NeoPose. The evaluation results are shown in Table 1 . The results suggested that: (i) NeoPose with deconvolution and geometrical mid-points performed the best in terms of both precision (88.0%) and recall (80.7%). (ii) Whatever applying physical or geometrical mid-points, the deconvolution structure helped to reduce the ghost detections. Figure 11 shows some more details on how the deconvolution structure benefits in reducing the false positive detections of body region points. (iii) Compared to applying physical midpoints, networks with geometrical mid-points gained more correct detections, and therefore raised the level of recall. In OpenPose [12] , the association of body regions is realized through part-affinity-field(PAF), which is surface integral of those pixels between two body regions. However, in images with low resolution, the reliability of PAF drops and leads to the errors on pose estimation. The design of midpoints (both physical and geometrical ones) is targeting such issue by simplifying the representation of two body regions' correlation. Compared to OpenPose, NeoPose that applied either physical or geometrical mid-points showed better performance in the human detection task. As to the association of body region points, networks with geometrical mid-points gained a larger number of correct association. One of the important reasons is that the body regions' association based on geometrical mid-points was done in parallel. In such case, any type of body region point can be associated to a basic pattern without relying on other types of body region points. While in case of applying physical mid-points, the association is done is sequence (e.g., right waist, right knee, right ankle), where missing the association of any body region will cause that the following ones could not be successfully associated to the basic pattern. What's more, geometrical mid-points can be located out of a person's body, which means what we trained is not only the visible information but also the geometrical correlation among different body region points. Figure 6 shows that geometrical mid-points can be successfully detected. Such way of training the correlation is recently studied in some researches such as training the center point of a person [19] . We also considered that such theory of training could be used in future researches on human-object/human-human interaction. Deconvolution is recently considered as a practical way to enhance the robustness of sensing algorithms [20, 21] . In this research, we evaluated its usage under low resolution images which is not studied in previous researches, and found that the false positive detections of body region points significantly decreased by using the networks with deconvolution structure. The task for evaluating different algorithms in this paper targets on the real industrial problems where people captured are congested and with low resolution. One research by Gkioxari et al. [8] and our previous work [15] indicated that under such situations, top-down human detectors tend to fail in both human detection and pose estimation. Instead of top-down methods, in this paper, we investigated how bottom-up methods could be used in solving human detection problems. In the task on MHP dataset where all images were resized to a smaller resolution (height = 120 pixels), those networks with deconvolution structure gained almost no ghost detections, which inferred that such designs of networks could be suitable for real industrial problems such as the tasks shown in Fig. 1 . In those tasks, what the most important is to confirm that the target recognized is indeed a person. Throughout associating multiple body regions, the human detection would be Fig. 11 False positive detections of human body region points using different versions of NeoPose. a1, a2 vgg + physical mid-points, b1, b2 vgg + deconvolution + physical mid-points, c1, c2 vgg + geometrical mid-points, d1, d2 vgg + deconvolution + geometrical mid-points more reliable in low resolution images, and even though some parts of human body are under occlusion, it is possible to determine the presence of a person by associating the visible body regions. Such claim is also considered to be extended to solve the recognition of other objects rather than human. This research is an extended version of our previous work [15] . In order to find solutions for real industrial problems, we have been focusing on following ways of thinking: (i) Using pose estimation for human detection in order to deal with occlusion in congested situation and to improve the confidence of detected person under low image resolution. (ii) For those images with low resolution, simplifying the deep neural network from training area (e.g., PAF in OpenPose) to training some mid-points for the association of body regions. (iii) What's more, in this research, we discussed how physical and geometrical mid-points performed in a task on MHP dataset and evaluated the usage of a deconvolution structure. Although our experiment suggested that NeoPose performed better compared to other recent bottom-up approaches, both precision and recall does not reach 90 percent. This is a challenging issue because in the real industrial problems, the environment, the quality of image and the status of captured persons are usually in a complex representation. Also, such complexity is the reason why many good algorithms in academy studies are still not appropriate for releasing industrial products. We will be continuing looking for better theory and better algorithms to fulfill the gap between academic and industrial usages of AI algorithms as our constant value in future work. We are also interested in seeing our algorithms being applied into many industrial scenes, such as autonomous driving buses, automatic sports behavior analysis, gesturebased human-computer interface for touchless systems in post-covid-19 world, which are all considered as our future publications. Conflict of interest Yadong Pan is working at NEC. Before he started his career at NEC, he was supervised by Prof. Kenji Suzuki at University of Tsukuba, Japan. Ryo Kawai is working at NEC. Before he started his career at NEC, he was supervised by Prof. Yasushi Yagi at Osaka University, Japan. Noboru Yoshida, Hiroo Ikeda and Shoji Nishimura are also working at NEC. Before they joined NEC, their majors were not directly related to image processing or pattern recognition. Potion: pose motion representation for action recognition Pose encoding for robust skeleton-based action recognition Chained multistream networks exploiting pose, motion, and appearance for action classification and detection Efficient online multi-person 2D pose tracking with recurrent spatio-temporal affinity fields Pose flow: efficient online pose tracking Pose-driven deep convolutional model for person reidentification Detecting and recognizing human-object interactions Using k-poselets for detecting people and localizing their keypoints Rethinking on multi-stage networks for human pose estimation Distribution-Aware coordinate representation for human pose estimation Multi-person pose estimation with enhanced channel-wise and spatial information Realtime multi-person 2d pose estimation using part affinity fields Associative embedding: end-to-end learning for joint detection and grouping Arttrack: articulated multi-person tracking in the wild Multi-person pose estimation with mid-points for human detection under real-world surveillance The MHP (Multi-Human Parsing) dataset Multiple Object Tracking) dataset: https ://motch allen ge Objects as points Deep high-resolution representation learning for human pose estimation Simple baselines for human pose estimation and tracking