key: cord-0749210-hwskmwwl authors: Zhou, Fangbo; Zhao, Huailin; Zhang, Yani; Zhang, Qing; Liang, Lanjun; Li, Yaoyao; Duan, Zuodong title: COMAL: compositional multi-scale feature enhanced learning for crowd counting date: 2022-03-11 journal: Multimed Tools Appl DOI: 10.1007/s11042-022-12249-9 sha: 7372b05bf591bb588aa9630ef49be6982f6d3716 doc_id: 749210 cord_uid: hwskmwwl Accurately modeling the crowd’s head scale variations is an effective way to improve the counting accuracy of the crowd counting methods. Most counting networks apply a multi-branch network structure to obtain different scales of head features. Although they have achieved promising results, they do not perform very well on the extreme scale variation scene due to the limited scale representability. Meanwhile, these methods are prone to recognize background objects as foreground crowds in complex scenes due to the limited context and high-level semantic information. We propose a compositional multi-scale feature enhanced learning approach (COMAL) for crowd counting to handle the above limitations. COMAL enhances the multi-scale feature representations from three aspects: (1) The semantic enhanced module (SEM) is developed for embedding the high-level semantic information to the multi-scale features; (2) The diversity enhanced module (DEM) is proposed to enrich the variety of crowd features’ different scales; (3) The context enhanced module (CEM) is designed for strengthening the multi-scale features with more context information. Based on the proposed COMAL, we develop a crowd counting network under the encoder-decoder framework and perform extensive experiments on ShanghaiTech, UCF_CC_50, and UCF-QNRF datasets. Qualitative and quantitive results demonstrate the effectiveness of the proposed COMAL. Crowd counting is to estimate the number and density distribution of people in an image or video frame. It is particularly prominent because of its special significance for public safety and management [6, 52, 53] , especially during the COVID-19 pandemic, accurate crowd counting helps avoid gatherings of people. It has also attracted widespread attention of many scholars [5, 12, 23, 36, 38, 49, 58] . However, crowd counting is a very challenging due to the large scale variation of crowd head and complex backgrounds. In recent years, with the renaissance of deep learning [28, 50, 51] , convolutional neural network (CNN) based methods [8, 15] , have achieved significant progress in crowd counting task [13, 18, 27, 55, 63] . They formulate the task as a regression problem [10, 22, 32, 39, 64] , which designs the sophisticated network to establish the nonlinear relationship between the input crowd image and its corresponding crowd density map. Among them, efficiently modeling the scale variations of a crowd is a classical and hot research topic, and many researchers spare no effort to propose methods to handle it. For example, multi-column networks [37, 44, 66] are designed to model different scales of crowd head. However, they usually have complicated structures and need a long time to be optimized, requiring large computation resources to be implemented and not appropriate to real-world applications. The encoder-decoder frameworks are recently very popular in crowd counting tasks, and the sophisticated decoder is proposed. For instances, Zhao et al. [68] design the decoder with different auxiliary task branches for obtaining robust representations from the auxiliary tasks. Xie et al. [60] extract multi-scale features via the proposed decoder with the stacked dilated convolutional layers and the recurrent modules. Although they achieve promising performances, they may fail in the scene with the extreme crowd scale variations and complicated background stuff due to the limited scale representations and representational ability. Thus, modeling a crowd's scale variations in different scenes is still a challenging and unsolved problem for crowd counting. To solve the challenges mentioned above in the crowd counting task, we aim to extract efficient multi-scale feature representations for crowd counting from three aspects: (1) extract high-level semantic features of crowd for enhancing the crowd-aware representations; (2) model continuous scale variations of crowd for multi-scale crowd counting; and (3) extract long-range dependency of pixel for obtaining context information. To this end, we propose a compositional multi-scale feature enhanced learning approach (COMAL). Specifically, for Semantic feature enhancement, the semantic enhanced module (SEM) is designed, which embeds the semantic information from high-level features to the multiscale crowd features. The diversity enhanced module (DEM) enriches the varieties of feature representations via three diversity enhanced blocks in a cascade manner for scale diversity enhancement. For context enhancement, the context enhanced module is proposed to extract context information from spatial and channel dimensions via the neural attention mechanism. With the help of COMAL, the multi-scale features can own strong representational ability and abundant feature representations, which can handle the scale variation challenges from different crowd scenes. Based on the proposed COMAL, we design a counting network under the encoder-decoder framework. The COMAL is used as the decoder for final crowd density estimation. Extensive experiments are performed on commonly-used crowd counting benchmarks, and our network outperforms the other state-of-the-art methods. The visualization results further prove the effectiveness of the proposed COMAL. To summarize, the main contributions of our paper are fourfold: -We propose a semantic enhanced module (SEM) to embed the high-level semantic information into the multi-scale features, which can improve the crowd recognition performance on complex crowd scene. -We develop a diversity enhanced module (DEM) to enrich the scale representations. It helps the counting network to handle the extreme scale variations case better. -We design a context enhanced module (CEM) to strengthen the extracted multi-scale features with more context information. CEM can help the counting network recognize the foreground crowd and background stuff for the complex crowd scene. -We combine the above three modules into a compositional learning approach, COMAL, and build an encoder-decoder network based on it for crowd counting. With the assistance of COMAL, the counting network outperforms the other state-of-the-art methods on commonly-used crowd counting benchmarks. The rest of this paper is organized as follows. Section 2 demonstrates the related works of CNN-based crowd counting and multi-scale feature learning methods. In Section 3, we introduce the COMAL and its components in detail. We introduce the experiment details and model analysis in Section 4 and conclude our method in Section 5. In this section, we review the CNN-based crowd counting methods and multi-scale feature representation learning methods. We first review the crowd counting method [7, 29, 34, 45, 46, 57, 62] and summarize them in Table 1 . For example, Zhang et al. [66] proposed a Multi-column Convolutional Neural Network (MCNN) with different convolutional structures to solve the scale variations of crowd heads. Sam et al. [44] designed Switch-CNN, which trained a switch classifier to select the optimal CNN regressor for the specific scale density estimation. However, The limitation of Switch-CNN is that it chooses one of the results of different sub-networks rather than fusing them. Deb et al. [11] proposed an aggregated multi-column dilated convolution network for perspective-free counting. Although the above multi-column networks have achieved significant progress, they only consider limited crowd scale and doesn't perform well on continuous scale variation scene. To reduce the computational resources, Li et al. [30] proposed the CSRNet, which adopted the dilated convolutional layers to enlarge the receptive field of the network. However, the six successive dilated convolutional layers of CSRNet will cause a serious gridding effect [54] , which can not efficiently extract crowd features. To solve this problem, our SEM adopted multiple parallel filters with different dilate rates for exploiting multi-scale features. Cao et al. [3] proposed a scale aggregation network (SANet), which applied the scale aggregation module to extract multi-scale features and the transposed convolutional layer to regress the final crowd density map. Besides, some neural attention based methods have also been applied to the crowd counting task [16, 19] . Guo et al. [19] explored a scale-aware attention fusion method with different dilated rates to obtain different visual granularities of the crowd's region of interest. Gao et al. [16] proposed a space-/channel-wise attention regression network to exploit the context information of crowd scene for accurate crowd counting. The well-designed attention models effectively encode the large-range contextual information. We propose a compositional learning approach to enhance the multi-scale feature, which guides the counting network to learn robust representations for different crowd scenes. Scale variation is a common problem in different computer vision tasks [4, 9, 20, 31, 67] . Many multi-scale feature representation learning methods are proposed to solve it. Lin et al. [31] proposed a feature pyramid network (FPN), which fused high-level features and low-level features by element-wise summation for small object detection. Zhao et al. [67] proposed a pyramid scene parsing network (PSPNet) for aggregating context information at different scales. Inspired by the spatial pyramid pooling (SPP) [21] , Chen et al. [9] proposed the Atrous Spatial Pyramid Pooling (ASPP) module to use four convolutions with different dilated rates. ASPP can effectively enlarge the network's receptive field and obtain multiscale information, which prompt the network to achieve a new superior result on semantic segmentation task. He et al. [20] proposed the Adaptive Pyramid Network (APCNet), which used Adaptive Context Modules to leverage local and global representation to estimate an affinity weight for local regions. To obtain larger-scale information, Cao et al. [4] proposed a global context network (GCNet), which focuses on the connection between different image positions by establishing a long-range relationship between pixels. In this paper, we propose the DEM to enrich the multi-scale feature representations, and apply the proposed SEM and CEM to strengthen the feature representations. In this section, we firstly introduce the overview of the counting network with the proposed COMAL. Then, SEM, DEM, and CEM are elaborated. Finally, we demonstrate the loss function and evaluation metrics we use. The overview of the counting network we used in this paper is shown in Fig. 1 . Following [2, 16, 30] , we choose VGG-16 [48] as the feature encoder. However, in order to obtain semantic features, we use the first thirteen layers instead of the first ten convolutional layers. Then, the encoder features are fed to SEM, DEM and CEM sequentially to get the enhanced multi-scale crowd features. Finally, the extracted multi-scale features are processed by a single 1 × 1 convolutional layer and the bilinear interpolation operation to regress the final crowd density map. Each component of the counting network is demonstrated as follows. We propose the SEM to generate the multi-scale crowd features with abundant semantic information for final crowd density estimation. The detailed structure of SEM is shown in Fig. 2 . It has two paths: the low-level feature process path (LFP) and the high-level feature process path (HFP dimension and processed by the bilinear interpolation operation to the same size as the lowlevel feature dimensions. Different from the previous approach [31] , which directly uses the element-wise summation operation to fuse the upsampled high-level features and low-level features, we follow the design of Exfuse [65] . The output of HFP is multiplied with the output of LFP with element-wise multiplication operation to generate the initial multi-scale features, which prompt the network with more feature discriminability. More analysis can be seen in Section 4.3. Although SEM generates multi-scale crowd features, the representation of crowd features are limited, which will hinder the performance of counting network in complex scene. To increase the diversity of crowd features, we design the DEM, which consists of three diversity enhanced blocks (DEB). The design philosophy of DEM comes from [56] . As shown in Fig. 3 , each DEB has two branches. One branch with a single 3 × 3 convolutional layer and another branch with two stack 3 × 3 convolutional layers. All 3 × 3 convolutional layers have the half channel number of the input features and the output of two branches are fused with the element-wise summation. We place three DEBs in a cascade manner after SEM, as shown in Fig. 3 (b) , and it is equivalent to eight branches with different receptive fields in parallel, as shown in Fig. 3 (c) To increase the discriminability of the proposed COMAL, we propose the CEM to exploit the context information from multi-scale crowd features. The detailed architecture of CEM is shown in Fig. 4 . The PAM encodes the context information by calculating the long-range pixel relationship. Its detailed structure is shown in Fig. 4 . The input features are firstly processed by a 3 × 3 convolutional layer. After that, the processed features are fed into a 1×1 convolutional layer and the Softmax layer to get the position attention weight P att i , which can be formulated as follows: where {P i |i ∈ {1 · · · N }} denotes the i-th position of input feature map, N is the number of positions in the feature map, which is equal to H × W . The position attention weight P att i is fed to the bottleneck structure which is constructed by two 1 × 1 convolutional layers. Specifically, we place the LayerNormalization (LN) at the middle of two 1 × 1 convolutional layers for better weight optimization. The output of bottleneck is fused with the input of PAM via the residual learning and the final position attention feature can be formulated as follows: where P denotes the input feature of PAM. ReLU(·) and LN(·) denote the ReLU and LN layer, respectively. W p1 and W p2 represent the weight of two 1 × 1 convolutional layers, respectively. The structure of CAM is similar with PAM, which is shown in Fig. 4 . Different from PAM, we apply the global average pooling layer to acquire the global context information and the final channel attention weight C final i can be defined as follows: where C denotes the input feature of CAM. C m represents the global average pooling feature. W c1 and W c2 denote the weight of two 1 × 1 convolutional layers, respectively. Following [66] , we use the Gaussian kernel to convolve the head annotation points and generate a crowd density map F(x), which is defined as follows: where G σ (x) stands for the Gaussian kernel, x i is the ground truth head location, x is a pixel position in the input image. We convolve δ (x − x i ) with a Gaussian kernel with parameter σ . For different datasets, σ is set as different values. For ShanghaiTech Part B, UCF CC 50, and UCF QNRF, σ is set to 15. For ShanghaiTech Part A, σ is equal to βd i , whered i represents the average distance of k nearest neighbors and β is set to 0.3. We use the L2 loss to optimize the proposed COMAL. The loss function is defined as follows: where N is the total number of training images. F (X i , Θ) is the estimated density map generated by COMAL with parameters Θ. X i represents the input image while F i is the ground truth of the input image X i . The mean absolute error (MAE) and the mean square error (MSE) are chosen to evaluate the effectiveness of our method. The formulations are as follows: where N stands for the total number of the test images. C In this section, we first describe the implementation details and experiment setup. Then, we introduce the commonly-used crowd counting datasets and compare our method with other state-of-the-art methods. Finally, we conduct ablation experiments to evaluate the effectiveness of each component from our method. We apply the Adam to optimize our network. Following [16, 19, 40] , the initial learning rate is set to 1 × 10 −5 . And the learning rate decreased by 0.99 times every two epochs. The weight decay is set to 1 × 10 −4 . To optimize the network better, we set a magnification factor to enlarge the value of the ground truth density map. The magnification factor is set to 100 for ShanghaiTech Part A and UCF-QNRF, 200 for ShanghaiTech Part B, and 10 for UCF CC 50. All training images are cropped and resized to 576 × 768. The experiments are conducted under the Pytorch framework with a single NVIDIA GTX 2080Ti GPU. We evaluate our method on three commonly-used crowd counting datasets. The details of each dataset are shown in Table 2 . ShanghaiTech [66] includes 1,198 images with 330,165 annotated people. It is divided two parts: Part A and Part B. Part A contains 482 highly crowded images randomly grabbed from the Internet. Part B contains 716 images taken on downtown Shanghai's bustling streets. The comparison results on ShanghaiTech dataset are presented in Table 3 . We can see that the proposed COMAL outperforms other state-of-the-art methods in terms of MSE metrics. Specifically, compared with CSRNet, our COMAL achieves lower 8.6 and 17.9 in terms of MAE and MSE metrics, which benefits from the proposed SEM that can avoid a serious gridding effect [54] . Compared with SCAR, our COMAL also performs better counting accuracy, which benefits from the proposed SEM and DEM. The qualitative results in Fig. 5 further prove the effectiveness of our method. We observe from the fourth column that our proposed DEM can capture continuous scale changes of the crowd. Besides, we conduct the further statistic analysis of the performance of the proposed COMAL on ShanghaiTech Part A dataset. Specifically, as shown in Table 4 , the ShanghaiTech Part A dataset is divided into five crowd density levels. We compare the performance of COMAL, SCAR and CSRNet on the five crowd density levels, the comparison Fig. 6 . We find that COMAL performs better than the other counting networks on all five crowd density levels, which demonstrates the effectiveness of the proposed method. Following some previous works, we perform five-fold cross validation to evaluate the performance of the proposed COMAL. The quantitive results on UCF CC 50 are presented Bold fonts indicate the best results in Table 5 . Compared with other state-of-the-art methods, SFCN †with Pre-GCC [40] uses synthetic data to expand the limited training images of UCF50 and achieves a better count performance. However we see that our COMAL achieves state-of-the-art results in methods without synthetic data pretraining, which further proves the superiority of our method. Although there are huge variation crowd distribution in this dataset, COMAL performs better 17.5 and 20.8 than the performance of TEDNet in terms of MAE and MSE metrics, which is a significant progress for crowd counting task. The performance of the proposed COMAL on UCF-QNRF is presented in Table 6 . We can see that COMAL outperforms the other methods in methods without synthetic data pretraining, which further proves the superiority of our method. Compared with the performance of TEDNet, COMAL achieves lower 10.9 in terms of MAE, which further proves the effectiveness of our method. Without the help of synthetic data, our method still achieves similar performance on MAE metrics compared to Pre-GCC [40] . [44] 228 445 CSRNet [30] 120.3 208.5 PACNN [47] --SCAR [16] --TEDNet [26] 113 188 DADNet [19] 113.2 189.4 RAZ-Net [35] 116 195 2-DA-CNN [69] --SFCN †with Pre-GCC [40] 102 Qualitative and quantitative results are displayed in Fig. 7 and Table 7 . We can see that the counting performance is continually improved with the injection of the proposed components into the counting model, and achieves the best results with all the proposed components, which proves the effectiveness of our method. Specifically, compared with the fourth model, COMAL performs better 6.6 and 8.6 in terms of MAE and MSE metrics, which demonstrates the importance of context information generated by CEM for final crowd counting. + 0 1 G G V DEM L A M O C + 3 1 G G V D E We design three different structures to verify the effectiveness of each component in COMAL. As shown in Table 8 . C(N c ) represents convolutional layer with N c filters. From the first row and the last row of Table 8 , we can see that the counting accuracy drops when we use the convolutional layer to replace the SEM, which demonstrates that the high-level semantic features are important for final crowd counting. Besides, compare the performance of the second row and the third row, we find that the method with DEM performs better than the method without DEM. This is contributed to the multi-scale features generated by the DEM. For the last two rows of Table 8 , we can see that the performance of CEM outperforms the performance of CBAM [59] hugely, which further proves the effectiveness of our proposed CEM. We explore the effect of the number of DEB to the final counting accuracy. The comparison results are displayed in Table 9 . We can see that with the increased number N DEB of DEB, the counting performance of COMAL is improved, and COMAL achieves the best results when the N DEB is equal to 3, which is benefited from the scale diversities provided by DEBs. However, when the N DEB is larger than 3, the counting performance drops. The reason is that more DEBs increase the complexities of the network and hinder the optimization process of the counting network. To evaluate the rationality of CEM, we explore the performance of COMAL with only PAM (COMAL w/PAM) or CAM (COMAL w/CAM) on ShanghaiTech Part A dataset. The quantitative results are shown in Table 10 . We can see that the counting accuracy has continually improved with the help of CA and PA. The model achieves the best results when the model with CEM, which demonstrates that effectiveness of our method. The qualitative results in Fig. 8 further prove the importance of CEM to final counting accuracy. In this paper, we propose the COMAL for multi-scale crowd counting. We use the first 13 layers of VGG-16 as the encoder to extract features, and adopt the proposed decoder to process the extracted features for final density estimation. COMAL is evaluated on three challenging crowd counting datasets and achieves superior results compared with other state-of-the-art methods. However, COMAL owns lots of network parameters which is not suitable for the devices with limited computation resources. Besides, we only model the image spatial context information and do not consider to extract temporal information of video. Thus, in future work, we can explore our COMAL to video crowd counting task in a lightweight design. Crowd counting on images with scale variation and isolated clusters Crowdnet a deep convolutional network for dense crowd counting Scale aggregation network for accurate and efficient crowd counting Gcnet: Non-local networks meet squeeze-excitation networks and beyond Robust crowd counting based on refined density map Privacy preserving crowd monitoring: Counting people without people models or tracking Scale pyramid network for crowd counting Convolutional neural networks for forecasting flood process in internet-of-things enabled smart city Rethinking atrous convolution for semantic image segmentation Object counting and instance segmentation with imagelevel supervision An aggregated multicolumn dilated convolution network for perspectivefree counting Crowd counting by adaptively fusing predictions from an image pyramid Cnn-based density estimation and crowd counting: A survey Domain-adaptive crowd counting via inter-domain features segregation and gaussian-prior reconstruction Shaohua Wan. Exploring deep learning for view-based 3d model retrieval Scar: Spatial-/channel-wise attention regression networks for crowd counting Pcc net: Perspective crowd counting via spatial convolutional network Feature-aware adaptation and structured density alignment for crowd counting in video surveillance Dadnet: Dilated-attention-deformable convnet for crowd counting Adaptive pyramid context network for semantic segmentation Spatial pyramid pooling in deep convolutional networks for visual recognition Crowd counting using scale-aware attention networks Scene-adaptive accurate and fast vertical crowd counting via joint using depth and color information Multi-source multi-scale counting in extremely dense crowd images Composition loss for counting, density map estimation and localization in dense crowds Crowd counting and density estimation by trellis encoder-decoder networks Cˆ, 3 framework: An open-source pytorch code for crowd counting Smo-dnn: Spider monkey optimization and deep neural network hybrid classifier model for intrusion detection Where are the blobs Counting by localization with point supervision Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes Feature pyramid networks for object detection Decidenet: Counting varying density crowds through attention guided detection and density estimation Context-aware crowd counting Point in, box out Beyond counting persons in crowds Recurrent attentive zooming for joint crowd counting and precise localization Gate and common pathway detection in crowd scenes and anomaly detection using motion units and lstm predictive models Towards perspective-free object counting with deep learning Kumbh a case study for dense crowd counting and modeling Learning from synthetic data for crowd counting in the wild Pixel-wise crowd understanding via synthetic data Density-aware curriculum learning for crowd counting Iterative crowd counting Divide and grow: Capturing huge diversity in crowd images with incrementally growing cnn Switching convolutional neural network for crowd counting Top-down feedback for crowd counting convolutional neural network Crowd counting via adversarial cross-scale consistency pursuit Revisiting perspective information for efficient crowd counting Very deep convolutional networks for large-scale image recognition Generating high-quality crowd density maps using contextual pyramid cnns A framework for prediction and storage of battery life in iot devices using dnn and blockchain An effective feature engineering for dnn using hybrid pca-gwo for intrusion detection in iomt architecture Counting people by clustering person detector outputs Convolutional neural networks for crowd behaviour analysis: a survey Understanding convolution for semantic segmentation Sclnet: Spatial context learning network for congested crowd counting Deep people counting in extremely dense crowds Improving deep crowd density estimation via pre-classification of density Cbam: Convolutional block attention module Rsanet: Deep recurrent scale-aware network for crowd counting Liang He. Adaptive scenario discovery for crowd counting Learn to scale: Generating multipolar normalized density maps for crowd counting Cross-scene crowd counting via deep convolutional neural networks Crowd counting via scale-adaptive convolutional neural network Exfuse: Enhancing feature fusion for semantic segmentation Single-image crowd counting via multi-column convolutional neural network Pyramid scene parsing network Leveraging heterogeneous auxiliary tasks to assist crowd counting Two stages double attention convolutional neural network for crowd counting Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.