key: cord-0585090-jr8jrvhr
authors: Wang, Xin; Zhao, Yang; Yang, Tangwen; Ruan, Qiuqi
title: Multi-Scale Context Aggregation Network with Attention-Guided for Crowd Counting
date: 2021-04-06
journal: nan
DOI: nan
sha: e6aa78ae8a16d88bf36ae1dd5dcf0e6efb59ce98
doc_id: 585090
cord_uid: jr8jrvhr

Crowd counting aims to predict the number of people and generate the density map in the image. There are many challenges, including varying head scales, the diversity of crowd distribution across images and cluttered backgrounds. In this paper, we propose a multi-scale context aggregation network (MSCANet) based on single-column encoder-decoder architecture for crowd counting, which consists of an encoder based on a dense context-aware module (DCAM) and a hierarchical attention-guided decoder. To handle the issue of scale variation, we construct the DCAM to aggregate multi-scale contextual information by densely connecting the dilated convolution with varying receptive fields. The proposed DCAM can capture rich contextual information of crowd areas due to its long-range receptive fields and dense scale sampling. Moreover, to suppress the background noise and generate a high-quality density map, we adopt a hierarchical attention-guided mechanism in the decoder. This helps to integrate more useful spatial information from shallow feature maps of the encoder by introducing multiple supervision based on semantic attention module (SAM). Extensive experiments demonstrate that the proposed approach achieves better performance than other similar state-of-the-art methods on three challenging benchmark datasets for crowd counting. The code is available at https://github.com/KingMV/MSCANet

Crowd counting has attracted much attention in recent years due to its important application including video surveillance, public security, et al. In addition, it is a key technique for high-level behavior analysis algorithms, such as crowd behavior analysis, crowd gathering detection. Specifically, crowd density estimation is also beneficial to prevent the spread of the 2019-nCoV virus. However, scale variations, huge crowd diversity, and background clutter are critical challenges of crowd counting. As shown in Figure 1 , the head scale within an image is varying due to camera perspective, and the crowd distribution across scenes also represents different patterns. Some CNN-based methods usually overestimate the density map of backgrounds due to the complexity of backgrounds, as analyzed in some crowd counting review papers [1] , [2] . Besides, some gridded areas (such as trees and buildings) are more likely to be mistaken in density map because the appearance of backgrounds is very similar to that congested crowd areas. To address the scale variations issue, many multi-column Fig. 1 . Some samples contain scale variations and background clutter in the ShanghaiTech dataset. The red rectangle indicates some human heads of different sizes. The green rectangle contains some clutters similar to the crowd area, especially the high-density crowd. The first column shows the original image, the second column shows the ground-truth density map, and the third column is the predicted density map from CSRNet method [3] in 2018. network based methods [4] - [7] are proposed to extract multiscale features, where different column networks are designed with different kernel sizes. However, these multi-column based methods have a more bloated structure, which lead to redundant information of each subnetwork, as analyzed in CSRNet [3] . Besides, inspired by inception architecture, some scaleaware modules [8] , [9] adopt multiple various convolution kernels with different receptive fields to extract features at various scales. These modules can be plugged directly into the existing single column network. The advantage of singlecolumn-based methods is their elegant network structure and high training efficiency. However, the rate of dilated kernel at different columns needs to be carefully selected in these scale-aware modules, which is challenging to capture various continuous scales. We observe that a small object of crowded areas can be represented by its neighbor information and the rich contextual information at multiple scales can improve counting accuracy. Therefore, in this paper, we investigate how to extract multi-scale contextual information from different receptive fields.

The attention mechanism is usually used to make the foreground area get more attention in [10] - [14] . The common idea is to design a single attention model with a bloated structure to predict an attention map and then multiply the estimated density map by the predicted attention map. However, this architecture makes the whole model too complicated and leads to a heavy computing burden. We observe that the shallow feature maps contain many edge and background noise information. Therefore, it is necessary to suppress these noise information in the shallow feature maps.

In this paper, we propose a multi-scale context aggregation network (MSCANet) based on encoder-decoder architecture for crowd counting, which can handle the scale variations and generate a high-quality density map. Specifically, we design a dense context-aware module (DCAM) to aggregate multi-scale contextual information by densely connecting the dilated kernels of different receptive fields. Compared with existing scale-aware modules that only extract limited scale context information, the proposed DCAM can capture rich contextual information and scale diversity due to its larger and finer range of receptive fields. Moreover, inspired by UNet [15] and FPN [16] , we present a hierarchical attention-guided decoder to hierarchically integrate different feature maps of the encoder for a high-quality density map. The multiple lightweight semantic attention modules (SAM) are added into different fusion layers of the decoder, which guide the model to pay more attention to the shallow feature maps of the crowd area. The contributions of this paper are summarized as follows:

• We propose a multi-scale context aggregation network based on encoder-decoder for crowd counting, which can improve the multi-scale representation and generate a high-quality density map. • Multi-scale context aggregation module is designed to help the encoder capture rich context and broad range scale information. Specifically, the module consist of multiple dense context-aware modules (DCAM) stacked, which densely connects multiple dilated kernels with different receptive fields. • We propose a hierarchical attention-guided decoder to explicitly integrate important information of different feature maps from the encoder for a high-quality density map. Multiple SAMs are introduced into different stages of the decoder to make the shallow feature maps of crowd areas get more attention. • Extensive experiments are conducted on three crowd datasets. The results demonstrate that our proposed method achieves better performance than other similar state-of-the-art methods.

To address the issues of scale variations, researchers proposed many multi-column based methods [4] [5]to explicitly utilize different columns with different respective fields to extract multi-scale features. Zhang et al. [4] first adopted multi-column convolutional neural network (MCNN) to handle scale variations, where the multi-scale features are extracted by three branches with kernels of different sizes (large, medium, small). Sam et al [5] proposed an improved method switch-CNN based on MCNN, which introduces a switch classifier to choose an optimal branch for input image patch. However, it is difficult to train multi-column networks and the efficiency of each network is low. Therefore, the single-column deeper network architecture is usually used in many state-of-theart methods. Among these single-column methods [3] [9] [6] , dilated convolution kernel and deformable convolution kernel are used to build multi-scale aware module. Li et al. [3] utilized multiple dilated convolutional layers in the single-column network to increase respective fields without the loss of spatial information. Chen et al. [9] proposed a Scale Pyramid Module(SPM) to extract multi-scale features, which employs dilated convolution with different rates in parallel. Guo et al. [6] explored a scale-aware attention fusion with various dilated rates to capture different visual information of crowd regions of interest, and utilized deformable convolutions to generate a high-quality density map. However, in the above methods, since the rate of dilated kernel at different columns is fixed, the existing scale-aware methods only capture contextual information of specific receptive fields, which limits the representation of more scales.

The attention mechanism's goal is to make the model focus on useful information to improve the counting performance. Liu et al. [12] proposed a DecideNet to combine the estimated density map with detection map through an attention guide. Liu et al. [10] designed an attention map generator to provide the mask region of the crowd area, and then a multi-scale deformable network is used to predict the density map of crowd mask area. Miao et al. [11] built a shallow features based dense attention network to diminish the impact of the background noise. Chen et al. [17] proposed a crowd attention convolutional network (CAT-CNN) for crowd counting, where the human in the estimated density map can get more attention by encoding a confidence map. However, these existing methods design a single attention model with a complicated structure to enhance the attention of foreground areas.

The proposed method is illustrated in Figure 2 . The key components are multi-scale context aggregation module and hierarchical attention-guided decoder.

Scale pyramid module based on the dilated convolution kernel is a common technology to extract multi-scale features. Since the receptive field of each column is fixed in advance, these methods only extract limited different scales. Inspired by [6] , we observe that contextual information is very important for high-density crowd area, particular for the pixel-regression task. Therefore, we design a multi-scale context aggregation module to aggregate multi-scale contextual information of crowd through the dilated convolution. The encoder contains two parts: feature extraction backbone and multi-scale context aggregation module that consists of multiple dense context dense-aware modules.

1) Feature extraction backbone: Similar to previous crowd counting works [3] [10] [9] , we adopt the first ten layers of VGG-16 model [18] pre-trained from the ImageNet dataset as the backbone for their strong transfer learning ability. The last two pooling layers and all full connection layers of VGG-16 are removed. The backbone consists of 5 blocks {C1, C2, C3, C4, C5}. The all dilated rates in C5 block are set to 2, which can enlarge the receptive fields and reduce the loss of feature map information.

2) Dense context-aware module: The dense context-aware module (DCAM) densely connects multiple dilated convolutions with different dilated rates to integrate multiple contextual information of various receptive fields. It is well known that the dilated convolution with rate d is able to enlarge the receptive field of k × k kernel to k + (k − 1) × (r − 1) without the reduction of feature map sizes. In [9] , the dilated rates in four column branches are usually set to 2, 4, 6, 8 or other fixed sizes. However, contextual diversity is restricted by the number of branches. Too large receptive field size and too sparse context sampling can degrade model performance.

To capture detailed contextual information, we stack multiple dilated convolution layers by dense connection to enlarge the range of receptive fields.

As shown in Figure 3 , the proposed DCAM consists of three dilated convolution layers with dense connection. Considering that the head scale variations are continuous, the dilated rates of DCAM are 2, 4, 6, receptively. The output of l-th dilated convolution layer H l is concatenated with all outputs from preceding layers{H 1 , H 2 , · · · , H l−1 }, and then they are fed into the next dilated convolution layer. Besides, we employ the global average pooling layer F avg to capture global contextual information of input X, and then concatenate with the outputs of different dilated layers. Finally, a 1 × 1 convolution is used to fuse the concatenated feature maps. Moreover, in order to process more scale diversity, we stack three DCAMs to further enhance the fusion of different contextual information. The contextual information Y can be obtained by the following:

where F k,d (·) denotes a k × k dilated convolution with dilated rate d. Φ is the parameter of each convolution kernel. U b is a bilinear upsampling operation.

The hierarchical attention-guided decoder (HAGD) is proposed to progressively integrate different feature maps of the encoder for a high-quality density map. Generally, features of small-scale people can be extracted at an earlier layer. But the earlier layer contains a lot of noises caused by some backgrounds (trees, buildings). Therefore, it is necessary to suppress these background noises in the process of feature fusion. Based on this motivation, we adopt an attentionguided mechanism to increase the attention of crowd area in shallow feature maps through a lightweight semantic attention module (SAM). Similar with UNet structure [15] , we integrate feature maps F 6 and F 3 through a feature fusion module (FM), and then features M 3,6 from F M 1 are passed into next layers to concatenate with feature map F 2 through F M 2. As shown in Figure 4 , the structure of SAM is:

Where CBR(m, n) denotes each convolution layer with m filters of size n × n, followed by Batch Normalization layer and Relu layer. In the last layer, the probability of each pixel belongs the foreground is obtained by a sigmoid function.

To suppress background noise of the final density map, the element-wise multiplication is conducted to fuse feature map M 2,3 from F M 2 and estimated attention map from SAM. The final density map is obtained through a density estimation module (DME). The DME consists of 3 × 3 × 64, 3 × 3 × 32 and 1 × 1 × 1 convolution, each convolution layer is followed by BN layer and Relu Layer.

where Y p is the estimated density map. D p is the predicted foreground mask from SAM. represents the element-wise multiply operation.

From a hierarchical ensemble perspective, we take each fused feature of decoder into SAM to give foreground more attention. In the process of training, multiple supervision losses of SAM at different scales are calculated to improve the location accuracy and reduce the difficulty of model learning. The ground-truth attention mask D gt is obtained from groundtruth density map Y gt . It can be represented as:

where D gt (x i ) is the ground-truth value at pixel x i and it is a binary map. t is a threshold value of 0.0001. To obtain ground-truth attention mask of different scales, we resize the D gt to 1/2, 1/4 and 1/8 of its original size.

The SAM aims to predict the probability of each pixel belongs foreground, it can be seen as a semantic segmentation with two label. Therefore, we train the SAM with a binary cross-entropy loss. It is defined as:

where N is the number of training samples. D p i , D gt i are the predicted mask and ground-truth mask of input sample i, receptively.

We combine multiple losses of SAM at different scales to give more attention to crowd areas in the feature fusion. Besides, the Euclidean distance between predicted density map D i and ground-truth density map Y GT i is used to define the regression loss L den . Therefore, the final loss L is the combination of multiple semantic losses and regression loss. It can be written as:

where λ 1 , λ 2 , λ 3 are parameter to adjust the weight of SAM at different scales, which are set to 1e −2 , 1e −3 and 1e −4 in our experiment. L 1 att , L 2 att , L 3 att represent the attention loss of SAM at different levels, respectively.

Following some previous works [3] [8] [9] , we use a Gaussian kernel to blurring each head annotation to generate the density map of image. For a head at pixel x i , it can be represented in the image by a delta function δ(x − x i ), and the density map is obtained by convolving with Gaussian kernelG σ (x) as follow.

where G σi (x) is a geometry-adaptive Gaussian kernel with variance σ i ; the variance σ i of each pixel is determined by the average distancesd i to its k nearest neighbors. For ShanghaiTech Part A datatset, we use the geometry-adaptive kernel to generate density map. The average distanced i is computed by KNN (K=3) method and the parameter β is set to 0.3. For the ShanghaiTech Part B, the UCF CC 50, the UCF-QNRF, σ is set to 15.

Adam optimizer is employed to minimize the loss of model for 500 epochs. The initial learning rate is set to 1e-4 and then is decreased by a factor 0.1 every 10 epochs. The first 10 layer's weight of pre-trained VGG16 are used to initialize the trainable parameters of backbone, and other parameters are initialized by a Gaussian distribution with zero mean and a standard deviation of 0.01. All training images are randomly cropped with a fixed size 320×320 to train the model. The size of mini-batch is 16. Besides, the training images and density maps are horizontally flipped with a probability of 0.5. In the test phase, the whole image is sent to the model to predict the density map of the input image.

The MAE and RMSE metric are used to evaluate the performance of the proposed model, which measure the accuracy and robustness of model. They are defined as follow:

where N is the number of test image. Y GT i and Y pred i is the ground truth and the estimated counts for i-th image, receptively.

1) ShanghaiTech Dataset: This dataset consists of 1198 images with a total of 330,165 annotation people, which is divided into Part A and Part B. The Part A includes 482 images with 300 images for training and 182 images for testing. The Part B contains 716 images and the number of training and testing images is 400 and 316, respectively. As shown in Table  I , our method achieves an MAE of 60.1 and an RMSE of 100.2, which obtains the best performance among the listed methods. The result shows that the proposed method is able to aggregate multi-scale features for addressing scale variations.

2) UCF CC 50 Dataset: This dataset only contains 50 images of different sizes with 40 images for training and 10 images for testing, and the number of people in images ranges from 96 to 4633. We perform the 5-fold cross-validation to evaluate our method. As shown in Table I 5 . Visualization of estimated density maps of UCF-QNRF Dataset. The first row presents the input image, the second row is the ground-truth density map, the third row is the predicted density map form CSRNet [3] , and the fourth row shows the predicted density map of our method.

The dataset is a large-scale crowd counting datatset with a total of 1.25 million annotated people head, which consists of 1535 high-quality images with 1201 images for training and 334 images for testing. Our proposed method obtains an MAE of 100.8 and an RMSE of 185.9, and it outperforms some state-of-the-art methods in tableI by a significant margin. This indicates that diverse contextual information is able to improve the counting accuracy of highdensity area.

The visualization results of estimated density map on UCF-QNRF are showed in Figure 5 . Moreover, PSNR and SSIM metrics are used to compare the quality of predicted density map with other methods that focus on generating high-quality density map. As shown in Table II , our method obtains a higher PSNR metric than other similar methods. This indicates the hierarchical attention-guided mechanism can improve the visual quality of density map. 

In the ablation experiment, we use the first 10 layers of vgg16-bn network [18] as our backbone to generate density map, which followed by two 3 × 3 convolutional layers and one 1 × 1 convolutional layer. Our baseline model obtains a strong benchmark result after introducing the batch normalization layers and data augmentation strategy. It achieves an MAE of 72.3 and an RMSE of 118, and outperforms most of early methods such as MCNN [4] , switch-CNN [5] . Besides, the backbone combined with the dense context-aware module (DCAM) can achieve an improvement of 13.3% in MAE and 15.6% in RMSE. This shows that the proposed DCAM can enhance the feature representation of model and handle the head scale variations better. Moreover, when we introduce a hierarchical attention-guided decoder, the quality of predicted density map can be significantly improved by 10.1% in PSNR and 9.7% in SSIM. This demonstrates that the detailed information from shallow feature maps can help to recover more spatial information of the density map.

In this paper, we propose a novel crowd counting architecture called MSCANet, which helps to capture robust multiscale features and generate a high-quality density map. The proposed DCAM densely connects multiple dilated kernels of different receptive fields to aggregate more diverse contextual information. To reduce background noise of density map, the HAGD is designed to integrate different-level shallow features of the encoder through a multiple level supervision. Extensive experiments on three benchmarks show that the proposed method achieves competitive results than other state-of-theart methods in terms of both counting accuracy and density map quality.

Cnn-based density estimation and crowd counting: A survey

A survey of recent advances in cnnbased single image crowd counting and density estimation

Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes

Single-image crowd counting via multi-column convolutional neural network

Switching convolutional neural network for crowd counting

Dadnet: Dilated-attentiondeformable convnet for crowd counting

Crowd counting and density estimation by trellis encoderdecoder networks

Scale aggregation network for accurate and efficient crowd counting

Scale pyramid network for crowd counting

Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding

Shallow feature based dense attention network for crowd counting

Decidenet: Counting varying density crowds through attention guided detection and density estimation

Pcc net: Perspective crowd counting via spatial convolutional network

Learning multi-level density maps for crowd counting

U-net: Convolutional networks for biomedical image segmentation

Feature pyramid networks for object detection

Crowd counting with crowd attention convolutional neural network

Very deep convolutional networks for large-scale image recognition

Context-aware crowd counting

Learn to scale: Generating multipolar normalized density maps for crowd counting

Mspnet: Multi-supervised parallel network for crowd counting