key: cord-0134207-5fdfsw1a authors: Lin, Hui; Ma, Zhiheng; Ji, Rongrong; Wang, Yaowei; Hong, Xiaopeng title: Boosting Crowd Counting via Multifaceted Attention date: 2022-03-05 journal: nan DOI: nan sha: 211b6cd50993de7dd77aa03231a06f8a733b41d8 doc_id: 134207 cord_uid: 5fdfsw1a This paper focuses on the challenging crowd counting task. As large-scale variations often exist within crowd images, neither fixed-size convolution kernel of CNN nor fixed-size attention of recent vision transformers can well handle this kind of variation. To address this problem, we propose a Multifaceted Attention Network (MAN) to improve transformer models in local spatial relation encoding. MAN incorporates global attention from a vanilla transformer, learnable local attention, and instance attention into a counting model. Firstly, the local Learnable Region Attention (LRA) is proposed to assign attention exclusively for each feature location dynamically. Secondly, we design the Local Attention Regularization to supervise the training of LRA by minimizing the deviation among the attention for different feature locations. Finally, we provide an Instance Attention mechanism to focus on the most important instances dynamically during training. Extensive experiments on four challenging crowd counting datasets namely ShanghaiTech, UCF-QNRF, JHU++, and NWPU have validated the proposed method. Codes: https://github.com/LoraLinH/Boosting-Crowd-Counting-via-Multifaceted-Attention. Crowd counting plays an essential role in congestion estimation, video surveillance, and crowd management. Especially after the outbreak of coronavirus disease (COVID- 19) , real-time crowd detection and counting attract more and more attention. In recent years, typical counting methods [20, 21, 41, 50] * This is the pre-print version of a CVPR22 paper. † Corresponding author. utilize the Convolution Neural Network (CNN) as backbone and regress density map to predict the total crowd count. However, due to the wide viewing angle of cameras and the 2D perspective projection, large-scale variations often exist in crowd images. Traditional CNNs with fixed-size convolution kernel are difficult to deal with these variations and the counting performance is severely limited. To alleviate this issue, multi-scale mechanism is designed, such as multi-scale blobs [48] , pyramid networks [22] , and multicolumn networks. These methods introduce an intuitive local-structure inductive bias [43] , suggesting that the respective field should be adaptive to the size of objects. Lately, the blossom of Transformer models, which adopt the global self-attention mechanism, has significantly improved the performances of various natural language processing tasks. Nonetheless, it is not until ViT [10] introduces patch-dividing as a local-structure inductive bias that transformer models can compete with and even surpass CNN models in vision tasks. The development of vision transformer suggests that both global self-attention mechanism and local inductive bias are important for vision tasks. The study about transformer based crowd counting is just in its preliminary stage [19, 49] and undergoes major challenges in introducing the local inductive bias to transformer models in crowded scenes. These models usually use fixedsize attention as ViT, which is limited in encoding the 2D local structure as pointed out by [10] and clearly inadequate to handle large-scale variations of crowd images. To solve this problem, in this paper, we improve both the structure and training scheme of vision transformers for crowd counting from the following three perspectives. Firstly, in response to such limitations in local region encoding, we propose the learnable region attention (LRA) to emphasize the local context. Different from previous vision transformers that adopt fixed patch division schemes, LRA can flexibly determine which local region it should pay attention to for each feature location. As a result, the local attention module provides an efficient way of extracting the most relevant local information against the scale changes. Moreover, it further disengages from the dependence on the position embedding module of ViT, which has been proven inefficient in encoding local space relations [10] . Secondly, we propose an efficient Local Attention Regularization (LAR) method to regularize the training of the LRA module. Inspired by the recent finding of human behaviors [5] that people often allocate similar attention resources to objects with similar real sizes regardless of their sizes in 2D images, we require the allocated attention w.r.t. each feature location to be similar. Based on this understanding, we design LAR to optimize the distribution of local attention by penalizing the deviation among them. LAR enforces the span of visual attention to be small on crowd area, and vice versa, for balanced and efficient allocations of attention. Finally, we make an attempt to apply the attention mechanism to the instance (i.e., the point annotations) level in images and propose the Instance Attention module. As the point annotations as provided in popular crowd benchmarks are spare and can only occupy a very small portion of the entire human heads, there are unavoidable annotation errors. To alleviate this issue, we use Instance Attention to focus on the most important instances dynamically during training. In summary, we propose a counting model with multifaceted attention, termed as Multifaceted Attention Network (MAN), to address large-scale variations in crowd images. The contributions are further summarized as follows: • We propose the local Learnable Region Attention to allocate an attention region exclusive for each feature location dynamically. • We design a local region attention regularization method to supervise the training of LRA. • We introduce an effective instance attention mechanism to select the most important instances dynamically during training. • We perform extensive experiments on popular datasets including ShanghaiTech, UCF-QNRF, JHU++, and NWPU, and show that the proposed method makes a solid advance in counting performance. Existing crowd counting methods can be categorized into three types: detection, regression and point supervision. Detection based methods [15, 18] construct detection models to predict bounding boxes for every person in the image. The final predicted count is represented by the number of boxes. However, its performance is limited by the occlusion in congested areas and the need of additional annotations. Regression based methods [13, 50] predict count by regressing to a pseudo density map generated based on point annotations. More improvements such as multi-scale mechanisms [2, 22, 30, 48] , perspective estimation [44, 46] and auxiliary task [17, 51] further promote the development of crowd counting. Recently, many works propose to avoid the inaccurate generation of pseudo maps and directly use point supervisions. BL [21] designs the loss function based on Bayesian theory, calculating the deviation of expectation for each crowd. And further works [14, 23, 38] focus on optimal transport and measure the divergence without depending on the assumption of Gaussian distribution. The transformer [36] has rapidly been used in wide range of machine learning area. [9] proposes Bidirectional Encoder Representations from Transformers (BERT) to enable deep bidirectional pre-training for language representations. [26] makes use of transformer to achieve strong natural language understanding through generative pre-training. [8] introduces a generalization of the Transformer model which extends theoretical capabilities. The Vision Transformer (ViT) [10] firstly applies a transformer architecture for vision tasks and demonstrates outstanding performances. DETR [3] further boosts the efficiency of vision transformer focusing on object detection. More recently, these advancements have boosted the effective applications of transformer in various tasks. [32, 42, 53] adopt transformers on instance or semantic segmentation. [34, 52, 54] improve accuracy and efficiency for object detection. For tracking task, great properties of transformer are also leveraged in [4, 33, 39 ]. The self-attention module is a key component in many deep learning models and especially in different kinds of transformers. In order to better leverage on the ability of relative information extraction, some previous works endow the attention module with the variable property. [47] proposes flexible self-attention module which computes attention weights over words with the dynamic weight vector. Disan [27] introduces multi-dimensional attention and directional self-attention to perform a feature-wise selection and the context-aware representations. Longformer [1] utilizes dilated sliding window attention to combine local and global information. And [25] enables more focused attentions by dynamic differentiable windows. In vision tasks, Swin Transformer [19] designs shifted attention windows with overlap to achieve cross-window connections. The study [35] develops blocked local attention and attention downsampling to improve speed, memory usage, and accuracy. [45] proposes focal self-attention to capture both local and global interactions in vision transformers. A 2-D version of sliding window attention as Longformer [1] is introduced to achieve a linear complexity w.r.t. the number of tokens [49] . We extend the previous variable attentions to learnable one, under the premise of large scale variations in crowd images. Our proposed 2D Learnable Region Attention (LRA) breaks the constraint of traditional fixed-size local attention windows in vision tasks and is robust to scale variations. In this section, we will elaborate the Multifaceted Attention Network, which consists of three major components: the Learnable Region Attention (LRA), the Local Attention Regularization (LAR), and the Instance Attention Loss. Figure 1 presents an overview of the framework. For each image I, we first use VGG-19 [28] as our backbone to extract the features F ∈ R C×W ×H , where C, W , and H are the channel, width and height, respectively. Then the feature map is flattened and transmitted into transformer encoder with the proposed LRA to learn features F against various scales. Afterwards, a regression decoder is utilized to predict the final density map D ∈ R W ×H from F . Finally, We use an Local Attention Regularization dedicated to supervise the training of the LRA module and an Instance Attention Loss to constrain the training of the total network. Traditional transformer network [36] adopts selfattention layer in the encoder. It is able to connect all pairs of input and output positions to consider the global relations of current features. It is computed by: where S is the softmax function and 1 √ d is a scaling factor based on the vector dimension d. W Q , W K , W V ∈ R d×d are weight matrices for projections. Q, K, V , which are derived from source features, stand for the query, key, and value vectors, respectively. However, since it regards the input as a disordered sequence and indiscriminately considers all correlations among features, the global attention model is positionagnostic [7] . Therefore, we propose a local learnable region attention to consider spatial information and enable more focus attentions. As fixed-size convolution kernel and predesigned attentions [19, 49] are insufficient to learn cross-scale spatial information, we seek to design a mechanism by which each feature will be learnable to attend to a most suitable local region. In specific, as a rectangular region can be identified by two vertices, we begin with a region filter mechanism to obtain the exclusive region for each position. We first define two filter functions of position p = (x p , y p ) where 0 ≤ x p < W, 0 ≤ y p < H as: Given two predicted vertices bottom left (bl) and upper right (ur): b = (x b , y b ), u = (x u , y u ) for a specific feature, the filter regions for it can be calculated by: (3) After that, calculated by Hadamard product between two filter regions R = R bl • R ur , the region map R ∈ R W ×H can finally be expressed as: Especially, when adopting global attention, R can be represented as an all-ones matrix. It is worth noticing that following the above-mentioned filter mechanism, the accuracy of each exclusive region entirely depends on only two discrete points, which is not learnable and lacks flexibility. Therefore, to explore more on local relations and improve the effectiveness of learning ability, we redesign the region filter mechanism based on coverage probability projections. First, given the query vector and key vector Q, K ∈ R W H×d , we replace the two binary filter regions by first introducing two predicted coverage probability maps as follows: where To obtain a 2D learnable attention map, C 1 and C 2 are reshaped to an order-3 tensor with a size of R W H×W ×H . For each i ∈ W H along the first axis of C 1 and C 2 , there are two corresponding probability maps C 1 i , C 2 i ∈ R W ×H . That is, C 1 i = C 1 (i, :, :) and C 2 i = C 2 (i, :, :). We then redesign the filter region maps by following the Cumulative Distribution Function (CDF) w.r.t. two different directions, namely from bottom left (bl) to upper right (ur) and opposite. More specifically, given a 2D probability function C i , for any position p = (x p , y p ) where 0 ≤ x p < W, 0 ≤ y p < H, we have where • is the Hadamard product. Nonetheless, since we compute those two probability maps by softmax function, the two cumulative distributionŝ R bl i ,R ur i may have a large number of zero values. Then the final region map, which is the product of above two regions, will be trivial, as illustrated by the first row of Figure 2 . Therefore, we perform a reverse to guarantee no matter which the cumulative direction is chosen,R i C 1 i and R i C 2 i will have nontrivial overlap, as shown by Figure 2 . The complete region map becomes: After obtaining the learnable region map R, we combine it into attention module for learnable local attention. With W Q loc , W K loc ∈ R d×d being specific parameter matrices of local attention, the output can be computed by: Compared to R in Eq. 4 which relies on discrete vertex points, R is in a form of parameter arrays which are differentiable. The proposed learnable region attention mechanism is thus trainable and more flexible at determining the attentional regions. Then the global attention is defined by sharing same refined value vectors with LRA: Finally, the output of the complete attention module is a combination of global attention and our proposed learnable region attention (LRA): We take inspiration from the recent finding from the study of human behaviors that the human visual system usually pays similar attention to the objects with similar real sizes [5] . To mimic such a phenomenon, we design the local region attention regularization module for supervising the training of the local learnable region attention module. The goal is to balance the distributions of local attention and penalize the deviation among the attention allocated to local regions. More specifically, given a predicted learnable region map R i ∈ R W ×H and a feature map F ∈ R C×W ×H , we compute the attention-weighted features, which can be formulated as a double tensor contraction of the second and the third mode of F with the first and second mode of R i [6] : [·] (·) indicates the mode for the tensor contraction operator [24] and we finally have To keep the consistency of allocations of attention resources in each region, we regularize the local attention by minimizing the variance among weighted features: where E is the mean allocation of attention resources in all local attention regions. The deviation penalty G between two weighted features is given by: The scheme of Local Attention Regularization is shown in Figure 3 . For optimizing the entire network, we provide the Instance Attention Loss. As the ground truth as provided in popular crowd benchmarks is in a form of spare point annotations and only occupies a very small portion of human heads, this kind of human-labeled annotations inevitably exist spatial error. To alleviate negative influences by annotation noises, we impose a dynamic selection mechanism named Instance Attention, considering that the trained model sometimes predicts more correct signals than annotations. The mechanism where e = [ j ] N j are deviations between predictions and labels. For example, in MSE Loss, N equals to the size of density map, while in Bayesian Loss [21] (BL), N equals to the number of annotated points. Considering the performance and robustness, we finally choose BL as the deviation function: where j is j th annotated point. D is the final predicted density map. P rob j (p) represents for the posterior of the occurrence of the j th annotation given the position p. The instance attention mask in Eq. 15 provides a mechanism to select or weigh the instances. We regard the deviation j between predictions and labels as a kind of uncertainty of labels. If j is too large, there is a contradiction under the label of the instance. In this case, we shall reduce its importance or exclude this instance in back-propagation dynamically. For efficient computation, we adopt m as binary vectors. We first get the indices that sort the deviations in ascending order k = sortID (e). Then m is given by: where δ is the threshold. Clearly, in normal cases, m = [1] N and δ = 1.0. In the experiments, we set δ = 0.9, which means only 90% annotated points with the lowest deviations from prediction will be involved in supervision. Finally, the overall loss function of MAN is defined by: [21] serves as our baseline. The best performance is shown in bold and the second best is shown in underlined. Network Structure: We adopt VGG-19 as our CNN backbone network which is pre-trained on ImageNet. We refer to [36] for the structure of transformer encoder and replace the attention module by our proposed LRA. Specifically, as LRA is spatial-aware, the feature map is directly fed into the encoder without position encoding. Our regression decoder consists of an upsampling layer and three convolution layers with activation ReLU function. The kernel sizes of first two layers are 3 × 3 and that of last is 1 × 1. Training Details: We first adopt random scaling and horizontal flipping for each training image. Then we randomly crop image patches with a size of 512 × 512. As some images in ShanghaiTech A contain smaller resolution, the crop size for this dataset changes to 256 × 256. We also limit the shorter side of each image within 2048 pixels in all datasets. We use Adam algorithm [12] with a learning rate 10 −5 to optimize the parameters. We set the number of encoder layers T as 4 and the loss balanced parameter λ as 100. Experiments for evaluation are conducted on four largest crowd counting datasets: ShanghaiTech [50] , UCF-QNRF [11] , JHU-Crowd++ [29] and NWPU-CROWD [40] . These datasets are described as follows: Evaluation Metrics: We evaluate counting methods by two commonly used metrics: Mean Absolute Error (MAE) and Mean Squared Error (MSE). They are defined as follows: where M is the number of sample images. N gt i and N i are ground truth and estimated count of i th image respectively. MAE measures the accuracy of methods more and MSE measures the robustness more. The lower of both represents the better performance [50] . We evaluate our model on above four datasets and list thirteen recent state-of-the-arts methods for comparison. BL [21] serves as our baseline. The quantitative results of counting accuracy are listed in Table 1 . As the result shows, our MAN performs great accuracy on all the four benchmark datasets. MAN improves MAE and MSE values of second best method S3 [14] from 80.6 to 77.3 and from 139.8 to 131.5, respectively. On JHU++, it improves these two values from 59.4 to 53.4 and from 244.0 to 209.9, respectively. Compared to BL, MAN significantly boosts its counting accuracy on all four datasets. The improvements are 9.6% and 11.3% for MAE and MSE on ShanghaiTech A, 12.9% and 15.1% on UCF-QNRF, 28.8% and 30.0% on JHU-Crowd++, and 27.4% and 28.9% on NWPU-CROWD. Visualizations of our MAN are shown in Figure 6 . Ablation Studies We perform the ablation study on UCF-QNRF and provide quantitative results in Table 2 . We start with the baseline of BL [21] and then test the contribution of vanilla transformer encoder [36] . MAE and MSE are reduced by 3.9% and 3.4%, respectively. By adding IAL, the performance from baseline is improved by 4.5% and 2.5%. The combination of transformer and IAL further boosts the counting accuracy. Then, we replace the vanilla attention module by our proposed LRA, the performance is improved by 2.7% and 3.5% without IAL and by 1.8% and 5.7% with IAL. However, it is worth noticing that when we only adopt LRA without global attention, the performance will drop, indicating both global and local information are important. Finally, when the LAR is adopted, the best performance is achieved, which boosts the counting accuracy of BL by 12.9% and 15.1% for MAE and MSE, respectively. We hold experiments to understand the parameter selection of proposed Instance Attention Loss. We compare the counting accuracy under different thresholds on UCF-QNRF, which result is shown in Figure 5 . We set Multifaceted Attention Network without Instance Attention Loss as the baseline, which also means δ = 1. We observe that the counting accuracy reaches best when δ = 0.9, representing the model cuts off 10% annotations with largest deviations from the prediction. As δ is selected smaller, the accuracy of MAN declines obviously. It can be explained by the insufficient use of ground truth and that the model is weakly supervised. Then, when we focus on 0.8 ≤ δ < 1, the results are much better than supervision by all annotations. This may suggest that there are about 20% annotations which will negatively influence the performance of model in counting when adopted in training. And by our Instance Attention Loss, it reduces this negative influence conveniently and effectively. Visualizations of LRA: Figure 4 presents a visualization of the region mask R i in Learnable Region Attention (LRA) where the location of corresponding feature i is marked by a white circle. Supervised by the Local Attention Regularization, LRA is able to balance the allocations of attention resources. As can be seen, the attention region becomes gradually narrower as the scale of focus crowd is smaller and the number of people needed attention in each region is about the same. It is similar to human's efficient deployment of attention resources and justifies the usefulness of our LRA and LAR. Table 3 reports a comparison of model size, floating point operations (FLOPs) computed on one 384 × 384 input image, inference time for 1024 × 1024 images. It can be easily observed that the model size and inference time of MAN are closed to those of VGG19+Trans and much smaller than those of ViT-B. Moreover, MAN and VGG19-Trans are with a marginal difference in FLOPs. It thus shows that the proposed components are lightweight compared with vanilla transformers. This paper is aimed to enhance the ability of transformers in spatial local context encoding for crowd counting. We contribute to the structure of transformers by proposing a Learnable Region Attention module. We also improve the training pipeline by designing Local Attention Regularization to balance the attention allocated for each proposed region and introducing the Instance Attention Loss to reduce the influences of label noise. The proposed Multifaceted Attention Network has achieved state-of-the-art performances on four crowd counting datasets. We consider future directions for applying our model to a wider range of vision tasks. Longformer: The long-document transformer Scale aggregation network for accurate and efficient crowd counting End-toend object detection with transformers Transformer tracking Attention scales according to inferred real-world object size Tensors: a brief introduction Convit: Improving vision transformers with soft convolutional inductive biases Universal transformers Pre-training of deep bidirectional transformers for language understanding Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale Composition loss for counting, density map estimation and localization in dense crowds Adam: Amethod for stochastic optimization Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes Direct measure matching for crowd counting Visualizations on UCF-QNRF. The first and third rows are input images while the second and forth rows are the corresponding density maps predicted by our MAN. The warmer color means higher density Decidenet: Counting varying density crowds through attention guided detection and density estimation Contextaware crowd counting Semi-supervised crowd counting via self-training on surrogate tasks Point in, box out: Beyond counting persons in crowds Swin transformer: Hierarchical vision transformer using shifted windows Towards a universal model for cross-dataset crowd counting Bayesian loss for crowd count estimation with point supervision Learning scales from points: A scale-aware probabilistic model for crowd counting Learning to count via unbalanced optimal transport Using double contractions to derive the structure of slice-wise multiplications of tensors with applications to semi-blind mimo ofdm Differentiable window for dynamic local attention Improving language understanding by generative pre-training Disan: Directional self-attention network for rnn/cnn-free language understanding Very deep convolutional networks for large-scale image recognition Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method Generating highquality crowd density maps using contextual pyramid cnns Rethinking counting and localization in crowds: A purely point-based framework Segmenter: Transformer for semantic segmentation Transtrack: Multiple-object tracking with transformer Rethinking transformer-based set prediction for object detection Scaling local self-attention for parameter efficient visual backbones Attention is all you need A generalized loss function for crowd counting and localization Distribution matching for crowd counting Transformer meets tracker: Exploiting temporal context for robust visual tracking Nwpucrowd: A large-scale benchmark for crowd counting and localization Eccnas: Efficient crowd counting neural architecture search End-to-end video instance segmentation with transformers Vision transformer advanced by exploring intrinsic inductive bias Perspectiveguided convolution networks for crowd counting Focal self-attention for local-global interactions in vision transformers Reverse perspective network for perspectiveaware object counting Dynamic self-attention: Computing attention over words dynamically for sentence embedding Multi-scale convolutional neural networks for crowd counting Multi-scale vision longformer: A new vision transformer for high-resolution image encoding Single-image crowd counting via multi-column convolutional neural network Leveraging heterogeneous auxiliary tasks to assist crowd counting End-to-end object detection with adaptive clustering transformer Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers Deformable detr: Deformable transformers for end-to-end object detection