key: cord-0832988-yile6wg7 authors: Xie, Jinyang; Gu, Lingyu; Li, Zhonghui; Lyu, Lei title: HRANet: Hierarchical region-aware network for crowd counting date: 2022-02-02 journal: Appl Intell (Dordr) DOI: 10.1007/s10489-021-03030-w sha: 0afcae66f79aea5a80a94410d6b97844f31ea79d doc_id: 832988 cord_uid: yile6wg7 Aiming to tackle the most intractable problems of scale variation and complex backgrounds in crowd counting, we present an innovative framework called Hierarchical Region-Aware Network (HRANet) for crowd counting in this paper, which can better focus on crowd regions to accurately predict crowd density. In our implementation, first, we design a Region-Aware Module (RAM) to capture the internal differences within different regions of the feature map, thus adaptively extracting contextual features within different regions. Furthermore, we propose a Region Recalibration Module (RRM) which adopts a novel region-aware attention mechanism (RAAM) to further recalibrate the feature weights of different regions. By the integration of the above two modules, the influence of background regions can be effectively suppressed. Besides, considering the local correlations within different regions of the crowd density map, a Region Awareness Loss (RAL) is designed to reduce false identification while producing the locally consistent density map. Extensive experiments on five challenging datasets demonstrate that the proposed method significantly outperforms existing methods in terms of counting accuracy and quality of the generated density map. In addition, a series of specific experiments in crowd gathering scenes indicate that our method can be practically applied to crowd localization. Aiming to estimate the number of persons in dense and complex scenes, crowd counting is receiving increasing academic attention due to its great value in real-world applications such as public safety [1, 2] , intelligence gathering [3, 4] , emergency management [5, 6] , and data analysis [7, 8] . What is more, to prevent the spread of COVID-19 virus in public places, it can be used to perform social distance monitoring by providing effective crowd size information [9] . Motivated by the recent successful use of Convolutional Neural Network (CNN) in various computer vision tasks, most crowd counting methods are inclined to adopt CNN-based methods. Despite the impressive advancements have been achieved, these approaches suffer from two intractable problems. On the one hand, the size of individuals in the image tends to vary with the distance from the camera, which leads to the scale variation problem [10] , as illustrated in the top of Fig. 1 . On the other hand, the background regions in the crowd image have a similar appearance or color with the foreground regions, which is not conducive to accurate identification of crowd regions [11] , as illustrated in the bottom of Fig. 1 . To address the above problems, numerous solutions are put forward from different technical perspective [12, 13] , such as multi-column architectures [14, 15] , adaptive feature pyramids [16, 17] , and multiple CNNs with different sizes of receptive fields [18, 19] . In addition, considering the low quality of the input images, some methods try to pre-process the input images by introducing meta-heuristic-based image processing algorithms [20] [21] [22] [23] [24] [25] and image optimization algorithms [26, 27] , thus improving the processing efficiency of the crowd images. Though these schemes can mitigate these problems to some extent, it is still challenging to achieve high-precision crowd counting in extremely complex scenarios due to the presence of various background interferences. To be specific, firstly, the influence of background regions is not fully considered. These methods are not targeted to suppress the weight of the background region in the feature map, so that features irrelevant with the crowd can easily be extracted from the background regions [28] . In this case, it is necessary to further calibrate the weights of different regions in the feature map, thus suppressing the representation ability of the background regions. Secondly, the local correlation is ignored. Due to the presence of scale variation, different crowd regions in density map contain different scale information, which is local correlation [29] . However, the Euclidean loss commonly used by existing methods does not effectively capture local correlation, so that the generated crowd density maps have poor local continuity. Therefore, we can devise a new loss function that can intentionally learn local correlations on different regions. According to the above analysis, we propose a Hierarchical Region-Aware Network (HRANet) based on the multi-level architecture, which can adaptively recalibrate the weights of different local regions to better focus on the crowd regions. In our implementation, to capture the internal differences within different regions in crowd images, we design a Region-Aware Module (RAM), which can flexibly extract region adaptive features at each level. Additionally, to further suppress the weight of background regions in these features, we implement a region-aware attention mechanism on the Region Recalibration Module (RRM) to capture the relative influence of region features on different image locations. Based on this influence, RRM can recalibrate the weights of different regions. By combining the above two components, our HRANet can effectively suppress the influence of background regions on the counting results. In addition, we design a Region Awareness Loss (RAL) to enforce our model to learn global and local correlations within different regions. Specifically, we divide the ground-truth and the generated density map into different size sub-regions and further compute the structural similarity between these subregions, so that our method can produce locally consistent density maps and reduce false identifications. In general, the contributions of our work are threefold: (1) We design a Region-Aware Module (RAM), which captures the internal variation of the crowd image to adaptively extract contextual information of different regions. (2) We design a Region Recalibration Module (RRM), which adopt a novel region-aware attention mechanism to capture the relative influence of regional features on different image locations, thus recalibrating the weights of local regions at each level. (3) We design a Region Awareness Loss (RAL), which can enforce the proposed model to learn global and local correlations within different sizes of regions to produce locally consistent density maps. The organization of the rest of the paper is as follows. Section 2 illustrates the related work associated with our method. Section 3 describes our method in detail. Section 4 performs detailed comparative experiments. Section 5 analyzes the superiority of our method. Section 6 concludes the whole paper. Motivated by the successful use of deep learning, many CNN-based methods [30, 31] have widely developed in areas of crowd counting and have made significant gains. In this study, we focus on two categories of approaches associated with our work: multi-column based method and multi-level based method. Multi-column based methods usually adopt multiple branches with different sized convolution kernels to encode different receptive fields, so that they can capture multi-scale features. As a pioneering work that employs multiple columns, MCNN [32] extracts multi-scale features by utilizing three branches with convolution kernels of different sizes. In the same way, CP-CNN [33] adopts two branches to extract the local and global contextual information in feature maps, respectively, and further combines them to generate the density maps. DeepCount [34] learns density information from multiple branches to train a density-aware regressor. DSSI-Net [29] uses a message-passing mechanism to refine the feature information between three branches. DADNet [35] introduces dilated CNN with different rates on the multicolumn structure to capture more multi-scale contextual information. McML [36] introduces a statistical network, so that it can estimate the mutual information between different scales. In general, these multi-column based methods can effectively extract multi-scale features from different branches to accommodate crowds on different scale. However, the network architecture of each branch in these methods are almost identical, which inevitably leads to significant information redundancy. Multi-level based methods exploit the hierarchical structure of CNN to capture multi-scale features from within the backbone network. For example, ANF [37] introduces Conditional Random Fields (CRFs) to aggregate multi-level features. It also incorporates non-local attention mechanisms as inter-and intra-layer attention to expand the receptive fields between different levels. MBTTBF [38] introduces a bidirectional feature fusion method in the multi-level network to facilitate the full fusion of multi-level features. TEDNet [39] proposes a decoder with dense skip connections in the multi-level structure to hierarchically aggregate multi-level features. SaCNN [40] extracts multi-level features from the backbone FCN hierarchically and resizes them to the same size for generating crowd density maps. This kind of methods effectively extract features at different scales from the network, which significantly reduces the computational effort of the model. However, since the low-level features extracted by these methods contain more edge information, which may interfere with the final counting results. Loss functions aim to measure the difference between the predicted results and the real data, thus guiding the model to achieve better optimization results. The loss function commonly used in crowd counting is the MSE loss [41] , which is mainly used to evaluate the variability of the data. After that, considering the local similarity between different regions due to scale variation, some researches [42, 43] start to design different similarity preserving metrics [44, 45] and similarity losses [28, 39] to reduce the difference between the ground-truth and the estimated density map. DSSINet [29] designs a Dilated Multiscale Structural Similarity (DMS-SSIM) loss to learn local correlations within regions of different sizes. Specifically, the DMS-SSIM is computed by measuring the structural similarity between the multi-scale region centered at the given pixel on the estimated density map and the ground-truth. TEDNet [39] proposes a combinatorial loss to learn local correlations and spatial correlations in density maps. CFANet [28] designs a novel structural loss function (SL), which can improve both the structural similarity and counting accuracy between the estimated density map and the ground-truth. In this section, we first provide the overview of the proposed Hierarchical Region-Aware Network (HRANet) and then describe each module in detail. In our implementation, the HRANet adopts an encoderdecoder architecture base on the UNet [46] , whereby we can obtain a high-resolution estimated density map, and meanwhile retains more spatial detail information. As illustrated by Fig. 2 , our HRANet consists of four components: encoder, RAM, RRM, and decoder. To flexibly capture deeper features, we adopt the first 13 layers pre-trained VGG-16 as encoder by virtue of its adaptive structure for follow-up feature fusion. Given a crowd image I, the encoder output the multi-level features {F 1 3 , F 1 4 , F 1 5 } , which contain the contextual information at different scales. Due to the fact that feature F 1 i at each level can only encode a single size of receptive field, leading to its insufficient expression capacity. To remedy this, we design the RAM to extract wider contextual information, and the RRM to recalibrate weights for these contextual information, thus improving the expression capability of feature F 1 i . To be specific, the RAM is responsible for capturing the internal differences within different regions in multi-level features F 1 i and generating region adaptive features F 2 i . Subsequently, the RRM is utilized to further capture the relative influence of of F 2 i on each image location to recalibrate the weight of different regions, and generate final output features F 3 i . Based on the final output features, we apply multiple decoders to aggregate them hierarchically. Here, each decoder is composed of a bi-linear up-sampling operation and a 1 × 1 convolutional operation. where U bi represents the up-sampling operation, Concat () represents the concatenation operation, Conv() represents the 1 × 1 convolutional operation. Finally, we apply a set of convolutional layers to regress the aggregated features to produce the crowd density map. The detailed structure of these convolutional layers is: , where C(m,n) represents the convolutional layer, m represents the number of channels, and n represents the size of the convolutional kernel. U represents the bilinear interpolation up-sampling operation. RAM aims to adaptively extract contextual features on different regions. As illustrated in Fig. 3 , the RAM consists of two branches, which are responsible for extracting global contextual features and region adaptive features, respectively. In the first branch, we first squeeze the number of channels of F 1 i to preserve the representation power of global contextual features by utilizing a 1 × 1 convolutional layer. Hereby, we can obtain the global contextual features F2 i . In the second branch, we design a region-aware block, consisting of a 3 × 3 average pooling layer followed by a 1 × 1 convolutional layer, to capture the internal differences within different regions of the crowd image. Specifically, we first divide the multi-level features F 1 i into different regions using an average pooling operation. Then, the output features are convolved by a 1 × 1 convolution to generate the region-aware features F2 i , which contain adaptive contextual information of different regions. Afterwards, we combine the global contextual features where Concat() is the concatenation operation, Dconv() is the 3 × 3 depthwise convolutional operation. To suppress the influence of background features on counting results and further improve the representation ability of multi-level features, RRM is designed to recalibrate the weights of different regions in the feature map. The RRM is implemented with a Region-aware Attention Mechanism (RAAM), as shown in Fig. 4 . Taking region features F 2 i and multi-level features F 1 i as input, RAAM first captures the relative influence of region features on each image location by computing the contrast features F3 i : Then, we send these contrast features F3 i into a 1 × 1 convolutional network followed by a sigmoid activation layer to calculate weights W i of different regions: where Sig() is the sigmoid activation function, Conv() is the 1 × 1 convolutional operation. To obtain the recalibration features F3 i , an elemental multiplication operation between the region features F 2 i and the weights W i . where ⊗ is elemental multiplication between the weights W i and the contextual features F 2 i . Fig. 3 The details of Region-Aware Module (RAM). ⊕ represents the channel concatenation operation region-aware block Fig. 4 The details of Region-Aware Attention Mechanism (RAAM). ⊕ represents the channel concatenation operation, ⊖ represents the subtraction operation, and ⊗ represents the elemental multiplication operation Finally, we combine the recalibration feature F3 i and the multi-level feature F 1 i using the channel concatenation operation to obtain the final feature F 3 i , which contains rich global information and local region adaptive information: During training, we adopt MSE loss [41, 47] to measure the difference between the estimated density map and groundtruth, which is expressed as where N is the number of images, D i is the density map of the given image I, and D gt i is the ground-truth. In addition, considering the local correlation within different size regions of the density map, we also introduce a Region Awareness Loss (RAL) at each level to guide the proposed model for learning the scale diversity at the local regions, as shown in Fig. 5 . Specifically, we first use a 1 × 1 convolution on the output features of each level to obtain the prediction X i of of this level, and perform up-sampling operation on X i to enforce it has the same size as the ground-truth Y. Then the groundtruth Y is divided into three layers of 1 × 1, 2 × 2 , and 4 × 4 sub-regions sequentially by applying different average pooling layers. Additionally, the same operation is performed on X i . Finally, the structural similarity loss between X i and Y on each sub-region is computed separately. Thereby, the L ral is obtained by aggregating the losses on all layers: where P ave denotes the average pooling operation, k 2 j denotes the number of sub-regions, and S is the number of average pooling layers. N is the number of images. In addition, SSIM is the Structural Similarity Loss [48] . Therefore, the final loss is the summation of all the above losses. where =0.01 is the scale-specific weight, which is used to balance the influence of L i ral . In this section, we validate the performance of the proposed method on five publicly available crowd counting datasets in several aspects, and compared with existing methods to validate the novelty of the proposed method. Ground-truth generation We use a Gaussian kernel (normalized to 1) to generate ground-truth by blurring each head annotation in the crowd image. Specifically, for sparse crowd scenes, we follow the method of CSRNet [49] which uses a fixed Gaussian kernel to produce the ground-truth. In addition, for dense crowd scenes, we use an adaptive Gaussian kernel to produce ground-truth F(x): Fig. 5 The details of Region Awareness Loss(RAL). ⊕ represents the summation operations where x is the location of the pixel in crowd image, N represents the number of head annotations. x i is the target object in ground-truth , and d i is the average distance of its k nearest neighbors. We use the Gaussian kernel with parameter i to convolve x − x i , where is a constant. In our implementation, we set =0.3 and i = 3. Evaluation metrics The mean absolute error (MAE) and the mean square error (MSE) is used to evaluate the counting performance and robustness of our method respectively. They are defined as: where N is the total number of crowd images, z i is groundtruth of the i-th image, ẑ i is prediction of our model in the i-th image. MAE determines the accuracy of the crowd estimation and MSE determines the robustness of the crowd estimation. To more visually evaluate the quality of the crowd density maps generated by our method, we adopt the Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity in Image (SSIM) [48] as evaluation criteria. PSNR is used to calculate the error between the corresponding pixels, SSIM is used to measure the structural similarity between the ground-truth and the estimated density map. Implementation In our implementation, the encoder of our model consists of the first 13 convolutional layers and 3 pooling layers of VGG-16, which have been pre-trained on ImageNet. For the other convolutional layers, we use a Gaussian distribution with = 0.01 to initialize them randomly. In addition, our up-sampling operation is based on bilinear interpolation. We also use Adam Optimizer to train our network for 200 epochs and learning rate is initially set to 1e-5. Shanghai Tech [32] This dataset is one of the largest publicly available benchmark. It has 1198 images and 330,165 head annotations, which is composed of Part A and Part B. The Part A has total 482 images randomly collected from the Internet, the crowd scene is relatively dense. The part B includes total 716 images taken from a commercial street, the crowd scene is relatively sparse. [47] This dataset is the first publicly available and extremely challenging benchmark dataset. It has only 50 highly dense crowd images and the number of annotations for each image varies from 94 to 4543. Since this dataset has less data, in order to effectively conduct the comparison experiments, we perform 5-fold cross-validation following the method of CSRNet [49] . UCF-QNRF [50] This dataset is one of the publicly available large scale crowd image datasets, which is consists of 1535 crowd images with 1251642 annotations. Since it has a wide range of people from 49 to 12,865 in each image, leading to its extremely challenging. WorldExpo'10 [51] This dataset is a publicly large-scale crowd counting benchmark, it includes 1,132 annotated video sequences, each of which has 50 frame rates and its resolution is 576 × 720 pixels. In the test set, these frames have 5 scenes, each scene includes 120 images. We validate the performance of our method on two datasets with relatively sparse crowd scenes: ShanghaiTech Part B [32] and WorldExpo'10 [51] , and compare our method with some existing representative methods to analyze the advantages of our method in sparse crowd scenes. ShanghaiTech Part B [32] The comparison results are shown in Table 1 . As can be seen from the table, in terms of MAE, our method improves by 10% compared with DSSI-Net which has the best performance. Similarly, in terms of MSE, our method improves by 6% compared with DSSINet. WorldExpo'10 [51] As can be seen from the table, we achieve the best average MAE performance, which is 10% higher than the second-best method. Furthermore, in scene 2, our method still outperforms the other methods, leading the second place by 5% in terms of MAE. Similarly, in scene 4, our method is 11% ahead of the second best. In summary, our method significantly outperforms existing methods in crowd scenes with relatively low crowd density, which is attributed to our Region-Aware Module (RAM). Since our RAM can adaptively capture the internal differences within different regions, which results in our method capturing local contextual information to the maximum extent, and thus performing crowd counting more accurately. We validate the performance of our method on three datasets with dense crowd scenes: ShanghaiTech Part A [30] , UCF_CC_50 [41] , and UCF-QNRF [44] , thus analyzing the advantages of our method in dense crowd scenes. ShanghaiTech Part A [32] The comparison results are shown in Table 2 . As can be seen from the table, compares with EDENet that has the best performance, our method improves the MAE by 2%, which indicates that our method has higher counting accuracy. Similarly, our method improves the MSE by 3% compared with CDADNet, which reflects the higher robustness of our method. UCF_CC_50 [47] The comparison results are shown in Table 2 . As can be seen from the table, in terms of robustness, our method achieves the second-best performance, CDADNet is 3% ahead of our method. However, in terms of counting accuracy, our method achieves optimal performance, which is 3% ahead of CDADNet. UCF-QNRF [50] The comparison is summarized in Table 2 . As can be seen from the table, our method improves MAE by 2% compared with EDENet that has the best performance. Our method also produces the best MSE, which is 8% higher than the second-best method. Moreover, compared with the recently proposed ASANet and MTL-DP, our method is 18% and 14% ahead of them in terms of MSE, respectively. The comparison results indicate that our method still has significant advantages in dense crowded regions, which reflects the effectiveness of our Region-Aware Attention Mechanism (RAAM). Since RAAM can further recalibrate the weights of different regions in the feature map, so that our method can focus on the crowd regions in the image and avoid the influence of complex background on the counting results. In this section, we evaluate the quality of the density maps produced by using PSNR and SSIM as evaluation criteria. Specifically, we perform a comparison experiment between our method and some representative methods to analyze the superiority of the proposed method in terms of the quality of the generated density map. As can be seen from Table 3 , our method obtains the PSNR of 28.31 and the SSIM of 0.89, which improves by 9% and 7% over the second-best method. It can be seen from these results that our method is significantly better than other methods, which indicates that our method can generate higher quality crowd density maps. To validate the effectiveness of each component in our network, in this section, we perform several sets of ablation experiments on the ShanghaiTech part A&B dataset [32] to analyze the influence of the different settings on the counting performance. We apply our proposed encoder and the decoder as the backbone, and apply MSE loss as the basic loss function. After that, we gradually add the corresponding modules to verify the effectiveness of these modules on the ShanghaiTec dataset. Furthermore, we visualize some density maps generated by the corresponding components to visually compare their performance, as shown in Fig. 8 . Firstly, we verify the effectiveness of RAM and the results are shown in Table 4 . Compared with Backbone individually, Backbone+RAM improves MAE by 3% and MSE by 4% on the ShanghaiTech part A dataset. In addition, Backbone+RAM improves MAE by 17% and MSE by 8% on the ShanghaiTech part B dataset. This significant improvement shows that our RAM can effectively capture the internal variation of the crowd region to extract wider contextual information. As shown in Fig. 8 , compared with Backbone, Backbone+RAM can avoid the influence of the background regions to some extent, which further demonstrates the excellent performance of our RAM in extracting local region features. Secondly, we verify the effectiveness of RRM and the results are shown in Table 4 . Compared to the above two versions, Backbone+RAM+RRM improves MAE by 10% and 21% respectively on the ShanghaiTech part A& B dataset. This substantial performance improvement fully validates the effectiveness of the RAAM. As seen in Fig. 8 , we use RAAM to further recalibrate the weights of local regions, which results in our method better focusing on the crowd regions and avoiding the influence of background regions on the final counting results. Thirdly, we verify the effectiveness of the RAL. Compared with using MSE as the loss function individually, using RAL to learn local correlations within different regions improves our model MAE by 10% and MSE by 8% on the ShanghaiTech part A. Therefore, we can conclude that using RAL as the loss function can improve the counting performance more effectively. As shown in Fig. 6 , we visually show the training process of our method with different loss functions, from which we can see that using RAL to train our method can converge faster and more smoothly. Finally, we verify the impact of parameter in the loss function on the counting performance. As shown in Fig. 7 , our method achieves the optimal performance in different datasets when is set to 0.01. Therefore, we finally choose =0.01 as the scale-specific weight to balance the impact of RAL on each level. In this section, we further analyze the differences between our method and existing methods. In addition, we also verify the generalization ability of our method in crowd localization. We visualize some density maps generated by the proposed method and some representative methods in different crowd scenes, as shown in Fig. 9 . Based on this figure, we further analyze the reasons why our method outperforms existing methods. As shown in the yellow marked regions in the figure, when dealing with crowd scenes with large crowd scale variation, the distribution of crowds in the density map generated by our method is closer to the groundtruth. It reflects the fact that our method can respond more effectively to rapid scale variation. Since the features at different levels are complementary to each other, their combination can improve the representation ability of multi-scale features. Thus, our method hierarchically combines these features, so that our method is more robust to scale variation. As shown in the red marked regions in the figure, when dealing with crowd scenes with complex background, our method can easily adapt to cluttered crowd scenes. This reflects that our method can better focus on crowd regions. Due to the presence of region-aware attention mechanism, our method can further recalibrate the weights of different regions, thus suppressing the influence of background regions on the counting results. In the same way, as shown in the last two columns in the figure, when dealing with crowd scenes with different crowd densities, our method estimates the number of pedestrians closer to the ground-truth, which is attributed to our Region-Aware Module (RAM). Our RAM can effectively extract adaptive features within different regions, so that our method can capture wider contextual information compared with existing methods. In this section, we verify the generalization ability of our method for performing crowd localization on the NWPU dataset [8] . In our implementation, we deploy our method on the IIM framework [62] which is proposed by the Gao et al. After that, we follow existing crowd localization methods and apply Recall (Rec.), instance-level Precision (Pre.), and F1-measure to validate our method. All of the above evaluation criteria are defined in [8] . As can be seen from the Fig. 11 , our method significantly outperforms the average level. This is attributed to the fact that our method can generate high-resolution feature maps, which makes our method have more spatial information to perform crowd localization. In addition, our region-aware attention mechanism further recalibrates the weights of different regions, so that our method can effectively avoid the interference of noisy information and accurately locate the heads of the pedestrian. Additionally, to illustrate the performance of the proposed method in performing crowd localization, we visualize the crowd localization maps generated by our method, as shown in Fig. 10 . In this paper, we devise a Hierarchical Region-Aware Network (HRANet) for crowd counting, which addresses the crowd scale variation problem and the complex background problem from two aspects: more accurate focus on crowd regions and adequately considering local correlations on different regions. For the former, we propose a Region-Aware Module (RAM) and a Region Recalibration Module (RRM). By integrating these two modules, our method can capture the influence of local regions on different image locations to further recalibrate the weights of different regions, thus suppressing the representation ability of background information. For the latter, we design a Region Awareness Loss (RAL), which can guide our model to learn global and local correlations in crowd density map, thus producing locally consistent density maps. Several sets of experiments on five crowd counting datasets show that proposed method can achieve more accurate counting results compared with the existing advanced methods. In addition, multiple sets of comparative experiments have fully demonstrated strong generalization ability of our method. A hybrid model of convolutional neural networks and deep regression forests for crowd counting Neuron linear transformation: Modeling the domain shift for crowd counting Strategies to utilize the positive emotional contagion optimally in crowd evacuation Iot-based positive emotional contagion for crowd evacuation Encoderdecoder based convolutional neural networks with multi-scaleaware modules for crowd counting Learning spatial awareness to improve crowd counting Arch formation-based congestion alleviation for crowd evacuation Nwpu-crowd: A largescale benchmark for crowd counting and localization Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting Pcc net: Perspective crowd counting via spatial convolutional network Cnn-based density estimation and crowd counting: A survey Nas-count: Counting-by-density with neural architecture search Effective use of convolutional neural networks and diverse deep supervision for better crowd counting Switching convolutional neural network for crowd counting Relational attention network for crowd counting Attention scaling for crowd counting Ha-ccn: Hierarchical attentionbased crowd counting network Scale pyramid network for crowd counting Pyramid-dilated deep convolutional neural network for crowd counting The arithmetic optimization algorithm Multi-modal medical image fusion based on equilibrium optimizer algorithm and local energy functions Combining gabor energy with equilibrium optimizer algorithm for multi-modality medical image fusion A novel approach based on three-scale image decomposition and marine predators algorithm for multi-modal medical image fusion A novel approach based on grasshopper optimization algorithm for medical image fusion Gradient-based optimizer: A new metaheuristic optimization algorithm An improved medical image synthesis approach based on marine predators algorithm and maximum gabor energy A new adaptive tuned social group optimization (sgo) algorithm with sigmoid-adaptive inertia weight for solving engineering design problems Coarse-and fine-grained attention network with background-aware loss for crowd density map estimation Crowd counting with deep structured scale integration network Fine-grained crowd counting Crowd counting via adversarial cross-scale consistency pursuit Single-image crowd counting via multi-column convolutional neural network Generating high-quality crowd density maps using contextual pyramid cnns Deep densityaware count regressor Dadnet: Dilated-attentiondeformable convnet for crowd counting Improving the learning of multi-column convolutional neural network for crowd counting Attentional neural fields for crowd counting Multi-level bottom-top and topbottom feature fusion for crowd counting Crowd counting and density estimation by trellis encoderdecoder networks Crowd counting via scale-adaptive convolutional neural network Learning from synthetic data for crowd counting in the wild Global and local semantics-preserving based deep hashing for cross-modal retrieval Discriminative deep metric learning for asymmetric discrete hashing Learning discrete class-specific prototypes for deep semantic hashing Correlation filteringbased hashing for fine-grained image retrieval U-net: Convolutional networks for biomedical image segmentation Multi-source multiscale counting in extremely dense crowd images Image quality assessment: from error visibility to structural similarity Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes Composition loss for counting, density map estimation and localization in dense crowds Cross-scene crowd counting via deep convolutional neural networks Crowd counting via hierarchical scale recalibration network Context-aware crowd counting Shallow feature based dense attention network for crowd counting Denet: A universal network for counting crowd with varying densities and scales Adversarial scaleadaptive neural network for crowd counting Density-aware and background-aware network for crowd counting via multi-task learning Cdadnet: Context-guided dense attentional dilated network for crowd counting Edenet: Elaborate density estimation network for crowd counting Scar: Spatial-/channel-wise attention regression networks for crowd counting Scale aggregation network for accurate and efficient crowd counting Learning independent instance maps for crowd localization Jinyang Xie was bor n in Shandong, China Acknowledgements This work was supported by the National Natural Science Foundation of China under Grant 61976127. Conflicts of interest Conflict of interest The authors declare that they have no conflict of interest.