key: cord-0459479-9ukachw4
authors: Gao, Jiaqi; Huang, Zhizhong; Lei, Yiming; Wang, James Z.; Wang, Fei-Yue; Zhang, Junping
title: S$^2$FPR: Crowd Counting via Self-Supervised Coarse to Fine Feature Pyramid Ranking
date: 2022-01-13
journal: nan
DOI: nan
sha: dd99664bfa1fecd0df3f5683b82d03bf7d865149
doc_id: 459479
cord_uid: 9ukachw4

Most conventional crowd counting methods utilize a fully-supervised learning framework to learn a mapping between scene images and crowd density maps. Under the circumstances of such fully-supervised training settings, a large quantity of expensive and time-consuming pixel-level annotations are required to generate density maps as the supervision. One way to reduce costly labeling is to exploit self-structural information and inner-relations among unlabeled images. Unlike the previous methods utilizing these relations and structural information from the original image level, we explore such self-relations from the latent feature spaces because it can extract more abundant relations and structural information. Specifically, we propose S$^2$FPR which can extract structural information and learn partial orders of coarse-to-fine pyramid features in the latent space for better crowd counting with massive unlabeled images. In addition, we collect a new unlabeled crowd counting dataset (FUDAN-UCC) with 4,000 images in total for training. One by-product is that our proposed S$^2$FPR method can leverage numerous partial orders in the latent space among unlabeled images to strengthen the model representation capability and reduce the estimation errors for the crowd counting task. Extensive experiments on four benchmark datasets, i.e. the UCF-QNRF, the ShanghaiTech PartA and PartB, and the UCF-CC-50, show the effectiveness of our method compared with previous semi-supervised methods. The source code and dataset are available at https://github.com/bridgeqiqi/S2FPR.

C ROWD counting has broad applications in traffic control, public safety surveillance, smart city planning such as preventing stampedes from occurring, and estimating participation in rallies or parades. In a pandemic such as the COVID-19, effective crowd counting can help authorities determine whether social distancing can still be maintained for a certain public space. The goal of crowd counting is to estimate the number of people in a given image or video sequence, Fig. 1 . Motivation. F V 1 and F V 2 are the feature patches cropped on feature maps in the latent space. Their corresponding regions in the input space are I 1 and I 2 , respectively, according to the receptive field. Thus, it should be guaranteed that the output counts g(F V 2 ) predicted by larger patches in the latent space would be no fewer than the counts g(F V 1 ) predicted by their sub-patches in the same feature map. especially in crowded scenes. Although crowd counting has been an active research area in recent years, it remains a challenge due to the influence of many extrinsic factors such as occlusion, illumination, head size variation, diverse perspectives, and non-uniform distribution.

Earlier counting methods were based on detection [1] , [2] . They mainly focused on designing a robust human-pose or human-body detector to count pedestrians in a given scene by using a sliding-window template matching trick [2] . Consequently, the precision of counting to a great extent depends on the performance of those detectors, which are computationally expensive and time-consuming. An alternative way is to regard crowd counting as a regression task [3] , [4] , [5] by building a mapping from the crowd image to the final count number. Although these two approaches can work well in a sparse scene and several state-of-the-art object detection methods [6] , [7] , [8] , [9] improve the counting accuracy with the help of extracting multi-scale features, the performance still seems to be limited due to the occlusion, congested scenes and tiny head sizes far from the camera perspective.

Benefiting from the strong representation learning ability of arXiv:2201.04819v1 [cs.CV] 13 Jan 2022 convolutional neural networks (CNN) [10] , [11] , [12] , [13] , [14] , CNN-based methods [15] , [16] , [17] , [18] , [19] , [20] , [21] , [22] , [23] , [24] , [25] , [26] , [27] , [28] , [29] , [30] , [31] are employed to predict a density map of a still image because the density map contains more spatial information of people distribution and its integral equals the number of people in one image. For example, multi-branch architectures [16] , [18] , [17] , [19] are designed to extract the multi-scale features and detect varying sizes of heads because different-sized convolutional filters have varying receptive fields, which are more useful for learning non-uniform crowd distribution. To avoid the similar features and redundant patterns extracted from multi-column based backbones, CSRNet [22] shows that a deeper single-column CNN could outperform those multicolumn based models. Further, MLCNN [27] fuses features extracted from different layers in a deep single neural network to generate more accurate density maps. Although most existing CNN-based supervised learning methods have achieved good results in some public datasets [32] , [16] , [33] , [34] , a large quantity of pixel-level or box-level annotations need to be manually labeled, which can be prohibitively costly and time-consuming, especially when the scenes are highly crowded. For instance, it can take over thirty minutes for an adult to annotate a single image with an immensely congested scene containing more than 2,000 people. One possible solution [35] is to utilize a game environment to synthesize a new dataset for crowd counting. Models could be initially trained on the synthetic dataset in a supervised way and then fine-tuned to the real dataset. However, there may have some physical differences between the characters in the game and pedestrians in the real world. Other solutions [36] , [37] , [38] , [39] are to leverage limited labeled images and abundant unlabeled images for semi-supervised crowd counting. To be more specific, a few auxiliary tasks such as the Gaussian process [38] and segmentation surrogate tasks [39] are used to generate pseudolabels for unlabeled images to assist the feature extractor to learn more robust feature representations and help the crowd counter learn a more discriminative decoder. The L2R [36] , [37] methods exploit the structural information of unlabeled images in image level to assist the counter to predict more accurate density maps. However, the generation quality of pseudo labels depends greatly on the model capacity and representation learning ability. Surrogate tasks may introduce extra computational cost and parameters. Meanwhile, simple constraints among different scales of images are limited to reduce the estimation errors.

To address the aforementioned issues, in this paper, we propose a novel semi-supervised learning method called S 2 FPR that can leverage more partial orders of coarse-to-fine pyramid features from different stages among unlabeled images to predict more accurate density maps by limited labeled samples. Items in intermediate feature maps from different layers represent their corresponding regions (receptive field) in the original input image. These intermediate features taking from unlabeled images can drive the model to learn more general and robust representations. Meanwhile, our method can further improve the sample efficiency of unlabeled samples since the number of ranking pairs of our method are build from the features at different scales, which are at least three times than the ones on the images in L2R. In addition, to assist the model in learning task-specific representations, we propose a new Unlabeled Crowd Counting dataset (FUDAN-UCC) which contains 4,000 images from the image search engine GettyImages 1 as the unlabeled dataset in the whole training process.

Our main contributions can be summarized as follows.

• We propose a coarse-to-fine feature margin ranking loss for semi-supervised crowd counting by utilizing partial orders and structural information among unlabeled images to assist the model estimate the counts more accurately with limited labeled images. The proposed coarse-to-fine ranking loss on feature level is simple, intuitive, and easy to implement. • We construct a new large unlabeled crowd counting dataset (FUDAN-UCC) from the Internet which contains 4,000 high-resolution images of congested scenes. We believe that this new unlabeled dataset can significantly facilitate the developments of semi-supervised crowd counting community. • We demonstrate that our method has outperformed other semi-supervised crowd counting methods on several benchmarked datasets with clear margins.

In this section, we review the previous crowd counting approaches, including traditional methods based on handcrafted features and deep-learning-based models. To learn to count from limited labeled images, we also review recent semisupervised, weakly-supervised, and self-supervised learning methods for crowd counting.

Detection-based Methods: Some earlier works concentrate on detecting pedestrians one by one for counting tasks. The detector is trained by classical hand-crafted features such as SIFT, HOG, and edges extracted from the whole or the part of human bodies. [40] , [1] , [41] , [42] extract the general features from the whole-body to train a classifier such as SVM, boosting, and random forest algorithms. However, they achieves limited performance when there is a lot of occlusions. By contrast, body-part features [43] , [44] , [45] , e.g., heads and shoulders, can improve the accuracy to some extent. Nevertheless, counting-by-detection methods only work well for sparse scenes because of the sensitivity to severe occlusions and density variations.

Regression-based Methods: Regression-based approaches pay attention to enhancing the ability to estimate global counts for crowd counting. They often attempt to learn a mapping function from local and global features extracted from local regions or the whole images to the overall count. These methods can be divided into two steps: i) extracting useful features including foreground features, textures, corners, histogram oriented gradients (HoG), and local binary patterns (LBP); and ii) training a regression model such as linear regression, ridge regression, Bayesian Poisson regression, and Gaussian process regression based on features extracted from step i).

Nevertheless, these two conventional counting methods mainly make use of hand-crafted features and may not perform well in the extremely crowded scenes. Meanwhile, both of them ignored the pedestrian distribution so that they achieved limited performance. To better learn spatial distribution information of persons in the scene, Lempitsky et al. [46] proposed to predict a density map instead of regressing a scalar for crowd counting. A density map can reflect the distribution of people approximately and its integral is equal to the number of people in a given image.

Most off-the-shelf deep learning based methods are built upon stacks of convolution operations to regress crowd counts or the density maps. More specifically, Wang et al. [47] used an AlexNet-like architecture to predict the number of people in highly crowded scenes. Considering the perspective information, Zhang et al. [15] achieved better counting in the unseen images from cross-scene cases. Zhang et al. [16] further proposed an MCNN architecture, which contains multicolumn convolutional layers with different kernel sizes, to resolve the scale variation issue for crowd counting. After that, many multi-column based models have been put forward. Sam et al. [18] designed a switchable network for training patches within different density levels. Sindagi et al. [17] introduced local and global contextual information to assist the network to generate high-quality density maps. Observing that features extracted from different columns normally have similar and redundant patterns, Li et al. [22] proposed a deeper single-column network within dilated convolutions to address the issues and achieve better performance. Jiang et al. [27] fused features from different layers by simple concatenations in CNN to obtain a multi-scale feature map representation. Besides, Liu et al. [23] mixed scale-aware contextual features and perspective maps together to estimate density maps while Cao et al. [25] and Chen et al. [26] designed scale aggregation network and scale pyramid network to tackle the scale variation problems. Zhao et al. [29] distilled the perspective information with the help of depth-embedded module for learning better scale-aware representations for crowd counting. Jiang et al. [48] obtained segmentation masks according to regions of different density levels and introduce a scaling factor to jointly estimate people counts. Yang et al. [49] uniformly warped the input images to force head sizes in different locations to the same scale through the perspective transformation. Furthermore, to correct small errors of ground truth caused by the empirically-chosen parameter σ, Wan et al. [50] , [51] utilized the kernel-based density map to refine the final density map. Bai et al. [52] self-corrected the density map by EM algorithm. ZoomCount [28] proposed a zooming mechanism to tackle the underestimation and overestimation issues due to the density variation problem. Adversarial networks [53] , [54] , [55] are also used for crowd counting to generate a high-quality density map. Sindagi et al. [56] fused multilevel bottom-top top-bottom features to resolve scale variation problems for crowd counting. Xiong et al. [57] divided feature maps into several grids and count hierarchically. Ma et al. [58] proposed a Bayesian loss to learn an expectation of people distribution by using point supervisions instead of generating density maps.

Fully-supervised counting methods requires massive pixellevel annotations, which can be prohibitively costly. A synthetic dataset [35] constructed by the GTA5 game environment may solve the data-hungry issue. Another way is to leverage the huge amount of unlabeled crowded images to assist the model to learn task-specific representations for better crowd counting. For instance, Change et al. [59] implemented a unified active and semi-supervised regression framework to exploit the manifold structure of images. There are two ways in semi-supervised settings: a) generating a set of reliable pseudo labels for unlabeled images and then tune the model in a supervised way. Sindagi et al. [38] and Liu et al. [39] employed the Gaussian process method and surrogate segmentation tasks, respectively, to generate the pseudo-labels for unlabeled images; b) exploiting self-structural information and constructing an unsupervised loss among unlabeled images to auxiliary optimize the model. Liu et al. [36] , [37] leveraged more unlabeled data collected from the Internet and construct a rank margin loss to optimize the model. Sam et al. [60] constructed an unsupervised reconstruction loss to learn useful features from unlabeled images and then trained the counter by using labeled images. Parameters in the front layers are frozen when training the subsequent layers.

Utilizing the self-structural information and ordinal relations among unlabeled images is crucial for counting from limited labeled samples. Although Liu et al. [36] claimed that such ordinal relations among unlabeled images in input space are effective, these constraints are actually insufficient. In this paper, we explore such relations in different feature levels because features in deep layers of CNN contain more semantic information which are closer to densities and distributions. By the definition of receptive field, the relative positions do not change after stacks of convolution operations. In other words, one feature patch in an intermediate feature map is supposed to correspond to one sub-region of the given input image (See Fig. 1) . Therefore, the output predicted by one feature patch g(F V 2 ) should be greater than that of its sub-patch g(F V 1 ). We regard such kind of relations as partial orders. In this paper, we exploit these partial orders in multiple intermediate layers and construct a coarse-to-fine feature pyramid margin rank loss to assist the model to learn from the unlabeled images. In this case, our proposed model can increase the utilization ratio of unlabeled images which are at least three times of L2R [36] .

In this section, we will introduce partial orders hidden on coarse-to-fine features of unlabeled images, which are 

Proof. The images in the first row are what the cropped feature patches from the second row can see. Since the feature maps have 512 channels in practical, we just fuse these channels and visualize the mean feature map. The predictions and ground-truths of these cropped feature patches are shown in the last two rows. ' ' and '≥' mean partial orders and fixed ordinal relations of these cropped feature patches, respectively. embedded in the latent space. To make full use of such selfsupervised structural information to estimate densities more accurately, we design a mixed loss function to optimize our model.

In the area of crowd counting, a common sense is that for any images of arbitrary size, the number of people in an image patch is always more than or at least equal to that of people in its sub-regions [36] . Inspired by this observation and the definition of the receptive field, we believe this common sense can also be held in the features level because the convolution and pooling layers in convolutional neural networks do not change the relative positions of the objects in one image. In other words, each position in the intermediate feature maps should represent the specific corresponding regions in the input space, which is often called the receptive field. Thus, it should be guaranteed that the output counts predicted by larger feature patches in the latent space would be no fewer than the counts predicted by their sub-patches in the same feature map, as shown in Fig. 1 . Additionally, such partial orders should be held in different intermediate layers of the network. In this way, we can greatly increase the utilization ratio of partial orders and structural information among unlabeled images. Further, we visualize the receptive field, feature patches in hidden layers, and the prediction and ground truth of our proposed method in Fig. 2 . The visualization results clearly validate our hypothesis that partial orders will hold in the intermediate feature layers.

To compute the margin rank loss of unlabeled images, a prerequisite is to obtain a set of feature patch pairs. After 

A feature map F ∈ R C×H×W , number of cropped patches M , cropped ratio r.

Step 1 :

Choose a point as center point in one small region randomly. The region is defined to be r M the shape of this feature map F , with the same aspect ratio, centered by point ( c 2 , h 2 , w 2 ) .

Initialize the feature pairs set S = ∅.

Step 3 :

Choose the first and largest cropped patch v 0 = F .

Crop M − 1 patches centered at the new center point. These M − 1 patches v 1 , v 2 , ...v M are cropped by the ratio r iteratively.

Step 5 :

Resize these M patches to the same size of F and add each feature patch pair < v m , v n >, ∀m < n to the S.

A set of feature patch pairs S.

Notations : F ∈ R C×H×W where C, H, W represents the channels, height, and width of this feature map F respectively. ( c 2 , h 2 , w 2 ) is the coordinate of the center point.

the cropping process, specifically, we guarantee that each subregion is fully contained by its larger sub-region for training. For each level of the feature map, we then randomly crop M sub-regions so that any two of M + 1 regions can make one candidate pair. Formally, we have 

We describe the network architecture in detail here. Our network mainly consists of two modules, feature extractor and crowd density map estimator. Feature extractor is to learn coarse-to-fine features through several convolutions and maxpooling operations while crowd density map estimator is to regress the density map based on these features. Our backbone of the feature extractor module is derived from the VGG-16 network [11] . We only use the first ten layers of VGG-16 with pre-trained weights to train our feature extractor.

where the feature extractor module f (·; θ) with the parameters θ contains the first ten layers of pretrained VGG-16 network. And the module uses the i-th input image to output its corresponding feature v (i) . The dilated convolutional layers with 3 × 3 kernel size, dilated ratio 2 and stride 1 followed by an unsampling layer constitute the density map estimator, same as described by CSRNet-B [22] .

where D (i) are the predicted density map of the i th image. g(·; φ) is the density estimator with the parameters φ that contains six dilated convolutional layers, 1 × 1 conv layer and an upsampling layer. We use the L 2 loss as our supervised loss, L s , among labeled images.

where N is the number of training images in a batch, x (i) and y (i) are the i th original input image and corresponding ground truth density map in one batch, respectively. For training the unlabeled images, we use the same feature extractor architecture with shared parameters, as shown in Fig. 3 .

In a deep CNN, feature maps are downsampled using convolutional or pooling layers. The receptive field shows that each pixel in the intermediate feature maps captures the information from one region of input space. As discussed in Section III-A, larger regions contain the same or more people in comparison to smaller sub-regions in the input space. Similarly, because of the receptive field, the counts of the predicted density map with smaller regions of feature maps in the latent space should be the same or fewer than that with super-regions of the same feature maps.

We adopt the margin ranking loss as our self-supervised loss among unlabeled images. We expect the network could learn the ordinal relation of feature maps in the latent space. Meanwhile, we should guarantee that the ordinal relation exists in multi-scale features in different latent spaces. We cropped the feature maps among unlabeled images in each latent space, and construct a margin ranking loss of the i th unlabeled image L (i) r as follows:

where v is the margin, and we empirically set it to zero in our model. We expect the count of a smaller region of feature maps g(v (i) u,m ) in latent space to be no more than that of its super-region g(v (i) u,n ). Therefore, when the network predicts the correct ordinal relation g(v

u,n ), the loss L (i) r will be zero and there will be no gradient backpropogation. Otherwise, the loss L (i) r will have the difference value between these two estimates and the gradients will be backpropogated to update parameters of our model. Therefore, the total selfsupervised loss is defined as follows:

where N is the number of unlabeled images we used in the training process. K is the number of coarse to fine latent space we chose to construct the feature pairs set S. M is the number of cropped patches of feature maps from the same latent space k. D u,n > in set S in the k th latent space. Now, we have introduced the feature pairs set generation method including the main architecture of our model, the fullysupervised loss L s and self-supervised rank loss L u among unlabeled images. Thus, the final loss we adopt to train our model is the combination of L s and L u within the hyperparameter λ.

L total = L s + λL u .

A. Experimental Setups.

Existing semi-supervised crowd counting methods mainly follow two different training settings. On the one hand, the off-the-shelf benchmarked datasets (ShanghaiTech PartA [16] , ShanghaiTech PartB [16] , UCF-CC-50 [32] , UCF-QNRF [33] ) is usually split into labeled and unlabeled subsets with different proportions (5%, 25%, 30%, and 50% for the labeled images in the full dataset). On the other hand, other extra crowded images as the unlabeled data is collected for semi-supervised training. For fair comparisons, in this paper, we conducted experiments under both two training settings, strictly following the most previous literature. Unless noted otherwise, the 100% labeled data under semi-supervised mode has used our new collected unlabeled dataset.

We capture 4,000 images in total from the image search engine GettyImages by the keyword 'crowd' for constructing the unlabeled dataset in our experiments. Fig. 4 illustrates a few natural images of our collected dataset that contains varying scenarios, diverse people distribution, and different illumination. The resolutions of them range from 221 × 612 to 612 × 612. This captured dataset may serve as a potential and standard unlabeled crowd counting benchmark dataset for researchers to investigate the semi-supervised, weaklysupervised or unsupervised learning methods in the future.

The UCF-CC-50 dataset [32] : The UCF-CC-50 dataset is the first large-scale congested scene dataset for pedestrians counting. Only 50 images with varying resolutions are collected among available images from public websites. An average of 1,279 persons appeared in each image. The maximum and minimum are 4,543 and 94, respectively. Both the small number of images and the drastic change in the number of people bring a big challenge to the counting task. We use the 5-fold cross-validation to test due to limited samples. We generate the density map as ground truth by geometry-adaptive Gaussian kernels [16] for a fair comparison.

ShanghaiTech PartA dataset [16] : The ShanghaiTech PartA dataset consists of 482 images with variant resolutions. 241,677 heads are annotated in different illumination conditions and crowded scenes. The highly-variant density distribution is varying from 33 to 3,139 persons with an average of 501 persons per image. We used geometry-adaptive kernels to generate the ground truth of all images. For fully supervised learning, researchers often split this dataset into two parts for training and testing. The images of the validation set are normally chosen from ten percent of training set images.

ShanghaiTech PartB dataset [16] : The ShanghaiTech PartB dataset is made up of 716 images with 88,488 head annotations captured from busy streets in central business districts in Shanghai. The resolution of these pictures is fixed with 768 × 1024 pixels. 400 images are used to train and validate while the rest are used to test. The ground truth of these images is generated by a fixed Gaussian kernel whose variance σ is set to 15.

The UCF-QNRF dataset [33] : To construct a larger crowd counting dataset which includes a dramatic variation of head sizes, diverse viewpoints and perspectives, and different places and times of one day, 1,535 images are collected from several search engines like Google Image Search, Flickr, etc. Over 1,251,642 coordinates are labeled, costing more than 2,000 labor-hours. Due to the high resolution of the images, we limit the shorter side of a given image to no more than 1,500 pixels to fit the memory capacity. Images are rescaled with the same ratio to refrain from the global and local contextual information loss as little as possible. The training and test set consist of 1,201 and 334 images, respectively. Similar to the ground truth generation method in the ShanghaiTech PartB dataset, a fixed Gaussian kernel is adopted.

We follow the similar preprocessing methods in previous works [38] , [39] to train in a supervised way, while for semi-supervised learning, we randomly choose the training set with different proportions as labeled samples and others will be regarded as unlabeled ones. Considering the images with larger resolutions of UCF-QNRF dataset and limited memories, we resize the shorter side no more than 1,920 pixels and keep the same aspect ratio with original size. Effective data augmentations like random horizontal flipping, random cropping and normalization are also adopted for both labeled and unlabeled images for training. The captured images are also resized with the same aspect ratio to fit the cropping operations. We use the Adam optimizer with the learning rate of 10 −5 and weight decay of 10 −4 in all of our experiments.

We employ two commonly-used evaluation metrics (Mean Absolute Error, MAE, and Root Mean Squared Error, RMSE) to evaluate our model. The formulae of MAE and RMSE are defined as follows:

where N denotes the number of images from the test set. Y i and Y i are the predicted counts and actual counts of the i-th image, respectively. Briefly speaking, MAE implies the precision of the estimates, and RMSE implies the robustness of the estimates. RMSE is more sensitive to the outliers. Obtaining a model with a low MAE together with a low RMSE is our expectation.

Evaluation on the UCF-CC-50 dataset: The experimental results on the UCF-CC-50 dataset are shown in Table I . Since there are only 50 images in the UCF-CC-50 dataset, making it not suitable for dividing into labeled and unlabeled datasets, we just utilize the unlabeled images from collected dataset to train our model. For a fair comparison to other work, we use the same 5-fold cross-validation to compute the average MAE and average RMSE metrics. The baseline model is trained by using only labeled 50 images in a fully-supervised way. It achieves 266.10 average MAE. We reproduced the L2R method using our model architecture. It improves the performance on this dataset and leads to 261.60 average MAE. Our approach thus achieves a nearly 10 MAE improvement compared with the L2R method.

Evaluation on the ShanghaiTech dataset: We demonstrate the experimental results on both the ShanghaiTech PartA and PartB datasets shown in Table II and Table III . All labeled and unlabeled images are chosen from the ShanghaiTech datasets. We randomly pick up 5%, 25%, and 50% labeled images from the training set as labeled samples, meanwhile, the rest of the training set are regarded as unlabeled samples for training. We compare our method against three previous methods, L2R [36] , IRAST [39] and [38] . These methods used different ratios of labeled images to train their models and reported their results. For a fair comparison to them respectively, again, we use the same proportion of labeled images. For PartA, we reach the best results when the ratio is 50% with the MAE 78.4. For 5% and 25% settings, we obtain competitive results in MAE and RMSE compared with Sindagi et al. [38] . A possible reason is that 5% means only 15 labeled images for training, and the ranking loss among the rest unlabeled images only reflects qualitative relations which cannot help the model predict the specific count number more accurately within a small number of labeled images. As for the PartB dataset, we observe that partial orders among unlabeled images have limited assistance. The reasons for this are that 1) the density distribution is relatively sparse in this dataset and the number of people is relatively small, and 2) our proposed qualitatively partial orders among unlabeled images may be more suitable and efficient for crowded scenes. Nevertheless, our proposed method still achieves comparable performance to other methods. If we use all images from training set and our collected unlabeled ones for training, the results of S 2 FPR can achieve the lowest MAE and RMSE scores. Evaluation on the UCF-QNRF dataset: The experimental results on the UCF-QNRF dataset are shown in Table IV . We also randomly choose 5%, 25%, 30%, and 50% labeled images of its training set and regard the rest as unlabeled ones for semi-supervised training to compare with previous methods [36] , [38] , [39] . Moreover, we reproduce the L2R [36] method using our collected unlabeled dataset. The results indicate that our proposed method S 2 FPR is superior to the previous semi-supervised methods.

Different utilization ratios of labeled images: We design the ablation study to verify whether our approach would be robust under the different settings of the varying number of labeled images. We conduct this experiment on the ShanghaiTech PartA dataset. We randomly choose 5%, 25%, 50% images, respectively, to make up the labeled dataset and the rest images from the training set are the unlabeled ones. As shown in Table V , our method achieves a consistent performance improvement towards training with only labeled images. Impact of varying λ: Further, to exploit the role of margin ranking loss in the final mixed loss, we try the different hyperparameters λ which represents the weight of self-supervised loss. The value of λ is chosen from {0.1, 0.5, 1, 5, 10}. The specific performance with different values of λ is shown in Table VI . We use the unlabeled images from our collected dataset together with all images from the training set of the ShanghaiTech PartA dataset for evaluating the impact of λ. Our model achieves the best performance on the ShanghaiTech PartA dataset when λ is set to 1. Combination of ranking losses in different layers: Coarse-to-fine pyramid features in different layers represent from high-level semantic information to low-level visual information including textures, edges, backgrounds, among others. And partial orders should exist in feature patches from dif- Fig. 5 . Visualization. From left to right, four columns represent the original image, ground truth, label-only prediction, and label+unlabel prediction, respectively. The first-row images come from the ShanghaiTech PartA dataset. The second-row images come from the UCF-CC-50 dataset. The third row is coming from the UCF-QNRF dataset. The last-row images are from the ShanghaiTech PartB dataset. ferent layers. Therefore, we report the different results caused by a diverse combination of ranking loss in disparate layers (low-level, intermediate-level, and high-level) as shown in Table VII . We discover that the performance will be better as the utilization ratio of coarse-to-fine pyramid feature patches with partial orders increases. To be specific, such partial orders of high-level layers are more helpful to crowd counting than that of the other two layers. 

We visualize the results on these four different datasets in Fig. 5 . The visualization results clearly demonstrate the efficiency of our coarse-to-fine feature pyramid ranking loss. The performance of utilizing partial orders among unlabeled images is better than that of training on labeled ones only. To be more specific, in the first and second row of Fig. 5 , the label-only predicted density map prones to learn the uniform distribution of crowded people. Meanwhile, if we consider partial orders among unlabeled images to train our model, the prediction will be close to the actual distribution.

Our work focused on taking advantage of partial orders from coarse-to-fine pyramid features to assist the neural network to enhance the qualitative discrimination among unlabeled images. Extensive experiments show that the proposed model S 2 FPR obtains better performance compared with other stateof-the-art methods with the help of self-supervised coarseto-fine feature pyramid ranking loss, especially in congested scenes. Being simple and intuitive, our proposed method is easy to implement. Besides, we proposed a new unlabeled crowd counting dataset (FUDAN-UCC) which could serve as an unlabeled dataset for the semi-supervised, weaklysupervised crowd counting community in the foreseeable future.

Pedestrian Detection in Crowded Scenes

Pedestrian Detection: An Evaluation of the State of the Art

Bayesian Poisson Regression for Crowd Counting

Crowd Counting using Multiple Local Features

Feature Mining for Localised Crowd Counting

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

You Only Look Once: Unified, Real-Time Object Detection

YOLOv3: An Incremental Improvement

SSD: Single Shot Multibox Detector

Imagenet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Going Deeper with Convolutions

Deep Residual Learning for Image Recognition

Densely Connected Convolutional Networks

Cross-Scene Crowd Counting via Deep Convolutional Neural Networks

Single-Image Crowd Counting via Multi-Column Convolutional Neural Network

Generating High-Quality Crowd Density Maps using Contextual Pyramid CNNs

Switching Convolutional Neural Network for Crowd Counting

Crowdnet: A Deep Convolutional Network for Dense Crowd Counting

Adcrowdnet: An Attention-Injective Deformable Convolutional Network for Crowd Understanding

ResnetCrowd: A Residual Deep Learning Architecture for Crowd Counting, Violent Behaviour Detection and Crowd Density Level Classification

CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes

Context-Aware Crowd Counting

PaDNet: Pan-Density Crowd Counting

Scale Aggregation Network for Accurate and Efficient Crowd Counting

Scale Pyramid Network for Crowd Counting

Learning Multi-Level Density Maps for Crowd Counting

Zoomcount: A zooming mechanism for crowd counting in static images

Scale-aware crowd counting via depth-embedded convolutional neural networks

Pcc net: Perspective crowd counting via spatial convolutional network

Mask-aware networks for crowd counting

Multi-Source Multi-Scale Counting in Extremely Dense Crowd Images

Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds

NWPU-Crowd: A Large-Scale Benchmark for Crowd Counting

Learning from Synthetic Data for Crowd Counting in the Wild

Leveraging Unlabeled Data for Crowd Counting by Learning to Rank

Exploiting Unlabeled Data in CNNs by Self-Supervised Learning to Rank

Learning to Count in the Crowd from Limited Labeled Data

Semi-Supervised Crowd Counting via Self-Training on Surrogate Tasks

Histograms of Oriented Gradients for Human Detection

Monocular Pedestrian Detection: Survey and Experiments

Pedestrian Detection via Classification on Riemannian manifolds

Object Detection with Discriminatively Trained Part-Based Models

Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors

Detection and Tracking of Multiple, Partially Occluded Humans by Bayesian Combination of Edgelet Based Part Detectors

Learning to Count Objects in Images

Deep People Counting in Extremely Dense Crowds

Attention Scaling for Crowd Counting

Reverse Perspective Network for Perspective-Aware Object Counting

Adaptive Density Map Generation for Crowd Counting

Kernel-Based Density Map Generation for Dense Object Counting

Adaptive Dilated Network With Self-Correction Supervision for Counting

Crowd Counting via Adversarial Cross-Scale Consistency Pursuit

Multi-Scale Generative Adversarial Networks for Crowd Counting

Adversarial Learning for Multiscale Crowd Counting under Complex Scenes

Multi-Level Bottom-Top and Top-Bottom Feature Fusion for Crowd Counting

From Open Set to Closed Set: Counting Objects by Spatial Divide-and-Conquer

Bayesian Loss for Crowd Count Estimation with Point Supervision

From Semi-Supervised to Transfer Counting of Crowds

Almost Unsupervised Learning for Dense Crowd Counting