key: cord-0229866-1a2u9wa9
authors: Liu, Weizhe; Durasov, Nikita; Fua, Pascal
title: Leveraging Self-Supervision for Cross-Domain Crowd Counting
date: 2021-03-30
journal: nan
DOI: nan
sha: 14739e62c1bba4f5df5cc39d1edc5d8252ab62e9
doc_id: 229866
cord_uid: 1a2u9wa9

State-of-the-art methods for counting people in crowded scenes rely on deep networks to estimate crowd density. While effective, these data-driven approaches rely on large amount of data annotation to achieve good performance, which stops these models from being deployed in emergencies during which data annotation is either too costly or cannot be obtained fast enough. One popular solution is to use synthetic data for training. Unfortunately, due to domain shift, the resulting models generalize poorly on real imagery. We remedy this shortcoming by training with both synthetic images, along with their associated labels, and unlabeled real images. To this end, we force our network to learn perspective-aware features by training it to recognize upside-down real images from regular ones and incorporate into it the ability to predict its own uncertainty so that it can generate useful pseudo labels for fine-tuning purposes. This yields an algorithm that consistently outperforms state-of-the-art cross-domain crowd counting ones without any extra computation at inference time.

Crowd counting is important for applications such as video surveillance and traffic control. For example during the current COVID-19 pandemic, it has a role to play in monitoring social distancing and slowing down the spread of the disease. Most state-of-the-art approaches rely on regressors to estimate the local crowd density in individual images, which they then proceed to integrate over portions of the images to produce people counts. The regressors typically use Random Forests [35] , Gaussian Processes [4] , or more recently Deep Networks [101, 107, 56, 68, 90, 76, 71, 49, 37, 67, 74, 102, 42, 48, 66, 26, 60, 3] , with most stateof-the-art approaches now relying on the latter.

Unfortunately, training such deep networks in a traditional supervised manner requires much ground-truth annotation. This is expensive and time-consuming and has slowed down the deployment of data-driven approaches. The total number of people obtained by integrating these maps is overlaid on the images. Bottom row: Estimated people density maps by the network of [86] with overlaid estimated total number of people. Because the network has been trained on synthetic images, the estimated number of people in the synthetic image is very close to the correct one. This is not the case in the real one because of the large domain shift between synthetic and real images.

One way around this difficulty is to use synthetic data for training purposes. However there is usually too much domain shift-change in statistical properties-between real and synthetic images for networks trained in this manner to perform well, as shown in Fig. 1 .

In this paper, we remedy this shortcoming by training with both synthetic images, along with their associated labels, and unlabeled real images. We force our network to learn perspective-aware features on the real images and build into it the ability to use these features to predict its own uncertainty using a fast variant of the ensemble method [13] to effectively use pseudo labels for fine-tuning. We train it as follows: ages, and upside-down version of the latter. We train the network not only to give good results on the synthetic images but also to recognize if the real images are upside-up or upside-down. This simple approach to self-supervision forces the network to learn features that are perspective-aware on the real images.

2. At the end of this first training phase in which we perform image-wise self supervision on the real images, our network is semi-trained and the uncertainties attached to the people densities it estimates have meaning. We exploit them to provide pixel-wise selfsupervision by treating the densities the network is confident about as pseudo labels, that we use as if they were ground-truth labels to re-train the network. We iterate this process until convergence.

Our contribution is therefore a novel approach to selfsupervision for cross-domain crowd counting that relies on stochastic density maps, that is, maps with uncertainties attached to them, instead of the more traditional deterministic density maps. Furthermore, it explicitly leverages a specificity of the crowd counting problem, namely the fact that perspective distortion affects density counts. We will show that it consistently outperforms the state-of-the-art crossdomain crowd counting methods.

Given a single image of a crowded scene, the currently dominant approach to counting people is to train a deep network to regress a people density estimate at every image location. This density is then integrated to deliver an actual count [43, 50, 72, 45, 27, 108, 103, 84, 38, 40, 96, 55, 41, 91] . Most methods work on counting people from individual images [92, 73, 77, 9, 83, 99, 100] while others account for temporal consistency in video sequence [90, 104, 14, 44, 47, 46] . While effective these approaches require a large annotated dataset for training purposes, which is hard to obtain in many real-world scenarios. Unsupervised domain adaptation seek to address this difficulty. We discuss earlier approaches to it, first in a generic context and then for the specific purpose of crowd counting.

Unsupervised Domain Adaptation. Unsupervised domain adaptation aims to align the source and target domain feature distributions given annotated data only in the source domain. A popular approach is to learn domain-invariant features by adversarial learning [80, 16, 21, 81, 7, 22, 65, 105, 8, 106, 31, 54, 10, 11, 24, 53, 89] , which leverages one extra discriminator network to narrow the gap between two different domains. Another way to bridge the domain gap is to define a specific domain shift metric that is then minimized during training [51, 52, 28, 12, 82, 58, 29, 62, 33, 95, 39, 34, 93, 94, 36, 59] . Other widely used approaches include generating realistic-looking synthetic images [69, 20, 2, 98, 97] , incorporating self-training [70, 6, 18, 75] , transferring model weights between different domains [63, 64] , and using domain-specific batch normalization [5] . The method of [79] introduces a self-supervised auxiliary task such as detecting image-rotation in unlabeled target domain images for cross-domain image classification and served as an inspiration to us.

Crowd Counting. Most of the techniques described above are intended for classification problems and very few have been demonstrated for crowd counting purposes.

One exception is the method of [86, 17, 87] that trains the deep model on synthetic images and then narrows the domain gap, by using a CycleGAN [109] extension to translate synthetic images to make them look real and then retrain the model on these translated images. A limitation of this work is that the translated images, while more realistic than the original synthetic ones, are still not truly real.

Another exception is the method of [78] . It uses pseudo labels generated by a network trained on synthetic images as though they were ground-truth labels. It relies on Gaussian Processes to estimate the variance of the pseudo labels and to minimize it. However, the uncertainty of these pseudo labels is not estimated or taken into account and the computational requirements can become very large when many synthetic images are used simultaneously.

The method of [19] uses adversarial learning to align features across different domains. However, it relies on extra discriminator networks which are complicated and hard to train. [61, 23, 88 ] leverage a few target labels to bridge the domain gap, therefore require extra annotation cost.

By contrast to these approaches, ours explicitly takes uncertainty into account and leverages a specificity of the crowd counting problem, namely the fact that perspective distortion matters.

We propose a fully unsupervised approach to fine-tuning a network that has been trained on annotated synthetic data, so that it can operate effectively on real data despite a potentially large domain shift. At the heart of our method is a network that estimates people-density at every location while incorporating a variant of the deep ensemble approach [13] to provide uncertainties about these. The key to success is to first pre-train this network so that these uncertainties are meaningful and then to exploit them to recursively fine-tune the network.

We have therefore developed a two-stage approach that first relies on real-images and upside-down versions of these to provide an image-wise supervisory signal. We use them to train the network not only to give good results on the . Two-stage approach. Top: During the first training stage, we use synthetic images, real images, and flipped versions of the latter. The network is trained to output the correct people density for the synthetic images and to classify the real images as being flipped or not. Bottom: During the second training stage, we use synthetic and real images. We run the previously trained network on the real images and treat the least uncertain people density estimates as pseudo labels. We then fine tune the network on both kinds of images and iterate the process.

synthetic images but also to recognize if the real images are upside-up or upside-down. This yields a partially-trained network that can operate on real images and return meaningful uncertainty values along with the density values. We can therefore exploit them to provide pixel-wise supervisory signal, by treating the people density estimates the network is most confident about as pseudo labels, that are treated as ground-truth and use to re-train the network. We iterate this process until the network predictions stabilize. Fig. 2 depicts our complete approach.

be a synthetic sourcedomain dataset, where x s denotes a color synthetic image and y s the corresponding crowd density map. The targetdomain dataset is defined as D t = {x t i } Nt i=1 without ground truth crowd density labels where x t denotes a color real image. In most real-world scenarios, we have N s N t . Our goal is to learn a model that performs well on the targetdomain data.

To this end, we use a state-of-the-art encoder/decoder architecture for people density estimation [86] . We chose this one because it has already been used by cross-domain crowd counting approaches and therefore allows for a fair comparison of our approach against earlier ones. Let E and D be the encoder and decoder networks that jointly form the people density estimation network F of [86] . Given an input image x as input, E returns the deep features f = E(x) that D takes as input to return the density map D(f ).

One way to enable self-supervision for classification purposes is to use a partially trained network to predict labels and associated probabilities, treat the most probable ones as pseudo labels that can be used for training purposes as though they were ground-truth labels [98, 97] . This strategy is widely used to provide pixel-wise [111] and imagewise [110] self-supervision to address classification problems. If the probability measure is reliable and allows the discarding of potentially erroneous labels, repeating this procedure several times results in the network being progressively refined without any need for ground-truth labels.

To implement a similar mechanism in our context, we need more than labels at the image-level. We require esti- Figure 3 . Masksembles approach. During training, for every input vector, a binary mask is selected from a set of pre-generated masks and is used to zero out a corresponding set of features. Performing the inference several times using different masks then yields an ensemble-like behavior.

mates of which individual densities in an estimated density map are likely correct and which are not. In other words, we need a stochastic crowd density map instead of the deterministic one that existing methods produce. Among all the methods that can be used to turn our network F into one that returns such stochastic density maps, MC-Dropout [15] and Deep Ensembles [32] have emerged as two of the most popular ones. Both of those methods exploit the concept of ensembles to produce uncertainty estimates. Deep Ensembles are widely acknowledged to yield significantly more reliable uncertainty estimates [57, 1] . However, they require training many different copies of the network, which can be very slow and memory consuming. Instead, we rely on Masksembles, a recent approach [13] that operates on the same basic principle as MC-Dropout. However, instead of achieving randomness by dropping different subsets of weights for each observed sample, it relies on a set of precomputed binary masks that specify the network parameters to be dropped. Fig. 3 depicts this process.

In practice, we associate to the first convolutional layer of the decoder D a Masksembles layer. During training, for each sample in a batch we randomly choose one of the masks, set the corresponding weights to one or zero in the Masksembles layers, which drops the corresponding parts of the model just like standard dropout. During inference, we run the model multiple times, once per mask, to obtain a set of predictions and, ultimately, an uncertainty estimate. This turns out to provide uncertainty estimates that are almost as reliable as those of Ensembles but without having to train multiple networks and is therefore much faster and easier to train. Formally, we writē

where x is the input image, F m is the modified network F used with mask m.ȳ and u are the same size as input image and we treat the individual values of u ∈ u as pixel-wise uncertainties.

F m can be trained in a supervised fashion using the synthetic training set D s but that does not guarantee that it will work well on real images. Hence, we introduce the auxiliary task decoder D aux shown at the top of Fig. 2 whose task is to classify an image as being oriented normally or being upside-down from the features produced by the encoder. To train the resulting two-branch network, we use synthetic images from D s along with real images from D t and flipped versions of these, such as the ones shown in Fig. 4 . For the synthetic images, the output should minimize the usual L 2 loss given the ground-truth density maps and, for the real images, the output should minimize a cross entropy loss for binary classification as being either upsideup or upside-down.

Formally, we introduce the loss function

which we minimize with respect to the weights of the encoder E and the two decoders D and D aux . L s is the L 2 distance between the predicted people density map and the ground truth one y s i while L a is the cross-entropy loss for binary classification given the ground-truth upside-up or down label y t i for image x t i . We use this label only for the real images because we have ground truth annotations for the synthetic ones. As will be shown in the results section, this provides sufficient supervision for the synthetic images and also using the image-wise supervision for these brings no obvious improvement.

procedure FIRST STAGE( D s and D t ) Initialize the weights for people density estimation network F m with single encoder E and two decoders D Note that the L s and L a use the same encoder E. To minimize L a and hence correctly estimate if an input image is upside-down or not, E must extract meaningful features from the real images and not only from synthetic ones. Furthermore, these features must enable the decoder D to handle scene perspective, that is, the fact that people densities are typically higher at the top of the image than the bottom in upside-up images. In other words, minimizing L a forces E to produce perspective-aware features while minimizing L s forces the decoder D to operate on such features to properly estimate people densities on the synthetic images. In this way, we make E produce features that are appropriate both for synthetic and real images, hence mitigating the domain shift between the two, as will be demonstrated in the results section.

This first training stage is summarized by the first procedure of Alg. 1.

After the first training stage described above, our model can produce both a density mapȳ and its corresponding uncertainty u. Let F 0 m be the corresponding network. We can now refine its weights to create increasingly better tuned networks F k m for 1 ≤ k ≤ K by iteratively minimizing

<uα is one for all densities for which the uncertainty is less than the top α% uncertainty u α . In other words, at each iteration we use the densities produced by F k−1 m for which the uncertainty is low enough as pseudo labels to train F k m . This second training stage is summarized by the second procedure of Alg. 1.

In this section, we first introduce the evaluation metrics and benchmark datasets we use in our experiments. We then provide the implementation details and compare our approach to state-of-the-art methods. Finally, we perform a detailed ablation study.

Previous works in crowd density estimation use the mean absolute error (M AE) and the root mean squared error (RM SE) as evaluation metrics [86, 78] . They are defined as

where N is the number of test images, z i denotes the true number of people inside the ROI of the ith image andẑ i the estimated number of people. In the benchmark datasets discussed below, the ROI is the whole image except when explicitly stated otherwise. The number of people are recovered by integrating over the pixels of the predicted density maps.

GCC [86] . It is the synthetic dataset we use. UCF-QNRF [26] . It is a real image dataset that comprises 1,535 images with 1,251,642 people in them. The training set comprises 1,201 of these images. Unlike in Shang-haiTech, there are dramatic variations both in crowd density and image resolution.

UCF CC 50 [25] . It is a real image dataset that contains only 50 images with a people count ranging from 94 to 4,543, which makes it challenging for a deep-learning approach. For a fair comparison, we use the same 5-fold crossvalidation protocol as in [86, 78] : We partition the images into 5 10-image groups. In turn, we then pick four groups for training and the remaining one for testing. This gives us 5 sets of results and we report their average.

WorldExpo'10 [101] . It is a real image dataset that comprises 1,132 annotated video sequences collected from 103 different scenes. There are 3,980 annotated frames, with 3,380 of them used for training purposes. Each scene contains a Region Of Interest (ROI) in which people are counted. As in previous work [86] , we report the MAE for each scene along with the average over all scenes.

For a fair comparison with previous work [86, 78] , we use SFCN [86] as the crowd density regressor and Adam [30] for parameter update with a learning rate of 1e − 6. After a grid search on one single dataset as discussed below, we set λ 1 in Eq. 3, λ 2 , and K in Eq. 4 to 10 −4 , 1.0 and 2 respectively for all our experiments.

To estimate uncertainty, we generate 3 stochastic density map for each image and take the standard deviation to be our uncertainty measure. We set the threshold value α of Eq. 4 to 10, which means that 10% most uncertain pseudo labels are discarded and that we keep the other 90% as pseudo labels for model training. This large percentage is appropriate because there are large areas of the real images that do not contain anyone and for which the pseudo labels are very dependable. We will show below that removing only 10% of the labels suffices to substantially boost performance over keeping all pseudo labels.

Recall that we drop the auxiliary network D aux in the second training stage. In the final evaluation phase, we generate only one density map for each image instead of averaging multiple estimates, we will show that the performance is similar for both cases in supplementary material. Hence our model does not require any extra computation at inference time. Fig. 5 depicts qualitative results on Shang-haiTech Part B dataset and we provide additional ones in the supplementary material along with more details about the model.

In Tab. 1, we compare our results to those of state-of-theart domain adaptation approaches for each one of the public benchmark datasets, as currently reported in the literature. In each case, we reprint the results as given in these papers and add those of OURS, that is, of our method. We consistently and clearly outperform all other methods on all the datasets. And, since we use the same SFCN network archi- tecture as the methods of [86, 78] , the performance boost is directly attributable to our approach of domain adaptation.

In [86] , the authors report fully supervised MAE results on Shanghaitech Part B and UCF-QNRF of 9.4 and 124.7, respectively, to be compared to our own unsupervised values of 11.4 and 198.3 . In other words, our unsupervised approach performs almost as well as a supervised one on Shanghaitech Part B while there still remains a gap on UCF-QNRF. This is because the crowds in both the synthetic source domain and in Shanghaitech Part B are still mostly sparse enough for bodies to be visible. By contrast, in UCF-QNRF, the crowds are denser. Hence, it often happens that only heads are visible, thus creating a larger domain gap between source and target images that could be bridged in future work either by using a synthetic dataset that itself features denser crowds or, more ambitiously, by using a detection pipeline that focuses more on heads and would naturally reduce the domain gap.

We perform an ablation study on UCF-QNRF dataset to confirm the role of the self-supervision loss terms, the setting of hyper-parameters, the impact of stochastic density map, the choice of auxiliary task and to compare against other uncertainty estimation techniques.

Self-Supervision. We compare our complete model against several variants. BASELINE uses the SFCN crowd density estimator trained on the synthetic data and without any domain adaptation. OURS-IMG involves the first image-wise training stage but not the second. OURS-IMG-SYN also involves only the first image-wise training stage but both real and synthetic images can be flipped upside Table 2 . Ablation study on self-supervision. Both image-wise and pixel-wise self-supervision boost the performance and combining both further improves performance. By contrast, using image-wise self-supervision during the second stage, as opposed to the first, makes no obvious difference.

down, whereas in OURS-IMG only the real ones are. Conversely, OURS-PIX skips the first image-wise training and involves only the second pixel-wise training stage. OURS-DUP is similar to our complete approach except for the fact that it uses both pixel-wise and image-wise supervision during the second training stage whereas OURS only uses pixel-wise supervision by that point. As shown in Tab. 2, both OURS-IMG and OURS-PIX outperform BASELINE which shows that both training stages matter. However, OURS does even better, which confirms that properly pre-training the network before using pixel-wise supervision matters. Since OURS-IMG-SYN and OURS-DUP achieve similar performance as OURS-IMG and OURS respectively, we drop image-wise selfsupervision for synthetic image and in the second stage for simplicity.

Hyper-Parameter Selection. We tested different values for the hyper-parameters we use, that is λ 1 in Eq. 3, α, λ 2 and K in Eq. 4. As shown in Tab. 3, λ 1 = 1e − 4, α = 10, λ 2 = 1.0 and K = 2 yields the best results on this dataset Table 4 . Ablation study on stochastic density map. Generating stochastic density map slightly improve the performance but not by a significant amount. and we used the same values for all others. Note that α = 10 delivers much better performance than α = 0, which confirms that throwing away as few as 10% of the pseudo labels makes a very significant difference.

To test if generating a stochastic density map instead of a deterministic one has a significant impact of performance, we compare the performance of BASELINE that generates a deterministic map with a version of it that includes Masksembles to generate a stochastic map but still without any domain adaptation. As can be seen in Tab. 4, the version with Masksembles does slightly better but not by a significant amount. Therefore, Masksembles by itself does not account for the large improvements we saw in Tab. 1.

Choice of Auxiliary tasks. Having chosen to use inverted images to provide a self-supervision signal may seem arbitrary during the first phase of training. To show that it is not, we tried variants in which we flip the images leftright (OURS-MIRROR), we rotate them by 90 degrees (OURS-90) and by 270 degrees (OURS-270). As can be seen in Tab. 5, OURS-MIRROR performs on par with OURS-PIX, the model trained without any image-wise supervision. OURS-90 and OURS-270 do slightly better but OURS is clearly best. This confirms the importance of flipping the images upside-down, which helps the network deal with perspective effects.

Uncertainty Estimation. We use Masksembles [13] for uncertainty estimation because of its effectiveness and simplicity. However we could also have used MC-Dropout [15] Table 6 . Ablation study on uncertainty estimation. The Masksembles approach we used in measuring model uncertainty achieves better performance than MC-Dropout and similar performance as Deep Ensembles in terms of all three measures, and at a much lower computational cost.

or Deep Ensembles [32] . We tested both and report the results in Tab. 6 . In addition to the usual MAE and RMSE, we also computed the Pearson Correlation Coefficient

where n is the sample size, a i , u i are pixel-wise samples of counting error and uncertainty value respectively. r au ∈ [−1, 1] and the higher its value is, the more correlated uncertainty is to the MAE error. In other words, when r au is large, it makes sense to discard uncertain densities as probably wrong and not to be used as pseudo labels. As can be seen in Tab. 6, using Masksembles [13] as in OURS clearly outperform MC-Dropout [15] and is comparable with Deep Ensembles [32] . However, training Ensembles takes three times longer, which motivates our use of Masksembles.

We have proposed an approach to combining image-wise and pixel-wise self-supervision to substantially increase cross-domain crowd counting performance when only annotations of synthetic image is available. However, our approach does not require the source images to be synthetic and could take advantage of additional annotations when available. In future work, we will therefore expand it to using multiple datasets of real-world images with partial annotations.

Pitfalls of in-domain uncertainty estimation and ensembling in deep learning

Batch Weight for Domain Adaptation with Mass Shift

Scale Aggregation Network for Accurate and Efficient Crowd Counting

Bayesian Poisson Regression for Crowd Counting

Domain-Specific Batch Normalization for Unsupervised Domain Adaptation

Progressive Feature Alignment for Unsupervised Domain Adaptation

ROAD: Reality Oriented Adaptation for Semantic Segmentation of Urban Scenes

Blending-Target Domain Adaptation by Adversarial Meta-Adaptation Networks

Learning Spatial Awareness to Improve Crowd Counting

Unsupervised Domain Adaptation via Regularized Conditional Alignment

Gradually Vanishing Bridge for Adversarial Domain Adaptation

Cluster Alignment with a Teacher for Unsupervised Domain Adaptation

Masksembles for Uncertainty Estimation

Locality-Constrained Spatial Transformer Network for Video Crowd Counting

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

Domain-Adversarial Training of Neural Networks

Domain-adaptive Crowd Counting via Inter-domain Features Segregation and Gaussian-prior Reconstruction

Spherical Space Domain Adaptation with Robust Pseudo-Label Loss

Focus on Semantic Consistency for Cross-Domain Crowd Understanding

CyCADA: Cycle Consistent Adversarial Domain Adaptation

FCNs in the Wild: Pixel-Level Adversarial and Constraint-Based Adaptation. In arXiv Preprint

Conditional Generative Adversarial Network for Structured Domain Adaptation

One-Shot Scene-Specific Crowd Counting

Unsupervised Domain Adaptation with Hierarchical Gradient Synchronization

Multi-Source Multi-Scale Counting in Extremely Dense Crowd Images

Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds

Crowd Counting and Density Estimation by Trellis Encoder-Decoder Networks

Contrastive Adaptation Network for Unsupervised Domain Adaptation

Unsupervised Visual Domain Adaptation: A Deep Max-Margin Gaussian Process Approach

Adam: A Method for Stochastic Optimisation

Attending to Discriminative Certainty for Domain Adaptation

Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles

Sliced Wasserstein Discrepancy for Unsupervised Domain Adaptation

Drop to Adapt:learning Discriminative Features for Unsupervised Domain Adaptation

Learning to Count Objects in Images

Enhanced Transport Distance for Unsupervised Domain Adaptation

CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes

Density Map Regression Guided Detection Network for RGB-D Crowd Counting and Localization

Distant Supervised Centroid Shift: A Simple and Efficient Approach to Visual Domain Adaptation

Recurrent Attentive Zooming for Joint Crowd Counting and Precise Localization

Crowd Counting with Deep Structured Scale Integration Network

Crowd Counting Using Deep Recurrent Spatial-Aware Network

Adcrowdnet: An Attention-Injective Deformable Convolutional Network for Crowd Understanding

Geometric and Physical Constraints for Drone-Based Head Plane Crowd Density Estimation

Context-Aware Crowd Counting

Counting People by Estimating People Flows

Estimating People Flows to Better Count Them in Crowded Scenes

Exploiting Unlabeled Data in CNNs by Self-Supervised Learning to Rank

Leveraging Unlabeled Data for Crowd Counting by Learning to Rank

Point In, Box Out: Beyond Counting Persons in Crowds

Learning Transferable Features with Deep Adaptation Networks

Deep Transfer Learning with Joint Adaptation Networks

Stochastic Classifiers for Unsupervised Domain Adaptation

GCAN: Graph Convolutional Adversarial Network for Unsupervised Domain Adaptation

Bayesian Loss for Crowd Count Estimation with Point Supervision

Towards Perspective-Free Object Counting with Deep Learning

Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset dhift

Transferrable Prototypical Networks for Unsupervised Domain Adaptation

Domain2vec: Domain Embedding for Unsupervised Domain Adaptation

Iterative Crowd Counting

Few-Shot Scene Adaptive Crowd Counting Using Meta-Learning

Unsupervised Domain Adaptation Using Feature-Whitening and Consensus Loss

Residual Parameter Transfer for Deep Domain Adaptation

Beyond Sharing Weights for Deep Domain Adaptation

Maximum Classifier Discrepancy for Unsupervised Domain Adaptation

Size and Count: Accurately Resolving People in Dense Crowds via Detection

Divide and Grow: Capturing Huge Diversity in Crowd Images with Incrementally Growing CNN

Switching Convolutional Neural Network for Crowd Counting

Generate to Adapt: Aligning Domains Using Generative Adversarial Networks

Learning Transferrable Representations for Unsupervised Domain Adaptation

Crowd Counting via Adversarial Cross-Scale Consistency Pursuit

Revisiting Perspective Information for Efficient Crowd Counting

Counting with Focus for Free

Crowd Counting with Deep Negative Correlation Learning

Two-Phase Pseudo Label Densification for Self-Training Based Domain Adaptation

Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs

Multi-Level Bottom-Top and Top-Bottom Feature Fusion for Crowd Counting

Learning to Count in the Crowd from Limited Labeled Data

Unsupervised Domain Adaptation through Self-Supervision

Simultaneous Deep Transfer Across Domains and Tasks

Adversarial Discriminative Domain Adaptation

Advent: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation

Adaptive Density Map Generation for Crowd Counting

Residual Regression with Semantic Prior for Crowd Counting

Densefusion: 6D Object Pose Estimation by Iterative Dense Fusion

Learning from Synthetic Data for Crowd Counting in the Wild

Pixel-Wise Crowd Understanding via Synthetic Data

Neuron Linear Transformation: Modeling the Domain Shift for Crowd Counting

Dual Mixup Regularized Learning for Adversarial Domain Adaptation

Spatiotemporal Modeling for Crowd Counting in Videos

From Open Set to Closed Set: Counting Objects by Spatial Divide-And-Conquer

Learn to Scale: Generating Multipolar Normalized Density Maps for Crowd Counting

Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation

Reliable Weighted Optimal Transport for Unsupervised Domain Adaptation

d-SNE: Domain Adaptation Using Stochastic Neighborhood Embedding

Perspective-Guided Convolution Networks for Crowd Counting

Phase Consistent Ecological Domain Adaptation

Fda: Fourier Domain Adaptation for Semantic Segmentation

Relational Attention Network for Crowd Counting

Attentional Neural Fields for Crowd Counting

Cross-Scene Crowd Counting via Deep Convolutional Neural Networks

Nonlinear Regression via Deep Negative Correlation Learning

Wide-Area Crowd Counting via Ground-Plane Density Maps and Multi-View Fusion CNNs

FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in City Cameras

Collaborative and Adversarial Network for Unsupervised Domain Adaptation

Domain-Symmetric Networks for Adversarial Domain Adaptation

Single-Image Crowd Counting via Multi-Column Convolutional Neural Network

Leveraging Heterogeneous Auxiliary Tasks to Assist Crowd Counting

Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks

Confidence Regularized Self-Training

Unsupervised Domain Adaptation for Semantic Segmentation via Class-Balanced Self-Training

Acknowledgments This work was supported in part by the Swiss National Science Foundation.