key: cord-0229866-1a2u9wa9 authors: Liu, Weizhe; Durasov, Nikita; Fua, Pascal title: Leveraging Self-Supervision for Cross-Domain Crowd Counting date: 2021-03-30 journal: nan DOI: nan sha: 14739e62c1bba4f5df5cc39d1edc5d8252ab62e9 doc_id: 229866 cord_uid: 1a2u9wa9 State-of-the-art methods for counting people in crowded scenes rely on deep networks to estimate crowd density. While effective, these data-driven approaches rely on large amount of data annotation to achieve good performance, which stops these models from being deployed in emergencies during which data annotation is either too costly or cannot be obtained fast enough. One popular solution is to use synthetic data for training. Unfortunately, due to domain shift, the resulting models generalize poorly on real imagery. We remedy this shortcoming by training with both synthetic images, along with their associated labels, and unlabeled real images. To this end, we force our network to learn perspective-aware features by training it to recognize upside-down real images from regular ones and incorporate into it the ability to predict its own uncertainty so that it can generate useful pseudo labels for fine-tuning purposes. This yields an algorithm that consistently outperforms state-of-the-art cross-domain crowd counting ones without any extra computation at inference time. Crowd counting is important for applications such as video surveillance and traffic control. For example during the current COVID-19 pandemic, it has a role to play in monitoring social distancing and slowing down the spread of the disease. Most state-of-the-art approaches rely on regressors to estimate the local crowd density in individual images, which they then proceed to integrate over portions of the images to produce people counts. The regressors typically use Random Forests [35] , Gaussian Processes [4] , or more recently Deep Networks [101, 107, 56, 68, 90, 76, 71, 49, 37, 67, 74, 102, 42, 48, 66, 26, 60, 3] , with most stateof-the-art approaches now relying on the latter. Unfortunately, training such deep networks in a traditional supervised manner requires much ground-truth annotation. This is expensive and time-consuming and has slowed down the deployment of data-driven approaches. The total number of people obtained by integrating these maps is overlaid on the images. Bottom row: Estimated people density maps by the network of [86] with overlaid estimated total number of people. Because the network has been trained on synthetic images, the estimated number of people in the synthetic image is very close to the correct one. This is not the case in the real one because of the large domain shift between synthetic and real images. One way around this difficulty is to use synthetic data for training purposes. However there is usually too much domain shift-change in statistical properties-between real and synthetic images for networks trained in this manner to perform well, as shown in Fig. 1 . In this paper, we remedy this shortcoming by training with both synthetic images, along with their associated labels, and unlabeled real images. We force our network to learn perspective-aware features on the real images and build into it the ability to use these features to predict its own uncertainty using a fast variant of the ensemble method [13] to effectively use pseudo labels for fine-tuning. We train it as follows: ages, and upside-down version of the latter. We train the network not only to give good results on the synthetic images but also to recognize if the real images are upside-up or upside-down. This simple approach to self-supervision forces the network to learn features that are perspective-aware on the real images. 2. At the end of this first training phase in which we perform image-wise self supervision on the real images, our network is semi-trained and the uncertainties attached to the people densities it estimates have meaning. We exploit them to provide pixel-wise selfsupervision by treating the densities the network is confident about as pseudo labels, that we use as if they were ground-truth labels to re-train the network. We iterate this process until convergence. Our contribution is therefore a novel approach to selfsupervision for cross-domain crowd counting that relies on stochastic density maps, that is, maps with uncertainties attached to them, instead of the more traditional deterministic density maps. Furthermore, it explicitly leverages a specificity of the crowd counting problem, namely the fact that perspective distortion affects density counts. We will show that it consistently outperforms the state-of-the-art crossdomain crowd counting methods. Given a single image of a crowded scene, the currently dominant approach to counting people is to train a deep network to regress a people density estimate at every image location. This density is then integrated to deliver an actual count [43, 50, 72, 45, 27, 108, 103, 84, 38, 40, 96, 55, 41, 91] . Most methods work on counting people from individual images [92, 73, 77, 9, 83, 99, 100] while others account for temporal consistency in video sequence [90, 104, 14, 44, 47, 46] . While effective these approaches require a large annotated dataset for training purposes, which is hard to obtain in many real-world scenarios. Unsupervised domain adaptation seek to address this difficulty. We discuss earlier approaches to it, first in a generic context and then for the specific purpose of crowd counting. Unsupervised Domain Adaptation. Unsupervised domain adaptation aims to align the source and target domain feature distributions given annotated data only in the source domain. A popular approach is to learn domain-invariant features by adversarial learning [80, 16, 21, 81, 7, 22, 65, 105, 8, 106, 31, 54, 10, 11, 24, 53, 89] , which leverages one extra discriminator network to narrow the gap between two different domains. Another way to bridge the domain gap is to define a specific domain shift metric that is then minimized during training [51, 52, 28, 12, 82, 58, 29, 62, 33, 95, 39, 34, 93, 94, 36, 59] . Other widely used approaches include generating realistic-looking synthetic images [69, 20, 2, 98, 97] , incorporating self-training [70, 6, 18, 75] , transferring model weights between different domains [63, 64] , and using domain-specific batch normalization [5] . The method of [79] introduces a self-supervised auxiliary task such as detecting image-rotation in unlabeled target domain images for cross-domain image classification and served as an inspiration to us. Crowd Counting. Most of the techniques described above are intended for classification problems and very few have been demonstrated for crowd counting purposes. One exception is the method of [86, 17, 87] that trains the deep model on synthetic images and then narrows the domain gap, by using a CycleGAN [109] extension to translate synthetic images to make them look real and then retrain the model on these translated images. A limitation of this work is that the translated images, while more realistic than the original synthetic ones, are still not truly real. Another exception is the method of [78] . It uses pseudo labels generated by a network trained on synthetic images as though they were ground-truth labels. It relies on Gaussian Processes to estimate the variance of the pseudo labels and to minimize it. However, the uncertainty of these pseudo labels is not estimated or taken into account and the computational requirements can become very large when many synthetic images are used simultaneously. The method of [19] uses adversarial learning to align features across different domains. However, it relies on extra discriminator networks which are complicated and hard to train. [61, 23, 88 ] leverage a few target labels to bridge the domain gap, therefore require extra annotation cost. By contrast to these approaches, ours explicitly takes uncertainty into account and leverages a specificity of the crowd counting problem, namely the fact that perspective distortion matters. We propose a fully unsupervised approach to fine-tuning a network that has been trained on annotated synthetic data, so that it can operate effectively on real data despite a potentially large domain shift. At the heart of our method is a network that estimates people-density at every location while incorporating a variant of the deep ensemble approach [13] to provide uncertainties about these. The key to success is to first pre-train this network so that these uncertainties are meaningful and then to exploit them to recursively fine-tune the network. We have therefore developed a two-stage approach that first relies on real-images and upside-down versions of these to provide an image-wise supervisory signal. We use them to train the network not only to give good results on the . Two-stage approach. Top: During the first training stage, we use synthetic images, real images, and flipped versions of the latter. The network is trained to output the correct people density for the synthetic images and to classify the real images as being flipped or not. Bottom: During the second training stage, we use synthetic and real images. We run the previously trained network on the real images and treat the least uncertain people density estimates as pseudo labels. We then fine tune the network on both kinds of images and iterate the process. synthetic images but also to recognize if the real images are upside-up or upside-down. This yields a partially-trained network that can operate on real images and return meaningful uncertainty values along with the density values. We can therefore exploit them to provide pixel-wise supervisory signal, by treating the people density estimates the network is most confident about as pseudo labels, that are treated as ground-truth and use to re-train the network. We iterate this process until the network predictions stabilize. Fig. 2 depicts our complete approach. be a synthetic sourcedomain dataset, where x s denotes a color synthetic image and y s the corresponding crowd density map. The targetdomain dataset is defined as D t = {x t i } Nt i=1 without ground truth crowd density labels where x t denotes a color real image. In most real-world scenarios, we have N s N t . Our goal is to learn a model that performs well on the targetdomain data. To this end, we use a state-of-the-art encoder/decoder architecture for people density estimation [86] . We chose this one because it has already been used by cross-domain crowd counting approaches and therefore allows for a fair comparison of our approach against earlier ones. Let E and D be the encoder and decoder networks that jointly form the people density estimation network F of [86] . Given an input image x as input, E returns the deep features f = E(x) that D takes as input to return the density map D(f ). One way to enable self-supervision for classification purposes is to use a partially trained network to predict labels and associated probabilities, treat the most probable ones as pseudo labels that can be used for training purposes as though they were ground-truth labels [98, 97] . This strategy is widely used to provide pixel-wise [111] and imagewise [110] self-supervision to address classification problems. If the probability measure is reliable and allows the discarding of potentially erroneous labels, repeating this procedure several times results in the network being progressively refined without any need for ground-truth labels. To implement a similar mechanism in our context, we need more than labels at the image-level. We require esti- Figure 3 . Masksembles approach. During training, for every input vector, a binary mask is selected from a set of pre-generated masks and is used to zero out a corresponding set of features. Performing the inference several times using different masks then yields an ensemble-like behavior. mates of which individual densities in an estimated density map are likely correct and which are not. In other words, we need a stochastic crowd density map instead of the deterministic one that existing methods produce. Among all the methods that can be used to turn our network F into one that returns such stochastic density maps, MC-Dropout [15] and Deep Ensembles [32] have emerged as two of the most popular ones. Both of those methods exploit the concept of ensembles to produce uncertainty estimates. Deep Ensembles are widely acknowledged to yield significantly more reliable uncertainty estimates [57, 1] . However, they require training many different copies of the network, which can be very slow and memory consuming. Instead, we rely on Masksembles, a recent approach [13] that operates on the same basic principle as MC-Dropout. However, instead of achieving randomness by dropping different subsets of weights for each observed sample, it relies on a set of precomputed binary masks that specify the network parameters to be dropped. Fig. 3 depicts this process. In practice, we associate to the first convolutional layer of the decoder D a Masksembles layer. During training, for each sample in a batch we randomly choose one of the masks, set the corresponding weights to one or zero in the Masksembles layers, which drops the corresponding parts of the model just like standard dropout. During inference, we run the model multiple times, once per mask, to obtain a set of predictions and, ultimately, an uncertainty estimate. This turns out to provide uncertainty estimates that are almost as reliable as those of Ensembles but without having to train multiple networks and is therefore much faster and easier to train. Formally, we writē where x is the input image, F m is the modified network F used with mask m.ȳ and u are the same size as input image and we treat the individual values of u ∈ u as pixel-wise uncertainties. F m can be trained in a supervised fashion using the synthetic training set D s but that does not guarantee that it will work well on real images. Hence, we introduce the auxiliary task decoder D aux shown at the top of Fig. 2 whose task is to classify an image as being oriented normally or being upside-down from the features produced by the encoder. To train the resulting two-branch network, we use synthetic images from D s along with real images from D t and flipped versions of these, such as the ones shown in Fig. 4 . For the synthetic images, the output should minimize the usual L 2 loss given the ground-truth density maps and, for the real images, the output should minimize a cross entropy loss for binary classification as being either upsideup or upside-down. Formally, we introduce the loss function which we minimize with respect to the weights of the encoder E and the two decoders D and D aux . L s is the L 2 distance between the predicted people density map and the ground truth one y s i while L a is the cross-entropy loss for binary classification given the ground-truth upside-up or down label y t i for image x t i . We use this label only for the real images because we have ground truth annotations for the synthetic ones. As will be shown in the results section, this provides sufficient supervision for the synthetic images and also using the image-wise supervision for these brings no obvious improvement. procedure FIRST STAGE( D s and D t ) Initialize the weights for people density estimation network F m with single encoder E and two decoders D Note that the L s and L a use the same encoder E. To minimize L a and hence correctly estimate if an input image is upside-down or not, E must extract meaningful features from the real images and not only from synthetic ones. Furthermore, these features must enable the decoder D to handle scene perspective, that is, the fact that people densities are typically higher at the top of the image than the bottom in upside-up images. In other words, minimizing L a forces E to produce perspective-aware features while minimizing L s forces the decoder D to operate on such features to properly estimate people densities on the synthetic images. In this way, we make E produce features that are appropriate both for synthetic and real images, hence mitigating the domain shift between the two, as will be demonstrated in the results section. This first training stage is summarized by the first procedure of Alg. 1. After the first training stage described above, our model can produce both a density mapȳ and its corresponding uncertainty u. Let F 0 m be the corresponding network. We can now refine its weights to create increasingly better tuned networks F k m for 1 ≤ k ≤ K by iteratively minimizing