key: cord-0169624-xyxgy4bp
authors: Vora, Jeet; Dutta, Swetanjal; Jain, Kanishk; Karthik, Shyamgopal; Gandhi, Vineet
title: Bringing Generalization to Deep Multi-View Pedestrian Detection
date: 2021-09-24
journal: nan
DOI: nan
sha: 6c892d901de1703bec61cabc5c0df1494c3d5b4e
doc_id: 169624
cord_uid: xyxgy4bp

Multi-view Detection (MVD) is highly effective for occlusion reasoning in a crowded environment. While recent works using deep learning have made significant advances in the field, they have overlooked the generalization aspect, which makes them impractical for real-world deployment. The key novelty of our work is to formalize three critical forms of generalization and propose experiments to evaluate them: generalization with i) a varying number of cameras, ii) varying camera positions, and finally, iii) to new scenes. We find that existing state-of-the-art models show poor generalization by overfitting to a single scene and camera configuration. To address the concerns: (a) we propose a novel Generalized MVD (GMVD) dataset, assimilating diverse scenes with changing daytime, camera configurations, varying number of cameras, and (b) we discuss the properties essential to bring generalization to MVD and propose a barebones model to incorporate them. We perform a comprehensive set of experiments on the WildTrack, MultiViewX, and the GMVD datasets to motivate the necessity to evaluate the generalization abilities of MVD methods and to demonstrate the efficacy of the proposed approach. The code and the proposed dataset can be found at https://github.com/jeetv/GMVD

"Essentially all models are wrong, but some are useful."

-George E. P. Box In this work, we pursue the problem of Multi-View Detection (MVD), a mainstream solution for dealing with occlusions, especially when detecting humans/pedestrians in crowded settings. The input to MVD methods is images from multiple calibrated cameras observing the same area from different viewpoints with an overlapping field of view. The predicted output is an occupancy map [1] in the ground plane (bird's eye view). The solutions of MVD has evolved from classical methods [1] , [2] , [3] , to hybrid approaches [4] to end-toend trainable deep learning architectures [5] . Expectedly, the current landscape of MVD is dominated by end-toend trainable deep learning methods [5] , [6] , [7] . We argue that by training and testing on homogeneous data, current deep MVD methods have overlooked critical fundamental concerns, and to render them useful, the focus should shift towards their generalization abilities. Ideally, three forms of generalization abilities are essential for the practical scalability and deployment of MVD methods, which is illustrated in Fig. 1: 1) Varying number of cameras: The model should adapt to a varying number of cameras (a network trained on six camera views, should work on a setup with five cameras). 2) Varying configuration: The model should not overfit to the specific camera configuration. The performance should be similar even with altered camera positions, as long as they span the dedicated area. 3) Varying scenes: Models trained on one scene should work on another (model trained on a traffic signal should work on a setup inside a university).

Surprisingly, the existing deep learning-based MVD methods are primarily trained and tested with the same camera configuration, on the same scene, using the same number of cameras. Even the environmental conditions (time, weather, etc.) are similar across train and test splits. For instance, the most commonly used Wildtrack dataset [8] includes a 200 second recording from all cameras, where the first 3 minutes are used for training and the rest of the 20 seconds are used for testing. We argue that the current State Of The Art (SOTA) methods are seriously hindered from the deployment perspective. The current models [5] , [6] , [7] will break if a camera malfunctions. They will need retraining if a camera needs to be added to the setup. Furthermore, our experiments show that the performance significantly drops if the camera positions or the scene is varied. The SOTA models also seem to overfit to the order in which the cameras are sent to the model. The absence of a diverse dataset is a major shortcoming. The available datasets: Wildtrack (real) and MultiViewX (synthetic), comprise a single short sequence, where initial frames are used for training and later for testing. In Figure 2 , we show that the evaluation strategy in both datasets is unreliable and prone to overfitting. To this end, we propose a novel Generalized MVD (GMVD) dataset. Given the privacy concerns, COVID restrictions, hardware setup difficulties, the requirement of manual annotations, etc., we believe curating a sizeable synthetic dataset is the right way forward. Henceforth, we use Unity and the GTA game environment to capture the GMVD dataset. It includes about 53 sequences captured in 7 different scenes with significant variations in camera configuration, weather, lighting conditions, pedestrian appearance, etc. The number of cameras also varies across scenes. We use 6 scenes for training and 1 scene for testing. The proposed GMVD dataset sets up a new benchmark for evaluating MVD with generalization. It further allows reserving valuable real-world footages [8] directly for testing.

Furthermore, we suggest a set of design guidelines to ensure practical usability of Deep MVD methods. We demonstrate that permutation invariance, transfer learning, and regularization are vital for generalization. We improve the baseline architecture [5] with appropriate changes and establish SOTA generalization for MVD. We want to emphasize that we do not claim any major architectural novelty, and our work focuses on the barebone baseline architecture. Overall, our work makes the following contributions: 1) We conceptualize and emphasize the importance of generalization in MVD and propose a novel GMVD dataset for the same. 2) We highlight the shortcomings of the current evaluation methodology and propose novel experimental setup on existing datasets. 3) We adapt the baseline architecture to bring generalization to deep MVD. We show that permutation invariance is crucial for MVD and average pooling is one minimal way to achieve it. We propose a novel drop view regularization. 4) We back our claims using an extensive set of experiments and ablation studies. We show staggering improvements in scene and configuration generalization, paving the way for a practicable MVD.

Seminal work by Fleuret et al. [1] cast MVD as predicting occupancy probabilities over a discrete grid, an idea which has stood the test of time. The classical methods in MVD rely on background subtraction to compute likelihood over a fixed set of anchor boxes derived using scene geometry, project them on the top view and adopt conditional random field (CRF) or mean-field inference for spatial aggregration [1] , [2] , [3] . The classical methods, however observe a gradual degradation in detection performance with increased crowds, as the background subtraction becomes less effective with increase in crowds and clutter. Some methods do away with background subtraction and rely on handcrafted classifiers [9] instead.

Anchor based MVD methods replace background subtraction with anchor-based deep pedestrian detectors like Faster R-CNN [10] , SSD [11] and YOLO [12] . Some of these methods process each view separately [13] and some process them simultaneously [14] , [15] . The inaccuracies in the pre-defined anchor boxes [4] limit the performance of these methods. Even if the boxes are correct, locating the exact ground point to project in each 2D bounding box presents a challenge and leads to a significant amount of errors. Moreover, some of the Anchor based methods still rely on operations outside of Convolutional Neural Networks (CNNs), requiring to work out a balance between different potential terms [14] .

MVDet [5] is a recent anchor-free approach that aggregates multi-view information by perspective transformation and concatenating multi-view feature map onto the ground plane and then performs large kernel convolution for spatial aggregation. It overcomes limitations of manual tuning of CRF potentials, reliance on pre-defined 3D anchor boxes and projection errors from monocular detectors. It aggregates projected features from a ResNet [16] backbone using three convolutional layers to predict the final occupancy map. MVDet achieves notable improvement over the preceding anchor based methods (over 14% improvement on the Wild-Track dataset [8] ). The idea from [5] was further enhanced by using deformable transformers [17] to improve the feature aggregation in MVDeTr [6] . More recently, SHOT [7] introduced a combination of homographies at multiple heights to improve the quality of the projections.

We propose a new MVD dataset incorporating the three forms of generalization discussed above ( Figure 1 ). Some example frames from the proposed Generalized Multi-View Detection (GMVD) dataset are illustrated in Figure 3 . The GMVD dataset contains diverse non-overlapping scenes within and across training and test sets. In contrast, the existing MVD datasets Wildtrack and MultiViewX include noticeable overlap across train and test splits (single scene, pedestrians appearance, and location), encouraging existing MVD methods to overfit the dataset-specific aspects and thus hindering their practicality. The GMVD dataset, by its design, prevents overfitting from happening by keeping a clear separation in train and test splits. Capturing a real-world MVD dataset is difficult, primarily because of privacy concerns. The COVID restrictions also restrict crowded human capture. Moreover, such a dataset requires significant manual annotation effort. Consequently, we curate the GMVD dataset using synthetic environments. The GMVD dataset is curated using Grand theft Auto V (GTAV) and Unity Game Engine. We employ two different environments to avoid overfitting to a single synthetic data generation source. This reasoning is aligned with recent works [18] , [19] which utilize multi-source datasets to improve generalization performance. The GMVD dataset includes seven distinct scenes, one indoor (subway) and six outdoors. One of the scenes are reserved for the test split. We vary the number of total cameras in each scene and provide different camera configurations within a scene. Additional salient features of GMVD include daytime variations (morning, afternoon, evening, night) and weather variations (sunny, cloudy, rainy, snowy). We generate multiple short sequences for each scene while randomly varying the daytime and the weather. The generation of multiple random sequences ensures diversity, as different pedestrians (with different clothing and appearance) are picked in each case. The dataset also includes significant variations in lighting conditions. Local illumination sources come into play due to the presence of indoor and night scenes. We compare our dataset with the existing ones in Table I . Avg. Coverage represents the average amount of cameras observing each location. For GMVD, avg. coverage varies from 2.76-6.4 cameras depending on the scene. In addition to the discussed variations, GMVD is advantageous due to the dataset size, especially in terms of the total number of individual sequences.

Thereby, we propose the GMVD dataset as a new benchmark for MVD. We further encourage future methods to train on the GMVD dataset and test their performance on sparsely available, difficult to capture real-world datasets like Wild-Track .

Dataset Generation: We used Script Hook V [20] library to interface with the GTAV environment. For each scene, camera positioning and orientation were determined manually so as to increase the camera coverage. All the cameras were positioned above the humans' average height. Due to hardware limitation, it is commonplace to have a small synchronization delay in real-world multi-camera setups. To emulate such realistic scenario, we induce a small synchronization error (20-100 ms) between different camera views [21] . A ground plane was defined for each location, partially overlapping with each camera's field of view. Only pedestrians inside the ground plane were considered for multi-view detection. We relied on the GTA's navigational AI engine to avoid collision and to obtain realistic pedestrian behavior. In Unity environment, the scene is manually curated by putting together 3d models of street, buildings and other props. We used the PersonX [22] 3d human models to create the pedestrians. To avoid collision errors (which are present in MultiViewX dataset), pedestrians were spawned at random locations within the region of interest, for every frame.

Since both the environments are synthetic, the 3D-2D correspondences were directly available from the game engines.

We use similar procedure as [5] for camera calibration. Track Labels: Our work focuses on a comprehensive analysis of the problem of Multi-View Detection. However, the proposed dataset can also be useful for the task of multi-view pedestrian tracking. To this end, for the sequences generated from the GTAV environment, we collect the track labels while capturing the data. While we do not use track labels in this work, we provide them with the dataset, which will be beneficial for the community in the future. We provide a total of 125000 frames with track labels. The GTAV frames for the GMVD dataset are regularly sampled from these densely annotated sequences.

We propose an anchor free deep MVD method along the lines of [5] , [6] , [7] specifically tailored to improve the generalization abilities by modifying the training objective and making use of an average pooling strategy on the projected feature maps. The overall architecture is shown in Fig. 4 . The input to our pipeline are multiple calibrated RGB cameras with overlapping fields of view, and the expected output is the occupancy map for pedestrians.

Feature Extractor: We use a ResNet18 [16] backbone as a feature extractor replacing last three strided convolutions with dilated convolutions to have a high spatial resolution of the feature maps. Given N camera views of image size

where H i and W i corresponds to height and width of images, C-channel features are extracted for N camera views which corresponds to size (N,

where H f and W f represents the height and width of the extracted features. Perspective Transformation: The extracted features from the feature extractor are then projected onto the ground plane using a perspective transformation, where (H g , W g ) corresponds to the height and width of the ground plane grid. Considering the calibrated cameras, K represents the intrinsic camera parameters and [R|t] represents the extrinsic camera parameters (R is the rotation matrix and t is the translation vector).

In the world coordinate system, the ground plane corresponds to Z = 0, i.e., W = (X, Y, 0, 1) T . A pixel of an image I = (x, y) T is transformed to the ground plane as follows:

where s is a scaling factor and P is a perspective transformation matrix.

Average Pooling: We first project the ResNet feature maps from each viewpoint on to the bird's eye view using the perspective transformation to obtain the projected feature maps f m i (where, i = 1, 2, ..., N ). Following this, we average pool the projected feature maps f m i to obtain the final bird's eye view feature representation F of size (C, H g , W g ), which is written as,

While there can be many other alternatives to average pooling, we opt for this solution, primarily because it is permutation-invariant. Unlike MVDet, where the camera views ideally need to be input in the same order as training during inference, our proposed solution can accept arbitrary number of views in an arbitrary order. Furthermore, the average pooling solution is free from any learnable parameters which ensures that there is no overfitting introduced due to this operation. The projected feature maps for N cameras of size (N, C, H g , W g ) after average pooling, reduces to (C, H g , W g ), thus removing the dependency over the number of camera views thereby allowing the model to take an arbitrary number of views as input. 

The loss function compares the output probabilistic occupancy map (p) with the ground-truth (g). Inspired by the work on saliency estimation in images and vidoes [26] , [27] , [28] , we use the combination of Kullback-Leibler Divergence (KLDiv) and Pearson Cross-Correlation (CC) metrics as a loss function. The final loss function can be written as:

where σ(p, g) is the covariance of p and g, σ(p) is the standard deviation of p and σ(g) is the standard deviation of g. [5] , [8] .

Evaluation metrics: We use the standard evaluation metrics proposed in [8] . State of the Art comparisons: We compare against nine different methods. The set includes one monocular object detection baseline (referred to as RCNN clustering [13] ); a classical probabilistic occupancy map method [1] ; four anchor based methods [30] , [14] , [15] , [29] and three recent end-to-end trainable deep MVD approaches [5] , [6] , [7] .

For generalization experiments, we only compare against the recent state-of-the-art methods MVDet [5] , MVDetr [6] and SHOT [7] .

Down sampled images of 720 × 1, 280 pixels serve as an input to the model. The feature extracted from ResNet-18 has C = 512 channel features, which is bilinearly interpolated to get the shape of 270 × 480. These (N, C = 512, H f = 270, W f = 480) extracted features are projected onto top view to obtain (N, 512, H g , W g ) sized features for N viewpoints, which are average pooled to obtain the ground plane grid shape of (512, H g , W g ). H g and W g vary from scene-to-scene, depending on the area of ground plane. The spatial aggregation has three layers of dilated convolution with a 3 × 3 kernel size and dilation factor of 1, 2, and 4. Training is done for ten epochs with early stopping; we set batch size as 1, SGD optimizer with momentum = 0.9 has been used with one-cycle learning rate scheduler. A probability of τ or more on the occupancy grid is considered a detection. For GMVD experiments, τ is determined using MultiViewX as a validation set, and for other experiments, we use τ = 0.4 in alignment with the previous works. Non-Maximal Suppression (NMS) is applied with a spatial resolution of 0.5m. All training and testing have been performed on a single Nvidia GTX 1080 Ti GPU. Unless specifically mentioned, we always use pre-trained ImageNet [32] weights while training our proposed model. 

Like prior works, we evaluate our approach on the WildTrack and MultiViewX datasets in Table II . We find that our proposed models attains satisfactory performance on the test sets of both WildTrack (best MODA score of 87.2) and MultiViewX (best MODA score of 88.2). This is slightly worse than the recently proposed methods [6] , [7] , but is far superior to the performance of the classical and the anchorbased MVD methods. However, we would like to highlight that the traditional evaluation protocol is highly misleading since the train and test sets have significant overlap, thereby encouraging overfitting. Therefore, we emphasize the evaluation across a varying number of cameras, changing camera configurations, and on new scenes. sitions are varied between the train and test sets. We train all the models on two sets of camera views and then test the trained models on both sets. The results are provided in Table V . When the models are evaluated on the same camera configuration, all the models have satisfactory performance. However, when evaluated on the different camera configuration, MVDet, MVDeTr, and SHOT see a huge degradation in performance. Our model is fairly robust to the changing camera configuration. Especially when trained with DropView regularization, the resulting model outperforms all other models by over 20 percentage points. Scene Generalization: Finally, an important concern with the practical utility of MVD methods is that since real-world data is scarce, a trained model should be able to generalize to new scenes. We first evaluate the scene generalization abilities of the MVD methods by training them on MultiViewX and evaluating them on WildTrack in Table IV. Our proposed model is able to utilize the extra camera present in the WildTrack dataset and achieves a MODA score of 70.7. This further highlights the benefits of an architecture that works with arbitrary number of views, since the performance during inference can be enhanced by adding more view. However, even without the additional view, our model achieves a MODA score of 66.1, which is much higher than SHOT which only achieves a MODA score of 53.6. In addition to this, we perform the scene generalization experiment proposed in [7] where the MultiViewX scene is split into two halves, and each half is covered using 3 cameras each. In this setting as well (Table VI) , our proposed approach with DropView regularization has a MODA score of 66.1, which is significantly higher than both SHOT (49.1) and MVDeTr (56.5). GMVD Benchmark: Having shown that our proposed model is capable of comprehensive generalization abilities, we benchmark our proposed approach on the GMVD dataset (Table VII) . We train our model on the training set of the GMVD dataset and use MultiViewX dataset for validation. Since each sequence in the training set has a different number of cameras, none of the existing methods can be adapted to this setting, since they can be trained only on a fixed set of cameras. When evaluated on WildTrack, our model is able to achieve a MODA score of 80.1, which is a significant improvement over the results from training on MultiViewX.

Notably, this shows that training on our synthetic dataset, we can nearly attain the same performance as training on Wild-Track itself. When evaluated on GMVD test set, our model achieves a MODA score of 68.2. The results empirically suggest the difficulty of the GMVD test set, compared to WildTrack and MultiViewX, resulting from a distinct traintest split and the presence of extensive variations. We believe that our dataset can serve two important purposes. The first is as a diverse, synthetic dataset from which a model can be adapted to real-world data. The second is that the GMVD dataset itself can be a challenging benchmark to evaluate the generalization capabilities of MVD methods. In this setting, MultiViewX being used for validation is ideal, since this ensures that no information from the test set is leaked during training.

The biggest limitation in the field of Multi-View Detection is that real-world capture of data is extremely challenging due to the difficulty in collecting a dataset with people in addition to the challenges involved in the hardware setup and annotations. The absence of a large, diverse benchmark significantly hampers the progress of this topic. Therefore, the existing WildTrack dataset is extremely valuable for the community. However, due to its limited size and variety, it is not suitable for training and should only be used to evaluate the generalization abilities of the models. In this regard, we hope that our proposed dataset and our barebone model serves as a useful tool in bridging the gap between the theory and real-world application of MVD methods. In our work, we have not explored the use of unsupervised domain adaptation techniques to bridge the gap between the feature distributions of the synthetic and real datasets and the direction is left for exploration in the future work.

We find the current Multi-View Detection setup severely limited and encouraging models to overfit the training configuration. Therefore, we conceptualize and propose novel experimental setups to evaluate the generalization capabilities of MVD models in a more practical setting. We find the stateof-the-art models to have poor generalization capabilities on our proposed setups. To alleviate this issue, we introduce changes to the feature aggregation strategy, loss function, as well as a novel regularization strategy. With the help of comprehensive experiments, we demonstrate the benefits of our proposed architecture. In addition to this, we propose a diverse, synthetic, but realistic dataset which can be used both as an evaluation benchmark, as well as a training dataset for various MVD methods. Overall, we hope our work plays a crucial role in steering the community towards more practical Multi-View Detection solutions. We ablate the choice of the loss function in Table VIII for the scene generalization experiment. We consider the Mean Squared Error (MSE), KL-Divergence(KL), Pearson Cross-Correlation (CC), as well as our chosen loss function (KL+CC). We find that the combination of KL-Divergence and Pearson Cross-Correlation achieves significantly better results than any other loss function. First we show the predicted occupancy maps of MVDet, MVDeTr, SHOT and our method and compare them with the ground truth, in the traditional setting. Subsequently, qualitative results are shown w.r.t to three generalization abilities obtained from both the WildTrack and MultiViewX datasets.

The traditionally evaluated results which contains occupancy maps of ground truth, our method, MVDet, MVDeTr and SHOT are shown in Fig. 7 . The occupancy map from our method which uses average pooling, KLCC loss function and ImageNet pretraining gives us more accurate localization as compared to the base MVDet architecture. The results (maps) are competitive when compared to SHOT and MVDeTr. The maps obtained using MVDeTr are sharper and focused, however, it also has more false positives.

Varying number of cameras: The output occupancy map for varying number of cameras are shown in Fig. 8 . Wild-Track consists of seven cameras, we show the results inferred with three cameras upto six cameras. As the number of views are increasing, we get an accurately localized occupancy map.

Changing camera configurations: The output occupancy map for cross subset evaluation are shown in Fig. 10 . Here, we have the occupancy maps for a model trained on one set and tested on other set. For example, trained on camera views one, three, five and seven and tested on cameras two, four, five and six or vice-versa like the camera splits shown in Figure 6 . Clearly the pre-training is improving localization in both the methods. Furthermore, our method with average pooling is better at disambiguating the occlusions and also giving brighter outputs (resulting in sharp maxima's).

In this subsection the qualitative results for MultiViewX dataset are been shown. We consider similar configurations as in the Wildtrack dataset. The obtained results clearly indicates the improvements our method brings over the MVDet, MVDeTr and SHOT model and observations are similar to that of the Wildtrack dataset. Fig. 7 shows the traditionally evaluated results. Varying number of cameras: The output occupancy map for varying number of cameras are shown in Fig. 11 . MultiViewX consists of six cameras, we show the results inferred with three cameras upto five cameras. As the number of views are increasing, we get an accurately localized occupancy map. Changing camera configurations: The output occupancy map for cross subset evaluation are shown in Fig. 12 . Here, we have the occupancy maps for a model trained on one set and tested on other set. For example, trained on camera views one, three, and four and tested on cameras two, five and six or vice-versa, the camera splits are shown in Figure  9 and their results are shown in Table IX 

The qualitative results of output occupancy map for crossdataset evaluation are shown in Fig. 13 , when we train on synthetic dataset (MultiViewX ) and test on real dataset (WildTrack ). First four occupancy maps are the outputs of MVDet, MVDeTr, SHOT and our method when tested on only 6 views of WildTrack dataset for having a fair comparison with other methods. We also show the output occupancy map when tested on all the views of WildTrack dataset. Our method provides accurately localized occupancy maps and disambiguate the occlusions as compared to other methods. 

Multicamera people tracking with a probabilistic occupancy map

Multiple object tracking using k-shortest paths optimization

Sparsity driven people localization with a heterogeneous network of cameras

Foveabox: Beyound anchor-based object detection

Multiview detection with feature perspective transformation

Multiview detection with shadow transformer (and view-coherent data augmentation)

Stacked homography transformations for multi-view pedestrian detection

Wildtrack: A multi-camera hd dataset for dense unscripted pedestrian detection

Conditional random fields for multi-camera object detection

Faster r-cnn: Towards real-time object detection with region proposal networks

Ssd: Single shot multibox detector

You only look once: Unified, real-time object detection

Multi-view people tracking via hierarchical trajectory composition

Deep occlusion reasoning for multicamera multi-target detection

Deep multi-camera people detection

Deep residual learning for image recognition

Deformable detr: Deformable transformers for end-to-end object detection

mdalu: Multisource domain adaptation and label unification with partial datasets

Learning to generalize unseen domains via memory-based multisource meta-learning for person re-identification

Script Hook V

The mta dataset for multi-target multi-camera pedestrian tracking by weighted distance aggregation

Dissecting person re-identification from the viewpoint of viewpoint

Dropout: a simple way to prevent neural networks from overfitting

Self-supervised feature learning by learning to spot artifacts

Self-supervised learning for video correspondence flow

What do different evaluation metrics tell us about saliency models

Tidying deep saliency prediction architectures

Vinet: Pushing the limits of visual modality for audiovisual saliency prediction

Semantic driven multi-camera pedestrian detection

Generalizable multi-camera 3d pedestrian detection

Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol

Imagenet: A large-scale hierarchical image database