key: cord-0272153-vjooy75c
authors: Barquero, Germ'an; Hupont, Isabelle; Fern'andez, Carles
title: Rank-based verification for long-term face tracking in crowded scenes
date: 2021-07-28
journal: nan
DOI: 10.1109/tbiom.2021.3099568
sha: 3801a5f127a682d36e76613092d62428a29b49f5
doc_id: 272153
cord_uid: vjooy75c

Most current multi-object trackers focus on short-term tracking, and are based on deep and complex systems that often cannot operate in real-time, making them impractical for video-surveillance. In this paper we present a long-term, multi-face tracking architecture conceived for working in crowded contexts where faces are often the only visible part of a person. Our system benefits from advances in the fields of face detection and face recognition to achieve long-term tracking, and is particularly unconstrained to the motion and occlusions of people. It follows a tracking-by-detection approach, combining a fast short-term visual tracker with a novel online tracklet reconnection strategy grounded on rank-based face verification. The proposed rank-based constraint favours higher inter-class distance among tracklets, and reduces the propagation of errors due to wrong reconnections. Additionally, a correction module is included to correct past assignments with no extra computational cost. We present a series of experiments introducing novel specialized metrics for the evaluation of long-term tracking capabilities, and publicly release a video dataset with 10 manually annotated videos and a total length of 8' 54". Our findings validate the robustness of each of the proposed modules, and demonstrate that, in these challenging contexts, our approach yields up to 50% longer tracks than state-of-the-art deep learning trackers.

R ECENT advances on Convolutional Neural Networks (CNNs), IP cameras (optics, video compression, frame rates, ultra-high resolutions) and computational hardware (GPUs, DNN accelerators) have allowed video-surveillance systems to move to increasingly crowded, large-scale and unconstrained scenarios. These novel scenarios typically involve crowds of people massively walking toward cameras located at near-eye level, as illustrated in Figure 1 .

Such scenarios include large public open spaces (squares, large avenues, parks) and critical infrastructures (transport stations, airports, government buildings, malls) of the utmost importance for law enforcement bodies. There is a need for accurate video analytics systems in these locations, where major security threats occur (e.g. terrorist attacks) and where COVID-19 prevention measures have to be guaranteed.

Video-surveillance applications need to raise immediate alerts as a response to security and sanitary threats. For example, when a face recognition system is deployed, alarms are sent to endusers (e.g. police bodies) every time a new subject is detected or identified. However, the recurrent sending of duplicate alarms has to be controlled and minimized, in order not to collapse end-users. To avoid this infobesity problem, subjects need to be tracked, so that one single alarm is generated per track. Obtaining long and accurate tracks (e.g. without ID switches and fragmentations) is In these scenarios, the proposed face tracking approach surpasses full-body and general object tracking methods. therefore essential to increase the system's usability. In the case of COVID-19 prevention, it is also extremely important to accurately track individuals at the long-term, so that people count is as exact as possible.

Real-time is also an important requirement for video analytics systems monitoring crowds. A subject may remain on scene for seconds or even minutes. Making enforcement bodies wait until the end of tracks to receive alerts (which is known as offline strategy) implies losing a precious time that could save lives. Hence, online constraints become essential.

Nevertheless, from the algorithmic perspective, it is particu-arXiv:2107.13273v1 [cs.CV] 28 Jul 2021 larly challenging to perform reliable online and long-term tracking of subjects in this context since:

• The high pedestrian densities observed in crowded videosurveilled places lead to reduced pedestrian visibility. This results in frequent mutual occlusions which impede the full-body detection, deteriorating the performance of today's most common detection-based tracking algorithms [1] . Consequently, tracking algorithms have to rely exclusively on the facial region -which is most of the time the only visible part of the subjects-instead of full-body or upper-body regions as in classic pedestrian tracking works [2, 3] .

• People in video-surveilled places typically move all around the scene, positioning themselves closer or farther from the camera focus, and becoming again occluded or blurred for long periods. Existing generic object trackers cannot handle these situations properly, as demonstrated in [4] .

• Current state-of-the-art trackers, especially those based on Deep Learning, have a high computational cost [5] . While their real-time deployment may be possible in low-tomoderate pedestrian density scenes, computational performance drops dramatically in densely crowded scenarios.

There is a lack of available datasets covering videosurveillance scenarios where persons' faces are recorded at near-eye level, so that they are visible enough to allow face detection and tracking at the microscopic level, but that show a macroscopic view of the crowd at the same time. Crowded video-surveillance videos recorded at neareye level are particularly difficult to collect and annotate.

This work is an extension of [6] , in which we proposed an architecture especially conceived for long-term face tracking in crowded video-surveillance contexts. More specifically, our main contributions can be summarized as follows:

• Our architecture recovers from partial and full long-term occlusions thanks to a novel online tracklet reconnection module grounded on rank-based face verification techniques. • We also propose a track correction module, which updates past track assignments with current information. This module has no extra computational cost, while considerably improving long-term tracking performance.

• We validate the system with regard to different state-ofthe-art trackers, and present an ablation study quantifying the contribution of each proposed module. Four validation metrics designed to evaluate long-term tracking capabilities are introduced for that purpose. • We publicly release to the community the video dataset used in our experiments, including ground truth track annotations. All our videos present novel crowd scenes recorded at near-eye level, where faces are visible enough to be analysed at the microscopic level, while also benefiting from a macroscopic view of the crowd.

In this extended version, we present an improved tracklet reconnection module that substantially outperforms the previous one in all validation videos [6] . We extend the validation dataset with five new videos, which increase the total dataset length by 4' 4" (about twice the original duration) and represent more challenging scenarios in terms of illumination, people's motion (velocity, direction, etc.) and occlusions. We have also updated the analysis of the state-of-the-art, by including the most recent tracking methods. Two new trackers have been validated with our dataset and incorporated in our comparisons [7, 8] . Finally, we have included an experiment that replicates a typical videosurveillance context, in which long video sequences are simulated by means of distractors, to demonstrate the robustness of the proposed method in real-life settings.

Multi-object tracking is commonly carried out by assigning a single-object tracker to each target of interest. Current state-ofthe-art single-object trackers are based on Deep Learning (DL). A typical DL approach consists in splitting the tracking process into two stages: first, a region proposal network extracts regions of interest from the image and, then, a discriminator selects the best region candidate for the target object. For the second stage, a number of approaches rely on Siamese CNNs [9, 10, 11, 12] . These trackers usually make strong temporal consistency assumptions to ensure certain computational efficiency. Alternatively, other works propose to break these constraints and enable the tracker to search for the target at arbitrary positions and scales [8] . However, unconstrained trackers become unstable in scenarios where tracked objects are very similar.

To gain efficiency, single stage or "one-shot" trackers have been recently proposed [13, 14, 15] . In contrast to two-stage methods, one-shot trackers exploit multi-task learning to perform both object detection and feature extraction using a single network. Despite the impressive trade-off between accuracy and computational performance, they have been only applied to pedestrian (full-body) tracking yet. Their translation to other use-cases, such as face-based tracking, still needs to be explored. Other works suggest saving a pool of templates from past track images, the most representative and different among them, and use them to match regions of interest [16] . Very recent works leverage temporal and spatial Transformers [17] to perform tracking-by-attention [18, 19, 20] . Although they are very promising, their performance in multi-object tracking and especially in face tracking still needs to be properly explored.

Nevertheless, a common drawback of DL-based methods is that they are computationally expensive and cannot handle longterm tracking just by themselves: they fail to re-locate out-of-view targets when re-appearing in the scene. We refer the reader to [4] for a comprehensive demonstration and to [5] for an in-depth survey on DL trackers. A longer-term DL tracker has been recently proposed in [21] , but it requires an initial offline training of the model with images from the particular target to track, and thus it is not suitable for our use case.

More computationally efficient trackers are not based on DL, but also achieve competitive accuracies. A clear example is SORT [7] , in which Kalman filters and the Hungarian algorithm are combined to associate detections. In [22] , the use of a Kernelized Correlation Filter (KCF) visual tracker is proposed to fill the gaps generated after applying a classic intersection-over-union (IOU) data association strategy between frame detections. Both algorithms work at high frame rates, but still rely heavily on IOU values, which makes them prone to ID-switches. Other highperforming visual trackers include: MOSSE [23] and CSRT [24] , which are both based on correlation filters; and the Median Flow tracker [25] , based on motion flow. However, again, these trackers tend to fail with long-term occlusions.

The tracking of persons has been mainly tackled by applying generic object tracking approaches and targeting full-body regions. Few studies rely exclusively on faces to track persons and address face tracking as a problem with its own particularities.

Taking advantage of the high accuracies reached by current face detectors, some face tracking works propose tracking-bydetection approaches. In [26] , a generic AdaBoost face detector is combined with an adaptive structured output SVM tracker, using a IOU data association strategy. However, this approach is only suitable for short-term tracking, as it does not implement any tracklet reconnection strategy, and its core tracker [27] has been proven slower and less accurate than many newer trackers [28, 29, 30] . As a longer-term approach, [31] applies the Tracking-Learning-Detection (TLD) paradigm to faces: the face is tracked and simultaneously learned by a detector that supports the tracker once it fails.

More recent approaches achieve long-term face tracking by using clustering techniques to associate short-term tracklets [32, 33] . Short-term tracklets are obtained by combining detectors and simple data association methods. Then, facial features are computed for each detection through DL face recognition models, and clusters are extracted from the feature space to collapse sameidentity tracklets. Although these approaches achieve state-of-theart results, they work fully offline and imply a high computational cost, which is not suitable for real-time tracking.

It is also worth mentioning that some works propose tracking mechanisms to improve face recognition in videos, in scenarios where persons of interest are previously enrolled using few still images. In [34] and [35] , simple visual trackers are used to obtain tracks from which new (unseen) high-quality face stills are collected [34] . These stills are matched against reference images to identify people. Interestingly, they are additionally used to enrich the gallery of enrolled images, thus improving face recognition performance. Nevertheless, these works focus on face recognition, leaving tracking as a secondary task.

Studies on people tracking have traditionally focused on full-body pedestrian tracking in low-to-moderately crowded urban scenes. As a result, several pedestrian video datasets are available and commonly used by the community [2, 36, 37] . Another field for which a large number of datasets is available is crowd analysis, e.g. crowd counting or crowd behavior understanding [38, 39, 40] . These crowd datasets usually contain high-angle views, in which people faces appear at very low resolutions (mostly below 30x30 pixels). Some datasets have been conceived for face tracking and are thus closer to our scope. A relevant example is the dataset released in [41] . However it focuses on sitcom and music videos taken with many different shots and not crowded at all (6 IDs per video). Also, the MobiFace dataset has been released to evaluate in-the-wild face tracking algorithms for mobile devices [4] . Videos are recorded from moving smartphone cameras, sometimes in "selfie" mode, and contain few faces (less than 5) per video. Consequently, none of these datasets cover our use case.

The only exception is the ChokePoint dataset. It provides a collection of 48 videos capturing individual subjects walking through two portals [42] . To pose a more challenging real-world surveillance problem, two extra sequences were recorded in a indoor crowded environment, which represent the scenarios that we are targeting in this work.

This work presents a four-module architecture that overcomes the limitations of previous approaches to favor long-term face tracking in crowded video-surveillance environments, see Figure 2 . The following sections describe each module in detail.

The system firstly extracts tracklets following a tracking-bydetection approach. The tracking module is in charge of predicting face locations over the frames where the face detector fails, using one simple visual tracker. Several visual tracking algorithms are available for that purpose in our implementation, including KCF, MOSSE, Median Flow and CSRT.

The tracking module creates a tracklet T i for every new detected face i. In case the detector loses a face in the following frame, it keeps predicting its position until the data association module decides to: (i) update the position with a new detection or (ii) force it to die. These tracklets are additionally used to collect the pool of reference images that serve as a basis for the face identification mechanism, c.f. Section 3.3.

In order to decide which detection should guide which tracklet, a data association problem needs to be solved. Once the tracking module has predicted the new K t positions of live tracklets for a frame t, the data association module retrieves the N t faces detected in that frame. A state-of-the-art face detector is used in our implementation for that purpose [43] . Then, to establish correspondences between predicted and detected face bounding boxes, the Munkres implementation of the Hungarian algorithm is applied to their IOU values [44] . For every correspondence with an IOU value above a threshold λ IOU , the tracking module updates the corresponding tracklet with the new detected bounding box position. For detected faces without a tracklet correspondence, a new tracklet is initialized and these faces are considered as new identities.

Tracklet predictions without a face detection correspondence are kept alive for T max frames. The tracking module keeps predicting their location over those frames where no association is made. If the tracklet is not updated for T max frames consecutively, it is forced to die and marked as inactive.

When a partial or full occlusion occurs, trackers generally lose the tracked target and consider it as a new object when it re-appears. To overcome this limitation, our system incorporates an online face-based tracklet reconnection module (FBTR). This module is inspired by face verification: a face recognition model is used to collect reference image templates from each tracklet, and then a matching procedure is applied to unify same-identity tracklets. In our implementation, we use the state-of-the-art face recognition model by [45] .

The selection of reference templates is driven by image quality. More particularly, three indicators are considered: (i) face detection confidence, (ii) head pose angles and (iii) an image sharpness metric. Detection confidence is a value directly provided by the face detector [43] . Pitch, yaw and roll head pose angles are estimated using the fiducial extractor by Zhu et al. [46] . Image sharpness is obtained by applying the Laplace operator in the facial area as proposed by Nikitin et al. [47] . Using these quality indicators, face detections contained in each tracklet are divided into 3 groups (see Figure 3 ):

• Enrollable faces. Faces with high visual quality, used to enroll identities.

• Verifiable faces. Faces that have enough quality to produce reliable templates. In the tracklet reconnection process, verifiable faces are matched against enrollable faces. Note that enrollable faces are a subset of verifiable faces.

• Discarded faces. Their low quality makes them unsuitable for the FBTR module, as they would produce non-reliable templates.

After data association, the FBTR component checks the quality of each detected face. If not discarded, a template is extracted with the face recognition model and stored either as enrollable or verifiable. Thus, in this work, the structure of a tracklet includes: (i) the set of detections associated to the same ID; and (ii) metadata in the form of image templates extracted by the face recognition model, either labelled as "verifiable" or "enrollable".

Then, for each tracklet T k ∈ T with an assigned detection D k in the current frame, we retrieve tracklets T i ∈ T , with T = T \ {T k }. For each retrieved tracklet T i , the average of its enrollable face templates, E Ti , is computed and taken as the tracklet reference template. The average of all verifiable face templates of tracklet T k , V T k , is also computed. Now, letting S be the similarity function of the face recognition model, we define T R as the rank-R retrieved tracklet candidate, i.e. the tracklet at position R, after having sorted all candidates from highest to lowest similarity to T k :

As part of the reconnection process, we fuse tracklet T k with T 1 (the rank-1 retrieved tracklet), but only when the following conditions are satisfied:

1) The similarity between T 1 and T k is above a fixed threshold λ F BT R :

where 0 ≤ λ F BT R ≤ 1.

2) That highest similarity is above the average of the next C-highest ones, by a margin 1/ :

where 0 < ≤ 1 and C ∈ N≥ 1.

The rank-based constraint in Equation 3 helps filtering out wrong reconnections. This is particularly useful since, as the number and length of tracklets increase, tracking errors may result in mixed-up identity templates for a tracklet, which inevitably keeps on reducing inter-class distance among them. Thus, our second condition helps avoiding the propagation of reconnection errors and allows us to be more permissible with the choice of λ F BT R . Therefore, whenever T 1 verifies the two previous conditions, the FBTR module re-assigns detection D k from tracklet T k to the most similar candidate T 1 . Tracklets T k and T 1 are joined, and the pair T k , T 1 is appended to a list of track pairs, which is the input to the correction module.

To quantify the improvement brought by the rank-based reconnection criterion, both the current version of this module (FBTR) and the older one without the rank-based constraint (S FBTR) [6] will be considered in our experiments.

The correction module receives the pairs of tracklets joined by the FBTR module. For each pair T k , T 1 , it retrieves all the detections assigned in the past to T k and switches their track ID to T 1 . This strategy has a complexity of O(n) (where n is the number of detections inside the re-identified tracklet), so it can be implemented without adding significant computational cost, and is beneficial to refine the tracking history. This feature is particularly interesting for forensic video analysis. Figure 4 shows the effect of adding this module. The upper example (Fig. 4a) illustrates a regular tracklet reconnection: T 4 is reconnected to T 3 as soon as a verifiable face is found, whereas T 5 cannot be reconnected to T 2 as no enrolment face was available. On the other hand, the example at the bottom (Figure 4b) shows the effect of incorporating the correction module, which now replaces T 4 by T 3 not only for the current detection, but also the ones previous to instant t i .

In this section, we present novel metrics especially suitable for evaluating the long-term tracking performance of our system. Two of these metrics revisit and extend those commonly used in multi-object tracking, namely mismatch errors (number of ID switches produced across tracks) and track fragmentations [48] . The remaining two, Completion Rate Sum (CRS) and Completion Rate Plot (CRP), are introduced in this work for the first time.

One of the main challenges in multi-object tracking is to avoid drifting, i.e. losing a target. Assuming that the facial detector has high accuracy, in our use-case drifting only happens when two or more tracklets switch their target identity. The drifting effect becomes much more dramatic when incorporating reconnection capabilities: when switching its target, the tracklet integrity is lost, leaving the system vulnerable to future misassignments. Therefore, there is a need for revisiting the mismatch error metric by taking into account two new important concepts:

• Soft-mismatch error (smme). It is produced when the tracker switches the correct identity (ID) to a new one that has not been associated to any track until that time. This leads to ID-fragmentation, but can potentially be recovered by the FBTR module.

• Hard-mismatch error (hmme). It is produced when, instead of switching to a new identity, the tracker switches to a previously assigned one. This leads to a probably unrecoverable ID-switch.

Another desired feature of our system is its ability to obtain long-term tracklets. As our track annotations only consider faces detected by the face detector, which were not corrected and therefore match the outputs of our architecture, traditional metrics like MOTA or MOTP [49] are not suitable for evaluating the tracking accuracy. Thus, it is necessary to introduce metrics able to quantify the length of tracklets generated by detection-based trackers. Overall, the goal is to build a robust long-term tracker that reduces fragmentation while keeping the number of IDswitches low. To achieve it, we formulate three new scalar metrics and a graphical one:

F rag = smme #dets where the numerator is the sum of soft-mismatch errors produced in a video, and #dets is the total number of faces detected (according to ground truth annotations).

where hmme is the total number of hard-mismatch errors in the video. Table 1 : Description of the videos used to evaluate our tracking architecture. In videos marked with asterisk (*), subjects leave and re-enter the scene twice. Density refers to the mean number of face detections per frame.

This section describes the dataset used to evaluate the proposed architecture, and the results of the different experiments carried out with it. All experiments were conducted on a desktop Intel(R) Core(TM) i7-9700K CPU machine at 3.60GHz and a NVIDIA GeForce RTX 2070 GPU. The system thresholds λ IOU , λ F BT R and λ S F BT R , were adjusted using a private collection of training videos to 0.25, 0.5 and 0.7, respectively. The parameters of the FBTR module were set to = 0.8 and C = 6, based on the experimental results presented in Section 5.3. Quality thresholds for detection confidence, head angles and image sharpness were respectively set to 0.95, ±25 • and 0.9 for enrollable faces, and 0.8, ±60 • and 0.75 for verifiable faces.

Since there are no public datasets fully corresponding to our use case (c.f. Section 2.3), we have compiled and annotated a set of ten videos showing crowded indoor and outdoor video-surveillance scenes. The total duration of the dataset is 8' 54". Two of these videos come from the extra sequences (cameras P2E S5 and P2L S5) of the well-known public dataset ChokePoint [42] . To force re-appearances of subjects and validate the FBTR module performance, the three sequences recorded by P2E S5 cameras were concatenated, leading to the Choke1 video. Similarly, the video Choke2 was generated from P2L S5 cameras sequences. The remaining eight videos were selected from YouTube. Table 1 

Tracks were semi-automatically annotated to obtain ground truth data. Firstly, face bounding boxes were retrieved using a state-of-the-art face detector [43] . Only faces with a detection confidence above 0.50 were considered. Then, detections were manually verified, and tracks and corresponding IDs were annotated 1 .

This first experiment aims at benchmarking different trackers to choose the most appropriate one for the tracking module. We analyze the performance of our data association module using the following visual trackers: CSRT [24] , KCF [22] , Median Flow [25] , MOSSE [23] and SORT [7] . Additionally, we include in our benchmark the longer-term tracker TLD [31] and the stateof-the-art deep trackers GlobalTrack [8] and SiamRPN++ [10] . 1 . All videos and annotations are publicly available at https://github.com/ hertasecurity/LTFT. Figure 5 present the results obtained on all of our evaluation videos. Results highlight the lower performance of TLD. This is probably due to its capability to re-identify targets, which makes it vulnerable to ID-switches. This effect is also observed with GlobalTrack. In this case, the deep learning models stored for each lost tracker quickly lead to a memory overflow, which prevents it from being evaluated under the same conditions. The best performing tracker in terms of ID-switches is MOSSE, but its high fragmentation leads to a poor overall completion rate. The SiamRPN++ tracker achieves the highest completion rate (CRS = 0.722) and the lowest fragmentation (F rag = 0.01502). However, KCF is very close to it (CRS = 0.710, F rag = 0.01650) and runs at much higher FPS, which is critical in video-surveillance contexts. SORT provides the highest frame rate and a fairly low number of ID-Switches, in exchange of higher fragmentation and, consequently, a lower completion rate.

According to these findings, simple visual trackers and more sophisticated DL-based trackers lead to similar performances in such challenging environments where occlusions are extremely frequent. In the following experiments we will use the KCF tracker as the core of our tracking module.

In this section we analyze how the choice for parameters and C affects the stability of the FBTR module. All the experiments in Table 2 : Evaluation of the data association module on the whole dataset, using different trackers in the tracking module. (*) SORT was implemented with its default data association method.

this section are run on a private collection of six videos, with a total length of 10' 57".

To mimic the state of the FBTR module at any given point of the tracking timeline, we manually annotate the videos with ground truth detections and identity tracks, yielding a perfect tracking database with non-mixed identities. Note that such database combines identities from all video sequences. Furthermore, in order to simulate noisy tracking in real-life conditions, we generate additional databases with a predefined number of mixed identities. This is achieved by randomly picking a pair of pure tracklets, and moving a slice of 5 consecutive templates from each of them to a new single tracklet, now containing the 5+5 mixed templates. By repeating this operation, we generate six databases in which an increasing percentage of identities are mixed (0, 5, 10, 15, 20 and 25%).

Next, in order to validate the discriminating capabilities of the rank-based FBTR module depending on choice of , we proceed as follows. For each identity in the database we randomly extract a slice of 5 verifiable templates, which simulates a target reconnection when queried against the selected database. We additionally generate another query, which mixes 5+5 verifiable templates from two different identities, hence simulating a tracklet that underwent an ID-switch. For each of these two queries, and values of C = {1 . . . 9}, we compute by isolating it in Equation 3. In the case of the reconnection query, we consider either a correct or wrong reconnection, depending on whether the identity assigned by the highest similarity match is correct or not. This way, we obtain 3 distribution density curves (correct reconnections, wrong reconnections and ID-switches) for each combination of C and dataset (9×6 = 54 experiments in total). Each process is repeated ten times to avoid edge results. Figure 6 contains a subset of the resulting distributions, which show high discrimination between correct and wrong reconnections of pure identities, when no ID-switches are considered (row A). Although the reconnection of mixed identities poses a bigger challenge, a threshold = 0.8 seems to offer a good tradeoff between correctly and wrongly filtered reconnections. This is desirable to ensure the stability of the FBTR module. When using only the best and second best similarity matches (C = 1), the increase in mixed identities causes a dramatic distortion of the distributions, also decreasing the discriminating power of . Favorably, C acts as a distribution regularizer and helps reducing this effect.

Additionally, Figure 7 plots the percentage of filtered match candidates for C ∈ {1, 3, 5, 7, 9} when setting = 0.8. As expected, we observe a significant improvement on high C values as the proportion of mixed identities increases. For example, with 5% of the identities mixed, any value of C reduces around 75% of wrong reconnections, but large values of C reduce correct reconnections in only 10%, as opposed to the undesirable reduction of over 20% with C = 1. A saturation plateau is reached around C = 6, which is the setting used for the remaining experiments.

In this section, we present an ablation study that quantifies the contribution of each module (TM, DA, FBTR, and CM) and demonstrates the suitability of the proposed approach for longterm tracking in crowded video-surveillance scenarios. The following ablation experiments are presented:

• DA. Tracking is performed following a simple data association strategy: the tracking module is deactivated, and data association is computed based on the IOU value of detections in consecutive frames.

• DA+TM. In this experiment, data association is carried out using KFC in the tracking module. No face identification or correction mechanisms are applied. Table 3 and Figure 8 show the results of the ablation study on our whole dataset. It can be clearly observed that each added module increases the overall tracking completion rate.

The impact of FBTR is particularly noteworthy: it increases the CRS by 9.8% (from 0.593 to 0.651) and 7.3% (from 0.710 to 0.762) when added to DA and DA+TM, respectively. The rank-based FBTR constraint demonstrates its efficacy by enhancing the CRS of its simplified version by 4.3% (from 0.624 to 0.651) and 3.3% (from 0.738 to 0.762) when added to DA and DA+TM, respectively. The correction module also improves longterm tracking (CRS increase of 3.4% with S FBTR, from 0.738 to 0.763, and 2.8% with FBTR, from 0.762 to 0.783) without any extra computational cost. At the same time, it strongly reduces the fragmentation generated by the FBTR module by 19.1% (from 0.1850 to 0.1496) and 26.6% (from 0.1896 to 0.1391), for both simplified and rank-based FBTR, respectively. As expected, the number of ID-Switches increases as we achieve longer-term tracking, but its value stays reasonably low (0.00497 and 0.00530 for S FBTR and FBTR, respectively) and slightly decreases when using the CM module (0.00486 and 0.00512, respectively). Table 4 details ablation results per video. Results on Choke1 and Choke2 highlight the good performance of our architecture, especially of the FBTR module (CRS increases above 60%), in contexts where people leave and re-enter the scene. The impact of the CM is also dramatic, reaching CRS values up to 0.940 and reducing fragmentation up to 74%. The remaining videos do not contain subject re-appearances. In the case of Sidewalk and Bengal, where long occlusions are frequent, FBTR improves short-term tracking by 19.0% (CRS=0.796) and 15.3% (CRS=0.844), respectively. In Street, where camera angle is lower, the FBTR module becomes more vulnerable to IDswitches (24.9% of increase), even though it still considerably 

Database with 20% of mixed identities Figure 6 : Distributions of the rank-based threshold for correct / wrong reconnections (pure identities) and ID-switches (mixed identities). C varies across columns. A), B) and C) use a FBTR database with 0%, 10% and 20% of mixed identities, respectively.

increases the CRS (3.6%). Terminal 1, 2, 3 and 4 videos are by far the most challenging in terms of illumination, motion and occlusions. In these videos, the impact of FBTR and CM is lower, but the DA+TM+FBTR+CM architecture is still the most successful one. In Shibuya, which provides a much higher resolution (3840×2160), the face detector becomes an important bottleneck, propagating a dramatic speed decrease in all combinations (up to 3.3 FPS). At fixed resolutions (1080p), we can appreciate a higher decrease on computational performance for the FBTR module in videos with higher density of faces, such as Sidewalk (from 15.5 to 5.4 FPS) or Bengal (from 19.9 to 8.5 FPS).

We define a distractor as a person (identity) that does not appear in the set of videos used for validation, but that is registered in the system to cause potential failures (misidentifications) in the face verification procedure. Deploying the proposed method in real surveillance scenarios would imply considerably longer videos and continuous video stream analysis of crowded scenes, the FBTR module before the video processing starts. Then, we re-run the complete architecture without the CM extension. Table 5 shows the metrics relative to three distractor sets of increasing size: 200, 2K and 20K distractors. Although the computational performance drops with the number of distractors (3%, 23%, and 71%, respectively), the resulting values for fragmentation, ID-switches and CRS remain stable. These results anticipate the effectiveness of the presented method when applied to longer videos with higher number of identities.

We have presented an architecture for long-term, multi-face tracking in crowded video-surveillance scenarios. The proposed method benefits from advances in the fields of face detection and face recognition to achieve long-term tracking, in contexts that are particularly unconstrained in terms of movement, re-appearances, and occlusions. We have introduced specialized metrics conceived to evaluate long-term tracking capabilities, and publicly released a dataset with ten videos representing the targeted use case.

The series of experiments carried out lead to interesting findings. Firstly, we demonstrate that our novel tracklet reconnection strategy, grounded on rank-based face verification, allows us to obtain up to 50% longer tracks. Secondly, we show how the proposed rank-based constraint helps to keep higher inter-class distances among tracklets, while minimizing the propagation of errors due to mixed-up identities. As a result, the completion rate increases up to a 4.3%, without significant penalties in terms of ID-Switches or fragmentation. Finally, the proposed cost-free correction module has been proved to increase tracking robustness, not only by keeping on improving long-term capabilities, but also by reducing fragmentation. Isabelle has participated in more than 35 national and European public-funded R&D projects, and has more than 50 international publications.

Carles Fernández Tena received a PhD Cum Laude in Computer Vision and AI from Universitat Autònoma de Barcelona in 2010, also receiving the 2010 Extraordinary PhD Award. He is currently CTO at Herta, where he has been leading the Research department since 2014. He has published more than 60 scientific articles in peer-reviewed international journals and conferences, and participated in public-funded projects from the FP6, FP7 and H2020 frameworks. His research interests include Deep Learning and Computer Vision, particularly face recognition and video analytics in very unconstrained environments.

Tracking pedestrian heads in dense crowd

Deep-ReID: Deep filter pairing neural network for person reidentification

An integrated deep learning framework for occluded pedestrian tracking

Mobiface: A novel dataset for mobile face tracking in the wild

Deep learning in video multi-object tracking: A survey

Long-term face tracking for crowded video-surveillance scenarios

Simple online and realtime tracking

Globaltrack: A simple and strong baseline for long-term tracking

Distractor-aware siamese networks for visual object tracking

Junliang Xing, Distractors Frags ID-Switches CRS

Distractors analysis results

SiamRPN++: Evolution of siamese visual tracking with very deep networks

Fast online object tracking and segmentation: A unifying approach

MOTS: Multi-object tracking and segmentation

Mots: Multi-object tracking and segmentation

Towards real-time multi-object tracking

Fairmot: On the fairness of detection and re-identification in multiple object tracking

Tracking holistic object representations

Attention is all you need

Trackformer: Multi-object tracking with transformers

Track to detect and segment: An online multi-object tracker

Learning spatio-temporal transformer for visual tracking

Tracking by instance detection: A metalearning approach

Extending IOU based multi-object tracking by visual information

Visual object tracking using adaptive correlation filters

Discriminative correlation filter with channel and spatial reliability

Forward-backward error: Automatic detection of tracking failures

Online multi-face detection and tracking using detector confidence and structured SVMs

Struck: Structured output tracking with kernels. IEEE transactions on pattern analysis and machine intelligence

High-speed tracking with kernelized correlation filters

Fully-convolutional siamese networks for object tracking

Object tracking based on online representative sample selection via non-negative least square

Face-TLD: Tracking-learning-detection applied to faces

A prior-less method for multi-face tracking in unconstrained videos

Tracking persons-of-interest via adaptive discriminative features

Adaptive appearance model tracking for still-to-video face recognition

An automatic system for unconstrained video-based face recognition

Future frame prediction for anomaly detection -a new baseline

Pedestrian attribute recognition at far distance

Data-driven crowd analysis in videos

Cross-scene crowd counting via deep convolutional neural networks

CVPR19 tracking and detection challenge: How crowded can it get? arXiv preprint

Tracking persons-of-interest via adaptive discriminative features

Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition

Faceboxes: A cpu real-time face detector with high accuracy

The hungarian method for the assignment problem

ArcFace: Additive angular margin loss for deep face recognition

Face alignment in full pose range: A 3d total solution

Face quality assessment for face verification in video

Mot16: A benchmark for multi-object tracking

Evaluating multiple object tracking performance: the clear mot metrics

This work was partly funded by the Spanish project AI-MARS (CIEN CDTI Programme, grant number IDI-20181108).