Microsoft Word - 2011_02_10_Serratosa.docx 1 A Probabilistic Integrated Object Recognition and Tracking Framework Francesc Serratosa a , René Alquézar b & Nicolás Amézquita a a Departament d’Enginyeria Informàtica i Matemàtiques, Universitat Rovira i Virgili, Av. Països Catalans 26, 43007 Tarragona, Spain b Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Llorens Artigas 4-6, 08028 Barcelona, Spain Abstract. This paper describes a probabilistic integrated object recognition and tracking framework called PIORT, together with two specific methods derived from it, which are evaluated experimentally in several test video sequences. The first step in the proposed framework is a static recognition module that provides class probabilities for each pixel of the image from a set of local features. These probabilities are updated dynamically and supplied to a tracking decision module capable of handling full and partial occlusions. The two specific methods presented use RGB colour features and differ in the classifier implemented: one is a Bayesian method based on maximum likelihood and the other one is based on a neural network. The experimental results obtained have shown that, on one hand, the neural net based approach performs similarly and sometimes better than the Bayesian approach when they are integrated within the tracking framework. And on the other hand, our PIORT methods have achieved better results when compared to other published tracking methods in video sequences taken with a moving camera and including full and partial occlusions of the tracked object. Keywords: Object tracking; object recognition; occlusion; performance evaluation; probabilistic methods; video sequences; dynamic environments 2 1. Introduction One of the most general and challenging problems a mobile robot has to confront is to identify, locate and track objects that are common in its environment. To this end, object models have to be defined or learned in conjunction with some associated recognition and tracking procedures. There are several issues that have to be considered while dealing with object locating and tracking which deserve some previous discussion. The first important issue is to determine the type of object model to learn, which usually depends on the application environment. For instance, in [1], the target was an aerial vehicle. And in [2,3,4,5] the targets were people. In [6], they used specific parameters of the object to be tracked. In [7], they track hands using textures. In [8], they developed and implemented a real system for simultaneous localization and mapping (SLAM) algorithm for mobile robots based on an extended Kalman filter. It was applied to indoor environments and used stereo vision based on two web-cam. The system diverges from ours in that we like to track objects captured from the mobile- robot cameras, instead of localize the position of our robot. While tracking the object, for instance people walking in the street, the system could try to recognize person the through a face recognition system. There is a lot of literature related to this field [9]. Moreover, other systems not only indentify subjects but detect the mood of these subjects or detect specific pathologies [10]. Face identification or recognition is not the scope of this paper. Finally, other field related to object tracking is automatic hand-gesture recognition [11]. In this kind of systems, hands have to be tracked and the trajectory (position, speed, acceleration) has to be analyzed to conclude the meaning of this movement. In our case, we want a mobile robot equipped with a camera to locate and track general objects (people, other robots, balls, wastepaper bins …) in both indoor and outdoor environments. A useful object model should be relatively simple and easy to acquire from the result of image processing steps. For instance, the result of a colour image segmentation process, consisting of a set of regions or spots, characterized by simple features related to colour, may be a good starting point to learn the model [7, 12]. Although structured models like attributed graphs or skeletons can be synthesized for each object from several segmented images [13, 14], we have decided to investigate a much simpler approach in which the object is just represented as an unstructured set of pixels. Other methods detect some characteristic points of the object to be tracked [15]. At a learning phase, the most repeatable object keypoints for the specific object are learned. Another interesting work is [16], in which the algorithm search for different region tracks. These methods have been proven to have a good performance when there is low variability of the features of the object. Nevertheless, with deformable objects, it is difficult to extract some representative points. One of the main drawbacks of structural methods is that the segmented images can be quite different from one frame to the other, and therefore it is difficult to match the structure in the current frame with the previous ones. In [17], the model was specially designed to segment and track objects from video sequences that suffer from abrupt changes. The starting point of our approach is to accept these differences between segmented images and use a more rudimentary model in which the basic element is not the spot or region of the segmented image but its pixels. An example of structural 3 method was reported in [14], where the object model was based on the skeleton of the object obtained in the segmented images. Since the skeletons resulting from two almost equal images can be very different, the applicability of such approach is limited. The tracking step was performed in [14] by an extension of the Kalman filter in which the skeleton and other geometrical features were considered. Other options has been [18,19] where the model specifically incorporated the relation between position and time. Finally, other methods are based on keeping the information of the silhouette of the object to be tracked. In [5], the method is based on learning a dynamic and statistical model of the silhouette of the object. In our case, we cannot use this system since we assume that the deformation of the object to be tracked is not predictable. A second significant issue is to provide the tracking procedure with the capacity of determining occlusions and re-emergencies of tracked objects, i.e. occlusion handling. Over recent years, much research has been developed to solve the problem of object tracking under occlusions [4], because, in real-world tracking, a target being partly or entirely covered by other objects for an uncertain period of time is common. Occlusions pose two main challenges to object tracking systems. The first challenge is how to determine the beginning and the end of an occlusion. The second challenge is how to predict the location of the target during and at the end of the occlusion. Determining occlusion status is very hard for the trackers where the only knowledge available on the target is its initial appearance. When some parts of an occluder are similar to those of the target, the occluder and the target are mistaken. Various approaches that analyze occlusion situations have been proposed. The most common one is based on background subtraction [4, 19, 20]. Although this method is reliable, yet it only works with a fixed camera and a known background, which is not our case in mobile robotics. In [4], they used several cameras, and tracking and occlusion of people is solved by a multi-view approach. In [20], they achieve real-time tracking with small images. Evidence is gathered from all of the cameras into a synergistic framework. Other approaches are based on examining the measurement error for each pixel [22, 23]. The pixels that their measurement error exceeds a certain value are considered to be occluded. These methods are not very appropriate in outdoor scenarios, where the variability of the pixel values between adjacent frames may be high. A mixture of distributions is used in [24] to model the observed value of each pixel, where the occluded pixels are characterised by having an abrupt difference with respect to a uniform distribution. Contextual information is exploited in [25, 27]. These methods have better performance in terms of analysing occlusion situations but tracking errors are observed to frequently occur and propagate away. In addition, in the case of using these approaches in a mobile robot application, there is a need of knowing a priori the robot surroundings. Determining the re-emergence of the target and recapture its position after it is completely occluded for some time is the other main challenge. Setting a similarity threshold is one method, yet the optimal threshold value is difficult to determine. This problem is circumvented in [22], where the image region that matches the best with the template over a prefixed duration is assumed to be the reappearing target. In [23], an observation model and a velocity motion model were defined. The observation model was based on an adaptive appearance model, and the velocity motion model was derived using a first-order linear predictor. Both approaches are defined in the framework of particle filter, with provisions for handling occlusion. 4 In the scenarios where the motion of the target is not smooth neither predictable most of the aforementioned methods would fail. Recently, new object tracking methods that are robust to occlusion have been reported with very promising results [27, 28]. The method reported in [27] relies on background subtraction (it works only for static cameras) and a k-NN classifier to segment foreground regions into multiple objects using on-line samples of object’s appearance local features taken before the occlusion. The method described in [28] relies on an adaptive template matching but it only handles partial occlusions and the matching process seems to be computationally costly. A third relevant issue, which generally is not so mentioned, is to integrate the recognition and tracking steps in a common framework that helps to exploit some feedback between them. To the best of our knowledge there are few existing works that combine recognition and tracking in an integrated framework [29, 30]. Object recognition and tracking are usually performed sequentially and without any feedback from the tracking to the recognition step [14]. These tasks often are treated separately and/or sequentially on intermediate representations obtained by the segmentation and grouping algorithms [31-33]. Sometimes, they are applied in a reverse order, with a first tracking module supplying inputs to the recognition module, as, for instance, in gesture recognition [34]. An integrated framework for tracking and recognising faces was presented in [30]. Conventional video-based face recognition systems are usually embodied with two independent components: the recognition and the tracking module. In contrast, an architecture was defined in [30] that tightly couples these two components within a single framework. The complex and nonlinear appearance manifold of each registered person was partitioned into a collection of sub-manifolds where each models the face appearances of the person in nearby poses. The sub-manifolds were approximated by a low-dimensional linear subspace computed by PCAs. Finally, Artificial Intelligence was applied to tracking objects in [35]. This paper describes thoroughly and in detail the current state of a probabilistic integrated object recognition and tracking (PIORT) methodology that we have developed in the latest years, as well as two particular methods derived from it. It also presents a collection of experimental results in test video sequences obtained by PIORT methods and alternative tracking methods. Previous stages in the development of PIORT, together with preliminary results, have been partially reported elsewhere [36- 39]. In the experimental evaluation carried out, PIORT methods have been compared to six state-of-the-art tracking methods of which we were able to get and apply their program codes to the test video sequences: - Template Match by Correlation (TMC) [40]; - Basic Meanshift (BM) [41]; - Histogram Ratio Shift (HRS) [42]; - Variance Ratio Feature Shift (VRFS) [42]; - Peak Difference Feature Shift (PDFS) [42]; and - Graph-Cut Based Tracker (GCBT) [43, 44]. 5 Their codes were downloaded from the VIVID tracking evaluation web site www.vividevaluation.ri.cmu.edu, which unfortunately seems not to be accessible anymore. We briefly summarise these methods next. In the TMC method [40], the features of the target object are represented by histograms. These histograms are regularised by an isotropic kernel which produce spatially smooth functions suitable for gradient-based optimisation. The metric used to compare these functions is based on the Bhattacharyya distance and the optimisation is performed by the mean-shift procedure. In [41] a general non-parametric framework is presented for the analysis of a multimodal feature space and to separate clusters. The mean-shift procedure (localisation of stationary points in the distributions) is used to obtain the clusters. Throughout this framework, a segmentation application is described. In [42] three different tracking methods are presented. They are based on the hypothesis that the best feature values to track an object are the ones that best discriminate between the object and the present background. Therefore, with several sample densities of the object and also of the background, the system computes the separability of both classes and obtains new features. The feature evaluation mechanism is embedded in a mean- shift tracking system that adaptively selects the top-ranked discriminative features for tracking. In the first method, Histogram Ratio Shift (HRS), the weights applied to each feature are dynamically updated depending on the histograms of the target and also of the background. In the second one, Variance Ratio Feature Shift (VRFS), the ratio between the variance of the target and the surrounding background is computed and considered for selecting the features. Finally, the Peak Difference Feature Shift (PDFS) softens the histogram of the features by a Gaussian kernel; moreover, it considers possible distracter objects near the target and dynamically changes the feature selection. And finally, in [43, 44], a method for direct detection and segmentation of foreground moving objects is presented called Graph-Cut Based Tracker (GCBT). The method first obtains several groups of pixels with similar motion and photometric features. The mean-shift procedure is used to validate the motion and bandwidth. And then, the system segments the objects based on a MAP framework. Our PIORT methodology is based on the iterative and adaptive processing of consecutive frames by a system that integrates recognition and tracking in a probabilistic framework. The system uses object recognition results provided by a classifier, e.g. a Bayesian classifier or a neural net, which are computed from colour features of image regions for each frame. The location of tracked objects is represented through probability images that are updated dynamically using both recognition and tracking results. The tracking procedure is capable of handling quite long occlusions. In particular, object occlusion is detected automatically and the prediction of the object’s apparent motion and size takes into account the cases of occlusion entering, full occlusion and occlusion exiting. In contrast with [27], our tracking method does not rely on background subtraction and a fixed camera and, to the contrary of [28], it can cope with complete occlusion and it does not involve any template to match and update. In our approach, the following assumptions are made: i) target objects may be distinguished from other objects and the background based on 6 colour features of their appearance, though these features may experiment slight variations during the image sequence; in fact, this is a requirement of the classifiers we currently use for static recognition and could be relaxed or changed if the classifier in this module were replaced or used a different set of object’s appearance features; ii) target object’s shape, apparent motion and apparent size can all vary smoothly between consecutive frames (non-rigid deformable objects are thus allowed); we think this is not a strong assumption if a typical video acquisition rate is used, as large changes in shape, motion and size are allowed for the whole sequence; iii) image sequences may be obtained either from a fixed or a slow moving camera (this is also quite realistic for most applications in practice); iv) target objects may be occluded during some frames, but their motion does not change abruptly during occlusion; this last assumption is certainly stronger and may fail in some cases, but it is caused by the need of predicting an approximate position of the object during occlusion based on its previous trajectory. The rest of the paper is organized as follows. A formal description of our probabilistic framework for object recognition and tracking is given in Section 2. As shown in Fig. 1, the system involves three modules: static recognition, dynamic recognition and tracking decision modules. The methods used for the static recognition module are specified in Section 3. The dynamic recognition module is explained in Section 4. The tracking decision module is described in detail in Section 5. The experimental results are presented in Section 6. Finally, conclusions and future work are discussed in Section 7. 7 2. A probabilistic framework for integrated object recognition and tracking Let us assume that we have a sequence of 2D color images I t (x,y) for t=1,…,L, and that there are a maximum of N objects of interest in the sequence of different types (associated with classes c=1,…,N, where N≥1), and that a special class c=N+1 is reserved for the background. Furthermore, let us assume that the initial position of each object is known and represented by N binary images, pc 0 (x,y), for c=1,…,N, where pc 0 (x,y)=1 means that the pixel (x,y) belongs to a region covered by an object of class c in the first image. If less than N objects are actually present, some of these images will be all-zero and they will not be processed further, so, without loss of generality, we consider in the sequel that N is the number of present objects to track. Hence, we would like to obtain N sequences of binary images Tc t (x,y), for c=1,…,N, that mark the pixels belonging to each object in each image; these images are the desired output of the whole process and can also be regarded as the output of a tracking process for each object. We can initialize these tracking images (for t=0) from the given initial positions of each object, this is ),(),( 00 yxpyxT cc = (1) In our approach, we divide the system in three modules. The first one performs object recognition in the current frame (static recognition) and stores the results in the form of probability images (one probability image per class), that represent for each pixel the probabilities of belonging to each one of the objects of interest or to the background, according only to the information in the current frame. This can be achieved by using a classifier that has been trained previously to classify image regions of the same objects using a different but similar sequence of images, where the objects have been segmented and labeled. Hence, we assume that the classifier is now able to produce a sequence of class probability images Qc t (x,y) for t=1,…,L and c=1,…,N+1, where the value Qc t (x,y) represents the estimated probability that the pixel (x,y) of the image I t (x,y) belongs to the class c, which has been computed taking into account a local feature vector (see Section 3). In general, the probability images Qc t (x,y) can be regarded as the output of a static recognition module defined by some function r on the current image: ( ) ),(),( yxIryxQ tt c = (2) In the second module (dynamic recognition), the results of the first module are used to update a second set of probability images, pc , with a meaning similar to that of Qc but now taking into account as well both the recognition and tracking results in the previous frames through a dynamic iterative rule. More precisely, we need to store and update N+1 probability images pc t (x,y), for c=1,…,N+1, where the value pc t (x,y) represents the probability that the pixel (x,y) in time t belongs to an object of class c (for c=1,…,N) or to the background (for c=N+1). In general, these dynamic probabilities should be computed as a certain function f of the same probabilities in the previous step, the class probabilities given by the classifier for the current step (which have been obtained from the actual measurements) and the tracking images resulting from the previous step: ( ) ),(),,(),,(),( 11 yxTyxQyxpfyxp tttt c −− = (3) 8 The update function f used in our system is described in Section 4, which incorporates some additional arguments coming from the tracking module to adapt its parameters. Finally, in the third module (tracking decision), tracking binary images are determined for each object from the current dynamic recognition probabilities, the previous tracking image of the same object and some other data, which contribute to provide a prediction of the object’s apparent motion in terms of translation and scale changes as well as to handle the problem of object occlusion. Formally, the tracking images Tc t (x,y) for the objects (1≤c≤N) can be calculated dynamically using the pixels probabilities p t (x,y) according to some decision function d: ( ) ),(),,(),( 1 yxTyxpdyxT t c tt c − = (4) in which some additional arguments and results may be required (see (12) and Section 5 for a detailed description of the tracking decision module). 3. Static recognition module In our PIORT (Probabilistic Integrated Object Recognition and Tracking) framework, the static recognition module is based on the use of a classifier that is trained from examples and provides posterior class probabilities for each pixel from a set of local features. The local features to be used may be chosen in many different ways. A possible approach consists of first segmenting the given input image I t (x,y) in homogeneous regions (or spots) and computing some features for each region that are afterwards shared by all its constituent pixels. Hence, the class probabilities Qc t (x,y) are actually computed by the classifier once for each spot in the segmented image and then replicated for all the pixels in the spot. For instance, RGB color averages can be extracted for each spot after color segmentation and used as feature vector v(x,y) for a classifier. In the next two subsections we present two specific classifiers that have been implemented and tested within the PIORT framework using this type of information. Figure 1: Block diagram of the dynamic object recognition and tracking process. 9 3.1 A simple Bayesian method based on maximum likelihood and background uniform conditional probability Let c be an identifier of a class (between 1 and N+1), let B denote the special class c=N+1 reserved for the background, let k be an identifier of an object (non-background) class between 1 and N, and let v represent the value of a feature vector. Bayes theorem establishes that the posterior class probabilities can be computed as ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )∑ = + == N k kPkvPBPBvP cPcvP vP cPcvP vcP 1 | | | | | (5) Our simple Bayesian method for static recognition is based on imposing the two following assumptions: a) equal priors: all classes, including B, will have the same prior probability, i.e. P(B)=1/(N+1) and P(K)=1/(N+1) for all k between 1 and N. b) a uniform conditional probability for the background class, i.e. P(v|B)=1/M, where M is the number of values (bins) in which the feature vector v is discretized. Note that the former assumption is that of a maximum likelihood classifier, whereas the latter assumes no knowledge about the background. After imposing these conditions, equation (5) turns into ( ) ( ) ( )∑ = + = N k kvP M cvP vcP 1 | 1 | | (6) and this gives the posterior class probabilities we assign to the static probability images, i.e. Qc t (x,y) = P(c | v(x,y)) for each pixel (x,y) and time t. It only remains to set a suitable M constant and to estimate the class conditional probabilities P(v | k) for all k between 1 and N (object classes). To this end, class histograms Hk are set up using the labeled training data and updated on-line afterwards using the tracking results in the test data. For constructing the histograms, let v(x,y) be the feature vector consisting of the original RGB values of a pixel (x,y) labeled as belonging to class k. We uniformly discretize each of the R, G and B channels in 16 levels, so that M =16×16×16= 4096. Let b be the bin in which v(x,y) is mapped by this discretization. To reduce discretization effects, a smoothing technique is applied when accumulating counts in the histogram as follows: ( ) ( ) ( ) ( ) bbbHbH bneighborsbHbH kk kk ofneighbor a is ' if1' : ' ))(#10( : += −+= (7) where the number of neighbors of b (using non-diagonal connectivity) varies from 3 to 6, depending on the position of b in the RGB space. Hence, the total count Ck of the histogram is increased by ten (instead of one) each time a pixel is counted and the 10 conditional probability is estimated as P(v | k) = Hk(b) / Ck where b is the bin corresponding to v. The above smoothing technique is also applied when updating the histogram from the tracking results; in that case the RGB value v(x,y) in the input image I t (x,y) of a pixel (x,y) is used to update the histogram Hk (and the associated count Ck) if and only if Tk t (x,y)=1. 3.2 A neural net based method In this method, a neural net classifier (a multilayer perceptron) is trained off-line from the labeled training data. The RGB color averages extracted for each spot after color segmentation are used as feature vector v(x,y) and supplied as input to the network in both training and test phases. To the contrary of the Bayesian method described previously, training data for the background class are also provided by selecting some representative background regions in the training image sequence, because the network needs to gather examples for all classes including the background. The network is not retrained on-line using the tracking results in the test phase (this is another difference with respect to the Bayesian method described). It’s well known that using a 1-of-c target coding scheme for the classes, the outputs of a network trained by minimizing a sum-of-squares error function approximate the posterior probabilities of class membership (here, Qc t (x,y) ), conditioned on the input feature vector [45]. Anyway, to guarantee a proper sum to unity of the posterior probabilities, the network outputs (which are always positive values between 0 and 1) are divided by their sum before assigning the posterior probabilities. 4. Dynamic recognition module Even though the static recognition module can be applied independently to each image in the sequence, this does not exploit the dynamic nature of the problem and the continuity and smoothness properties that are expected in the apparent motion of the objects through the sequence. Hence, a dynamic update of the pixel class probabilities pc t (x,y) is desired that takes into account these properties. To this end, not only the previous probabilities pc t-1 (x,y) and the results of the current static recognition Qc t (x,y) have to be combined but also the binary results of the tracking decision in the previous step Tc t-1 (x,y) have to be considered, since this permits to filter some possible misclassifications made by the static classifier. Typically, some background spots are erroneously classified as part of an object and this can be detected if these spots are situated far from the last known position of the object. Therefore, the update function f for the dynamic class probabilities can be defined as follows (for some adaptive parameters αc t , 0<αc t <1): ( ) ),()1(),(),( ),()1(),(),( ),( 1 1 11 11 ∑ + = −− −− −+ −+ = N k t k t k t k t k t k t c t c t c t c t ct c yxQyxpyxT yxQyxpyxT yxp αα αα (8) A tracking image for the background, which is required in the previous equation, can be defined simply as 11    ≤≤∀= =+ otherwise0 1:0),(if1 ),( 1 NccyxT yxT t ct N (9) and computed after the tracking images for the objects. The parameter αc t that weights the influence of the previous probabilities must be adapted depending on the apparent motion of the tracked object of class c. If this motion is very slow, αc t should reach a maximum αmax closer to 1, whereas if the motion is very fast, αc t should reach a minimum αmin closer to 0. In order to a set a proper value for αc t the areas (Ac t-1 and Ac t-2 ) and mass centers (Cc t-1 and Cc t-2 ) of the object in the two previous tracking images are used in the following way. Let π11 −− = tc t c Ar and π22 −− = tc t c Ar be the estimates of the object radius in the two previous frames obtained by imposing a circular area assumption. Let 21 −− −= t c t cc CCd be the estimated displacement of the object in the 2D image and let 21max −− += t c t cc rrd be the maximum displacement yielding some overlapping between the two former circles. If max cc dd ≥ we would like to set min αα =t c , whereas if 0= c d then the value of αc t should be set according to the change of the object apparent size: Let ( )2121 ,max −−−− −= t c t c t c t cc rrrrs be a scale change ratio. If 0= c s (unchanged object size) then we would like to set max αα =t c whereas in the extreme case 1= c s then we would set min αα =t c again. Combining linearly both criteria, displacement and scale change, we define the prior value ( ) ( ) ccc t c sdd minmax max minmaxmax ˆ αααααα −−−−= (10) which satisfies max ˆ αα ≤t c . Note that the value max ˆ αα =t c (maximum weight for the previous probabilities) is obtained when both 0= c d and 0= c s , what means that both the centers and the areas of the object are the same in the last two observations (no displacement and no scale change have occurred). Finally the parameter αc t is set as follows:      >∧< ≤∧< ≥ = min max min max min max min ˆifˆ ˆif if ααα ααα α α t ccc t c t ccc cc t c dd dd dd (11) The constants αmin and αmax were set to 0.1 and 0.6, respectively, in our experiments (see Section 6). 5. Tracking decision module As depicted in Figure 1, the tracking images Tc t (x,y) for the objects (1 ≤ c ≤ N) can be 12 calculated dynamically using the pixels probabilities p t (x,y) according to some decision function d. However, this function involves some additional arguments and results, as explained next. To give an initial estimate of the foreseen translation and rescaling of the object in the current step, the measurements of both the object mass center and area in the tracking images of the two previous steps are required. Hence, the areas Ac t-1 and Ac t-2 and the mass centers Cc t-1 and Cc t-2 , already used in the dynamic recognition module as we have seen, must also be supplied here. The application of the estimated transformation to the previous tracking image Tc t-1 (x,y) will serve to reduce the image area to explore using the class probabilities while filtering (blacking) the rest. This strategy alone permits to track visible objects reasonably well [26, 27] but it fails completely if the object becomes occluded for some frames [28]. In order to cope with occlusion, more information is needed in the decision function d. The key point is to distinguish between the a posteriori tracking image Tc t (x,y) and an a priori prediction ( )yxT t c ,ˆ , which could maintain some relevant information of the object before the occlusion such as area and movement. The object mass center Cc t and area Ac t needed for tracking should be measured either from Tc t (x,y) or ( )yxT t c ,ˆ depending on whether the object is visible or occluded. Hence, an occlusion flag Oc t has to be determined as an additional result. Moreover, the two previous flags Oc t-1 and Oc t-2 help to know whether the object is entering or exiting an occlusion. In addition, t c m r is a movement weighted average vector that represents the past trajectory direction of the tracked object, which is useful to solve some ambiguous cases that happen when the object crosses or exits an occlusion by another object with a similar appearance (same- class occlusion). Finally, it should be taken into account that the uncertainty in the prediction ( )yxT t c ,ˆ grows as the number of consecutive frames the object is occluded increases. In the original method described in [27], two constant parameters ε and δ were used to define an uncertainty region around each pixel transformation. Since we want to adjust the level of uncertainty based on the duration of the occlusion, these parameters have to be adaptive for each object, i.e. εc t and δc t . Summarizing, the decision function d involves the following arguments and results:           = −−−− −−−−− −− 1121 21211 11 ,,, ,,,,, ),,(ˆ),,(),,( ,,,,,),,(ˆ),,( t c t c t c t c t c t c t c t c t c t c t c t t c t c t c t c t c t c t c t c AA CCOOm yxTyxTyxp dACOmyxTyxT δε δε rr (12) This function is described in detail in the next subsections, which cover the different independent sub-modules of the tracking decision module. Figure 2 illustrates graphically some of the calculations that are explained in what follows. 5.1 A priori prediction of the tracked objects The first step is to give a priori estimates of the mass center and area of the object in time t. The mass center is predicted as follows: 13 otherwise2 ifˆ 21 121    − ¬∧ = −− −−− t c t c t c t c t ct c CC OOC C (13) When the object is exiting an occlusion, Cc t-2 is not reliable enough to be used together with Cc t-1 to predict the next movement; therefore, a conservative estimate is given, just the previous measured value. In the rest of cases (the object is visible, is occluded or is entering an occlusion), a constant rate prediction is used. Note, however, that when the object is occluded, the mass center is not measured on the a posteriori tracking image, but on the a priori one, as we will see later. It is interesting to notice that the above constant rate prediction can be proved to be equivalent to the one given by a linear Kalman filter for a particular setting of the filter parameters and equations. Let t c t c t c t c uBwAw ω++=+ 1 and t c t c t c wHd ν+= be respectively the state and measurement equations of a linear Kalman filter (KF) for predicting the mass center of object c. If we set A=I, B=I, H=I, dc t =Cc t as the measurement, uc t = Cc t - Cc t-1 as the input and R t =0 as the covariance matrix of the measurement noise νc t (which is assumed to be zero), then the a priori and a posteriori estimates of state wc t given by the KF are 2Cc t-1 - Cc t-2 and Cc t respectively. Figure 2: Geometrical illustration of the tracking process. Estimates of object’s area and mass center for step t are computed from previous values in t-1 and t-2. For each pixel in step t a rectangular region in step t-1 is determined which allows the assignment to the pixel of one of three labels: “certainly belonging to the object” (yellow diagonal-bar-shaded region), “uncertain” (blue brick-shaded region) and “certainly not belonging to the object” (the rest of the image). 14 The a priori estimate of the object area is calculated as follows: ( ) otherwise ifˆ 1 12221     ¬∧¬ = − −−−− t c t c t c t c t ct c A OOAA A (14) If the object has been visible in the two previous frames, a constant rate of a scale factor is used to predict the area. It can be proved that this prediction is equivalent to the one given by a (non-linear) extended Kalman filter for a particular setting of the filter parameters and equations. Let ( )t c t c t c t c uwfw ω,, 1 =+ and ( )t c t c t c whd ν, = be the state and measurement equations, respectively, of an extended Kalman filter (EKF) for predicting the area of object c. If we set ( ) t c t c t c t c t c t c uwuwf ωω +=,, , ( ) tctctctc wwh νν +=, , dc t =Ac t as the measurement, uc t = Ac t /Ac t-1 as the input and R t =0 as the covariance matrix of the measurement noise νc t (which is assumed again to be zero), then the a priori and a posteriori estimates of state wc t given by the EKF are (Ac t-1 ) 2 / Ac t-2 and Ac t respectively. In the rest of cases (the object is occluded or is entering or exiting an occlusion), the area is supposed to remain constant. From these predictions, a change of coordinates transformation can also be estimated that maps each pixel Pc t-1 = (xc t-1 ,yc t-1 ) of the object c in step t-1 (maybe occluded) into its foreseen position in step t: ( ) 111 ˆˆ ˆ −−− −+= t c t c t c t c t c t c AACPCP (15) Actually, we are interested in applying the transformation in the inverse way, i.e. to know which is the expected corresponding position in time t-1, ( )111 ˆ,ˆˆ −−− = tc t c t c yxP , of a given pixel Pc t = (xc t , yc t ) in t: ( ) 1 11 ˆ ˆ ˆ − −− − += t c t c t c t ct c t c AA CP CP (16) This is enough to compute the a priori tracking image ( )yxT t c ,ˆ in time t, either from the previous a posteriori or a priori tracking image, depending on the previous occlusion flag: ( ) ( ) ( ) otherwiseˆ,ˆˆ ifˆ,ˆ ,ˆ 111 1111    ¬ = −−− −−−− t c t c t c t c t c t c t ct c yxT OyxT yxT (17) where the values of 11 ˆ,ˆ −− t c t c yx are clipped whenever is necessary to keep them within the range of valid coordinates. 5.2 First computation of the tracking images To compute the a posteriori tracking image Tc t (x,y), the pixel class probabilities p t (x,y) are taking into account only in some image region that is determined from Tc t-1 (x,y) or 15 ( )yxT t c ,ˆ 1− (depending on Oc t-1 ) and the tolerance parameters εc t and δc t . Since the estimates of the translation and scale parameters in the coordinate transformation can be inaccurate, we define a rectangular region of possible positions for each pixel by specifying some tolerances in these estimates. To this end, we use the adaptive parameters εc t and δc t , which must be positive values to be set in accordance with our confidence in the translation and scale estimates respectively (the most confidence the smallest tolerance and vice versa), and which are adjusted according to the following rules: ( ) otherwise if,min 121 max    ∨+ = −−− ini t c t cincr t ct c OO ε εεε ε (18) ( ) otherwise if,min 121 max    ∨+ = −−− ini t c t cincr t ct c OO δ δδδ δ (19) where εini, δini are default values, εmax, δmax are the maximal allowed values and εincr, δincr are the respective increases for each successive step under occlusion. Note that the tolerances keep on growing when exiting an occlusion until the object has been visible in the two previous frames; this is needed to detect and track the object again. Let ( )111 , −−− = t Cc t Cc t c yxC and ( )t Cc t Cc t c yxC ˆ,ˆˆ = be respectively the previous mass center and the a priori estimate of the current mass center. The four vertices of the rectangular uncertainty region centered at 1ˆ −t c P are denoted (top-left) TLc t-1 = (xinfc t-1 , yinfc t-1 ), (top- right) TRc t-1 =(xsupc t-1 , yinfc t-1 ), (bottom-left) BLc t-1 =(xinfc t-1 , ysupc t-1 ) and (bottom-right) BRc t-1 = (xsupc t-1 , ysupc t-1 ), where: ( ) ( ) ( ) ( )        − −−− + ≥−−− + −−− + = − − − − − − − − otherwise ˆˆ ˆˆ 0ˆˆif ˆˆ ˆˆ 1 1 1 1 1 1 1 1 t c t c t c t c t cC t cC t c t cC t ct cC t cC t cC t c t cC t c t c t c t c t c t cC t cC t c t cC t ct cC t c AAA xxxx x xxxx AAA xxxx x xinf δ ε ε δ ε (20) ( ) ( ) ( ) ( )       + −+− + ≥−+− − −+− + = − − − − − − − − otherwise ˆˆ ˆˆ 0ˆˆif ˆˆ ˆˆ 1 1 1 1 1 1 1 1 t c t c t c t c t cC t cC t c t cC t ct cC t cC t cC t c t cC t c t c t c t c t c t cC t cC t c t cC t ct cC t c AAA xxxx x xxxx AAA xxxx x xsup δ ε ε δ ε (21) 16 ( ) ( ) ( ) ( )       − −−− + ≥−−− + −−− + = − − − − − − − − otherwise ˆˆ ˆˆ 0ˆˆif ˆˆ ˆˆ 1 1 1 1 1 1 1 1 t c t c t c t c t cC t cC t c t cC t ct cC t cC t cC t c t cC t c t c t c t c t c t cC t cC t c t cC t ct cC t c AAA yyyy y yyyy AAA yyyy y yinf δ ε ε δ ε (22) ( ) ( ) ( ) ( )        + −+− + ≥−+− − −+− + = − − − − − − − − otherwise ˆˆ ˆˆ 0ˆˆif ˆˆ ˆˆ 1 1 1 1 1 1 1 1 t c t c t c t c t cC t cC t c t cC t ct cC t cC t cC t c t cC t c t c t c t c t c t cC t cC t c t cC t ct cC t c AAA yyyy y yyyy AAA yyyy y ysup δ ε ε δ ε (23) The values of xinfc t-1 , yinfc t-1 , xsupc t-1 and ysupc t-1 are clipped whenever is necessary to keep them within the range of valid coordinates. In order to understand eqs. (20) to (23), they must be first compared with eq. (16), which gives the position of the rectangle center (jointly for x and y coordinates). Then, consider for instance eq. (20), since the other three are simply derived by symmetry; eq. (20) aims at setting the leftmost value of x in the uncertainty rectangle; to this end, some small proportion of the estimated center displacement in the x coordinate is subtracted in the numerator, and the scale ratio is either enlarged or shrunk in the denominator (by adding or subtracting respectively a small proportion of the new estimated area) depending on which of these two options yields the smallest (leftmost) x value; it’s easy to check that the last depends on the sign of the numerator. Now, each pixel Pc t = (xc t ,yc t ) is labeled, with respect to object c, as one of three labels (“certainly belonging to the object c”, “certainly not belonging to the object c” or “uncertain”) as follows. If Oc t-1 is false then: if all the pixels in the rectangular region delimited by TLc t-1 , TRc t-1 , BLc t-1 , BRc t-1 have a common value of 1 in Tc t-1 (x,y), it is assumed that Pc t is definitely inside and certainly belongs to object c; to the contrary, if they have a common value of 0 in Tc t-1 (x,y), it is assumed that Pc t is clearly outside and certainly does not belong to object c; otherwise, the rectangular region contains both 1 and 0 values, the pixel Pc t is initially labeled as “uncertain”. However, if Oc t-1 is true, Tc t-1 (x,y) will represent a totally or partially occluded object and we cannot rely on it, but on the predicted ( )yxT t c ,ˆ 1− , which is based on information previous to the occlusion. If all the pixels in the rectangular region delimited by TLc t-1 , TRc t-1 , BLc t-1 , BRc t-1 have a common value of 0 in ( )yxT t c ,ˆ 1− , it is assumed that Pc t does not belong to object c; otherwise (the rectangular region contains both 1 and 0 values or only 1 values), the pixel Pc t is labeled as “uncertain”. Only for the uncertain pixels (x,y) the dynamic probabilities p t (x,y) will be used. Recall that these probabilities will have been updated previously from the object recognition results in time t, Q t (x,y), also expressed as probabilities. More precisely, we propose the 17 following rule to compute the value of each pixel of the a posteriori tracking image for object c in time t:         += otherwise0 1 and 1between allfor maximum theis),(anduncertainisit objecttobelongscertainly),(if 1 ),( Nc yxpor cyx yxT t c t c (24) 5.3 Post-processing of the tracking images Sometimes, the tracking images Tc t (x,y) obtained by applying eq. (24) contain disconnected regions of 1-valued pixels, or, said in other words, more than one connected component t ci T , 1≤i≤I, I>1. This may be produced by a variety of causes, mainly segmentation or recognition errors, but also may be due to possible partial occlusions of the target object by an object of a different class. In addition, a particular problem that leads to object split occurs immediately after a same-class crossing or occlusion: when the target object has just finished crossing another object or region which is recognized to be in the same class (distracter), then the tracking method is misled to follow both the object and the distracter. It is very difficult to devise a general method that can always distinguish between erroneous components due to noise or distracters and correct object components, especially if separated components are allowed to cope with partial occlusions, but some useful heuristics based on properties such as size, movement or shape may be defined that work reasonably well in a majority of cases. In order to eliminate noisy regions and to circumvent the same-class crossing problem, while handling partial occlusions at the same time, we propose a post-processing step that removes from Tc t (x,y) some possible artifacts or distracters (setting some initially 1- valued pixels to zero). In fact, this step is only carried out if Tc t (x,y) contains more than one component. In such a case, we need to choose which components t ci T to keep (one or more) and which to discard. To this end, three heuristic filters are applied sequentially, whenever two or more components remain before the filter application. The first filter is aimed at deleting small noisy regions and is solely based on their size. Let t ci A be the area of the i-th connected component t ci T and let )(Area t c T be the total area covered by 1-valued pixels in Tc t (x,y). The i-th component is removed if the ratio )(Area t c t ci TA is below a given threshold κ , e.g. κ = 0.15. The second filter is aimed at deleting distracters, including those appearing after same- class occlusion, and is based on a comparison between the apparent movement of the remaining components and the previous recent trajectory of the tracked object represented by the movement vector 1−t c m r . Let t ci C be the mass center of the i-th connected component t ci T and define an associated movement vector 1− −= t c t ci t ci CCz r for each component. Then, 18 1 1 , θcos − − 〉〈 = t c t ci t c t ci ci mz mz rr rr (25) is a measure of the alignment between the vectors 1−t c m r and t ci z r , which is only reliable for our purposes if both vectors have a sufficiently large norm. Otherwise, the angle θci can be considered rather random, since may be affected a lot by adding small perturbations on the vectors. Consequently, abrupt trajectory changes (greater than 90 degrees) are penalized if we remove the i-th component t ci T when the condition λλ ≥∧≥∧< −10θcos t c t cici mz rr holds, where λ is another threshold, e.g. λ = 3. However, to guarantee that at least one component is kept, the remaining component for which t ci z r is the most collinear vector with respect to 1−t c m r , i.e. the component i such that         〉〈 = − t ci t c t ci z mz i r rr 1 , max arg (26) is never removed by this second filter. The third filter is also aimed at deleting distracters and is based on a comparison between the shapes of the components and that of the a priori prediction of the target object (represented by the 1-valued region in ( )yxT t c ,ˆ ). For each remaining component t ci T , the a priori prediction of the target object is moved from its original center t c Ĉ to the component center t ci C , thus resulting in a translated copy ( )yxT t ci ,ˆ , and the spatial overlap between both shapes is then measured as follows: ( ) ( )t ci t ci t ci t ci ci TT TT SO ˆ Area ˆ Area ∪ ∩ = (27) The components having a spatial overlap 243.0< ci SO (which is the overlap obtained between two circles of the same size when one of the centers is located in the border of the other circle) are deleted in this third filter, unless ci SO is the maximum spatial overlap of the remaining components. This exception guarantees the persistence of at least one component in the final tracking image. As a result of the post-processing, the pixels of all the components t ci T removed by any of the three filters are set to zero in the final tracking image Tc t (x,y). 5.4 Determination of occlusion and geometric measurements Once both Tc t (x,y) and ( )yxT t c ,ˆ have been determined, it is possible to detect the occurrence of an occlusion (i.e. to set the current occlusion flag) in the following way. Let Area(Tc t ) be the measured area of the 1-valued region in the final Tc t (x,y) and let Area( t c T̂ ) be the measured area of the 1-valued region in ( )yxT t c ,ˆ . Then, 19 ( ) ( ) ( ) ( ) otherwiseˆAreaArea ifˆAreaArea 2 1 1    < < = − rTT OrTT O t c t c t c t c t ct c (28) where 0 < r1 < r2 < 1 (for instance, r1=0.5, r2=0.75). Note that the condition for remaining in occlusion mode is harder than the condition for initiating an occlusion. This facilitates the recovery of the object track when exiting an occlusion or when a false occlusion has been detected. Next, the a posteriori estimates of the object mass center and area are selected between those of the a priori and a posteriori tracking images based on the value of the occlusion flag: ( ) ( ) otherwiseˆ if    ¬ = t c t c t ct c TMC OTMC C (29) where MC(Tc t ) is the measured mass center of the 1-valued region in the final Tc t (x,y) and MC( t c T̂ ) is the measured mass center of the 1-valued region in ( )yxT tc ,ˆ , and ( ) ( ) otherwiseˆArea ifArea    ¬ = t c t c t ct c T OT A (30) Finally, the movement weighted average vector t c m r is updated afterwards as follows: ( )( ) ( )   ≥−+ <−+ = − − βββ β 1if1 1if1 1 1 tmv ttmtv m t c t c t c t ct c rr rr r (31) where β is a positive parameter between 0 and 1, e.g. β=0.2, and t c v r is the current movement defined by 1− −= t c t c t c CCv r . Note that the second row in (31) is a typical moving average computation, while the first row denotes a simple average for the starting steps, and both give the same result for t=1/β. 6. Experimental results We were interested in testing both PIORT approaches in video sequences including object occlusions and taken with a moving camera. Nevertheless, we also performed a first set of validation experiments in video sequences taken with a still camera. In all tests we defined N=1 objects of interest to track. All images in the video sequences were segmented independently using the EDISON implementation of the mean-shift segmentation algorithm, code available at http://www.caip.rutgers.edu/riul/research/code.html. The local features extracted for each spot were the RGB colour averages. For object learning, spots selected through ROI (region-of-interest) windows in the training sequence were collected to train a two- layer perceptron using backpropagation and to build the target class histogram. When 20 using the neural net in the test phase, the class probabilities for all the spots in the test sequences were estimated from the net outputs. When using the histogram, the spot class probabilities were estimated according to equation (6). In both cases, the spot class probabilities were replicated for all the pixels in the same spot. For object tracking in the test sequences, ROI windows for the target object were only marked in the first image to initialise the tracking process. The recognition and tracking results for the test sequences of our PIORT approaches were stored in videos where each frame has a layout of 2 x 3 images with the following contents: the top left is the image segmented by EDISON; the top middle is the image of probabilities given by the static recognition module for the current frame; the top right is the a priori prediction of the tracking image; the bottom left is the image of dynamic probabilities; the bottom right is the original image with a graphic overlay that represents the boundaries of the a posteriori binary tracking image (the final result for the frame); and the bottom middle is an intermediate image labelled by the tracking module where yellow pixels correspond to pixels labelled as “certainly belonging to the object”, light blue pixels correspond to pixels initially labelled as “uncertain” but with a high dynamic probability, dark blue pixels correspond to pixels labelled as “uncertain” and with a low probability, dark grey pixels are pixels labelled as “certainly not belonging to the object” but with a high probability and the rest are black pixels with both a low probability and a “certainly not belonging to the object” label. For comparison purposes, tracking of the target objects in the test sequences was also carried out by applying the six following methods, which only need the ROI window mark in the first frame of the test sequence: Template Match by Correlation (TMC), which refers to normalized correlation template matching [30]; Basic Meanshift (BM) [31]; Histogram Ratio Shift (HRS) [32]; Variance Ratio Feature Shift (VRFS) [32]; Peak Difference Feature Shift (PDFS) [32]; and Graph-Cut Based Tracker (GCBT) [33, 30]. These methods have been commented briefly in Section 1. From the tracking results of all the tested methods, two evaluation metrics were computed for each frame: the spatial overlap and the centroid distance [46]. The spatial overlap SO(GTk,STk) between the ground truth GTk and the system track STk in a specific frame k is defined as the ratio ( ) ( ) ( ) kk kk kk STGT STGT , STGTSO Area Area ∪ ∩ = (32) and Dist(GTCk, STCk) refers to the Euclidean distance between the centroids of the ground truth (GTCk) and the system track (STCk) in frame k. Naturally, the larger the overlap and the smaller the distance, the better performance of the system track. Since the centroid distance can only be computed if both GTk and STk are non-null, a failure ratio was measured as the number of frames in which either GTk or STk was null (but not both) divided by the total number of frames. Finally, an accuracy measure was computed as the number of good matches divided by the total number of frames, where a good match is either a true negative or a true positive with a spatial overlap above a threshold of 0.243 (which is the overlap obtained between two circles of the same size when one of the centres is located in the border of the other circle). 21 6.1 Experimental results on video sequences taken with a still camera The first set of experiments comprised three test video sequences taken with a still camera that show indoor office scenes where the target to track is a blue ball moving on a table. A similar but different sequence was used for training a neural network to discriminate between blue balls and typical sample regions in the background and for constructing the class histogram of the blue ball (this training sequence is available at http://www-iri.upc.es/people/ralqueza/bluetraining.avi). In the first test sequence, http://www-iri.upc.es/people/ralqueza/S1S2.avi, two blue balls are moving on the table and one occludes temporally the other one during some frames. Two experiments were performed on this test sequence depending on the initialisation of the tracking. In test S1, the tracking was initialised at the right ball and in test S2, the tracking was initialised at the left ball. The static recognition module considers that both balls belong to the same class. In both tests, the temporal overlapping was correctly managed by our methods since the tracked ball is well relocated after exiting the occlusion. The corresponding videos displaying the results of PIORT methods (in the layout described above) are at http://www-iri.upc.es/people/ralqueza/S1_NN.mpg and S2_NN.mpg for the PIORT-Neural net method and at S1_Bayes.mpg and S2_Bayes.mpg for the PIORT-Bayesian method. Sequence S1: Blue balls crossed. Tracking initial right ball Sequence S2: Blue balls crossed. Tracking initial left ball In the second test sequence (test S3), http://www-iri.upc.es/people/ralqueza/S3.avi, the tracked blue ball is occluded twice by a box during 5 and 12 frames, respectively. Recognition and tracking results for the whole sequence using the PIORT-Neural Net and Bayesian methods are at http://www-iri.upc.es/people/ralqueza/S3_NN.mpg and S3_Bayes.mpg, respectively. The tracking of the blue ball is quite satisfactory since both occlusions are correctly detected and the ball is correctly relocated when exiting the occlusion. Sequence S3: Blue ball moving occluded by box 22 In the last test sequence of this group, http://www-iri.upc.es/people/ralqueza/S4.avi, there are again two blue balls and the target moving ball crosses twice, once in front of and once behind, the second ball, which does not move. As the recognition module classifies both balls in the same class, the same-class occlusion is not detected as an occlusion (the two balls are merged into a single blue object), but anyway the target ball is well tracked after the two crossings. The videos displaying the results of the PIORT- Neural Net and Bayesian methods for this sequence are at http://www- iri.upc.es/people/ralqueza/S4_NN.mpg and S4_Bayes.mpg, respectively. Sequence S4: Blue ball moving around another blue ball Table 1 presents the results (mean ± std. deviation) of the spatial overlap (SO) and centroid distance (CD) measures together with the failure ratio (FR) and accuracy (Acc) of each tracking method for the four tests S1 to S4, emphasizing the best values for each measure and test in bold. Our PIORT tracking methods worked fine in the four tests obtaining the best values of the four measures (except in the Accuracy measure for test S4, where the HRS method gave a slightly superior performance). All methods performed quite well in S1; only PDFS method performed comparably to PIORT approaches in S2; only PIORT methods worked in S3 while the rest failed; and only BM and HRS methods performed comparably to PIORT approaches in S4. 23 Table 1. Results of ball tracking on video sequences taken with a still camera. SO: Spatial Overlap; CD: Centroid Distance; FR: Failure Ratio; Acc: Accuracy Video Sequence Tracking method SO CD FR Acc S1 Blue balls crossed (Right ball) TMC 0.56 ± 0.10 5.07 ± 2.07 0 0.98 BM 0.60 ± 0.06 3.19 ± 1.21 0 1.00 HRS 0.46 ± 0.11 6.03 ± 2.05 0 1.00 VRFS 0.66 ± 0.07 1.15 ± 0.47 0 1.00 PDFS 0.63 ± 0.10 2.01 ± 0.94 0 1.00 GCBT 0.64 ± 0.18 13.20 ± 52.52 0.05 0.94 PIORT-Neural Net 0.84 ± 0.09 1.38 ± 1.39 0 1.00 PIORT-Bayesian 0.80 ± 0.07 0.75 ± 0.76 0 1.00 S2 Blue balls crossed (Left ball) TMC 0.22 ± 0.27 44.34 ± 52.24 0 0.41 BM 0.23 ± 0.29 42.51 ± 50.42 0 0.36 HRS 0.25 ± 0.31 44.93 ± 51.96 0 0.41 VRFS 0.28 ± 0.35 42.82 ± 52.62 0 0.41 PDFS 0.50 ± 0.30 36.27 ± 86.95 0.14 0.77 GCBT 0.20 ± 0.27 70.69 ± 68.80 0 0.36 PIORT-Neural Net 0.60 ± 0.23 3.94 ± 4.98 0 0.91 PIORT-Bayesian 0.46 ± 0.25 15.04 ± 52.64 0.05 0.73 S3 Blue ball moving occluded by box TMC 0.01 ± 0.04 173.40 ± 68.71 0.22 0 BM 0.01 ± 0.07 182.54 ± 68.14 0.22 0 HRS 0 187.85 ± 67.96 0.25 0 VRFS 0.02 ± 0.18 140.14 ± 93.44 0.20 0.17 PDFS 0.13 ± 0.41 131.07 ± 106.1 0.42 0.02 GCBT 0 237.02 ± 134.6 0.74 0.22 PIORT-Neural Net 0.81 ± 0.42 0.47 ± 0.38 0 1.00 PIORT-Bayesian 0.53 ± 0.37 8.39 ± 48.61 0.03 0.95 S4 Blue ball moving around still blue ball TMC 0.35 ± 0.22 13.10 ± 32.38 0.01 0.75 BM 0.56 ± 0.15 7.39 ± 29.05 0.01 0.93 HRS 0.60 ± 0.13 6.21 ± 29.16 0.01 0.96 VRFS 0.10 ± 0.62 74.68 ± 45.00 0.01 0.14 PDFS 0.13 ± 0.43 44.39 ± 36.14 0.01 0.17 GCBT 0.10 ± 0.53 201.60 ± 98.35 0.80 0.18 PIORT-Neural Net 0.74 ± 0.21 5.90 ± 29.33 0.01 0.94 PIORT-Bayesian 0.72 ± 0.20 5.58 ± 29.38 0.01 0.94 24 6.2 Experimental results on video sequences taken with a moving camera The second set of experiments comprised another three test video sequences where the target is a ball, but this time taken with a moving camera. The first of them (test S5) again shows an indoor office scene where a blue ball is moving on a table and is temporally occluded, while other blue objects appear in the scene. This test sequence can be downloaded at http://www-iri.upc.es/people/ralqueza/S5.avi. Sequence S5: Blue bouncing ball on table The other two test sequences in this group show outdoor scenes in which a Segway robot tries to follow an orange ball that is being kicked by a person. Both include multiple occlusions of the tracked orange ball and differ in the surface over which the ball runs, which is pavement in the case of test S6 and grass in test S7 (see http://www- iri.upc.es/people/ralqueza/S6.avi and S7.avi, respectively). A similar but different sequence was used for training a neural network to discriminate between orange balls and typical sample regions in the background and for constructing the class histogram of the orange ball (this training sequence is available at http://www- iri.upc.es/people/ralqueza/orangetraining.avi). Table 2. Results of ball tracking on video sequences taken with a mobile camera. 25 SO: Spatial Overlap; CD: Centroid Distance; FR: Failure Ratio; Acc: Accuracy The tracking results videos for the above test sequences are attainable at http://www- iri.upc.es/people/ralqueza/S5_NN.mpg, S5_Bayes.mpg, S6_NN.mpg, S6_Bayes.mpg, S7_NN.mpg and S7_Bayes.mpg. Sequence S6: Segway - Orange ball on pavement Video Sequence Tracking method SO CD FR Acc S5 Blue bouncing ball on table TMC 0.28 ± 0.48 74.65 ± 91.53 0.19 0.43 BM 0.23 ± 0.52 78.40 ± 90.33 0.19 0.37 HRS 0.16 ± 0.45 125.88 ± 11.80 0.43 0.30 VRFS 0.20 ± 0.38 96.72 ± 134.84 0.39 0.60 PDFS 0.28 ± 0.57 103.60 ± 36.77 0.41 0.59 GCBT 0.01 ± 0.29 188.79 ± 18.13 0.75 0.21 PIORT-Neural Net 0.60 ± 0.40 12.53 ± 59.38 0.05 0.95 PIORT-Bayesian 0.59 ± 0.39 12.46 ± 59.40 0.05 0.95 S6 Segway - Orange ball on pavement TMC 0.06 ± 0.40 146.35 ± 81.83 0.03 0.14 BM 0.09 ± 0.43 110.94 ± 76.70 0.03 0.19 HRS 0.09 ± 0.38 156.99 ± 103.80 0.41 0.21 VRFS 0.16 ± 0.68 70.46 ± 49.17 0.03 0.21 PDFS 0.14 ± 0.59 117.09 ± 81.43 0.03 0.21 GCBT 0.01 ± 0.34 233.56 ± 62.12 0.93 0.06 PIORT-Neural Net 0.72 ± 0.20 2.67 ± 19.21 0.01 0.98 PIORT-Bayesian 0.13 ± 0.73 202.14 ± 99.35 0.81 0.19 S7 Segway - Orange ball on grass TMC 0.02 ± 0.29 137.93 ± 84.53 0.04 0.04 BM 0.15 ± 0.27 125.13 ± 116.14 0.34 0.35 HRS 0.03 ± 0.33 190.63 ± 89.72 0.54 0.08 VRFS 0.59 ± 0.21 7.93 ± 38.85 0.02 0.95 PDFS 0.33 ± 0.50 121.46 ± 125.91 0.48 0.51 GCBT 0.01 ± 0.37 208.39 ± 83.88 0.79 0.04 PIORT-Neural Net 0.47 ± 0.23 17.02 ± 60.98 0.06 0.88 PIORT-Bayesian 0.25 ± 0.49 133.43 ± 126.22 0.53 0.42 26 Sequence S7: Segway - Orange ball on grass Table 2 presents the results (mean ± std. deviation) of the spatial overlap (SO) and centroid distance (CD) measures together with the failure ratio (FR) and accuracy (Acc) of each tracking method for the three tests S5 to S7, emphasizing the best values for each measure and test in bold. Our PIORT-Neural net method worked fine in the three tests obtaining the best values of spatial overlap and accuracy measures in tests S5 and S6 and yielding results a little bit under the performance of the VRFS method in test S7, in which the VRFS method gave the best values of the four measures. Our PIORT- Bayesian method worked well in test S5 but failed to track the orange ball correctly in tests S6 and S7. Only both PIORT methods performed well in S5; only PIORT-Neural net method worked in S6 while the rest failed; and only VRFS and PIORT-Neural net methods obtained satisfactory results in S7. Sequence S8: Pedestrian with red jacket The last set of experiments comprised another three test video sequences, taken with a moving camera in outdoor environments, where the targets are humans, more precisely, some part of their clothing. The first sequence in this group (test S8) is a long sequence taken on a street where the aim is to track a pedestrian wearing a red jacket (see http://www-iri.upc.es/people/ralqueza/S8.avi) and includes total and partial occlusions of the followed person by other walking people and objects on the street. In this case, a short sequence of the scene taken with a moving camera located in a different position (http://www-iri.upc.es/people/ralqueza/redpedestrian_training.avi) was used as training sequence. 27 The other two test sequences in this group, tests S9 and S10, show outdoor scenes in which humans riding Segway robots and wearing orange T-shirts are followed. In test S9 a single riding guy is followed, whereas in test S10, two men are riding two Segway robots simultaneously and crossing each other. These test sequences are at http://www- iri.upc.es/people/ralqueza/S9.avi and S10.avi and the training sequence associated with them is at http://www-iri.upc.es/people/ralqueza/T-shirt_training.avi. Sequence S9: Guy on Segway with orange T-shirt Sequence S10: Men on Segway with orange T-shirt The tracking results videos for the above test sequences are attainable at http://www- iri.upc.es/people/ralqueza/S8_NN.mpg, S8_Bayes.mpg, S9_NN.mpg, S9_Bayes.mpg, S10_NN.mpg and S10_Bayes.mpg. 28 Table 3 presents the results of the evaluation measures of each tracking method for the three tests S8 to S10, emphasizing the best values for each measure and test in bold. Both PIORT methods gave the best results, very similar between them, in tests S8 and S9, and PIORT-Neural net method performed clearly the best in test S10. Note that in the pedestrian sequence (S8), an occlusion by people carrying red bags distracted the attention of the PIORT tracking module and caused a momentarily impairment in performance, especially for the centroid distance measure, but the tracker was able to recover correctly the target after that occlusion. In this sequence S8, only the PDFS method performed comparably to PIORT approaches in terms of accuracy and centroid distance, although it achieved a rather lower spatial overlap. In test S9, the HRS, VRFS and PDFS methods obtained similar and reasonably well results, but not as good as those of PIORT methods. Finally, only the PIORT-Neural net method worked well in test S10, where the PIORT-Bayesian method performed poorly because it followed the other Segway-riding man after a crossing between both men. Table 3. Results of human tracking on video sequences taken with a mobile camera. SO: Spatial Overlap; CD: Centroid Distance; FR: Failure Ratio; Acc: Accuracy Video Sequence Tracking method SO CD FR Acc S8 Pedestrian with red jacket TMC 0.44 ± 0.31 25.25 ± 61.10 0.07 0.77 BM 0.24 ± 0.58 72.08 ± 64.33 0.07 0.34 HRS 0.35 ± 0.24 13.49 ± 38.27 0.02 0.64 VRFS 0.45 ± 0.32 34.27 ± 81.13 0.12 0.82 PDFS 0.50 ± 0.20 11.42 ± 45.11 0.03 0.95 GCBT 0.04 ± 0.32 194.7 ± 105.3 0.77 0.16 PIORT-Neural Net 0.79 ± 0.24 11.90 ± 50.87 0.04 0.96 PIORT-Bayesian 0.74 ± 0.24 11.15 ± 48.14 0.04 0.95 S9 Guy on Segway with orange T- shirt TMC 0.10 ± 0.53 130.3 ± 69.75 0.00 0.15 BM 0.22 ± 0.13 41.30 ± 58.70 0.01 0.40 HRS 0.53 ± 0.25 22.83 ± 58.43 0.05 0.86 VRFS 0.69 ± 0.25 27.69 ± 75.15 0.10 0.90 PDFS 0.56 ± 0.21 29.19 ± 74.65 0.10 0.90 GCBT 0.14 ± 0.22 101.6 ± 112.7 0.36 0.19 PIORT-Neural Net 0.73 ± 0.16 3.40 ± 14.78 0.00 0.97 PIORT-Bayesian 0.74 ± 0.13 3.70 ± 14.61 0.00 0.98 S10 Men on Segway with orange T- shirts TMC 0.06 ± 0.39 104.3 ± 83.15 0.03 0.10 BM 0.29 ± 0.28 42.10 ± 59.06 0.03 0.59 HRS 0.28 ± 0.30 38.72 ± 65.09 0.06 0.58 VRFS 0.38 ± 0.34 36.81 ± 64.53 0.06 0.61 PDFS 0.32 ± 0.36 91.14 ± 119.4 0.35 0.56 GCBT 0.04 ± 0.31 187.1 ± 103.1 0.72 0.08 PIORT-Neural Net 0.73 ± 0.18 8.37 ± 40.74 0.03 0.96 PIORT-Bayesian 0.16 ± 0.58 81.36 ± 62.93 0.03 0.22 29 7. Conclusions, discussion and future work In this paper we have described an updated version of the probabilistic integrated object recognition and tracking (PIORT) methodology that we have developed in the latest years, partially reported in [36-39], and presented a collection of experimental results in test video sequences, with the aim of comparing two particular approaches derived from PIORT, based on Bayesian and neural net methods, respectively, with some state-of- the-art tracking methods proposed by other authors. An improved method for object tracking, capable of dealing with rather long occlusions and same-class object crossing, has been proposed to be included within our probabilistic framework that integrates recognition and tracking of objects in image sequences. PIORT does not use any contour information but the results of an iterative and dynamic probabilistic approach for object recognition. These recognition results are represented at pixel level as probability images and are obtained through the use of a classifier (e.g. a neural network) from region-based features. The PIORT framework is divided in three parts: a static recognition module, where the classifier is applied to single-frame images, a dynamic recognition module that updates the object probabilities using previous recognition and tracking results, and a tracking decision module, where tracking binary images are determined for each object. This third module combines the recognition probabilities with a model that predicts the object’s apparent motion in terms of translation and scale changes, while coping with the problems of occlusion and re-emergence detection. Moreover, the tracking module can deal with object splitting, either due to partial occlusions or same-class object crossing, and, in most cases, is able to select and track only the target object after it crosses or is occluded by another object which is recognized as belonging to the same class, i.e. it is able to re-establish the identity of the target object. The experimental work reported in this paper has been focused on the case of single object tracking, just because the tracking methods we had available for the comparison only allowed single object tracking. However, as shown in [36], the PIORT system is capable of tracking multiple objects of different classes simultaneously and, as demonstrated in the experiments, it can be applied to video sequences acquired either by a fixed or a moving camera. The size, shape and movement of the target objects can vary softly along the sequence, but the appearance features used by the classifier (up to now, colour features) should remain rather stable for a successful tracking. It must be taken into account that the global performance of the system depends not only on the ability of the tracking method but also on the quality of the object recognition probabilities provided by the trained classifier. In this regard, false positive detections by the classifier can only be harmful for tracking when they are very close or “touching” the target, otherwise they are filtered by the second and third modules. Even in the first case, the tracking module is sometimes able to distinguish between the target and a false distracter, when the latter is different enough in terms of size, shape or motion trajectory. Concerning false negative errors by the classifier, they can be coped partially by the second module, especially when the apparent motion of the target is slow and hence the previous probabilities (adaptively) weight more than the current ones given by the classifier. Nevertheless, if the classifier 30 fails to detect the target object in just a few consecutive frames the tracker will assume a target occlusion and proceed in occlusion mode, which implies an assumption of constant motion directed by the previous trajectory and a growing uncertainty in the target position. In this case, object tracking can be sometimes recovered if the classifier redetects the target afterwards, depending basically on the real trajectory of the target and the gap duration. In this paper, we have presented two static recognition methods that can be embedded in the first module of PIORT, giving rise to two different instances of the methodology. Both methods are based on the use of a classifier that is trained from examples and provides posterior class probabilities for each pixel from a set of local features. The first classifier is based on a maximum likelihood Bayesian method in which the conditional probabilities for object classes are obtained from the information of the class histograms (for discretized RGB values) and a uniform conditional probability is assumed for the background. The second classifier is based on a neural net which is trained with the RGB colour averages extracted for each spot of the segmented images. Even though the characteristics of these two classifiers are quite different, the recognition and tracking results of PIORT using both approaches were excellent and very similar in five of the ten test sequences, which might mean that the good ability of PIORT to track the objects is mostly due to a smart cooperation of the three inner modules and is not very dependent on the specific method used for object recognition. However, in the remaining five test sequences, the tracking method based on a neural net classifier clearly outperformed the one based on a simple Bayesian classifier, which failed in three of these test sequences. Indeed, we observed that updating the histograms at each frame may cause severe drift errors when the tracker begins to fail, which result in a rapid breakdown of the Bayesian classifier performance in subsequent frames. Hence, depending on the particular application, it might be preferable not to update the histograms after training. The performance of both Bayesian and neural net classifiers also depends somewhat on the quality of the image segmentation process carried out previously. In the case of good segmentations, like the ones we obtained using EDISON for the test sequences, the probability images given by the classifiers are smooth (large areas with same values) and this eases the tracking, whereas in the case of over-segmentations, the probability images may be noisy due to an excess of spots and this may hinder a stable tracking. In the experimental comparison with other six methods proposed in the literature for object tracking, a PIORT method obtained the best results in nine of the ten test sequences and only a slightly inferior performance with respect to best method in the other one (VRFS). Except for the case of the first test sequence S1, where all methods worked fine, the six alternative methods tested mostly failed to track the target objects correctly in the test sequences, due to the difficult instances of occlusions and object crossings they contain. However, we are aware of the fact that the six alternative methods tested here are not model-based (i.e. they are not trained in advance) to the contrary of PIORT, and thus, it is little surprising that PIORT obtained the best results. The availability (for us) of their implementation was the main reason why we selected them, but we foresee to carry out future experimental comparisons of PIORT against some state-of-the-art model-based tracking methods like those by Cremers [5] and 31 Lepetit [15] (once we have available an implementation of these methods to run the experiments). Although further experimental work is needed, the new tracking module included in PIORT has demonstrated by now to be effective under several-frames occlusions produced by an object of a class different to that of the target object. If the occluding and the target objects are recognised as belonging to the same class, then the occlusion is not detected as such, both objects are merged temporarily, but despite this behaviour, the tracking method is able in most cases to recover and track the original target when the same-class object occlusion or crossing ends. However, as observed in some of the test sequences, still there are cases where the behaviour of the tracking decision module of PIORT should be improved, particularly in the step of object re-emergence after occlusion and when other objects of similar appearance are next to the target. The upgrade of this tracking module will be subject of future research. We think that PIORT approaches for object tracking are especially suitable in noisy environments where segmented images vary so much in successive frames that it is very hard to match the corresponding regions or contours of consecutive images. The empirical results presented are quite satisfactory, despite the numerous mistakes made by the static recognition module, which can be mostly ignored thanks to the integration with the proposed tracking decision module. A right criticism that can be raised against PIORT is that too many parameters need to be set. Apart from the parameters specific of the classifier in the first (static recognition) module, the dynamic recognition module uses two parameters, which are bounds on the linear adaptive weighting of previous and current probabilities, and the tracking decision module uses up to twelve parameters: six related to the uncertainty in the target position prediction, three for tracking image post-processing filters (one for each filter), two for occlusion mode determination and one more for a weighted average computation of the target movement vector. It is very difficult to get rid of these parameters in our approach, but the default values reported in the previous sections have been tuned carefully to yield a stable satisfactory behaviour of PIORT in all the test sequences. Of course, for new sequences, these default values may not be optimal and some further tuning might improve the performance. A sensitivity analysis for each one of the PIORT parameters would be extremely hard to do and assess, since the system response may depend a lot also on the specific features of the input sequences. By experience, we hypothesize and claim that, in general, small variations on the given default values do not affect importantly the obtained tracking results, but larger ones could do. As future work, we want to extend the experimental validation of PIORT by applying it to new and more difficult image sequences; in particular, sequences where multiple objects are tracked simultaneously in the scene. And, as commented before, new comparative studies against state-of-the-art model-based tracking methods (e.g. [5, 15]) would be very interesting to do whenever possible. For the two currently used approaches in the static recognition module, an obvious upgrade is to replace the RGB by the HSI colour space, since the latter seems to be more suited for matching or tracking objects, especially in natural environments with changing illumination. In addition, we are interested in implementing and testing new 32 classifiers in the static recognition module, which could exploit other features completely different to the basic colour features used up to now. For instance, an SVM classifier could be applied to a set of features formed by Gabor filter responses, provided that class probability values were estimated from margin values. Another possible extension would be to replace in the third module the simple rules used in the a-priori predictions of target centres and areas (equivalent to noiseless Kalman filters) by the whole Kalman filter formulation considering noise for both the dynamics and the observations. However, this replacement would increase even more the number of the system parameters and it is not clear that resulted in significant changes in the whole system behaviour. Acknowledgements This research was partially supported by Consolider Ingenio 2010, project CSD2007- 00018, by the CICYT project DPI 2007-61452 and by the Universitat Rovira i Virgili (URV) through a predoctoral research grant. References [1] H. Wang, J. Peng, L. Li, “Runway detection of an unmanned landing aerial vehicle based on vision”, IJPRAI 20(8), pp: 1225-1244, 2006. [2] H-D. Yang, S-W. Lee, S-W. Lee, “Multiple Human Detection and Tracking Based on Weighted Temporal Texture Features”, IJPRAI 20(3), pp: 377-392, 2006 [3] J-W. Hsieh, Y-S. Huang, “Multiple-person tracking system for content analysis”, IJPRAI 16(4), pp: 447-462 2002. [4] A. Villanueva, R. Cabeza, S. Porta, “Gaze Tracking System Model Based on Physical Parameters”, IJPRAI 21(5), pp: 855-878, 2007 [5] S. Kang, B-W. Hwang, S-W. Lee, “ Multiple People Tracking Based on Temporal Color Feature”, IJPRAI 17(6), pp: 931-949, 2003 [6] J. Ning, L. Zhang, D. Zhang, C. Wu, “Robust Object Tracking Using Joint Color-Texture Histogram”, IJPRAI 23(7), pp: 1245-1263, 2009. [7] A. Sanfeliu, F. Serratosa, R. Alquézar, “Second-order random graphs for modeling sets of attributed graphs and their application to object learning and recognition”, Int. Journal of Pattern Recognition and Artificial Intelligence 18 (2004) 375-396. [8] A. Chatterjee, O. Ray, A. Chatterjee, A- Rakshit, “Development of a real-life EKF based SLAM system for mobile robots employing vision sensing”, Expert Systems with Applications, (38), pp: 8266–8274, 2011. [9] Li, Stan Z.; Jain, Anil K. (Eds.) “Handbook of Face Recognition”, Springer, 2005. [10] K. Burçin, N.V. Vasif, “Down syndrome recognition using local binary patterns and statistical evaluation of the system”, Expert Systems with Applications, (38), pp: 8690–8695, 2011. [11] C.Y. Tsai, Y.H. Lee, “The parameters effect on performance in ANN for hand gesture recognition system”, Expert Systems with Applications, (38), pp: 7980–7983, 2011. [12] G. L. Foresti, “Object recognition and tracking for remote video surveillance”, IEEE Trans. on Circuits and Systems for Video Technology 9 (1999) 1045-1062. [13] Francesc Moreno-Noguer, Alberto Sanfeliu, Dimitris Samaras, “Dependent Multiple Cue Integration for Robust Tracking”, IEEE Trans. Pattern Anal. Mach. Intell. 30(4), pp: 670-685, 2008. [14] H. Lee, J. Kim, H. Ko, “Prediction Based Occluded Multitarget Tracking Using Spatio-Temporal Attention”, IJPRAI 20(6), pp: 925-938, 2006. [15] C-J. Chang, J-W Hsieh, Y-S. Chen, W-F. Hu, “Tracking Multiple Moving Objects using a Level-Set Method”, IJPRAI 18(2), 2004. [16] A. Senior, A. Hampapur, Y-L. Tian, L. Brown, S. Pankanti, R. Bolle, “Appearance models for occlusion handling”, Image and Vision Computing 24 (2006) 1233-1243. [17] H.T. Nguyen, A.W.M. Smeulders, “Fast occluded object tracking by a robust appearance filter”, IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 1099-1104. [18] S.K. Zhou, R. Chellappa, B. Moghaddam, “Visual tracking and recognition using appearance-adaptive models in particle filters”, IEEE Trans. Image Process. 13 (2004) 1491-1506. 33 [19] A.D. Jepson, D.J. Fleet, T.F. EI-Maraghi, “Robust online appearance models for visual tracking,” IEEE Trans. Pattern Anal. Mach. Intell. 25 (2003) 1296-1311. [20] K. Ito, S. Sakane, “Robust view-based visual tracking with detection of occlusions”, in: Proc. Int. Conf. Robotics Automation, 2001, vol. 2, pp. 1207-1213. [21] K. Hariharakrishnan, D. Schonfeld, “Fast object tracking using adaptive block matching”, IEEE Trans. Multimedia 7 (2005) 853-859. [22] L. Zhu, J.Zhou, J. Song, “Tracking multiple objects through occlusion with online sampling and position”, Pattern Recognition 41 (2008) 2447-2460. [23] J. Pan, B. Hu, "Robust occlusion handling in object tracking", in: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2007), Minneapolis, Minnesota, June 2007. [24] Z. Tu, X. Chen, A.L. Yuille, S.C. Zhu, “Image parsing: unifying segmentation, detection, and recognition”, in: Proc. Ninth IEEE Int. Conf. on Computer Vision, 2003, pp 18- 25. [25] K-C. Lee, J. Ho, M-H. Yang, D. Kriegman, “Visual tracking and recognition using probabilistic appearance manifolds”, Computer Vision and Image Understanding 99 (2005) 303-331. [26] S.C. Zhu, A. Yuille, “Region competition: unifying snakes, region growing, and Bayes/MDL for multiband image segmentation”, IEEE Trans. on Pattern Analysis and Machine Intelligence 18 (1996) 884-900. [27] J. Malik, S. Belongie, T. Leung, J. Shi, “Contour and texture analysis for image segmentation”, Int. J. Computer Vision, 43 (2001) 7-27. [28] Z. Tu, S.C. Zhu, “Image segmentation by data driven Markov chain Monte Carlo”, IEEE Trans. on Pattern Analysis and Machine Intelligence 24 (2002) 657-673. [29] F-S. Chen, C-M. Fu, C-L. Huang, “Hand gesture recognition using a real-time tracking method and hidden Markov models”, Image and Vision Computing 21 (2003) 745-758. [30] L. Maddalena, A. Petrosino, A. Ferone, “Object Motion Detection and Tracking by an Artificial Intelligence Approach”, IJPRAI 22(5), 2008. [31] N. Amezquita Gomez, R. Alquézar, F. Serratosa, “Object recognition and tracking in video sequences: a new integrated methodology“, in: Proc. 11th Iberoamerican Congress on Pattern Recognition, CIARP 2006, LNCS 4225, pp. 481-490. [32] N. Amézquita Gómez, R. Alquézar, F. Serratosa, “A new method for object tracking based on regions instead of contours”, in: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2007), Minneapolis, Minnesota, June 2007. [33] N. Amézquita Gómez, R. Alquézar, F. Serratosa, “Dealing with occlusion in a probabilistic object tracking method”, in: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2008), Anchorage, Alaska, June 2008. [34] R. Alquézar, N. Amézquita Gómez, F. Serratosa, “Tracking deformable objects and dealing with same class object occlusion”, in: Proc. Fourth Int. Conf. on Computer Vision Theory and Applications (VISAPP 2009), Lisboa, Portugal. [35] D. Comaniciu, V. Ramesh, P. Meer, “Kernel-based object tracking”, IEEE Trans. on Pattern Analysis and Machine Intelligence 25 (2003) 564-577. [36] D. Comaniciu, P. Meer, “Mean shift: a robust approach toward feature space analysis”, IEEE Trans. on Pattern Analysis and Machine Intelligence 24 (2002) 603-619. [37] R. Collins, Y. Liu, “On-line selection of discriminative tracking features”, IEEE Trans. on Pattern Analysis and Machine Intelligence 27 (2005), 1631-1643. [38] A. Bugeau, P. Pérez, “Track and cut: simultaneous tracking and segmentation of multiple objects with graph cuts”, in: Proc. Third Int. Conf. on Computer Vision Theory and Applications (VISAPP 2008), Funchal, Madeira, Portugal. [39] Y. Boykov, O. Veksler, R. Zabih, “Fast approximate energy minimization via graph cuts”, IEEE Trans. on Pattern Analysis and Machine Intelligence 23 (2001) 1222-1239. [40] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. [41] F. Yin, D. Makris, S.A. Velastin, “Performance evaluation of object tracking algorithms”, in: Proc. 10th IEEE Int. Workshop on Performance Evaluation of Tracking and Surveillance (PETS’2007). [42] S.M. Khan & M. Shah, “Tracking Multiple Occluding People by Localizing on Multiple Scene Planes”, IEEE Trans. on Pattern Analysis and Machine Intelligence 31 (2009) 505-519. [43] D. Cremers, “Dynamical Statistical Shape Priors for Level Set Based Tracking”, IEEE Trans. on Pattern Analysis and Machine Intelligence 28 (2006) 1262-1273. [44] V. Lepetit & P. Fua, “Keypoint Recognition using Randomized Trees”, IEEE Trans. on Pattern Analysis and Machine Intelligence 28 (2006) 1465-1479. [45] M. de la Gorce, N. Paragios and David Fleet. “Model-Based Hand Tracking with Texture, Shading and Self- occlusions”, IEEE Conference in Computer Vision and Pattern Recognition, 2008. [46] Yan Huang, Irfan A. Essa: Tracking Multiple Objects through Occlusions. IEEE Conference in Computer Vision and Pattern Recognition 2005: 1051-1058. [47] T.Yang, S.Li, Q.Pan, J.Li, “Real-Time multiple objects tracking with occlusion handling in dynamic scenes”, ”, IEEE Conference in Computer Vision and Pattern Recognition, 2005, 970 - 975.