key: cord-0433025-a5lsyqkz authors: Fuhl, Wolfgang title: 1000 Pupil Segmentations in a Second using Haar Like Features and Statistical Learning date: 2021-02-03 journal: nan DOI: nan sha: 32a335c716a03da3a4f2143456b6b9b6dbbf1f09 doc_id: 433025 cord_uid: a5lsyqkz In this paper we present a new approach for pupil segmentation. It can be computed and trained very efficiently, making it ideal for online use for high speed eye trackers as well as for energy saving pupil detection in mobile eye tracking. The approach is inspired by the BORE and CBF algorithms and generalizes the binary comparison by Haar features. Since these features are intrinsically very susceptible to noise and fluctuating light conditions, we combine them with conditional pupil shape probabilities. In addition, we also rank each feature according to its importance in determining the pupil shape. Another advantage of our method is the use of statistical learning, which is very efficient and can even be used online. https://atreus.informatik.uni-tuebingen.de/seafile/d/8e2ab8c3fdd444e1a135/?p=%2FStatsPupil&mode=list We introduce tiny Haar features which follow elliptical shapes for pupil segmentation. Together with a detection area, a shape conditioned probability distribution as well as statistical feature waiting. In this paper we present a new approach for pupil segmentation. It can be computed and trained very efficiently, making it ideal for online use for high speed eye trackers as well as for energy saving pupil detection in mobile eye tracking. The approach is inspired by the BORE and CBF algorithms and generalizes the binary comparison by Haar features. Since these features are intrinsically very susceptible to noise and fluctuating light conditions, we combine them with conditional pupil shape probabilities. In addition, we also rank each feature according to its importance in determining the pupil shape. Another advantage of our method is the use of statistical learning, which is very efficient and can even be used online. https://atreus.informatik.unituebingen.de/seafile/d/8e2ab8c3fdd444e1a135/?p=%2FStatsPupil&mode=list. The plethora of image based eye tracking [6, 18] applications has continued to rise in recent years. The most important areas of application are currently driver monitoring [2] , virtual reality [62] , augmented reality [66] , medicine [1, 7, 8] , market research [62, 68] , remote support [69] , human computer interaction [10, 58] , supportive explanation models for computer vision models [78] , and many more. Of course, not only pupil detection is important for this, but also scan path analysis [11, 15] , eye movement classification [13, 14, 23, 28, 35, 36, 47] , visualizations [30, 32, 34, 54] and validations of the approaches and models used [24, 37] . Those diverse application areas bring different image based challenges [9, 31, 33, 46, 48, 49] and challenging resource restrictions [17, 19, 21] . Some of these image based challenges are changing illumination conditions, reflections on glasses, make up, recording errors, and high off axial pupil positions. In addition, the diversity of people using eye tracking devices also rise new challenges like deformed pupils [33, [43] [44] [45] [46] which occurs after eye surgery [5] . Other challenges in eye tracking are different recording techniques like RGB and NIR imaging. While NIR is mostly used in head mounted [49] eye trackers, RGB imaging is still used in remote [22] eye tracking and especially web cam based eye tracking [67] . Due to the current situation with the Covid pandemic, web cam based eye tracking becomes more and more important for market research [62, 67] and scientific studies [67] . Nowadays, the gaze signal alone is also no longer sufficient, as the eye provides a variety of other sources of information. These are pupil response to cognitive load [4] , pupil shape for eyeball regression [18] , and eyelids to determine a person's fatigue [40] [41] [42] . The cognitive load is very interesting for the detection of mental disorders [65] or ranking a persons performance capability [59] . Eye ball regression is used to improve the robustness of eye trackers against drifts of the device and to improve the accuracy [70] . The fatigue detection of a person is important for critical applications like driving [76] , flying [64] , flight surveillance [61] , and many more. Due to the progress in eye tracker technology so far, mobile applications [56] , long-term studies [73] , high speed eye tracking for fundamental research [51] , and the consumer market such as computer games [53] as well as the privacy aspects of eye tracking [12, 39] are becoming more and more important. For this it is necessary that the algorithms can be used as resource saving and robust as possible [17, 19, 21, 25, 29] to consume as little energy as possible in mobile applications [56] , to guarantee the real-time capability in high speed eye tracking [51] , and not to waste computing capacity which is needed for computer games [63] . In this paper we present a resource-sparing approach, which is inspired by CBF [21] and BORE [17] . Our approach was developed with the main features of cheap execution and easy training. The features used are Haar features [74] which can be computed very efficiently. In addition, we increase the computation of Haar features via down scaling the images instead of computing the integral image. Another important feature of our algorithm is the use of conditional pupil ellipse distributions, which allow to consider only the ellipses that can occur. As a training method we use statistical learning, which has a complexity of ( ) and can be computed very efficiently. In this way, our detector can even be personalized and used optimally for individuals using the minimum resources. Contribution of this work to the state of the art: (1) We define the features in comparison to BORE [17] and CBF [21] generalized as Haar features. In CBF [21] and BORE [17] only direct pixel comparisons were used, we use the difference of areas. , , (2) Our approach is the first to use ellipse parameter conditional probability distributions for ellipse selection. This avoids unnecessary checks of ellipse points, as is the case in CBF [21] and BORE [17] . (3) Compared to BORE [17] and CBF [21] we use index tables whereby each feature has to be evaluated only once. This further reduces resource consumption and, in combination with the precomputed indexes already presented in CBF [21] and BORE [17] , further speeds up the process. (4) Our approach is simply trained on the occurrence statistics. This procedure is much more resource efficient than the unsupervised learning and evaluation of all possible combinations as done in BORE [17] . It is also much faster than the random combination evaluation used in CBF [21] . (5) Using feature weighting, our approach also has the ability to find ellipses that are not fully present in the image. (6) Compared to BORE [17] and CBF [21] , our approach segments the pupil explicitly. For BORE [17] , there is only one experimental implementation for ellipse extraction and in the case of CBF [21] , only the pupil center. (7) Compared to the Tiny CNNs [19, 24] , our approach has significantly reduced runtime and hence resource consumption. In addition, we only need a fraction of time to train our model as well as no GPU to execute the teacher network. Since pupil tense has evolved in different directions, we divide the related work into three areas. These are classical computer vision approaches, deep neural networks, and resource saving machine learning approaches. In the field of pupil detection and exact pupil center determination, the first major breakthrough came with the use of cate images [71] . Previously, adaptive thresholds were used [50] . A major disadvantage of edge images are their susceptibility to noise and motion blur. Therefore, edge filtering methods were introduced [33, 46, 48] , which suppress noise and pass only relevant edge segments. In addition to this, angular integral projection function [33] and also blob detection [46] were used. Another improvement in pupil shape reconstruction was the evaluation of individual segments [52] . Alternative to edge detection, the radial symmetry transform was used to detect the pupil center [60] . With the advent of convolutions in neural network and the success in the field of image processing, these CNNs were also used for pupil detection and segmentation as well as they are continuously refined [26, 27] . The first window-based approach was PupilNet [44, 45] , running in real time on a single CPU core only. Later, large residual networks were also used [16, 38] and puplished together with huge annotated data sets as well as generative adversarial networks [20] were used. The first U-Net with interconnections was poposed with DeepVOG [77] . New loss formulations regarding the pupil shape where proposed in [3] . Additional to those loss formulations an L1 loss connected to the central part of a fully connected convolutional network was proposed in [55] . The first real-time machine learning methods combined with simple features were introduced by PupilNet [44, 45] and continued BORE [17] and CBF [21] . BORE [17] is capable of non-supervised learning and self-optimization. CBF [21] , on the other hand, uses random ferns and pixel comparisons to determine the center of the pupil. Also the supervised decent method (SDM) was used for the regression of the pupil center in remote images [57] . However, this has the disadvantage of being highly dependent on the mean shape, which we will show in our later evaluation in combination with landmarks for segmentation on head mounted images. Another approach was created using the teacher, student training method [19, 24] . These tiny cnns [19, 24] are very robust and were successfully used across datasets. The paper reported a runtime of 16ms but the provided nets only have a runtime of 4-8ms on a CPU core. In addition, they learn to evaluate the accuracy and give therefore a validity of the pupil ellipse [24] . The approach presented by us is also to be classified in this category. This is due on the one hand to the fact that our approach uses statistical learning and on the other hand to the resource-saving use of our method. In addition to these properties, our approach can also be trained very fast and resource-saving. Our approach uses statistical learning and by this it is enough to look at each training sample three times to create a detector. The first step for training our detector is to create the search area ( , tuples) and the ellipses expected there ( ). This gives us the set which stores all ellipses for each position , . Reformulated as a conditional probability distribution corresponds to the probability of ellipse under the condition to be at position , and thus Equation 1. To calculate EL, we need one pass of the training data. In the second step, we reduce EL to speed up our detector and reduce over fitting. For this, we represent each ellipse as eight landmarks (See Figure 1 ) and round them to integers. For reduction, all ellipses with the same landmark distances are combined into one ellipse with a maximum deviation of one pixel per landmark. This gives us the reduced conditional probability distribution . The next step is to create our feature extractors from the landmarks. For this we use Haar features. Instead of computing the area differences in the integral images, we use the difference of pixels in downscaled images. In the second pass of the training set, for each ellipse in we store all occurrences of the eight differences . Since the set of eight differences to each ellipse is very large, we want to reduce it. To reduce this set, we compute the best five difference sets for each positive probability in , noting here that one are eight differences one for each landmark. For this we use the mean shift clustering with a maximum of five clusters. Reformulated as a conditional probability distribution, we thus have positive probabilities for five difference sets under the condition of the probability of an ellipse at some position ( = ( | , )) and thus Equation 2. (2) In the last run, the individual differences or landmarks are now weighted with respect to their robustness. These feature weights are computed in the third run of the training set. For this we use the difference set with the minimum distance of the landmark differences and weight positively if the sign matches and negatively if the sign differs. Based on this we statistically weight the reliability of each feature. After the pass each feature is normalized to sum to one and also form a probability distribution. This gives us a similar conditional probability distribution as for the difference sets and thus Equation 3. , , To use the detector, the landmark differences to the ellipses must be calculated at all possible positions. Then the minimum difference to the difference sets is calculated and the deviation is weighted by the feature weights . The final ellipse and position is then the global minimum and described in Equation 4 . In Equation 4 are the haar features, is the difference set,˜is the set of differences in the input image for ellipse at position , and is the corresponding feature weight. Overall Equation 4 searchs for the minimum difference of the eight landmark positions in the entire input image. If there are multiple equally good positions we use the conditional probability distribution to select the most probable ellipse and position. While this already provides a very efficient detector, further optimizations are necessary like the precalculation of all indexes in the image, as it was already presented with BORE [17] and CBF [21] . Also, all differences are indexed to calculate differences at each position only once. The data used for the evaluation are the segmented pupils of [38] . The data set consists of two files p1_image.mp4ändp2_image.mp4ẅith an image resolution of 192 × 144. The first file contains data taken in a driving simulator and the second file contains images from real world driving. Because of this there are no reflections or strong light fluctuations in the data of the first file, so it contains much simpler images. Therefore, we decided to use the first file with more than 500,000 frames as the training data set. The second file with more than 350.000 frames and the much more challenging images is used as the evaluation data set. For the data augmentation, we used up to 20% random noise, as well as reflections with an intensity up to 20%, where the reflections are calculated from randomly selected images. Also, we randomly changed the contrast of the image in the range of -40 to 40. In addition, we shifted the image randomly in a range of -10 to 10 pixels as well as we used zooming with a random factor in the range of 0.8 to 1.2. For the TinyCNNs [19] , this was done online during training. For all other approaches, the data augmentation was computed in advance resulting in five images from each frame. Of course, the image could also occur without augmentation. Since we trained our approach once with the real data and once with the simulator data, we also give here the details for our approach. First, we used the simulator of [18] and inverted the images so that the pupil is dark. Then we selected the data based on the pupil ellipse, which matched the pupil ellipses in the normal training set. For the data augmentation, we used the same approaches as for the training on the real data except for adjusting the contrast of the background and the pupil of the simulated images. Here we used the differences from the training set first to adjust the contrast. The hardware used for training and running the final models consists of an Intel i5-4570 CPU running at 3.2 GHz. The system has 16GB of DDR4 memory and an NVIDIA 1050 TI with 4GB of GDDR5 memory. The GPU was only used for training the TinyCNNs [19] . All runtime analyses were performed on one CPU core. For a comparison with the state of the art, we use ElSe as a representative of edge-based approaches, BORE as a resource saving alternative, the TinyCNNs pre-trained on LPW [72] and provided by the authors as well as two newly trained TinyCNNs on the presented training data, and SDM [75] for landmark detection also as a resource sparing alternative. Figure 2 is the cumulative euclidean pupil center pixel error which is important for the gaze estimation accuracy. As can be seen on both plots in Figure 2 , SDM and BORE perform worse. BORE cannot handle the reflections very well as can be seen especially in Figure 3 where a high mean pupil center error is present nearly everywhere on the image space. For SDM this is different since the method perform well in the near area of the mean shape ( Figure 3 ). The best performance regarding a cumulative pupil center error of zero has ElSe (Figure 2 right) . It is also reaches the highest values for the cumulative intersection over union for a value of 0.9 (Figure 2 left) . Apart from this the TinyCNNs and the proposed approach are more robust and reaching nearly 90% at a pixel error of two (Figure 2 right) . After the pixel error of two, our approch is outperformed by the TinyCNNs but our approach needs only a fraction of computation time (See Table 1 ). For the segmentation quality, our approach keeps up with the TinyCNNs whereas the newly trained ones perform significantly better (Figure 2 left) . This is due to the reduction of possible ellipses as described in the method section. In addition, it can be seen in Figure 2 that our approach trained on the simulated data performs only slightly worse compared to the one trained on the real data. If Figure 3 and Figure 4 are compared for each approach, it can be seen that ElSe has a lot of invalid detections over the entire image space (Figure 3 ). This steams from heavy reflections which make edge detection not applicable. In Figure 4 on the other hand, ElSe has a good average segmentation quality over the image space with the exception of the upper right area where occlusions by the eyelid occurred. Another important information is the clear center bias which can be seen for SDM by comparing Figure 3 and Figure 4 . Looking now at our approach and the TinyCNNs, we notice that they have good coverage of the entire image space (Figure 3 and Figure 4 ). In terms of segmentation, however, our approach is significantly worse in the outer areas ( Figure 4) . Table 1 shows the training time in hours as well as the execution time in milliseconds on a single CPU core. As can be seen, our approach outperforms the other approaches in terms of execution time. For the training time ElSe is the fastest approach due to it has not to be trained. Combining the detection, segmentation, training time, and execution time results we think the proposed approach is a valuable contribution to the eye tracking community. While the presented approach with Haar features in combination with statistical learning has a very low training time and a very low runtime and can also be trained on simulated data, this approach of course also has disadvantages. The first disadvantage is the search area. This means that no pupils or ellipses can be found outside this area. Of course, this limitation can be easily circumvented by arbitrarily extending the search area, but this has a negative impact on both the detection rate and the runtime. Another disadvantage of the presented approach, is the statistic itself, which in the case of feature weighting weights frequent occurrences of valid features more heavily. This means that large data sets of similar images lead to features that are valid in these images being weighted more heavily than others. This results in an overfitting to these images. Also, the presented approach only recognizes shapes which are also present in the training data. This is because unknown shapes are not sampled and have no probability of occurrence. This can be easily fixed by simulated data or data manipulation, but this also leads to an increased runtime. How we think the algorithm should be applied: Since the presented algorithm can be used very performantly and statistical learning can be used very efficiently for training, our idea for the application is a direct training after calibration. Here, an expensive deep neural network could be used in the first step to segment the pupils offline. Then statistical learning is used to weight the haar features and the ellipses. Through this, it would be possible to create a personalized detector, as is the case with BORE [17] , and deploy it online in a resource-efficient manner. A disadvantage of this approach is, of course, that the one-point calibration could not be used in this case but a coverage of the whole area would have to be guaranteed. In this work, we have presented a new approach to efficiently train and segment pupils. While it is not able to segment pupils as accurately as, for example, edge-based approaches, it is comparatively robust and very efficient to compute. To overcome the disadvantage of segmentation quality, finer segmentation can of course be performed in a second step but this again incurs additional computational overhead. Overall, we believe that our approach is a valuable contribution to the online adaptation of pupil detectors and their use in high speed eye tracking. Feature-based attentional influences on the accommodation response Affordable visual driver monitoring system for fatigue and monotony RITnet: real-time semantic segmentation of the eye for gaze tracking Using task-induced pupil diameter and blink rate to infer cognitive load Effect of cataract surgery and pupil dilation on iris pattern recognition for personal authentication A breadth-first survey of eye-tracking applications Towards intelligent surgical microscopes: Surgeons gaze and instrument tracking Towards automatic skill evaluation in microsurgery Image-based extraction of eye features for robust eye tracking From perception to action using observed actions to learn gestures Encodji: encoding gaze data into emoji space for an amusing scanpath classification approach Reinforcement learning for the privacy preservation and manipulation of eye tracking data Histogram of oriented velocities for eye movement detection Rule based learning for eye movement type detection Ferns for area of interest free scanpath classification MAM: Transfer learning for fully automatic video annotation and specialized detector creation BORE: Boosted-oriented edge optimization for robust, real time remote pupil center detection Neural networks for optical vector and eye ball parameter estimation Tiny convolution, decision tree, and binary neuronal networks for robust and real time pupil outline estimation The applicability of Cycle GANs for pupil and eyelid segmentation, data generation and image refinement CBF:Circular binary features for robust and real-time pupil center detection Evaluation of State-of-the-Art Pupil Detection Algorithms on Remote Eye Images Eye movement velocity and gaze data generator for evaluation, robustness testing and assess of eye tracking software and visualization tools Learning to validate the quality of detected landmarks Multi Layer Neural Networks as Replacement for Pooling Operations Rotated Ring, Radial and Depth Wise Separable Radial Convolutions Weight and Gradient Centralization in Deep Neural Networks A Multimodal Eye Movement Dataset and a Multimodal Eye Movement Segmentation Analysis Training Decision Trees as Replacement for Convolution Layers Region of interest generation algorithms for eye tracking data Ways of improving the precision of eye tracking data: Controlling the influence of dirt and dust on pupil detection Arbitrarily shaped areas of interest based on gaze density gradient ExCuSe: Robust Pupil Detection in Real-World Scenarios Automatic Generation of Saliency-based Areas of Interest for the Visualization and Analysis of Eye-tracking Data Fully Convolutional Neural Networks for Raw Eye Tracking Data Segmentation, Generation, and Reconstruction Fully Convolutional Neural Networks for Raw Eye Tracking Data Segmentation, Generation, and Reconstruction Explainable Online Validation of Machine Learning Models for Practical Applications 000 images closer to eyelid and pupil segmentation The Gaze and Mouse Signal as additional Source for User Fingerprints in Browser Applications EyeLad: Remote Eye Tracking Image Labeling Tool Eyes Wide Open? Eyelid Location and Eye Aperture Estimation for Pervasive Eye Tracking in Real-World Scenarios Fast and Robust Eyelid Outline and Aperture Detection in Real-World Scenarios Fast camera focus estimation for gaze-based focus control Pupilnet: Convolutional neural networks for robust pupil detection Pupilnet v2. 0: Convolutional neural networks for cpu based real time robust pupil detection ElSe: Ellipse Selection for Robust Pupil Detection in Real-World Environments Eye movement simulation and detector creation to reduce laborious parameter adjustments Non-Intrusive Practitioner Pupil Detection for Unmodified Microscope Oculars Pupil detection for head-mounted eye tracking in the wild: An evaluation of the state of the art Detecting and tracking eyes by using their physiological properties, dynamics, and appearance RemoteEye: An open-source high-speed remote eye tracker SET: a pupil detection method using sinusoidal approximation If looks could kill-an evaluation of eye tracking in computer games Biomedical Engineering Systems and Technologies EllSeg: An Ellipse Segmentation Framework for Robust Gaze Tracking Eye tracking for everyone Supervised descent method (SDM) applied to accurate pupil detection in off-the-shelf eye tracking systems Eye tracking and eye-based human-computer interaction Increasing human performance by sharing cognitive load using brain-to-brain interface Fast and robust ellipse detection algorithm for head-mounted eye tracking systems Evaluation of eye metrics as a detector of fatigue Combining virtual reality and mobile eye tracking to provide a naturalistic experimental environment for shopper research Kernel foveated rendering Analyzing pilots' fatigue for prolonged flight missions: Multimodal analysis approach using vigilance test and eye tracking The effects of cognitive load on attention control in subclinical anxiety and generalised anxiety disorder Automatic analysis of eye-tracking data for augmented reality applications: A prospective outlook Searchgazer: Webcam eye tracking for remote studies of web search Exploring natural eyegaze-based interaction for immersive virtual reality Eye tracking support for visual analytics systems: foundations, current applications, and research challenges Self-calibrating head-mounted eye trackers using egocentric visual saliency Robust real-time pupil tracking in highly off-axis images Labelled pupils in the wild: a dataset for studying pupil detection in unconstrained environments Wearable eye tracking for mental health monitoring Rapid object detection using a boosted cascade of simple features Supervised descent method and its applications to face alignment Real-time eye tracking for the assessment of driver fatigue DeepVOG: Open-source pupil segmentation and gaze estimation in neuroscience using deep learning Studying relationships between human gaze, description, and computer vision