key: cord-0277807-c93hxukj
authors: Aziz, Samantha D.; Komogortsev, Oleg V.
title: An Assessment of the Eye Tracking Signal Quality Captured in the HoloLens 2
date: 2021-11-14
journal: nan
DOI: 10.1145/3517031.3529626
sha: 8f7712b94969e6a0718ded1834bdf6bf000ae72f
doc_id: 277807
cord_uid: c93hxukj

We present an analysis of the eye tracking signal quality of the HoloLens 2s integrated eye tracker. Signal quality was measured from eye movement data captured during a random saccades task from a new eye movement dataset collected on 30 healthy adults. We characterize the eye tracking signal quality of the device in terms of spatial accuracy, spatial precision, temporal precision, linearity, and crosstalk. Most notably, our evaluation of spatial accuracy reveals that the eye movement data in our dataset appears to be uncalibrated. Recalibrating the data using a subset of our dataset task produces notably better eye tracking signal quality.

The growing ubiquity of dedicated eye trackers into augmented-reality (AR) devices promotes the application of eye tracking to improve interaction between the user and the device. Eye tracking can, for example, prolong the battery life of untethered devices without reducing the perceived quality of the simulated environment through foveated rendering. However, each potential application of eye tracking requires different levels of signal quality to be effective. Eye movement biometrics [1] , for example, requires higher quality gaze data than simple gaze-based interaction with the environment [2] . It is therefore imperative to determine whether an eye tracker captures gaze data at a suitable quality for a particular use case.

Data quality metrics such as spatial accuracy and precision are commonly reported by the eye tracking manufacturer. However, discrepancies between these values and those achieved in practice have been well-documented in the literature [2, 4, 5] . Eye tracking signal quality also depends heavily on the conditions in which the data were collected [6] . It can be difficult routinely achieve optimal results regardless of experimental condition [5] . Taken together, these findings underscore the importance of studying eye tracking signal quality under a variety of experimental conditions. The recently released HoloLens 2 includes a built-in eye tracker, and represents an accessible option for integrating eye tracking into research. In order to benchmark its performance, we evaluate the quality of the eye tracking data captured by the HoloLens 2. We seek to establish a benchmark for the HoloLens 2's eye tracking signal quality in an experimental setup similar to those commonly found in literature (e.g., [3, 8] ), while additionally comparing our results to a previous study of the HoloLens 2's eye tracking signal quality presented by [7] . We investigate a number of phenomena-namely, linearity and crosstalk-that exhibited unique properties in an eye movement dataset previously collected in virtual reality [3] . In presenting this analysis, we seek to clarify whether the observations in this virtual reality-based investigation persist across head-mounted virtual/augmented reality eye tracking devices-if they do, these findings would have significant implications for future researchers interested in using eye tracking in virtual and augmented reality. We introduce a new, publicly available eye movement dataset collected with the HoloLens 2's integrated eye tracker (n=30) and describe its eye tracking signal quality using spatial accuracy, spatial precision, temporal precision, linearity, and crosstalk. From this analysis, we identify discrepancies between the manufacturer-specified signal quality and our own results. We also observe how these results respond to post hoc recalibration. The gaze data described in this manuscript is publicly available and can be downloaded at https://doi.org/10.18738/T8/9T99DU 2 Methodology 2.1 Subjects 33 subjects (15 female, 18 male, median age: 21, age range: 19-36) took place in this study. Ten subjects who normally wore glasses removed them for this study, and four subjects wore contact lenses while participating. None of the participants wore glasses during data collection.

Three subjects were excluded from this analysis based on excessively noisy data or data loss. Namely, subjects were excluded if more than 20% of their data consisted of invalid samples. The remaining 30 subjects were used for our signal quality analysis. All subjects wore masks throughout the experiment, as data collection was conducted during the COVID-19 pandemic.

We employed data from a random saccades task to compute the eye tracking quality measures reported herein. The stimulus shown was a white ring measuring 0.5 • displayed at a viewing distance of 1500 mm. Subjects were instructed to fixate on the center of the stimulus as it appeared in random positions uniformly sampled from a visual field spanning ±15 • horizontally and ±10 • vertically. The stimulus remained at each position between 1 and 1.5 seconds before jumping to a random location at least 3 • away. The stimulus jumped 80 times in each trial.

To study the effects of manual calibration on the collected data, we also included a recalibration task that took place immediately before the random saccades task. Subjects fixated on a 1 • wide stimulus as it appeared in predetermined positions forming a 13-point grid spanning the same field of view as the random saccades task. The positions of these points were the same across all subjects, but were presented in random order. The recalibration task is illustrated in Figure 1 .

We created our stimulus using Unity 2019.4.21 and captured gaze data with Microsoft's Mixed Reality Toolkit (MRTK) plugin. We added a custom event to the MRTK to capture and save gaze data reported by the HoloLens 2. Headset position tracking was disabled so that the stimulus remained at a fixed position relative to the center of the headset. To reduce the number of visual distractions from the environment, subjects faced a non-reflective black canvas for the duration of the experiment. Subjects also used a chinrest to minimize head movements during data collection.

The HoloLens 2's integrated eye tracker-referred to herein as the "HoloLens 2"-features a sampling rate of 30 Hz and a nominal spatial accuracy of 1.5 • [9] . The HoloLens 2 captures gaze data from the left and right eyes simultaneously. However, the manufacturer does not provide an API to extract raw monocular signals from the device. The two monocular signals are instead combined into a single gaze ray, which is then made available via the MRTK. The methods used to compose this ray from the monocular signal are not publicly available at the time of writing.

Each gaze sample is represented by a three-dimensional gaze vector v = (v x , v y , v z ) vectors. We transformed raw gaze data into degrees of visual angle using simple trigonometry and MATLAB's atan2d function:

During data collection, we observed that the HoloLens 2's gaze data appeared to be largely uncalibrated. Gaze positions were systematically offset within subjects, despite calibration taking place immediately before the task began. While the source of this phenomenon is unclear, it is present in all subjects. Because the HoloLens 2's built-in calibration does not provide feedback, we used the recalibration task described in Section 2.2 to test the extent to which the captured signal could be calibrated manually. We selected linear regression as a form of post-hoc data recalibration:

where recalibration coefficients A, B and C were computed using the data obtained from the recalibration task that preceded the random saccades task, as described in Section 2.2. These coefficients were then applied to the data from the random saccades task.

Eye tracking signal quality is typically measured during a stable fixation on a target. Although there are a number of algorithms available for classifying eye movement (e.g. I-VT), they require input parameters that require fine-tuning, and produce significantly different outputs for non-ideal parameter values [10, 11] . Rather than relying on a fixation detection algorithm, used a data-driven approach to select the most stable subset of samples across all subjects.

First, we minimized the latency between the gaze signal and its corresponding target position using an approach proposed by [3] . We estimate the optimal saccade latency for each recording by shifting the gaze signal backward until the mean Euclidean distance between the measured gaze position and the target position was minimized, up to 800 milliseconds. The results of our approach are illustrated in Figure 2 .

We then identified stable fixation periods by the absence of error due to saccadic movement. First, the angular offset between the gaze signal and the target was calculated on a per-sample basis for the first 30 samples (approximately 1000 ms) from the beginning of each target step. We then calculated the mean angular error of each sample across all fixations and empirically selected the largest contiguous window of samples with the lowest error (Figure 3 ), as lower error likely indicates that the eye is stable and fixating on the target [12] . As a result, the first 233 ms of each fixation were discarded to eliminate instability caused by non-fixational movement. The first 466 ms of the remaining gaze signal was then used for all analysis herein.

Spatial accuracy is measured as the distance between the reported gaze position and the target position, reported in degrees of visual angle. By treating each stable fixation as a series of n gaze samples with a measured gaze position (x g i , y g i ) and a target position (x t i , y t i ), we determine the spatial accuracy of that series with one of the following:

where H, V, and C respectively denote horizontal, vertical, and combined spatial accuracy. We then take the median spatial accuracy value across fixations within each subject, and present average spatial accuracy of the dataset as the median value across all subjects in the dataset. We also include the mean spatial accuracy of our dataset to compare our results with [7] 's eye tracking signal quality analysis. Table 1 summarizes the spatial accuracy values measured across the dataset. In the original data, accuracy is notably worse in the vertical direction. The histograms in Figure 4 show the distribution of spatial accuracy across all fixations in the dataset. While the original data has an unusually high frequency of fixations with large error, the recalibrated data closely resembles the right-skewed distribution that is characteristic of spatial accuracy [5] . This corrective effect is seen most prominently in the vertical dimension.

As a sanity check, we re-calculated the average spatial accuracy for the original data using the raw gaze vectors reported by the HoloLens 2. Given the raw gaze vectors v described in Section 2.3 and a target vector u, we measured the difference between them using cosine similarity:

The resulting spatial accuracy θ was nearly identical to the results described in Table 1 . These results indicate that our data processing and selection efforts did not introduce the extremely high error observed in the original data. Table 2 : Average RMS spatial precision across subjects, expressed separately as the 50th, 75th, and 90th percentiles. H, V, and C respectively denote horizontal, vertical, and combined gaze directions. Mean spatial precision is reported as the standard deviation of intersample distance for comparability. All other precision values are calculated using RMS.

By treating each fixation as a set of n gaze samples, we compute spatial precision with the following equation from [13] ,

where θ is the Euclidean distance between consecutive samples. Although [13] recommend using angular distance over Euclidean distance as a measure of intersample distance, [12] demonstrate that RMS precision does not differ significantly between the two when calculated on stable fixation periods. Similar to our description of spatial accuracy, we compute the median spatial precision within each subject and describe the average spatial precision of the dataset as the median precision value across subjects. For comparability, we also include the mean precision calculated as the standard deviation of intersample distances described in [7] . Table 3 .3 summarizes our results. The distribution of precision values across all fixations in the dataset is also shown in Figure 5 .

We evaluate temporal precision of the device by calculating the variability of inter-sample intervals (ISIs) between consecutive timestamps. Given n gaze samples captured by the HoloLens 2 that are each sampled at a timestamp t i reported by the device, ISIs are calculated by taking the difference between the timestamps of consecutive samples. After processing the data captured by the device, we were left with 89,721 gaze samples across all subjects. The mean difference between timestamps is 34.8 ms (SD 21.9 ms), which corresponds to approximately 1 sample. We also investigated the proportion of samples that were dropped by the HoloLens 2. Dropped samples are identified when ISIs exceed 49.95 ms (50% more than the ideal ISI of 33.3 ms). 4,088 samples across the entire dataset fit this criteria (4.6% of the dataset).

Linearity measures the extent to which spatial accuracy changes based on the location of the target. Many eye tracking signal quality investigations [4] , including those in virtual reality [3] , have revealed that spatial accuracy varies systematically relative to the region of the screen the user is looking, typically expressed as the target's location in the field of view. We apply the approach used by [3] for calculating linearity to our own dataset. Table 3 .5 summarizes the slope of the fitted linearity equation for the horizontal and vertical components of the data across calibration strategies. The ideal linearity slope has a value of 1.0, which indicates that target locations and gaze positions have a straight-line linear association. We identify a linearity slope that is significantly different from ideal when its 95% confidence interval does not contain the ideal value of 1.0 within its range. While the original data is significantly different from ideal, the recalibrated data features lower error across the field of view and a linearity slope that successfully approximates ideal conditions. To illustrate how linearity is expressed in our own dataset, see Figure 6 .

Crosstalk describes the extent to which the rotation of the eye in one direction (e.g. horizontal or vertical) affects movement in the orthogonal direction. Because [3] observed quadratic crosstalk in their virtual reality eye movement dataset, we reproduce their analysis to assess whether this phenomenon is common in virtual/augmented reality eye movement datasets. 

Of all the eye tracking signal quality metrics we investigated, only spatial accuracy is publicly benchmarked by the manufacturer. We find that the original data's spatial accuracy results are consistently worse than the typical spatial accuracy reported by the manufacturer (1.5 • ). Similarly, [7] report a mean spatial accuracy of 0.77 • in an experimental setup similar to our own-significantly lower (better) than both our findings and the 1.5 • accuracy reported by the HoloLens 2 manufacturer. Recalibrating our data remarkably improves spatial accuracy. In fact, the average spatial accuracy achieved by the recalibrated data better approximates the manufacturer's reported spatial accuracy benchmark.

Although there are no manufacturer-supplied metrics for spatial precision, [7] provide a benchmark in their own study of HoloLens 2's signal quality, reporting a mean spatial precision of 0.24 • . Our comparable spatial precision results are lower (better) than the prior work at 0.14 • . The difference in measured spatial precision values may come down to differences in experimental setups-we made efforts to eliminate noise in the data caused by head movement, whereas Kapp et al. did not.

Because crosstalk and linearity are closely related to spatial accuracy, these measures were also improved by recalibration. Our results for linearity illustrate that spatial accuracy tends to deteriorate at the extremes of the field of vision. This is consistent with the findings of other eye movement signal quality analyses in literature [4, 1] . Recalibration predictably brought linearity values closer to the ideal. The scale of this improvement in linearity may partially depend on the design of the recalibration task, which captures gaze data across the entire field of view.

Our analysis of crosstalk is based on a novel approach from [3] , where quadratic crosstalk was observed in an eye movement dataset captured in virtual reality. A subset of our data also exhibits partially quadratic crosstalk, but transforms into an intercept-only fit after recalibration. It is unclear whether the quadratic vertical crosstalk we initially observed is indeed endemic to head-mounted virtual/augmented reality eye tracking devices, or is simply a consequence of less-than-ideal eye tracking signal quality. Incidentally, less-than-ideal eye tracking signal quality may be more common in head-mounted devices, as it can be more difficult to consistently achieve a good headset fit on participants. Future investigations may be able to disentangle the possible sources of crosstalk found in these datasets by grouping individual recordings by spatial accuracy and then analyzing crosstalk separately for each group.

Overall, the recalibrated signal quality results represent a realistic analysis of the HoloLens 2's achievable eye tracking signal quality. Based on our investigation, it appears that the device's built-in calibration was not applied during our investigation. It is also possible that the presence of masks during data collection may be partially responsible for the degradation of eye tracking signal quality in our dataset by introducing spurious corneal reflections or contributing to lens fogging. However, the relatively lower (higher-quality) spatial precision results our data achieves indicate that this may not be the case. Our study may have elicited vergence-accommodation conflict in some participants, as we placed our stimulus at a distance of 1.5 m away from the viewer, while the manufacturer recommends a focal plane of 2.0 m in their evaluations of spatial accuracy.

Further investigation into the eye tracking signal quality of the device is limited by the nature of the gaze data available through the HoloLens 2's eye tracking API. The device processes gaze data to remove personally identifying information [9] , but it is not known how the signal is further affected (e.g., filtering). It is important to note that updates to the manufacturer's eye tracking API since the time of writing may change the characteristics of the gaze signal that is presented to users.

We evaluated the HoloLens 2's eye tracking signal quality using commonly used signal quality descriptors, including the first analysis of linearity and crosstalk. Our investigation contributes to a growing body of literature that characterizes the nature of eye tracking data in AR environments. Based on our findings, we recommend future studies using the HoloLens 2 include a recalibration task that captures gaze data across the viewing field to enable post hoc data correction.

Eye movement biometrics using a new dataset collected in virtual reality

Gaze typing in virtual reality: Impact of keyboard design, selection method, and motion

Evaluating the data quality of eye tracking signals from a virtual reality system: Case study using smi's eye-tracking htc vive

Cleaning up systematic error in eye-tracking data by using required fixation locations. Behavior research methods, instruments, & computers : a journal of the

Common predictors of accuracy, precision and data loss in 12 eye-trackers

Toward everyday gaze input: Accuracy and precision of eye tracking and implications for design

Arett: Augmented reality eye tracking toolkit for head mounted displays

Gazebase, a large-scale, multi-stimulus, longitudinal eye movement dataset

Eye tracking on hololens 2

Identifying fixations and saccades in eye-tracking protocols

Standardization of automated analyses of oculomotor fixation and saccadic behaviors

Angular offset distributions during fixation are, more often than not, multimodal

Eye tracker data quality: What it is and how to measure it

The influence of calibration method and eye physiology on eyetracking data quality

Photosensor oculography: Survey and parametric analysis of designs using model-based simulation

Multimodality during fixation -part ii: Evidence for multimodality in spatial precision-related distributions and impact on precision estimates

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1840989.