key: cord-0644223-ziyks5d8
authors: Zeng, Zhe; Liu, Sai; Cheng, Hao; Liu, Hailong; Li, Yang; Feng, Yu; Siebert, Felix Wilhelm
title: GaVe: A Webcam-Based Gaze Vending Interface Using One-Point Calibration
date: 2022-01-14
journal: nan
DOI: nan
sha: 6ed1d43319ce4ceb5d0c9a76a2d2ce5cee997678
doc_id: 644223
cord_uid: ziyks5d8

Even before the Covid-19 pandemic, beneficial use cases for hygienic, touchless human-machine interaction have been explored. Gaze input, i.e., information input via eye-movements of users, represents a promising method for contact-free interaction in human-machine systems. In this paper, we present the GazeVending interface (GaVe), which lets users control actions on a display with their eyes. The interface works on a regular webcam, available on most of today's laptops, and only requires a one-point calibration before use. GaVe is designed in a hierarchical structure, presenting broad item cluster to users first and subsequently guiding them through another selection round, which allows the presentation of a large number of items. Cluster/item selection in GaVe is based on the dwell time of fixations, i.e., the time duration that users look at a given Cluster/item. A user study (N=22) was conducted to test optimal dwell time thresholds and comfortable human-to-display distances. Users' perception of the system, as well as error rates and task completion time were registered. We found that all participants were able to use the system with a short time training, and showed good performance during system usage, selecting a target item within a group of 12 items in 6.76 seconds on average. Participants were able to quickly understand and know how to interact with the interface. We provide design guidelines for GaVe and discuss the potentials of the system.

S INCE touchless input is not only convenient but also hygienic, the Covid-19 pandemic has led to a rise in demand for touchless human-machine interaction in the public space. Especially in high-traffic fast-food restaurants and public transportation ticket offices, touchless ordering and ticketing systems are needed to prevent the transmission of viruses. Touchless gaze-based input represents a promising method for interaction in touchless human-machine interaction (HMI) systems.

In daily life, humans use their eyes mainly to obtain information, but methods have been developed to also use Hao Cheng and Yu Feng are with the Institute of Cartography and Geoinformatics, Leibniz University Hannover, Germany (e-mail: hao.cheng@ikg.unihannover.de; yu.feng@ikg.uni-hannover.de).

Hailong Liu is with Nara Institute of Science and Technology, Japan (email: liu.hailong@is.naist.jp).

Yang Li is with the ifab-Institute of Human and Industrial Engineering, Karlsruhe Institute of Technology, Germany (e-mail: yang.li@kit.edu).

Felix Wilhelm Siebert is with the Department of Technology, Management and Economics, Technical University of Denmark, Denmark (e-mail: fwisi@dtu.dk). eye-movement as an input modality in HMI. For example, various interfaces have been developed which let users control websites [1] , enter text [2] - [4] or PIN codes [5] , [6] with their eyes. [7] demonstrated that gaze-based interaction can be superior over conventional touch interfaces in the automotive environment. For public displays, gaze-input has multiple advantages. First, gaze input facilitates touchless interaction with the interface, which prevents the transmission of e.g. viruses through touch between multiple users. Second, gaze input can prevent shoulder surfing and ensures user privacy when using public displays. Third, as the price of commercial eye tracker devices is decreasing, it presents a cost-efficient input method. More recently, gaze estimation has been conducted on off-theshelf consumer hardware such as webcams [8] , [9] . This makes gaze estimation technically and economically feasible for all devices that include a front camera, such as cellphones, tablets, and laptops. Thus, using gaze input is no longer limited by high hardware costs and can be used to benefit a much larger user group.

Despite these advances, gaze-based interaction is still facing a number of challenges that need to be considered in the design of interfaces:

1) The "Midas touch" problem [10] -Searching and selecting an interactive item are not always clearly separated. It can be challenging to distinguish a user just looking at an object on a screen from the intention of the user to select that object. 2) The "Fat finger" problem-In touch displays, slightly inaccurate finger placement can lead to unintentionally wrong inputs. In gaze-based interaction, miniature eye movements such as tremor, drift, and microsaccades [11] produce noise in the gaze position estimation. Thus, the gaze cursor is not as accurate as an equivalent mouse cursor.

3) The calibration requirement-The process is considered time-consuming. To enable gaze interaction on a public display, the system should attempt to avoid or shorten the calibration process to improve user acceptance and experience.

To address the aforementioned challenges, we propose a novel gaze interface, which can be built using an off-theshelf webcam. The"Midas touch" and "Fat finger" problems are addressed through high spatial separation of interactive display elements, while the calibration requirement is achieved through a brief one-point calibration. The contributions of this work are as follows: 

Human gaze can contain complex information about a person's interests, hobbies, and intentions [12] . To leverage this information, eye tracking technologies are applied to measure eye positions and movements. They have been widely used in medical, marketing, and psychological research. Moreover, with the help of eye tracking, eye gaze has been transformed into an alternative input modality for controlling or interacting with other digital devices [10] . Today, there are multiple, functionally different ways to use eye gaze as an input modality. The most popular gaze interactions are summarized in the following paragraphs.

Dwell-based gaze interaction: In dwell-based gaze interaction, gaze duration, i.e., dwell time, is used to activate an action and the eye position is used to replace the mouse cursor on the screen of a focused device. Dwell-based gaze interaction is subject to the "Midas touch" problem, i.e., a difficulty to distinguish between information intake and object selection on a display. To address the "Midas touch" problem, a time-based threshold is set for the selection of an object. Only if this pre-defined dwell-time threshold has been reached, will the corresponding action be triggered. Generally, the optimal setting for the dwell-time threshold varies from 200 ms to 1000 ms [2] , [13] , [14] . Hence, prior to the implementation of interfaces, an assessment of dwell time thresholds for a specific task can be necessary. Thanks to the straightforward function and its easy implementation, dwellbased gaze interaction has become one of the most popular gaze interaction methods. High tracking accuracy emerges as an additional challenge for dwell-based gaze interaction systems, as the method relies on a relatively high accuracy to correctly register the spatial location of the fixated object on the display. Hence, a calibration of the tracking system is needed and the size and spatial separation of the interactive items in the display can influence detection performance [15] .

Blink-based gaze interaction: In blink-based systems, the action of closing ones' eyes is used to trigger an action in the interface. To prevent unintentional triggering of actions through involuntary blinks, only voluntary blinks are used for gaze interaction. Frequently, voluntary blinking is defined over blink-duration [16] , with blinks over 200 ms registered as voluntary [17] or using single eye closure as a trigger method [18] . Similar to the dwell-based gaze interaction method, eye position is used to control the cursor on the device's screen and its performance is similarly influenced by the accuracy of eye tracking.

Gesture-based gaze interaction: Differing from the above two methods, gesture-based gaze interaction utilizes intentional saccades to trigger actions on a display. Natural saccades occur when the human gaze "jumps" from a fixated point to an unknown end point. Eye gestures are defined as an ordered sequence of intentional saccades [19] . They consist of different "paths" of saccades which can be mapped to specific interaction commands. Eye gesture-based interaction has several advantages over dwell-and blink-based gaze interaction. Firstly, eye gestures can distinguish intentional interaction commands from unintentional commands, thus effectively solving the "Midas touch" problem. Secondly, compared to dwell-based interaction, the control area of eye gestures does not rely on the exact position of gaze data, just on the relative position between starting and end points of saccades. However, there is a considerable disadvantage of gesture-based methods. Users of this interaction method need to learn and remember the defined gaze gestures before using them. This heavily limits its applicability in public displays.

Pursuit-based gaze interaction: Smooth pursuit eye movements occur when the eyes follow a moving object. Pursuit-based interaction is established by matching the trajectories of eye-movement to moving object trajectories on a display [20] . Different types of trajectories can be used, e.g., circular trajectories [21] , [22] , linear trajectories [23] - [25] , and irregular trajectories such as an object's outline [26] . In comparison to the other gaze interaction methods mentioned above, pursuit-based gaze interaction does not require precise gaze coordinates and personal calibration. As a dynamic interface, pursuit-based interaction is much different from existing human-machine interfaces. We need to consider the user acceptance when designing pursuit-based interfaces.

Eye tracking relies on technology that can register eye position and eye movement. Most eye tracking devices combine a camera and infrared light (IR) sources to estimate the gaze position, using the IR light to position the eyes in relation to the camera. Recently, off-the-shelf cameras have been used to estimate eye gaze [27] , [28] . Some studies developed interaction systems using gaze direction detection to enter text [29] , [30] or PIN [31] . However, while results are promising, the spatial accuracy of off-the-shelf camera gaze estimation is still relatively low. The gaze estimation error is around 5-6 • for model-based methods and 2-4 • visual angle for appearancebased estimation methods [8] . To circumvent the low spatial accuracy problem, [27] propose to utilize large interactive items.

In addition to the accuracy problem, the time needed for calibration may affect a user's acceptance and experience. This motivates researchers to design applications that work without personal calibration, such as using smooth-pursuit movements based on the front-facing camera of a tablet Eyetell [32] . This calibration-free design is appealing when it comes to the use of public displays. Thus, to facilitate the implementation and public displays of vending ordering, in this work, we focus on developing a dwell-based gaze interface without a lengthy calibration process using an off-the-shelf camera.

Weighing the advantages and disadvantages of available gaze based interaction systems, we implemented a dwell-time based gaze interaction system with a brief (2-second) onepoint calibration. It is characterized by ease of understanding and implementation. Our system is implemented on an offthe-shelf webcam and uses facial landmarks and a shape-based method to estimate the direction of gaze.

As shown in Figure 2 , the gaze estimation module consists of the following parts: (1) face detection, (2) iris position and pupil center detection, (3) estimating the ratio of each pupil center, (4) one-point calibration, and (5) the five gaze directions estimation, i.e., right, left, up, down, and center. Steps 1 and 2 use the 68-point face detection method implemented by the open-source Python Dlib library [33] , resulting in an initial rough estimation of a user's pupil center. In steps 3-5, we optimize the gaze estimation to detect five gaze directions using one-point calibration. In the following, we explain our method in detail.

(1) Face detection: The Dlib's 68-point facial landmark is used to detect a frontal face. In the face detection, 12 points are used for detecting eyes (6 points for each eye). As shown in Figure 3 , the point landmarks 36-41 are for detecting the left eye and the point landmarks 42-47 are for detecting the right eye.

(2) Iris position and pupil center detection: After narrowing down to the eye areas using the Dlib's 68-point facial landmark, we further partitioned the eye image into the lefteye and right-eye images. The two images were then analyzed individually for detecting their corresponding iris positions using the scleral-iris limbus method. Then, both eye images are converted into a binary mask by applying the bilateral filtering method-the iris contours are denoted in black color, in order to distinguish the iris from the other parts of the eye. In the end, the center coordinates of the pupil for each eye can be easily derived from the iris contour by calculating the image moments.

(3) Calculation the ratio of pupil center: The ratio of the pupil center is calculated based on its central position in relation to the edge positions that the pupil can normally reach. The following formula denotes the calculation of the horizontal ratio of the left pupil:

where x is the central x-coordinate of the left pupil extracted from the above steps and x max and x min are the maximum and minimum values of the eyelid edge, respectively, the pupil can reach. The value of h ratio left is the horizontal ratio for left eye which ranges from 0 to 1. When the ratio is close to 1, it means that the participant is looking in the leftmost direction. When the ratio is close to 0, it means that the participant is looking in the rightmost direction. Based on the observation that the pupil rarely reaches the eyelid edge denoted by the landmark positions, e.g., p 36 and p 39 , we optimized the maximum and minimum values using a pilot study. Seven participants (4 males and 3 females) were asked to record their pupil movement by orientating to the eyelid edge. They were asked to keep their head still while facing a display screen in front of them, and then to turn and hold their eyes for a period of time to the rightmost, leftmost, topmost and bottom-most directions. We recorded the data for each direction and calculated the average h ratio and v ratio . The mean of the h ratio is 0.28 when participants look rightmost and 0.87 when they look leftmost. These two values are used as h ratio min and h ratio max . In the vertical direction, the average ratios are 0.48 when gazing at the top and 0.95 when gazing at the bottom. These two values are used as v ratio min and v ratio max . Thus, we re-normalize the ratios using the data from pilot study (see formula 2 for left pupil).

The final horizontal ratio of the pupil center is the averaged value of both the left and right eyes. It should be noted that 4 . Illustration of the one-point calibration process the right eye is estimated using the same method.

With the optimized minimum and maximum ratios acquired from the pilot study, we were able to extend the original gaze tracking method proposed by [33] for the vertical direction. Table I shows the comparison of the ratio estimation for the left pupil center using the landmark positions and the optimized positions via the pilot study. It can be seen that the actual ratio the iris can reach is much smaller than the one defined by the landmark points, especially for the vertical ratio.

(4) One-point calibration: Calibration is the process of mapping the local eye coordinates obtained from the eyetracker/camera to a specific point on the display (resolution 1920 × 1080 pixels). For GaVe, we use a one-point calibration to simplify the process, and ensure a short calibration time during walk-up-and-use scenarios.

The calibration is visualized in Figure 4 . At the start of the calibration, a red point is displayed in the center of the screen. After two seconds, the red point turns green. The participants are instructed to keep their heads still and look at the red point until it turns green. In the calibration visualization of Figure 4 , the coordinates of the calibration point, x screen , y screen , are (960, 540) on the screen. The single participant's data produces a detected horizontal and vertical ratio, denoted as h c , v c , of (0.56, 0.51), respectively. These individual h c , v cratios are set as the central point of the screen for the individual user. It should be noted that the h c , v c ratios vary slightly between users.

(5) Gaze direction estimation: After the one-point calibration, we obtain the horizontal and vertical ratio of the central point h c , v c . According to the data from our pilot study, the individual central point can vary slightly around the actual central point of the screen.

In the pilot study as mentioned in step (3), all the participants completed the task of looking at the four targets on the screen (top, bottom, left, and right) in turn, the ratios were recorded. We found a central space. More precisely, it is a rectangle-like space (as shown in Figure 5 ). The width and length of the rectangle are denoted as w and l.

Based on the pilot study, we calculated the approximate ratios of w and l in relation to the actual width and height of the screen. We found that 0.4 and 0.2 are the proper values to suit all the participants. Therefore, we adopted these values for our final mapping from the gaze position to the screen position. More specifically, the total area of the screen was partitioned into the center (the rectangular region), left, right, up, and down. When the gaze was mapped in the corresponding area on the target screen, a gaze event, i.e., "look right", "look left", "look up", "look down" and "look center" would be detected. Figure 6 shows the partition process, where h ratio final and v ratio final represent the current ratio of the pupil center in relation to the edge positions.

After the development of the gaze direction estimation process ( Figure 6 ), a gaze-based vending machine interface (GaVe) which utilizes dwell time was designed and developed 1 . The general structure and all stages of the item selection process of GaVe are visualized in Figure 7 .

In the GaVe interface, all menu items are located in the noninteractive center of the screen. Interactive zones are located outside this central area, so users can look at the menu options, without triggering any item selection. As shown in Figure 7 (a), there are four clusters in the initial interface, arranged in the four directions up, down, left, and right. In each of the four clusters, three items are grouped, e.g., the top cluster combines a pizza, a burger, and a hot dog 2 . Four arrows are located outside of the central area, to visualize the interactive zones. Once a user's gaze is detected in one of the four interactive directions (defined in Figure 5 and 6) the corresponding arrow is marked with a gray circle, as real-time visual feedback. If the user continuously focuses on an arrow for longer than a predefined time threshold, the circle around the arrow turns red to confirm the first stage of the cluster selection.

Once a cluster is selected, the items in the selected cluster will be presented (Figure 7 (d) ). To smooth the expansion, the item located on the right side of the original cluster is also displayed on the right side in this stage. The same goes for the item originally located on the left side of the cluster, it is moved to the left side in the item-selection stage. The item originally located in the middle of the cluster is moved down to the bottom position. In the top position, a back button appears, allowing users to go back to the cluster stage. For the final item selection, users again fixate one of the directional arrows with their eyes, again receiving visual feedback on their selection.

An example for this two-stage item-selection process is described in the following, and visualized in Figure 7 . The target item ("chicken drumstick") is presented in the middle of the screen.

(0) Idle process: When the system detects that the user is looking towards the center of the interface, i. e., the noninteractive area within the rectangle defined in Figure 5 , no action is triggered and the interface shows the cluster selection screen (Figure 7(a) ). GaVe stays in the Idle process, as long as no looking up, down, left, or right is detected.

(1) Cluster selection: The first stage of cluster selection is illustrated in Figure 7 (a)-(c) in a vertical order. In Figure 7 (a) , the user must select the lower cluster consisting of "chicken drumstick-chips-popcorn" by looking at the down arrow. As shown in Figure 7 (b), the down button is highlighted with a gray circle to show that this button is in focus. As shown in Figure 7 (c), if the user continuously looks at this button for a predefined time threshold, the circle turns red to confirm the first stage of the cluster selection.

(2) Item selection: The second stage of item selection is illustrated in Figure 7 (d)-(f). In this stage, the items in the selected cluster are expanded, as shown in Figure 7 (d). In Figure 7 (e), the target "chicken drumstick" is located on the left. As the selection continues, the user needs to look at the left side of the interface. Again, the gray circle gives feedback to the user that the left side of the interface is detected. As the final step, Figure 7 (f) shows the confirmation interface when the target item ("chicken drumstick") is selected.

If a time of 10 seconds in the selection process of one interface stage is exceeded, the system will reset to the initial cluster-selection screen, and jump to the next target item. The previous item-selection task is then registered as a missed selection.

To explore the usability of the GaVe interface, we implemented the interface in a stylized vending machine, using a webcam to register participants' gaze. In the study, we experimentally varied the threshold for gaze dwell time that triggers an action, the distance from the user to the screen, and size of the central area for the vending machine to identify the optimal setup for the interface.

In total 22 participants were recruited for the experiment (13 males, 9 females, mean age: 28.1 years, ranging from 23 to 40 years). Ten of the 22 participants wore glasses and 3 participants wore contact lenses. The remaining 9 participants did not wear visual aids. Most of the participants had no experience with eye tracking and gaze interaction. 

The study used a three-factor within design. The independent variables are: The participants' performance was assessed through objective and subjective criteria. The objective criteria included task completion time and error rate. The task completion time is defined as the time that participants take to complete a trial, i.e., to finish the selection of a given target item. Errors are registered when participants select an item that is not the current target item, or they are unable to interact with the system for a predefined time, i.e. no action (cluster/item selection) is registered after 10 s. The error rate is calculated as the fraction of the number of trials registered as errors divided by the total number of trials.

The subjective experience of participants was assessed after completion of the experiment. The participants were asked three questions: 1) Which is the most comfortable distance for you? 2) At what distance do you think you can select the target most accurately? 3) How would you evaluate this gaze interaction system?

The experiment was conducted in eye tracking labor of the Chair of Human-Machine Systems at the Technical University of Berlin. Before the experiment, all participants signed a consent form and answered a demographic questionnaire. After this, the participants were given a short introduction to the interactive system and received an explanation on how to use it. The participants were instructed to select a given target item as accurately and quickly as possible. The given target item was displayed in the central area of the interface in a semi-transparent form. Each experiment condition included four trials, i.e., repeated four times. All participants performed the experiment in a seated position to adjust and stabilize the interaction distance. The order of the distance conditions was balanced among participants to avoid a frequent change of the seated position, whereas the orders of the conditions in terms of the central area size and dwell time were randomized across participants to prevent the occurrence of effects of sequence. After completing the task for each distance condition, the participants were asked to answer the three open questions listed above. During the experiment, the participants were allowed to rest when one condition was finished. The experiment lasted approximately 30 minutes.

In the following, the objective measures of the user study will be presented, followed by the subjective assessment by the participants. A three-way repeated-measures ANOVA (3*3*4) was conducted for the data analysis. The Shapiro-Wilk test and Q-Q-Plot were used to validate the assumption of data normality. We used the Greenhouse-Geisser correction when the Mauchly's sphericity test indicates that the data does not fulfill the sphericity assumption. Moreover, Bonferroni correction was applied for post-hoc pairwise comparison.

For task completion time and error rate, detailed results are presented in Table II for all experimental conditions. A detailed analysis of this finding is given in the following subsections. It can be observed that at the distance 45 cm from the user to the screen with the one-second gaze dwell time and the medium-sized screen central area, the participants achieved the minimum error rate with a relatively short task completion time (highlighted in boldface with an underline).

A. Task completion time Figure 8 visualizes the task completion time for different distances to the screen and different dwell time conditions. It can be observed that task completion time is closely associated with the dwell time-the task completion time increased alongside the dwell time set in the conditions. To further analyze the impact of the task completion time under different distance conditions, we removed the fixed duration of the dwell time. The results are illustrated by the red bars within the original bar plots in Figure 8 . Even after removing the fixed duration of the dwell time for activating an action, we still found that the task completion time is longer for the dwell time conditions of 1.0 and 1.2 s than that of 0.5 and 0.8 s. Furthermore, we found no two-way and three-way interaction of factors.

Since the Shapiro-Wilk test shows that the error rate is not normally distributed (p < .05), we applied an Align Rank Transform [34] before the repeated ANOVA.

As shown in Figure 9 (a), the error rate decreased when the dwell time increased from 0.5 to 1.0 s, and reached its lowest at 1.0 s. The error rate rose once again when the dwell ; and significant differences in terms of the distance were found between each condition (p < .001). However, there was no significant difference regarding the size of the central area (F (2, 735) = 1.77, p = 0.17).

We found an interaction effect between the dwell time and the size of the central area (p < .05) with respect to the error rate. Namely, the error rate at the 0.5 s dwell time is significantly higher than that at the 1.0 s dwell time for both the small-sized area condition (p < .01) and the medium-sized area condition (p < .001).

We further analyzed the error rate by distinguishing between false detections (Figure 9 (b)) and missed detections (Figure 9 (c)). A false detection was registered when the selected item is not the given target item. A missed detection is registered when a participant does not activate an action within the predefined time frame of 10 s. Figure 9 (b) and (c), visualize how the false detection rate-the fraction of the false scenarios over the total number of trials-decreases gradually from 0.5 to 1.2 s of the dwell time, while the missed detection rate increases gradually from 0.5 to 1.2 s.

Besides the objective variables, i.e., task completion time and error rate, we also collected subjective feedback from the participants. In terms of the comfortable distance to the screen, 55% of the participants thought that the smallest distance of 45 cm was the most comfortable condition to accomplish the given tasks, while 32% of the participants considered that the most comfortable distance was 55 cm, and only the remaining 13% chose 65 cm. In terms of system accuracy, 91% of participants felt that the system was most accurate at 45 cm, while only 9% of the participants preferred the distance of 55 cm, and no participant perceived the largest distance of 65 cm as the most accurate one. Asked about their general evaluation of the system, many participants considered that this gaze interface was innovative. One participant from the medical specialty mentioned that this touchless interaction was hygienic, and gave the interactive system a highly complimentary remark.

The aim of this study was to design a webcam-based gaze interface for touchless human-computer interaction on public displays. The need for touchless input modality is particularly increasing during Covid-19 pandemic. We developed a gazebased interface for a vending machine, in which the interaction is triggered by the gaze direction estimation registered through a webcam. A controlled laboratory experiment was conducted to study the usability of the system and to comprehensively assess optimal system parameters. From the user study, we found that the GaVe interface is effective and easy to use for most participants, even for participants wearing contact lenses and glasses. All participants were able to use the system after a short introduction.

The detailed analysis of the user study showed that there was a marked increase in the task completion time when the dwell time became longer from 0.5 to 1.2 s, even when accounting for the longer wait times during the dwell-time based trigger. One possible reason for this result is that the excessive duration of dwell time strains the eyes, which in turn increases the difficulty of the fixation-based selection [35] . There were more actions that had been interrupted by a failure to keep the eyes on the items for a certain duration of time. To further analyze error rates, we divided errors into false detections an missed detections. As the dwell time threshold increases, the false detection rate also decreases, in contrast, the missed detection rate increases. Based on the combined results above, overall, 0.8 −1.0 s is a relatively robust parameter range for the interface design. In most cases, the closer distance setting resulted in lower task completion times and error rates in all three distance settings (45 cm, 55 cm, 65 cm). Consistent with the objective evaluations, about half of the participants rated 45 cm as the most comfortable distance, followed by 55 cm. The vast majority (approx. 90%) of participants felt that the accuracy is higher at a distance of 45 cm, compared to the other two distance conditions.

Although there were no significant differences in terms of the size of the central area in relation to the task completion time and error rate, the descriptive results show that the medium-sized central area condition achieved slightly shorter task completion time and lower error rate than that of the small-and large-sized central area conditions.

To apply the GaVe interface in real-world applications, future research should consider, first, the screen size. The display used in this study is relatively small. A larger screen size may improve the correct detection rate. Second, head movement was not fully considered in our interface design, potentially limiting the real-world use of the interface, where head movements should be considered during gaze estimation to achieve a more robust interaction. In addition, individual height differences between users can also affect the usability of the system. This can be optimized by the automatic adaptation of camera height to a user's height to improve both face detection and gaze estimation, as well as user experience.

VII. CONCLUSION In this paper, we conducted a proof-of-concept study for a hands-free input method based on gaze estimation using a webcam. GaVe interface was designed based on dwell time using this proposed method. Users can easily interact with the gaze-based interface after a 2 s one-point calibration. As a touchless control modality, this interface design can improve the hygiene of using public displays, especially during the COVID-19 pandemic.

Based on the results of the user study, we draw the following conclusions for the design of public gaze-based interfaces: (1) A user to interface distance between 45 cm and 55 cm is preferred and supports accurate use, (2) the dwell time threshold should be set to 0.8 −1.0 s, and (3) the size of the central non-interactive area of the interface can be vary between 13.4 • ×10.32 • visual angle without negative effects. In addition, our research can provide guidance on structuring the interface design for touchless ordering services in similar applications, such as ticket vending machines, automatic coffee machines, and parking meters, as the number of selectable items can be decreased by inserting additional selection rounds.

The author Zhe Zeng would like to thank the China Scholarship Council (CSC) for financially supporting her PhD study at Technical University of Berlin, Germany.

Gazetheweb: A gaze-controlled web browser

Twenty years of eye typing: Systems and design issues

Smoovs: Towards calibration-free text entry by gaze using smooth pursuit movements

Hgaze typing: Head-gesture assisted gaze typing

A rotary dial for gaze-based pin entry

Entering pin codes by smooth pursuit eye movements

Gaze-based interaction on multiple displays in an automotive environment

Evaluation of appearance-based methods and implications for gaze-based applications

Camtype: assistive text entry using gaze with an off-the-shelf webcam

What you look at is what you get: Eye movement-based interaction techniques

The oculomotor control system: A review

Eye movements: The past 25 years

Fast gaze typing with an adjustable dwell time

Eye movements in gaze interaction

An evaluation of an eye tracker as a device for computer input2

Communication via eye blinks and eyebrow raises: Video-based humancomputer interfaces

Blinkwrite2: An improved text entry method using eye blinks

Gaze+ hold: Eyes-only direct manipulation with continuous gaze modulated by closure of one eye

Interacting with the computer using gaze gestures

Pursuits: Spontaneous interaction with displays based on smooth pursuit eye movement and moving targets

Orbits: Gaze interaction for smart watches using smooth pursuit eye movements

Smooth pursuit study on an eye-control system for continuous variable adjustment tasks

A text entry interface using smooth pursuit movements and language model

Calibrationfree gaze interfaces based on linear smooth pursuit

Speye: A calibrationfree gaze-driven text entry technique based on smooth pursuit

Outline pursuits: Gaze-assisted selection of occluded objects in virtual reality

Eye typing using markov and active appearance models

Searchgazer: Webcam eye tracking for remote studies of web search

Efficient eye typing with 9-direction gaze estimation

Smartphone-based gaze gesture communication for people with motor disabilities

Gazetouchpin: Protecting sensitive data on mobile devices using secure multimodal authentication

Eyetell: Tablet-based calibration-free eye-typing using smooth-pursuit movements

Eye tracking library easily implementable to your projects

The aligned rank transform for nonparametric factorial analyses using only anova procedures

Fast gaze typing with an adjustable dwell time