key: cord-0454606-p0ddi632
authors: Foroutan, Yalda; Kalhor, Ahmad; Nejati, Saeid Mohammadi; Sheikhaei, Samad
title: Control of computer pointer using hand gesture recognition in motion pictures
date: 2020-12-24
journal: nan
DOI: nan
sha: e37c5d672e673cb46428d38e02ee953b4235800a
doc_id: 454606
cord_uid: p0ddi632

A user interface is designed to control the computer cursor by hand detection and classification of its gesture. A hand dataset with 6720 image samples is collected, including four classes: fist, palm, pointing to the left, and pointing to the right. The images are captured from 15 persons in simple backgrounds and different perspectives and light conditions. A CNN network is trained on this dataset to predict a label for each captured image and measure the similarity of them. Finally, commands are defined to click, right-click and move the cursor. The algorithm has 91.88% accuracy and can be used in different backgrounds.

Nowadays, computers role an essential part of human life and penetrate in general aspects of people's personal and social lives. Mass-produces and increased availability of personal computers have a growing impact on daily human life. One of Human-computer Interaction (HCI) purposes is to control and improve human and computer communications, like creating an interface between human and computer to transform hand gestures into meaningful commands. Despite HCI, intermediaries like computer mouse are still useable. Since 50 years ago, when the first mouse was introduced, much progress has been made to improve the mouse. Although the computer mouse had significant improvement, this hardware is based on direct contact and curbs the user to control the computer from a close distance. In spite of the COVID-19 virus pandemic, the end lifetime of contact-based gadgets that control machines and computers can be anticipated.

In recent years, machine learning enthusiasts follow human activities [1] . Some of these activities present alternative ways to control computers. For instance, they were using a Kinect [2] sensor and EEG mouse [3] or EMG signals [4] to classify human actions and allocate mouse commands to them in order to control the pointer of the computer. The methods mentioned above, however, remove the mouse hardware but use other hardware, which is more expensive and bulkier than the usual mouse. Therefore, in order to replace the mouse, it is superior to use software for controlling the pointer of the computer.

Since humans' most useable organs to move objects are their hands, people can benefit from their hands to control intelligence systems too. Thus, one effective way to replace the hardware of the mouse is using hand gesture recognition methods to control the computers. The done works into hand gesture recognition are based on a contact or machine learning and without a single contact. Older works were based on contact, such as a data glove [5] , which detects hand gestures and tracks their movement. Besides its price, such gadgets restrain hand movements because of their weight, wiring and sensors. With the advancement of computer vision and obtaining more information from images, these kinds of gloves have become simpler; their wiring was removed and now they use a camera in order to track hand movements [6] . Indeed, the camera can learn to follow the color of the glove or the shape of the palm and fingers. Next, the glove was replaced with some colored fingertips [7] . The tips, eventually, were also removed and hand gesture applications become touch-less and based on machine learning techniques and captured frames from a camera.

Machine learning techniques, which use image processing systems like cameras, can be an alternative for contact-based approaches. Maybe the most useful feature of these techniques is degrees of freedom for hand movements, which should be obligatory for the HCI systems. Such techniques are combined with image processing and computer vision methods; Image processing methods convert the captured images of a camera to digital forms and implement scaling, filtering and noise remover on them. Computer vision creates eyesight for computers to distinguish between different gestures the way that human does. For example, in [8] , by setting a color threshold for skin color, human skin is detected and the background of images is removed. In [9] , no learning algorithm is used and instead of frames, a video sequence is imported to the process part and methods, such as skin detection and approximate median model are applied. [10] detects both hand and head based on the skin color and creates a white and black mask from each frame; Next, a VGGNet model is trained to differentiate between hand and head. In [11] , an angle between thumb and index finger is defined for discriminating hand gestures. In some other papers, such as [12] , the images convert to the HSV color space to access each color just in one component. In [13] , hands move in front of a blue background to benefit from the variation of human skin and background in the HSV space.

Hand gesture recognition usually has two stages, such as hand localization and gesture classification. SIFT is one of the fastest tools for feature extraction. It is famous for its speed before the emergence of the neural network. In [14] , using SIFT distinguishes between eight gestures to control machines like fans and washing machines. Although the color-based methods are simple and easy to implement, their generalization from person to person or in challenging conditions is low. These methods, for instance, have skin-color dependency; In [15] , users should manually import their skin colors. Furthermore, pixel-wise differentiation between human skin and backgrounds is more sophisticated and sensitive than it seems.

The availability of big datasets has increased the use of neural networks and make them a substitute way for classic machine learning ones; since they are invariant to light conditions, perspective variations and different backgrounds. Object detection algorithms that are based on neural networks can be applied for hand detection. Algorithms such as You Only Look Once (YOLO) and Single Shot Multi-Box Detector (SSD) are useable for real-time tasks and higher accuracy than RCNN or Faster RCNN. [16] takes advantages of two SSD detectors; First, one detects the head and shoulder area, and the second one recognizes hand gestures in the detected area. Plus, the SSD algorithm can be learned by new images [17] . Selective-dropout can reduce computational load [18] to reduce performance lag. If hand detection and gesture recognition be merged during the detection part, their output will be a valid predicted label [19] . Hand gesture recognition algorithms based on deep learning are more accurate than classic methods. Even though deep learning approaches do not require feature extraction and reduce hand-crafted designs, they are slow because of their complexities. Also, a data glove and a hand gesture recognition, by means of a hand detector SSD and an SVM, respectively, are used to classify hand gestures of both hands [20] .

In this section, a human-computer interface based on hand gestures for the computer's pointer is designed. For designing the HCI, a related dataset should be collected so as to train a convolutional neural network (CNN). The trained CNN will be used for two proposes. Finally, a cursor controller will be designed to convert a predicted label to a cursor task.

A hand dataset with 6720 image samples (300 x 300) of 15 subjects in 18 different simple backgrounds is collected. The dataset has four different gesture classes, such as fists, palms, pointing to the right and pointing to the left. Subjects were asked to photograph from both hands and use both palmar and dorsal sides. Figure 1 depicts four gesture samples from the dataset. 5120 samples were used for the training set and 1600 for the validation set and test set. Usually, datasets are split randomly to create the training and validation set. Here, the training and validation set, including the test set, have different distributions to avoid overfitting. In fact, the training samples are captured from webcams, without any intermediary software, and the validation samples are the acceptable images from the SSD algorithm, which detects hands. So the CNN classifier is learned by a distribution and cope with another one for the validation data and in real-time. 

Users should move their hands in front of the computer webcam in order to capture frames. The frames are preprocessed and imported to an SSD hand detector. If a hand is detected in a frame, two outputs will be expected from SSD, such as a cropped frame and the center coordinate of the cropped frame. Since the SSD algorithm draws a bounding box around hands in each frame, the hands will be cropped from the bounding box area. Then, the cropped frame is fed to a classification part. The center of the cropped image will be used in mouse commands for moving the computer cursor. If no hand is detected in the frame, the next frame will be considered for hand detection. The process of hand detection is shown in Figure 2 .

Since the SSD detects all kinds of hand gestures, there are two types of cropped frames that are imported to the classification part: one of four defined classes in the dataset or undefined ones. Four defined classes should become mouse command and no action should occur for other gestures. One of the critical challenges in classification is removing unwanted classes in a way that humans deal with them. So to predict a valid label for frames with defined classes and remove others, a CNN is trained. The architecture of EfficientNet-B0 is used followed by 8 fully connected layers for classification [21] . The last layer of EfficientNet-B0 has 1280 neurons and and then reduced to 4 neurons by 8 fully connected layers [22] . The collected hand samples from the dataset with the dimensions of 70 x 70 x 3 train the CNN network in 20 epochs. The CNN reaches 99 percent accuracy for the test set. For evaluation, cropped frames are preprocessed and fed to the frozen CNN network. They are then classified into 4 + 1 categories, 4 as mentioned before and 1 for other gestures.

A Radial Basis Function (RBF) network is designed to solve removing the unwanted cropped frames. After the network is trained, its last 8 dense layers are removed and a similarity network is formed. Indeed, the similarity network acts as an encoder or a feature extractor, which reduces the dimensions of the samples from 70 x 70 x 3 to 1280. Therefore, all samples of each class are fed through the network. The mean vector of the encoded samples of each class is calculated in order to create 4 reference vectors. Next, encoded samples just from the validation set and test set are compared with these references by the euclidean distance. The maximum distance of each class is defined as a threshold. When a cropped frame is imported to the classification part, Its output will be compared with the references and the smallest distance will be chosen. If the chosen distance is lower than its threshold, the cropped frame belongs to the dataset and it, otherwise, is from an unwanted class and should be ignored and a new frame should be given to the SSD.

Hence, in the classification part, two tasks will be done: First, the classifier predicts a label for the cropped frame. Second, the similarity network compares the cropped frame with the reference vectors and then the defined thresholds to determine whether the cropped frames represent one of the four dataset classes or not. Therefore, the classifier and similarity network act independently and the result of the similarity network validates the predicted label. If the classification part predicts a valid label, the computer cursor will respond to that. 

There should be a controller unit to allocate commands to each predicted label in order to control the computer's pointer. If the proposed algorithm is off, it will be turned on just by seeing the user's palm. After that, the recognized palm moves the cursor based on the center coordinate of the cropped frame. Since the SSD uses input images with 300 x 300 resolution, its coordinate should be converted to a meaningful coordinate for the screen. When the proposed algorithm is turned on, one can click or right-click where the cursor is by pointing to the left or right. By showing fists, users can turn the application off and after that, no action will occur until the palm turns it on again. (see Figure 3 )

In figure 4 , the proposed algorithm for controlling the computer cursor is summarized. Webcam captures a frame and imports it to the hand detector. If there is a hand in the frame, the frame will be cropped and its center coordinate calculated. The cropped frame is fed to classification parts and if it represents a defined gesture from the dataset, it will become a valid label and control the computer cursor. 

The proposed algorithm was developed on a personal computer using ubuntu with GTX 1080 Ti GPU and implemented within Python programming language using Keras framework and OpenCV library. The speed of the proposed algorithm to control the pointer, such as clicking, right-clicking and moving the pointer, is 15 frames per second.

In order to evaluate the classification part, two deep neural network architectures such as VGG16 and EfficientNet-b0 are examined. Measures such as the number of parameters, test accuracy and run-time are considered to the performances of the architectures. Even though VGG16 reaches higher accuracy than EfficientNet-B0, its parameters are more than Figure 5 : Controlling the cursor of the computer by hand gesture recognition. By recognizing the palm, the algorithm turns on, and by pointing to the left and right, one can click and right-click, respectively. Then, the algorithm is turned off by recognizing the user's fist.

three times of EfficientNet-B0 parameters, so VGG16 requires more run-time rather than EfficientNet-B0. As a result, the classification part uses EfficientNet-B0 as its feature extractor. As mentioned before, the training set has a different distribution from the validation and test set. In the learning process, approximately 76% of the dataset and the remaining samples are utilized for the training and validation/test data, respectively. Three new backgrounds, such as a white, simple and complex background, also two distances from the webcam and two light conditions, are pondered for evaluating the proposed algorithm. For each situation, ten frames, including both hands, are captured. The three chosen backgrounds are illustrated in Figure 6 . Therefore, for each hand's position, 80 frames are checked and the results are presented in Table 1 .

The confusion matrix of the proposed algorithm is shown in Table 2 . The lowest accuracy is for clicking mode because, in the complex background, there was a brown closet that the SSD algorithm confused to detect hand gesture correctly. 

In this paper, a human-computer interaction algorithm based on hand gesture recognition is proposed to control the computer pointer without a single contact. It was also observed that mostly the deep learning methods, such as CNNs, have been used instead of the classic machine learning and image processing tools.

The proposed algorithm can be useful for other intelligence systems in the complex background and different light conditions and camera perspectives. Since the proposed algorithm is touch-less can be used in public places because 

Human activity recognition adapted to the type of movement

Robust part-based hand gesture recognition using kinect sensor

Eeg mouse: A machine learning-based brain computer interface

Real-time hand gesture recognition using the myo armband and muscle activity detection

3-d hand motion tracking and gesture recognition using a data glove

Real-time hand-tracking with a color glove

Sixthsense: a wearable gestural interface

Vision based gesturally controllable human computer interaction system

Computer vision based human-computer interaction using color detection techniques

Hand gesture recognition with skin detection and deep learning method

A method for controlling the mouse movement using a real time camera

Hand detection using hsv model

Human hand gesture based system for mouse cursor control

Economical and user-friendly design of vision-based natural-user interface via dynamic hand gestures

Globefire. globefire/hand detection tracking opencv

Long-range hand gesture recognition with joint ssd network

Hand gesture recognition based on single-shot multibox detector deep learning

Light yolo for high-speed gesture recognition

Hand posture detection and classification using you only look once (yolo v2) object detector

Smart glove and hand gesture-based control interface for multi-rotor aerial vehicles

Rethinking model scaling for convolutional neural networks

Indoor and outdoor face recognition for social robot, sanbot robot as case study