key: cord-0920812-hwh4dkne authors: Sarma, Debajit; Bhuyan, M. K. title: Methods, Databases and Recent Advancement of Vision-Based Hand Gesture Recognition for HCI Systems: A Review date: 2021-08-29 journal: SN Comput Sci DOI: 10.1007/s42979-021-00827-x sha: 586a0bde08c710c349735387d01a0ab1c5de49a6 doc_id: 920812 cord_uid: hwh4dkne Hand gesture recognition is viewed as a significant field of exploration in computer vision with assorted applications in the human–computer communication (HCI) community. The significant utilization of gesture recognition covers spaces like sign language, medical assistance and virtual reality–augmented reality and so on. The underlying undertaking of a hand gesture-based HCI framework is to acquire raw data which can be accomplished fundamentally by two methodologies: sensor based and vision based. The sensor-based methodology requires the utilization of instruments or the sensors to be genuinely joined to the arm/hand of the user to extract information. While vision-based plans require the obtaining of pictures or recordings of the hand gestures through a still/video camera. Here, we will essentially discuss vision-based hand gesture recognition with a little prologue to sensor-based data obtaining strategies. This paper overviews the primary methodologies in vision-based hand gesture recognition for HCI. Major topics include different types of gestures, gesture acquisition systems, major problems of the gesture recognition system, steps in gesture recognition like acquisition, detection and pre-processing, representation and feature extraction, and recognition. Here, we have provided an elaborated list of databases, and also discussed the recent advances and applications of hand gesture-based systems. A detailed discussion is provided on feature extraction and major classifiers in current use including deep learning techniques. Special attention is given to classify the schemes/approaches at various stages of the gesture recognition system for a better understanding of the topic to facilitate further research in this area. In this period of innovation, where we are profound into the information age, technological progression has arrived at such a point that nearly everybody in each nook and corner of the world independent of any discipline, has interacted with computers somehow or the other. However, in general, a typical user ought not to need to secure computer education to utilize computers for basic undertakings in regular day-to-day life. Human-computer interaction (HCI) is a field of study which plans to encourage the communication of clients, regardless of whether specialists or fledglings, with computers in a simple way. It improves user experience by distinguishing factors that help to diminish the expectation to learn and adapt for new users and furthermore gives arrangements like console easy routes and other navigational guides for common users. In designing an HCI system, three main factors should be considered: functionality, usability and emotion [73] . Functionality denotes actions or services a system avails the user. However, a system's functionality is only useful if the user can exploit it effectively and efficiently. The usability of a system denotes the extent to which a system can be used effectively and efficiently to fulfill user requirements. A proper balance between functionality and usability results in good system design. Taking account of emotion in HCI includes designing interfaces that are pleasurable to use from a physiological, psychological, social, and aesthetic perspective. Considering all three factors, an interface should be designed to fit optimally between the user, device, and required services. Figure 1 illustrates this concept. In recent years, significant effort has been devoted to body motion analysis and gesture recognition. With the increased interest in human-computer interaction (HCI), research related to gesture recognition has grown rapidly. Along with speech, they are the obvious choice for natural interfacing between a human and a computer. Human gestures constitute a common and natural means for nonverbal communication. A gesture-based HCI system enables a person to input commands using natural movements of the hand, head, and other parts of the body [171] (Fig. 2) . And since the hand is the most widely used body part for gesturing apart from face [93] , hand gesture recognition from visual images forms an important part of this research. Generally, hand gestures are classified as static gestures or simply postures and dynamic or trajectory-based gestures. Again, dynamic or trajectory-based gestures can be isolated or continuous. Before going into more depth, we want to first see how to acquire data or information for hand gesture recognition. The task of acquiring raw data for hand gesture-based HCI systems can be achieved mainly by two approaches [36] : sensor based and vision based (Fig. 3) . Sensor-based approaches require the use of sensors or instruments physically attached to the arm/hand of the user to capture data consisting of position, motion and trajectories of fingers and hand. Sensor-based methods are mainly as follows: 1. Glove-based approach measures position, acceleration, degree of freedom and bending of the hand and fingers. Glove-based sensors generally constitute flex sensors, gyroscope, accelerometer, etc. 2. Electromyography (EMG) measures human muscle's electrical pulses and decode the bio-signal to detect finger movements. 3. WiFi and radar use radio-waves, broad-beam radar or spectrogram to detect the changes in signal strength. 4. Others utilize ultrasonic, mechanical, electromagnetic and other haptic technologies. Vision-based approaches require the acquisition of images or videos of the hand gestures through video cameras. 1. Single camera-it includes webcams, different types of video cameras and smart-phone cameras. 2. Stereo-camera and multiple camera-based systemsa pair of standard color video or still cameras capture two simultaneous images to give depth measurement. Multiple monocular cameras can better capture the 3D structure of an object. 3. Light coding techniques-projection of light to capture the 3D structure of an object. Such devices include PrimeSense, Microsoft Kinect, Creative Senz-3D, Leap Motion Sensor, etc. 4. Invasive techniques-body markers such as hand color, wrist bands, and finger marker. But the term vision based is generally used for capturing images or videos of the bare hand without any glove and/or marker. The sensor-based approach reduces the need for pre-processing and segmentation stage, which is essential to classical vision-based gesture recognition systems. The architecture of HCI systems can be broadly categorized into two groups based on their number and diversity of inputs and outputs: unimodal HCI systems and multimodal HCI systems [83] (Fig. 4 ). 1. Unimodal HCI systems Unimodal systems can be (a) vision based (e.g., body movement tracking [147] , gesture recognition [146] , facial expression recognition [115, 189] , gaze detection [206] , etc.), (b) audio based (e.g., auditory emotion recognition [47] , speaker recognition [105] , speech recognition [125] , etc.), or (c) based on different types of sensors [113] . 2. Multimodal HCI systems Individuals for the most part utilize different modalities during human to human correspondence. Subsequently, to survey a user's expectation or conduct extensively, HCI frameworks ought to likewise incorporate data from numerous modalities [162] . Multimodal interfaces can be arranged utilizing blends of data sources, for example, gesture and speech [161] or facial posture and speech [86] and so forth. Some of the major applications of multimodal systems Fig. 3 Human-computer interaction using: a CyberGlove-II (picture courtesy: https:// www. cyber glove syste ms. com/ produ cts/ cyber glove-II/ photos-video), b vision-based system It is an essential ability for computers to perceive the gestures of the hand visually for the future advancement of vision-based HCI. Static gesture recognition or pose estimation of the isolated hand, in constrained conditions, is roughly a solved problem to quite an extent. Notwithstanding, there are as yet numerous aspects of dynamic hand gestures that must be addressed, and it is an interdisciplinary challenge mainly due to three difficulties: • Dynamic hand gestures vary spatio-temporally with assorted and different implications; • The human hand has a complex non-unbending design making it hard to perceive; and • There are as yet numerous difficulties in computer vision itself making it a poorly presented problem. A gesture recognition system depends on certain subsystems associated in arrangement. In view of the arrangement of subsystems, the general exhibition of the framework is reliant on the precision of every subsystem. Along these lines, generally execution is profoundly influenced by a subsystem that is a "feeble connection". All the gesture-based applications are dependent on the ability of the device to read gestures efficiently and correctly from a stream of continuous gestures. To develop human-computer interfaces using the human hand has motivated researchers for continuous hand gesture recognition. Two major challenges present in the process of continuous hand gesture recognition are-constraints related to segmentation and problems in spotting the hand gestures perfectly in a continuous stream of gestures. But there are many other challenges apart from these which we will discuss now. More on constraints in hand gesture recognition can be found in [32] by the same authors. • Challenges in segmentation Exact segmentation of the hand or the gesturing body part from the caught record-ings or pictures still remains a challenge in computer vision for some limitations like illumination variation, background complexity, and occlusion. -Illumination variation The precision of skin color segmentation techniques is generally influenced by illumination variation. Because of light changes, the chrominance properties of the skin tones may change, and the skin color will appear different from the original color. Many methods use luminance invariant color spaces to accommodate varying illuminations [27, 66, 89, 90, 173] . However, these methods are useful only for a very narrow range of illumination changes. Moritz et al. found that the skin reflectance locus and the illuminant locus are directly related, which means that the perceived color is not independent of illumination changes [209] . Sigal et al. used dynamic histogram segmentation technique to counter illumination changes [199, 200] . In the dynamic histogram method, a second-order Markov model is used to predict the histogram's time-evolving nature. The method is applicable only for a set of images with predefined skin-probability pixel values. This method is very promising for videos with smooth illumination changes but fails for abrupt illumination changes. Also, this method is applicable to the time progression of illumination changes. In many cases where the illumination change is discrete, and input data is a set of skin samples obtained under randomly changed illumination conditions, this method performs poorly. Stern [130] . This method is dependent on the accuracy of the detected skin pixels from the face detection method, and it may fail if the face is not detected perfectly or the detected face has a mustache, beard, spectacles, or hair falling over it. Although a color correction strategy is used to convert the colors of the frame in the absence of a face, this solution is temporary and prone to error. In [190] , the authors converted RGB colorspace into HSV and YCbCr color-cues to compensate illumination variation in the skin-segmentation method to segment the hand portion from the background. Biplab et al. has utilized a fusion-based picture explicit model for skin division to deal with the issue of segmentation under differing enlightenment conditions [31] . -Background complexity Another serious issue in gesture recognition is the appropriate division of skinshaded items (e.g., hands, face) against an intricate static/dynamic background. An example of a complex background is shown in Fig. 6 . Different types of complex backgrounds exist: • Cluttered background (Static) Although the background statistics are fairly constant, the background color and texture are highly varied. This kind of background can be modeled using Gaussian mixture models (GMMs). However, to model backgrounds of increasing complexity, more Gaussians should be included in the GMM. • Dynamic background The background color and texture change with time. Although hidden Markov models (HMMs) are often used to model signals that have a time-varying structure, unless they follow a well-defined stochastic process, their application to background modeling is computationally complex. The precision of skin division strategies is restricted because of the presence or movement of skin-colored objects behind the scenes which increment false positives. • Camouflage The background is skin-colored or contains skin-colored regions, which may abut the region of interest (e.g., the face, hands). For example, when a face appears behind a hand, this complicates hand gesture recognition, and when a hand appears behind a face, this complicates face region segmentation. These kinds of cases render it nearly impossible to segment the hand or face regions solely from pixel color information. Figure 7 shows a case of camouflage. The major problem with almost all segmentation methods based on the color space is that the feature space lacks spatial information on the objects, such as their shape. These are the main issues of hand and face segmentation for gesture recognition. As shown in Fig. 6a , the background might be cluttered and have some skin-colored regions. In these conditions, it is difficult to segment actual skin regions (see Fig. 6b [167] . This approach assumes that owing to the smooth texture of human skin, skin regions in images would be more homogeneous and have fewer edges than skin-colored regions in the background. The performance of this technique degrades when skin regions have many edges because of complex hand poses. Jhang et al. proposed an adaptive skin color segmentation method based on a skin probability distribution histogram (SPDH) [246] . The SPDH plots the total pixel count with a certain normalized skin probability with respect to the corresponding normalized skin probability of the pixel group in a particular image. Finally, the valley of SPDH is determined using a trained artificial neural network (ANN) as the optimum threshold for the image. The whole system's accuracy depends on how accurate the normalized skin probability is. Also, the color deviation histogram (CDH) method fails if the background color becomes similar to the skin color, as in that case, the color deviation will be very small for that group of pixels. Wang et al. combined the RGB and YCgCb color spaces and the texture information of the skin regions to detect the skin [225] . From the results, it is very evident that this method fails if there is a color similarity between the background and the skin regions. Avinash et al. proposed a skin color segmentation method by combining HSI and YCbCr color spaces with some morphological operations with labeling [13] . Their primary assumption was that the background color is different from the skin color, and thus this method fails drastically in the presence of skin-colored backgrounds. Pisharady et al. used biologically inspired features like Gabor wavelet to handle the problem of complex background [171] (Fig. 7 ). -Occlusion Another major challenge is mitigating the effects of occlusion in gesture recognition. In singlehanded gestures, the hand may occlude itself apart from some other objects. The problem is more severe in two-handed gestures where one hand may occlude the other while doing the gestures. The appearance of the hand is affected by both kinds of occlusion subsequently hampering recognition of gestures. In monocular vision-based gesture recognition, the appearance of gesturing hands is view dependent. As shown in Fig. 8 [120] . The pose of the hand and the postures of the fingers were reconstructed using the position of the color markers in the image. Occlusion was handled by predicting the finger positions and by validating 3D geometric visibility conditions. -Multiple cameras with tracking-based gesture recognition Instead of using multiple cameras and hand tracking separately, a fusion-based approach using both of them may be suitable for occlusion handling. Utsumi et al. used an asynchronous multicamera tracking system for hand gesture recognition [219] . Though multiple camera-based systems are one solution for this problem, these devices are not purely accurate. View-invariant 3D models or depth measuring sensors can provide some more insight into this problem (Fig. 9 ). • Difficulties related to the articulated shape of the hand The accurate detection and segmentation of the gesturing hand are significantly affected by variations in illumination and shadows, the presence of skin-colored objects in the background, occlusion, background complexity, and different other issues. The complex articulated shape of the hand makes it further tough to model the appearance of the hand for both static and dynamic gestures. Moreover, in the case of dynamic or trajectory-based gestures, the tracking of physical movement of the hand is quite challenging due to the varied size, shape and color of the hand. Generally, it is expected that a generic gesture recognition system should be invariant to the shape, size and appearance of the gesturing body part. The human hand has 27 bones-14 in the fingers, 5 in the palm, and 8 in the wrist (Fig. 10a) . The 9 interphalangeal (IP) joints have one degree of freedom (DOF) each for flexion and extension. The 5 metacarpophalangeal (MCP) joints have 2 DOFs each: one for flexion and extension and the other for abduction or adduction (spreading the fingers) in the palm plane. The carpometacarpal (CMC) joint of the thumb, which is also called the trapeziometacarpal (TM), has 2 DOFs along nonorthogonal and nonintersecting rotation axes [74] . The palm is assumed to be rigid. Lee et al. proposed a 27-DOF hand (Fig. 10b ). As evident from Fig. 10 , the hand is an articulated object with more than 20 DOF. Now, because of the interdependencies between the fingers, the effective number of DOF reduces to approximately six. Their estimation-in addition to the location and orientation of the handresults in a large number of parameters to be estimated. Estimation of hand configuration is extremely difficult because of occlusion and the high degrees of freedom. Even data gloves are not able to acquire the hand state perfectly. Compared with sensors for glove-based recog-nition, computer vision methods are generally at a disadvantage. To get rid of these constraints, [150] has tracked air-written gestures only through finger-tip detection. But it has the limitation that the detection of sign language is not possible. For monocular vision, it is impossible to know the full state of the hand unambiguously for all hand configurations, as several joints and finger parts may be hidden from the view of the camera. Applications in vision-based interfaces need to keep these limitations in mind and focus on gestures that do not require full hand pose information. General hand detection in uncon- [48] , b the kinematic model [123] strained settings is a largely unsolved problem. In view of this, systems often locate and track hands in images using color segmentation, motion flow, background subtraction, or a combination of these techniques. • Gesture spotting problem Gesture spotting means locating the beginning and the end-points of a gesture in a nonstop stream of gestures. When gesture boundaries are resolved, the gesture can be extracted and grouped. In any case, spotting significant patterns from a stream of gestures is an exceptionally troublesome errand mainly because of two issues: segmentation ambiguity and spatiotemporal variability. For sign language recognition, the framework should uphold the natural gesturing of the user to empower unhindered collaboration with the entity. Prior to taking care of the video into the recognition framework, the non-gestural movements ought to be taken out from the video sequence since these movements regularly blend a motion grouping. Instances of non-gestural movements incorporate "movement epenthesis" and "gesture co-articulation" (appeared in Fig. 11 ). Movement epenthesis occurs between two gestures and the current gesture is affected by the preceding or the following gesture. Gesture co-articulation is an unwanted movement that occurs in the middle of performing a gesture. In some cases, a gesture could be similar to a sub-part of a longer gesture, referred to as the "sub-gesture problem" [7] . When a user tries to repeat the same gesture, spatiotemporal variations in the shape and speed of the hands will occur. The system must accommodate these variations while maintaining an accurate representation of the gestures. Though static hand gesture recognition problem [52, 59, 60, 156, 174] is almost a solved one, but to date, there are only a handful of works are there dealing with these three problems of continuous hand gesture recognition system [16-18, 95, 133, 211, 240] . • Problems related to two-handed gesture recognition The inclusion of two-handed gestures in a gesture vocabulary can make HCI more natural and expressive for the user. It can greatly increase the size of the vocabulary because of the different combinations of left and right-hand gestures. Previously proposed methods include template-based gesture recognition with motion estimation [78] and two-hand tracking with colored gloves [10] . Despite its advantages, two-handed gesture recognition faces some major difficulties: -Computational complexity The inclusion of twohanded gestures can be computationally expensive because of their complicated nature. Fig. 11 a Movement epenthesis problem [18] b Gesture co-articulation (marked with redline) [202] c sub-gesture problem (here gesture '5' is a sub-gesture of gesture '8') [7] tracking of two interacting hands in a real environment is still an unsolved problem. If the two hands are clearly separated, the problem can be solved as two instances of the single-hand tracking problem. However, if the hands interact with each other, it is no longer possible to use the same method to solve the problem because of overlapping hand surfaces [160] . • Hand gestures with facial expressions Incorporating facial expressions into the hand gesture vocabulary can make it more expressive as it can enhance the discrimination of different gestures with similar hand movements. A major application of hand and face gesture recognition is sign language. Little work has been reported in this research direction. Von Agris et al. used facial and hand gesture features to recognize sign language automatically [2] . This approach also has the following challenges: -The simultaneous tracking of both hand and face. -Higher computational complexity compared with the recognition of only hand gestures. • Difficulties associated with extracted features It is generally not recommended to consider all the image pixel values in a gesture video as the feature vector. This will not only be time-consuming but also it would take a great many examples to span the space variation, particularly if multiple viewing conditions and multiple users are considered. The standard approach is to compute some features from each image and concatenate these as a feature vector to the gesture model. Both the spatial and temporal movements of the hand along with its characteristics should be considered by a gesture model. No two samples of the same gesture will bring about the very same hand and arm movements or similar arrangement of visual pictures, i.e., gestures experience the ill effects of spatio-transient variety. Spatio-temporal variety exists in any event when the same user plays out a similar gesture on various occasions. Each time the user performs a gesture, the shape, position of the hand and speed of the motion normally change. Accordingly, extracted features ought to be rotation-scaling-translation (RST) invariant. Yet, different image processing strategies have their own imperatives to deliver RST-invariant features. Another limitation is that the processing of a lot of image information is tedious, and thus a real-time application might be troublesome. The essential part of vision-based frameworks is to identify and perceive visual signs for correspondence. A vision-based plan is more helpful than a glove-based one on account of its natural methodology. It tends to be utilized any place inside a camera's field of view and simple to convey. The fundamental undertaking of vision-based gesture recognition is to get visual data in a specific scene and attempt to separate the vital motions. This methodology should be acted in a progression of succession, in particular, acquisition, detection and pre-processing; gesture representation and feature extraction; and recognition ( Fig. 12 ). The acquisition and detection of the gesturing body part is vital for a productive VGR framework. The procurement incorporates capturing gestures utilizing imaging gadgets. The fundamental assignment of discovery and pre-processing is essentially the segmentation of the gesturing body part from images or videos as precisely as could really be expected. 2. Gesture representation and feature extraction The assignment of the following subsystem in a hand gesture recognition system is to model or represent the gesture. The performance of a gestural interface is directly related to the proper representation of hand gestures. After gesture modeling, a bunch of features should be extricated for gesture recognition. Diverse sorts of features have been distinguished for addressing specific sorts of gestures [25] . 3. Recognition The last subsystem of an recognition framework has the assignment of recognition or classification Fig. 12 The basic architecture of a typical gesture recognition system of gestures. A reasonable classifier perceives the incoming gesture parameters or features and gathers them into either predefined classes (supervised) or by their closeness (unsupervised) [146] . There are numerous classifiers utilized for both static and dynamic gestures, every one with its own benefits and constraints. Gesture acquisition involves capturing images or videos using imaging gadgets. The detection and classification of moving objects present in a scene is key research in the field of action/gesture recognition. The most important research challenges are segmentation, detection, and tracking of moving objects from a video sequence. The detection and preprocessing stage mainly deals with localizing gesturing body parts in images or videos. Since dynamic gesture analysis consists of all these subtasks, so this very portion can be subdivided into segmentation and tracking or combining both of them together. Moreover, in static gestures also segmentation is a vital step. 1. Segmentation Segmentation is the way toward partitioning an image into various distinct parts and in this way discovering the region of interest (ROI), which is hand for our situation. Precise segmentation of the hand or the body parts from the captured images actually stays a challenge for some engrossed limitations in computer vision like illumination variation, background complexity, and occlusion. A large portion of the segmentation strategies can be extensively delegated as follows ( Fig. 13 ): (a) skin color-based segmentation, (b) region based, (c) edge based, (d) Otsu thresholding and so on. The simplest method to recognize skin districts of a picture is through an explicit boundary specification for skin tone in a particular color space, e.g., RGB [69] , HSV [205] , YCbCr [28] or CMYK [193] . Numerous analysts drop the luminance segment and have utilized just the chrominance segment since chrominance signals contain skin color information. This is on the grounds that the hue-separation space is less sensitive to illumination changes when contrasted with RGB shading space [190] . Also, color cues show variations in the skin color in different illumination conditions, and also skin color changes with the change in human races, and so segmentation is more constrained in the presence of skin-colored objects in the background. Occlusion also leads to many issues in the segmentation process. Recently published works of literature show that the performance of the model-based approaches (parametric and non-parametric) is better than explicit boundary specification-based methods [97] . To improve the detection accuracy, many researchers have used parametric and non-parametric model-based approaches for skin detection. For example, Yang et al. [237] used a single multivariate Gaussian to model skin color distribution. But, skin color distribution possesses multiple co-existing modes. So, the Gaussian mixture model (GMM) [238] is more appropriate than a single Gaussian function. Lee and Yoo [124] proposed an elliptical modeling-based approach for skin detection. The elliptical modeling has less computational complexity as compared to GMM modeling. However, many true skin pixels may get rejected if the ellipse is small. Whereas if the ellipse is larger, many non-skin pixels may be detected as skin pixels. Out of different non-parametric model-based approaches for skin detection, Bayes skin probability map (Bayes SPM) [88] , self-organizing map (SOM) [22] , k-means clustering [154] , artificial neural network (ANN) [33] , support vector machine (SVM) [69] , random forest [99] are noteworthy. The regionbased approach involves region growing techniques, region splitting and region merging techniques. Rotem et al. [184] combined patch-based information with edge cues under a probabilistic framework. In an edge-based technique, basic edge-detecting approaches like Prewitt filter, Canny edge detector, Hough transforms are used. Otsu thresholding is a clustering-based image thresholding method that converts a gray-level image to a binary image using any edge detecting or tracking technique so that we have only two objects, i.e., one is hand and the other is background [145] . In the case of videos, all these methods can be applied with dynamic adaptation. Tracking can also be considered as a part of pre-processing in the hand detection process as both tracking and segmentation together help to extract the hand from the background. Despite the fact that skin segmentation is perhaps the most favored technique for segmentation or detection, still, it is not so viable for different imperatives like scene illumination variation, background complexity, and occlusion [190] . Fundamentally, when earlier information on moving objects like appearance and shape is not known, pixel-level change can, in any case, give viable motion-based cues for detecting and localizing objects. Different methodologies for moving item discovery utilizing pixel-level change can be background subtraction, inter-frame difference, or three-frame difference [241] . Stabilized background detection consistently is an expensive matter making it defenseless for long and fluctuated video groupings [241] . Aside from this, the choice of temporal distance between frames is a tricky question. It essentially relies upon the size and speed of the moving object. Despite the fact that interframe difference methods can easily detect motion, it shows terrible performance in localizing the object. The three-frame difference [92] approach uses previous, current and future frames to localize the object in the current frame. The utilization of future frames presents a slack in the global positioning framework, and this slack is adequate just if the object is far away from the camera or moves slowly comparative with the high catch pace of the camera. Tracking of the hand can be restricted due to the fast movement of the hand and its appearance can alter immensely within a few frames. In such cases, modelbased algorithms like mean-shift [56] , Kalman filter [44] , particle filter [30] are some of the methods used for tracking. The mean-shift is a purely non-parametric mode-seeking algorithm that iteratively shifts a data point to the average of data points in its neighborhood (similar to clustering). However, tracking often con-verges to an incorrect object when the object changes its position very quickly in the two neighboring frames. Because of this problem, a conventional mean-shift tracker fails to position a fast-moving object. [152, 185, 190] used a modified mean-shift algorithm called continuous adaptive mean-shift (CAMShift) where the window size is adjusted so as to fit the gesture area reflected by any variation in the distance between the camera and the hand. Though CAMShift performs well with objects that have a simple and consistent appearance, it is not powerful in more perplexing scenes. The movement model for the Kalman filter depends on the understanding that the speed is moderately little when items are moving, and thus, it is demonstrated by a zero mean and low variance white noise. One restriction of the Kalman filter is the supposition that the state variables depend on Gaussian distribution, and along these lines, the Kalman filter will give inaccurate assessments for state variables that do not follow a linear Gaussian environment. The particle filter is for the most part a preferred strategy over the Kalman filter since it can consider non-linearity and non-Gaussianity. The fundamental thought of the particle filter is to apply a weighted sample particle set to approximate the probability distribution, i.e., the necessary posterior density function is addressed by a bunch of arbitrary examples with related weights and estimation is done based on these samples and weights. Both Kalman filter and particle filter have the disadvantage of the requirement of previous knowledge in modeling the system. Kalman filter or particle filter can be combined with the mean shift tracker for precise tracking. In [224] , authors have detected hand movement using Adaboost with the histogram of gradient (HOG) method. Here the first step is object labeling by segmentation and the second step is object tracking. Accordingly, an update for tracking is done by calculating the distribution model with various label values. Skin-segmentation and tracking together can give quite a good performance [68] , but researchers have adopted other methods too where skin segmentation is not so efficient. Based on spatio-temporal variation, gestures are mainly classified as static or dynamic. Static gestures are simply the pose or orientation of the gesturing part (e.g., hand pose) in the space and hence sometimes simply called posture. On the other hand, dynamic gestures are defined by trajectory or temporal deformation (e.g., shape, position, motion, etc.) of body parts. Again dynamic gestures can be either single isolated trajectory type or continuous type, occurring in a stream, one after another. 1. Gesture representation A gesture must be represented using a suitable model for its recognition. Based on feature extraction methods, the following are the types of gesture representations: model based and appearance based (Fig. 14) . (a) Model based Here, gestures can be modeled utilizing either a 2D model or a 3D model. The 2D model essentially relies upon either different color-based models like RGB, HSV, YCbCr, and so forth, or silhouettes or contours obtained from 2D images. The deformable Gabarit model relies upon the arrangement of active deformable shaping. Then again, 3D models can be classified into mesh model [98] , geometric model, volumetric models and skeletal models [198] . [179] . But it also has its own disadvantages like precision, accuracy, etc. [32] . (b) Appearance based The appearance-based model attempts to distinguish gestures either straight-forwardly from visual images/videos or from the features derived from the raw data. Highlights of such models might be either the image sequences or a few features obtained from the images which can be utilized for hand-tracking or classification purposes. For instance, Wilson and Bobick [228] introduced results utilizing activities, generally hand motions, where the genuine gray-scale images (with no background) are utilized in reallife portrayal. Rather than utilizing raw gray-scale images, Yamato et al. [234] utilized body silhouettes, and Akita [5] utilized body shapes/edges. Yamato et al. [234] used low-level silhouettes of human activities in a hidden Markov model (HMM) system, where binary silhouettes of background-subtracted images are vector quantized and used as input to the HMMs. In Akita's work [5] , the utilization of edges and some straightforward two-dimensional body setup information were utilized to decide the body parts in a progressive way (first, discover legs, then the head, arms, trunk) in light of steadiness. While utilizing two or three-dimensional primary data, there is a prerequisite of individual features or properties to be extracted and tracked from each frame of the video sequence. Consequently, movement understanding is truly cultivated by perceiving an arrangement of static setups that require previous detection and segmentation of the item. Furthermore, since the good old days, sequential statespace models like generative hidden Markov models (HMMs) [122] or discriminative conditional random fields (CRFs) [19] have been proposed to demonstrate elements of activity/gesture recordings. Temporal ordering models like dynamic time warping (DTW) [7] have likewise been applied with regards to dynamic activity/gesture recog- In most literature, e.g., [165] , it is mentioned that gestures are represented by either model-based or appearance-based model. The motion-based methods are also generally included in the appearance-based methods (shown in Fig. 14) . But, here, we want to discuss the motion-based methods separately. This is because the shape and appearance of the body/body-part depend on many factors, e.g., illumination variation, image resolution, skin color, clothing, etc. But motion estimation should be independent of the shape and appearance of the gesturing hand (at least in theory). Optical flow and motion templates are the two major motion-based representation schemes and can be used directly to describe human gesture/action [191] . There are also a few examples like [191, 192, 232] where these two methods are combined together. [119] executed a blend of HOG-HOF for taking insensible human activity from motion pictures. [39] additionally proposed to ascertain changes of optical flow that focus on optical flow differences between frames (motion boundaries). Yacoob and Davis [233] utilized optical flow estimations to follow predefined polygonal patches set on interest areas for facial expression recognition. [229] introduced an incorporated methodology where the optical flow is coordinated frame-by-frame over time by considering the consistency of direction. In [135] , the optical flow was used to detect the direction of motion along with the RANSAC algorithm which in turn helped to further localize the motion points. In [95] , authors have used optical flow guided trajectory images for dynamic hand gesture recognition using deep learning-based classifier. (b) Motion templates Basically, motion templates are the compact representation of a gesture video where the dynamics of motion of a gesture video is encoded into an image. These templates are compact representations of videos where a single image illustrates the motion information of the whole video useful for video analysis. Hence, these images are named motion fused images or temporal templates or motion templates. There are three widely used motion fusion strategies namely motion energy image(MEI) and motion history image (MHI) [3, 21] , dynamic images (DI) [20] and methods based on PCA [49] . We will not go into the details of these methods, but the same can be found in [191] by the same authors. 2. Feature extraction After modeling a gesture, the next step is to extract a bunch of features for gesture recognition. For static gestures, features are obtained from image data like color and texture or posture data like direction, orientation, shape, and so forth. There are three basic features for spatio-temporal patterns of dynamic gestures namely location, orientation and velocity [242] The last subsystem of a gesture framework has the assignment of recognition where a reasonable classifier perceives the incoming gesture parameters or features and gathers them into either predefined classes (supervised) or by their closeness (unsupervised). Here, the hand gesture recognition techniques have been tried to classify into some categories for easy understanding. And based on the type of input data and the method, the hand gesture recognition process can be broadly categorized into three sections: • Conventional methods on RGB data • Depth-based methods on RGB-D data • Deep networks-a new era in computer vision Vision-based gesture recognition generally depends on three stages where the third module consists of a classifier, which typically classifies the input gestures. However, each classifier has its own advantages as well as limitations. Here, we discuss the conventional methods of classification for static and dynamic gestures on RGB data. • Static gesture recognition Static gestures are basically finger-spelled signs in still images without any time frame. Unsupervised k-means and supervised k-NN, SVM, ANN are the major classifiers for static gesture recognition. k-means It is an unsupervised classifier that evaluates k center points to minimize error in the cluster-ing defined by the sum of the distances of all data points to their respective cluster centers. For a set of observations 1 , 2 , ..., n , in a d-dimensional real vector space, k-means clustering partitions the n observations into a set of k clusters or groups S = {S 1 , S 2 , … , S k } (k ≤ n) and their centers are given by The classifier arbitrarily finds k cluster centers in the feature space. Each point in the information dataset is assigned to the closest cluster center, and their locations are refreshed to the average location value for each group. This cycle is then rehashed until a halting condition is met. The halting condition could be either a user indicated of maximum number of cycles or a distance edge for the development of the group communities. Ghosh and Ari [59] utilized a k means clustering-based radial basis function neural network (RBFNN) for static hand gesture recognition. In this work, k means grouping is utilized to decide the RBFNN centers. k-nearest neighbors (k-NN) k-NN is a non-parametric algorithm where information in the feature space can be multidimensional. It is a supervised learning scheme with a bunch of labeled vectors as training data. The number k essentially decides the number of neighbors (close feature vectors) that impact the characterization. Commonly, an odd estimation of k is picked for two-class characterization. Each neighbor might be given a similar weight or more weight might be given to those nearest to the input information by applying a Gaussian distribution. In uniform voting, a new feature vector is allocated to the class to which the majority of its neighbors belongs. Hall et al. expected two statistical distributions (Poisson and binomial) for the sample data to get the ideal estimation of k [67] . The k-NN can be utilized in various applications, for example, hand gesture-based media player control [138] , sign language recognition [64] , and so on. -Support vector machine (SVM) An SVM is a supervised classifier for both linearly separable and nonseparable data. When it is not possible to linearly separate the input data in the current feature space, then SVM maps this non-linear data to some higher dimensional space where the data can be linearly separated. This mapping from lower to higher dimensional space makes the order of the information more straightforward and recognition more precise. On several occasions SVM has been utilized for [41, 98, 132, 183] . SVMs were initially intended for two-class grouping, and an expansion for multi-class arrangement is vital for many instances. Dardas et al. [41] applied SVM along with bag-of-visual-words for hand gesture recognition. Weston and Watkins [226] proposed an SVM design to settle a multi-class pattern recognition problem using a single optimization stage. Be that as it may, their optimization procedure found to be extremely convoluted to be executed for real-life pattern recognition problems [77] . Rather than utilizing a single optimization method, various paired classifiers can be utilized to take care of multi-class grouping issues, for example, "one-against-all" and "one-against-one" techniques. Murugeswari and Veluchamy [151] utilized "one-against-one" multiclass SVM for gesture recognition. It was tracked down that the "one-against-one" strategy performs better compared to the remainder of the strategies [77] . -Artificial neural network (ANN) ANN is a statistical learning algorithm utilized for different errands like functional approximation, pattern recognition and classification. ANNs can be used as a biologically inspired supervised classifier for gesture recognition where training is performed utilizing a bunch of marked input data. The trained ANN arranges new input data into the labeled classes. ANNs can be utilized to perceive both static [59] as well as dynamic hand gestures [157, 163] . [157] applied ANN to classify gesture motions utilizing a 3D articulated hand model. A dataset collected using Kinect ® sensor [163] was used for this. Obtaining info from data glove, Kim et al. [102] applied ANNs to perceive Korean sign language from the movement of hand and fingers. A restriction of traditional ANN design is its failure to deal with temporal arrangements of features proficiently and successfully [165] . Primarily, it cannot make up for changes in transient moves and scales, particularly in real-time applications [177] . Out of a few altered structures, multi-state time-delay neural networks [239] can deal with such changes somewhat utilizing dynamic programming. Fuzzy-based neural networks have likewise been utilized to perceive gestures [220] . • Dynamic gesture recognition Dynamic gestures or trajectory-based gestures are gestures having trajectories with temporal information in terms of video frames. Dynamic gestures can be either a single isolated trajectory type or continuous type occurring one after another in a stream. Recognition performance of dynamic gestures, especially the continuous gestures, is basically dependent on gesture spotting schemes. Dynamic gesture recognition schemes can be categorized into direct or indirect methods [7] . The approaches in direct method first detect the boundaries in time for the performed gestures and then apply standard techniques same as isolated gesture recognition. Typically, motion cues like speed, acceleration and trajectory curvature [242] ) or specific starting-ending marks [7] , an open/closed palm can be applied for boundary detection. Whereas, in the indirect approach temporal segmentation is intertwined with recognition. In indirect methods, typically gesture boundaries are detected by finding time intervals that give good scores when matched with one of the gesture classes in the input sequence. Such procedures are too vulnerable to false positives and recognition errors as they have to deal with two vital constraints of dynamic gesture recognition [146] : 1) spatiotemporal variability, i.e., a user cannot reproduce the same gesture at the exact same shape and duration and 2) segmentation ambiguity, i.e., problems faced due to erroneous boundary detection. Through indirect methods, we try to minimize these problems as much as possible. Indirect methods can be of two types (Fig. 15 ): non-probabilistic, i.e., (a) dynamic programming/dynamic time warping, (b) ANN; and probabilistic, i.e., (c) HMM and other statistical methods, (d) CRF and Fig. 15 Conventional dynamic gesture recognition techniques its variants. Some other common techniques are eigenspace-based methods [164] , curve fitting [196] , finite-state machine (FSM) [16, 19] and graph-based methods [194] . -Dynamic programming/dynamic time warping (DTW) A template matching approach of dynamic programming is dynamic time warping (DTW) and it has been extensively used in isolated gesture recognition. It can find the optimal alignment of two signals in the time domain. Each element in a time series is represented by a feature vector. So, the DTW algorithm calculates the distance between each possible pair of points in two time series in terms of their feature vectors. The steps in a DTW are as follows: • Two time series P and Q: where i , i are feature vectors for the ith element of the corresponding time sequences. DTW has been applied for gesture classification by several authors [7, 80, 127, 211] . Alon et al. [7] proposed a DTW-based approach that can handle the sub-gesture problem. Lichtenauer et al. [127] introduced a hybrid method by applying statistical DTW (SDTW) only for time warping and another classifier on the warped features. -Hidden Markov model (HMM) Though HMM originally emerged in the field of speech recognition, now, it is one of the most widely used techniques for gesture recognition with its numerous variants. HMM is extensively used because it can be applied for modeling the spatiotemporal variability of the gesture videos. Since trajectory-based gesture is a series of images, so there is a need for past knowledge to help the system to recognize gestures and an HMM can help us in this. Before we elaborate on HMM, let us understand a traditional Markov = , , … , = , , … , , process. A stochastic process has the nth order Markov property if the current event's conditional probability density is dependent only on the n most recent events. For n = 1 , the process is called a first-order Markov process, where the current event depends only on the previous event. This is a useful assumption for hand gestures, where the positions and orientations of the hands are treated as events. HMM has two special properties for encoding hand gestures-a) it assumes a first-order model, i.e., it encodes the present time (t) in terms of the previous time ( t − 1)-the Markov property of underlying unobservable finite-state Markov process and b) a set of random functions, each associated with a state, that produces an observable output at discrete intervals. In this way, an HMM is a "doubly stochastic" process [176] . The states in the hidden stochastic layers are governed by a set of probabilities: i. The state transition probability distribution , which gives the probability of transition from the current state to the next possible state. ii. The observation symbol probability distribution , which gives the probability of observation for the present state of the model. iii. The initial state distribution , which gives the probability of a state being an initial state. An HMM is expressed as =( , , ) and is described as follows: • The initial probability distribution Π = j , where The modeling of a gesture involves two phasesfeature extraction and HMM training. In the first phase, a particular gesture sequence is represented by a set of feature vectors. Each of these feature vectors describes the trajectory of the hand corresponding to a particular state of the gesture. The number of such states depends on the nature and complexity of a gesture. In the second phase, the vector set is used as an input to HMM. The global HMM structure is formed by connecting in parallel the trained HMMs ( 1 , 2 , . . . , G ) , where G is number of gestures to be recognized. For dynamic gestures, temporal components like the start state, the end state, and the set of observation sequences (e.g., position) are mapped by an HMM classifier using a set of boundary conditions. For a given observation sequence, the key issues of HMM are, HMMs are frequently applied for trajectory-based gesture recognition [72, 117, 122, 166] . But the main disadvantage of HMM is that every gesture model has to be represented and trained separately considering it as a new class, independent of anything else already learned. -Conditional random field (CRF) CRF is basically a variant of the Markov model with some added advantages. HMM requires strict independence assumptions across multivariate features and conditional independence between observations. This is generally violated in continuous gestures where j = P q 1 = s j , for 1 ≤ j ≤ N. observations are not only dependent on the state, but also on the past observations. Another disadvantage of using HMM is that the estimation of the observation parameters needs a huge amount of training data. The distinction between HMM and CRF is that HMM is a generative model that defines a joint probability distribution to solve a conditional problem thus focusing on modeling the observation to compute the conditional probability. Moreover, one HMM is constructed per label or pattern where HMM assumes that all the observations are independent. On the other hand, CRF is a discriminative model that uses a single model of the joint probability of the label sequence to find conditional densities from the given observation sequence. CRFs can effortlessly address contextual dependencies and have computationally alluring properties. CRFs support proficient recognition utilizing dynamic programming, and their parameters can be learned utilizing convex optimization. Both HMM and CRF can be used for labeling sequential data. For this, we define a statement for a given observation sequence x that, we want to choose a label sequence y * such that the conditional probability P(y|x) is maximized, that is: Maximum entropy Markov models (MEMMs) are discriminative models, where each state has an exponential model that takes the observation sequence as input and outputs a probability distribution over the next possible states. Each of the P(y t |y t−1 , x) , is an exponential model of the form: where Z is a normalization constant and the summation is overall features. But MEMM suffers from Label Bias Problem, i.e., the transition probabilities of leaving a given state are normalized for only that state (local normalization). MEMMs have a non-linear decision surface in light of the fact that the current observation is simply ready to choose what successor state has chosen; however, the probability mass do not move to that state. To stay away from this impact, a CRF utilizes an undirected graphical model that characterizes a single (2) y * = argmax y P(y|x). log-linear distribution over the joint vector of a whole class label sequence given a specific observation sequence and accordingly the model has a linear decision surface. Let G = (V, E) be a graph such that Y = (Y v ), v V so that Y is indexed by vertices of G. Then (X, Y) is a conditional random field, when conditioned on X, the random variables Y v obey the Markov property with respect to the graph: where Z is the normalization constant and C is the potential function over clique C. where f(.) is the feature vector defined over the clique and is the corresponding weight vector for those features. Bhuyan et al. [19] proposed a recognition method applying CRF through a novel set of motion chain code features. Sminchisescu et al. [203] have compared performance analysis applying algorithms based on CRF and MEMM for discerning human motion in video sequences. Undirected conditional model CRF and directed conditional model MEMM with different windows of observations are compared with HMM. Both MEMM and HMM have trouble in perceiving long-range observation dependencies that become useful in discriminating among various gestures. It is seen that CRFs have better recognition performance compared to MEMMs, which in turn, typically outperformed traditional HMMs. This is because CRF applies an undirected graphical model to overcome the problem of label bias present in maximum entropy Markov models (MEMMs) where states with low-entropy transition distributions effectively ignore their observations. The main constraint of CRF is that training is more time-consuming ranging from several minutes to several hours for models having longer windows of observations (as compared to seconds for HMMs, or minutes for MEMMs), on a standard desktop PC. -Some other classification methods Here, we discuss some other classification techniques that have also been used in the classification of gestures. Patwardhan and Roy [164] presented an eigenspace-based methodology to represent trajectory-based hand gestures containing both shape and trajectory information which are rotation, scale and translation (RST) invariant. Shin et al. [196] presented a curve-fitting based geometric framework utilizing Bezier curves by fitting the curve to the 3D motion trajectory of the hand. The gesture velocity is interlinked in the algorithm to enable trajectory analysis and classification of dynamic gestures having variations in velocity. Bhuyan et al. [16, 19] represented the keyframes of a gesture trajectory as a sequence of states ordered in the spatial-transient space, which constitutes a finite state machine (FSM) that classifies the input. Graph-based frameworks are also applied as a powerful scheme for pattern recognition problems but have been practically left unused for a long period of time due to their high computational cost. [194] used graphs for gestures matching in an eigenspace to handle hand occlusion (Fig. 16) . Depth information is largely invariant to illumination variation and skin colors and offers a quite clear segmentation from the background. So, the major problems in segmentation like illumination variation and occlusion can be handled nicely with the help of depth information to a great extent. Due to these advantages, depth measuring cameras have been used in the field of computer vision for many years. However, the applicability of depth cameras was restricted because of their excessive cost and low quality. With the introduction of low-cost color-depth (RGB-D) cameras like Kinect ® by Microsoft, Leap Motion Controller (LMC) by Leap Motion, Intel RealSense ® , Senz3D ® by Creative and DVS128 ® by iniLabs, a new revolution was evolved in gesture recognition by providing highquality depth images that can handle issues like complex background and variation in illumination. Out of all these, hand gesture recognition on Kinect ® -based dataset and 'one-shot learning' with RGB-D data, are the prominent methods mostly discussed in depth-based hand gesture recognition. • Kinect ® -based methods Kinect ® has a combined RGB and IR camera along with depth sensor [248] . It uses the infrared projector and sensor for depth computation and an RGB camera for capturing RGB data only. The infrared projector projects a predefined pattern on the items and a CMOS sensor captures the deformations in the reflected pattern. Depth information is then calculated by mapping a three-dimensional view of the scene obtained from the deformation information. Kinect ® acquire RGB-D information by consolidating organized light with two exemplary computer vision strategies: depth from focus and depth from the stereo. The skeletal information got from these RGB-D sensors is changed over to more significant and undeniable features, and algorithms are created for the robust gesture classification. Classification of hand gestures is particularly difficult because of the complex articulation and relatively smaller area of the hand region. Kinect ® is helpful in tending to these central issues in computer vision [163, 181, 212] . It has also diverse applications ranging from gaming to classroom [71, 116] . • Other depth sensor-based methods Leap motion controller (LMC) and Intel RealSense ® are the most used RGB-D sensor for HCI applications apart from Kinect ® . RealSense ® is more robust to self-occlusions and it can capture pinching gestures. LMC is another RGB-D sensor and its purpose is to locate 3D fingertip positions instead of the whole-body depth information as the case with Kinect ® sensor. It can detect only fingertips lying parallel to the sensor plane, but with high accuracy. In [133] feature vector with depth information is computed using a leap motion sensor and fed into the hidden conditional neural field (HCNF) to classify dynamic hand gestures. Leap motion sensors can also be applied in different utilization, e.g., virtual environments [178] and sign language recognition [172] . • One-shot learning methods on RGB-D data Using Deep Learning, human-level performance has become achieveable on complex image classification tasks. However, these models rely on a supervised training paradigm and their achievement is heavily dependent on the availability of labeled training data. Also, the classes that the models can recognize are limited to those they were trained on. This makes these models less useful in realistic scenarios for the classes where enough labeled data is not available during training. Also, since it is practically not possible to train on images of all possible objects, so the model is expected to recognize images from classes with a limited amount of data in the training phase or precisely with a single example. So, in the case of a small dataset, 'one-shot learning' may be very useful. Various researchers [108, 221, 230] have used one-shot learning in both deep learning and non-deep learning paradigm for recognition of hand gestures, especially with RGB-D data. Wu et al. [230] presented a framework to learn gestures from just one learning sample for each class, in particular, 'one-shot learning'. Features are obtained depending on extended motion history image (Extended MHI) and the gestures are recognized based on the maximum correlation coefficient. The extended MHI is used to improve the presentation of MHI by making up for the immobile regions and repeated activities. A multi-view spectral embedding (MSE) scheme is utilized to meld the RGB and depth information in an actually significant way. The MSE calculation finds the natural connection among RGB and depth features, improving the recognition rate of the algorithm. In [136] , authors used a methodology consolidating MHI with statistical measures and frequency domain transformation on depth images for one-shot-learning hand gesture recognition. Due to the availability of the depth information, the backgroundsubtracted silhouette images were obtained using a simple mask threshold. Though the idea of artificial intelligence (AI) is quite ancient, modern AI first came into the picture around the mid-twentieth century. The AI aims at developing intelligence in machines so as to make them work and respond like humans. This can be achieved when the machines are made to have certain traits, e.g., reasoning, problem solving, perception, learning, etc. Machine learning (ML) is one of the cores of AI. There are a large number of applications of ML in many aspects of modern human society. Consumer products like cameras and smartphones are the best examples where ML techniques are being employed increasingly. In the field of computer vision, ML techniques have been vigorously used in different applications like object detection, image classification, face recognition, gesture and activity recognition, semantic segmentation, and many more. In conventional ML, engineers and data scientists have to identify useful features and they have to handcraft the feature extractor manually which requires considerable engineering skills and domain knowledge. To identify important and powerful features, they must have considerable domain expertise. The issue of "handcrafting features" can be addressed if good features can be learned automatically. This automatic learning of features can be done by a learning method called "representation learning". These are methods that enables a machine to automatically learn the representations that are crucial for detection or classification. Recently, deep learning has shown outstanding performance outperforming "non-deep" state-of-the-art methods in action and gesture recognition fields. Deep learning, a subfield of ML, is based on representation learning methods having multiple levels of representation. Deep learning is a part of ML algorithms, in which extraction of multiple levels of features is possible. In several fields, such as computer vision, deep learning methods have been proved to have much better performance than conventional ML methods. The main reason for deep learning having an upper hand over ML is the fact that the feature learning mechanism at these different levels of representation is fully automatic, thereby allowing the computational model to implicitly capture intricate structures embedded in the data. The deep learning methods are said to have deep architecture because of the non-uniform processing of information at different levels of abstraction where higher-level features are interpreted in the form of lower-level features. This has propelled the advancement of learning powerful and successful portrayals straightforwardly from raw data and deep learning gives a conceivable method of naturally learning different levels of image specific features by utilizing different layers. Deep networks are fit for discovering remarkable dormant constructions inside unlabeled and unstructured raw data and can be utilized for both feature extraction as well as classification [110] . The recent popular deep learning methods like convolutional neural network (CNN), recurrent neural network (RNN) and long short-term memory (LSTM) have demonstrated competitive performance in both image/video representation as well as classification. But deep learning approaches have mainly two inherent requirements: huge data for training purposes and expensive computation. But in this modern era, the abundance of high quality, easily available labeled datasets from different sources along with parallel graphics processing unit (GPU) computing, also played a vital role in the success of deep learning by fulfilling its requirements. We will see all these methods one by one, but before that let's talk about one major problem of deep learning which is the requirement of huge data and how various researchers have tried to overcome it through the data augmentation process when the database is limited. • The need for data augmentation in deep learning methods Contrary to the hand-crafted features, there is developing interest towards feature learned and represented by deep neural networks [12, 29, 37, 43, 58, 85, 94, 101, 110, 121, 129, 148, 149, 153, 169, 201, 215, 217, 223, 249, 250] . But the fundamental necessity in deep learning methods is loads of data set examples. Various researchers have stressed the significance of utilizing diverse training samples for CNNs/RNNs [110] . For datasets with restricted variety, they have proposed data augmentation techniques in the training stage to forestall CNNs/RNNs from overfitting. Krizhevsky et al. [110] utilized different data augmentation procedures in the preparation of the recognition problem of 1000 groups. Simonyan and Zisserman [201] utilized some spatial augmentation on every image frame to prepare CNNs for video-based human action classification. Notwithstanding, these data augmentation strategies were restricted to only spatial varieties. Pigou et al. [169] transiently deciphered video outlines apart from applying spatial changes to add varieties to the video sequences containing dynamic movement. Molchanov et al. [148] applied space-time video augmentation methods to keep away 3D-CNN from overfitting. Hubel and T.N. Weisel proposed the prototype of Cat's visual cortex, which later on helped in the development of CNNs. The first neural network architecture for visual pattern recognition was presented by K. Fukushima in 1980 and was given the nickname "neocognitron" [57] . This network was based on unsupervised learning. Finally, in the late 90s, Yann LeCunn and his collaborators developed CNN which showed exciting results in various recognition tasks [121] . But till 2012, CNN was not that much evolved due to the requirements of deep learning methods mentioned above. After the work of Krizhevsky et al. [110] , various researchers applied CNN in various domains for classification as well as other purposes. Generally, 2D-CNN is used in the case of images that can access only spatial information, whereas, for video processing, 3D-CNN (C3D) is quite effective which can extract both spatial as well as temporal information. A fusion-based approach with CNN as trajectory shape extractor of a gesture video and CRF as temporal feature extractor is proposed by [235] . In [190] , the authors used CNN for recognition of hand gestures using trajectory-to-contour-based images obtained through skin segmentation and tracking method. In [245] , the authors used pseudo-color-based MHI images as input to convolutional networks. [96] proposed a model for isolated gesture recognition using optical flow where the trajectory-contour of the moving hand with varied shape, size and color is detected and the hand gesture is classified through a VGG16 CNN framework. • 3D-CNN (C3D) model 2D-CNN can handle 2D images for various tasks like recognition acting on the raw data directly. Whereas 3D-CNN models, also called C3D, act on videos for gesture or action detection. The framework obtains features from spatial as well as temporal dimensions by acting convolutions in 3D, thereby capturing the spatial as well as movement data present in the video sequence. [85] introduced a C3D network for human action recognition. To examine the progression of short video clips and normalize the framework's reactions for all the clips, Tran et al. [215] employed a C3D to learn the spatio-temporal features from sliced video clips and then fuse these features to make the final classification. [223] used a temporal segment network that works on video segments called snippets for spatio-temporal evaluation in action recognition. 3D-CNN (C3D) is quite effective which can extract both spatial as well as a piece of temporal information at less expense of both data and processing computation compared to RNN/LSTM [101, 192] . • Two-stream model Ciregan et al. [37] explained the advantage of utilizing multiple CNNs in parallel in improving the performance of the whole network by 30-80% for different image grouping errands. Also, for large-scale video arrangement, Karpathy et al. [94] found that the best outcomes can be obtained by joining two separate layers of CNNs trained with original and spatially trimmed video clips. Simonyan and Zisserman [201] proposed different streams of CNNs for spatial and transient data extraction which are later intertwined in the late-fusion scheme. Here in one stream optical flow is used for activity acknowledgment. To perceive sign language gestures, Neverova et al. [153] utilized CNNs to consolidate tone and depth information from hand areas and upper-body skeletons. Two stream model with two C3D layers that takes RGB and optical flow computed from the RGB stream as inputs were used by [101] for action recognition. [250] used a hidden two-stream CNN model where input is a crude video sequence that can explicitly detect the activity class without computing optical flow directly. Here the network predicts the motion information from consecutive frames through a temporal stream CNN that makes the network 10× faster [250] , without computing optical flow which is timeconsuming. • Long-term video prediction-RNN/LSTM/GRU CNN can handle restricted local temporal data, and consequently, people have moved towards RNN, which can deal with worldly information utilizing repetitive associations in hidden layers [12] . Be that as it may, the major disadvantage of RNN is its short-term memory, which is inadequate for genuine real-life varieties in gestures or actions. To take care of this issue, long short-term memory (LSTM) [58] was presented which can handle longer-range temporal structures. Gestures or actions, in a video sequence, can be considered as a sequential temporal evaluation of body/body-part in a space-time representation. So, 3D-CNN/RNN/LSTM is the network generally applied in video-based action/gesture recognition. In addition to 3D-CNNs, recurrent neural networks have also been applied for dynamic hand gesture classification [149, 249] . [29] has extracted hand trajectory and hand posture features from RGB-D data and then a two-stream recurrent neural network (2S-RNN) is used to fuse multi-modal features. The spatio-temporal graphs are good for representing long-range spatio-temporal variations. Hence, a combination of high-level spatiotemporal graphs and RNN can also be applied to resolve the issue of spatio-temporal representation in RNN [84] . The long short-term memory problem and vanishing/ exploding problem of RNN can be handled to some extent by adding 'gates' in LSTM. Hence networks based on LSTM can be efficiently utilized for the representation of dynamic gestures [43, 129, 217] . However, in both RNN and LSTM, the problem of vanishing/expanding gradient is much acute compared to CNN and they become more data-hungry. Gated recurrent units (GRU) are simplified LSTM units with adaptive gate parameters with fewer parameters which makes the training process faster. [197] presented a skeleton-based dynamic hand gesture acknowledgment technique that separates geometric features into various parts and uses a gated recurrent unit-recurrent neural network (GRU-RNN) for each featuring part. Since each divided feature component has fewer dimensions than the whole element, the number of hidden units needed for optimization is decreased. Subsequently, the plan accomplished improved recognition performance with fewer parameters. Thus, more or less, deep learning procedures can give exceptional execution in both feature extraction and recognition tasks due to their inherent feature learning ability. The powerful and effective algorithms of deep networks are fit for tackling complex pattern recognition and optimization tasks. Page 26 of 40 The advancement of standard hand gesture datasets is an essential requirement for the dependable analysis and verification of hand gesture recognition techniques. There are a few freely accessible hand gesture databases that are created with the end goal of hand motion investigation and similar examinations. Several authors have come out with such lists of databases [11, 12, 35, 170] . But most of them have not given a detailed analysis of the same in a concise way, though all of them tried to include the most-used hand and human activity databases. In this work, we have tried to collate a comprehensible list of the 50 most used freely accessible hand gesture databases with their brief description in two tables. Table 2 mainly gives the content and description, whereas Table 3 gives the link of the publicly available sources. The approach of vision-based hand gesture is more intrinsic and suitable compared to other glove-based approaches used in HCI since it can be used in the field of vision of a camera anywhere and at any time. The operator does not need to master any special hardware and, thus, it is easier to deploy. A vision-based approach also enables a variety of gestures to be used that can be updated in the software. Computer vision methods can enable HCI that is difficult or impossible to achieve with other modalities. Visual information is important in human-human communication because meaning is conveyed through identity, facial expression, posture, gestures, and other visually observable attributes. Therefore, intuitively it is possible to have natural HCI by sensing and perceiving these visual cues from video cameras placed appropriately in the environment. The major benefit of VGR is that it requires modest gadgets in terms of cost as input devices. Even an advanced camera can be incorporated with a solitary chip. Large-scale manufacturing is thus a lot simpler in contrast to other info gadgets like data gloves with mechanical components. Furthermore, the expense of image processing equipment can be minimized since most computers now have a central processing unit and graphics processing unit fast enough to perform these computer vision tasks. While other information gadgets like a mouse, joystick, and trackpad are restricted to a particular capacity; camera-based computer vision techniques are flexible enough to offer an entire scope of conceivable future applications in a human-computer association as well in user validation, video conferencing, and distance schooling. Another significant benefit of computer vision is that it is non-intrusive. Cameras are open information gadgets that do not need direct contact with the user to detect activities. The user can communicate with the computer without wires and without controlling mediator gadgets. Moreover, humans are more comfortable in communicating with body postures or gestures as compared to using some mechanical techniques like clicking the mouse or pressing the keyboard, or touching a touch-sensitive screen and thus experience more comfortable and better natural interactions than with traditional interaction techniques. These are the major advantages of a VGR system, including a natural, contact-free method of interaction. However, vision-based gesture interfaces also have many disadvantages, including user fatigue, cultural differences, the requirement of high-speed processing, and noise sensitivity. Nevertheless, it is more difficult to use because current computer vision schemes are still limited in processing such highly articulated, non-convex, and flexible objects like the human hand. Vision-based recognition is amazingly difficult not just in light of its assorted settings, different translations, and spatio-transient varieties yet additionally as a result of the complex non-unbending properties of the human hand. The current classifiers utilized for vision-based motion recognition are not prepared to handle all the motion characterization issues at the same time. Every one of them has at least one downside restricting the general execution of the motion recognition strategies. Despite all the drawbacks, the number of VGR systems is assumed to increase more in daily life; and as such, interactive technology needs to be designed effectively to provide a more natural way of communication. Therefore, currently, vision-based gesture recognition has become a major research field in HCI and there is a various real-life implementation of VGR. More specifically hand gestures-based VGR systems can provide a noncontact input modality. The widespread use of gesture-based interfaces for vision-based HCI is possible due to the advantages mentioned above. One of the forward leaps in VGR is the presentation of Microsoft Kinect ® as a contact-less interface [248] . The Kinect has huge potential in different applications, for example, medical care [82] , educational training [116] , and so on. Be that as it may, its poor open-air execution and depth resolution limits its convenience. As of late, SoftKinetic's Gesture Control Technology is consolidated in BMW vehicles to permit drivers to explore the in-vehicle infotainment framework easily [1] . Most as of late executed and some proposed utilization of VGR incorporate sign language recognition [127] , virtual reality (VR) [187] , virtual game [112] , augmented reality SN Computer Science Alphanumeric dataset (AR) [180] , smart video conferencing [142] , smart home and office [155] , medical services and clinical help (MRI navigation) [82] , robotic surgery [213] , wheelchair control [104] , driver observing [204] , vehicle control [168] , interactive presentation module [244] , virtual study hall [71] , web-based business (e-commerce) [9] , etc. A portion of the significant applications (see Fig. 17 ) of hand gesture-based HCI applications are illustrated below: • Augmented reality and virtual reality Hand gestures can be very useful for realistic manipulations of virtual objects in virtual environments [187] and as an interface for virtual gaming [112] . Many problems like detection, registration and tracking can be solved using augmented reality techniques [180] . • Sign language recognition Hand gestures are useful for sign language recognition for the deaf-mute community [127] . The system mainly acts as an interpreter between the deaf/mute and others. • Vehicle monitoring and vehicle control Gesture-based interfaces may be used to operate a vehicle [168] , and also for driver monitoring [204] . • Healthcare and medical assistance Gesture-based interfaces have many applications in healthcare and medicine, for example, MRI navigation in the operating room [82] , and medical volume visualization tasks, browsing radiology images may be some of the possible applications. Gestures can also be used to train physicians in robotic surgery [213] and medical assistance for physically disabled persons, including hand gesture-based wheelchair control [104] . • Information retrieval Gesture-based interfaces can also be used for day-to-day information retrieval from the internet [166] . • Education Gesture interfaces for controlling presentations (e.g., powerpoint ® ) is helpful for teachers [244] . Gesture-based interfaces can be used for window menu activation. Gesture interfaces can be useful in controlling desktop, television, etc. and also for tablet PC applications [155] . are already embedded, passengers can take benefit from hand tracking and gesture recognition to control menus without physically touching a platform. Though there are some other touch-less technologies such as voice recognition, language and pronunciation become a barrier in many instances. Moreover, people are focusing on using smartphones to minimize contact when it comes to aspects such as check-in. However, with smartphones, passengers still often have to touch a screen, which gives a chance of risk. Additionally, at airport border control, it is often forbidden to use a smartphone. So, there are further limits to these existing features. In addition, on roads drivers can control auto navigation through simple in-air movements. In such cases, hand-tracking and gesture recognition technology can provide a hardware-agnostic solution to these problems. Another major challenge to overcome for a gesture recognizer system is the implementation of an efficient realtime application. A good gesture recognizer should fulfill the following requirements, and the most important aspect is computational efficiency for real-time implementation: • Robustness The system should be robust to real-world conditions like noisy visual information, changing illumination, cluttered and dynamic backgrounds, occlusion, and so on. • Scalability The core of the system should be adaptive to different scales of applications like sign language recognition, robot navigation, virtual environments, and so on. • Computational efficiency The system should be computationally efficient. • User's tolerance The system should detect the mistakes performed by the user and ask the user to repeat them until the mistake is corrected. As shown in Fig. 18 , the real-time implementation of gesture recognition algorithms can be made by using graphics processing units (GPUs) alone or in combination with generalpurpose CPUs to increase the processing speed. Hand gesture recognition is a significant field of exploration in computer vision with different applications in HCI. Applications incorporate desktop tools, computer games, healthcare, medical assistance, robotics, sign language, vehicle monitoring, and virtual reality environments. Interfaces support unimodal or multimodal connection by utilizing computer vision, speech recognition, wearable sensors, or a mix of these and different advancements. Utilizing more than one modality can make the interaction more natural and accurate, but it also increases the system complexity. Both static and dynamic gestures give a helpful and normal human-computer interface. Dynamic gestures can be grouped depending on their implications and appearances. They can be obtained primarily from vision-based frameworks or from wearable-sensor-based gloves. In principle, vision-based gesture interfaces should be preferred to data gloves because of their simplicity and low cost. While glove-based gesture recognition is almost a tackled issue, vision-based gesture recognition is yet in Fig. 17 Applications of hand gesture recognition systems: a virtual reality, b gesture-based interaction with robots (Picture courtesy http://www.robots-dreams.com/pc-based-robosapien-control-project), c desktop computing application, d virtual computer games using gesture, e sign language recognition, f vehicle control (picture cour-tesy: http://www.automotiveworld.com/news-releases/3D-gesture-recognition-virtual-touch-screen-bring-new-meaning-vehicle-controls/), g gesture controlled robotic surgery (Pic. courtesy: http://www. purdueexponent.org/campus/collection_daa8e8c2-3e15-11e0-bb90-0017a4a78c22.html) and h television and desktop controlling its developing stage. Vision-based gesture recognition typically depends on the proper segmentation of the gesturing body parts. Image segmentation is altogether influenced by factors including physical movement, variations in illumination and shadows, and background complexity. The complex enunciated state of the hand makes it difficult to represent the appearance of gestures. Moreover, the variety of gesture boundaries due to spatial-transient differences of hand gestures makes the spotting and recognition process more troublesome. Recognition of static, as well as dynamic gestures, becomes more difficult if there is occlusion. Occlusion estimation is a challenging problem in its own right and an active area of research. Impediment assessment is a difficult issue by its own doing and a functioning space of exploration. Occlusions can be estimated using multiple cameras or tracking-based methods. The inclusion of depth information in gesture recognition can make the recognition process more accurate. Deep learning techniques have acquired another point of view for different applications of computer vision. Deep learning strategies can be used in both feature extraction and recognition inferable from their underlying feature learning ability in finding salient latent structures within unlabeled and unstructured raw data. This paper surveyed the main approaches in vision-based hand gesture recognition for HCI. Major topics were different classes of gestures and their acquisition; gesture system architectures; and applications and recent advances of gesture-based human-computer interfaces in HCI. A detailed discussion was provided on the features and major classifiers in current use. Also, a brief description of different hand gesture databases is listed with their available source links. The scope of gesture naturalness and expressiveness can be enhanced by including facial expressions or allowing the use of both hands. However, this increases the size of the gesture vocabulary, inherently increasing the complexity. The authors declare that they have no conflict of interest. softkinetic's gesture control technology rolls out in additional car model The significance of facial features for automatic sign language recognition Motion history image: its variants and applications Trajectory space: a dual representation for nonrigid structure from motion Image sequence analysis of real world human motion Human hand postures and gestures recognition: towards a human-gesture communication interface A unified framework for gesture recognition and spatiotemporal gesture segmentation A low power, fully event-based gesture recognition system Building multi-modal personal sales agents as interfaces to e-commerce applications Recognizing two handed gestures with generative, discriminative and ensemble methods via fisher kernels Use of GPU in gesture recognition Intelligent biometric group hand tracking (IBGHT) database for visual hand tracking research and development A survey on deep learning based approaches for action and gesture recognition in image sequences Color hand gesture segmentation for images with complex background A new 2D static hand gesture colour image dataset for ASL gestures A dynamic approach and a new dataset for hand-detection in first person vision FSM-based recognition of dynamic hand gestures via gesture summarization using key video object planes Continuous hand gesture segmentation and co-articulation detection Feature extraction from 2d gesture trajectory in dynamic hand gesture recognition A novel set of features for continuous hand gesture recognition Dynamic image networks for action recognition The recognition of human movement using temporal templates A SOM based approach to skin detection with application in real time systems High accuracy optical flow estimation based on a theory for warping Large displacement optical flow: descriptor matching in variational motion estimation Invariant features for 3-D gesture recognition Egocentric gesture recognition using recurrent 3D convolutional neural networks with spatiotemporal transformer modules Face segmentation using skin-color map in videophone applications Face segmentation using skin-color map in videophone applications Two streams recurrent neural networks for large-scale continuous gesture recognition Real-time user interface using particle filter with integral histogram Combining image and global pixel distribution model for skin colour segmentation Review of constraints on vision-based gesture recognition for humancomputer interaction A skin detector based on neural network. In: Communications, circuits and systems and West Sino expositions 6dmg: a new 6D motion gesture database Survey on 3D hand gesture recognition A review of hand gesture and sign language recognition techniques Multi-column deep neural networks for image classification A color hand gesture database for evaluating and improving algorithms on hand gesture and posture recognition Human detection using oriented histograms of flow and appearance Hand gesture recognition using bag-of-features and multi-class support vector machine Real-time hand gesture detection and recognition using bag-of-features and support vector machine techniques Skeleton-based dynamic hand gesture recognition Long-term recurrent convolutional networks for visual recognition and description Development of gesture-based human-computer interaction applications by fusion of depth and colour video streams Benchmark databases for video-based automatic sign language recognition Speech recognition techniques for a sign language recognition system Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit Visionbased hand pose estimation: a review. Comput Vis Image Underst Principal motion components for one-shot gesture recognition Multi-modal gesture recognition challenge 2013: dataset and results Two-frame motion estimation based on polynomial expansion Static hand gesture recognition based on hog characters and support vector machines. In: Instrumentation and measurement, sensor network and automation (IMSNA) Intrinsic images by entropy minimization Instructing people for training gestural interactive systems Most probable longest common subsequence for recognition of gesture character input The estimation of the gradient of a density function, with applications in pattern recognition Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition Learning precise timing with LSTM recurrent networks A static hand gesture recognition algorithm using k-mean based radial basis function neural network Static hand gesture recognition using mixture of features and SVM classifier Monocular tracking of the human arm in 3D A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior Constraint integration for efficient multiview pose estimation with self-occlusions K-nearest correlated neighbor classification for Indian sign language gesture recognition using feature fusion Chalearn gesture challenge: design and first results Segmentation of the face and hands in sign language video sequences using color and motion cues Choice of neighbor order in nearest-neighbor classification Automatic skin segmentation and tracking in sign language recognition Automatic skin segmentation for gesture recognition combining region and support vector machine active learning Recognizing hand gesture using Fourier descriptors Gesture recognition using kinect in a virtual classroom environment Lip shape and hand position fusion for automatic vowel recognition in cued speech for French The axes of rotation of the thumb carpometacarpal joint Documentation of pointing and command gestures under mixed illumination conditions: video sequence database Determining optical flow A comparison of methods for multiclass support vector machines Two-handed gesture tracking incorporating template warping with static segmentation The generalized uniqueness wavelet descriptor for planar closed curves User independent hand gesture recognition by accelerated DTW A full-body gesture database for automatic gesture recognition Intention, context and gesture recognition for sterile MRI navigation in the operating room Multimodal human-computer interaction: a survey. Comput Vis Image Underst Structural-RNN: deep learning on spatio-temporal graphs 3D convolutional neural networks for human action recognition Multimodal biometric human recognition for perceptual human-computer interaction Statistical color models with application to skin detection Statistical color models with application to skin detection Computer vision-based human body segmentation and posture estimation Fuzzy system learned through fuzzy clustering and support vector machine for human skin color segmentation HMM and IOHMM for the recognition of mono-and bi-manual 3D hand gestures A human motion estimation method using 3-successive video frames thesis: a framework for research and design of gesture-based human-computer interactions Large-scale video classification with convolutional neural networks Deep networkbased hand gesture recognition using optical flow guided trajectory images Deep networkbased hand gesture recognition using optical flow guided trajectory images Spatial-based skin detection using discriminative skin-presence features Real time hand pose estimation using depth sensors Skin detection: a random forest approach Efficient skin detection under severe illumination changes and shadows Improving human action recognition with two-stream 3D convolutional neural network A dynamic gesture recognition system for the Korean sign language (KSL) Tensor canonical correlation analysis for action classification A method for controlling wheelchair using hand gesture recognition An overview of text-independent speaker recognition: from features to supervectors Robotic wheelchair based on observations of people using integrated sensors Gesture recognition with a time-of-flight camera One-shot-learning gesture recognition using HOG-HOF Deeply learned view-invariant features for cross-view action recognition Imagenet classification with deep convolutional neural networks A study of the effect of illumination conditions and color spaces on skin segmentation Enhancing the gaming experience using 3D spatial user interface technologies Hand data glove: A new generation real-time mouse for human-computer interaction Hand posture and face recognition using a fuzzy-rough approach Extraction of informative regions of a face for facial expression recognition A kinect-based assessment system for smart classroom Natural movement generation using hidden Markov models and principal components A variational approach to monocular hand-pose estimation Learning realistic human actions from movies Visual tracking of hand posture with occlusion handling Gradient-based learning applied to document recognition An hmm-based threshold model approach for gesture recognition Models and technology in computer animation, computer animation series An elliptical boundary model for skin color detection An overview of noiserobust automatic speech recognition Action recognition based on a bag of 3D points Sign language recognition by combining statistical DTW and independent classification Recognizing actions by shape-motion prototype trees Spatio-temporal LSTM with trust gates for 3D human action recognition. Spatio-temporal LSTM with trust gates for 3D human action recognition Real-time skin color detection under rapidly changing illumination conditions Learning discriminative representations from RGB-D video data Hand posture recognition using finger geometric feature Dynamic hand gesture recognition with leap motion controller An iterative image registration technique with an application to stereo vision An optical flow based approach for action recognition A template matching approach of one-shot-learning gesture recognition Humancomputer interaction based on visual hand-gesture recognition using volumetric spatiograms of local binary patterns Feature weighted nearest neighbour classification for accelerometer-based gesture recognition Idiap two handed gesture dataset. Switzerland: IDIAP Research Institute Hand gesture recognition with leap motion and kinect devices Ouhands database for hand detection and pose recognition Automatic analysis of multimodal group actions in meetings Exploiting silhouette descriptors and synthetic data for hand gesture recognition A survey of research on contextaware homes Vision-based hand gesture recognition of alphabets, numbers, arithmetic operators and ascii characters in order to develop a virtual text-entry interface system Gesture recognition: a survey A survey of advances in vision-based human motion capture and analysis Hand gesture recognition with 3D convolutional neural networks Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network Fingertip detection and tracking for recognition of air-writing in videos Hand gesture recognition system for real-time application Hand gesture recognition using Camshift algorithm Moddrop: adaptive multi-modal gesture recognition Skin color segmentation by texture feature extraction and k-mean clustering Gesture based automating household appliances Geometry-based static hand gesture recognition using support vector machine Visual recognition of continuous hand postures Grasp recognition using a 3D articulated model and infrared images Hand gesture recognition in real time for automotive interfaces: a multimodal vision-based approach and evaluations Tracking the articulated motion of two strongly interacting hands Multimodal interfaces. The human-computer interaction handbook: fundamentals, evolving technologies and emerging applications Toward an affect-sensitive multimodal human-computer interaction Human gesture recognition using kinect camera Hand gesture modelling and recognition involving changing shapes and trajectories, using a predictive Eigentracker Visual interpretation of hand gestures for human-computer interaction: a review A real-time hand gesture recognition system for daily information retrieval from internet Adaptive skin segmentation in color images The search for a safer driver interface: a review of gesture recognition human machine interface Sign language recognition using convolutional neural networks Recent methods and databases in vision-based hand gesture recognition: a review Attention based detection and recognition of hand postures against complex backgrounds Libras sign language hand configuration recognition based on 3D meshes Published by Foundations of Computer Science A study on static hand gesture recognition using moments Spelling it out: real-time ASL fingerspelling recognition A tutorial on hidden Markov models and selected applications in speech recognition Vision based hand gesture recognition for human computer interaction: a survey A leap-supported, hybrid AR interface approach Model-based tracking of self-occluding articulated objects Static and dynamic hand-gesture recognition for augmented reality applications Robust part-based hand gesture recognition using kinect sensor Robust hand gesture recognition based on finger-earth mover's distance with a commodity depth camera Finger spelling recognition from RGB-D information using kernel descriptor Combining region and edge cues for image segmentation in a probabilistic Gaussian mixture framework An efficient sign language recognition (SLR) system using Camshift tracker and hidden Markov model (hmm) Chairgest: a challenge for multimodal mid-air gesture recognition for close HCI Hand posture and gesture recognition techniques for virtual reality applications: a survey Rules of play: game design fundamentals Static and dynamic 3D facial expression recognition: a comprehensive survey Hand gesture recognition using deep network through trajectory-to-contour based images Optical flow guided motion template for hand gesture recognition Two-stream fusion model for dynamic hand gesture recognition using 3D-CNN and 2D-CNN optical flow guided motion template Human colour skin detection in CMVK colour space Graph-based matching of occluded hand gestures Dynamic hand gesture recognition: an exemplar-based approach from motion divergence fields Gesture recognition using Bezier curves for visualization navigation from registered 3-D data Skeleton-based dynamic hand gesture recognition using a part-based GRU-RNN for gesture-based interface Real-time human pose recognition in parts from single depth images Estimation and prediction of evolving color distributions for skin segmentation under varying illumination Skin color-based video segmentation under time-varying illumination Two-stream convolutional networks for action recognition in videos Recognition of global hand gestures using self co-articulation information and classifier fusion Conditional models for contextual human motion recognition Determining driver visual attention with one camera A novel method for automatic face segmentation, facial feature extraction and tracking. Signal Process Image Commun A literature survey on robust and efficient eye localization in real-life scenarios Tracking body and hands for gesture recognition: Natops aircraft handling signals database Adaptive color space switching for tracking under varying illumination Skin colour detection under changing lighting conditions Real-time fingertip localization conditioned on hand gesture classification Dynamic hand gesture recognition using motion trajectories and key frames Recognizing hand gestures with microsoft's kinect Surgical gesture segmentation and recognition Real-time continuous pose recovery of human hands using convolutional networks Learning spatiotemporal features with 3D convolutional networks A system for person-independent hand posture recognition against complex backgrounds Gesture recognition with a convolutional long short-term memory recurrent neural network Direct manipulation interface using multiple cameras for hand gesture recognition Hand detection and tracking using pixel value distribution model for multiple-camera-based gesture interactions Human-computer interaction for smart environment applications using fuzzy hand posture and gesture models One-shot learning gesture recognition from RGB-D data using bag of features Superpixel-based hand gesture recognition with kinect depth camera Temporal segment networks: towards good practices for deep action recognition Hidden-markov-models-based dynamic hand gesture recognition Skin color detection under complex background Multi-class support vector machines. Citeseer: technical report Purdue RVL-SLLL American sign language database Learning visual behavior for gesture analysis Detecting salient motion by accumulating directionally-consistent flow One shot learning gesture recognition from RGBD images Point context: an effective shape descriptor for RST-invariant trajectory recognition Movement human actions recognition based on machine learning Recognizing human facial expressions from long image sequences using optical flow Recognizing human action in timesequential images using hidden Markov model Continuous hand gesture recognition based on trajectory shape information Research on a skin color detection algorithm based on self-adaptive skin color model Skin-color modeling and adaptation Gaussian mixture model for human skin color and its applications in image and video databases Extraction of 2D motion trajectories and its application to hand gesture recognition Handling movement epenthesis and hand segmentation ambiguities in continuous sign language recognition using nested dynamic programming Moving object localization in thermal imagery by forward-backward MHI Hand gesture recognition using combined features of location, angle and velocity Bighand2. 2m benchmark: hand pose dataset and state of the art analysis A hand gesture based interactive presentation system utilizing heterogeneous cameras Fusion of 2D CNN and 3D densenet for dynamic gesture recognition An adaptive skin color detection algorithm with confusing backgrounds elimination Hand gesture recognition with surfbof based on gray threshold segmentation Microsoft kinect sensor and its effect Two-stream RNN/CNN for action recognition in 3D videos Hidden two-stream convolutional networks for action recognition