key: cord-0060978-skcp9i4x authors: Zhu, Zhaosong; Jiang, Xianwei; Zhang, Juxiao title: Sign Language Video Classification Based on Image Recognition of Specified Key Frames date: 2020-06-13 journal: Multimedia Technology and Enhanced Learning DOI: 10.1007/978-3-030-51103-6_33 sha: ac081c4e52e1b5342551f9e5cf3998bc09ca1518 doc_id: 60978 cord_uid: skcp9i4x This paper is based on the Chinese sign language video library, and discusses the algorithm design of video classification based on handshape recognition of key frames in video. Video classification in sign language video library is an important part of sign language arrangement and is also the premise of video feature retrieval. At present, sign language video’s handshape classification work is done manually. The accuracy and correctness of the results are quite erroneous and erroneous. In this paper, from the angle of computer image analysis, the definition and extraction of key frames are carried out, and then the region of interest is identified. Finally, an improved SURF algorithm is used to match the area of interest and the existing hand image, and the classification of the video is completed. The entire process is based on the actual development environment, and it can be used for reference based on the classification of video image features. With the development of information technology, Internet technology and multimedia technology have been greatly improved, In particular, the emergence of "we media", such as YouTube, Facebook, tiktok and so on, has led to the explosive growth of videos on the Internet, in this case, the manual annotation has become impossible, and the subjectivity of artificial tagging cannot meet the needs of users. In order to facilitate the management and retrieval of massive video, automatic video classification is particularly important, Automatic video classification is also widely used in video monitoring, network supervision, medicine and other fields: For example, Johnson et al. proposed a multi-mode monitoring method, which extracts the static features of human body from multiple angles to realize the detection, separation and recognition of human beings at a distance [1] ; Through video classification, video on the Internet is regulated to filter out undesirable videos (pornography, violence, etc.) [2] . The video classification method was applied to the video library obtained by the wireless endoscope, and all the videos was classified according to the different organs diagnosed [3] . In addition to specific fields, there are also some general video classification methods: Fischer et al. proposed in 1995 that video could be divided into news, sports, business, advertising, cartoons, etc. [4] ; Huang et al. proposed a classification algorithm based on text features, extracted user-generated text features, and used a classifier for classification [5] . Jiang et al. proposed a method of video classification using support vector machine (SVM) based on visual features (color, motion, edge, etc.) [6] ; Subashini et al. proposed a machine learning algorithm based on audio features and image histograms [7] . However, the current video classification methods generally have two problems: 1) Insufficient universality. Some prior knowledge is needed to design the classification rules, which can only be targeted at certain fields; 2) Complex algorithms. It needs a lot of computing resources to use multi-level deep learning algorithm to deal with video library with a large amount of video. In order to solve the above problems, this paper proposes a classification method for key frame images of sign language video. The main steps are as follows: 1) Extraction of key frames from sign language video; 2) Image visual feature preprocessing and hotspot extraction; 3) Feature matching with the designated image to achieve video classification. The object of this paper is video library of Chinese sign language (csl-lib, Project NO. zda125-8, 2016). The library contains 57,531 sign language vocabulary videos from nine specific regions in China. At present, the copyright of this dataset belongs to the Chinese language and script commission, and some contents will be released later. In the retrieval operation of video library, the handshape index of video is an important retrieval method, which is also the only retrieval method based on video image features, and has important video analysis and research value. The classification is mainly based on the 60 hand shapes in sign language (see Fig. 1 ). It will consume a lot of human, material and time resources by manual classification. In view of this problem, a set of practical classification methods is proposed from the perspective of computer image processing, Furthermore, SURF algorithm [9] which is based on Lowe D G's SIFT algorithm [8] , is improved by plane angle rotation. Meanwhile, the key frame extraction algorithm of literature [10, 11] is applied. Finally, a classification method based on key frame matching is proposed which can be applied to sign language handshape classification and has the characteristics of batch and high efficiency. The key frame of video retrieval is defined as the image of the handshape used by the gesture in the sequence formed by the video stream. Take the sign language video 'lightning/electricity' as an example. This sign language is the standard sign language. In most parts of mainland China, it has the same or similar stroke (the left hand does not move, and the right hand draws the shape of lightning in the air), so it is typical of cases. Key frame extraction is mainly divided into two steps: Firstly, video serialization and graying. The grayscale processing adopts the general formula (1) proposed in literature [12] to form the video sequence as shown in Fig. 2 . Secondly, extract key frames according to the algorithm. The extracted key frame image shoulsd have two characteristics: 1) it should have a certain duration of stability; 2) clear edges can be preprocessed into hand recognition materials. According to the two characteristics, the key frames are described as follows during the processing: The current frame has a small difference with the preceding m frames and the following m frames, The value of m is a natural number, so it should not exceed the number of frames, In the method of this paper, the efficiency of video library processing is considered. So m is equal to 2, that is, the difference is calculated with the first two frames and the second two frames. The difference coefficient is calculated by the gaussian function, and the difference is summarized at last. Assuming that the number of frames in the video is n, and the key frame is i, V i is defined as the grayscale matrix of the sequence image i, The vector i; V i ð Þ forms a discrete function: Fig. 3 shows the mean value function of gray square of "lightning" sign language video sequence. Key frame image extraction is completed according to the following formula: But it's hard to compute the derivative of the function formed by the discrete sequence directly, So f 0 x ð Þ is calculated using the series of coefficient operators S. S is normalized by the gaussian function formula (3), (4), (5) S ¼ À0:135; À0:365; 1; À0:365; À0:135 j j ð 4Þ Generate f 0 x ð Þ j j data sequence (Fig. 4) to reflect the changes between images. According to the actual situation of the video, the head subsequence and the tail subsequence are removed, because these two sub-sequences may contain useless information. In this project, the front part is about 10 frames for the video model gesture preparation stage, and the back part is about 10 frames for her gesture homing stage, which needs to be removed. Then find the p frames in the sequence where the rate of change is from small to large, Min p f 0 x ð Þ j j ð Þ, When p ¼ 3, three Ordinal Numbers conforming to the conditions are obtained, which are 17, 32 and 37 respectively. The corresponding images are frame 17, 32 and 37 (Fig. 5) . The video key frame is extracted, and then the hand shape matching recognition is carried out. The interference of hand image matching mainly comes from the following two points: 1) Differences between the matched images. That is, the source of the key frame image and the matching handshape image are different, and the difference of the image feature is relatively large, so it is not easy to form a match. 2) Spatial difference. Although it is the same handshape, the spatial position transformation gap is large and it is not easy to match. For the second point, there is no valid three-dimensional transpose algorithm for plane space, Therefore, in order to reduce the first kind of interference and form an effective matching feature, the key frame image is preprocessed before matching to improve the matching accuracy. Preprocessing is divided into two steps: 1) Image marginalization and binarization to reduce pixel interference; 2) image hotspot extraction to reduce the interference of non-hot region. High pass filter is used to process the image. The improved regional edge detection algorithm proposed by J Canny [13] and wang [14] et al. was referred to and simplified to meet the needs of this project. The kernel matrix of Sobel operator is used as filter (Fig. 6) . The horizontal filtering and vertical filtering were performed respectively, and after filtering, the matrices sobel_X and sobel_Y were formed, Then the L1 normal form is used to obtain filtering results. Finally, the low threshold is used for binarization processing to retain the basic contour information. In the case of this project, the gray intermediate value 128 is adopted as the threshold. The result is shown in Fig. 7 . The selection of threshold value should be optimized according to the specific situation of the image. The extraction process of hot area is as follows: Suppose the key frame image is Img i . It's previous frame is Img iÀ1 , and next frame is Img i þ 1 , The extraction formula is as follows: By using the difference of adjacent images to extract the hot areas, the background interference can be minimized and only the dynamic region can be concerned, Finally, the results are filtered by low threshold, and the remaining part is the hot areas, The final results are shown in Fig. 8 , and the position of hot areas is shown in Table 1 . Fig. 7 . key frame images that has been marginalized In the process of matching the key frame hot areas with the feature handshape images, the scale-invariant feature matching is needed. The accelerated version of SIFT algorithm [8] , SURF algorithm [9] , was used and the results were further filtered, SURF algorithm has the following 6 steps: (1) Construct the Hessian matrix of the image through formula (8) and calculate the eigenvalue. The convolution window of gaussian filter is used in the calculation process, and the simplified matrix of 3 * 3 is adopted (2) Construct gaussian difference scale space of the image. It is generated by convolution of gaussian difference kernel with image at different scales. Core formula (8), (9) and (10) Since the image size is small, the space is set as 3 scales. Each scale contains five levels. The template size of the filter increases gradually from 1 to 3 at different scales. In 1 scale, the filter's fuzzy coefficient increases gradually, forming a 3 * 5 scale space. (3) Feature point location. Compare each pixel processed by Hessian matrix with 26 points in the neighborhood of 2d image space and scale space. The key points were preliminarily located, and then the weak key points and the wrong key points were filtered out to screen out the final stable characteristic points. (4) Main direction distribution of feature points. The harr wavelet transform in the circular neighborhood of the statistical feature points is used. The sum of the horizontal and vertical Haar-like features of all points in the 60-degree sector was calculated. After the rotation was conducted at an interval of 0.2 radians and the harr small baud value in the region was calculated again, the direction of the sector with the largest value was finally taken as the main direction of the feature point. calculated to determine the matching degree. The shorter the Euclidean distance is, the better the matching degree of the two feature points is. At the same time, the Hessian matrix trace was judged. The matrix trace of the feature points had the same sign, which represented the contrast change in the same direction. If different, it was the opposite, even if the Euclidean distance was 0, it was directly excluded. This algorithm is used to match the hot areas of the key frames with each handshape image in turn, Finally, the percentage matching value is formed, Not all of the 60 handshape feature matching results are listed, but some typical matching results are listed as shown in Fig. 9 . As can be seen from Fig. 9 , although the handshape with the highest matching degree can be analyzed as the top left figure, with the matching degree of about 14.44% among the 60 hand shapes, the difference between the matching degree of other handshapes is not large enough to form matching results, such as figure 13 .33% in the bottom left figure, figure 11 .11% in the top right figure, and figure 11 .11% in the bottom right figure. From the matching results, it can be seen that the matched feature point pairs need to be filtered due to more interference. Inspired by the geometric characteristics proposed by Liu et al. [15] . The filtering method is as follows: If the matching image is placed in the same plane, the two images will have relatively fixed positions in the plane coordinate system, as shown in Fig. 10 . The connecting line segment of the center point of imageA and imageB is ab, The Angle between ab and the horizontal line is a. Let the Angle increment be Da, and assume that the points M and N are a pair of matching points. The Angle between the line MN and the horizontal line is b, If b satisfies formula (11) , then MN is the matching point, otherwise it is deleted from the matching point pairs. You can make some adjustments by setting the size of the Da. be a À Da; a þ Da ½ ð 11Þ The matching point pairs formed after filtering are shown in Fig. 11 . The matching degrees of handshapes in the four cases were 21.43% in the top left, 17.86% in the bottom left, 5.36% in the top right, and 5.36% in the bottom right. The difference is large enough for classification, and A good result distribution was formed in 60 handshapes matches. Based on the above results, we can find that the handshapes in the sign language video "lightning" have a good matching degree with the seventh and eighth hand shapes in Fig. 1 . In order to classify the videos in the sign language database by specified handshapes, this paper proposes an algorithm combination flow. Firstly, extract the key frames of digital video. The video image sequence matrix difference is used to calculate the sequence change rate and the smaller frame is taken as the key frame; Secondly, the feature extraction and hot areas extraction of the key frame images are carried out. Feature extraction uses Sobel operator to extract the contour and conduct binarization, and hot areas extraction is to use the difference between adjacent images to separate the image area with larger values. Finally, handshapes matching was performed, mainly using SURF algorithm. In addition, the point pairs generated by SURF algorithm were filtered into the plane at an Angle to form a handshape-matching distribution satisfying the requirements. Compared with the supervised learning algorithm, this process avoids the stage of sample learning and the complicated classification calculation, saves the computing resources, and has a certain efficiency and practicability. But because the video key frame image has spatial transformation, it can't match the specified handshape completely. The next research direction focuses to solve this problem with depth information [16] , and at the same time, deep learning algorithm and more graphical features [17] is introduced to apply this algorithm flow to video classification and other practical applications. A multi-view method for gait recognition using static body parameters Video filtration for content security based on multimodal features Wireless capsule endoscopy video classification using an unsupervised learning approach Automatic recognition of film genres Text-based video content classification for online videosharing sites An automatic video content classification scheme based on combined visual features model with modified DAGSVM Audio-video based classification using SVM Distinctive image features from scaleinvariant keypoints SURF: speeded up robust features Unordered image key frame extraction based on image quality constraint A self-adaptive weighted affinity propagation clustering for key frames extraction on human action recognition Mover's distance as a metric for image retrieval A computational approach to edge detection Image edge detection algorithm based on improved Canny operator Gesture recognition based on geometric features Fingeritip detection and gesture recognition based on kinect depth data Gesture recognition: a servey