key: cord-0479056-pfps2flx authors: Gushchin, Alexander; Antsiferova, Anastasia; Vatolin, Dmitriy title: Shot boundary detection method based on a new extensive dataset and mixed features date: 2021-09-02 journal: nan DOI: nan sha: 578156e9ed77007a534e340da279bf072ef8eda0 doc_id: 479056 cord_uid: pfps2flx Shot boundary detection in video is one of the key stages of video data processing. A new method for shot boundary detection based on several video features, such as color histograms and object boundaries, has been proposed. The developed algorithm was tested on the open BBC Planet Earth [1] and RAI [2] datasets, and the MSU CC datasets, based on videos used in the video codec comparison conducted at MSU, as well as videos from the IBM set, were also plotted. The total dataset for algorithm development and testing exceeded the known TRECVID datasets. Based on the test results, the proposed algorithm for scene change detection outperformed its counterparts with a final F-score of 0.9794. One of the basic steps in video processing is video scene splitting. For example, scene cutting is a necessary step in video annotation and indexing [3] , keyframe searching [4] , and automatic video format changing [5] . Existing algorithms have achieved high accuracy in detecting transitions between scenes in general cases, but still make mistakes in detecting complex transitions (Fig. 1) . Also, existing algorithms have been developed based on open data sets that may contain errors. For example, when analyzing one of the most popular BBC [1] datasets, several frame inaccuracies were found in the markup of transitions: for example, in the From Pole to Pole video, the first scene ends at 632 frames, but the second begins at 650 frames. Since the algorithm must specify the frame number with the scene change, such errors have been corrected to work more accurately. Thus, the challenges of creating a method for quickly and accurately partitioning video into scenes, as well as creating a volumetric data set with accurate partitioning, are relevant. Since different definitions for scene transitions are found in the literature (for example, the definition of a scene varies -it can be defined as a gluing of two perspectives or a semantic part of a movie), the following are the basic definitions that will be used in this work. In this paper, we have relied on the definitions given in the formal statement of the problem formulated by the authors in [6] . The basic element of video is a frame, frames are combined into scenes (shots), and scenes are combined into semantic scenes. A scene-is a continuous stretch of video, shot with a single camera, without stitches or interruptions. A semantic scene is a sequence of scenes with the same semantics. The task of the shot boundary detector is to indicate all the frames in which a scene change has occurred. In most cases, the content within a scene changes gradually, and at the boundaries there is montage gluing, so this task is trivial for humans. Scene changes themselves are divided into two types -abrupt and gradual. Abrupt changes in scenes -the momentary transition from a frame of one scene to a frame of the next. This can be dissolve (the gradual appearance of a new scene on top of the previous one), fade (a gradual transition to a black frame and back) or wipe. Examples of such transitions are shown in Fig. 2 . Most shot boundary detection algorithms work in 2 steps: • Calculating the value of the frame difference metric or metrics • Setting the threshold for frame classification. Also at this stage, machine learning is often used for automatic classification. • An additional step can be filtering frames for false positive detections. The purpose of this work was to create a new method for shot boundary detection and compare it with existing methods on a new large volume of data. The paper is further structured as follows: • In section 2, an overview of algorithms from the field is given, as well as datasets to compare them • Section 3 gives a detailed overview of the proposed approach to solve the problem at hand • Section 4 contains the results of testing the open algorithms and comparing them with the proposed method • Section 5 contains conclusions and further plans. In one of the most detailed works devoted to the analysis of [7] shot boundary detection algorithms, the authors considered their drawbacks as ways to improve them. The main drawbacks include the slow speed of operation, as well as errors in cases of flashes, fast camera movement, etc. In most existing methods, the first step is the calculation of features for frames. One of the frequently used is the frame similarity metric for finding the degree of difference between frames. As the scene changes, the value of this metric will increase, while inside the scene it is close to zero. The most popular techniques are: calculation of color histograms [7] , [8] , boundary gradients [9] [10] [11] , geometric transformations [12] [13], motion vectors [14] [15] . One of the simplest methods of constructing this metric is a pixel-by-pixel comparison of frames [7] . Other difference metrics are also calculated between frames -for example, [16] uses L*a*b* space and the formula for the distance between colors in it. The construction of color histograms was used, for example, in [7] , [8] . Histograms can be computed both for RGB and other color spaces (HSV, YCbCr, L*a*b*). With this approach, the algorithm is less sensitive to motion within the frame, but may produce many false positives for scenes with flashes and rapid light changes. The use of boundary gradients partially solves the problem of false positives when the camera or objects move within the frame, allowing you to use frame boundary matching without relying on lighting. Such a technique was used in [9] and [10] . In [11] , the authors used object boundaries within frames to construct a histogram of directional gradients. The histograms for different levels of the resolution pyramid are concatenated. This approach allowed the authors to obtain the characteristics of object boundaries in the frame at different levels. The motion vectors were used in the following works: [17] [14] [15] . By using them, the scene change detection algorithm can be adapted to the movement within the scene, the camera movement or the appearance of a large object in the frame. Vectors take longer to compute than approaches based on color or border histograms, but they can be used together with fast computable metrics and achieve high accuracy (for example, if we consider motion vectors before comparing frame boundaries). Also one of the popular techniques is geometric transformations of frames -Contourlet, Fourier transform [12] , Walsh Hadamard [13] . These methods are sensitive to frame motion and resolution, which can seriously increase the running time of the algorithm. The authors from [Contourlet] used an improved contourlet, which is not sensitive to the problems mentioned above. Rarer ways of constructing metrics include SIFT, SURF, entropy [18] [19] . They can give comparable accuracy, but require more computational resources. Many of the methods described above can be applied not to the whole image, but to a part of it. In this approach frame is divided into blocks (overlapping or not) and metric is calculated for each block. Vector of such metrics can be concatenated, histograms (including cumulative ones) [10] ) or use statistics (e.g., expectation and variance). Partitioning into blocks allows the algorithm to be less susceptible to changes in certain parts of the scene (e.g., rapid movement of objects or flashes). Thus, using a combination of features based on different characteristics of video and individual frames allows to achieve higher detection accuracy, reduce the number of false positives, but increases the runtime of the algorithm. After calculation of frame similarity metrics, each frame is classified into one of three categories: abrupt scene change (cut), gradual scene change (dissolve, vipe, fade in or fade out, no scene change. Since the algorithm needs to analyze all frames, which is a resource-intensive task, some authors use video preprocessing: they select, using additional fast algorithms, the segments where scene changes are supposedly present and further process only those segments. This approach assumes that no scene changes occur on frames that are not in these segments. There are three main approaches to classification: classification by threshold, adaptive threshold, and machine learning. Threshold (or a set of thresholds) is the simplest way to classify. The values of a metric or metrics are compared to a predetermined threshold and a decision is made as to whether a class belongs to a certain class. This approach is rarely used, as it is more advantageous to select thresholds for each individual video based on its features. The adaptive threshold does not have this disadvantage and can not only change depending on video [10] , but also depending on metrics values in some neighborhood of the frame [20] . Thus, the threshold is adjusted not for the whole video, but for a particular scene. Recently, due to the development of machine learning algorithms, they are increasingly used for classification: SVM [21] , bagged tree classifier [11] , k nearest neighbors, neural networks. The authors of [22] have analyzed the techniques used-according to their research, SVM showed the best results. Certain video artifacts significantly complicate the detection algorithms-for example, flashes and camera/object motion. Some of them can be eliminated at preprocessing stage -for example, separate metrics for flashes [23] are introduced. In most papers on scene-shift detection methods, the authors compare the performance of algorithms on the dataset used in the TRECVID competition. This is one of the most famous and extensive comparisons of shot boundary detection methods, which has been conducted annually for 7 years. It tested 57 algorithms using different sets of marked videos. After the end of each competition, articles were released analyzing the participants' solutions and their results (e.g., [22] ). The dataset included mostly documentaries and television shows. There are also a number of articles comparing shot boundary detection methods (e.g., [7] ). We requested access to the TRECVID dataset, but unfortunately, due to Covid-19, the authors were unable to provide it. (The vendor agreement requires sending the dataset on DVDs, and To create an algorithm for detecting scene shifts, a set of OS VSD [24] data was collected using Yandex.Toloka [25] . The creation of a dataset is divided into several steps: • A few algorithms configured in a way to maximize the completeness of the results was running on all videos • A list of potential scene changes was created by combining the results of all algorithms • Each potential scene change was cut from the original video as a short video sequence of 40 frames long • Yandex.toloka was used to show peoples all these sequences for markup • -For each video segment, observers indicated whether there was a scene change in it -Each video segment was shown at least 5 different people, if the results were not unambiguous the number of observers increased until an agreement between observers was reached This resulted in an additional 19 videos with a total duration of 965 minutes surpassing the existing TRECVID. The table above shows other comparative characteristics of this set. Methods that have shown high accuracy in existing comparisons were used as base features for the new algorithm. To analyze them, a newly created OS VSD dataset was used, on which these methods were compared. In the first step, the proposed algorithm uses several metrics to describe frame differences. These metrics are built on a boundary gradient, a frame color histogram to describe frame differences. This approach allows to take into account several factors possible when changing scenes and to get more information about the frames being compared. On the second stage, lgbm algorithm is applied to these metrics for classification. It was chosen as a result of experiments with different machine learning techniques. First, let us describe the features that our algorithm relies on. • Metrics proposed in [26] -These metrics use the average value and standard deviation of the brightnesses of the pixels in the block of frames. • Cumulative color histogram metric -It is based on [10] . First, the Sobel operator is applied to the frames to find the borders, the trapezoidal smoothing function is applied, and the cumulative histogram of frame blocks is calculated. • Metric proposed in the Max Remain repository [27] -It calculates the difference between color histograms of two consecutive frames and builds the difference between them. The output is a vector of length 3 * ,number of columns in histogram. • Histogram of borders of objects in the frame -At the beginning we apply Sobel operator to find the borders, divide the frame into 100 non-overlapping blocks, build a histogram of borders and compare it with neighboring frames. A threshold is applied to cut off blocks which are different in neighboring frames. • Metric proposed in the aysebilgegunduz [28] repository metric is the distance bhattacharyya between histograms of consecutive frames. The [30] features were also tested, but were discarded during the experiments due to their low accuracy compared to the other metrics. As a training dataset were selected 19 videos on video hosting youtube.com total duration of 26 minutes (38917 frames) . Also 2 videos from BBC Planet Earth set were added with total duration of 96 minutes (144700 frames). Thus, the training dataset consisted of 21 videos of 122 minutes duration (183617 frames). There were 917 abrupt scene changes and 54 gradual scene changes. The test dataset consisted of 9 videos taken from the BBC Planet Earth dataset and 10 videos and the RAI dataset. The total test dataset consisted of 563 minutes of video (804883 frames), with 4510 sharp and 348 gradual scene changes. Linear and logistic regression, SVM, K-means, LGBM, and random forest were tried as a learning algorithm. The LGBM algorithm showed the best results, and its parameters were chosen using crossvalidation. The graph of the contribution of the features in the final model can be seen in Fig. 3 . The accuracy of the algorithm was measured on a test dataset, and a comparison with counterparts was made. The F1 score metric was used to measure the accuracy and recall of the found scene changes. Table 3 shows the scores obtained: the proposed algorithm outperformed popular methods in terms of accuracy. The efficiency of the used metrics was also analyzed. Fig. 4 shows the values of metrics pairs on each frame of training dataset. Different colors indicate the presence and absence of scene changes. From the illustration we can see that many pairs of metrics make a clear classification, for example, a pair of metrics from [26] . Fig. 5 shows an example of the [26] metric, which has the largest contribution to the accuracy of the model, on a segment of video from the test set. It can be seen that on the frames with scene changes the metric takes large values, easily separable from the rest. In this paper, we proposed a new method for determining scene changes based on different metrics. The algorithm was tested on BBC Planet Earth and RAI datasets; its accuracy was 0.9784 and completeness was 0.9803. The proposed method outperformed its counterparts in the F1-score metric. At the moment the speed of the algorithm is slower than analogues, as it uses a larger number of features. In the further development of the project it is planned to speed up the proposed method, as well as to analyze the performance of the methods in complex cases for classification. Innovative Shot Boundary Detection for Video Indexing Shot based keyframe extraction using edge-lbp approach The metric value from the article [26] on a random segment of the video from the test dataset. Blue line is metric value, vertical lines are frames with scene change A formal study of shot boundary detection, Circuits and Systems for Video Technology Methods and challenges in shot boundary detection: A review Comparison of video shot boundary detection techniques Gradual shot boundary detection using localized edge blocks Video shot boundary detection using block based cumulative approach A new pyramidal opponent colorshape model based video shot boundary detection Advanced and adaptive shot boundary detection hadamard transform kernel-based feature vector for shot boundary detection Scene detection and retrieval of video using motion vector and occurrence rate of shot boundaries Automatic shot boundary detection combining color, edge, and motion features of adjacent frames A novel bifold-stage shot boundary detection algorithm: invariant to motion and illumination Video shot boundary detection using motion activity descriptor Shot boundary detection from videos using entropy and local descriptor Fast shot segmentation combining global and local visual descriptors Multi-modal visual featuresbased video shot boundary detection Video shot boundary detection using multiscale geometric analysis of nsct and least squares support vector machine Video shot boundary detection: Seven years of trecvid activity Effective fades and flashlight detection based on accumulating histogram difference Adaptive moment detector of instantaneous scene changes in a video stream and its training method based on the signs of video stream content: dark / light, calm / dynamic Predicting and optimizing image compression Msu vqmt scene change detection tool Ffmpeg shot boundary detection tool This work was partially supported by the Russian Foundation for Basic Research under Grant 19-01-00785a and the non-commercial fund of science and education development "Intellect".