key: cord-0306177-530sm1ho authors: Zhang, Fan; Katsenou, Angeliki; Bampis, Christos; Krasula, Lukas; Li, Zhi; Bull, David title: Enhancing VMAF through New Feature Integration and Model Combination date: 2021-03-10 journal: nan DOI: 10.1109/pcs50896.2021.9477458 sha: 15d1d350f633e79c7acc4db8672d6af62a918321 doc_id: 306177 cord_uid: 530sm1ho VMAF is a machine learning based video quality assessment method, originally designed for streaming applications, which combines multiple quality metrics and video features through SVM regression. It offers higher correlation with subjective opinions compared to many conventional quality assessment methods. In this paper we propose enhancements to VMAF through the integration of new video features and alternative quality metrics (selected from a diverse pool) alongside multiple model combination. The proposed combination approach enables training on multiple databases with varying content and distortion characteristics. Our enhanced VMAF method has been evaluated on eight HD video databases, and consistently outperforms the original VMAF model (0.6.1) and other benchmark quality metrics, exhibiting higher correlation with subjective ground truth data. In recent years, the consumption of video data has increased significantly. It was predicted by CISCO that, by 2022, 80% of the world's data traffic will be video [1] . However, since 2020, due to the impact of the COVID-19 pandemic, it is clear that this is an underestimate. Video quality assessment is thus a key tool for video service providers, enabling them to accurately estimate the quality of their encoded content and hence improve the experience of their users. Quality assessment plays a critical role in benchmarking encoding methods, comparing configurations and for creating optimum bit rate ladders prior to adaptive streaming over networks with variable bandwidth to devices with differing capabilities. Perceptual video quality can be assessed through psychophysical experiments which, while effective, are costly and time consuming. Objective quality metrics offer more efficient solutions, but to be effective these must be designed to achieve good correlation with subjective results. Objective quality assessment methods can be classified into three primary categories according to how much information they utilise from the reference (original) material, (i) full reference (FR) (ii) reduce reference and (iii) no-reference. In this paper, we only focus on full reference video quality metrics. The most commonly used FR metric is PSNR (peak signalto-noise ratio). PSNR simply measures the pixel-wise distortions between the test content and its reference counterpart, and thus does not always correlate well with visual perception. Over the past two decades, many perceptually inspired objective quality metrics have been proposed, targeting enhanced correlation with subjective quality compared to PSNR. Notable examples include SSIM [2] and its variants [3, 4] , VIF [5] , VSNR [6] , VQM [7] , MOVIE [8] , MAD [9] , STMAD [10] and PVM [11] . Further details on objective quality assessment methods can be found in references [12, 13] . In recent years, improved quality assessment performance has been achieved using machine learning techniques. One of the most successful examples is the Video Multimethod Assessment Fusion (VMAF) approach, developed by Netflix. This combines several existing quality metrics and video features including ADM [14] , VIF (at four different scales) and the average temporal frame difference (TI), using a support vector machine (SVM) regressor. VMAF was trained on a large HD video quality database, VMAF+ [15] , and has been reported to outperform conventional quality assessment methods on various subjective databases. It also includes various extensions developed for different viewing scenarios such as UHDTV and cellular phones. However, VMAF only includes one temporal feature (TI) which is based on simple frame differencing and this can lead to inconsistent performance when applied to video content with complex temporal activity (e.g. dynamic textures) [16] . Additionally, the training database VMAF+ is limited as it only contains test sequences generated by an H.264 codec and resolution re-sampling. The trained fusion model, therefore, cannot be guaranteed to achieve optimal performance when used on content compressed by other video codecs. In this context, inspired by previous work on dynamic texture classification [17] and model fusion [18] , we propose a VMAF enhancement method based on three primary modifications: (i) the employed ADM metric is enhanced by integrating a feature related to dynamic textures; (ii) two new SVM models are trained based on selection from a pool of original VMAF and new features; (iii) the final quality index is obtained by linearly combining these two models, where training is based on two training databases. The enhanced VMAF method has been benchmarked against the original VMAF and against other popular quality metrics using eight evaluation databases. Results show consistent correlation improvement with subjective ground truth on all test datasets. The remainder of this paper is organised as follows. Section II describes the proposed algorithm, focusing on the details of three primary modifications. Section III presents the experimental configurations, including training/test materials and evaluation metrics. Section IV provides the comparison results between the proposed method and benchmark approaches on the test content alongside the complexity analysis. Finally, the conclusions and future work are outlined in Section V. The proposed approach is illustrated in Fig. 1 . Compared to the original VMAF method, the new approach calculates a modified ADM index based on the input reference and test video frames, which will be fused with temporal frame difference (TI), four VIF values (at four different resolution scales) and new selected video features using a re-trained SVM regression model (Model 1). A second SVM model (Model 2) was trained based on new selected features and a different training dataset to provide a further estimate of the test video quality. The outputs from these two SVM models are combined using a normalised linear model to generate the final quality index. The implementation details relating to each of these components are described below. ADM [14] is an image quality assessment method which predicts perceptual quality through separately estimating detail losses (DLM) and additive impairments (AIM). The calculation of the detail losses exploit two HVS characteristicscontrast sensitivity and spatial masking. However the current ADM model is limited in that it does not fully capture temporal masking effects (for example due to dynamic textures, e.g. water, smoke, and steam). Therefore, based on the dynamic texture classification method proposed in [17] , we have modified the contrast masking thresholds MT λ (equation (18) of [14] ) in ADM to include weighting by a dynamic texture feature (DTF λ ) as shown below: Here MT λ,new are the new masking thresholds which replace the original MT λ in the ADM calculation. λ is the wavelet decomposition level defined in [14] . α is a parameter which was empirically obtained to achieve the optimal correlation performance (in terms of the Spearman Rank Order Correlation) with subjective ground truth based on the VMAF+ database [15] . DTF λ is a dynamic texture feature based on the displaced frame difference (luma component only) after motion compensation (using the previous frame as reference). The motion compensation operation is performed using an optical flow approach (Lucas-Kanade Method) [19] . Here we use notation similar to that in [14] o represents the original frame and o DF represents the displaced frame after motion compensation. DTF is decomposed using a 2D discrete wavelet transform (DWT) at different scales to calculate the corresponding DTF λ at the scale λ. In order to achieve improved correlation performance, new feature candidates (alongside the original VMAF features) are computed. These include popular quality metrics and video features as summarised in TABLE I. These features are calculated separately on luma and chroma channels (if applicable) and at four different resolution scales (low resolution frames are obtained through 2D DWT decomposition). This results in a total of 165 new features. Quality metrics PSNR, SSIM [2] , MS-SSIM [4] , VSNR [6] , VIF [5] , MAD [20] and PVM [11] Image and video features SI [21] , CF [21] , TP [16] , ∆SI, ∆CF, ∆TI, ∆TP, LUMA, BL, ED The definitions of most features included in Here, the blurring artefact BL is similar to that in [11] , which calculates the energy loss in three high frequency DWT subbands (H, V , D) of the test video frames. The edge artefact ED is the additional energy present in the high frequency DWT subbands compared to the reference coefficients. During feature selection, we followed an approach which is similar to the Sequential Forward Method Selection (SFMS) method in [22] , and used the Spearman Rank Order Correlation Coefficient (SROCC) as the performance measurement. The algorithm description is provided in Algorithm 1. Pick the next best feature: Break the while loop; 10: end if 11: end while 12: return F * * For SVM Model 1 in Figure 1 , the selected feature set includes the original six VMAF features. The ADM metric was replaced by the enhanced version as described in Section II-A. For SVM Model 2, the input selected feature set is empty. Frame level outputs from the two SVM models are linearly combined to generate the final quality index for each frame. The weighting parameter β ∈ [0, 1] was selected to achieve the best overall SROCC value on training datasets. Ten video quality databases including HD (1920×1080) content have been used for training and evaluating the proposed approach. As we solely focus on distortions and artefacts introduced through compression, non-compression distortion versions have been removed from the datasets. However, because resolution re-sampling is commonly used in many video streaming applications, test sequences with re-sampling artefacts have been retained. The databases employed here, their compression codecs, and distortions types are summarised in TABLE II. Among these databases, VMAF+ is used to train the SVM regression model 1 (M 1 ) (as it was the training dataset used for the original VMAF model). CC-HDDO was selected to train the second SVM model (M 2 ), because it contains video content compressed by both HEVC and AV1 codecs as well as resolution re-sampling artefacts. The other eight databases are used for evaluation. The subjective scores associated with the eight test databases were used to evaluate the performance of the proposed method. For reference and further comparison, eight other popular objective quality metrics have also been tested on these databases. These are: PSNR, SSIM [2] , MS-SSIM [4] , VIF [5] , VSNR [6] , ADM [14] VMAF (0.6.1) [15] and ST-VMAF [18] . The performance of all metrics was evaluated using the Spearman Rank Order Correlation Coefficient (SROCC). A significance test was also conducted to identify the difference in performance between the original VMAF (0.6.1) and other tested metrics on all test databases. The approach in [8, 11] was used whereby an F-test was conducted on the residual between the average MOS (mean opinion scores) of each database and the MOS predicted by the tested objective quality metrics through a non-linear regression using a logistic function. Our three primary contributions have also been tested on the eight test datasets and compared to the full version. (1) ADM modification (denoted as w/o E-ADM): the effectiveness of the enhanced ADM method (E-ADM) has been evaluated by replacing it with the original ADM [14] . Other selected features remain the same, and the SVM models have been re-trained on the same training databases. (2) New feature selection (denoted as w/o NF): new selected features are substituted by original VMAF features (except E-ADM), and two models are trained separately on two databases based on the same (original VMAF) features. The enhanced ADM metric is used here. (3) Model combination (denoted as M 1 and M 2 ): the correlation performance of two separate SVM models are presented and compared to the combined version. The new selected feature sets for the two SVM models are listed in TABLE III, where M 1 corresponds to the extension of the original VMAF and M 2 is the second model based on new selected features. The weighting parameter β is 0.5. Table IV presents a summary of performance comparison between the proposed approach and the other eight benchmark quality metrics. For each evaluated quality metric, the SROCC value on each test dataset is presented alongside an aggregate SROCC (Overall) value for all eight databases. We followed the method in [18] to calculate the aggregate correlation coefficient using Fisher transformation [29] , in which the SROCC value for each database was transformed based on: The transformed values are then averaged and inverse transformed to obtain the aggregate SROCC. It can be observed that the proposed method offers higher aggregate SROCC values across the eight test databases compared to the other quality metrics. Furthermore, improvement has been achieved over the original VMAF 0.6.1 for all eight test databases, and this result is statistically significant (based on the F-test) on the NFLX dataset. Moreover, according to the ablation study results in TABLE IV, all three primary contributions have led to higher overall correlation performance when compared to their tested replacements. The newly selected features contribute the largest improvement. When quality metrics are employed in practical applications, it is useful to compare performance using distorted video sequences generated by different codecs and coding configurations. Inspired by the database combination idea proposed in [30] , we have calculated subjective MOS (after the ranges have been normalised to 0-100) difference between every pair of test sequences in each of the eight test databases. When computing the MOS difference, we artificially changed the order to ensure that MOS differences are equal to, or greater than, zero. This results in MOS difference scores corresponding to 102,842 test sequence pairs. The quality index difference between the test sequences in every pair has also been computed (following the same order). The distributions of the quality index difference values are shown in Fig. 2 for both the original VMAF and the proposed method. We also calculate the accuracy ratio for both metrics, which is defined as the number of pairs where the VMAF difference values provide the same classification results as the MOS differences. It can be observed that the proposed approach offers the higher accuracy ratio (85.8%) compared to the original VMAF (84.4%) -the improvement (1.4%) represents an additional 1440 (102842×1.4%) video pairs that have been correctly classified. This improvement is significant according to Fisher's exact test (p = 1.2 × 10 −11 ). V. CONCLUSION In this paper, we have proposed an enhanced version of VMAF based on dynamic texture feature integration (into ADM), additional feature selection and model combination. Multiple training databases have also been used to train the SVM fusion models. The new approach shows consistent improvements over the original VMAF model on all test databases in terms of correlation with subjective ground truth. It also achieves superior performance when used to compare different distorted versions of the same sources on a large combined dataset. Future work will focus on the more challenging cases (e.g. the outliers in Fig. 2 ) when multiple codecs and various coding configurations are employed. The values in each cell x(y) correspond to the SROCC value (x) and F-test result (y) at 95% confidence interval. y=1 suggests that the metric is superior to VMAF 0.6.1 (y=-1 if the opposite is true), while y=0 indicates that there is no significant difference between them. CISCO visual networking index: forecast and methodology Image quality assessment: from error visibility to structural similarity Video quality assessment based on structural distortion measurement Multi-scale structural similarity for image quality assessment An information fidelity criterion for image quality assessment using natural scene statistics VSNR: A wavelet-based visual signal-to-noise ratio for natural images A new standardized method for objectively measuring video quality Motion tuned spatiotemporal quailty assessment of natural videos Most apparent distortion: full-reference image quality assessment and the role of strategy A spatiotemporal most-apparent-distortion model for video quality assessment A perception-based hybrid model for video quality assessment Objective video quality assessment methods: A classification, review, and performance comparison Intelligent Image and Video Compression: Communicating Pictures Image quality assessment by separately evaluating detail losses and additive impairments Toward a practical perceptual video quality metric BVI-HD: A video quality database for HEVC compressed and texture synthesised content A parametric framework for video compression using region-based texture models Spatiotemporal feature integration and model fusion for full reference video quality assessment An iterative image registration technique with an application to stereo vision Seven chanllenges in image quality assessment: Past present, and future research Analysis of public image and video database for quality assessment A fusion-based video quality assessment (fvqa) index A subjective comparison of AV1 and HEVC for adaptive video streaming MCL-V: A streaming video quality assessment database Comparing VVC, HEVC and AV1 using objective and subjective assessments SHVC verification test results IVP subjective quality video database Report on the validation of video quality models for high definition video content Averaging correlations: Expected values and bias in combined pearson rs and fisher's z transformations Training objective image and video quality estimators using multiple databases