key: cord-0660172-u2p87j4c authors: Wang, Zhihua; Wang, Haotao; Chen, Tianlong; Wang, Zhangyang; Ma, Kede title: Troubleshooting Blind Image Quality Models in the Wild date: 2021-05-14 journal: nan DOI: nan sha: 7f6fe6a6c1765a85e6c8d393b60a5e50ebbd2a60 doc_id: 660172 cord_uid: u2p87j4c Recently, the group maximum differentiation competition (gMAD) has been used to improve blind image quality assessment (BIQA) models, with the help of full-reference metrics. When applying this type of approach to troubleshoot"best-performing"BIQA models in the wild, we are faced with a practical challenge: it is highly nontrivial to obtain stronger competing models for efficient failure-spotting. Inspired by recent findings that difficult samples of deep models may be exposed through network pruning, we construct a set of"self-competitors,"as random ensembles of pruned versions of the target model to be improved. Diverse failures can then be efficiently identified via self-gMAD competition. Next, we fine-tune both the target and its pruned variants on the human-rated gMAD set. This allows all models to learn from their respective failures, preparing themselves for the next round of self-gMAD competition. Experimental results demonstrate that our method efficiently troubleshoots BIQA models in the wild with improved generalizability. Over the years, researchers and engineers in the field of image processing and computer vision have realized the importance of blind image quality assessment (BIQA) [42] . Numerous BIQA models [3, 30, 33, 34, 47, 48] have been proposed, focusing mainly on boosting performance on existing IQA datasets of fixed sizes. However, the superior correlation numbers on closed test sets may not translate in a reliable way to generalization in the open visual world [44, 29, 36, 50] . Therefore, computational methods for probing and improving the generalizablity of BIQA models are highly desirable. In 2008, Wang and Simoncelli [44] described a maximum differentiation (MAD) competition procedure to compare IQA models in the space of all possible images. Ma et al. [29] proposed gMAD, a discrete instantiation of the MAD method, by restricting the search space to some specific domain of interest. Both methods are able to automatically and efficiently expose failures of a relatively weak IQA model, Figure 1 : Failure cases of a "top-performing" BIQA method -UNIQUE [52] spotted by an ensemble of its pruned versions. (a) Best/worst-quality images according to the ensemble, with near-identical quality reported by UNIQUE. (b) Best/worst-quality images according to UNIQUE with near-identical quality reported by the ensemble. by letting it compete with a set of strong models. Wang and Ma [43] took advantage of gMAD to identify the counterexamples of a BIQA model [52] using a set of stronger fullreference IQA metrics. Furthermore, they demonstrated that harnessing gMAD-selected failures significantly improves the BIQA generalizability. Despite demonstrated success, the progressive failure identification and model rectification pipeline proposed in [43] have two drawbacks. First, it can only be applied to the synthetic distortion scenario, where full-reference IQA models are computable. For BIQA models in the wild with input images containing realistic camera distortions, it is highly nontrivial to obtain a list of stronger methods to falsify a state-of-the-art model. Second, the competing full-reference models are fixed throughout model development, rendering failure-spotting less effective as the target model becomes stronger [43] . BIQA with DNNs Recently, there has been a surge of interest in developing BIQA models based on DNNs. A major challenge along this direction is to constrain the large set of network parameters using a small set of human-rated images with mean opinion scores (MOSs). Kang et al. [20] trained a DNN with one convolution layer on 32 × 32 patches to compensate for the lack of training data. Bosse et al. [3] developed a DNN with more convolution layers using the same patch training strategy. Ma et al. [30] leveraged the distortion identification as an auxiliary task to warm up the training. Kim et al. [21] used the error map from the Minkowski metric to regularize the training. Ma et al. [31] took a step further, and exploited multiple full-reference metrics as noisy annotators for training DNN-based BIQA models without MOSs. These methods are mostly designed to handle synthetic distortions [38, 26] , with limited generalizability to realistic distortions [14, 18] . To meet the cross-distortionscenario challenge, Zhang et al. [51] bilinearly pooled two feature representations that are sensitive to synthetic and realistic distortions, respectively. Zhang et al. [52] described a simple method to train BIQA models on multiple IQA datasets. The resulting UNIQUE model is capable of assessing image quality in the laboratory and wild, and will be used as the target model to demonstrate the feasibility of the proposed method. Network Pruning DNNs commonly hinge on overparameterization [28] and can be effectively compressed [23] . Network pruning [15] has been an effective technique to remove redundant computation at surprisingly little sacrifice of test accuracy. For example, Han et al. [15] proposed to prune DNNs by thresholding model weights based on their magnitudes. Li et al. [24] pruned DNN filters with small 1 -or 2 -norms. Liu et al. [27] encouraged channel sparsity by adding 1 -constraints on the batch normalization scaling parameters. Molchanov et al. [35] estimated filter importance using Taylor expansion. He et al. [16] used geometry median to select the most redundant filters. A latest review is referred to [2] . Some researchers have started to rethink pruning beyond just an ad-hoc compression tool, and to explore its in-depth connection with DNN memorization/generalization. Frankle et al. [12] pioneered to show that there exist highly sparse "critical subnetworks" from the full DNNs, that can be trained in isolation from scratch. That critical subnetwork could be effectively identified by pruning [49, 13] . The most relevant work is due to Hooker et al. [17] , who showed that pruning a trained image classifier tends to harm its performance more on the most difficult and long-tailed training images. This implies that pruning might effectively spot samples not well learned by the current model, and provides novel insights to exposing a trained model's potential weakness. We take inspiration from [17] , and improve their method to troubleshoot BIQA models, by identifying and leveraging quality-discriminable images between pruned and non-pruned methods. Active Learning The main idea of active learning is to mine the most valuable samples to label from a large unlabeled dataset [8] . In active learning for regression, query by committee (QBC) selects the most disagreed samples by a committee of models [37, 5] . Expected model change maximization (EMCM) selects samples that can cause the largest change to the current model [6] . Greedy sampling (GS) was originally proposed as a robust clustering method against Figure 2 : Diagram pf troubleshooting BIQA models in the wild. We start with a differentiable parametric target BIQA model, seek pairs of images by letting it compete with ensembles of pruned variants in gMAD [29] , collect human scores for the gMAD set, fine-tune all models on the combination of the previously seen databases and the newly annotated gMAD set. The target and competing models co-evolve for the next round of troubleshooting. outliers [1] . It was adapted to active learning in [45] , with the goal of selecting samples that can increase the diversity of model responses. Residual active learning (RSAL) [9, 10] trains two models to fit the target outputs and the prediction residuals, respectively. The residual model is then used to select samples with the maximum predicted residuals. The procedure of identifying gMAD images [29] in this work can be seen as a form of active sampling with the criterion that the selected images have the greatest potential to falsify the target model. We will compare the error-spotting efficiency of several active learning methods in Section 4.3. We formulate the general problem of troubleshooting BIQA models in the wild as follows. We assume a strong offthe-shelf BIQA model f that has been trained on the labeled set D with images captured in the wild. Also assumed is a large-scale unlabeled set S, containing images with much greater scene complexities and realistic distortions. The end goals are to identify diverse failures of f in S with a limited human labeling budget, and to leverage the exposed failures to further improve the generalizability of f . The gMAD competition [29] suggests to select images that optimally distinguish f and a stronger competing model because those images are most likely to be its counterexamples. Then the core question is "how to obtain a set of diverse competing models for efficient failure-spotting?" In this paper, we create strong competing models, dubbed "selfcompetitors," from the target model by network pruning [17] . After obtaining the labeled gMAD set L through subjective testing, we jointly fine-tune the target and competing models on the combination of D and L, attempting to learn from spotted failures without forgetting previously seen data [25] . Figure 2 illustrates the proposed diagram of troubleshooting BIQA models in the wild. The success of failure-spotting of the target model by gMAD [29] depends on the strength of the competing models. Here we resort to network pruning for competing model construction due to two reasons. First, DNN-based models are highly overparameterized. Therefore, the performance drop of pruned models on test sets is often insubstantial, and can be easily recovered if fine-tuning is allowed. Second, Hooker et al. [17] showed in the context of image classification that images that differentiate between the original and pruned classifiers are the least-memorized and weakestlearned ones by the original models. They appear the most challenging for both models, and sometimes even for humans to classify. In BIQA terms, these are likely to be selected by gMAD as the most informative images to falsify both the target and competing models. Specifically, we first generate a list of pruned models {h j } m j=1 from the target model f . Thanks to the prosperity of the network pruning field, we are able to leverage a diverse set of state-of-the-art network pruning techniques [15, 16, 24, 27, 35] , with different hyperparameter settings, to encourage diversity among pruned models. Furthermore, as ensemble models have been shown to achieve stronger generalizability than individual models in many fields of machine learning [53] , we create n ensemble models {g i } n i=1 by randomly combining a subset of s models out of {h j } m j=1 : where α ij = 1 if h j is selected to create g i , and j α ij = s. In Eq. (1), {h j } n j=1 have been mapped to the same perceptual scale such that the weighted summation is legitimate. Given the large-scale unlabeled dataset S, gMAD selects top-k pairs of images that best discriminate between the target model f and the competing model g i : where {x ij ,ŷ ij } k−1 j=1 is the set of k − 1 pairs of images that have been selected. The roles of f and g i may be switched by replacing arg max with arg min. However, recursive optimization of Eq. (2) may simply expose different instantiations of failures with the same underlying root causes (as shown in the first row of Figure 3 ). To encourage spotting diverse failures, a fine-grained version of gMAD can be formulated as ] defines a quality level, within which f predicts the two images x and y to have similar quality (see the second row of Figure 3 ). In this case, f is regarded as the defender, while g i is the attacker. We may select several (non-overlapping) quality levels to cover the full quality spectrum. All image pairs selected by exhausting the competing models and the quality levels form the gMAD set M, whose size is considerably smaller than the unlabeled set S and is adjustable to fit the available human labelling budget. In practice, it is possible to further diversify the spotted failures in M by decreasing k and increasing n provided that the created ensemble models differ to certain degrees (see Figure 3 ). Finally, we conduct subjective testing to collect human scores for each (x, y) ∈ M, leading to four possible results: • Case I. Both f and g i are consistent with humans in ranking the perceived quality of x and y. This happens because f is a top-performing model, while g i closely resembles f as its pruned version. We may reduce the possibility of this outcome by increasing the size of S and selecting more suitable network pruning algorithms. • Case II. g i is consistent with human perception, but f is not. In this case, the selected (x, y) constitutes a failure of f , and is informative for subsequent model rectification. • Case III. f is consistent with human perception, but g i is not. In this case, f successfully spots a counterexample of g i , which seems to deviate from the original goal. However, it is worth noting that (x, y) is still useful in improving the performance of f because we are co-evolving f and g i in the subsequent stage of model rectification. That is, g i is also given the opportunity to learn from its failures, increasing the possibility of exposing f 's weaknesses in the next round of the gMAD competition. • Case IV. Neither f nor g i is consistent with human perception. In this case, (x, y) manifests itself as a double-failure result, which is the most informative in improving the generalizability of f [17]. The labeled gMAD set L exposes aspects of weaknesses of f , and thus is useful for improving its generalization to the real world. To avoid catastrophic forgetting [32] , we choose to combine L and previously trained dataset D, and jointly fine-tune f and {g i } n i=1 on the combined set. By doing so, all models are able to learn from their respective failures and improve their generalizability for the next round of the gMAD competition. We may iterate this procedure of failure identification and model rectification several rounds, leading to a progressive human-in-the-loop troubleshooting method for BIQA models in the wild. We denote the target and competing models in the first round as f (0) and {g respectively. In the t-th round, we fine-tune f (t−1) and {g Compute the responses of f (t) on S Seek pairs of images associated with f (t) and g In this section, we first describe the real experimental setups, and then provide quantitative and qualitative results to validate the feasibility of the proposed method, followed by an ablation study to test the failure-spotting efficiency of our method. Target Model f We use UNIQUE [52] , a state-of-the-art BIQA model with so far the best cross-distortion-scenario performance to our best knowledge. We retrain it on six IQA datasets, i.e., LIVE [38] , CSIQ [22] , KADID-10k [26] , BID [7] , LIVE Challenge [14] , and KonIQ-10k [18] . We leave 20% images for monitoring the performance changes of f during troubleshooting. Unlabeled Dataset S To construct the large-scale dataset S for gMAD to seek potential failures of f , we first download 750, 000 images from the Internet followed by automatic pre-screening to remove duplicate and nonphotographic images. Afterward, we sample 100, 000 images with marginal distributions nearly uniform with respect to image attributes, including bitrate, JPEG compression ratio, brightness, colorfulness, contrast, and sharpness [40] . Finally, we down-sample the images such that the long edge has 1, 024 pixels as a way of facilitating computational prediction. The constructed S includes a wide range of realistic camera distortions, such as sensor noise contamination, motion and out-of-focus blurring, under-and over-exposure, contrast reduction, color cast, and a mixture of these. To encourage the diversity of the competing model pool, we adopt six network pruning algorithms: oneshot magnitude pruning (OMP) [15] , 1filter pruning [24] , 2 -filter pruning [24] , TaylorFOWeight pruning [35] , network slimming [27] , and FPGM pruning [16] , among which OMP is unstructured weight pruning, and the others are filter pruning. Fine-tuning is conducted after each model pruning method to recover the model performance. In addition, we use three different pruning ratios for each method, resulting in a total of m = 18 pruned models {h j } 18 j=1 from f . We randomly combine s = 8 out of m = 18 pruned models, giving rises to n = 120 ensemble models {g i } 120 i=1 . Note that ensembling requires all pruned models to use the same perceptual scale. To achieve this, we map all model predictions onto the MOS scale [0, 100] of the LIVE Challenge Database [14] , with higher values indicating better perceptual quality. The fitted mapping function can be treated as part of the pruned model. As formulated in Eq. (1), ensembling is implemented by simple averaging, which gives all pruned models equal weights. Labeled gMAD Set L To seek gMAD pairs, we set five quality levels, roughly covering the full quality range from "bad", "poor", "fair", "good", to "excellent". Two types of pairs are queried by treating the target model f in Eq. (3) as the defender and the attacker, respectively. We retain a [18] , L (1) and L (2) , respectively. Top section lists two knowledge-driven models. Middle section contains three DNN-based models retrained on the training set of KonIQ-10k. Note that the result of f on L (2) is obtained by fine-tuning it on both D and L (1) . KonIQ-10k L (1) L (2) NIQE [34] 0.521 0.340 0.293 HOSA [46] 0.520 0.336 0.287 DB-CNN [51] 0.806 0.690 0.641 MetaIQA [54] 0.841 0.801 0.751 HyperIQA [39] 0 single pair that best differentiates between f and g i at each quality level by setting k = 1. We perform two rounds of troubleshooting (i.e., r = 2), and the number of gMAD pairs in L (1) and L (2) are 1, 184 and 1, 194, respectively. We gather human data from 20 subjects in an office environment with a calibrated display [41] using the single stimulus continuous quality rating. We process the raw subjective data using the outlier detection and subject rejection algorithm in [4] . We find that all subjects are valid, and 2.89% and 3.03% of ratings are outliers and subsequently removed. For each round of model rectification, we fine-tune all 19 models, including the target model with the same optimization settings. Specifically, the Adam method is used with a learning rate of 10 −5 and a mini-batch size of 32 -half from the previous training set D and half from the gMAD set L. The maximum epoch number is set to ten. During fine-tuning, we re-scale and crop the images to 384×384. We test on images of original sizes. It takes about 162 GPU hours for each round of fine-tuning as measured on a machine with a single RTX 2080Ti. Failure Identification Table 1 lists the Spearman rank correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC) results between model predictions and MOSs on the gMAD sets L (1) and L (2) . We also include the performance on the test set of KonIQ-10k [18] for reference. (1) and f (2) in gMAD. A larger aggressiveness/resistance value indicates better performance [29] . Aggressiveness Resistance 1.366 0.019 f (2) 1.467 0.034 Several aspects of the results are worth noting. First, the correlation numbers of f on L (1) and L (2) are much lower than that on KonIQ-10k, indicating the effectiveness of our method in exposing failures of a "top-performing" BIQA model. Second, despite fine-tuned on L (1) , f (1) delivers slightly worse performance on L (2) compared to f (0) on L (1) . This suggests that the co-evolving ensemble models are able to spot stronger errors of f in the second round of troubleshooting. Third, the identified counterexamples of f in each round show increasing transferability to falsify five existing BIQA models, as evidenced by larger performance drops on L (2) than L (1) . In Figure 4 , we take a look at the empirical distributions of four possible results of gMAD pairs (see Section 3.2). Generally, ensemble models are more aggressive to falsify the target model, and are more resistant to the target's attacks as well. After the first round of fine-tuning, the failure-spotting capability of all models has been significantly improved. We progressively fine-tune the target model f (0) on D L (1) to obtain f (1) , which is further fine-tuned on D L (1) L (2) to obtain f (2) . To verify the relative improvements resulting from model rectification, we let f (0) , f (1) , and f (2) play the gMAD game against one other on S \ L. In each of 3 1 competitions, we select 100 gMAD pairs at five quality levels for human annotation. As suggested in [29] , we aggregate the paired comparison results into two global ranking vectors to indicate how aggressive one model is to falsify other models as the attacker and how resistant one model is to survive other models' attacks as the defender. Table 2 shows the global ranking results with higher values indicating better performance. It is easy to conclude that f continually evolves to be a better model in terms of both aggressiveness and resistance in gMAD without forgetting previously seen data. This verifies the feasibility of the proposed scheme to troubleshoot BIQA models in the wild. We next compare f (0) , f (1) , and f (2) qualitatively. Figure 5 shows four representative gMAD pairs between f (0) and f (1) . It is clear that the pairs of images in (a) and (b) exhibit substantially different quality, which is in disagreement with f (0) . In contrast, f (1) correctly predicts top images to have Figure 6 : Representative gMAD pairs between f (1) and f (2) . (a) Fixing f (1) at the low quality level. (b) Fixing f (1) at the high quality level. (c) Fixing f (2) at the low quality level. (d) Fixing f (2) at the high quality level. much better quality than bottom images. When the roles of f (0) and f (1) are reversed, f (0) still fails to expose failures of f (1) (see (c) and (d)), suggesting that f (1) is significantly improved by learning from the gMAD set in the first round. Figure 6 depicts four gMAD competition results between f (1) and f (2) . In (a) and (b), we observe that f (2) is able to falsify f (1) by finding its counterexamples which appear dark, but the perceptual gaps between best and worst cases are not as large as those when f (1) attacks f (0) . In (c) and (d), f (2) successfully survives the attacks from f (1) , with pairs of images of similar quality according to human perception. This indicates that perceptual gains from the second round of fine-tuning are not as substantial as those in the first round, which is a common phenomenon in active learning. We also show four gMAD pairs between f (0) and f (2) in Figure 7 to further demonstrate the improvements of f (0) after two rounds of troubleshooting. f (2) favors the top images in (a) and (b), which is consistent with human judgments, suggesting that f (2) successfully attacks f (0) . f (0) fails to penalize the top image in (c) and (d), which are spotted by f (2) . This further validates the improved generalizability of f to the real world. In this subsection, we show that gMAD sampling in our method has stronger failure-spotting capability compared to five active learning methods for regression: random sampling, QBC [37] , EMCM [6] , RSAL [9, 10] , and GS [1] . We conduct experiments on the smartphone photography attribute and quality (SPAQ) dataset [11] , which contains 11, 125 human-rated images captured by 66 smartphones. We sample a subset of 200 images by each method, and compute the SRCC and PLCC between MOSs and predictions by f (0) (i.e., UNIQUE [52] ). Table 3 shows the results, with a lower correlation coefficient indicating better performance. As can be seen, the images selected by gMAD [29] lead to the worst performance of f (0) , among all methods, which shows the failure-spotting capability of the proposed gMAD sampling. We have introduced a computational method for progressively troubleshooting BIQA models in the wild. The key to success of our method is to construct strong "selfcompetitors" as random ensembles of pruned versions of a "top-performing" target model. We have demonstrated the effectiveness of the ensemble models in exposing diverse counterexamples of the target model in the gMAD competition. A second advantage of our method is the flexibility to co-evolve the target and competing models, which allows all models to learn from their respective failures, making progressively troubleshooting the target model more effective. Our work extends a new line of research in BIQA with many important topics to be explored. For example, the current work only performs two rounds of troubleshooting due to the limited human labeling budget. Nevertheless, it is interesting to mathematically analyze the convergence of the proposed method or come up with a practical stop criterion to guide the setting of the fine-tuning round. Moreover, the computational complexity of constructing competing models, i.e., pruning followed by fine-tuning, is relatively high. It is thus worth exploring more computationally efficient methods, e.g., snapshot ensembles [19] for competing model construction. Another future direction is to extend the current work to troubleshoot BIQA models in the laboratory and wild, towards universal and generalizable BIQA. Greedy sampling for approximate clustering in the presence of outliers What is the state of neural network pruning? Deep neural networks for no-reference and full-reference image quality assessment ITU-R. Methodology for the subjective assessment of the quality of television pictures Active learning for regression based on query by committee Maximizing expected model change for active learning in regression No-reference blur assessment of digital pictures based on multifeature classifiers Improving generalization with active learning A two-stage regression approach for spectroscopic quantitative analysis. Chemometrics and Intelligent Laboratory Systems Kernel ridge regression with active learning for wind speed prediction Perceptual quality assessment of smartphone photography The lottery ticket hypothesis: Finding sparse, trainable neural networks The lottery ticket hypothesis at scale Massive online crowdsourced study of subjective and objective picture quality Learning both weights and connections for efficient neural network Filter pruning via geometric median for deep convolutional neural networks acceleration What do compressed deep neural networks forget? KonIQ-10k: An ecologically valid database for deep learning of blind image quality assessment International Conference on Learning Representations Convolutional neural networks for no-reference image quality assessment Deep CNN-based blind image quality predictor Most apparent distortion: Full-reference image quality assessment and the role of strategy Optimal brain damage Pruning filters for efficient ConvNets Learning without forgetting KADID-10k: A large-scale artificially distorted IQA database Learning efficient convolutional networks through network slimming Rethinking the value of network pruning Group maximum differentiation competition: Model comparison with few samples End-to-end blind image quality assessment using deep neural networks Blind image quality assessment by learning from multiple annotators Catastrophic interference in connectionist networks: The sequential learning problem No-reference image quality assessment in the spatial domain Making a "completely blind" image quality analyzer Importance estimation for neural network pruning A fusion-based approach to enhancing multi-modal biometric recognition system failure prediction and overall performance Query by committee A statistical evaluation of recent full reference image quality assessment algorithms Blindly assess image quality in the wild guided by a selfadaptive hyper network Shaping datasets: Optimal data selection for specific target distributions across dimensions Final report from the video quality experts group on the validation of objective models of video quality assessment Reduced-and no-reference image quality assessment Active fine-tuning from gMAD examples improves blind image quality assessment Maximum differentiation (MAD) competition: A methodology for comparing computational models of perceptual quantities Active learning for regression using greedy sampling Blind image quality assessment based on high order statistics aggregation Unsupervised feature learning framework for no-reference image quality assessment From patches to pictures (PaQ-2-PiQ): Mapping the perceptual space of picture quality Drawing early-bird tickets: Toward more efficient training of deep networks Predicting failures of vision systems Blind image quality assessment using a deep bilinear convolutional neural network Uncertainty-aware blind image quality assessment in the laboratory and wild Ensemble Methods: Foundations and Algorithms MetaIQA: Deep meta-learning for no-reference image quality assessment The authors would like to thank all subjects who participated in our subjective study during this period of the coronavirus pandemic. This work was supported in part by the National Natural Science Foundation of China (62071407), and the CityU SRG-Fd and APRC Grants (7005560 and 9610487).