key: cord-0135676-ymev9cuv
authors: Ding, Keyan; Ma, Kede; Wang, Shiqi; Simoncelli, Eero P.
title: Comparison of Image Quality Models for Optimization of Image Processing Systems
date: 2020-05-04
journal: nan
DOI: nan
sha: e4d02fa4e6338dac56b21602cb95d08db49de0a8
doc_id: 135676
cord_uid: ymev9cuv

The performance of objective image quality assessment (IQA) models has been evaluated primarily by comparing model predictions to human judgments. Perceptual datasets (e.g., LIVE and TID2013) gathered for this purpose provide useful benchmarks for improving IQA methods, but their heavy use creates a risk of overfitting. Here, we perform a large-scale comparison of perceptual IQA models in terms of their use as objectives for the optimization of image processing algorithms. Specifically, we evaluate eleven full-reference IQA models by using them as objective functions to train deep neural networks for four low-level vision tasks: denoising, deblurring, super-resolution, and compression. Extensive subjective testing on the optimized images allows us to rank the competing models in terms of their perceptual performance, elucidate their relative advantages and disadvantages for these tasks, and propose a set of desirable properties for incorporation into future IQA models. The implementations are available at https://github.com/dingkeyan93/IQA-optimization.

The goal of objective image quality assessment (IQA) is the construction of computational models that predict the perceived quality of visual images. Such models can be used to evaluate and compare image processing methods and systems (Wang, 2011) . The standard paradigm for testing IQA models is to compare them to human perceptual quality ratings of distorted images, many of which are available in datasets such as LIVE or TID2013 (Ponomarenko et al., 2015) . However, excessive reuse of these test sets during IQA model development may lead to overfitting, and as a consequence, poor generalization to images corrupted by distortions that are not in the test sets (see Table 3 ).

A highly promising (but relatively under-studied) application of objective IQA measures is to guide the design and optimization of new image processing algorithms. The parameters of image processing methods are usually adjusted to minimize the mean squared error (MSE), the simplest of all fidelity metrics, despite the fact that it has been widely criticised for its poor correlation with human perception of image quality (Girod, 1993) . Early attempts at perceptual optimization used the structural similarity (SSIM) index (Wang et al., 2004) in place of MSE, and achieved modest gains in applications of image restoration (Channappayya et al., 2008) , wireless video streaming (Vukadinovic and Karlsson, 2009) , video coding , and image synthesis (Snell et al., 2017) . Driven by the surge of impressive results in deep neural networks (DNNs), recent authors have used perceptual measures based on pre-trained DNNs for optimization purposes (Johnson et al., 2016) , although these have not been tested against human judgments.

In this paper, we systematically evaluate a large set of full-reference IQA models in the context of perceptual optimization. To determine their suitability for optimization, we first test the models on recovering a reference image from a given initialization by optimizing the model-reported distance to the reference. For many IQA methods, we find that the optimization does not converge to the reference image, and often creates severe distortions. These optima are either local (due to non-convexity of the IQA objective), or global but nonunique (due to the loss of information in the IQA objective). We select a subset of eleven IQA models as useful "perceptual" objectives for use in optimizing DNNs in four low-level vision tasks -image denoising, blind image deblurring, single image super-resolution, and lossy image compression. Extensive subjective testing on the optimized images reveals the relative performance of the competing models. Moreover, careful inspection of their visual failures indicates limitations in model design, which in turn sheds light on developing future IQA models in a principled way.

Full-reference IQA methods can be broadly classified into five categories:

-Error visibility methods. These apply a distance measure directly to pixels (such as MSE), or to transformed representations of the images. The MSE possesses useful properties for optimization (e.g., differentiability and convexity), and when combined with linear-algebraic tools, analytical solutions can often be obtained. For example, the classical solution to the MSE-optimal denoising problem (assuming a translation-invariant Gaussian signal model) is the Wiener filter (Weiner, 1950) . Given that MSE in the pixel domain is poorly correlated with perceived image quality, many IQA models operate by first mapping to a more perceptually appropriate representation (Safranek and Johnston, 1989; Daly, 1992; Lubin, 1993; Watson, 1993; Teo and Heeger, 1994; Watson et al., 1997; Larson and Chandler, 2010; Laparra et al., 2016) . -Structural similarity (SSIM) methods. These are constructed to measure the similarity of local image "structures". The prototype is the SSIM index (Wang et al., 2004) , which combines similarity measures of three conceptually independent components -luminance, contrast and structure. It has become a de facto standard in the field of perceptual image processing, and has inspired subsequent IQA models based on feature similarity (Zhang et al., 2011) , gradient similarity (Liu et al., 2012a) , edge strength similarity , and saliency similarity . -Information theoretic methods. These attempt to measure some approximation of the mutual information between the perceived reference and distorted images as an indication of perceptual image quality. Statistical modeling of the image source, the distortion process, and the HVS is critical in algorithm development. The prototype is the visual information fidelity (VIF) measure . -Learning-based methods. These learn the relationship between the input images and the perceptual distance from a large set of examples, using supervised machine learning methods. By leveraging the power of DNNs, these methods have come to dominate the field of IQA, in terms of performance on existing image quality databases (Bosse et al., 2018; Prashnani et al., 2018) . But given the high dimensionality of the input space (i.e., the number of pixels, typically millions), these methods are prone to overfitting the data. Strategies that compensate for the insufficiency of labeled training data include building on pre-trained networks Ding et al., 2020) , training on local image patches (Bosse et al., 2018) , and combining multiple IQA databases (Zhang et al., 2019b) . -Fusion-based methods. These aim to combine existing IQA methods to build a "super-evaluator" that exploits the diversity and complementarity of the incorporated methods for improved quality prediction performance (analogous to "boosting" methods in machine learning). Fusion combinations can be determined empirically (Ye et al., 2014) or learned from data (Liu et al., 2012b; Ma et al., 2019) . Some methods incorporate deterministic or statistical image priors to regularize an IQA measure (Jordan, 1881; Ulyanov et al., 2018) . Since such regularizers can be seen as a form of no-reference IQA measures (Wang and Bovik, 2011) , we also view these as fusion solutions.

3 Screening of Full-Reference IQA Models for Perceptual Optimization

We used a naïve task to demonstrate the issues encountered when using IQA models in gradient-based perceptual optimization. This task also allows us to pre-screen existing models, and to motivate the design of experiments used in subsequent comparisons.

Given a reference (undistorted) image x and an initial image y 0 , we aim to recover x by numerically optimizing

where D denotes a full-reference IQA measure with a lower score indicating higher predicted quality, and y is the recovered image. As an example, if D is the MSE, the (trivial) analytical solution is y = x, indicating full recoverability. For the majority of current IQA models, which are continuous and differentiable, solutions must be sought numerically, using gradient-based iterative solvers. We consider a total of 17 methods: three error visibility methods -MAD (Larson and Chandler, 2010) , PAMSE and NLPD , seven structural similarity methods -MS-SSIM (Wang et al., 2003) , CW-SSIM (Wang and Simoncelli, 2005) , FSIM (Zhang et al., 2011) , SFF (Chang et al., 2013) , GMSD (Xue et al., 2014) and VSI , MCSD , two information theoretical methods -IFC (Sheikh et al., 2005) and VIF , and five deep learningbased methods -GTI-CNN (Ma et al., 2018) , Deep-IQA (Bosse et al., 2018) , PieAPP (Prashnani et al., 2018) , LPIPS and DISTS (Ding et al., 2020) . As this paper focuses on the perceptual optimization performance of individual IQA measures, fusion-based methods are not included. Fig. 1 and Fig. 2 show the recovery results from two different initializations -a white Gaussian noise image and a JPEG compressed version of the reference image. For all methods, we find that optimization converges to a final image with a substantially lower score than that of the initial image. All models based on injective mappings (MS-SSIM, PAMSE, NLPD and DISTS) are guaranteed to recover the reference image (although the convergence may depend on the choice of initial image). Many of the remaining IQA models fail to recover the main structures of the reference image when initialized with the white Gaussian noise image, or create noticeable model-dependent distortions when initialized with the JPEG compressed image. This is because these methods rely on surjective mapping functions to transform the images to a "perceptual" space for quality computation. For example, GTI-CNN (Ma et al., 2018) uses a surjective DNN with four stages of convolution, subsampling, and halfwave rectification. The resulting undercomplete representation is optimized for geometric transformation invariance, at the cost of significant information loss. The examples demonstrate that preservation of some aspects of this lost information is important for perceptual quality. Similar arguments can be applied to other surjective DNN-based IQA models, such as DeepIQA (Bosse et al., 2018) and PieAPP (Prashnani et al., 2018) . In addition, with a better initialization (e.g., a JPEG compressed image with roughly correct local luminances), optimization guided by surjective models achieves a perceptually better image, compared to initialization with purely white Gaussian noise. Nevertheless, the visual quality of the final images is in some cases worse than that of the initial JPEG image (see Fig. 2 (h), (n) (o) and (p)).

The reference image recovery test results were used to pre-screen the full set of IQA models, excluding those that perform poorly (due to surjectivity) or closely (due to similar design). The following 11 full-reference IQA models were selected for subsequent evaluation:

1. MAE, the Mean Absolute Error ( 1 -norm) of pixel values, has been frequently adopted in optimization similar as MSE, despite its poor perceptual relevance. MAE has been shown to consistently outperform MSE in image restoration tasks (Zhao et al., 2016) . 2. MS-SSIM (Wang et al., 2003) , the Multi-Scale extension of the SSIM index (Wang et al., 2004) , provides more exibility than single-scale SSIM, allowing for a wider range of viewing distances. MS-SSIM has become a standard "perceptual" quality measure, and has guided the design of DNN-based image super-resolution (Zhao et al., 2016; Snell et al., 2017) and compression (Ballé et al., 2018) algorithms. 3. VIF , the Visual Information Fidelity measure, predicts the quality of the distorted image by quantifying how much information in the reference image is preserved. VIF can be computed either in spatial or wavelet (Simoncelli et al., 1992) domain. A distinct property of VIF relative to other IQA models is that it can indicate that the "distorted" image has visual quality superior to that of the reference . 4. CW-SSIM (Wang and Simoncelli, 2005) , the Complex Wavelet SSIM index, is designed to be robust to small geometric distortions such as translation and rotation. The construction allows for consistent phase shifts of wavelet coefficients, which preserves local image features. CW-SSIM addresses the limitation of most IQA methods that require a precise registration process at the front end. 5. MAD (Larson and Chandler, 2010) , the Most Apparent Distortion measure, explicitly models adaptive strategies of the HVS. Specifically, a detectionbased strategy is employed for near-threshold distortions, and an appearance-based strategy is activated if the distortions are clearly visible. 6. FSIM (Zhang et al., 2011) , the Feature SIMilarity index, computes quality estimates based on phase congruency (Kovesi, 1999) as the primary feature, and incorporates the gradient magnitude as the complementary feature. It also supplies a color version by making quality measurements from chromatic components. 7. GMSD (Xue et al., 2014) , the Gradient Magnitude Similarity Deviation, computes pixel-wise gradient magnitude similarity followed by standard deviation (std) pooling. This pooling strategy is problematic: an image with large but constant local distortion yields an std of zero (indicating the best predicted quality). 8. VSI , the Visual Saliency Induced quality index, assumes that the change of salient regions due to image degradation is closely related to the change of visual quality. It combines saliency magnitude similarity with gradient magnitude similarity, and demonstrates good quality prediction performance, especially for localized distortions, such as local patch substitution (Ponomarenko et al., 2015) . 9. NLPD , the Normalized Laplacian Pyramid Distance, mimics the nonlinear transformations of the early visual system: local luminance subtraction and local gain control, and combines these values using weighted p -norms. The parameters are optimized to minimize the representation redundancies, instead of matching human judgments. NLPD has been successfully employed to optimize image rendering algorithms Laparra et al., 2017) , where the input reference image has a much higher dynamic range than that of the display. It has also been used to optimize a compression system . 10. LPIPS , the Learned Perceptual Image Patch Similarity model, computes the distance between deep representations of two images. The authors showed that feature maps of different DNN architectures have "reasonable" effectiveness in accounting for human perception of image quality. As LPIPS has many different configurations, we choose the default one based on the VGG network (Simonyan and Zisserman, 2015) with the weights learned from the BAPPS dataset .

11. DISTS (Ding et al., 2020) , the Deep Image Structure and Texture Similarity metric, is designed with explicit tolerance to texture resampling (e.g., replacing one patch of grass with another). DISTS is based on an injective mapping function built from a variant of the VGG network, and combines structure and texture similarity measurements between corresponding feature maps of the two images. It is sensitive to structural distortions, tolerant of texture resampling, and robust to mild geometric transformations.

We re-implemented all 11 of these models using Py-Torch 1 , and verified that our code could reproduce the published performance results for each model on the LIVE , CSIQ (Larson and Chandler, 2010) , and TID2013 (Ponomarenko et al., 2015) databases (see Table 2 in Appendix A). We also modified grayscale-only models to accept color images, by computing scores on RGB channels separately and averaging them to obtain an overall quality estimate.

We used each of the 11 full-reference IQA models as objective functions for optimizing the parameters of DNNs to solve four low-level vision tasks:

image denoising, blind image deblurring, single image super-resolution, lossy image compression.

The parameters of each network are optimized to minimize an IQA measure over a database of corrupted and original image pairs via stochastic gradient descent. Implementations of all IQA models, as well as the DNNs for the four tasks, are available at https://github. com/dingkeyan93/IQA-optimization.

Image denoising is a core application of classical image processing, and also plays an essential role in testing prior models of natural images. In its simplest form, one aims to recover an unknown clean image x ∈ R N from an observed image y that has been corrupted by additive white Gaussian noise n of known variance σ 2 , i.e., y = x + n. Denoising algorithms can be roughly classified into spatial domain methods (e.g., Wiener filter (Weiner, 1950) , bilateral filter (Tomasi and Manduchi, 1998 ) and collaborative filtering (Dabov et al., 2007) ), and wavelet transform methods (Donoho and Johnstone, 1995; Simoncelli and Adelson, 1996; Portilla et al., 2003) . Later, sparsifying transforms (Elad and Aharon, 2006) and variants of nonlinear shrinkage functions have been directly learned from natural image data (Hel-Or and Shaked, 2008; Raphan and Simoncelli, 2008) . In recent years, purely data-driven models based on DNNs have achieved new levels of performance.

Here, we constructed a simplified DNN, shown in Fig. 3 , inspired by the EDSR network (Lim et al., 2017) . The network was trained to estimate the noise (which is then subtracted from the observation to yield a denoised image), by minimizing a loss function defined by

where D is an IQA measure and f φ : R N → R N is the mapping of the DNN, parameterized by vector φ.

The goal of image deblurring is to restore a sharp image x from a blurry observation y, which can occur due to camera defocus or motion, and/or the motion of objects in a scene. The observation process is usually described by

where K ∈ R N ×N denotes a spatially-varying linear kernel. Blind deblurring refers to the problem in which the blur kernel is unknown. Most early methods, e.g., the classical Lucy-Richardson algorithm (Richardson, 1972; Lucy, 1974) , focused on non-blind deblurring where the blur kernel is given. Successful blind deblurring methods, such as (Fergus et al., 2006; Pan et al., 2016) , rely heavily on statistical priors of natural images and geometric priors of blur kernels. With the success of deep learning, many DNN-based approaches (Tao et al., 2018; Kupyn et al., 2018) attempt to directly learn the mapping function for blind deblurring without explicitly estimating the blur kernel. Here we also adopted this "kernel-free" approach to train a DNN for image deblurring in an end-to-end fashion. We employed the same network architecture used in denoising (see Fig. 3 ) with the same loss function (Eq. (2)).

Single image super-resolution aims to enhance the resolution and quality of a low-resolution image, which can be modelled by

where P denotes downsampling by a factor of β. This is an ill-posed problem, as downsampling is a projection onto a lower-dimensional subspace. Early attempts exploited sampling theory (Li and Orchard, 2001) or natural image statistics (Sun et al., 2008) . Later methods focused on learning mapping functions between the lowresolution and high-resolution images through sparse coding (Yang et al., 2010) , locally linear regression (Timofte et al., 2013), self-examplars (Huang et al., 2015) , etc. Since 2014, DNN-based methods have come to dominate this field as well (Dong et al., 2014 ). An efficient method of constructing a DNN-based mapping is to first extract features from the low-resolution input and then upscale them with sub-pixel convolution (Shi et al., 2016; Lim et al., 2017) . Here, we followed this method in constructing a DNN-based function f : R N β → R N , with architecture specified in Fig. 4 . The loss is specified by

Data compression involves finding a more compact data representation from which the original image can be reconstructed. Compression can be either lossless or lossy. Here we followed a prevailing scheme in lossy image compression -transform coding, which consists of transformation, quantization, and entropy coding. Traditional image compression methods (e.g., the most widely used standard -JPEG) use a fixed linear transform for all bit rates. More recently, many researchers have demonstrated the visual benefits of nonlinear transforms, especially DNN-based learnable ones that are capable of adapting their parameters to different bitrate budgets. In this paper, we constructed two DNNs for analysis and synthesis transforms, respectively, as shown in Fig. 5 . The analysis transform f a maps the image to a latent feature vector z, whose values are then quantized to L levels with the centers being {c 1 , . . . , c L }, where c i ∈ R for i = 1, . . . , L. This quantized representationz = Q(f a (x)), is fed to the synthesis transform f s to reconstruct the compressed image: y = f s (z). The quantization step has zero gradients almost everywhere, which prevents gradient-based training via backpropagation . Hence, we used a soft differentiable approximation (Mentzer et al., 2018) Fig. 3 Network architecture used for denoising and deblurring. Apart from initial and final convolutional blocks, it contains 16 residual blocks, each consisting of two convolutions and a halfwave rectifier (ReLU). Conv h × w × c in × c out indicates affine convolution with filter size h × w, over c in input channels, producing c out output channels. Fig. 4 Network architecture used for super-resolution, containing 16 residual blocks followed by two upsampling modules, each composed of an upsampler (factor of 2, using nearest-neighbor interpolation) and a convolution.

Fig. 5 Network architecture used for lossy image compression, which includes an analysis transformation f a , a quantizer Q, and a synthesis transformation f s . f a is comprised of n blocks, each with a convolution and downsampling (stride) by 2 followed by two residual blocks. After the last block, another convolution layer with m filters is added to produce the internal code representation, the values of which are then quantized by Q. f s consists of a cascade that is mirror-symmetric to f a , with nearest-neighbor interpolation used to upsample the feature maps.

to backpropagate gradients during training, where the scale parameter s controls the approximation level of quantization.

In lossy image compression, the objective function is a weighted sum of two terms that quantify the coding cost and the reconstruction error, respectively:

The first term is typically the entropy (Shannon, 1948) of the discrete codesz, which provides a lower bound on the bitrate for transmitting the quantized coefficients. The second term is the distortion between the compressed image y and the original image x, which we quantified with an IQA model D. The Lagrange multiplier λ controls the rate-distortion trade-off. Due to substantially different scales of IQA models, we would need to manually adjust λ for each model in order to enable fair comparison at similar bitrates. To avoid this, following Agustsson et al. (2019), we set λ = 0 in Eq. (7), and controlled an upper bound on bitrate

by adjusting the architecture of f s (i.e., the dimension ofz) and the number of quantization levels L in Q. This elimination of the entropy from the objective also means that we did not need to continually re-estimate the probability mass function P (z), which varies with changes in the network parameters. The optimization objective in Eq. (7) is reduced to

where φ and ψ are the parameters of f a and f s , respectively. The expectation is approximated by averaging over mini-batches of training images.

In this section, we present in detail the training of our DNN-based computational models for the four low-level vision tasks, and the subjective testing to collect human rating of the optimized images as the ground truth.

For denoising, we fixed the noise std to σ = 50. For deblurring, we simulated various kernels with different motion patterns and blur levels as in Kupyn et al. (2018) . For super-resolution, we generated low-resolution images by downsampling high-resolution images by a factor of β = 4 using bicubic interpolation. For compression, we set the number of quantization levels to L = 2 with centers {−1, 1}, the quantization scale parameter to s = 1, the number of downsampling stages to n = 4, and the number of output channels of f a to m = 64. This leads to a maximum of H(z)

We chose the 4, 744 high-quality images in the Waterloo Exploration Database (Ma et al., 2017b) as reference images. Training was performed in two stages. In the first stage, we pre-trained a network using MAE as the loss function for all four tasks. In the second stage, we fine-tuned the network parameters by optimizing the desired IQA model. Pre-training brings several advantages. First, a number of models are sensitive to initializations (e.g., CW-SSIM, MAD, FSIM, GMSD, and VSI) and pre-trainng yields more reasonable optimization results (also validated in the task of reference image recovery). Second, models that require backpropagating gradients through multiple stages of computation (e.g., LPIPS and DISTS) converge much faster. Third, it helps us to test whether the recently proposed IQA models lead to consistent perceptual gains on top of MAE, a special case of the simple p -norm distance.

For each training stage of the four tasks, we used the Adam optimization package (Kingma and Ba, 2015) with a mini-batch size of 16 and an initial learning rate of 10 −4 , which decays linearly by a factor of 2 for every 100K iterations, and we set the maximum number of iterations to 500K. We randomly extracted patches with the size of 192 × 192 × 3 during training, and tested on 20 independent images from the DIV2K validation set (see Fig. 6 ). Training took roughly 1, 000 GPU hours (measured using an NVIDIA GTX 2080 device) for a total of 4 × 11 = 44 models. Special treatment (e.g., gradient clipping) was given to some IQA models (e.g., FSIM and VSI) to facilitate training and convergence.

Generally, it can be difficult to stabilize the training of DNNs to convergence, especially given that the gradients of different IQA models exhibit idiosyncratic behaviors. Fortunately, a simple criterion exists to test the validity of the optimization results: for a given lowlevel vision task, the DNN optimized for the IQA measure D i should produce the best result (averaged over an independent set of images) in terms of D i itself, when comparing to DNNs optimized for {D j } j =i . Fig. 7 shows ranking of results generated by networks optimized for each of the 11 IQA models (corresponding to one column in one subfigure) on the DIV2K validation set (Timofte et al., 2017) , where 1 and 11 indicate the best and worst rankings, respectively. By inspecting the diagonal elements of the four matrices, we conclude that 43 out of 44 models satisfy the criterion, verifying the rationality of our training procedures. The only exception is when MAE is the optimization goal and NLPD is the evaluation measure for the deblurring task. Nevertheless, MAE ranks its own results the second place. As shown in Sec. 6.2, the resulting images from MAE and NLPD look visually similar.

We conducted a comprehensive subjective study to acquire human opinions of the perceptual optimization results. A two-alternative forced choice (2AFC) method was employed, allowing differentiation of fine-grained quality variations. Specifically, subjects were asked to choose which of two images has better perceived quality. The original image was also shown for reference. Subjects were allowed unlimited viewing time, and were free to adjust their viewing distance. Our customized graphical user interface also allows them to zoom in/out any portion of the two images for more careful visual comparison. We formed a total of 11 2 ×4×20 = 4400 paired comparisons for 11 IQA models, 4 tasks, and 20 test images. To reduce fatigue, we performed the experiment in multiple sessions, each consisting of 500 randomly selected comparisons, and allowed subjects to take a break at anytime during the session. Subjects were encouraged, but not required, to participate in multiple sessions.

An outlier detection feature was also added: we included 5 pairs where one image was of unambiguously better quality (e.g., the original and noisy images). Our intention was to discard the voting results of subjects who failed in more than one of these pairs. We gathered data from 25 subjects with general background knowledge of image processing and computer vision, but were otherwise naïve to the purpose of this study. Voting results of all subjects turned out to be valid. In total, each image pair was evaluated by at least 5 subjects, and each IQA model was ranked over 1, 000 times for each vision task. The numbers of valid responses for denoising, deblurring, super-resolution, and compression were 6, 516, 6, 471, 6, 473, and 6, 540, respectively.

In this section, based on subjective data, we conducted a quantitative comparison of the IQA models through the lens of perceptual optimization, yielding observations that are difficult to obtain from existing IQA databases. We also qualitatively compared the visual results associated with the IQA models. Last, we combined a top-performing IQA model with adversarial loss (Goodfellow et al., 2014) to test whether additional perceptual gains could be obtained in blind image deblurring.

We employed the BradleyTerry model (Bradley and Terry, 1952) to convert paired comparison results to global rankings. This probabilistic model assumes that the visual quality of the k-th test image optimized for the i-th IQA model, q k i , follows a Gumbel distribution with location µ k i and scale s. Assuming independence between q k i and q k j , the difference q k i − q k j is a logistic random variable, and therefore p k ij = P (q k i ≥ q k j ) can be computed using the logistic cumulative distribution function:

where s is usually set to 1, leading to a simplified expression:

As such, we may obtain the negative log-likelihood of our pairwise count matrix W k :

where w k ij represents the number of times that D i is preferred over D j for the k-th test image. For each of the four low-level vision tasks, we minimized Eq. (12) iteratively using gradient descent to obtain the optimal estimateμ k . We averagedμ k over the 20 test images, resulting in four global rankings of perceptual optimization performance, as shown in Fig. 8 . It is clear that MS-SSIM (Wang et al., 2003) and MAE are superior to the other IQA models in the task of denoising, whereas DNN-based measures DISTS (Ding et al., 2020) and LPIPS , outperform the others in the tasks of deblurring, super-resolution, and compression. Thus, there is no single IQA model that performs best across all tasks. We ascribe this to differences in the nature of the tasks: denoising requires distinguishing signal and noise, whereas deblurring, super-resolution and compression must recover missing details conditioned on the degraded information. Therefore, MS-SSIM and MAE that are known to prefer smooth appearances excel at denoising, despite close quantitative performance for most IQA models; DISTS and LPIPS that explicitly represent aspects of fine texture are superior for the remaining three tasks. Finally, it is important to note that many recent models, despite their impressive abilities to explain existing IQA databases, do not offer additional perceptual gains over MAE (and may even reduce perceptual quality).

To investigate whether the optimization results of the IQA models are statistically significant, we conducted an independent two-sample t-test. The null hypothesis is that the ranking scores {µ k i } 20 k=1 for D i and Fig. 9 Denoising results on two regions cropped from an example image, using a DNN optimized for different IQA models.

{µ k j } 20 k=1 for D j come from the same normal distribution with unknown variance. When the test accepts the null hypothesis at the α = 5% significance level, the two IQA models belong to the same group, and have the statistically indistinguishable performance. The grouping results are shown in Fig. 8 , from which we find that the perceptual gains of MS-SSIM over MAE are statistically insignificant on all four tasks, which is quite surprising because MS-SSIM is widely regarded as a much better perceptual IQA model than MAE. Relying on similar sets of VGG features (Simonyan and Zisserman, 2015) , DISTS and LPIPS also achieve similar performance, except for the super-resolution task where the former is statistically better.

By computing the SRCC between objective model rankings (in Fig. 7) and subjective human rankings (in Fig. 8) , we are able to compare the algorithm-level performance of the 11 IQA models. We find from the Table 1 that there is a lack of correlation between model predictions and human judgments for the majority of IQA methods. DISTS and LPIPS tend to rank the images with complex model-dependent distortions in a more perceptually consistent way. We refer the interested readers to Appendix A for image-level comparison on several IQA databases dedicated to various low-level vision problems.

In this section, we show some visual examples produced by each IQA-optimized model, summarize the main types of distortions, and diagnose the shortcomings of the competing full-reference IQA methods. Fig. 10 Deblurring results for two regions cropped from an example image, using a DNN optimized for different IQA models. Fig. 9 shows the denoising results of the "cat" image. It is not hard to observe that MAE, MS-SSIM, and NLPD do a good job in denoising flat regions, but tend to over-smooth texture regions. VIF encourages detail enhancement, leading to artificial local contrast, while GMSD produces a relatively dark appearance presumably because it discards local luminance information. Moreover, the results of FSIM and VSI exhibit noticeable artifacts. LPIPS and DISTS preserve fine details, but may not fully remove noise in smooth regions, mistaking the remaining noise as visually plausible texture. Overall, traditional IQA models MAE and MS-SSIM denoise images with various content variations robustly, keeping high-frequency information loss within the acceptable range. This may explain why they are still the dominant objective functions in this task. Fig. 10 shows the deblurring results of the "basket" image. We find that most IQA methods fail, but in different ways. Specifically, the results of MAE, MS-SSIM, CW-SSIM, and NLPD are still quite blurry. FSIM, GMSD, and VSI generate severe ringing artifacts. VIF again fails to adjust the local contrast. MAD exhibits undesirable dot artifacts, although the main structures are sharp. LPIPS succeeds in deblurring this example, while DISTS produces a result that is closest to the original. This is consistent with current state-of-the-art deblurring results, generated by incorporating comparison of the later stages of the VGG features into the loss. Fig. 11 shows the super-resolution results of the "corner tower" image. Again, MAE, MS-SSIM, NLPD, and especially CW-SSIM produce somewhat blurry images, without recovering fine details that contain highfrequency information. MAD, FSIM, GMSD, and VSI are able to generate some "structures", but these are perceived as unpleasant model-dependent artifacts. Benefiting from its texture synthesis capability, DISTS has the potential to super-resolve perceptually plausible fine details, although they differ from those of the original image. Fig. 12 shows the compression results of the "airplane" image at 0.24 ± 0.01 bpp. A JPEG image, compressed to 0.25 bpp, suffers from block and blur artifacts. Overall, the main structures of the original image are well preserved for most IQA models, but the fine details (e.g., the lawn) have to be discarded at this low bitrate, or are poorly synthesized as other forms of distortions. VIF reconstitutes a dreary image with overenhanced global contrast, and CW-SSIM superimposes wavy artifacts on the underlying image. Dot and ring- Fig. 11 Super-resolution results for two cropped regions from an example image, using a DNN optimized for different IQA models.

ing artifacts are again apparent in the results of MAD and VSI, respectively. The image by NLPD is blurred and red shifted. LPIPS and DISTS have more potentials in synthesizing textures that are visually similar to the original, outperforming other IQA models in this task. Now, we summarize and diagnose the "novel" artifacts created during perceptual optimization, which are not typically seen in traditional image databases for the purpose of quality assessment.

-Blurring is a frequently seen distortion type in all of the four vision tasks, and is mainly caused by error visibility methods (e.g., MAE and NLPD) and structural similarity methods (e.g., MS-SSIM), which rely on simple injective mappings. Specifically, MAE and SSIM work directly with pixels, and NLPD transforms the input image to a multiscale overcomplete representation using a single stage of local mean subtraction and divisive normalization. Under strict constraints imposed by the vision tasks, they prefer to make a more conservative estimate, producing something akin to a superposition of all possible outcomes with sharp structures, as would occur with a more conventional loss -MSE.

-Ringing is a high-frequency distortion type that often occurs in the optimized images by FSIM, VSI and GMSD (see Fig. 10 (i) -(k)). One common characteristic of the three models is that they rely heavily (in some cases, solely) on local gradient magnitude for feature similarity comparison, underweighting (or abandoning) other perceptually important features (such as local luminance and local phase). This creates enormous "shortcuts" that DNN-based computational models can take to generate distortions with similar local gradient statistics. -Dot patterns are typical in the optimization results of MAD, which extracts lower-order image statistics from responses of Gabor filters at multiple scales and orientations. The resulting set of statistical measurements is insufficient to summarize natural image structures that exhibit higher-order dependencies. Therefore, MAD is "blind" to distortions that satisfy the same set of statistical constraints, and gives the optimized distorted image a high-quality score. -Over-enhancement of local image contrast is encouraged by VIF, which, in most of our experiments, causes significant quality degradation. We believe this arises because VIF does not fully respect reference information when normalizing the covariance term. Specifically, only the second-order statistics of the reference image are used to construct the normalization factor. By incorporating the same statistics computed from the distorted image into normalization, the problem of over-enhancement may be alleviated. In general, quality assessment of image enhancement is a challenging problem (Fang et al., 2015; Wang et al., 2015) , and to the best of our knowledge, all existing full-reference IQA models fail to reward properly-enhanced cases, while penalizing over-enhanced cases. -Luminance and color artifacts are perceived in final images that are associated with many IQA models. Two causes seem plausible. First, methods such as GMSD discard luminance information, leaving a huge "null space" to accommodate luminance distortions. Second, methods such as MS-SSIM and NLPD are originally designed for grayscale images only. Applying them to RGB channels separately fails to take into account saturation (color contrast). Transforming to a perceptually better color space, and making use of knowledge of color distortions (Rajashekar et al., 2009 ) offers an opportunity for improvement.

In the field of image restoration and generation, many state-of-the-art algorithms are based on adversarial training (Goodfellow et al., 2014) , demonstrating impressive capability of synthesizing realistic visual content. The output of the adversarial loss is the probability of an image being computer-generated, and therefore is of low relevance to perceived quality of that image. In other words, the adversarial loss is a poor noreference IQA model, at the level of images (verified by our experiments on the LIVE database). However, it may be a "good" one in the algorithm level, meaning that given a set of images generated by a computational method and another set of natural photographic images, the average probability on the combined set quantitatively measures the capability of the method in generating realistic high-quality images. In this subsection, we explored the combination of the adversarial loss and top-performing IQA measures for additional perceptual gains.

We chose the task of blind image deblurring, and fine-tuned a state-of-the-art model -DeblurGAN-v2 (Kupyn et al., 2019) . According to our perceptual optimization results, we selected the best-performing IQA model (DISTS) for this experiment. We followed the same training strategy, and only replaced the loss function of the generator from

to

The first and second terms in Eq. (13) compute the MSE on pixels and responses of conv3 3 of VGG19 (Simonyan and Zisserman, 2015), respectively. Adv denotes a variant of the adversarial loss (Kupyn et al., 2019 ). An immediate advantage of this replacement is that the number of hyperparameters is reduced, making manual hyperparameter adjustment easier. After finetuning, the average DISTS value decreases from 0.22 to 0.18 on the Köhler test dataset (Köhler et al., 2012) . Fig. 13 shows two visual examples, from which we find that the fine-tuned results have sharper edges and enhanced contrast, indicating that perceptual gains may be obtained by DISTS on the two examples.

We have conducted a comprehensive study of perceptual optimization of four low-level vision tasks, guided by 11 full-reference IQA models. This provides an alternative means of testing the perceptual relevance of IQA models in a more realistic setting, which we believe is an important complement to the conventional methodology for IQA model evaluation. Extensive subjective testing led to several useful findings. First, through perceptual optimization, we generated a number of novel distortions (different from those in existing IQA databases), which may easily fool many competing models. It should be noted that the emergence of specific distortions is in principle dependent on the experimental setting (e.g., initialization strategy, model architecture, and optimization technique). Second, standard full-reference IQA models, MS-SSIM and MAE, will continue to play a central role in optimizing image processing systems due to their robustness and simplicity. Third, more recent IQA models with surjective mappings may still be used to monitor image quality and to optimize the parameter settings of image processing methods, but in a limited and well-controlled space. Last, the two DNN-based models, LPIPS and DISTS seem to stand out in our experiments, but their high computational complexity and lack of interpretability may hinder their application.

Our work has interesting connections to two separate lines of research. First, inspired by the philosophy of "analysis by synthesis" (Grenander, 1970) , Wang and Simoncelli (2008) introduced the maximum differentiation competition methodology to automatically synthesize images for efficiently comparing IQA models. However, the generated images may be highly unnatural, and therefore are of less practical importance. Ma et al. (2020) alleviated this issue by manually constraining the search space to a finite image set of current interest. Our approach can be seen as a mixture of the two methods, in the sense that the test images are automatically generated, and are well-controlled by specifying realistic vision tasks. Second, the existence of type II adversarial examples (Szegedy et al., 2013) has exposed the vulnerability of many computer vision algorithms, where a tiny change to the input that is imperceptible to the human eye would cause the algorithm to make classification mistakes. In our case, weaknesses in an IQA model are exposed through optimized images that serve as type I adversarial examples of the model: a significant change is made to the original image that substantially degrades its perceptual quality, but the model claims this image is of high quality.

Fusion of IQA models, which has not been investigated in this paper, may be a feasible way of developing a better IQA method in the context of perceptual optimization, as constantly practiced by researchers in related fields. The main technical difficulty is the lack of a principled way to combine IQA models of different perceptual scales.

Besides model fusion, a more promising design strategy for new IQA methods is to enforce a set of desirable properties. First, the transformation used in the IQA model should be perceptual, mapping the input images into a space where Euclidean distances match human measurements of image quality. This is in the same spirit that color scientists pursue perceptually uniform color spaces. Zhang et al. (2018) and Ding et al. (2020) demonstrated that a cascade of linear convolution, downsampling, and rectified nonlinearity optimized for highlevel vision tasks may be a good candidate. Second, the IQA model should enjoy unique optima (i.e., the underlying mapping should be injective) to guarantee that images close to optimum are visually similar to the original. This criterion was respected by early models (e.g., MS-SSIM), but was largely overlooked in recent IQA model development. Third, the IQA model should be continuous and differentiable, with well-behaved gradients, to aid optimization in complex situations (e.g., training DNNs with millions of parameters). Last but not least, the IQA model should be computationally efficient, enabling real-time quality assessment and perceptual optimization.

A conventional method for evaluating IQA models is to compute their agreement with subjective scores in one or more standardized IQA databases (e.g., LIVE , CSIQ (Larson and Chandler, 2010) or TID2013 (Ponomarenko et al., 2015) ), consisting of artificially distorted images. Many existing IQA models achieved impressive correlation numbers on these databases (see Table 2 ), but their performance in assessing the perceptual quality of images produced by low-level vision algorithms has not been tested. In this appendix, we tested them on multiple human-rated image generation/restoration quality assessment databases, including -A denoising database -FLT (Egiazarian et al., 2018) , consisting of 300 filtered images from 75 grayscale texture images by the BM3D algorithm with different levels of noise suppression. -Two motion deblurring databases -Liu13 (Liu et al., 2013) and Lai16 (Lai et al., 2016) . Liu13 contains 240 synthetically blurred examples, each of which is deblurred by five algorithms. Lai16 synthesizes 100 non-uniformly and 100 uniformly blurred images, which are further deblurred by 13 algorithms. -Two super-resolution databases -Ma17 (Ma et al., 2017a) and QADS . The former has 1, 620 super-resolved images from 9 methods, while the latter contains 980 images created by 21 methods. -A dehazing database -SHRQ (Min et al., 2019) , composed of 45 regular hazy images and 30 aerial hazy images, which are dehazed by 8 algorithms. -A depth image-based rendering database -Tian19 (Tian et al., 2018) , consisting of 140 rendered images from 10 sequences with 7 methods. -Two texture synthesis databases -SynTex (Golestaneh et al., 2015) and TQD (Ding et al., 2020) . SynTex contains 105 synthesized textures using 5 algorithms from 21 reference textures. TQD has 10 reference textures, each with 15 variations, including 7 artificially distorted images, 4 example-based synthesized textures and 4 cropped subimages. -A patch database -BAPPS , which includes 26.9K image patches generated by colorization, video deblurring, frame interpolation, and super-resolution algorithms, respectively. Table 3 shows the performance comparisons of 13 IQA methods in terms of SRCC. Considering the potential noisy judgements, BAPPS is evaluated using the 2AFC score: pq + (1 − p)(1 − q), where p is the percentage of human votes and q = {0, 1} is the vote of an IQA model. When q is closer to the majority of human votes, the 2AFC score is larger, indicating better performance (see Table 4 ). We find that the overall performance of all models is lower compared to that in the standard IQA databases (see Table 2 ), indicating the difficulty when applying to unseen distortions. Moreover, DNNbased measures are relatively better than knowledgedriven models in these application-oriented databases, but there is still significant room for improvement to meet the real-world challenges. Fig. 14 and Fig. 15 show two quality assessment examples of compression and super-resolution, respectively. Here we only compared the most widely used measures: PSNR and SSIM, and the two that performed best both on optimization and assessment: LPIPS and DISTS. It is not surprising to find that PSNR and SSIM have poor correlations with human opinions, as they focus more on signal fidelity rather than percep-tual quality. LPIPS and DISTS perform better, but the former is somewhat oversensitive to texture substitution (see Fig. 15 ). As many recent image restoration algorithms succeed in generating richer textures, DISTS holds much promise for use in quality assessment for such applications. , EDSR (Lim et al., 2017) , SRGAN (Ledig et al., 2017) , ESRGAN , and RankSRGAN (Zhang et al., 2019a) , respectively. One can see that the GAN-based results (f)-(h) are visually superior to the others, contrary to the predictions of PSNR and SSIM. LPIPS indicates that the result (f) is worse than (d) and (e), in disagreement with visual inspection. DISTS is correlated well with human perception in this example.

Generative adversarial networks for extreme learned image compression

End-to-end optimization of nonlinear transform codes for perceptual quality

End-to-end optimized image compression

Variational image compression with a scale hyperprior

Deep neural networks for no-reference and full-reference image quality assessment

Rank analysis of incomplete block designs: I. the method of paired comparisons

Sparse feature fidelity for perceptual image quality assessment

SSIM-optimal linear image restoration

Image denoising by sparse 3-D transform-domain collaborative filtering

Visible differences predictor: An algorithm for the assessment of image fidelity

Image quality assessment: Unifying structure and texture similarity

Learning a deep convolutional network for image super-resolution

Adapting to unknown smoothness via wavelet shrinkage

Statistical evaluation of visual quality metrics for image denoising

Image denoising via sparse and redundant representations over learned dictionaries

No-reference quality assessment of contrast-distorted images based on natural scene statistics

Removing camera shake from a single photograph

What's wrong with mean-squared error?

Super-resolution from a single image

The effect of texture granularity on texture synthesis quality

Generative adversarial nets

A unified approach to pattern analysis

A discriminative approach for wavelet denoising

Single image superresolution from transformed self-exemplars

Perceptual losses for real-time style transfer and super-resolution

Sur la série de fourier

Adam: A method for stochastic optimization

Recording and playback of camera shake: Benchmarking blind deconvolution with a real-world database

Image features from phase congruency

DeblurGAN: Blind motion deblurring using conditional adversarial networks

DeblurGAN-v2: Deblurring (orders-of-magnitude) faster and better

A comparative study for single image blind deblurring

Perceptual image quality assessment using a normalized Laplacian pyramid

Perceptually optimized image rendering

Most apparent distortion: Full-reference image quality assessment and the role of strategy

Photo-realistic single image superresolution using a generative adversarial network

New edge-directed interpolation

Enhanced deep residual networks for single image superresolution

Image quality assessment based on gradient similarity

Image quality assessment using multi-method fusion

A no-reference metric for evaluating the quality of motion deblurring

The use of psychophysical data and models in the analysis of display system performance

An iterative technique for the rectification of observed distributions

Learning a no-reference quality metric for single-image superresolution

High dynamic range image compression by optimizing tone mapped image quality index

Waterloo Exploration Database: New challenges for image quality assessment models

Geometric transformation invariant image quality assessment using convolutional neural networks

Blind image quality assessment by learning from multiple annotators

Group maximum differentiation competition: Model comparison with few samples

Conditional probability models for deep image compression

Quality evaluation of image dehazing methods using synthetic hazy images

Blind image deblurring using dark channel prior

Image database TID2013: Peculiarities, results and perspectives

Image denoising using scale mixtures of Gaussians in the wavelet domain

PieAPP: Perceptual image-error assessment through pairwise preference

Quantifying color image distortions based on adaptive spatiochromatic signal decompositions

Optimal denoising in redundant representations

Bayesian-based iterative method of image restoration

A perceptually tuned sub-band image coder with image dependent quantization and post-quantization data compression

A mathematical theory of communication

Image information and visual quality

An information fidelity criterion for image quality assessment using natural scene statistics

Image and video quality assessment research at LIVE

Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

Noise removal via bayesian wavelet coring

Shiftable multiscale transforms

Very deep convolutional networks for large-scale image recognition

Learning to generate images with perceptual similarity metrics

Image super-resolution using gradient profile prior

Intriguing properties of neural networks

Scalerecurrent network for deep image deblurring

Perceptual image distortion

A benchmark of DIBR synthesized view quality assessment metrics on a new database for immersive media applications

Anchored neighborhood regression for fast examplebased super-resolution

2017 challenge on single image super-resolution: Methods and results

Bilateral filtering for gray and color images

Deep image prior

Trade-offs in bit-rate allocation for wireless video streaming

SSIM-motivated rate-distortion optimization for video coding

A patch-structure representation method for quality assessment of contrast changed images

Multiscale contrast similarity deviation: An effective and efficient index for perceptual image quality assessment

ESRGAN: Enhanced super-resolution generative adversarial networks

Applications of objective image quality assessment methods

Reduced-and no-reference image quality assessment: The natural scene statistic model approach

Translation insensitive image similarity in complex wavelet domain

Maximum differentiation (MAD) competition: A methodology for comparing computational models of perceptual quantities

Multiscale structural similarity for image quality assessment

Image quality assessment: From error visibility to structural similarity

DCTune: A technique for visual optimization of DCT quantization matrices for individual images

Visibility of wavelet quantization noise

Extrapolation, Interpolation and Smoothing of Stationary Time Series: With Engineering Applications

Perceptual fidelity aware mean squared error

Gradient magnitude similarity deviation: A highly efficient perceptual image quality index

Fast direct super-resolution by simple functions

Image superresolution via sparse representation

Beyond human opinion scores: Blind image quality assessment based on synthetic scores

Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising

FSIM: A feature similarity index for image quality assessment

VSI: A visual saliencyinduced index for perceptual image quality assessment

The unreasonable effectiveness of deep features as a perceptual metric

RankSR-GAN: Generative adversarial networks with ranker for image super-resolution

Learning to blindly assess image quality in the laboratory and wild

Edge strength similarity for image quality assessment

Loss functions for image restoration with neural networks

Visual quality assessment for super-resolved images: database and method

Acknowledgements The authors would like to thank all participants who contributed to our subjective study in the special period of coronavirus disease (COVID-19) outbreak.