key: cord-0846255-t7trtlp0
authors: Ding, Keyan; Ma, Kede; Wang, Shiqi; Simoncelli, Eero P.
title: Comparison of Full-Reference Image Quality Models for Optimization of Image Processing Systems
date: 2021-01-21
journal: Int J Comput Vis
DOI: 10.1007/s11263-020-01419-7
sha: 00edf454ee5c40ca195c8084d8f2ef90de571d98
doc_id: 846255
cord_uid: t7trtlp0

The performance of objective image quality assessment (IQA) models has been evaluated primarily by comparing model predictions to human quality judgments. Perceptual datasets gathered for this purpose have provided useful benchmarks for improving IQA methods, but their heavy use creates a risk of overfitting. Here, we perform a large-scale comparison of IQA models in terms of their use as objectives for the optimization of image processing algorithms. Specifically, we use eleven full-reference IQA models to train deep neural networks for four low-level vision tasks: denoising, deblurring, super-resolution, and compression. Subjective testing on the optimized images allows us to rank the competing models in terms of their perceptual performance, elucidate their relative advantages and disadvantages in these tasks, and propose a set of desirable properties for incorporation into future IQA models.

The goal of objective image quality assessment (IQA) is the construction of computational models that predict the perceived quality of visual images. IQA models are generally classified according to their reliance on the availability of an original reference image. Full-reference methods compare a distorted image to the complete reference image, reducedreference methods require only partial information about the reference image, and no-reference (or blind) methods operate solely on the distorted image. The standard paradigm for Communicated by Daniel Scharstein. testing IQA models is to compare them to human quality ratings of distorted images, which have been made available in datasets such as LIVE ) and TID2013 (Ponomarenko et al. 2015) . However, excessive reuse of these test sets during IQA model development may lead to overfitting, and as a consequence, poor generalization to images corrupted by distortions that are not present in the test sets (see Table 4 ).

A highly promising but relatively under-studied application of IQA measures is to use them as objectives for the design and optimization of new image processing algorithms. The parameters of image processing methods are usually adjusted to minimize the mean squared error (MSE), the simplest of all fidelity metrics, despite the fact that it has been widely criticised for its poor correlation with human perception of image quality (Girod 1993) . Early attempts at perceptual optimization using the structural similarity (SSIM) index (Wang et al. 2004 ) in place of MSE achieved perceptual gains in applications of image restoration (Channappayya et al. 2008) , wireless video streaming (Vukadinovic and Karlsson 2009) , video coding , and image synthesis (Snell et al. 2017) . A recent publication used perceptual measures based on pre-trained deep neural networks (DNNs) for optimization of super-resolution results (Johnson et al. 2016) , although these have not been tested against human judgments.

In this paper, we systematically evaluate a large set of full-reference IQA models in the context of perceptual optimization. To determine their suitability for optimization, we first test the models on recovering a reference image from a given initialization by optimizing the model-reported distance to the reference. For many IQA methods, we find that the optimization does not converge to the reference image, and can generate severe distortions. These optima are either local, or global but non-unique. We select eleven optimization-suitable IQA models as perceptual objectives, and use them to optimize DNNs for four low-level vision tasks-image denoising, blind image deblurring, single image super-resolution, and lossy image compression. Extensive human perceptual tests on the optimized images reveal the relative performance of the competing models. Moreover, inspection of their visual failures indicates limitations in model design, providing guidance for the development of future IQA models.

Full-reference IQA methods can be broadly classified into five categories:

-Error visibility methods apply a distance measure directly to pixels (e.g., MSE), or to transformed representations of the images. The MSE in particular possesses useful properties for optimization (e.g., differentiability and convexity), and when combined with linear-algebraic tools, analytical solutions can often be obtained. For example, the classical solution to the MSE-optimal denoising problem (assuming a translation-invariant Gaussian signal model) is the Wiener filter (Wiener 1950) . Given that MSE in the pixel domain is poorly correlated with perceived image quality, many IQA models operate by first mapping images to more perceptually appropriate representations (Safranek and Johnston 1989; Daly 1992; Lubin 1993; Watson 1993; Teo and Heeger 1994; Watson et al. 1997; Larson and Chandler 2010; Laparra et al. 2016) , and measuring MSE within that space. -Structural similarity (SSIM) methods are constructed to measure the similarity of local image "structures", often using correlation measures. The prototype is the SSIM index (Wang et al. 2004) , which combines similarity measures of three conceptually independent componentsluminance, contrast and structure. It has become a de facto standard in the field of perceptual image processing, and provided a prototype for subsequent IQA models based on feature similarity (Zhang et al. 2011) , gradient similarity (Liu et al. 2012a) , edge strength similarity , and saliency similarity ).

-Information-theoretic methods measure some approximation of the mutual information between the perceived reference and distorted images. Statistical modeling of the image source, the distortion process, and the human visual system (HVS) is critical in algorithm development. A prototypical example is the visual information fidelity (VIF) measure . -Learning-based methods learn a metric from a training set of images and corresponding perceptual distances using supervised machine learning methods. By leveraging the power of DNNs, these methods have achieved state-of-the-art performance on existing image quality databases (Bosse et al. 2018; Prashnani et al. 2018) . But given the high dimensionality of the input space (i.e., millions of pixels), these methods are prone to overfitting the limited available data. Strategies that compensate for the insufficiency of labeled training data include building on pre-trained networks Ding et al. 2020) , training on local image patches (Bosse et al. 2018) , and combining multiple IQA databases (Zhang et al. 2019b ). -Fusion-based methods combine existing IQA methods to build a "super-evaluator" that exploits the diversity and complementarity of their constituent methods (analogous to "boosting" methods in machine learning). Fusion combinations can be determined empirically (Ye et al. 2014) or learned from data (Liu et al. 2012b; Ma et al. 2019) . Some methods incorporate deterministic or statistical image priors to regularize an IQA measure (Jordan 1881; Ulyanov et al. 2018 ). Since such regularizers can be seen as a form of no-reference IQA measures (Wang and Bovik 2011) , we also view these as fusion solutions.

We used a naïve task to demonstrate the issues encountered when using IQA models in gradient-based perceptual optimization. This task also allows us to pre-screen existing models, and to motivate the design of experiments used in subsequent comparisons.

Given a reference (undistorted) image x and an initial image y 0 , we aimed to recover x by numerically optimizing

where D denotes a full-reference IQA measure with a lower score indicating higher predicted quality, and y is the recov-

Reference image recovery test. Starting from (a) a white Gaussian noise image, we recover images by optimizing the predicted quality relative to a reference image, using different IQA models (b)-(r) ered image. For example, if D is MSE, the (trivial) analytical solution is y = x, indicating full recoverability. The majority of current IQA models are continuous and differentiable, and solutions must be obtained numerically using gradientbased iterative solvers. We considered an initial set of 17 methods, which we believe cover the full spectrum of fullreference IQA methods. These include three error visibility methods-MAD (Larson and Chandler 2010) , PAMSE ) and NLPD , seven structural similarity methods-MS-SSIM (Wang et al. 2003) , CW-SSIM (Wang and Simoncelli 2005) , FSIM (Zhang et al. 2011) , SFF (Chang et al. 2013) , GMSD (Xue et al. 2014) and VSI , MCSD , two information-theoretical methods-IFC (Sheikh et al. 2005) and VIF , and five DNN methods-GTI-CNN (Ma et al. 2018) , DeepIQA (Bosse et al. 2018) , PieAPP (Prashnani et al. 2018) , LPIPS and DISTS (Ding et al. 2020) . As this paper focuses on the perceptual optimization performance of individual IQA measures, fusion-based methods are not included. Figures 1 and 2 show recovery results from two different initializations-a white Gaussian noise image and a JPEGcompressed version of a reference image, respectively. For all IQA methods, the optimization converges to a final image with a substantially better score than that of the initial image. Models based on injective mappings such as MS-SSIM, PAMSE, NLPD and DISTS are able to recover the reference image (although the rate of convergence may depend on the choice of initial image). Many of the remaining IQA models generate a final image with worse visual quality than that of the initial image (e.g., compare Fig. 2 (a) with (o) or (p)), often with noticeable model-dependent artifacts. This is because these methods rely on surjective mapping functions to transform the images to a reduced "perceptual" space for quality computation. For example, GTI-CNN (Ma et al. 2018 ) uses a surjective DNN with four stages of convolution, subsampling, and halfwave rectification. The resulting undercomplete representation is optimized for geometric transformation invariance, at the cost of significant information loss. The examples demonstrate that preservation of some aspects of this lost information is important for perceptual quality. Similar arguments can be applied to other surjective DNN-based IQA models, such as DeepIQA (Bosse et al. 2018) and PieAPP (Prashnani et al. 2018) . Generally, optimization guided by the surjective models "recovers" more structures when initialized with the JPEG image (which provides roughly correct local luminances), as compared to initialization with purely white Gaussian noise.

Reference image recovery test. Staring from (a) a JPEG compressed version of a reference image, we recover images by optimizing the predicted quality relative to the reference image, using different IQA models (b)-(r)

The reference image recovery test results were used to prescreen the initial set of IQA models, excluding those that perform poorly (due to surjectivity). In addition, we excluded models with similar designs. This process yielded 11 fullreference IQA models to be compared in our human subject evaluations:

1. MAE, the Mean Absolute Error ( 1 -norm) of pixel values, has been frequently adopted in optimization, despite its poor perceptual relevance. MAE has been shown to consistently outperform MSE ( 2 -norm) in image restoration tasks (Zhao et al. 2016 ). 2. MS-SSIM (Wang et al. 2003) , the Multi-Scale extension of the SSIM index (Wang et al. 2004 ), provides more exibility than single-scale SSIM, allowing for a wider range of viewing distances. It decomposes the input images into Gaussian pyramids (Burt and Adelson 1983) , and computes contrast and structure similarities at each scale and luminance similarity at the coarsest scale only. MS-SSIM has become a standard "perceptual" quality measure, and has been used to guide the design of DNN-based image super-resolution (Zhao et al. 2016; Snell et al. 2017 ) and compression (Ballé et al. 2018) algorithms.

3. VIF , the Visual Information Fidelity measure, quantifies how much information from the reference image is preserved in the distorted image. A Gaussian scale mixture (Portilla et al. 2003 ) is used as a source model to summarize natural image statistics, and mutual information is estimated assuming only signal attenuation and additive noise perturbations. A distinct property of VIF relative to other IQA models is that it can handle cases in which the "distorted" image is visually superior to the reference . 4. CW-SSIM (Wang and Simoncelli 2005) , the Complex Wavelet SSIM index, is designed to be robust to small geometric distortions such as translation and rotation. The construction allows for consistent local phase shifts of wavelet coefficients, which preserves image features. CW-SSIM addresses a common limitation of IQA methods that require precise spatial registration of the reference and distorted images. 5. MAD (Larson and Chandler 2010) , the Most Apparent Distortion measure, explicitly models adaptive strategies of the HVS. Specifically, a detection-based strategy considering local luminance and contrast masking is employed for near-threshold distortions, and an appearance-based strategy involving local spatial-frequency statistics is activated for supra-threshold distortions. The two strategies are combined by a weighted geomet-ric mean, where the weight is determined based on the amount of distortion. 6. FSIM (Zhang et al. 2011) , the Feature SIMilarity index, assumes that HVS understands an image mainly according to its low-level features. It computes quality estimates based on phase congruency (Kovesi 1999) as the primary feature, and incorporates the gradient magnitude as the complementary feature. Moreover, the phase congruency component serves as a local weighting factor to derive an overall quality score. FSIM also supplies a color version by making quality measurements from chromatic components. 7. GMSD (Xue et al. 2014) , the Gradient Magnitude Similarity Deviation, focuses on computational efficiency of quality prediction, by simply computing pixel-wise gradient magnitude similarity followed by standard deviation (std) pooling. This pooling strategy is, however, problematic because an image with large but constant local distortion yields an std of zero (indicating the best predicted quality). 8. VSI , the Visual Saliency Induced quality index, assumes that the change of salient regions due to image degradation is closely related to the change of visual quality. The saliency map is used not only as a quality feature, but also as a weighting function to characterize the importance of a local region. By combining saliency magnitude, gradient magnitude and chromatic features, VSI demonstrates good quality prediction performance, especially for localized distortions, such as local patch substitution (Ponomarenko et al. 2015) . 9. NLPD , the Normalized Laplacian Pyramid Distance, mimics the nonlinear transformations of the early visual system: local luminance subtraction and local gain control, and combines these values using weighted p -norms. The parameters are optimized to minimize the representation redundancies, instead of matching human judgments. NLPD has been successfully employed to optimize image rendering algorithms Laparra et al. 2017) , where the input reference image has a much higher dynamic range than that of the display. It has also been used to optimize a compression system ). 10. LPIPS , the Learned Perceptual Image Patch Similarity model, computes the Euclidean distance between deep representations of two images. The authors showed that feature maps of different DNN architectures have "reasonable" effectiveness in accounting for human perception of image quality. As LPIPS has many different configurations, we chose the default one based on the VGG network (Simonyan and Zisserman 2015) with the weights learned from the BAPPS dataset . VGG-based LPIPS can be seen as a generalization of the "perceptual loss" (Johnson et al. 2016) , which com-putes the Euclidean distance on convolution responses from one stage of VGG. 11. DISTS (Ding et al. 2020) , the Deep Image Structure and Texture Similarity metric, is explicitly designed to tolerate texture resampling (e.g., replacing one patch of grass with another). DISTS is based on an injective mapping function built from a variant of the VGG network, and combines SSIM-like structure and texture similarity measurements between corresponding feature maps of the two images. It is sensitive to structural distortions but at the same time robust to texture resampling and modest geometric transformations.

We re-implemented all 11 of these models using PyTorch, 1 and verified that our code could reproduce the published performance results for each model on the LIVE , CSIQ (Larson and Chandler 2010) , and TID2013 (Ponomarenko et al. 2015) databases (see Table 2 in "Appendix 1"). We modified grayscale-only models to accept color images, by computing scores on RGB channels separately and averaging them to obtain an overall quality estimate.

We used each of the 11 full-reference IQA models to guide the learning of DNNs to solve four low-level vision tasks:

-image denoising, -blind image deblurring, -single image super-resolution, -lossy image compression.

The parameters of each network are optimized to minimize an IQA measure over a database of corrupted and original image pairs via stochastic gradient descent. Implementations of all IQA models, as well as the DNNs for the four tasks, are available at https://github.com/dingkeyan93/IQA-optimization.

Image denoising is a core application of classical image processing, and also plays an essential role in testing prior models of natural images. In its simplest form, one aims to recover an unknown clean image x ∈ R N from an observed image y that has been corrupted by additive white Gaussian noise n of known variance σ 2 , i.e., y = x+n. Denoising algorithms can be roughly classified into spatial domain methods Network architecture used for super-resolution, containing 16 residual blocks followed by two upsampling modules, each composed of an upsampler (factor of 2, using nearest-neighbor interpolation) and a convolution

Network architecture used for lossy image compression, which includes an analysis transformation f a , a quantizer Q, and a synthesis transformation f s . f a is comprised of n blocks, each with a convolution and downsampling (stride) by 2 followed by two residual blocks.

After the last block, another convolution layer with m filters is added to produce the internal code representation, the values of which are then quantized by Q. f s consists of a cascade that is mirror-symmetric to f a , with nearest-neighbor interpolation used to upsample the feature maps [e.g., Wiener filter (Wiener 1950) , bilateral filter (Tomasi and Manduchi 1998) and collaborative filtering (Dabov et al. 2007 )], and wavelet transform methods (Donoho and Johnstone 1995; Simoncelli and Adelson 1996; Portilla et al. 2003) . Adaptive sparsifying transforms (Elad and Aharon 2006) and variants of nonlinear shrinkage functions have also been directly learned from natural image data (Hel-Or and Shaked 2008; Raphan and Simoncelli 2008) . In recent years, purely data-driven models based on DNNs have achieved state-of-the-art levels of performance ).

Here, we constructed a simplified DNN, shown in Fig. 3 , inspired by the EDSR network (Lim et al. 2017) . The network was trained to estimate the noise (which is then subtracted from the observation to yield a denoised image), by mini-mizing a loss function defined as

where D is an IQA measure and f φ : R N → R N is the mapping of the DNN, parameterized by vector φ.

The goal of image deblurring is to restore a sharp image x from a blurry observation y, which can occur due to defocus and motion of the camera, and motion of objects in a scene. The observation process is usually described by

where K ∈ R N ×N denotes a spatially-varying linear kernel. Blind deblurring refers to the problem in which the blur kernel is unknown. Most early methods, e.g., the classical Lucy-Richardson algorithm (Richardson 1972; Lucy 1974) , focused on non-blind deblurring where the blur kernel is assumed known. Successful blind deblurring methods, such as (Fergus et al. 2006; Pan et al. 2016) , rely heavily on statistical priors of natural images and geometric priors of blur kernels. With the success of deep learning, many DNNbased approaches (Tao et al. 2018; Kupyn et al. 2018) attempt to directly learn the mapping function for blind deblurring without explicitly estimating the blur kernel. Here we also adopted this "kernel-free" approach to train a DNN for image deblurring in an end-to-end fashion. We employed the same network architecture used in denoising (see Fig. 3 ) with the same loss function (Eq. (2)).

Single image super-resolution aims to enhance the resolution and quality of a low-resolution image, which can be modelled by

where P denotes downsampling by a factor of β. This is an ill-posed problem, as downsampling is a projection onto a lower-dimensional subspace, and its solution must rely on some form of regularization or prior model. Early attempts exploited sampling theory (Li and Orchard 2001) or natural image statistics (Sun et al. 2008) . Later methods focused on learning mapping functions between the low-resolution and high-resolution images through sparse coding (Yang et al. 2010) , locally linear regression (Timofte et al. 2013) , selfexemplars (Huang et al. 2015) , etc. Since 2014, DNN-based methods have come to dominate this field as well (Dong et al. 2014 ). An efficient method of constructing a DNN-based mapping is to first extract features from the low-resolution input and then upscale them with sub-pixel convolution (Shi et al. 2016; Lim et al. 2017 ). Here, we followed this method in constructing a DNN-based function f : Fig. 4 . The loss is specified by

Data compression involves finding a more compact data representation from which the original image can be reconstructed. Compression can be either lossless or lossy. Here we followed a prevailing scheme in lossy image compressiontransform coding, which consists of transformation, quanti-

Test images (from the validation set of DIV2K) used in the subjective experiment zation, and entropy coding. Traditional image compression methods (e.g., the most widely used standard-JPEG) used a fixed linear transform for all bit rates. More recently, many researchers have demonstrated the visual benefits of nonlinear transforms, especially DNN-based learnable ones that are capable of adapting their parameters to different bitrate budgets. In this paper, we constructed two DNNs for analysis and synthesis transforms, respectively, as shown in Fig. 5 .

The analysis transform f a maps the image to a latent feature vector z, whose values are then quantized to L levels with the centers being {c 1 , . . . , c L }, where c i ∈ R for i = 1, . . . , L. This quantized representationz = Q( f a (x)), is fed to the synthesis transform f s to reconstruct the compressed image: y = f s (z). The quantizer has zero gradients almost everywhere (and infinite gradients at the transitions), which prevents training via gradient descent ). Hence, we used a soft differentiable approximation (Mentzer et al. 2018 )

to backpropagate gradients during training, where the scale parameter s controls the degree to which Q(·) approximates quantization. In lossy image compression, the objective function is a weighted sum of two terms that quantify the coding cost and the reconstruction error, respectively:

The first term is typically the entropy of the discrete codes z, which provides a lower bound on the bitrate for transmitting the quantized coefficients (Shannon 1948 

by adjusting the architecture of f s (i.e., the dimension ofz) and the number of quantization levels L in Q. This elimination of the entropy from the objective also means that we did not need to continually re-estimate the probability mass function P(z), which varies with changes in the network parameters. The optimization objective in Eq. (7) is reduced to

where φ and ψ are the parameters of f a and f s , respectively. The expectation is approximated by averaging over minibatches of training images. 

In this section, we describe in detail the training of our DNNbased computational models for the four low-level vision tasks, and the subjective testing procedure used to collect human ratings of the optimized images.

For denoising, we fixed the noise std to σ = 50 (relative to pixel values in the range [0, 255]). For deblurring, we simulated various kernels with different motion patterns and blur levels as in Kupyn et al. (2018) . For super-resolution, we generated low-resolution images by downsampling highresolution images by a factor of β = 4 using bicubic interpolation. For compression, we set the number of quantization levels to L = 2 with centers {−1, 1}, the quantization scale parameter to s = 1, the number of downsampling stages to n = 4, and the number of output channels of f a to m = 64. This leads to a maximum of H (z) W ×H ≤ W ×H 2 4 ·2 4 · 64 · log 2 (2)/(W × H ) = 0.25 bits per pixel (bpp). We chose the 4744 high-quality images in the Waterloo Exploration Database (Ma et al. 2017b ) as reference images. Training was performed in two stages. In the first stage, we pre-trained a network using MAE as the loss function for all four tasks ). In the second stage, we finetuned the network parameters by optimizing the desired IQA model. Pre-training brings several advantages. First, some IQA models are sensitive to initializations (e.g., CW-SSIM, MAD, FSIM, GMSD, and VSI) and pre-training yields more reasonable optimization results (also validated in the task of reference image recovery). Second, models that require backpropagating gradients through multiple stages of computation (e.g., LPIPS and DISTS) converge much faster. Third, it helps us to test whether the recently proposed IQA models lead to consistent perceptual gains on top of MAE, a special case of the simple p -norm distance.

For each training stage of the four tasks, we used the Adam optimization package (Kingma and Ba 2015) with a mini- Fig. 10 Denoising results on two regions cropped from an example image, using a DNN optimized for different IQA models batch size of 16 and an initial learning rate of 10 −4 , which decays linearly by a factor of 2 for every 100K iterations, and we set the maximum number of iterations to 500K. We randomly extracted patches with the size of 192×192×3 during training, and tested on 20 independent images selected from the DIV2K validation set (see Fig. 6 ). Training took roughly 1000 GPU hours (measured using an NVIDIA GTX 2080 device) for a total of 4 × 11 = 44 models. Special treatments (i.e., gradient clipping and a smaller learning rate) were given to FSIM and VSI, otherwise their losses are difficult to converge according to our trials.

Generally, it can be difficult to stabilize the training of DNNs to convergence, especially given that the gradients of different IQA models exhibit idiosyncratic behaviors. Fortunately, a simple criterion exists to test the validity of the optimization results: for a given low-level vision task, the DNN optimized for the IQA measure D i should produce the best result (averaged over an independent set of images) in terms of D i itself, when comparing to DNNs optimized for {D j } j =i . Figure 7 shows the ranking of results generated by networks optimized for each of the 11 IQA models (corresponding to one column in one subfigure) on the DIV2K validation set (Timofte et al. 2017) , where 1 and 11 indicate the best and worst rankings, respectively. By inspecting the diagonal elements of the four matrices, we conclude that 43 out of 44 models satisfy the criterion, verifying the rationality of our training procedures. The only exception is when MAE is the optimization goal and NLPD ) is the evaluation measure for the deblurring task. Nevertheless, MAE ranks its own results the second place. As shown in Sect. 6.2, the resulting images from MAE and NLPD look visually similar.

We conducted an experiment to acquire human perceptual comparisons of the IQA optimized results. A two-alternative forced choice (2AFC) method was employed, allowing differentiation of fine-grained quality variations. On each trial, subjects were shown two images optimized according to two different IQA methods, presented on the left and right side of the corresponding reference image (see Fig. 8 ). Subjects were asked to choose which of the two images had better quality. Subjects were allowed unlimited viewing time, and were free to adjust their viewing distance. A customized graphical user interface (GUI) was used to display the images at resolution matched to the screen (i.e., 512 × 512 pixels), and subjects were able to zoom in to any portion of the images for more careful comparison. The screen had the resolution of 1920×1080 pixels, and was calibrated in accordance with the

Deblurring results for two regions cropped from an example image, using a DNN optimized for different IQA models recommendations of ITU-R BT.500-11 ITU-R (2002). Tests were performed in indoor spaces with ordinary illumination levels.

We generated a total of 11 2 × 4 × 20 = 4400 paired comparisons for 11 IQA models, 4 tasks, and 20 test images. We gathered data from 25 subjects (13 males and 12 females) aged between 18 and 22, with normal or corrected-to-normal visual acuity. Subjects had general background knowledge of image processing and computer vision, but were otherwise naïve to the purpose of this study. To reduce fatigue, we performed the experiment in multiple sessions, each consisting of 500 randomly selected comparisons, with the randomized left-right presentation, and allowed subjects to take a break at any time during the session. Subjects were encouraged, but not required, to participate in multiple sessions. In order to detect subjects that were not properly performing the task, we included 5 pairs where one image was of unambiguously better quality (e.g., the original and a noisy image). Our intention was to discard the results of subjects who failed in more than one of these pairs, but the results of all subjects turned out to be valid. In total, each image pair was evaluated by at least 5 subjects, and each IQA model was ranked over 1000 times for each vision task.

Based on the subjective data, we conducted a quantitative comparison of the IQA models through the lens of perceptual optimization. We also qualitatively compared the visual results associated with the IQA models. Last, we combined a top-performing IQA model with adversarial loss (Goodfellow et al. 2014 ) to test whether additional perceptual gains could be obtained in blind image deblurring.

We employed the Bradley-Terry model (Bradley and Terry 1952) to convert paired comparison results to global rankings. This probabilistic model assumes that the visual quality of the k-th test image optimized for the i-th IQA model, q k i , follows a Gumbel distribution with location μ k i and scale s. Assuming independence between q k i and q k j , the difference q k i − q k j is a logistic random variable, and therefore p k i j = P(q k i ≥ q k j ) can be computed using the logistic cumulative distribution function: where s is usually set to 1, leading to a simplified expression:

As such, we may obtain the negative log-likelihood of our pairwise count matrix W k :

where w k i j represents the number of times that D i is preferred over D j for the k-th test image. For each of the four low-level vision tasks, we minimized Eq. (12) iteratively using gradient descent to obtain the optimal estimateμ k . We averaged μ k over the 20 test images, resulting in four global rankings of perceptual optimization performance, as shown in Fig. 9 . It is clear that MS-SSIM (Wang et al. 2003) and MAE are superior to the other IQA models in the task of denoising, whereas DNN-based measures DISTS (Ding et al. 2020) and LPIPS , outperform the others in all other tasks. Thus, there is no single IQA model that performs best across all tasks. We ascribe this to differences in the nature of the tasks: denoising requires distinguishing signal and noise, deblurring, super-resolution, and compression all require recovery of discarded information from partial deterministic measurements (for the first two, via linear projection, and for compression via quantization). MS-SSIM and MAE are both known to prefer smooth appearances, and are seen to excel at denoising. Both DISTS and LPIPS explicitly represent aspects of fine textures, and are superior for the remaining three tasks. Finally, it is important to note that many of the models, despite their impressive abilities to explain existing IQA databases, are outperformed by MAE, the simplest metric in our set.

To determine whether the optimization results of the IQA models are statistically significant, we conducted an independent paired-sample t-test. The null hypothesis is that the ranking scores {μ k i } 20 k=1 for D i and {μ k j } 20 k=1 for D j come from the same normal distribution with unknown variance. When the test cannot reject the null hypothesis at the α = 5% significance level, the two IQA models have statistically indistinguishable performance, and we considered them to belong to the same group. Grouping results are shown in Fig. 9 . Surprisingly, we find that the perceptual gains of MS-SSIM over MAE are statistically insignificant on all four tasks, despite the fact that MS-SSIM is far better than MAE in explaining existing IQA databases. Relying on similar sets Fig. 13 Compression results for two cropped regions from an example image, using a DNN optimized for different IQA models of VGG features (Simonyan and Zisserman 2015) , DISTS and LPIPS also achieve similar performance, except for the super-resolution task where the former is statistically better.

By computing the Spearman's rank correlation coefficient (SRCC) between objective model rankings (in Fig. 7) and subjective human rankings (in Fig. 9 ), we are able to compare the algorithm-level performance of the 11 IQA models on the new dataset. We find from the Table 1 that there is a lack of correlation between model predictions and human judgments for the majority of IQA methods. DISTS and LPIPS tend to rank the images with complex model-dependent distortions in a more perceptually consistent way. We refer interested readers to "Appendix 1" for more comparisons on several IQA databases dedicated to low-level vision problems.

In this subsection, we show example images produced by each IQA-optimized method, qualitatively summarize the types of visual distortion, and use them to diagnose the shortcomings of the corresponding IQA models. Figure 10 shows denoising results for the "cat" image. We observe that MAE, MS-SSIM, and NLPD do a good job in denoising flat regions, but tend to over-smooth texture regions. VIF encourages detail enhancement, leading to artificial local contrast, while GMSD produces a relatively dark appearance presumably because it discards local luminance information. Moreover, the results of FSIM and VSI exhibit noticeable artifacts. LPIPS and DISTS preserve fine details, but may not fully remove noise in smooth regions, mistaking the remaining noise as visually plausible texture. Overall, traditional IQA models MAE and MS-SSIM denoise images with various content variations robustly, keeping high-frequency information loss within the acceptable range. This may explain why they are the dominant objective functions for this task. Figure 11 shows deblurring results for the "basket" image. We see that most of the IQA methods fail, but in different ways. Specifically, the results of MAE, MS-SSIM, CW-SSIM, and NLPD are quite blurred. FSIM, GMSD, and VSI generate severe ringing artifacts. VIF again fails to adjust the local contrast. MAD exhibits undesirable white dot artifacts, although the main structures are sharp. LPIPS succeeds in deblurring this example, while DISTS produces a result that is closest to the original. This is consistent with current stateof-the-art deblurring results (Kupyn et al. 2019) , generated by incorporating comparison of the VGG features into the loss. Figure 12 shows super-resolution results for the "corner tower" image. Again, MAE, MS-SSIM, NLPD, and especially CW-SSIM produce somewhat blurred images, without recovering fine details. MAD, FSIM, GMSD, and VSI are able to generate some "structures", but these are perceived as unpleasant model-dependent artifacts. Benefiting from its texture synthesis capability, DISTS has the potential to super-resolve perceptually plausible fine details, although they differ from those of the original image. Figure 13 shows compression results for the "airplane" image at 0.24 ± 0.01 bpp. A JPEG image, compressed to 0.25 bpp, suffers from block and blur artifacts. Overall, the main structures of the original image are well preserved for most IQA models, but the fine details (e.g., the grass) have to be discarded at this low bitrate, or are synthesized with other forms of distortion. VIF reconstitutes a desaturated image with over-enhanced global contrast, and CW-SSIM superimposes periodic artifacts on the underlying image. White dots and ringing artifacts are again apparent in the results of MAD and VSI, respectively. The image by NLPD is blurred and red-shifted. Both LPIPS and DISTS succeed in synthesizing textures that are visually similar to the original.

We can summarize the artifacts created during perceptual optimization, some of which are not found in traditional image databases for the purpose of quality assessment:

-Blurring is a frequently seen distortion type in all four of the tasks, and is mainly caused by error visibility methods (e.g., MAE and NLPD) and structural similarity methods (e.g., MS-SSIM), which rely on simple injective mappings. Specifically, MAE and SSIM work directly with pixels, and NLPD transforms the input image to a multiscale overcomplete representation using a single stage of local mean subtraction and divisive normalization. Under strict constraints imposed by the tasks, they prefer to make a more conservative estimate, producing something akin to an average of all possible outcomes with sharp structures, as would occur when optimizing MSE. -Ringing is a high-frequency distortion type that often occurs in the images optimized for FSIM, VSI and GMSD (see Fig. 11 (i)-(k)). One common characteristic of the three models is that they rely heavily (in some cases, solely) on local gradient magnitude for feature similarity comparison, underweighting (or abandoning) other perceptually important features (such as local luminance and local phase). This creates "shortcuts" that the DNNs can exploit, generating distortions with similar local gradient statistics. from responses of Gabor filters at multiple scales and orientations. The resulting set of statistical measurements seems insufficient to summarize natural image structures that exhibit higher-order dependencies. Therefore, MAD is "blind" to distortions that satisfy the same set of statistical constraints, and gives the optimized distorted image a high-quality score. -Over-enhancement of local image contrast is encouraged by VIF, which, in most of our experiments, causes significant quality degradation. We believe this arises because VIF does not fully respect reference information when normalizing the covariance term. Specifically, only the second-order statistics of the reference image are used to construct the normalization factor. By incorporating the same statistics computed from the distorted image into normalization, the problem of over-enhancement may be alleviated. In general, quality assessment of image enhancement is a challenging problem (Fang et al. 2015; Wang et al. 2015) , and to the best of our knowledge, all existing full-reference IQA models fail to reward properly-enhanced cases, while penalizing overenhanced cases.

-Luminance and color artifacts are perceived in final images that are associated with many IQA models. Two causes seem plausible. First, methods such as GMSD discard luminance information. Second, methods such as MS-SSIM and NLPD are originally designed for grayscale images only. Applying them to RGB channels separately fails to take into account hue and saturation information. Transforming to a perceptually better color space, and making use of knowledge of color distortions (Rajashekar et al. 2009 ) offers an pportunity for improvement.

In the field of image restoration and generation, many state-of-the-art algorithms are based on adversarial training (Goodfellow et al. 2014) , demonstrating impressive capabilities in synthesizing realistic visual content. The output of the adversarial loss is the probability of an image being computer-generated, but this does not confer capabilities for no-reference IQA modeling, as confirmed by a low SRCC of 0.366 on the LIVE dataset ertheless, adversarial loss may be useful at the algorithm level, meaning that given a set of images generated by a computational method, the average probability quantitatively measures the capability of the method in generating photorealistic high-quality images. In this subsection, we explored the combination of the adversarial loss and a top-performing IQA measure for additional perceptual gains. We chose the task of blind image deblurring, and finetuned a state-of-the-art model-DeblurGAN-v2 (under the configuration of Inception-ResNet) (Kupyn et al. 2019 ). The original loss function for the generator is

The first and second terms are the MSE on pixels and responses of conv3_3 of VGG19 (Simonyan and Zisserman 2015) , respectively, and Adv is a variant of the adversarial loss (Kupyn et al. 2019) . We selected the best-performing IQA model-DISTS-for this experiment. We followed the same training strategy, but modified the loss function of the generator to be

where DISTS denotes the DISTS index. An immediate advantage of this replacement is that the number of hyperparameters is reduced, making manual hyperparameter adjustment easier. After fine-tuning, the average DISTS value decreases from 0.22 to 0.18 on the Köhler test dataset (Köhler et al. 2012) . Figure 14 shows two visual examples, from which we find that the fine-tuned results have sharper edges and enhanced contrast, indicating that perceptual gains may be obtained by DISTS on the two examples.

We have conducted a comprehensive study of perceptual optimization of four low-level vision tasks, guided by eleven full-reference IQA models. This provides an alternative means of testing the perceptual relevance of IQA models in a practical setting, which we believe is an important complement to the conventional methodology for IQA model First, through perceptual optimization, we generated a number of distortions (different from those used in existing IQA databases), which may easily fool the respective models or models of similar design philosophies (see Table 1 ). It should be noted that the emergence of specific distortions is in principle dependent on the experimental choices (e.g., initialization strategy, model architecture, and optimization technique). Second, although they underperformed the DNN-based models on three of four applications, the standard full-reference IQA models (MS-SSIM and MAE) are still valuable tools for optimizing image processing systems due to their robustness and simplicity. Third, more recent IQA models with surjective mappings may still be used to monitor image quality and to optimize the parameter settings of image processing methods, but in a limited and well-controlled space. Last, the two DNN-based models (LPIPS and DISTS) offered the best overall performance in our experiments, but their high computational complexity and lack of interpretability may hinder their use. Our work has interesting connections to two separate lines of research. First, inspired by the philosophy of "analy-sis by synthesis" (Grenander 1970) , Wang and Simoncelli (2008) introduced the maximum differentiation competition methodology to automatically synthesize images for efficiently comparing IQA models. Given two IQA models, MAD generates samples in the space of all possible images that best discriminate the two models. However, the synthesized images may be highly unnatural, and in this case, of limited practical importance. Ma et al. (2020) alleviated this issue by manually constraining the search space to a finite image set of practical interest. Our approach combines the best aspects of these two methods, in the sense that the test images for model comparison are automatically generated by the trained networks, but arise as solutions of real-world vision tasks and are thus of practical importance. Second, the existence of type II adversarial examples (Szegedy et al. 2013) has exposed the vulnerability of many computer vision algorithms, where a tiny change to the input that is imperceptible to the human eye would cause the algorithm to make classification mistakes. In our case, weaknesses in an IQA model are exposed through optimized images that may be interpreted as type I "adversarial" examples of the model: a significant change is made to the original image that sub-stantially degrades its perceptual quality, but the model still claims that this image is of high quality.

The analysis of our experimental results suggests several desirable properties that should be included in future IQA methods. First, the transformation used in the IQA model should be perceptual, mapping the input images into a space where a simple distance measure (e.g., Euclidean) matches human judgements of image quality. This is in the same spirit that color scientists pursue perceptually uniform color spaces, and is an underlying principle of a number of existing models (e.g., NLPD). Zhang et al. (2018) and Ding et al. (2020) demonstrated that a cascade of linear convolution, downsampling, and rectified nonlinearity optimized for highlevel vision tasks may be a good candidate. Second, the IQA model should enjoy unique optima (i.e., the underlying mapping should be injective) to guarantee that images close to optimal are visually similar to the original. This criterion was respected by early models (e.g., MS-SSIM), but was largely overlooked in recent IQA model development. Third, the IQA model should be continuous and differentiable, with well-behaved gradients, to aid optimization in complex situations (e.g., training DNNs with millions of parameters). Last but not least, the IQA model should be computationally efficient, enabling real-time quality assessment and perceptual optimization. To the best of our knowledge, although many current IQA models possess subsets of these properties, no current IQA model satisfies them all.

A conventional method for evaluating IQA models is to compute their agreement with subjective scores in one or more standardized IQA databases [e.g., LIVE , CSIQ (Larson and Chandler 2010) or TID2013 (Ponomarenko et al. 2015) ], consisting of artificially distorted images. Many existing IQA models achieve impressive correlation with these databases (see Table 2 ), but their performance in assessing the perceptual quality of images produced by low-level vision algorithms has not been tested.

In this appendix, we tested them on multiple human-rated image generation/restoration databases, including a denoising database-FLT (Egiazarian et al. 2018) , two motion deblurring databases-Liu13 (Liu et al. 2013 ) and Lai16 (Lai et al. 2016) , two super-resolution databases-Ma17 (Ma et al. 2017a ) and QADS ), a dehazing database-SHRQ (Min et al. 2019 ), a depth image-based rendering database-Tian19 (Tian et al. 2018) , two texture synthesis databases-SynTex (Golestaneh et al. 2015) and TQD (Ding et al. 2020) , and a patch similarity database-BAPPS . The details of these databases are summarized in Table 3 .

Tables 4 and 5 show the performance comparisons of 13 IQA methods in terms of the SRCC and 2AFC scores. As suggested in Zhang et al. (2018) , the 2AFC score is computed by: pq + (1 − p)(1 − q), where p is the percentage of human votes and q = {0, 1} is the vote of an IQA model. When q agrees with the majority of human votes, the 2AFC score is larger, indicating better performance. We find that the overall performance of all models is lower compared to that in the standard IQA databases (see Table 2 ), indicating the difficulty of generalizing to unseen distortions. Moreover, DNN-based measures are relatively better than knowledgedriven models in these application-oriented databases, but there is still significant room for improvement. Figure 15 shows a quality assessment example of realworld super-resolution methods. Here we only compared the most widely used measures (PSNR and SSIM), and the two that performed best both on optimization and assessment (LPIPS and DISTS). It is not surprising that PSNR and SSIM have the poor correlation with human opinions, as they focus more on signal fidelity than perceptual quality (Blau and Michaeli 2018) . LPIPS and DISTS perform better, but the former is somewhat oversensitive to texture substitution. As many recent image restoration algorithms succeed in generating richer textures, DISTS holds much promise for use in quality assessment for such applications (Figs. 16, 17, 18 and 19) .

Generative adversarial networks for extreme learned image compression

End-to-end optimization of nonlinear transform codes for perceptual quality

End-to-end optimized image compression

Variational image compression with a scale hyperprior

The perception-distortion tradeoff

Deep neural networks for no-reference and full-reference image quality assessment

Rank analysis of incomplete block designs: I. The method of paired comparisons

The Laplacian pyramid as a compact image code

Sparse feature fidelity for perceptual image quality assessment

SSIM-optimal linear image restoration

Image denoising by sparse 3-D transform-domain collaborative filtering

Visible differences predictor: An algorithm for the assessment of image fidelity. Human Vision, Visual Processing

Image quality assessment: Unifying structure and texture similarity

Learning a deep convolutional network for image super-resolution

Adapting to unknown smoothness via wavelet shrinkage

Statistical evaluation of visual quality metrics for image denoising

Image denoising via sparse and redundant representations over learned dictionaries

Noreference quality assessment of contrast-distorted images based on natural scene statistics

Removing camera shake from a single photograph

What's wrong with mean-squared error

Super-resolution from a single image

The effect of texture granularity on texture synthesis quality. Applications of Digital Image Processing XXXVIII

Generative adversarial nets

A unified approach to pattern analysis

A discriminative approach for wavelet denoising

Single image superresolution from transformed self-exemplars

Methodology for the subjective assessment of the quality of television pictures. Geneva: International Telecommunication Union

Perceptual losses for realtime style transfer and super-resolution

Sur la série de fourier

Adam: A method for stochastic optimization

Recording and playback of camera shake: Benchmarking blind deconvolution with a real-world database

Image features from phase congruency

DeblurGAN: Blind motion deblurring using conditional adversarial networks

DeblurGAN-v2: Deblurring (orders-of-magnitude) faster and better

A comparative study for single image blind deblurring

Perceptual image quality assessment using a normalized Laplacian pyramid

Perceptually optimized image rendering

Most apparent distortion: Full-reference image quality assessment and the role of strategy

Photo-realistic single image superresolution using a generative adversarial network

New edge-directed interpolation

Enhanced deep residual networks for single image super-resolution

Image quality assessment based on gradient similarity

Image quality assessment using multi-method fusion

A no-reference metric for evaluating the quality of motion deblurring

The use of psychophysical data and models in the analysis of display system performance

An iterative technique for the rectification of observed distributions

Learning a no-reference quality metric for single-image super-resolution

Geometric transformation invariant image quality assessment using convolutional neural networks

Group maximum differentiation competition: Model comparison with few samples

Waterloo exploration database: New challenges for image quality assessment models

Blind image quality assessment by learning from multiple annotators

High dynamic range image compression by optimizing tone mapped image quality index

Conditional probability models for deep image compression

Quality evaluation of image dehazing methods using synthetic hazy images

Blind image deblurring using dark channel prior

Image database TID2013: Peculiarities, results and perspectives. Signal Processing Image Communication

Image denoising using scale mixtures of Gaussians in the wavelet domain

PieAPP: Perceptual image-error assessment through pairwise preference

Quantifying color image distortions based on adaptive spatio-chromatic signal decompositions

Optimal denoising in redundant representations

Bayesian-based iterative method of image restoration

A perceptually tuned subband image coder with image dependent quantization and postquantization data compression

A mathematical theory of communication

Image information and visual quality

An information fidelity criterion for image quality assessment using natural scene statistics

Image and video quality assessment research at LIVE

Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

Noise removal via bayesian wavelet coring

Very deep convolutional networks for large-scale image recognition

Learning to generate images with perceptual similarity metrics

Image super-resolution using gradient profile prior

Intriguing properties of neural networks

Scale-recurrent network for deep image deblurring

Perceptual image distortion. Human Vision, Visual Processing, and Digital Display V, 2179

A benchmark of DIBR synthesized view quality assessment metrics on a new database for immersive media applications

Anchored neighborhood regression for fast example-based super-resolution

NTIRE 2017 challenge on single image super-resolution: Methods and results

Bilateral filtering for gray and color images

Deep image prior

Trade-offs in bit-rate allocation for wireless video streaming

A patchstructure representation method for quality assessment of contrast changed images

SSIMmotivated rate-distortion optimization for video coding

Multiscale contrast similarity deviation: An effective and efficient index for perceptual image quality assessment

ESR-GAN: Enhanced super-resolution generative adversarial networks

Reduced-and no-reference image quality assessment: The natural scene statistic model approach

Image quality assessment: From error visibility to structural similarity

Translation insensitive image similarity in complex wavelet domain

Maximum differentiation (MAD) competition: A methodology for comparing computational models of perceptual quantities

Multiscale structural similarity for image quality assessment

DCTune: A technique for visual optimization of DCT quantization matrices for individual images. Society for Information Display Digest of Technical Papers

Visibility of wavelet quantization noise

Extrapolation, interpolation and smoothing of stationary time series: with engineering applications

Perceptual fidelity aware mean squared error

Gradient magnitude similarity deviation: A highly efficient perceptual image quality index

Fast direct super-resolution by simple functions

Image superresolution via sparse representation

Beyond human opinion scores: Blind image quality assessment based on synthetic scores

Beyond a Gaussian denoiser: Residual learning of deep cnn for image denoising

VSI: A visual saliency-induced index for perceptual image quality assessment

FSIM: A feature similarity index for image quality assessment

The unreasonable effectiveness of deep features as a perceptual metric

RankSRGAN: Generative adversarial networks with ranker for image super-resolution

Learning to blindly assess image quality in the laboratory and wild

Edge strength similarity for image quality assessment

Loss functions for image restoration with neural networks

Visual quality assessment for super-resolved images: Database and method

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

The authors would like to thank all subjects who participated in our subjective study during this period of the coronavirus pandemic. This work was supported in part by the National Natural Science Foundation of China (62071407 to KDM and 62022002 to SQW), the CityU SRG-Fd and APRC Grants (7005560 and 9610487 to KDM), the Hong Kong RGC Early Career Scheme (9048122 to SQW), and the Howard Hughes Medical Institute (investigatorship to EPS).