key: cord-0058283-niloat70 authors: Hukkelås, Håkon; Lindseth, Frank; Mester, Rudolf title: Image Inpainting with Learnable Feature Imputation date: 2021-03-17 journal: Pattern Recognition DOI: 10.1007/978-3-030-71278-5_28 sha: 46ec1dc35990eb821fd96798ece4ec6e0de52e14 doc_id: 58283 cord_uid: niloat70 A regular convolution layer applying a filter in the same way over known and unknown areas causes visual artifacts in the inpainted image. Several studies address this issue with feature re-normalization on the output of the convolution. However, these models use a significant amount of learnable parameters for feature re-normalization [41, 48], or assume a binary representation of the certainty of an output [11, 26]. We propose (layer-wise) feature imputation of the missing input values to a convolution. In contrast to learned feature re-normalization [41, 48], our method is efficient and introduces a minimal number of parameters. Furthermore, we propose a revised gradient penalty for image inpainting, and a novel GAN architecture trained exclusively on adversarial loss. Our quantitative evaluation on the FDF dataset reflects that our revised gradient penalty and alternative convolution improves generated image quality significantly. We present comparisons on CelebA-HQ and Places2 to current state-of-the-art to validate our model. (Code is available at: github.com/hukkelas/DeepPrivacy. Supplementary material can be downloaded from: folk.ntnu.no/haakohu/GCPR_supplementary.pdf) ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this chapter (10.1007/978-3-030-71278-5_28) contains supplementary material, which is available to authorized users. Image inpainting is the task of filling in missing areas of an image. Use cases for image inpainting are diverse, such as restoring damaged images, removing unwanted objects, or replacing information to preserve the privacy of individuals. Prior to deep learning, image inpainting techniques were generally examplarbased. For example, pattern matching, by searching and replacing with similar patches [4, 8, 23, 29, 38, 43] , or diffusion-based, by smoothly propagating information from the boundary of the missing area [3, 5, 6] . Convolutional Neural Networks (CNNs) for image inpainting have led to significant progress in the last couple of years [1, 24, 42] . In spite of this, a standard convolution does not consider if an input pixel is missing or not, making it illfitted for the task of image inpainting. Partial Convolution (PConv) [26] propose a modified convolution, where they zero-out invalid (missing) input pixels and re-normalizes the output feature map depending on the number of valid pixels in the receptive field. This is followed by a hand-crafted certainty propagation step, where they assume an output is valid if one or more features in the receptive field are valid. Several proposed improvements replace the hand-crafted components in PConv with fully-learned components [41, 48] . However, these solutions use ∼ 50% of the network parameters to propagate the certainties through the network. We propose Imputed Convolution (IConv); instead of re-normalizing the output feature map of a convolution, we replace uncertain input values with an estimate from spatially close features (see Fig. 2 ). IConv assumes that a single spatial location (with multiple features) is associated with a single certainty. In contrast, previous solutions [41, 48] requires a certainty for each feature in a spatial location, which allocates half of the network parameters for certainty representation and propagation. Our simple assumption enables certainty representation and propagation to be minimal. In total, replacing all convolution layers with IConv increases the number of parameters by only 1−2%. We use the DeepPrivacy [15] face inpainter as our baseline and suggest several improvements to stabilize the adversarial training: (1) We propose an improved version of gradient penalties to optimize Wasserstein GANs [2] , based on the simple observation that standard gradient penalties causes training instability for image inpainting. (2) We combine the U-Net [35] generator with Multi-Scale-Gradient GAN (MSG-GAN) [19] to enable the discriminator to attend to multiple resolutions simultaneously, ensuring global and local consistency. (3) Finally, we replace the inefficient representation of the pose-information for the FDF dataset [15] . In contrast to the current state-of-the-art, our model requires no post-processing of generated images [16, 25] , no refinement network [47, 48] , or any additional loss term to stabilize the adversarial training [41, 48] . From our knowledge, our model is the first to be trained exclusively on adversarial loss for image-inpainting. Our main contributions are the following: 1. We propose IConv which utilize a learnable feature estimator to impute uncertain input values to a convolution. This enables our model to generate visually pleasing images for free-form image inpainting. 2. We revisit the standard gradient penalty used to constrain Wasserstein GANs for image inpainting. Our simple modification significantly improves training stability and generated image quality at no additional computational cost. 3. We propose an improved U-Net architecture, enabling the adversarial training to attend to local and global consistency simultaneously. In this section, we discuss related work for generative adversarial networks (GANs), GAN-based image-inpainting, and the recent progress in free-form image-inpainting. Generative Adversarial Networks. Generative Adversarial Networks [9] is a successful unsupervised training technique for image-based generative models. Since its conception, a range of techniques has improved convergence of GANs. Karras et al. [21] propose a progressive growing training technique to iteratively increase the network complexity to stabilize training. Karnewar et al. [19] replace progressive growing with Multi-Scale Gradient GAN (MSG-GAN), where they use skip connections between the matching resolutions of the generator and discriminator. Furthermore, Karras et al. [20] propose a modification of MSG-GAN in combination with residual connections [12] . Similar to [20] , we replace progressive growing in the baseline model [15] with a modification of MSG-GAN for image-inpainting. GAN-Based Image Inpainting. GANs have seen wide adaptation for the image inpainting task, due to its astonishing ability to generate semantically coherent results for missing regions. There exist several studies proposing methods to ensure global and local consistency; using several discriminators to focus on different scales [16, 25] , specific modules to connect spatially distant features [39, 44, 45, 47] , patch-based discriminators [48, 49] , multi-column generators [40] , or progressively inpainting the missing area [11, 50] . In contrast to these methods, we ensure consistency over multiple resolutions by connecting different resolutions of the generator with the discriminator. Zheng et al. [52] proposes a probabilistic framework to address the issue of mode collapse for image inpainting, and they generate several plausible results for a missing area. Several methods propose combining the input image with auxiliary information, such as user sketches [17] , edges [31] , or examplar-based inpainting [7] . Hukkelås et al. [15] propose a U-Net based generator conditioned on the pose of the face. GANs are notoriously difficult to optimize reliably [36] . For image inpainting, the adversarial loss is often combined with other objectives to improve training (a) Pconv [26] (b) Gated Conv [48] (c) Ours Illustration of partial convolution, gated convolution and our proposed solution. is element-wise product and ⊕ is addition. Note that C L is binary for partial convolution. stability, such as pixel-wise reconstruction [7, 16, 25, 33] , perceptual loss [39, 51] , semantic loss [25] , or style loss [41] . In contrast to these methods, we optimize exclusively on the adversarial loss. Furthermore, several studies [17, 40, 41, 47] propose to use Wasserstein GAN [2] with gradient penalties [10] ; however, the standard gradient penalty causes training instability for image-inpainting models, as we discuss in Sect. 3.2. Free-Form Image-Inpainting. Image Inpainting with irregular masks (often referred to as free-form masks) has recently caught more attention. Liu et al. [26] propose Partial Convolutions (PConv) to handle irregular masks, where they zero-out input values to a convolution and then perform feature re-normalization based on the number of valid pixels in the receptive field. Gated Convolution [48] modifies PConv by removing the binary-representation constraint, and they combine the mask and feature representation within a single feature map. Xie et al. [41] propose a simple modification to PConv, where they reformulate it as "attention" propagation instead of certainty propagation. Both of these PConv adaptations [41, 48] doubles the number of parameters in the network when replacing regular convolutions. In this section, we describe a) our modifications to a regular convolution layer, b) our revised gradient penalty suited for image inpainting, and c) our improved U-Net architecture. Consider the case of a regular convolution applied to a given feature map I ∈ R N : where * is the convolution and W F ∈ R D is the filter. To simplify notation, we consider a single filter applied to a single one-dimensional feature map. The generalization to a regular multidimensional convolution layer is straightforward. A convolution applies this filter to all spatial locations of our feature map, which works well for general image recognition tasks. For image inpainting, there exists a set of known and unknown pixels; therefore, a regular convolution applied to all spatial locations is primarily undefined ("unknown" is not the same as 0 or any other fixed value), and naive approaches cause annoying visual artifacts [26] . We propose to replace the missing input values to a convolution with an estimate from spatially close values. To represent known and unknown values, we introduce a certainty C x for each spatial location x, where C ∈ R N , and 0 ≤ C x ≤ 1. Note that this representation enables a single certainty to represent several values in the case of having multiple channels in the input. Furthermore, we defineĨ x as a random variable with discrete outcomes {I x , h x }, where I x is the feature at spatial location x, and h x is an estimate from spatially close features. In this way, we want the output of our convolution to be given by, where φ is the activation function, and O the output feature map. We approximate the probabilities of each outcome using the certainty C x ; that is, We assume that a missing value can be approximated from spatially close values. Therefore, we define h x as a learned certainty-weighted average of the surrounding features: where ω ∈ R K is a learnable parameter. In a sense, our convolutional layer will try to learn the outcome space ofĨ x . Furthermore, h x is efficient to implement in standard deep learning frameworks, as it can be implemented as a depth-wise separable convolution [37] with a re-normalization factor determined by C. Propagating Certainties. Each convolutional layer expects a certainty for each spatial location. We handle propagation of certainties as a learned operation, where * is a convolution, W C ∈ R D is the filter, and σ is the sigmoid function. We constraint W C to have the same receptive field as f with no bias, and initialize C 0 to 0 for all unknown pixels and 1 else. The proposed solution is minimal, efficient, and other components of the network remain close to untouched. We use LeakyReLU as the activation function φ, and average pooling and pixel normalization [21] after each convolution f . Replacing all convolutional layers with O x in our baseline network increases the number of parameters by ∼1%. This is in contrast to methods based on learned feature re-normalization [41, 48] , where replacing a convolution with their proposed solution doubles the number of parameters. Similar to partial convolution [26] , we use a single scalar to represent the certainty for each spatial location; however, we do not constrain the certainty representation to be binary, and our certainty propagation is fully learned. U-Net Skip Connection. U-Net [35] skip connection is a method to combine shallow and deep features in encoder-decoder architectures. Generally, the skip connection consists of concatenating shallow and deep features, then followed by a convolution. However, for image inpainting, we only want to propagate certain features. To find the combined feature map for an input in layer L and L + l, we find a weighted average. Assuming features from two layers in the network, (I L , C L ), (I L+l , C L+l ), we define the combined feature map as; and likewise for C L+l+1 . γ is determined by where β 1 , β 2 ∈ R + are learnable parameters initialized to 1. Our U-Net skip connection is unique compared to previous work and designed for image inpainting. Equation 6 enables the network to only propagate features with a high certainty from shallow layers. Furthermore, we include β 1 and β 2 to give the model the flexibility to learn if it should attend to shallow or deep features. Improved Wasserstein GAN [2, 10] is widely used in image inpainting [17, 40, 41, 47] . Given a discriminator D, the objective function for optimizing a Wasserstein GAN with gradient penalties is given by, where L adv is the adversarial loss, p is commonly set to 2 (L 2 norm), λ is the gradient penalty weight, andx is a randomly sampled point between the real image, x, and a generated image,x. Specifically,x = t · x + (1 − t) ·x, where t is sampled from a uniform distribution [10] . Previous methods enforce the gradient penalty only for missing areas [17, 40, 47] . Given a mask M to indicate areas to be inpainted in the image x, where M is 0 for missing pixels and 1 otherwise (note that M = C 0 ), Yu et al. [47] propose the gradient penalty: where is element-wise multiplication. This gradient penalty cause significant training instability, as the gradient sign ofḡ shifts depending on the cardinality of M . Furthermore, Eq. 9 impose ||∇D(x)|| ≈ 1, which leads to a lower bound on the Wasserstein distance [18] . Imposing ||∇D(x)|| ≤ 1 will remove the issue of shifting gradients in Eq. 9. Furthermore, imposing the constrain ||∇D(x)|| ≤ 1 is shown to properly estimate the Wasserstein distance [18] . Therefore, we propose the following gradient penalty: Previous methods enforce the L 2 norm [17, 40, 47] . Jolicoeur-Martineau et al. [18] suggest that replacing the L 2 gradient norm with L ∞ can improve robustness. From empirical experiments (see Appendix 1), we find L ∞ more unstable and sensitive to choice of hyperparameters; therefore, we enforce the L 2 norm (p = 2). In total, we optimize the following objective function: We propose several improvements to the baseline U-Net architecture [15] . See Fig. 3 for our final architecture. We replace all convolutions with Eq. 2, average pool layer with a certainty-weighted average and U-Net skip connections with our revised skip connection (see Eq. 6). Furthermore, we replace progressive growing training [21] with Multi-Scale Gradient GAN (MSG-GAN) [19] . For the MSG-GAN, instead of matching different resolutions from the generator with the discriminator, we upsample each resolution and sum up the contribution of the RGB outputs [20] . In the discriminator we use residual connections, similar to [20] . Finally, we improve the representation of pose information in the baseline model (pose information is only used on the FDF dataset [15] ). Information. The baseline model [15] represents pose information as one-hot encoded images for each resolution in the network, which is extremely memory inefficient and a fragile representation. The pose information, P ∈ R K·2 , represents K facial keypoints and is used as conditional information for the generator and discriminator. We propose to replace the onehot encoded representation, and instead pre-process P into a 4 × 4 × 32 feature bank using two fully-connected layers. This feature bank is concatenated with the features from the encoder. Furthermore, after replacing progressive growing with MSG-GAN, we include the same pose pre-processing architecture in the discriminator, and input the pose information as a 32 × 32 × 1 feature map to the discriminator. We evaluate our proposed improvements on the Flickr Diverse Faces (FDF) dataset [15] , a lower resolution (128 × 128) face dataset. We present experiments on the CelebA-HQ [21] and Places2 [53] datasets, which reflects that our suggestions generalizes to standard image inpainting. We compare against current state-of-the art [34, 41, 48, 52] . Finally, we present a set of ablation studies to analyze the generator architecture. 1 Quantitative Metrics. For quantitative evaluations, we report commonly used image inpainting metrics; pixel-wise distance (L1 and L2), peak signal-to-noise ratio (PSNR), and structural similarity (SSIM). Neither of these reconstruction metrics are any good indicators of generated image quality, as there often exist several possible solutions to a missing region, and they do not reflect human nuances [51] . Recently proposed deep feature metrics correlate better with human perception [51] ; therefore, we report the Frèchet Inception Distance (FID) [13] (lower is better) and Learned Perceptual Image Patch Similarity (LPIPS) [51] (lower is better). We use LPIPS as the main quantitative evaluation. We iteratively add our suggestions to the baseline [15] (Config A-E), and report quantitative results in Table 1 . First, we replace the gradient penalty term with Eq. 10, where we use the L 2 norm (p = 2), and impose the following constraint (Config B): 1 To prevent ourselves from cherry-picking qualitative examples, we present several images (with corresponding masks) chosen by previous state-of-the-art papers [11, 41, 48, 52] , thus copying their selection. Appendix 5 describes how we selected these samples. The only hand-picked examples in this paper are Fig. 1, Fig. 4, Fig. 6 , and where C 0 is the binary input certainty and G is the generator. Note that we are not able to converge Config A while imposing G out . We replace the one-hot encoded representation of the pose information with two fully connected layers in the generator (Config C). Furthermore, we replace the input to all convolutional layers with Eq. 3 (Config D). We set the receptive field of h x to 5 × 5 (K = 5 in Eq. 4). We replace the progressive-growing training technique with MSG-GAN [19] , and replace the one-hot encoded pose-information in the discriminator (Config E). These modifications combined improve the LPIPS score by 30.0%. The authors of [15] report a FID of 1.84 on the FDF dataset with a model consisting of 46M learnable parameters. In comparison, we achieve a FID of 1.49 with 2.94M parameters (config E). For experimental details, see Appendix 2. We extend Config E to general image inpainting datasets; CelebA-HQ [21] and Places2 [53] . We increase the number of filters in each convolution by a factor of 2, such that the generator has 11.5M parameters. In comparison, Gated Convolution [48] use 4.1M, LBAM [41] 68.3M, StructureFlow [34] 159M, and PIC [52] use 3.6M parameters. Compared to [48, 52] , our increase in parameters improves semantic reasoning for larger missing regions. Also, compared to previous solutions, we achieve similar inference time since the majority of the parameters are located at low-resolution layers (8 × 8 and 16 × 16) . In contrast, [48] has no parameters at a resolution smaller than 64 × 64. For single-image inference time, our model matches (or outperforms) previous models; on a single NVIDIA 1080 GPU, our network runs at ∼89 ms per image on 256 × 256 resolution, 2× faster than LBAM [41] , and PIC [52] . GatedConvolution [48] achieves ∼62 ms per image. 2 See Appendix 2.1 for experimental details. (a) Input (b) GConv [48] (c) PIC [52] (d) SF [34] (e) Ours Fig. 4 . Qualitative examples on the Places2 validation set with comparisons to Gated Convolution (GConv) [48] , StructureFlow (SF) [34] , and Pluralistic Image Completion (PIC) [52] . We recommend the reader to zoom-in on missing regions. For non handpicked qualitative examples, see Appendix 5. Table 2 . Quantitative results on the CelebA-HQ and Places2 datasets. We use the official frameworks to reproduce results from [48, 52] . For the (Center) dataset we use a 128 × 128 center mask, and for (Free-Form) we generate free-form masks for each image following the approach in [48] . We report L1, L2, and SSIM in Appendix 3. [52] , Partial Convolution (PC) [26] , Bidirectional Attention (BA) [41] , and Gated Convolution (GC) [48] . Examples selected by authors of [41] (images extracted from their supplementary material). Results of [48, 52] generated by using their open-source code and models. We recommend the reader to zoom-in on missing regions. Quantitative Results. Table 2 shows quantitative results for the CelebA-HQ and Places2 datasets. For CelebA-HQ, we improve LPIPS and FID significantly compared to previous models. For Places2, we achieve comparable results to [48] for free-form and center-crop masks. Furthermore, we compare our model with and without IConv and notice a significant improvement in generated image quality (see Fig. 1 Qualitative Results. Figure 4 shows a set of hand-picked examples, Fig. 5 shows examples selected by [41] , and Appendix 5 includes a large set of examples selected by the authors of [11, 41, 48, 52] . We notice less visual artifacts than models using vanilla convolutions [34, 52] , and we achieve comparable results to Gated Convolution [48] for free-form image inpainting. For larger missing areas, our model generates more semantically coherent results compared to previous solutions [11, 41, 48, 52] . 6 . Diverse Plausible Results: Images from the FDF validation set [15] . Left column is the input image with the pose information marked in red. Second column and onwards are different plausible generated results. Each image is generated by randomly sampling a latent variable for the generator (except for the second column where the latent variable is set to all 0's). For more results, see Appendix 6. Pluralistic Image Inpainting. Generating different possible results for the same conditional image (pluralistic inpainting) [52] has remained a problem for conditional GANs [14, 54] . Figure 6 illustrates that our proposed model (Config E) generates multiple and diverse results. Even though, for Places2, we observe that our generator suffers from mode collapse early on in training. Therefore, we ask the question; does a deterministic generator impact the generated image quality for image-inpainting? To briefly evaluate the impact of this, we train Config D without a latent variable, and observe a 7% degradation in LPIPS score on the FDF dataset. We leave further analysis of this for further work. Figure 7 visualizes if the generator attends to shallow or deep features in our encoder-decoder architecture. Our proposed U-Net skip connection enables the network to select features between the encoder and decoder depending on the certainty. Notice that our network attends to deeper features in cases of uncertain features, and shallower feature otherwise. We propose a simple single-stage generator architecture for free-form image inpainting. Our proposed improvements to GAN-based image inpainting significantly stabilizes adversarial training, and from our knowledge, we are the first to produce state-of-the-art results by exclusively optimizing an adversarial objective. Our main contributions are; a revised convolution to properly handle missing values in convolutional neural networks, an improved gradient penalty for image inpainting which substantially improves training stability, and a novel U-Net based GAN architecture to ensure global and local consistency. Our model achieves state-of-the-art results on the CelebA-HQ and Places2 datasets, and our single-stage generator is much more efficient compared to previous solutions. Convolutional neural networks. Guide to Convolutional Neural Networks Filling-in by joint interpolation of vector fields and gray levels ACM SIGGRAPH 2009 papers on -SIGGRAPH 09 Image inpainting Region filling and object removal by exemplarbased image inpainting Eye in-painting with exemplar generative adversarial networks Image quilting for texture synthesis and transfer Generative adversarial nets Improved training of Wasserstein GANs Progressive image inpainting with full-resolution residual network Deep residual learning for image recognition GANs trained by a two time-scale update rule converge to a local nash equilibrium Multimodal unsupervised image-toimage translation DeepPrivacy: a generative adversarial network for face anonymization Globally and locally consistent image completion SC-FEGAN: face editing generative adversarial network with user's sketch and color Connections between support vector machines, Wasserstein distance and gradient-penalty GANs MSG-GAN: multi-scale gradient GAN for stable image synthesis Analyzing and improving the image quality of stylegan Progressive growing of GANs for improved quality, stability, and variation Adam: a method for stochastic optimization Texture optimization for examplebased synthesis Mask-specific inpainting with deep neural networks Generative face completion Image inpainting for irregular holes using partial convolutions Rectifier nonlinearities improve neural network acoustic models Which training methods for GANs do actually converge? Examplar-based inpainting based on local geometry Mixed precision training EdgeConnect: Generative image inpainting with adversarial edge learning Conditional image synthesis with auxiliary classifier GANs Context encoders: feature learning by inpainting StructureFlow: image inpainting via structure-aware appearance flow U-Net: convolutional networks for biomedical image segmentation Improved techniques for training GANs Rigid-motion scattering for image classification Summarizing visual data using bidirectional similarity Contextual-based image inpainting: infer, match, and translate Image inpainting via generative multicolumn convolutional neural networks Image inpainting with learnable bidirectional attention maps Image denoising and inpainting with deep neural networks Image inpainting by patch propagation using patch sparsity Shift-Net: image inpainting via deep feature rearrangement High-resolution image inpainting using multi-scale neural patch synthesis The unusual effectiveness of averaging in GAN training Generative image inpainting with contextual attention Free-form image inpainting with gated convolution Learning pyramid-context encoder network for high-quality image inpainting Semantic image inpainting with progressive generative networks The unreasonable effectiveness of deep features as a perceptual metric Pluralistic image completion Places: a 10 million image database for scene recognition Toward multimodal image-to-image translation Acknowledgements. The computations were performed on resources provided by the Tensor-GPU project led by Prof. Anne C. Elster through support from The Department of Computer Science and The Faculty of Information Technology and Electrical Engineering, NTNU. Furthermore, Rudolf Mester acknowledges the support obtained from DNV GL.