International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

DOI: 10.21307/ijanmc-2020-013                            23 

Image Inpainting Research Based on Deep Learning 

 
Zhao Ruixia 

School of Computer Science and Engineering 

Xi’an Technological University 

Xi’an, China 

E-mail: 1364343954@qq.com 

Zhao Li 

School of Computer Science and Engineering 

Xi’an Technological University 

Xi’an, China 

E-mail: 332099732@qq.com 

 
Abstract—With the rapid development of computer technology, 

image inpainting has become a research hotspot in the field of 

deep learning. Image inpainting belongs to the intersection of 

computer vision and computer graphics, and is an image 

processing technology between image editing and image 

generation. The proposal of generative adversarial network 

effectively improves the problems of poor image inpainting 

effect and large difference between the inpainting image and 

the target image, and promotes the development of image 

inpainting technology. In this paper, the image inpainting is 

based on the generation of confrontation networks. Its network 

structure establishes two repair paths, namely the 

reconstruction path and the generation path, and the two paths 

correspond to two groups of networks. The encoder and 

generator in the network respectively complete the encoding 

and decoding tasks based on the residual network. The 

discriminator also uses the patch block discriminator on the 

basis of the residual network to discriminate the authenticity of 

the image. This paper uses Places2 data set to verify the 

algorithm, and uses PSNR and SSIM two objective evaluation 

methods to evaluate the quality of the repaired image. 

Experiments show that the algorithm inpainting effect is 

better. 

Keywords-Image Inpainting; Generation Adversarial 

Networks; Residual Network; Patch 

With the development and popularization of 
computer technology, Internet technology and 
multimedia technology, digital image processing 
technology has also developed rapidly. In the process 
of storage, transmission and use of digital image 
information, the phenomenon of image information 
damaged and loss will occur. These damaged areas 
affect the visual effect of the picture and the integrity 
of the information, and have a certain impact on the 
application of the picture. People urgently need a 
technology and method that can automatically 
inpainting damaged digital images, so digital image 
inpainting technology is born. 

I. INTRODUCTION 

Image inpainting is one of the most popular areas of 
deep learning. Its basic principle is to give an image of 
a damaged or corroded area, and try to use the intact 
information of the known area of the damaged image to 
inpainting the damaged area of the image[1-2]. Digital 
image inpainting methods can be divided into two 
major categories: traditional image inpainting methods 
and deep learning-based image repair methods. 
Traditional image repair methods can be divided into: 
structure-based image repair technology and texture 
synthesis-based image inpainting technology. Both 
image inpainting algorithms based on structure and 
texture can inpainting the loss of small areas such as 
folds. With the expansion of the missing areas, the 
inpainting effect gradually deteriorates. There are 
problems such as incomplete semantic information and 
blurred images in the inpainting results, which makes 
the image inpainting effect ineffective, ideal. The 
emergence of deep neural networks allows the model 
to obtain the understanding of image semantic 
information through multi-level feature extraction, and 
to a certain extent improves the repair effect of 
large-area damaged images. 

As deep learning shows exciting prospects in the 
fields of image semantic inpainting and situational 
awareness, and image inpainting algorithms based on 
deep learning can capture more advanced features of 
images than traditional inpainting algorithms based on 
structure and texture, so often used for image 
inpainting. At present, image inpainting based on 
generative adversarial networks is a major research 
hotspot in the field of deep learning image inpainting, 
which lays a solid foundation for the development of 
image inpainting technology. 

file:///E:/æ��é��/Dict/8.9.3.0/resultui/html/index.html#/javascript:;


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

24 

 
A.  The basic idea of generating adversarial networks 

Generative adversarial network is undoubtedly one 
of the popular artificial intelligence technologies, and 
was rated as the "Top Ten Global Breakthrough 
Technologies" in 2018 by the MIT Technology Review. 
The generative adversarial network is composed of a 
generative network and a discriminant network. The 
purpose of the generative network is to estimate the 
distribution of data samples from a given noise and 
generate synthetic data. The purpose of the 
discriminant network is to distinguish the input data 
from the generated data or the real data. The generative 
network and the discriminant network are a set of 
confrontational relationships. The source of the 
confrontational ideas comes from the zero-sum game in 
game theory. The two sides of the game use each 
other's strategy to change their confrontation strategy in 
an equal game, so as to achieve the goal of winning[3]. 
It is extended to the generative antagonistic network, 
that is, the generative network and the discriminant 
network are the two sides of the game. The 
optimization goal is to achieve Nash equilibrium[4], 
the generative network tries to produce closer to real 
data. Accordingly, the discriminant network tries to 
distinguish more perfectly between real data and data 
generated by generators. As a result, the two networks 
progressed in confrontation, and continued to confront 
each other after the progress, the data obtained by the 
generating network became more and more perfect, 
approaching the real data. 

B.   Development of deep learning models 

GeneratingSince the input of the GAN generation 
model is random noise data, in actual applications, 
there are generally clear variables to control the 
category or other information for the data to be 
generated, such as generating specific numbers from 1 
to 9 numbers. In order to solve the problem of 
generating labeled data, Conditional Generative 
Adversarial Networks are proposed, and information 
such as category labels and pictures are added to the 
input to make the image more in line with the target[5]. 
The foundation of image inpainting technology based 
on deep learning is the convolutional neural network, 
which uses the convolutional neural network to extract 
high-dimensional features and information prediction, 
which makes the image inpainting technology develop 
rapidly[6-7]. Because the network of generating model 
and discriminating model in GAN is too simple, there 
will be image blur when generating large-size images. 

In order to generate clear images, Radford A et al.[8] 
proposed deep convolutional generation adversarial 
networks. With the emergence of several unsupervised 
image conversion models, such as CycleGAN[9], 
DualGAN[10], DiscoGAN[11], it provides better ideas 
for image inpainting technology. 

II. NETWORK STRUCTURE 

Image inpainting not only requires that the results 
conform to human visual habits, making it difficult for 
the human eye to detect the traces of inpainting 
(undetected)[12], meanwhile inpainting the information 
contained in the missing pictures as much as possible, 
so that the restored image can be as much as possible 
Same as the image before the damage. Based on this 
restoration goal, this paper builds an image inpainting 
network framework suitable for this article by studying 
and analyzing the structure principles of GAN. 

Using the neural network's ability to extract 
high-dimensional features of images, the structural 
framework of this paper is built. In this paper, a parallel 
dual-path framework based on GAN is used: one is to 
reconstruct the path, and use the given real image and 
masked image to obtain its complementary image to 
reconstruct the original image; the other is to generate 
the path and use the given masked image to inpainting. 
The input image of the generated path and the input 
image of the reconstructed path are complementary 
images of each other. The network structure is built on 
the basis of the residual network. Its structure includes 
three parts: encoder, generating network and 
discriminating network. The image inpainting process 
in this paper is: (1)Input the masked image and the 
complement image (the masked image and the 
supplementary image are the real image) into the 
encoders E1 and E2 of the reconstruction path and the 
generation path to encode; (2)The extracted two image 
features were fused and input into generator G1 and G2; 
(3)The generator reconstructed image and the real 
image are input into the discriminator D1 for 
discrimination; (4)The generated image, the fused 
image and the real image are input into the 
discriminator D2 for discrimination; (5)The 
discriminators D1 and D2 output the discriminant 
results and feed them back to the encoder, generator 
and discriminator through the back propagation 
algorithm to update the network parameters and train 
the network. The overall structure of the network is 
shown in Figure 1.

file:///E:/æ��é��/Dict/8.9.3.0/resultui/html/index.html#/javascript:;


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

25 

 
Generator G1

Generator G2
Discriminator

D2

discriminator 

D1
Encoder E1

Encoder E2

True 
or 

false

True 
or 

false

reconstruc
tion path

generation 
path

Masked image

Complement 
image

Coded information fusion

Real 
image

Reconstruction 

Image

generation image

Fused image 

 
Figure 1. Data flow diagram of GAN 

A.  Encoder 

The encoder extracts the features of the image based 
on the residual network. The inputs of encoders E1 and 
E2 are three-channel images of 256×256 pixels. The 
residual block is composed of two layers of 
convolution and one layer of skip link. The first layer 
uses a convolution kernel of size 3×3. The length is 1 
and the padding size is 1. The second layer uses a 3×3 
convolution kernel with a sliding step size of 1 and no 

padding. The residual structure of the encoder is shown 
in Figure 2. 

In this paper, there are two parallel paths for image 
inpainting: reconstruction path and generation path. 
The network structure of the encoder is the same, and 
the combination of residual modules is used. The 
network structure contains 7 residual modules. The 
network structure of the encoder is shown in Figure 3. 

Input
x

Output
f(x)+x

convolution

x identity map

convolution

 
Figure 2. Residual structure of the encoder 

Residual 
module

Output 
image 

features

Input
damaged
image

Residual 
module

Residual 
module

Residual 
module

Residual 
module

Residual 
module

Residual 
module

 
Figure 3. Encoder network structure 

B.  Generate network 

The generating network adopts Res-Net network 
structure, and uses the residual decoding block to 
decode the features extracted in the encoding stage. In 

the generation network, the residual block is used in the 
decoding stage. The residual block in the decoding 
stage is composed of three parts: a convolution layer, a 
deconvolution layer, and a skip link layer. The 
convolutional layer uses a convolution kernel with a 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

26 

 
size of 3×3, a sliding step size of 1, and a padding of 1. 
The deconvolution layer uses a 3×3 convolution kernel 
with a sliding step size of 2 and a padding of 1. After 
the deconvolution operation, the padding of the output 
image is 1. The skip link layer performs a 
deconvolution operation, using a convolution kernel 
with a size of 3×3, a sliding step size of 2, and a fill of 
1. After the deconvolution operation, the output image 
has a fill of 1. The generated network uses the Spectral 
Normalization method to normalize the output data. 
The network structure of the residual block in the 
decoding stage is shown in Figure 4. 

A self-attention mechanism has been added to the 
network. The self-attention mechanism uses residual 
blocks and uses Short+Long Term to ensure the 
consistency of the appearance of the generated image. 
The network structure of the generated network is 
shown in Figure 5. 

C.  The training principleDiscrimination Network 

The discrimination network adopts the structure of 
PatchGAN. The difference between PatchGAN and 

ordinary GAN is that the output of ordinary GAN is the 
evaluation of the entire image, and the output of 
PatchGAN is an N×N matrix. Each element of the 
N×N matrix represents the original image. The larger 
receptive field in the map corresponds to a patch in the 
original picture. This paper runs a patch discriminator 
on the image in a convolution mode. The discriminator 
outputs a patch block of 70×70 size, and each element 
represents the probability value of the real image. This 
paper judges that the input of the network is a picture, 
the target picture is used as a positive example, and the 
inpainting picture is used as a negative example, so as 
to judge whether the inpainting picture is true. The 
discriminators D1 and D2 in this paper have the same 
network structure and use five-layer convolution. The 
first three layers use a 4×4 convolution kernel with a 
sliding step size of 1 and a padding of 1; the last two 
layers use a 4×4 convolution kernel with a sliding step 
size of 2 and a padding of 1. The discriminant network 
first extracts the features of the input image, and then 
analyzes and compares the extracted features. The 
network structure of the discrimination network is 
shown in Figure 6. 

 
Input OutputConvolutionConvolution

Deconvolution

 
Figure 4. Decoding residual block network structure 

Residual
module

Residual

module

Decode
residual

module

Decode
residual

module

self-attention 
mechanism

Decode
residual

module

Output 
generated 
image

Input
image

feature

 
Figure 5. Generate network structure diagram 

Generate 
image

Real 
image

convolution convolution convolution convolution convolution
Probability 

value

 
International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

27 

 
Figure 6. Discriminant network structure diagram 

III. NETWORK TRAINING 

In this paper, WGAN-GP loss is used to optimize 
the network structure. WGAN-GP is an improvement 
of WGAN. A gradient penalty method is proposed to 
improve the continuity constraint conditions, making 
GAN convergence more stable. The loss function of 
WGAN-GP is composed of the loss LG of the 
generator and the loss LD of the discriminator. The 
calculation formula of generator loss can be written as 

 
    

    
2

1 1

WGAN

D gp

gp

WGAN

D D gp

L E D x E D G z L

L E D x G z

L L L

  

       

 
    

 
 

 

 (1) 

Where x represents a randomly selected sample in 

the data set and 
 D x

 represents the result output 
when the input of the discriminant model is a real 

sample. 
WGAN

D
L

 Represents the loss function 
corresponding to the WGAN discriminator, Lgp 
represents the gradient penalty loss function newly 

added in WGAN-GP, and   represents the penalty 
coefficient. 

IV. EXPERIMENTAL RESULTS AND ANALYSIS 

A.  Experimental environment 

In order to verify the effectiveness of the algorithm 
proposed in this article, on the Ubuntu platform, the 
Python language and the PyTorch deep learning 
framework are used. Experiment with 5000 images of 
Place2, a public data set. The image size is 256×256 
pixels, and the ratio of 8: 2 is used for training and 
testing. 

B.  Experimental results 

Since the image inpainting task is to repair the 
incomplete part of the image, the data set should be 
mask processed before the inpainting task. In this paper, 
the image preprocessing is divided into two methods: 
random masked and intermediate masked. After the 
data processing is completed, the image inpainting task 
is performed. 

The inpainting result of occlusion in the image is 
shown in Figure 7. Where (a) represents the damaged 
image, (b) represents the inpainting image, and (c) 
represents the real image. 

The inpainting result of random masked in the 
image is shown in Figure 8. Where (a) represents the 
damaged image, (b) represents the inpainting image, 
and (c) represents the real image. 

C.  Experimental analysis 

At this stage, there are mainly two kinds of image 
evaluation methods: subjective evaluation method and 
objective evaluation method. This article combines the 
subjective evaluation method and the objective 
evaluation method to evaluate the repaired image. 

1) Subjective evaluation 

From the experimental results of 4.2, it can be seen 
that the content of the image inpainting by this method 
is basically the same as the target image, the color is 
very similar to the target image, and direct visual 
observation of the image is real and natural. The 
inpainting of texture is natural and continuous. 

2) Objective evaluation 

The objective evaluation method uses peak 
signal-to-noise ratio measurement (PSNR) and 
structural similarity (SSIM) to evaluate the repaired 
image. The higher the PSNR, the less distortion in the 
picture inpainting process, and the better the inpainting 
picture. SSIM measures the similarity of the two 
images. A higher value indicates that the two images 
are more similar. The maximum value is 1. The 
definition of peak signal-to-noise ratio, the expression 
is: 

 
)log(10PSNR

),(),(

M SE

2

f

1

0

1

0

2

0

MSE

G

NM

jiIjiI
M

i

N

j


















）（

 (2) 

MSE is the mean square error. The default value is 

255, ),(0 jiI  represents the pixel value at ),( ji  in the 

real image, ),( jiI  represents the pixel value at ),( ji  

in the inpainting image, and M * N represents the area 
size of the inpainting image. 

The definition of structural similaritycan be written 
as 

 
))((

)2)(2(
),(

2

22

1

22

21

CC

CC
yxSSIM

yxyx

xyyx









 (3) 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

28 

 
x and y represent the two input images, where x


 
is the average of x, y


 is the average of y, 
2

y


 is the 

variance of x, 
2

y


 is the variance of y, xy


 is the 

covariance of x and y, and 1
C

, 2
C

 are Used to 

maintain a stable constant. L is the dynamic range of 
pixel values, generally taken as 255. 

This paper compares four different image inpainting 
models, using PSNR and SSIM methods to evaluate. 

 
(a)damaged image             (b)inpainting image            (c)real image 

Figure 7. Inpainting result of intermediate masked 

 
International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

29 

 
(a)damaged image                (b)inpainting image                 (c)real image 

Figure 8. Inpainting result of random masked 

TABLE I.  EVALUATION RESULTS OF PSNR AND SSIM METHODS   

Image inpainting model PSNR SSIM 

CE[13] 18.72 0.843 

GL[14] 19.90 0.836 

GntIpt[15] 20.38 0.855 

GMCNN[16] 20.62 0.851 

Ours 24.06 0.857 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

30 

 
V. CONCLUDE AND PROSPECT 

In this paper, the image inpainting network structure 
is built based on GAN. The residual network is used in 
the encoding and decoding process to reduce the 
gradient disappearance and gradient explosion 
problems. Using the loss function of WGAN-GP to 
update the network parameters to inpainting the image, 
not only the similarity of the inpainting image structure, 
but also the matching degree of the image texture. The 
Place2 dataset is used for network training and testing. 
The subjective evaluation method and the objective 
evaluation method are used to evaluate the inpainting 
image. The objective evaluation method selects SSIM 
and PSNR to make an objective evaluation of the 
inpainting image. The comparison between the image 
inpainting model and the inpainting model of other 
papers verifies the effectiveness of the algorithm in this 
paper. 

 
REFERENCES 

[1] Bertalmio M, Sapiro G, Caselles V, et al. Image inpainting[C]. 
international conference on computer graphics and interactive 
techniques, 2000: 417-424. 

[2] Guillemot C, Meur O L. Image Inpainting : Overview and Recent 
Advances[J]. IEEE Signal Processing Magazine, 2014, 31(1): 
127-144. 

[3] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative 
adversarial nets[C]//Advances in neural information processing 
systems. 2014: 2672-2680. 

[4] Ratliff L J, Burden S A, Sastry S, et al. Characterization and 
computation of local Nash equilibria in continuous games[C]. allerton 

conference on communication, control, and computing, 2013: 
917-924. 

[5] Mirza M, Osindero S. Conditional Generative Adversarial Nets[J]. 
Computer Science, 2014:2672-2680.   

[6] Pathak D, Krahenbuhl P, Donahue J, et al. Context Encoders: Feature 
Learning by Inpainting[C]. computer vision and pattern recognition, 
2016: 2536-2544. 

[7] Yang C, Lu X, Lin Z, et al. High-Resolution Image Inpainting Using 
Multi-scale Neural Patch Synthesis[C]. computer vision and pattern 
recognition, 2017: 4076-4084. 

[8] Radford A, Metz L, Chintala S, et al. Unsupervised Representation 
Learning with Deep Convolutional Generative Adversarial 
Networks[J]. arXiv: Learning, 2015. 

[9] Zhu J, Park T, Isola P, et al. Unpaired Image-to-Image Translation 
Using Cycle-Consistent Adversarial Networks[C]. international 
conference on computer vision, 2017: 2242-2251. 

[10] Kim T, Cha M, Kim H, et al. Learning to Discover Cross-Domain 
Relations with Generative Adversarial Networks[J]. arXiv: Computer 
Vision and Pattern Recognition, 2017. 

[11] Yi Z, Zhang H, Tan P, et al. DualGAN: Unsupervised Dual Learning 
for Image-to-Image Translation[C]. international conference on 
computer vision, 2017: 2868-2876. 

[12] Efros A A, Freeman W T. Image quilting for texture synthesis and 
transfer[C]. international conference on computer graphics and 
interactive techniques, 2001: 341-346. 

[13] Pathak D, Krahenbuhl P, Donahue J, et al. Context Encoders: Feature 
Learning by Inpainting[C]. computer vision and pattern recognition, 
2016: 2536-2544. 

[14] Iizuka S, Simoserra E, Ishikawa H, et al. Globally and locally 
consistent image completion[J]. ACM Transactions on Graphics, 
2017, 36(4). 

[15] Yu J, Lin Z, Yang J, et al. Generative Image Inpainting with 
Contextual Attention[C]. computer vision and pattern recognition, 
2018: 5505-5514. 

[16] Wang Y, Tao X, Qi X, et al. Image Inpainting via Generative 
Multi-column Convolutional Neural Networks[C]. neural information 
processing systems, 2018: 329-338.