International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.03, 2020 

DOI: 10.21307/ijanmc-2020-029                            65 

Hierarchical Image Object Search Based on Deep 

Reinforcement Learning 

 
Wei Zhang 

School of Computer Science and Engineering 

Xi'an Technological University 

Xi'an, China 

E-mail: weivanity@ gmail.com 

 
Hongge Yao 

School of Computer Science and Engineering 

Xi'an Technological University 

Xi'an, China 

E-mail: 835092445@qq.com 

 
Yuxing Tan  

School of Computer Science and Engineering 

Xi'an Technological University 

Xi'an, China 

E-mail: 842061340@qq.com 

 
Abstract—Object detection technology occupies a pivotal 

position in the field of modern computer vision research，

its purpose is to accurately locate the object human 

beings are looking for in the image and classify the 

object. With the development of deep learning 

technology, convolutional neural networks are widely 

used because of their outstanding performance in 

feature extraction, which greatly improves the speed 

and accuracy of object detection. In recent years, 

reinforcement learning technology has emerged in the 

field of artificial intelligence, showing excellent 

decision-making ability to deal with problems. In order 

to combine the perception ability of deep learning 

technology with the decision-making ability of 

reinforcement learning technology, this paper 

incorporate reinforcement learning into the 

convolutional neural network, and propose a 

hierarchical deep reinforcement learning object 

detection model. 

Keywords-Object Detection; Deep Learning; 

Reinforcement Learning 

I. INTRODUCTION 

When observing a picture, humans can immediately 

know the location and category of the object in the 

image, and can get the information without even 

thinking too much. This is a breeze for us, but the 

computer cannot have all kinds of complicated ideas of 

our human brain, and it is not easy to realize it. 

In computer vision, the positioning and retrieval of 

images will be affected by two aspects, one is the 

content of the image, and the other is the pros and cons 

of the algorithm. There are two main factors 

influencing the image. The first is that the background 

and light when taking pictures will affect the quality of 

the image, resulting in a decrease in the accuracy of 

object detection. The second is the content of the 

image. If there are several similar objects, or some are 

mailto:835092445@qq.com


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.03, 2020 

66 

blocked by other objects, and the different angles of 

the object will affect the accuracy of detection. The 

algorithm mainly focuses on how to make the features 

have higher quality. Therefore, how to design an 

algorithm that can satisfy accurate positioning and 

continuously improve the object positioning speed is 

the key to research. 

For computers, these pictures are data collections 

which are composed of binary digits, and the things 

behind the data cannot be imagined by computers. Our 

purpose is to let the computer simulate our human 

vision and simply have the ability to process the image. 

Human beings get a lot of information in real life every 

day, and most of them belongs to the information 

transmitted to us by vision, and only part of the 

information in these visual images is what human need. 

Therefore, by extracting the important information, 

positioning and identifying them accurately, human 

can greatly reduce the amount of data that the 

computer needs to process and improve the efficiency 

of data processing. 

Reinforcement learning is an important field in 

machine learning. It constructs a Markov Decision 

Process and simulates human thinking to teach agents 

how to make actions that can obtain high reward 

values in the environment, and find the best strategy to 

solve the problem in such constant interaction. Based 

on this idea, this paper use reinforcement learning 

technology to simulate the human visual attention 

mechanism. The agent is taught to change the shape of 

the bounding box and focus only on a significant part 

of the image at a time, and then extract its features 

through the convolutional neural network. Finally, the 

object of image positioning and classification can be 

achieved. 

II. RELATED WORK 

A.  Traditional object detection algorithm 

Traditional object detection algorithms include 

primary feature extraction methods such as HOG 

feature extraction of objects and training SVM 

classifiers for recognition. Their algorithms are 

generally divided into three stages (see Figure 1.): 

1) Select different sliding window frames 
according to the size of the object, and use the sliding 
window to select a part of the content in the figure as 
a candidate area. 

2) Extract visual features from candidate regions. 

3) Use SVM classifier for identification. 

 
Figure 1. Traditional object detection algorithm 

 
The traditional object search algorithm has the 

following disadvantages: 

1) The selection strategy based on sliding windows 

is to slide across the entire image from beginning to 

end. For different object sizes, the program need 

windows with different size ratios to traverse. 

Although it can mark all the positions of the object, its 

brute-force enumeration search results in extremely 

high time complexity and a large number of windows 

that are not related to the object, so the speed and 

performance of feature extraction and classification 

have fallen into a bottleneck. 

2) The characteristics of each object are different, 

which leads to the diversity of forms, and the 

background factors of each object will also affect the 

accuracy of recognition. Therefore, the features of 

manual design are not very robust. 

Input image 
Region 

selection 

Feature 

extraction 
Classification 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.03, 2020 

67 

B.  Object detection algorithm based on deep learning 
After the appearance of CNN, it has been widely 

used in the field of computer vision. With the 

continuous development of science and technology, the 

difficulty of obtaining a large amount of sample data 

has been significantly reduced, and the continuous 

improvement of computing capabilities has enabled 

CNN to have the ability to extract features from a large 

amount of data, which has made huge gains in 

computer vision. 

Aiming at the shortcomings of traditional method 

for object detection, the object detection algorithm 

based on deep learning uses CPMC, Selective Search, 

MCG, RPN and other methods to generate candidate 

regions instead of window sliding strategy. These 

methods usually use various details of the image, such 

as image contrast, edge parts and color to extract 

higher-quality candidate regions, while reducing the 

number of candidate regions and time complexity. 

This type of object detection method is generally 

divided into two types: one-stage detection algorithm 

and two-stage detection algorithm. The one-stage 

detection algorithm regards the object detection 

problem as a regression problem and directly obtains 

the category and position information of the object. 

The detection speed of the algorithm is fast, but the 

accuracy is low. The two-stage detection algorithm 

first generates a large number of region proposals, and 

then classifies these region proposals through the 

convolutional neural network, so the accuracy is higher, 

but the detection speed is slower. 

C.  Object detection algorithm based on deep 

reinforcement learning 

In recent years, research on deep reinforcement 

learning has emerged endlessly. It has achieved 

excellent performance in many games than human 

master players, especially the success of the DeepMind 

team on the AlphaGO project, pushing deep 

reinforcement learning to a new height. In this context, 

many researchers try to apply deep reinforcement 

learning technology in the field of object detection. 

In 2015, Caicedo et al. adopted a top-down search 

strategy, analyzed the entire scene at the beginning, 

and then continued to move toward the object location. 

That is, use a larger bounding box to frame the object, 

and then shrink it step by step, eventually making the 

object surrounded by a compact bounding box. In 2016, 

Mathe et al. proposed an image-based sequence search 

model to extract image features from a small number 

of pre-selected image positions in order to efficiently 

search for visual objects. By formulating sequential 

search as reinforcement learning of the search policy, 

their fully trainable model can explicitly balance for 

each class, specifically, the conflicting goals of 

exploration - sampling more image regions for better 

accuracy -, and exploitation - stopping the search 

efficiently when sufficiently confident about the 

object's location. 

The above algorithm models all use reinforcement 

learning techniques to improve deep learning 

algorithms, and all have achieved good results. 

However, if the visual object algorithm is required to 

have a relatively high accuracy, it still needs to rely on 

a large number of candidate regions, so our research 

direction is to reduce the number of candidate regions 

while maintaining the quality of the candidate regions 

at a high level. 

III. HIERARCHICAL OBJECT SEARCH MODEL BASED 

ON DRL 

A.  MDP formulation 

This paper regard object detection as a Markov 

Decision Process, and find an effective object detection 

strategy by solving decision problems. In each process, 

the agent interacts with the current environment based 

on the current state, and decides the next search action, 

and gets an instant reward value. The agent 

continuously improves the efficiency of search in the 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.03, 2020 

68 

process of learning to obtain high cumulative reward 

value. 

There are 6 different actions in action space A, 

which are composed of two different types: select 

action and stop action. The selection action is to frame 

a part of the current area as the next observation area. 

It consists of four borders and a center frame, which 

respectively reduce the current search area to different 

sub-regions (see Figure 2.); the stop action indicates 

that the object has been found, so the bounding box is 

no longer changed, and the search process stops. 

In reinforcement learning, the state is the premise 

and basis for the agent to make actions. In this model, 

the state is composed of two aspects. One is the feature 

vector extracted by the convolutional neural network in 

the current state. The other is the historical action 

information performed in the process of searching for 

the object. This information helps to stabilize the 

search trajectory, so that the search process will not fall 

into the loop search, thereby improving the accuracy of 

the search.  

 
Figure 2. The selection action diagram

 
Figure 3. Sample generation process diagram based on Markov decision process 

 
The function of the reward is to reflect the reaction 

obtained by the agent during the interaction with the 

environment. The agent judges the merits of the action 

according to the different rewards received, and finally 

learns the strategy by maximizing the cumulative 

reward. Since the agent have two types of actions, our 

calculation methods are different depending on the 

type of action. 

The reward obtained by the agent depends on the 

action it takes in the current state. The model use  

to evaluate the effect of the action, so that the accuracy 

of detection can be obtained. This paper take  as the 

observation area after the movement, as the 

observation area before the movement, and  as the 

area of the object area,  is the reward obtained 

after making the selection action, then our reward 

function can be expressed for: 

  (1) 

If the difference of the overlap rate is positive, it 

means that our prediction range is closer to the object 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.03, 2020 

69 

area, if it is negative, it means that the prediction range 

is farther from the object area. If the decision improves 

the detection accuracy, the reward is positive, 

otherwise the reward is negative. 

The model use  as the reward function for the 

stop action, and set the reward value for the stop action 

as . At the same time, the program need to add a 

threshold  to determine when it will end the action. 

When the value of  is greater than the threshold, 

indicating that object has been found, then the program 

can end the search and perform the stop action. At the 

same time as there is a reward process, there is also a 

punishment when  continues to fall below the 

threshold and reaches the maximum number of 

searches. So that the agent knows the wrong process 

and corrects it. The reward function for the stop action 

is as follows: 

  (2) 

B.  DQN algorithm 

The model use three fully connected layers to form 

the Q-network, its input is the information content of 

the image, and the activation values of the 6 neurons in 

the output layer represent the confidence of 6 kinds of 

actions, among which the highest confidence 

corresponding action is selected.  

According to the state, action and reward function, 

the agent apply the Q-learning algorithm to learn the 

optimal strategy. Because the input image is a 

high-dimensional data, this paper use  to 

approximate the Q function  in high 

dimensions. The Q function based on strategy π is 

expressed as follows: 

  (3) 

The agent selects the action with the highest Q 

value from the Q function, and uses the Bellman 

equation to continuously update the Q function: 

  (4) 

Among them,  is the current state,  is the 

selected action in the current state,  is the immediate 

reward,  is the discount factor,  indicates the next 

state, and  indicates the next action to be taken. 

In order to train , the program need a large 

number of training samples which are usually 

continuously sampled (see Figure 3.), but the 

continuity between adjacent samples will cause 

inefficiency and instability of Q-network learning. This 

paper use the experience-replay mechanism to solve 

this problem. When the capacity of the experience pool 

tends to be saturated, the program constantly replace 

the old samples with new samples. At the same time, 

in order to make most of the samples selected with 

nearly the same probability, the program randomly 

extract samples in the experience pool.  

The loss function of the training process is set as 

follows: 

  (5) 

Among them,  is the actual output of the 

network,  is the expected output 

of the network, is the current reward value, 

 is the maximum expected reward 

value for the next decision,  is the discount factor. 

C.  Hierarchical object search process 

The initial candidate region of the model is the 

entire image. The size of the candidate region is 

normalized to a fixed size, and then put into a trained 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.03, 2020 

70 

CNN neural network model to extract feature values, 

and then rely on the greedy algorithm to use the 

probability  randomly select one of all actions to 

search, or use the learned strategy to make action 

decisions with a probability of . 

After the model made action , it switched to a 

new candidate area  which is a sub-region of the 

previous region, according to the reward function to 

give our agent the corresponding reward , and at 

the same time normalize the new candidate area and 

put it into the neural network model for features 

extraction, combine with previous actions to get a new 

state . Repeat the above hierarchical process 

continuously until our action becomes a stop action, or 

the number of search steps reaches the upper limit. If a 

stop action occurs, the final reward  is given 

according to its corresponding termination reward 

function. 

IV. EXPERIMENT 

A.  Data sets and parameter settings 

Use the Pascal VOC data set to train the model, 

which is the most used data set for object detection. 

The training set uses the combination of Pascal VOC 

2007 and Pascal VOC 2012, and the test set uses the 

Pascal VOC 2007 Test Set. 

The model use three fully connected layers to form 

a Q-network. Its input is the information content of the 

image. The activation values of 6 neurons in the output 

layer represent the confidence of 6 actions. The 

parameters of the network are initialized by a standard 

normal distribution function. The initial value of the 

greedy factor  is 1, every iteration, the  decreases 

by 0.1, and stops when it decreases to 0.1. Set the size 

of the experience pool to 1000, the reward discount 

coefficient  to 0.9, and the threshold to make the 

stop action is 0.5. 

B.  Experimental results and analysis 

● Model training  

In the process of training the model, the value of 

the loss function is continually declining along with 

the continuous iteration of the neural network, making 

the neural network tend to converge (see Figure 4.). 

When the number of training times reaches a certain 

level, the loss value tends to be stable, and various 

parameters in the network are also updated, forming a 

neural network model with recognition capabilities. 

 
Figure 4. Schematic diagram of loss function 

 
International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.03, 2020 

71 

● Results and analysis 

The model first analyzes the entire picture and finds 

the object through a series of frame transformation 

actions. Finally, the agent make the stop action indicate 

the end of the search. The following figure shows this 

hierarchical dynamic selection process in detail. 

 
Figure 5. Hierarchical dynamic selection process 

 
Experimental results show that the algorithm model 

proposed in this paper can improve the search speed 

and accuracy in object search. However, it can also be 

seen from the experiment that there may still be errors 

in the match between the object prediction frame and 

the actual bounding box of the object, because the 

model can only continue to select from the area 

selected by the previous bounding box. As a result, the 

predicted bounding box cannot reach other areas of the 

image. The model can improve the detection result by 

changing the appropriate proportion of the framed 

area. 

V. CONCLUSION 

This paper propose an object detection model based 

on deep reinforcement learning, which focuses on 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.03, 2020 

72 

different areas of the picture by performing a 

predefined area selection action, and iterates the 

process to make the bounding box tightly surround the 

object, Finally achieved the positioning and 

classification of object. Experiments show that the 

model can effectively detect the object in the image. 

 
REFERENCES 

[1]  Sutton R S, Barto A G. Reinforcement learning: An introduction[M]. 
MIT press, 2018. 

[2]  Gagniuc P A. Markov chains: from theory to implementation and 
experimentation[M]. John Wiley & Sons, 2017. 

[3]  Hu Y, Xie X, Ma W Y, et al. Salient region detection using weighted 
feature maps based on the human visual attention 
model[C]//Pacific-Rim Conference on Multimedia. Springer, Berlin, 
Heidelberg, 2004: 993-1000. 

[4]  Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time 
object detection with region proposal networks[C]//Advances in 
neural information processing systems. 2015: 91-99. 

[5]  LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied 
to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 
2278-2324. 

[6]  Dalal N, Triggs B. Histograms of oriented gradients for human 
detection[C]//2005 IEEE computer society conference on computer 
vision and pattern recognition (CVPR'05). IEEE, 2005, 1: 886-893. 

[7]  Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal 
margin classifiers[C]//Proceedings of the fifth annual workshop on 
Computational learning theory. 1992: 144-152. 

[8]  Papandreou G, Kokkinos I, Savalle P A. Modeling local and global 
deformations in deep learning: Epitomic convolution, multiple 
instance learning, and sliding window detection[C]//Proceedings of 
the IEEE Conference on Computer Vision and Pattern Recognition. 
2015: 390-399. 

[9]  Carreira J, Sminchisescu C. CPMC: Automatic object segmentation 
using constrained parametric min-cuts[J]. IEEE Transactions on 
Pattern Analysis and Machine Intelligence, 2011, 34(7): 1312-1328. 

[10]  Uijlings J R R, Van De Sande K E A, Gevers T, et al. Selective 
search for object recognition[J]. International journal of computer 
vision, 2013, 104(2): 154-171. 

[11]  Pont-Tuset J, Arbelaez P, Barron J T, et al. Multiscale combinatorial 
grouping for image segmentation and object proposal generation[J]. 
IEEE transactions on pattern analysis and machine intelligence, 2016, 
39(1): 128-140. 

[12]  Silver D, Huang A, Maddison C J, et al. Mastering the game of Go 
with deep neural networks and tree search[J]. nature, 2016, 
529(7587): 484-489. 

[13]  Caicedo J C, Lazebnik S. Active object localization with deep 
reinforcement learning[C]//Proceedings of the IEEE international 
conference on computer vision. 2015: 2488-2496. 

[14]  Mathe S, Pirinen A, Sminchisescu C. Reinforcement learning for 
visual object detection[C]//Proceedings of the IEEE Conference on 
Computer Vision and Pattern Recognition. 2016: 2894-2902. 

[15]  Watkins C J C H, Dayan P. Q-learning[J]. Machine learning, 1992, 
8(3-4): 279-292. 

[16]  Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep 
reinforcement learning[J]. arXiv preprint arXiv:1312.5602, 2013.