key: cord-0501189-x3t01fng
authors: Nguyen, Khoi Khac; Duong, Trung Q.; Do-Duy, Tan; Claussen, Holger; Hanzo, and Lajos
title: 3D UAV Trajectory and Data Collection Optimisation via Deep Reinforcement Learning
date: 2021-06-06
journal: nan
DOI: nan
sha: 72d496f041d9b3e753386e311dc60dd40f6d328e
doc_id: 501189
cord_uid: x3t01fng

Unmanned aerial vehicles (UAVs) are now beginning to be deployed for enhancing the network performance and coverage in wireless communication. However, due to the limitation of their on-board power and flight time, it is challenging to obtain an optimal resource allocation scheme for the UAV-assisted Internet of Things (IoT). In this paper, we design a new UAV-assisted IoT systems relying on the shortest flight path of the UAVs while maximising the amount of data collected from IoT devices. Then, a deep reinforcement learning-based technique is conceived for finding the optimal trajectory and throughput in a specific coverage area. After training, the UAV has the ability to autonomously collect all the data from user nodes at a significant total sum-rate improvement while minimising the associated resources used. Numerical results are provided to highlight how our techniques strike a balance between the throughput attained, trajectory, and the time spent. More explicitly, we characterise the attainable performance in terms of the UAV trajectory, the expected reward and the total sum-rate.

UAV-aided wireless networks have also been used for machine-to-machine communications [32] and D2D scenarios in 5G [14] , [33] , but the associated resource allocation problems remain challenging in real-life applications. Several techniques have been developed for solving resource allocation problems [18] , [19] , [31] , [34] - [36] . In [34] , the authors have conceived a multi-beam UAV communications and a cooperative interference cancellation scheme for maximising the uplink sum-rate received from multiple UAVs by the base stations (BS) on the ground. The UAVs were deployed as access points to serve several ground users in [35] . Then, the authors proposed successive convex programming for maximising the minimum uplink rate gleaned from all the ground users. In [31] , the authors characterised the tradeoffs between the ground terminal transmission power and the specific UAV trajectory both in a straight and in a circular trajectory.

The issues of data collection, energy minimisation, and path planning have been considered in [23] , [32] , [37] - [44] . In [38] , the authors minimised the energy consumption of the data collection task considered by jointly optimising the sensor nodes' wakeup schedule and the UAV trajectory.

The authors of [39] proposed an efficient algorithm for joint trajectory and power allocation optimisation in UAV-assisted networks to maximise the sum-rate during a specific length of time. A pair of near-optimal approaches for optimal trajectory was proposed for a given UAV power allocation and power allocation optimisation for a given trajectory. In [32] , the authors introduced a communication framework for UAV-to-UAV communication under the constraints of the UAV's flight speed, location uncertainty and communication throughput. Then, a path planning algorithm was proposed for minimising the associated completion time task while balancing the performance by computational complexity trade-off. However, these techniques mostly operate in offline modes and may impose excessive delay on the system. It is crucial to improve the decision-making time for meeting the stringent requirements of UAV-assisted wireless networks.

Again, machine learning has been recognised as a powerful tool of solving the high-dynamic trajectory and resource allocation problems in wireless networks. In [36] , the authors proposed a model based on the classic k-means algorithm for grouping the users into clusters and assigned a dedicated UAV to serve each cluster. By relying on their decision-making ability, DRL algorithms have been used for lending each node some degree of autonomy [7] , [18] - [21] , [28] , [29] . In [28] , an optimal DRL-based channel access strategy to maximise the sum rate and α-fairness was considered. In [18] , [19] , we deployed DRL techniques for enhancing the energy-efficiency of D2D communications. In [21] , the authors characterised the DQL algorithm for minimising the data packet loss of UAV-assisted power transfer and data collection systems. As a further advance, caching problems were considered in [7] to maximise the cache success hit rate and to minimise the transmission delay. The authors designed both a centralised and a decentralised system model and used an actor-critic algorithm to find the optimal policy. DRL algorithms have also been applied for path planning in UAV-assisted wireless communications [9] , [22] - [24] , [30] , [45] . In [22] , the authors proposed a DRL algorithm based on the echo state network of [46] for finding the flight path, transmission power and associated cell in UAV-powered wireless networks. The so-called deterministic policy gradient algorithm of [47] was invoked for UAV-assisted cellular networks in [30] . The UAV's trajectory was designed for maximising the uplink sum-rate attained without the knowledge of the user location and the transmit power. Moreover, in [9] , the authors used the DQL algorithm for the UAV's navigation based on the received signal strengths estimated by a massive MIMO scheme. In [23] , Qlearning was used for controlling the movement of multiple UAVs in a pair of scenarios, namely for static user locations and for dynamic user locations under a random walk model. However, the aforementioned contributions have not addressed the joint trajectory and data collection optimisation of UAV-assisted networks, which is a difficult research challenge. Furthermore, these existing works mostly neglected interference, 3D trajectory and dynamic environment.

In this paper, we consider a system model relying on a single UAV to serve several user nodes. The UAV is considered to be an information-collecting robot aiming for collecting the maximum amount of data from the users with the shortest distance travelled. We conceive a solution based on the DRL algorithm to find the optimal path of a UAV for maximising the joint reward function based on the shortest flight distance and the uplink transmission rate. We compare the difference between our proposed approach and other existing works in Table I . Our main contributions are summarised as follows: • The UAV system considered has stringent constraints owing to the position of the destination, the UAV's limited flight time and the communication link's constraint. The UAV's objective is to find an optimal trajectory for maximising the total network throughput, while minimising its distance travelled.

• We propose DRL techniques for solving the above problem. The area is divided into a grid to enable fast convergence. Following its training, the UAV can have the autonomy to make a decision concerning its next action at each position in the area, hence eliminating the need for human navigation. This makes UAV-aided wireless communications more reliable, practical and optimises the resource consumption.

• Two scenarios are considered relying either on three or five clusters for qualifying the efficiency of our approach in terms of both the sum-rate, the trajectory and the associated time.

The rest of our paper is organised as follows. In Section II, we describe our data collection system model and the problem formulation of IoT networks relying on UAVs. Then, the mathematical background of the DRL algorithms is presented in Section III. Deep Q-learning (DQL) is employed for finding the best trajectory and for solving our data collection problem in Section IV. Furthermore, we use the dueling DQL algorithm of [48] for improving the system performance and convergence speed in Section V. Next, we characterise the efficiency of the DRL techniques in Section VI. Finally, in Section VII, we summarise our findings and discuss our future research.

Consider a system consisting of a single UAV and M groups of users, as shown in Fig. 1 , where the UAV relying on a single antenna visits all clusters to cover all the users. The 3D coordinate of the UAV at time step t is defined as X t = (x t 0 , y t 0 , H t 0 ). Each cluster consists of K users, which are unknown and distributed randomly within the coverage radius of C. The users are moving following the random walk model with the maximum velocity v. The position of the kth user in the mth cluster at time step t is defined as X t m,k = (x t m,k , y t m,k ). The UAV's objective is to find the best trajectory while covering all the users and to reach the dock upon completing its mission. 

The distance from the UAV to user k in cluster m at time step t is given by:

We assume that the communication channels between the UAV and users are dominated by line-of-sight (LoS) links; thus the channel between the UAV and the kth user in the mth cluster at time step t follows the free-space path loss model, which is represented as

where the channel's power gain at a reference distance of d = 1m is denoted by β 0 .

The achievable throughput from the kth user in the mth cluster to the UAV at time t if the user satisfies the distance constraint is defined as follows:

where B and α 2 are the bandwidth and the noise power, respectively. Then the total sum-rate over the T time step from the kth user in cluster m to the UAV is given by:

Both the current location and the action taken jointly influence the rewards obtained by the UAV; thus the trial-and-error based learning task of the UAV satisfies the Markov property. We formulate the associated Markov decision process (MDP) [49] as a 4 tuple < S, A, P ss , R >,

where S is the state space of the UAV, A is the action space; R is the expected reward of the UAV and P ss is the probability of transition from state s to state s , where we have s = s t+1 |s = s t .

Through learning, the UAV can find the optimal policy π * : S → A for maximising the reward R. More particularly, we formulate the trajectory and data collection game of UAV-aided IoT networks as follows:

• Agent: The UAV acts like an agent interacting with the environment to find the peak of the reward.

• State space: We define the state space by the position of UAV as

At time step t, the state of the UAV is defined as s t = (x t , y t , H t ).

• Action space: The UAV at state s t can choose an action a t of the action space by following the policy at time-step t. By dividing the area into a grid, we can define the action space as follows:

The UAV moves in the environment and begins collecting information when the users are in the coverage of the UAV. When the UAV has sufficient information R m,k ≥ r min from the kth user in the mth cluster, that user will be marked as collected in this mission and may not be visited by the UAV again.

• Reward function: In joint trajectory and data collection optimisation, we design the reward function to be dependent on both the total sum-rate of ground users associated with the UAV plus the reward gleaned when the UAV completes one route, which is formulated as follows:

where β and ζ are positive variables that represent the trade-off between the network's sumrate and UAV's movement, which will be described in the sequel. Here, P (m, k) = {0, 1}

indicates whether or not user k of cluster m is associated with the UAV; R plus is the acquired reward when the UAV completes a mission by reaching the final destination. On the other hand, the term

defines the average throughput of all users.

• Probability: We define P s t s t+1 (a t , π) as the probability of transition from state s t to state s t+1 by taking the action a t under the policy π.

At each time step t, the UAV chooses the action a t based on its local information to obtain the reward r t under the policy π. Then the UAV moves to the next state s t+1 by taking the action a t and starts collecting information from the users if any available node in the network satisfies the distance constraint. Again, we use the DRL techniques to find the optimal policy π * for the UAV to maximise the reward attained (7) . Following the policy π, the UAV forms a chain of actions (a 0 , a 1 , . . . , a t , . . . , a f inal ) to reach the landing dock.

Our target is to maximise the reward expected by the UAV upon completing a single mission during which the UAV flies from the initial position over the clusters and lands at the destination.

Thus, we design the trajectory reward R plus when the UAV reaches the destination in two different ways. Firstly, the binary reward function is defined as follows:

where X f inal and X target are the final position of UAV and the destination, respectively. However, the UAV has to move a long distance to reach the final destination. It may also be trapped in a zone and cannot complete the mission. These situations lead to increased energy consumption and reduced convergence. Thus, we consider the value of R t plus in a different form by calculating the horizontal distance between the UAV and the final destination at time step t, yielding:

When we design the reward function as in (9), the UAV is motivated to move ahead to reach the final destination. However, one of the disadvantages is that the UAV only moves forward. Thus, the UAV is unable to attain the best performance in terms of its total sum-rate in some environmental settings. We compare the performance of the two trajectory reward function definitions in Section VI to evaluate the pros and cons of each approach.

We design the reward function by arranging for a trade-off game with parameters β, ζ to make our approach more adaptive and flexible. By modifying the value of β/ζ , the UAV adapts to several scenarios: a) fast deployment for emergency services, b) maximising the total sum-rate, and c) maximising the number of connections between the UAV and users. Depending on the specific problems, we can adjust the value of the trade-off parameters β, ζ to achieve the best performance. Thus, the game formulation is defined as follows:

where T and T cons are the number of steps that the UAV takes in a single mission and the maximum number of UAV's steps given its limited power, respectively. The distance constraint d m,k ≤ d cons indicates that the served (m, k)-user has a satisfied distance to the UAV. Those stringent constraints, such as the transmission distance, position and flight time make the optimisation problem more challenging. Thus, we propose DRL techniques for the UAV in order to attain the optimal performance.

In this section, we introduce the fundamental concept of Q-learning, where the so-called value function is defined by a reward of the UAV at state s t as follows:

where E[ ] represents an average of the number of samples and 0 ≤ γ ≤ 1 denotes the discount factor. The value function can be rewritten by expoiting the Markov property as follows:

In a finite game, there is always an optimal policy π * that satisfies the Bellman optimality

The action-value function is obtained, when the agent at state s t takes action a t and receives the reward r t under the agent policy π. The optimal Q-value can be formulated as:

The optimal policy π * can be obtained from Q * (s, a, π) as follows:

From (14) and (15), we have

where the agent takes the action a = a t+1 at state s t+1 .

Through learning, the Q-value is updated based on the available information as follows:

where α denotes the updated parameter of the Q-value function.

In RL algorithms, it is challenging to balance the exploration and exploitation for appropriately selecting the action. The most common approach relies on the -greedy policy for the action selection mechanism as follows:

Upon assuming that each episode lasts T steps, the action at time step t is a t that is selected by following the -greedy policy as in (18) . The UAV at state s t communicates with the user nodes from the ground if the distance constraint of d m,k ≤ d cons is satisfied. Following the information transmission phase, the user nodes are marked as collected users and may not be revisited later during that mission. Then, after obtaining the immediate reward r(s t , a t ) the agent at state s t takes action a t to move to state s t+1 as well as to update the Q-value function in (17) Receive initial observation state s 0

:

Obtain the action a t of the UAV according to the -greedy mechanism (18) 7:

Execuse the action a t and estimate the reward r t according to (7) 8:

Observe the next state s t+1

Store the transition (s t , a t , r t , s t+1 ) in the replay buffer B

10:

Randomly select a mini-batch of K transitions (s k , a k , r k , s k+1 ) from B

Update the network parameters using gradient descent to minimise the loss

The gradient update is ∇ θ L(θ) = E s,a,r,s y DQL − Q(s, a; θ) ∇ θ Q(s, a; θ) ,

12:

Update the state s t = s t+1

Update the target network parameters after a number of iterations as θ = θ 14:

end while 15: end for

The neural network parameters are updated by minimising the loss function defined as follows:

where θ is a parameter of the network Q and we have 13 The details of the DQL approach in our joint trajectory and data collection trade-off game designed for UAV-aided IoT networks are presented in Alg. 1 where L denotes the number of episode. Moreover, in this paper, we design the reward obtained in each step to assume one of two different forms and compare them in our simulation results. Firstly, we calculate the difference between the current and the previous reward of the UAV as follows:

Secondly, we design the total episode reward as the accumulation of all immediate rewards of each step within one episode as

V. DEEP REINFORCEMENT LEARNING APPROACH FOR UAV-ASSISTED IOT NETWORKS: A DUELING DEEP Q-LEARNING APPROACH According to Wang et. al. [48] , the standard Q-learning algorithm often falters due to the over-supervision of all the state-action pairs. On the other hand, it is unnecessary to estimate the value of each action choice in a particular state. For example, in our environment setting, the UAV has to consider moving either to the left or to the right when it hits the boundaries. Thus, we can improve the convergence speed by avoiding visiting all state-action pairs. Instead of using Q-value function of the conventional DQL algorithm, the dueling neural network of [48] is introduced for improving the convergence rate and stability. The so-called advantage function Q-function can be obtained by combining the two streams' outputs as follows:

Equation (28) applies to all (s, a) instances; thus, we have to replicate the scalar V (s; θ V ),

|A| times to form a matrix. However, Q(s, a; θ, θ A , θ V ) is a parameterised estimator of the true Q-function; thus, we cannot uniquely recover the value function V and the advantage function Receive the initial observation state s 0

:

Obtain the action a t of the UAV according to the -greedy mechanism (18) 7:

Execute the action a t and estimate the reward r t according to (7) 8:

Observe the next state s t+1

Store the transition (s t , a t , r t , s t+1 ) in the replay buffer B 10:

Randomly select a mini-batch of K transitions (s k , a k , r k , s k+1 ) from B

Estimate the Q-value function by combining the two streams as follows:

12:

Update the network parameters using gradient descent to minimise the loss 13: where

14:

Update the state s t = s t+1

Update the target network parameters after a number of iterations as θ = θ 16: end while

A. Therefore, (28) results in poor practical performances when used directly. To address this problem, we can map the advantage function estimator to have no advantage at the chosen action by combining the two streams as follows:

Intuitively, for a * = arg max a ∈A Q(s, a ; θ, θ A , θ V ) = arg max a ∈A A(s, a ; θ A ), we have Q(s, a * ; θ, θ A , θ V ) = V (s; θ V ). Hence, the stream V (s; θ V ) estimates the value function and the other streams is the advantage function estimator. We can transform (29) using an average formulation instead of the max operator as follows:

Now, we can solve the problem of identifiability by subtracting the mean as in (30) . Based on (30), we propose a dueling DQL algorithm for our joint trajectory and data collection problem in UAV-assisted IoT networks relying on Alg. 2. Note that estimating V (s; θ V ) and A(s, a; θ A )

does not require any extra supervision and they will be computed automatically.

In this section, we present our simulation results characterising the joint optimisation problem of UAV-assisted IoT networks. To highlight the efficiency of our proposed model and the DRL methods, we consider a pair of scenarios: a simple having three clusters, and a more complex one with five clusters in the coverage area. We use Tensorflow 1.13.1 [51] and the Adam optimiser of [52] for training the neural networks. All the other parameters are provided in Table II. In Fig. (2) , we present the trajectory obtained after training using the DQL algorithm in the 5-cluster scenario. The green circle and blue dots represent the clusters' coverage and the user nodes, respectively. The red dots and black triangles in the figure represent the UAV's state after taking action. The UAV starts at (0, 0), visits about 40 users, and lands at the destination that is denoted by the black square. In a complex environment setting, it is challenging to expect the UAV to visit all users, while satisfying the flight-duration and power level constraints.

For purposes of comparison, we run the algorithm five times in five different environmental settings and take the average to draw the figures. Firstly, we compare the reward obtained following (7) . Let us consider the 3-cluster scenario and β/ζ = 2 : 1 in Fig. (3a) , where the DQL and dueling DQL algorithms using the exponential function (9) reach the best performance. Discounting factor γ = 0.9

Max number of users per cluster 10

Noise power

The reference channel power gain β0 = −50dB Proposed path with (8) Proposed path with (9) Initial location Destination zone IoT device When using the exponential trajectory design function (9), the performance converges faster than that of the DQL and dueling DQL methods using the binary trajectory function (8) . In addition, in Fig. (3b) , we compare the performance of the DQL and dueling DQL techniques using different β/ζ values. The average performance of the dueling DQL algorithm is better than that of the DQL algorithm. In conjunction, the results of using the exponential function (9) is better than that of the ones using the binary function (8) . Fig. (4) . The performance of using the episode reward (24) is better than that using the immediate reward (23) (24) (b) With (9) Fig. 4 . The expected reward when using the DQL and dueling DQL algorithms with 5-cluster scenario relying on the DQL and dueling DQL algorithms. In Fig. (4a) , we compare the performance in conjunction with the binary trajectory design while in Fig. (4b) the exponential trajectory design is considered. For β/ζ = 1 : 1, the rewards obtained by the DQL and dueling DQL are similar and stable after about 400 episodes. When using the exponential function (9), the dueling DQL algorithm reaches the best performance. Moreover, the convergence of the dueling DQL technique is faster than that of the DQL algorithm. performance for all the β/ζ pair values, exhibiting better rewards. In addition, when using the exponential function (9), both proposed algorithms show better performance than the ones using the binary function (8) if β/ζ ≤ 1 : 1, but it becomes less effective when β/ζ is set higher.

We compare the performance of the DQL and of the dueling DQL algorithm using different reward function setting in Fig. (6) and in Fig. (7) , respectively. The DQL algorithm reaches the best performance when using the episode reward (24) in Fig. (6a) while the fastest convergence speed can be achieved by using the exponential function (9) . When β/ζ ≥ 1 : 1, the DQL algorithm relying on the episode function (24) outperforms the ones using the immediate reward function (23) in Fig. (6b) . The reward (7) using the exponential trajectory design (9) has a better performance than that using the binary trajectory design (8) for all the β/ζ values. The similar results are shown when using the dueling DQL algorithm in Fig. (7) . The immediate reward function (23) is less effective than the episode reward function (24) . In (7), we consider two elements: the trajectory cost and the average throughput. In order to quantify the communication efficiency, we compare the total throughput in different scenarios.

In Fig. (8) , the performances of the DQL algorithm associated with several β/ζ values are The throughput obtained for β/ζ = 1 : 1 is higher than that of the others and when β increases, the performance degrades. However, when comparing with the Fig. (3b) , we realise that in some scenarios the UAV was stuck and could not find the way to the destination. That leads to increased flight time spent and distance travelled. More details are shown in Fig. (8b) , where we compare the expected throughput of both the DQL and dueling DQL algorithms. The best throughput is achieved when using the dueling DQL algorithm with β/ζ = 1 : 1 in conjunction with (8) , which is higher than the peak of the DQL method with β/ζ = 1 : 2.

In Fig. (9) , we compare the throughput of different techniques in the 5-cluster scenario. Let us now consider the binary trajectory design function (8) in Fig. (9a) , where the DQL algorithm achieves the best performance using β/ζ = 1 : 1 and β/ζ = 2 : 1. There is a slight difference between the DQL method having different settings, when using exponential the trajectory design function (9) , as shown in Fig. (9b) .

In Fig. (10) and Fig. (11) , we compare the throughput of different β/ζ pairs. The DQL algorithm reaches the optimal throughput with the aid of trial-and-learn methods, hence it is important to carefully design the reward function to avoid excessive offline training. As shown in Fig. (6) , the throughput is degraded when the β/ζ increases. In contrast, with higher β values, the UAV can finish the mission faster. It is a trade-off game when we can choose an approximate β/ζ value for our specific purposes. When we employ the DQL and the (8), (24) Dueling DQL with (9), (23) Dueling DQL with (9), (24) (b) Fig. 10 . The obtained throughput when using the DQL and dueling DQL algorithms in 5-cluster scenario (24) in Fig. (11b) . In Fig. (11a) , the dueling DQL method outperforms the DQL algorithm for almost all β/ζ values in both function (23) and (24) . When we use the episode reward (24), the obtained throughput are stable with different β/ζ values. The throughput attained by using the exponential function (9) is higher than that using the binary trajectory (8) and by using the (8) Dueling DQL with (9) (b) With (24) Fig. 11 . The expected throughput when using the DQL and dueling DQL algorithms with 5 clusters episode reward (24) is higher than that using the immediate reward (23) . We can achieve the best performance when using the dueling DQL algorithm with (9) and (24) . However, in some scenarios, we can achieve the better performance with different algorithmic setting as we can see in Fig. (8b) and Fig. (10a) . Thus, there is a trade-off governing the choice of the algorithm 

In Fig. (12) , we compare the performance of our DQL technique using different exploration parameters γ and values in our -greedy method. The DQL algorithm achieves the best performance with the discounting factor of γ = 0.9 and = 0.9 in the 5-cluster scenario of Fig. (12) . Balancing the exploration and exploitation as well as the action chosen is quite challenging, in order to maintain a steady performance of the DQL algorithm. Based on the results of Fig. (12) , we opted for γ = 0.9 and = 0.9 for our algorithmic setting.

Next, we compare the expected reward of different mini-batch sizes, K. In the 5-cluster scenario of Fig. (13) , the DQL achieves the optimal performance with a batch size of K = 32.

There is a slight difference in terms of convergence speed with batch size K = 32 is the fastest.

Overall, we set the mini-batch size to K = 32 for our DQL algorithm. 

In this paper, the DRL technique has been proposed jointly optimising the flight trajectory and data collection performance of UAV-assisted IoT networks. The optimisation game has been formulated to balance the flight time and total throughput while guaranteeing the quality-of-service constraints. Bearing in mind the limited UAV power level and the associated communication constraints, we proposed a DRL technique for maximising the throughput while the UAV has to move along the shortest path to reach the destination. Both the DQL and dueling DQL techniques having a low computational complexity have been conceived. Our simulation results showed the efficiency of our techniques both in simple and complex environmental settings.

Drone trial to help Isle of Wight receive medical supplies faster during COVID19 pandemic

This Chilean community is using drones to deliver medicine to the elderly

High-resolution mapping based on an unmanned aerial vehicle (UAV) to capture paleoseismic offsets along the Altyn-Tagh fault, China

Charging unplugged: Will distributed laser charging for mobile wireless power transfer work?

Distributed algorithms for robust self-deployment and load balancing in autonomous wireless access networks

Flight time minimization of UAV for data collection over wireless sensor networks

Deep reinforcement learning-based edge caching in wireless networks

Cell-edge user offloading via flying UAV in non-uniform heterogeneous cellular networks

Deep reinforcement learning for UAV navigation through massive MIMO technique

Learning-aided realtime performance optimisation of cognitive UAV-assisted disaster communication

Practical optimisation of path planning and completion time of data collection for UAV-enabled disaster communications

Unmanned aerial vehicle with underlaid device-to-device communications: Performance and tradeoffs

An introduction of real-time embedded optimisation programming for UAV systems under disaster communication

Real-time optimal resource allocation for embedded UAV communication systems

A near-optimal UAV-aided radio coverage strategy for dense urban areas

Federated learning assisted multi-UAV networks

Trajectory design and power control for multi-UAV assisted wireless networks: A machine learning approach

Distributed deep deterministic policy gradient for power allocation control in D2D-based V2V communications

Non-cooperative energy efficient power allocation game in D2D communication: A multi-agent deep reinforcement learning approach

Real-time energy harvesting aided scheduling in UAV-assisted D2D networks relying on deep reinforcement learning

On-board deep Q-network for UAV-assisted online power transfer and data collection

Interference management for cellular-connected UAVs: A deep reinforcement learning approach

Reinforcement learning in multiple-UAV networks: Deployment and movement design

Autonomous navigation of UAVs in large-scale complex environments: A deep reinforcement learning approach

Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates

Reinforcement mechanism design for fraudulent behaviour in ecommerce

Playing Atari with deep reinforcement learning

Deep-reinforcement learning multiple access for heterogeneous wireless networks

Deep reinforcement learning for user association and resource allocation in heterogeneous cellular networks

Intelligent trajectory design in UAV-aided communications with reinforcement learning

Energy tradeoff in ground-to-UAV communication via trajectory design

Completion time minimization with path planning for fixed-wing UAV communications

Joint D2D assignment, bandwidth and power allocation in cognitive UAV-enabled networks

Multi-beam UAV communication in cellular uplink: Cooperative interference cancellation and sum-rate maximization

Throughput maximization for UAV-enabled wireless powered communication networks

Real-time deployment and resource allocation for distributed UAV systems in disaster relief

Joint trajectory and communication design for multi-UAV enabled wireless networks

Energy-efficient data collection in UAV enabled wireless sensor network

Unmanned aerial vehicle-aided communications: Joint transmit power and trajectory optimization

Energy-efficient data collection and device positioning in UAV-assisted IoT

Joint optimization on trajectory, altitude, velocity, and link scheduling for minimum mission time in UAV-aided data collection

UAV trajectory planning for data collection from time-constrained IoT devices

3D UAV trajectory and communication design for simultaneous uplink and downlink transmission

Aerial-ground cost tradeoff for multi-UAV-enabled data collection in wireless sensor networks

Age of information aware trajectory planning of UAVs in intelligent transportation systems: A deep learning approach

The "echo state" approach to analysing and training recurrent neural networks-with an erratum note

Continuous control with deep reinforcement learning

Dueling network architectures for deep reinforcement learning

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Dynamic programming and optimal control

Tensorflow: A system for large-scale machine learning

Adam: A method for stochastic optimization