key: cord-0162382-zkpiy1ma authors: Asadi, Amin; Pinkley, Sarah Nurre; Mes, Martijn title: A Markov Decision Process Approach for Managing Medical Drone Deliveries date: 2021-06-08 journal: nan DOI: nan sha: ec569b11aee6d4067d87a6490e7c3c1584103782 doc_id: 162382 cord_uid: zkpiy1ma We consider the problem of optimizing the distribution operations at a drone hub that dispatches drones to different geographic locations generating stochastic demands for medical supplies. Drone delivery is an innovative method that introduces many benefits, such as low-contact delivery, thereby reducing the spread of pandemic and vaccine-preventable diseases. While we focus on medical supply delivery for this work, drone delivery is suitable for many other items, including food, postal parcels, and e-commerce. In this paper, our goal is to address drone delivery challenges related to the stochastic demands of different geographic locations. We consider different classes of demand related to geographic locations that require different flight ranges, which is directly related to the amount of charge held in a drone battery. We classify the stochastic demands based on their distance from the drone hub, use a Markov decision process to model the problem, and perform computational tests using realistic data representing a prominent drone delivery company. We solve the problem using a reinforcement learning method and show its high performance compared with the exact solution found using dynamic programming. Finally, we analyze the results and provide insights for managing the drone hub operations. During the last decade, there has been substantial growth in the use of drones for various applications, including but not limited to transportation, agriculture, and delivery [1, 2, 3, 4] . Specifically, delivery using drones has received extensive attention as it can reduce air pollution and traffic in congested areas [5] . Moreover, drones are a viable option to reach remote locations with inadequate road infrastructure [6] . During pandemics, drones provide a safe and low contact delivery method, which can effectively slow down the spread of the diseases [7, 8] . Many companies and organizations, including Vanuatu's Ministry of Health and Civil Aviation [9] , Zipline [10] , Matternet [11] , and Manna Aero [12] use drones to deliver and distribute medical supplies such as vaccines, medicine, and blood units. Effective operations of a fleet of drones require addressing battery-oriented issues, including limited flight range, time-consuming charging operations, and the high price and short lifetime of batteries. In this research, we provide a model that effectively uses the charge inside the batteries to maximize the amount of stochastic demand met. Specifically, we classify the stochastic demands according to drones' flight range, which depends on the charge inside drone batteries. We ultimately maximize the expected total met demand for delivering medical items using drones dispatched from a battery swap station located in a drone hub. Battery swap stations are a solution to alleviate the aforementioned issues. In battery swap stations, charged batteries are swapped with empty batteries in short minutes. For instance, Matternet provides a station to automatically swap drone batteries used for delivering blood units and medicine between the supplier and hospitals [13] . Besides the quick swapping operation, recharging batteries in anticipation of demand can reduce overcharging and fast charging of batteries, which are shown to accelerate the battery degradation process [14, 15] . The faster batteries degrade, the quicker the need for battery replacement, leading to a higher cost and environmental waste. The application of a battery swap station is not limited to drones and can be extended to electric vehicles [16, 17] , electric scooters [18] and cell phone battery swaps in airports, hotels, and amusement parks [19] . Notably, the number of electric vehicle swap stations is growing in different regions of the world [20, 21, 22, 23] . In this research, we consider a swap station located at a drone hub that dispatches drones to satisfy multiple classes of stochastic demand for medical supplies, which are classified based on their distance from the station. Given the growth in the number and applications of battery swap stations, it is crucial to optimally manage the stations' operations to reach the highest performance of the station. Thus, we design a decision-making framework to provide optimal recharging and distribution policies when considering the stochastic demand originating from different geographical locations. The drone hub can send drones to locations within their flight range, which differ based on the level of charge inside their batteries. It is a complicated setting, given that the level of charge inside batteries can be any number between 0 and 1, and different combinations of charge levels can be used to satisfy the stochastic demands generated from places located at different distances from the drone hub. Hence, using aggregation and discretization, we classify the demand based on the distance between the hub and demand locations. To the best of our knowledge, we are the first to use such classification for this problem and link a level of charge inside batteries with a class of demand such that the demand of each class can be satisfied with batteries having the same or higher level of charge. That is, each class of demand can be satisfied with one or multiple levels of charge inside batteries. We formulate this problem as a stochastic scheduling and allocation problem with multiple classes of demand (SA-MCD). We use the Markov decision process (MDP) to model the stochastic SA-MCD. It is an appropriate modeling approach for problems like ours that are in the class of sequential decision-making under uncertainty problems [24] . The decisions are made in discrete points in time or decision epochs. We represent the state of the system in the MDP as the number of batteries within each charge level class. The actions of the MDP are the number of batteries recharging from one level to an upper level of battery charge. The transition probability is a complex function governed by multiple classes of stochastic demand. In our MDP, the optimal policy determines the maximum expected total reward, which is a function of total weighted met demand of different classes. We use backward induction (BI) and a reinforcement learning method with a descending -greedy exploration feature (D RL) to solve SA-MCDs. The exploration feature allows the algorithm to take random actions to assist in escaping short-term or local optima by visiting apparently non-attractive states. The descending exploration means that as the algorithm proceeds in iterations, the probability of taking an arbitrary action decreases, and the algorithm perform more exploitation (taking favorable actions) by visiting attractive states. BI can provide exact solutions for problems like SA-MCDs that have finite state and action spaces [24] . However, BI runs into the curses of dimensionality and faces computational time and memory issues as our problem size increases. Thus, we apply an RL method with an exploration feature that is able to find high-quality approximate solutions for large-scale SA-MCDs, which are not solvable using BI [25] . We show the convergence of our RL method and demonstrate its capacity to save computational time and memory. We computationally test the SA-MCD model and solution methods on a case study related to the Zipline drone delivery company, which delivers blood units, vaccines, and medical supplies in Rwanda. We consider the drone delivery of these supplies from its station located in the Muhanga district, Rwanda, to the reachable hospitals throughout the country. We consider the population of districts, number of hospitals in each district, number of people using a hospital, and rate of arrivals to each hospital to find the stochastic orders for medical supplies. Then, we convert the orders to the demand for drone missions, given that each drone can carry 2kg of medical products at a time [26] . For our computational experiments, we classify this demand into two classes based on the distance between the station and each hospital (i.e., closer hospitals are classified in level 1 demand, and farther hospitals are classified in level 2 demand). We import the real data associated with the distance between locations, the population of districts, flight regulations in Rwanda, and the Zipline drone configuration, including the speed, flight range, and recharging time. We derive insights from solving SA-MCDs to manage the distribution operations of the swap station using different sets of computational experiments. We provide the optimality gap and average percentage of the met demand using the D RL method for a modest problem (15-21 drones) . Solving the modest size problem shows that the Zipline company needs more drones to satisfy 100% of the stochastic demand. Hence, we draw the relationship between the number of drones in the station and the amount of met demand using our RL solution for larger instances of SA-MCD. We also analyze the interplay between the different demand classes and the use of higher-level charged batteries to satisfy lower-class demand. Main Contributions. We summarize the main contribution of this paper as follows. • We propose stochastic scheduling and allocation problems with multiple classes of demand (SA-MCD) for managing operations of a drone swap station located at drone a hub. We classify the demand based on the distance between the station and hospitals generating the stochastic demands. • We develop an MDP model for the SA-MCD and solve small instances of SA-MCD using backward induction (BI) and show inability of BI to solve large-scale SA-MCDs required for the real application. • We propose applying a descending -greedy reinforcement learning method (D RL) to find optimal and near-optimal policies for the station that faces stochastic demand for sending drones to deliver medical supplies. We show high performance of D RL for solving SA-MCDs. • We leverage our model and our approximate solution method to be used for a real application, which is the case study related to Zipline medical supply delivery using drones in Rwanda. • We conduct different sets of experiments to derive insights for managing the operations in a swap station to maximize demand satisfaction. We show that demand classification approach in modeling improves demand satisfaction, which is the primary purpose of delivering medical supplies using drones. The remainder of this paper is organized as follows. In Section 2, we discuss relevant literature related to the modeling, application, and solution methods for the stochastic SA-MCD. In Section 3, we present our Markov Decision Process to model the stochastic SA-MCD. In Section 4, we discuss the exact and approximate solution methods. In Section 5, we outline the computational experiments conducted and provide insights for managing swap station operations. We end with concluding remarks and propose directions for future work in Section 6. There is a growing interest in the use of drones for various applications. We provide an overview of scientific works, which are more relevant to the model, application, and solution methods presented in this paper. Therefore, we focus on providing an overview of research related to managing operations in swap stations, delivering medical items using drones, Markov Decision Process (MDP) modeling for dynamic problems, demand classification, and reinforcement learning (RL) methods. Many researchers have studied managing swap station operations. Asadi and Nurre Pinkley [27] present an MDP model to find the optimal/near-optimal policies (number of recharging/discharging and replacement actions) to maximize the expected total profit for the station facing stochastic demands and battery degradation. They solve this problem using a heuristic, RL methods, and a monotone approximate dynamic programming algorithm [28] to provide insights for managing the internal operations in the swap stations. Widrick et al. [29] propose an MDP model for the same problem when no battery degradation is considered. Nurre et al. [30] do not consider stochasticity and provide a deterministic model to find the optimal policies for managing swap stations. Our work is fundamentally different from the discussed papers as they do not consider different demand classes and multiple states of charge of batteries. Besides, our objective is satisfying the amount of met demand, which suits the healthcare delivery application, that differs from the aforementioned papers. Kwizera and Nurre [31] propose a two-level integrated inventory model to manage internal operations in a drone swap station delivering to multiple customers (or classes, equivalently) but exclude the uncertainty in the system. To the best of our knowledge, we are the first to introduce the stochastic SA-MCDs for managing internal operations in a swap station facing stochastic demands from different geographical locations. In recent years, there has been a rapid growth in using drones for many innovative applications [32] . Several papers [33, 34, 35, 36] review the applications of drones in different contexts, and we refer the reader to Otto et al. [35] for an extensive review on the optimization approaches for civil applications of drones. Delivering portable medical items such as blood units and vaccines using drones can positively impact the levels of medical service in remote or congested places where roads are not a viable option for transportation and delivery [35, 5] . Several companies are using drones to deliver medical supplies in different parts of the world [9, 10, 11, 12] . Notably, we focus on a case study related to Zipline, a drone delivery company that started with 15 drones delivering medical items to remote locations in Rwanda in 2016 [37] . After successful operations in Rwanda, Zipline expanded its medical delivery service in the south of Ghana using 30 drones in 2019 [38] . During the COVID-19 pandemic, drone delivery provides a fast, cheap, and reliable method to distribute COVID-19 vaccines. For instance, in Ghana, Zipline already delivered 11000 doses and will deliver more than 2.5 million doses in 2021 [39] . Draganfly, a Canada-based company, will use drones to distribute COVID-19 vaccines to remote areas of Texas starting in Summer 2021 [40] . A drone can store a limited amount of energy that restricts its flight range. This limitation needs to be considered when modeling real-world problems. Common modeling approaches are to use the maximal operation time [41, 42] and maximal flying distance [43, 44] . In this paper, we use the maximal coverage of 80km radius (160km round-trip) [45] from our swap station, which is located at the Muhanga Zipline drone hub, for geographically-based demand classification in our case study. Our SA-MCD problem is in the class of sequential decision-making under uncertainty, and we use a Markov Decision Process (MDP) model, which is appropriate for this class of problems [24] . There is extensive research on the use of MDPs for stochastic problems in the operations research community. A sample of problems and applications that are close to our research include drone applications [46, 47, 48] , dynamic inventory and allocation [49, 50] , and optimal timing of decisions [51, 52, 53, 54] . In SA-MCD, we use demand classification that researchers broadly utilize to study scheduling, allocation, supply chain management, and inventory control problems. We discuss a sample of scientific works that used such a classification in combination with a MDP modeling approach. Gayon et al. [55] provide optimal production policies for a supplier facing multiple classes of demands that are different in the demand rates, expected due dates, cancellation probabilities, and shortage costs. Benjaafar et al. [56] formulate an MDP model to derive optimal production policies for an assembly system wherein the demands are classified based on the difference between shortage penalties incurred due to the lack of inventory to satisfy orders. Thompson et al. [57] categorize patients served by a hospital according to the floors treating patients and the lengths of stay in hospitals. Milnar and Chevalier [58] use an infinite horizon MDP to model an admission control problem to maximize the expected total profit of a firm serving two classes of customers. The customers are classified based on profit margins, order sizes, and lead time. We use a geographically-based demand classification influenced by different flight ranges of a drone based on the level of charge inside its batteries. An instance of considering travel range for demand classification can be found in [59] wherein it is assumed that different types of EVs have different drive ranges. They incorporate this information to provide a framework that estimates the charging demands for charging stations and determine the service capacity of the stations without optimizing the system. As they do not consider optimization, this work is significantly different from our problem that considers optimization under uncertainty. When an MDP model is large and complex it suffers from the curses of dimensionality [25, 60] , and thus, is traditionally solved using approximate solution methods. Researchers apply Reinforcement Learning (RL) (or approximate dynamic programming (ADP), the term more used in the operations research community [25] ) to find approximate solutions that are not solvable using standard exact solution methods, (e.g., dynamic programming [24] ). Examples of various ADP/RL methods in dynamic allocation problems are temporal difference learning [61, 62, 63] , case-based myopic RL [64] , Q-Learning [65] , value function approximation [66, 67, 68] , linear function approximations [69] , and policy iteration [70] . In this paper, we apply a value function approximation using a look-up table (e.g., [64, 71] ) with an ε-greedy exploration feature [25, 72] to make our RL method visit and update the value of more (both attractive and unattractive) states in the state space. We reduce the exploration rate (increase the exploitation rate) to make the algorithm converge as it proceeds toward iterations. In Section 4.2, we present a comprehensive explanation of our RL method. In this section, we present our Markov Decision Process (MDP) approach to model the stochastic scheduling and allocation problem with multiple classes of demand (SA-MCD). We proceed by formally describing the classes of demand and the components of our MDP model. For our problem, we consider a set of medical facilities, each with an unknown number of requests (i.e., demand) for drone delivery over time. We know how long a drone needs to fly from the drone hub (located in the swap station) and back to satisfy a request for each medical facility. We cluster medical facilities with similar flight times into demand classes. The demand for each medical facility is then aggregated by demand class. The uncertainty in our MDP is the number of requests (i.e., demand) for each demand class by time. We assume that there is a known probability distribution that governs the uncertainty for each demand class over time. We depict an example of geographically-based demand classification in Figure 1 . We link each demand class with the required amount of battery charge that is necessary to make the round-trip flight from the drone hub to the medical facility and back. In other words, higher demand classes that are farther from the drone hub require more charge than those closer to the hub. Charging all batteries to full charge ensures that each drone+battery pair can satisfy a request from any demand class. However, in reality, this strategy results in a higher total cost, longer recharge times, and faster battery degradation. Thus, we make decisions about how many batteries are recharged to different charge levels over time. Our system incorporates time-varying elements, including the mean demand per class over time. We model the system using a finite horizon MDP. In addition to the relevance of finite horizon MDP for highly variable and dynamic systems, like ours, the finite horizon model is applicable for the station to check the quality (capacity) of batteries at the end of the time horizon and decide to replace batteries if needed. We seek to maximize the expected total reward, which equals the summation of the weighted met demand over all demand classes and time periods. We note that the main purpose of our swap station is demand satisfaction; thus, we do not directly incorporate different charging costs and times into our model. However, in the objective function, we use multipliers for the amount of demand of each class that is satisfied using the batteries of higher charge levels. In this design, higher charge level batteries are capable of satisfying demand for lower classes, but this can be penalized as it caused unnecessary charging costs. Further, because the mission of our drones is delivering medical items, which are often vital at the moment of orders, we do not consider backlogging or rolling over unmet demands. To maximize the expected total weighted demand, we seek optimal policies indicating how many batteries are charged to each charge level by state and time. We proceed by detailing the specific components of our MDP. Decision Epochs: Discrete times within the finite time horizon N < ∞ in which we make decisions. The set of decision epochs is T = {1, 2, . . . , N − 1}. States: The state of the system, s t , is dynamic and defined in the C-dimensional state space S. Thus, where S i is the state space for battery charge level i for i = 1, . . . , C. Each S i = {0, . . . , M } where M represents the total number of batteries. For each dimension, s i t ∈ S i equals the number of batteries with i level charge. The total number of batteries over all charge levels must not exceed M in accordance with Equation (1). All batteries with a charge level lower than the lowest charge level (i.e., level 1) are implicitly denoted as being level 0 which does not need to be stored but instead, can be calculated as The C-dimensional state space corresponds to the C demand classes. As previously mentioned, each demand class represents a set of medical facilities with similar round-trip delivery flight times. We link the state space with these demand classes by stating that a drone powered with charge level i is able to satisfy requests from demand class i and lower. In other words, a request from demand class i is able to be satisfied by a drone powered with charge level i or higher. We make the assumption that a demand request from class i is satisfied using a battery from the lowest, capable, available charge level. For example, imagine we have a request from demand class 2. Any battery with charge level 2, . . . , C is capable of satisfying this request. If a battery is available with charge level 2, this battery is used. However, if no batteries are available with charge level 2, then we look to assign a battery with charge level 3, and continue increasing the charge level until a battery is available. If none are available, we designate this demand as unmet. With this assumption, we maintain higher charge level battery in inventory. Actions: We use a t to denote the recharging action at time t using a vector of size ( C(C − 1) /2) such that a ik t represents the number of batteries starting at charge level i which are recharged to level k for k > i at time t. As follows, where To ensure that the number of batteries selected to be recharged from each charge level does not exceed the number of batteries within that class, we force the actions to satisfy Equations (5) and (6). In Figure 2 , we display the state transitions between different states due to different recharging actions or demand satisfaction for a single battery. Recharging/demand satisfaction increases/decreases the level of charge of a battery depending on the level of recharging/classes of met demand. Transition Probabilities: The system transitions from state s t to a future state s t+1 according to the selected action and the realized demand within each demand class. In our system, the demand at time t, denoted D t , is a vector of size C, i.e., D t = (D 1 t , . . . , D C t ) where each D i t , for i = 1, . . . , C, is a random variable representing the number of requests for demand class i. As we explain later, the state transitions and the probability transition function are complex, but we illustrate these functions of our MDP where C = 2. However, we note that the model could be applied to a problem with C > 2. As our state transition is complex, we first define an intermediate state of the system. We define the intermediate state of the system as L = (L 1 , L 2 ) and allow this to represent the number of batteries currently charged at levels 1 and 2 after all actions are taken and batteries are allocated to satisfy demand within their class (i.e., batteries with charge levels 1 and 2 are used to satisfy the demand of class 1 and class 2, respectively). Therefore, the intermediate states should not be mistaken for the post-decision states that show the system's state immediately after making decisions before realizing the uncertainty. We note, this intermediate transition does not incorporate the batteries from charge level 2 used to satisfy remaining demand from class 1. The transitions to intermediate states are governed by Equations (7) and (8) . In Equation (7) In Equations (9) and (10), The number of leftover level 2 charged batteries that can be used to satisfy unmet demand of class 1 is Using Equations (7) and (8), the intermediate state of the system is (L 1 , L 2 ) = (0, 7). Then, we use 4 out of 4 level 2 charged batteries to satisfy the demand of class 1 as min(U 1 , U 2 ) = 4. These 4 batteries will return to the station with level 1 charge. Hence, the future state s 1 t+1 = L 1 + 4 = 4 and s 2 t+1 = L 2 − 4 = 3. Now, we present the transition probability function from state s t to state s t+1 = j = (j 1 , j 2 ) using Equation (11) .In this equation, In all of the cases in Equation (11), the intermediate state transitions are calculated using Equations (7) and (8) . The first case in Equation (11) calculates the transition probabilities when the stochastic demand of class 1 and 2 is less than the number of charged level 1 and 2 batteries, respectively. The future state equates the intermediate state for each charge level. In the second case, the demand of level 2 is greater than or equal to the number of batteries with level 2 charge; hence the number of batteries with level 2 charge at time t + 1 equals the number of recently charged batteries from empty or level 1 charge. The future state of the system equates to the intermediate state of the system. The third case is similar to the second case except that in the second case, the stochastic demand of class 1 is less than the number of level 1 charged batteries but in the third case, the demand is greater than or equal to the number of level 1 charged batteries. The fourth case describes the condition that all of the demand for the class 1 charge can be satisfied using all the available level 1 charged batteries plus the leftover batteries of level 2 after satisfying the demand of class 2. The future state of level 2 batteries will be no more than the intermediate state for level 2. The amount of satisfied demand in stage 2 (satisfying demand of class 1 using remaining batteries of class 2) equals the difference between the intermediate and future state of level 2 charged batteries. The fifth case is similar to the fourth case, except that all of the demands of class 1 can not be satisfied in the first and second stage of the demand satisfaction process. We note that if the number of classes increases to C = 3, then there are 13 different cases to be considered. Further, we need to consider two intermediate states to capture demand satisfaction with the same level of charge and one and two-level of charge higher. For the application of our problem, two demand classes are adequate to observe a significant improvement in the quality of solutions (see Section 5) without making the model excessively complicated. If the application requires having more than three classes, we suggest using alternative strategies such as non-Markovian modeling and/or only applying ADP/RL/Q-learning methods, which does not require an explicit transition probability function. Despite the apparent complexity of our transition probability function, we would like to point out the benefit of our MDP model. Incorporating the assumptions for the order of demand satisfaction and timing of events in our model enables us to capture the system's dynamic without including the past state(s) information. As a result, our model holds the Markov property, allowing us to benefit from the MDP properties and guaranteed solution methods. Reward: We calculate the immediate reward of taking action a t when the transition from state s t to state s t+1 = j occurs using Equation (12) r t (s t , a t , j) = ρ 11 (s 1 for t = 1, . . . , N − 1. We use ρ ij to put weights on the amount of met demand of class j using level i charged batteries. In the first term, (s 1 t + a 01 t − a 12 t − L 1 ) denotes the number of level 1 charged drones used to satisfy class 1 demand. In the second term, (L 2 − j 2 ) equals to the number of level 2 charged batteries used to satisfy class 1 demand. In the third term, (s 2 t + a 02 t + a 12 t − L 2 ) determines the number of level 2 charged batteries used to satisfy class 2 demand. We note that our objective is to maximize the expected total satisfied demand which does not directly incorporate cost. However, with adjusting the weights of ρ ij , we implicitly include a cost factor via assigning a penalty/reward to demand satisfaction with an excessive level of charge. For instance, when ρ 21 = ρ 22 = ρ 11 = 1, there is no benefit in recharging batteries to level 1 to satisfy class 1 demand because level 2 charged batteries can satisfy class 1 demand by generating the same reward while these batteries can satisfy class 2 demands, too. However, if ρ 21 = 0.5, then we expect to recharge to/use more level 1 charged batteries to satisfy class 1 demand given that ρ 11 = 2ρ 21 , which means we penalize the reward of satisfying class 1 demand with level 2 charged batteries. We vary this parameter and analyze the results in Section 5. In practice, the drone delivery companies can select the value of this parameter based on the insights or their preferences. At the end of the time horizon we calculate the terminal reward. We assume that no action is taken at the end of the time horizon and that all remaining batteries can be used to satisfy future demand, and there is sufficient demand for each level. Thus, we define the terminal reward using Equation (13) . We calculate the immediate expected reward r t (s t , a t ), using the immediate reward and transition probability functions given by Equation (14), We derive the decision rules, d t (s t ) : s t → A st , from the action set to maximize the total expected reward. Because we select a single action based on the present state, which does not depend on the past states and actions, our decision rules belong to the Markovian decision rules [24] . A policy π is a sequence of decision rules for all decision epochs, that is d π t (s t ) ∀ t ∈ T . We can calculate the expected total reward of policy π for the problems starting from an arbitrary initial state s 1 using Equation (15) . The optimal policy, π * , maximizes the expected total reward. In this section, we first present the exact solution method, backward induction (BI), and continue with our approximate solution method, the reinforcement learning method with a descending -greedy exploration feature (D RL) to solve the stochastic SA-MCDs. As our Markov Decision Process (MDP) model has finite state and action spaces, there is at least one deterministic optimal policy [24] ; thus, backward induction (BI) can determine such policy (number of recharging actions) that maximizes the expected total reward or weighted met demand over time. Let V * t (s t ) be the optimal value function equivalent to the maximum expected total reward from decision epoch t onward when the system is in state s t . Then, we can use the optimality (Bellman) equations, given by Equation (16) , to find the optimal policies for all the decision epochs when moving backward in time. That is, BI sets the value of being in state s N at the end of the time horizon N to be equal to the terminal reward value given by Equation (13) . Then, the algorithm starts from the last decision epoch and finds the optimal actions (a * st,t ) and corresponding values (V * t (s t )) using Equations (16) and (17) stepping backward in time. The algorithm aims to find the optimal expected total reward over the time horizon, V * 1 (s 1 ), for state s 1 , which is the system's initial state at time t = 1. In other words, solving the optimality equations for t = 1 is equivalent to the expected total reward over the time horizon. The size of state space, action space, transition probability, and optimal policies (the best actions for all the states over time) are functions of O(M 2 ) and O(M 3 ), O(M 7 N ), O(M 2 N ), respectively. As the size of the problem increases, it becomes challenging for BI to find the optimal solution due to the curses of dimensionality, which causes a drastic increase in computational time and memory. Hence, we proceed with presenting our reinforcement learning (RL) method, which is capable of circumventing such problems [25, 60] . In this section, we explain the reinforcement learning method to find near-optimal policies and to overcome the curses of dimensionality [25] of the stochastic scheduling and allocation problem with multiple classes of demand (SA-MCD). We use a value iteration approach, reinforcement learning with a descending -greedy exploration feature (D RL), which provides high-quality approximate solutions for SA-MCD ( see Section 5). We proceed by introducing the notation and continue with our RL features and procedure. The number of core RL iterations τ2 The number of sample paths of demands (realized uncertainty) V n t (st) The optimal value of being in state st at time t for iteration n V n t (st) The approximate value of being in state st at time t for iteration n υ n t (st) The observed value of state st at time t for iteration n αn The step-size value at iteration n ε n The exploration rate at iteration n u A random number that is used to select exploration or exploitation z n t (s n t ) The smoothed value of being in state st at time t for iteration n Table 1 : Notation used in the reinforcement learning algorithm. We first determine the number of drones (M ), decision epochs (N − 1), (τ 1 ) iterations, and (τ 2 ) sample paths in our RL method. Then, we initialize the approximate value at the end of the time horizon, t = N , for all iterations using the terminal reward function given by Equation (13) . For every iteration, we select an initial state s n 1 . To select the action, we use the ε-greedy method [25] that allows exploring the action space works as follows. We generate a random number, u. Then, we compare u with the exploration rate, ε n , at iteration n. We use a descending function for the exploration rate over the iterations to ensure more states (both attractive and unattractive) are visited and facilitate the algorithm convergence. If Rand < ε n , we select a feasible action arbitrarily. Otherwise, we generate τ 2 sample paths of demands (realized uncertainty) and select the action that maximizes the observed valueυ n t (s t ) according to Equations (18) and (19) , wherein V n−1 t+1 (s t+1 ) is used to approximate the value of E(V t+1 (s t+1 ) | s t , a t ) for each sample path. If an action, is selected over multiple sample paths, we use the average of V n−1 t+1 (s t+1 ) as the approximation. The observed value and the approximated value at the previous iteration are smoothed using a stepsize function. This value is now used as the present approximation value of the observed state. When an action is selected, we sample an observation of uncertainty (generate a realized value for stochastic demand) to find the future state. The algorithm steps forward in time and moves to the future observed state until it reaches the last decision epoch and new iteration starts. The same process is repeated until τ 1 iterations are completed. a n st,t = arg max In this section, we explain the results of solving the stochastic scheduling and allocation problems with multiple classes of demand (SA-MCD) using realistic data related to the drone delivery company Zipline. We created this dataset to Generate a random number u 8: if Rand < ε n then 9: Sample an observation of the uncertainty, Dt 10: Determine a random feasible action, a t n 11: Calculate the observed value,υ n t (s n t ) 12: else 13: Generate τ2 sample paths of the uncertainty 14: Select an action that maximizesυ n t (st) over τ2 sample paths 15: Sample an observation of the uncertainty, Dt 16: Calculate the observed value,υ n t (s n t ) 17: end if 18: Smooth the new observation with the previous approximated value, 19: Update the present approximation using the smoothed value, V n t (s n t ) ← z n t (s n t ) 20: Determine next state, s n t+1 21: end for 22: Increment n = n + 1 23: end while mimic the geographical locations of the Zipline drone hub and hospitals in Rwanda, Africa, the population of districts, flight regulations in the country, Zipline drone configuration, including the speed, flight range, and recharging time. We solve modest SA-MCDs (15-21 drones) using exact solution methods. We note that we use the number of batteries and the number of drones interchangeably as each drone has a battery pack, which might be consist of multiple batteries that can be charged in parallel. As we run into the curses of dimensionality for larger instances of SA-MCD, we present the results of our reinforcement learning (RL) method that can provide near-optimal solutions for the modest instances and solve larger problem instances. We deduce managerial insights for managing the swap station's distribution operations that maximize the expected total weighted met demand of multiple classes. We proceed by first explaining the data. In this paper, we present an actual case study related to the vital operations of Zipline drones, which are delivering medical items in Rwanda, Africa. The Zipline station is located at the Muhanga district, west of Rwanda's capital city, Kigali. We focus on drone delivery to satisfy the stochastic demand for blood units originating from hospitals in Rwanda. In Table 2 , we summarize the input data, including, the name of each hospital, their location, the distance between the station and each hospital, the approximated population of people using each hospital, the number of blood units and flights needed per day for each hospital, and the demand class associated with each hospital. We categorize the demand into two classes based on the distance between the Zipline station and hospitals. As the flight range of the drone is estimated to be 80km [73] , we assume demands of hospitals located within [0km, 40km) and [40km, 80km] fall into class 1 and class 2, respectively. The demand class NA means that the hospital is not reachable from the Zipline station and excluded from further analysis. We exclude six hospitals (denoted by NA in the last column of Table 2 ) from the 33 identified hospitals as they are located at a distance out of the drone's flight range from the origin, Zipline station. We consider ten hospitals in class 1 and 17 hospitals in class 2 which are located throughout 15 distinct districts in Rwanda. We approximate the air travel distance between the station and hospitals using the Haversine formula [74] that is broadly used to find the distance between two points on the earth. When calculating the distance, we consider the rules for flying drones in Rwanda, which does not allow drones to fly within a 10km radius from airports [75] . Therefore, we need to adjust the travel distance between the station and two hospitals, Kiziguro and Rwamagana. Hence, we calculate the closest travel distances such that the flights to these destinations do not violate the rules for flying drones in Rwanda. In Figure 4 , we display the geographical locations of airports, hospitals, and the Zipline station. Consistent with [76, 77] , we use a non-homogenous Poisson process to determine the patients' arrival to hospitals. We examine the daily operations of the drone swap station wherein the time between two consecutive decision epoch is 90 minutes (1.5hr). Thus, N = 24 /1.5 + 1 = 17. The 90-minute intervals provide adequate time for drones to receive charge [78] and complete a round-trip from the furthest delivery mission to the station given that the maximum speed of the drone is 127km/hr [79] . To derive the mean demand of blood units per time t, we apply the following process. First, we determine the number of people using a particular hospital based on the population of each district. If more than one hospital is located in a district, we evenly distribute the district's total population over the number of hospitals located in that district. Second, we calculate the number of blood units needed per year for each hospital by multiplying an estimated portion of the population that needs blood units per year (2% recommended by the World Health Organization (WHO) [80] ) by the number of people using that hospital. The yielded number is an overestimate of the number reported by [81] . However, we use this number to account for pessimistic cases wherein the station faces more demand. Third, we divide the number of blood units required per year by 365 to find the number of blood units needed per day for each hospital. Next, we use the pattern of patient arrivals to hospitals, consistent with [82, 83, 84] , to derive the mean demand for blood units of time t over a day. The pattern in the literature indicates an ascending trend of arrivals from 6:00am to the peak at noon, followed by a descending trend from noon to 6:00am of the following day. Specifically, we use the data from [82] and fit a polynomial function for generating the mean arrival rate of time t. In Figure 5 , we display the mean demand of [82] and our fitted function. We scale the mean demand of blood units of time t such that the summation of the scaled demand over a day equals the calculated number of blood units required per day for each hospital. Then, as each drone can carry two units of blood [26] , we divide the mean demand of blood units of time t by two to find the mean demand for flights for each hospital. Finally, the mean demand for either demand class, λ 1 t , λ 2 t , is the aggregation of mean demands for flights from the hospitals within each demand class. In the first experiment, we consider Zipline has a fleet of 15 drones [37] . We set ρ 11 = ρ 22 = 1 and ρ 21 = 0.5 indicating that satisfying a demand of class 1 using a level 1 charged battery (partially-charged) generates more immediate reward than satisfying that demand using a level 2 charged battery (fully-charged). This setting implies the company provides less reward when drones with excessive level of charge are used to satisfy demands, which can be interpreted as a penalty to account for unnecessary higher recharging costs incurred. The setting of our D RL parameters is as follows. The number of core iterations is τ 1 = 200000. As we will see later, the algorithm will converge after 50000 iterations; however, as the computational time is in a matter of minutes, we keep 200000 iterations for our experiments. We test τ 2 = 1, 5, 10, . . . , 50 and observe that increasing the parameter to the value of 30 reduces the optimality gap and increases the robustness of the results and computational time, but excessive increase in the value of τ 2 only magnifies the computational time with little improvement in the quality of the result; thus, we set τ 2 = 30. We use the adaptive stepsize function provided by George and Powell [85] . We use ε n = 1 /n to adjust the value of the exploration rate at iteration n used in the ε-greedy approach to select policies within our RL method. With this function, we ensure a higher rate of exploration/exploitation in early/late iterations, which is desirable for visiting more states and enabling the algorithm to converge as it proceeds with each iteration. In this section, we first feed the data explained in Section 5.1 to solve the problem using exact and approximate solution methods, Backward Induction (BI) and the reinforcement learning method with a descending -greedy exploration feature (D RL), respectively. Then, we analyze the optimal policies (BI solutions) and assess the quality of near-optimal solutions derived from D RL. Moreover, as the drone delivery company can control and adjust ρ 21 (the weight of satisfying class 1 demand using level 2 charged batteries), we analyze the impact of changing the parameter's value on the station's operations and amount of met demand. We also conduct different sets of experiments to solve instances of the problem and answer the following questions. How many batteries are needed in the station to satisfy a certain level of the stochastic demand? What is the contribution of classifying the demand on the demand satisfaction? We use a high-performance computer with four shared memory quad Xeon octa-core 2.4 GHz E5-4640 processors and 768GB of memory for running all of our computational tests. In this section, we present the results from solving stochastic SA-MCD with BI and D RL and using the data presented in Section 5.1. The system's initial state is s 1 = (0, 15), which means all 15 batteries are charged to level 2 (fully-charged). The time horizon is one day and N = 17 wherein the first decision epoch is at midnight and the time between any two consecutive decision epochs is 90 minutes. That is, the decisions are made at 16 decision epochs, t, where t = 00:00, 1:30, 3:00, . . . , 10:30, 12:00, 13:30, . . . , 22:30. The results of D RL in the last iteration can differ in terms of the converged value and met demand. Hence, we generate independent sample paths to report robust results and evaluate our RL method's performance. Our preliminary experiments reveal that our result is robust when >100 sample paths of demand are incorporated for evaluation (the percentage change in the converged values is satisfactory and less than 0.1). Hence, to yield even a higher robustness level, we calculate the average percentage of met demand using Equations (21) and (22) We summarize the results in Table 3 . The percentage of met demand over a sample path equals the total number of demand met over the total realized demand of both classes. We report the average of 500 sample paths in Table 3 . We provide more detailed results about demand satisfaction by class later in Table 5 . As shown, D RL is faster and can generate a high-quality solution with 5.3% of optimality gap (derived from Equation (23)) in 8 minutes. In Figure 6 , we show the convergence of our proposed D RL method. In Table 3 , we also provide the results of an easy-to-implement intuitive benchmark. As the model's objective is to maximize the amount of met demand, our proposed benchmark policy makes all the empty batteries fully-charged (usable for demand satisfaction of all classes) at every decision epoch. We note the benchmark can be quickly implemented, but it needs a considerable computational time to calculate the exact expected total reward. The computational time for evaluating the benchmark includes all of the operations of backward induction, except the for loop for finding the best actions as the actions are derived from the (fixed) benchmark policy. Our results show that D RL outperforms the benchmark policy regarding the expected total reward, average demand met, and computational time. Thus, we only continue with D RL as the superior approximate solution method. In Table 4 , we provide the summary of the results from solving the problem with 15 to 21 drones using BI and D RL. We note, BI cannot find the optimal solution when the number of drones is greater than 21. provides an optimality gap of less than 6% for all instances and significantly reduces computational time. The maximum difference between the average percentage of met demand of D RL and BI is less than 5%. The results indicate the high performance of our D RL method in providing approximate solutions. Table 4 : Computational time, memory used, and average percentage of met demand over time for 500 sample paths when ρ 21 = 0.5 using BI and D RL methods. We can find the visited states and policies over time (sample paths of visited states and policies) using the sample paths of realized demands. Given the initial state, taken actions, realized demand, and the state transition functions given by Equations (9) and (10), we can find the future visited state. The consecutive visited states form the sample path of states. The sample paths of policies are the consecutive selected actions derived from the BI and D RL solution methods. In Figure 7 , the first, second, and third row depict sample paths of states, optimal policy, and met demands when the stochastic demand equals mean demand at time t and ρ 21 = 0.5, 1, and 2, respectively, for M = 15. Intuitively, when ρ 21 = 0.5, the level 1 charged batteries are used to satisfy class 1 demand and a 12 t = 0. When ρ 21 increases more batteries of class 1 are recharged to class 2 to be used for satisfying the demand of class 1. When ρ 21 = 1, no batteries are recharged to level 1 because there is no difference between the value of satisfying the class 1 demand using level 1 and level 2 charged batteries. Hence, the optimal policy includes recharging batteries to level 2 to be used for either classes of demand. For all values of ρ, we observe more recharging actions when demands are in peak periods. We also compare the average amount of met demand of either class and the optimal policies over time when ρ 21 (controlled parameter by the station) varies between 0.5 to 2 with 0.1 increments. We calculate the percentage of demand satisfaction of either class by inputting the associated value of each class into Equations (21) and (22) for different values of ρ 21 . We use Equation (24) to find the average number of actions over time over all sample paths. We summarized the result based on 500 sample paths of realized demand in Table 5 . We note that we display the number of actions in the last three columns of Table 5 , but we use percentage for the other columns related to demand satisfaction. Intuitively, as ρ 21 increases, more/less class 1 demand is satisfied using drones with level 2/level 1 charged batteries. On average, more recharging occurs from level 1 to level 2 to satisfy the demand of either class. For smaller values of ρ 21 , increasing the parameter value provides more incentive for recharging more drones up to level 2 and satisfying both demand levels. However, for larger values of ρ 21 , as level 2 charged batteries are used more to satisfy class 1 demand, fewer level 2 charged drones are available to satisfy class 2 demand. A significant finding is that 15 drones are not sufficient to satisfy the demand of either class. We proceed with analyzing the impact of increasing the number of drones in the station on the amount of met demand. In this section, we solve the problem for a larger number of drones in the station to find the relationship between this number and the amount of met demand. The analysis provides significant insights for drone delivery companies given the high price to purchase and maintain drones in swap stations. First, we note that backward induction (BI) can solve the problem with at most 21 drones using our computational resources. We summarized the amount of met demand, computational time, and memory used to solve the problem for 15 to 21 drones in Table 4 using BI and RL. For M > 21, we report the results of our D RL method in Table 6 . Table 6 : Computational time, memory used, and average percentage of met demand over time for 500 sample paths when ρ 21 = 0.5 using the D RL method. As shown, when M ≥ 54, the average percentage of met demand over time for 500 sample paths is 100%. We depict a sample path of policies for M = 54 when demand equals mean demand in Figure 8 . As all batteries are initially available with a level 2 charge, we recharge fewer drones in the early morning. Then, we recharge more batteries from 6:00 to 18:00 as the demand of either class increases. Overall, more batteries are recharged to level 2 to satisfy the demand of either class. Figure 8 : Sample paths of states, optimal policies, and demands for 54 drones, ρ 21 = 0.5 when the realized demand of either class equal mean demand. In this section, we compare the outputs of the models with and without demand classification to illustrate the contribution of classifying the stochastic demand. We focus on demand satisfaction as the crucial metric to assess the station's success in delivering medical supplies. We provide this metric for a different number of drones, which is important for decision-makers given the high price of purchasing and maintaining drones. In Figure 9 , we show the average percentage of met demand for a different number of drones when we use/do not use demand classification. The red color shows the percentage for the different number of drones when demand is not classified. In this model, the state of charge of batteries is either full or empty, and full batteries are used to satisfy the demand without considering the classified distance between the station and hospitals. The system's state is the number of fully-charged batteries. The action is the number of recharging actions that make an empty battery fully-charged. The uncertainty is the stochastic demand disregarding the distance between the station and hospitals (demand nodes). Hence, when demand is not classified, drones that return from (even a close) delivery mission are not available before recharging for the next delivery task. For this model, optimal policies indicate that more than 150 drones are needed to satisfy 100% of demand over 500 sample paths. In the model with demand classification, we note that finding the optimal policy and, in turn, the average percentage of demand for M > 21 is beyond our computational resources. Therefore, we only show the percentage derived from BI for M ≤ 21 using the color blue. However, using D RL enables us to offer near-optimal policies for this model. As shown, D RL (black line in Figure 9 ) provides an upper bound for the optimal number of drones needed to satisfy a particular level of demand for the model with demand classification. As shown, D RL's policies constantly outperform the optimal policy of the model with no demand classification in terms of the average percentage of met demand. For instance, D RL's policies can satisfy 100% of the met demand with only 54 drones, which is significantly lower than the required number of batteries to hit this target when the demand is not classified. In this research, we addressed managing distribution operations of a drone swap station located at a drone hub to maximize the amount of stochastic met demand for flights delivering medical supplies in Rwanda, Africa. We denoted this problem as stochastic scheduling and allocation problem with multiple classes of demand (SA-MCD) where the stochastic demand is classified based on the distances a drone can fly, which is linked to the level of charge inside the drone's battery. We formulated the problem as a Markov Decision Process (MDP) model wherein the optimal policies determine the number of recharging batteries from one level to a higher level of charge over time when encountering stochastic demand from different demand classes. We solved the problem using backward induction (BI) and observed that we run into time/memory issues when the number of drones is greater than 21. Hence, we applied a reinforcement learning method with a descending -greedy exploration feature (D RL) to find high-quality approximate solutions quickly and overcome the time/memory issue. We designed a set of experiments to show the high performance of our D RL method and obtain insights about how to manage the operations in the station to maximize the expected total weighted met demand when the model parameters vary. We found plenty of directions and opportunities related to this work for future research. For instance, we did not consider the difference between the length of time for different charging levels (still, our model outperformed the model with no demand classification). Future research should consider the time difference between different recharging actions to capture the system's behavior more realistically. As our transition probability function is large-scale and complex, it is worth investigating various model-free approaches, like RL/ADP/Q-learning, that can circumvent the complexities of the burdensome function, particularly when more than three classes are needed. Hence, this research, which introduces a new set of problems that can be used in many applications, opens the avenue for future studies from many perspectives. In terms of modeling and application, we can have multiple demand classification criteria, such as level of emergency (plus distance). Future research can add backlogging unsatisfied demands if it suits the application. Additionally, future research should consider how the operational charging and use actions for different demand classes impacts battery degradation wherein excessive charging should be avoided to lead to longer battery lifecycles. Drone taxis? Dubai plans roll out of self-flying pods Agricultural drones: How drones are revolutionizing agriculture and how to break into this booming market. UAV Coach UPS tested launching a drone from a truck for deliveries. USA Today Successful trial integration of DHL parcelcopter into logistics chain Designing unmanned aerial vehicle networks for biological material transportation -The case of Brussels Long-range drones deliver medical supplies to remote areas of Malawi. Forbes UNICEF Supply Division. How drones can be used to combat COVID-19 Drone delivery in the era of the pandemic: From the floor at commercial UAV expo. drone life Moving medical supplies: enter the drone Zipline and Walmart to launch drone deliveries of health and wellness products Matternet's M2 drone system enabling new U.S. hospital delivery network at Wake Forest Baptist Health Coronavirus delivers 'world's first' drone delivery service. Forbes Matternet unveils new medical drone payload exchange station The effect of cycling on the state of health of the electric vehicle battery Effects of electric vehicle fast charging on battery life and vehicle performance NIO deploys 18 battery swap stations covering 2,000+ km expressway. electrek BJEV advocates battery swap service for e-vehicle owners. China Daily, September 2019 Electric scooter with swappable batteries hits market Power ready to go China's BAIC group launches EV battery-swap station network in Beijing. China Money Network Ségolène royal: An agreement at COP21 would accelerate the energy transition Reva founder looks to make electric cars affordable in India. Business Standard Forget better place: Simple battery swap technology from Slovakia proves its worth Markov decision processes: Discrete stochastic dynamic programming Approximate Dynamic Programming: Solving the Curses of Dimensionality The American drones saving lives in Rwanda A stochastic scheduling, allocation, and inventory replenishment problem for battery swap stations A monotone approximate dynamic programming approach for the stochastic scheduling, allocation, and inventory replenishment problem: Applications to drone and electric vehicle battery swap stations Optimal policies for the management of an electric vehicle battery swap station Managing operations of plug-in hybrid electric vehicle (PHEV) exchange stations for use with a smart grid Using drones for delivery: A two-level integrated inventory problem with battery degradation and swap stations Drone-aided routing: A literature review Unmanned Aerial Aircraft Systems for transportation engineering: Current practice and future challenges Optimal delivery routing with wider drone-delivery areas along a shorter truck-route Optimization approaches for civil applications of unmanned aerial vehicles (UAVs) or aerial drones: A survey A survey of recent extended variants of the traveling salesman and vehicle routing problems for unmanned aerial vehicles Drones now delivering life-saving blood in Rwanda Special report: Zipline international blood drone delivery service Self-flying drones are helping speed deliveries of COVID-19 vaccines in Ghana Draganfly to start drone delivery of COVID-19 vaccine to rural Texas Sensor planning for a symbiotic UAV and UGV system for precision agriculture The vehicle routing problem with drones: several worst-case results Efficient route planning for an unmanned air vehicle deployed on a moving carrier A multi-objective approach for unmanned aerial vehicle routing problem with soft time windows constraints Engineering for Change. Special report: Zipline international blood drone delivery service Wind-energy based path planning for Unmanned Aerial Vehicles using Markov Decision Processes Optimal path planning of a target-following fixed-wing UAV using sequential decision processes Sense and collision avoidance of unmanned aerial vehicles using markov decision process and flatness approach Approximations of dynamic, multilocation production and inventory problems A heuristic stock allocation rule for repairable service parts The optimal timing of living-donor liver transplantation Optimal breast biopsy decision-making based on mammographic features and demographic factors Optimization of prostate biopsy referral decisions Optimal implantable cardioverter defibrillator (ICD) generator replacement Using imperfect advance demand information in production-inventory systems with multiple customer classes Optimal control of an assembly system with multiple stages and multiple demand classes Efficient short-term allocation and reallocation of patients to floors of a hospital during demand surges Dynamic admission control for two customer classes with stochastic demands and strict due dates Ghim Ping Ong, and Der-Horng Lee. A four-step method for electric-vehicle charging facility deployment in a dense city: An empirical study in Singapore Reinforcement Learning: An Introduction A neuro-dynamic programming approach to retailer inventory management Approximate dynamic programming algorithms for multidimensional flexible production-inventory problems Approximate dynamic programming algorithms for multidimensional inventory optimization problems Case-based reinforcement learning for dynamic inventory control in a multi-agent supply-chain system A reinforcement learning model for supply chain ordering management: An application to the beer game An approximate dynamic programming approach to multidimensional knapsack problems Approximate dynamic programming for dynamic capacity allocation with multiple priority levels Approximate dynamic programming for ambulance redeployment Approximate dynamic programming for large-scale resource allocation problems Real-time ambulance dispatching and relocation. Manufacturing and Service Operations Management Case-based myopic reinforcement learning for satisfying target service level in supply chain Bayesian exploration strategies for approximate dynamic programming Zipline launches long-distance drone delivery of COVID-19 supplies in the Virtues of the Haversine Unmanned aircraft operations in Rwanda unmanned aircraft operations in Rwanda The patient arrival process in hospitals: statistical analysis On patient flow in hospitals: A data-based queueing-science perspective A drone battery that charges in 5 minutes Zipline's new drone can deliver medical supplies at 79 miles per hour Estimate blood requirements -search for a global standard How can Rwanda tackle its blood donation shortfall? Coping with time-varying demand when setting staffing requirements for a service system Arrival time pattern and waiting time distribution of patients in the emergency outpatient department of a tertiary level health care institution of North India Department Cycle to Improve Patient Flow Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming This research is supported by the Arkansas High Performance Computing Center which is funded through multiple National Science Foundation grants and the Arkansas Economic Development Commission.