key: cord-0656307-pltu9e3u authors: Ahamed, Tanvir; Zou, Bo; Farazi, Nahid Parvez; Tulabandhula, Theja title: Deep Reinforcement Learning for Crowdsourced Urban Delivery: System States Characterization, Heuristics-guided Action Choice, and Rule-Interposing Integration date: 2020-11-29 journal: nan DOI: nan sha: ce15bc08365726a0b5e56aabb7e083f7b4a97f12 doc_id: 656307 cord_uid: pltu9e3u This paper investigates the problem of assigning shipping requests to ad hoc couriers in the context of crowdsourced urban delivery. The shipping requests are spatially distributed each with a limited time window between the earliest time for pickup and latest time for delivery. The ad hoc couriers, termed crowdsourcees, also have limited time availability and carrying capacity. We propose a new deep reinforcement learning (DRL)-based approach to tackling this assignment problem. A deep Q network (DQN) algorithm is trained which entails two salient features of experience replay and target network that enhance the efficiency, convergence, and stability of DRL training. More importantly, this paper makes three methodological contributions: 1) presenting a comprehensive and novel characterization of crowdshipping system states that encompasses spatial-temporal and capacity information of crowdsourcees and requests; 2) embedding heuristics that leverage the information offered by the state representation and are based on intuitive reasoning to guide specific actions to take, to preserve tractability and enhance efficiency of training; and 3) integrating rule-interposing to prevent repeated visiting of the same routes and node sequences during routing improvement, thereby further enhancing the training efficiency by accelerating learning. The effectiveness of the proposed approach is demonstrated through extensive numerical analysis. The results show the benefits brought by the heuristics-guided action choice and rule-interposing in DRL training, and the superiority of the proposed approach over existing heuristics in both solution quality, time, and scalability. Besides the potential to improve the efficiency of crowdshipping operation planning, the proposed approach also provides a new avenue and generic framework for other problems in the vehicle routing context. Urban delivery has been undergoing an exciting and challenging time with ever-growing online shopping demand. In the US, for example, the share of e-commerce sales in total retail sales has increased from 4.2% in the first quarter of 2010 to 16 .1% in the second quarter of 2020 (Statista, 2020a) . For the food delivery segment alone, revenue amounted to $22.1 billion in 2019 and is projected to reach $32.3 billion by 2024 (Statista, 2020b) . This trend has been further accelerated by the COVID-19 pandemic. On the other hand, the time-sensitive nature of today's urban deliveries is imposing considerable pressure on delivery service providers (DSP) to control logistics cost while meeting customer demand for more frequent and expedited deliveries. Guaranteed delivery within one or two hours after an order is placed online has become increasingly common in urban delivery practices (Setzke et al., 2018) . As a result of the growing demand and increasing time sensitivity, delivery vehicle traffic has been continuously on the rise in urban areas, causing many negative consequences (Kafle et al., 2017) . To address the challenges, crowdshipping has emerged as an attractive new form of urban delivery. In crowdshipping, a DSP solicits ordinary people, termed crowdsourcees, who have some available time and may walk, bike, or drive a car to perform delivery to earn some payment. Many companies, including Postmates, Deliv, Piggy Baggy, Amazon Flex, Uber Eats, DoorDash, and Instacart, are rapidly expanding their crowdshipping businesses. By using ordinary people as "ad hoc couriers", crowdshipping can bring significant cost advantages over the traditional asset-based delivery which relies on full-time employees and self-owned vehicles (Arslan et al., 2019) . In this paper, we focus on crowdshipping with spatially distributed request pickup and delivery locations, using "dedicated crowdsourcees" who are also spatially distributed. Distributed pickup and delivery locations are common for delivery from restaurants, grocery stores, and retail shops to customers, and even for document delivery between different office locations. Dedicated crowdsourcees inform the DSP about their available time for performing delivery, which is a popular form in the crowdshipping business (e.g., Amazon Flex, Instacart, Postmates). This is different from another type of crowdshipping using opportunistic (or "on the way") crowdsourcees: the pickupdelivery tasks are "piggybacked" on the crowdsourcees' existing journey with some extra miles and time incurred. We further consider that each crowdsourcee has a limited carrying capacity. A crowdsourcee is paid a fixed rate ($/minute) whenever the crowdsourcee is carrying a request. In addition to the description of the type of crowdsourcees, this paper considers a delivery environment where each shipping request has a narrow time window defined by the time between the earliest time of pickup and the latest time of delivery. This characterization is equivalent to specifying that each request needs to be delivered within a certain amount of guaranteed time (e.g., two hours) after the shipping request becomes available. This delivery environment is prevalent today, especially in food and grocery deliveries. Given the above setup, we consider the following static crowdshipping problem as the central research question of our paper: how can a DSP efficiently assign requests to crowdsourcees by minimizing total shipping cost, while respecting constraints arising from the time availability and carrying capacity limits of crowdsourcees and pickup and delivery time windows of the requests? Cost minimization requires consolidation of multiple requests into one crowdsourcee route, as this can reduce total routing distance thus routing time made by crowdsourcees. In the case that a request is not assigned to a crowdsourcee, the request will have to be picked up and delivered by a backup vehicle, which is more expensive than hiring a crowdsourcee. This crowdshipping problem can be viewed as a specific type of pickup-and-delivery problem and belongs to the broad category of vehicle routing problems (VRP). While many integer programming models and heuristic algorithms have been developed for solving this and similar problems, the novelty of this paper is that we propose, for the first time in the literature, an approach that leverages deep reinforcement learning (DRL) -more specifically deep Q learning (DQN) -to frame and solve the constrained crowdsourcee-shipping request assignment problem. Two salient features of DQN are experience reply and target network which can enhance efficiency, convergence, and stability in DRL training. Our work goes beyond simple adoption of the DQN algorithm in the existing literature, by making three major methodological contributions as follows. The first contribution is on a novel representation of system states for the crowdshipping problem. Due to the combinatorial nature of the crowdshipping problem and the heterogeneity of both requests and crowdsourcees in terms of time and carrying capacity, the states of a crowdshipping system cannot be represented by one or a few metrics. A comprehensive representation must in some way capture the sequence of pickup and delivery nodes on each crowdsourcee route. Yet routing sequence alone is not enough to reflect the fact that both requests and crowdsourcees are time sensitive: on the one hand, each request has a limited time window between the earliest possible pickup and the latest delivery (e.g., 2 hours). On the other hand, by dedicating one's time to crowdshipping, a crowdsourcee also has limited time availability. The time information about requests and crowdsourcees, which changes as crowdsourcee routes are constantly created and improved, is an inherent part of the system state that helps the DRL agent make informed routing decisions, especially with respect to what requests need be considered first and what crowdsourcee routes may be given higher priority given time availability and delivery urgency. To this end, a novel representation of system states that leverages the notion of information array is proposed which encompasses not only static location information of request pickup and delivery nodes but information on crowdsourcee routing sequences, request-specific time availability, and crowdsourcee-specific time and capacity availability. The second contribution is on embedment of heuristics-guided action choice in DRL. The combinatorial nature of the problem means that a very large number of different actions can be taken to construct and improve crowdsourcee routing. But enumerating all possible actions would be neither efficient nor practical in DRL training. To preserve training tractability, we abstract the action space into five general types of actions for assigning or improving the assignment of requests to crowdsourcees: 1) inserting an unassigned request to a crowdsourcee route (insertion); 2) moving an assigned request to another place in the same crowdsourcee route (intra-route move); 3) moving an assigned request to a different crowdsourcee route (inter-route move); 4) exchanging the positions of two requests that are assigned to two different crowdsource routes (1-exchange); 5) do-nothing. As many possibilities for taking a specific action still exist given an action type, heuristics that leverage the information offered by our proposed state representation and are based on intuitive reasonings are designed to guide the specific action to take. We show that the embedment of heuristics-guided action choice significantly enhances DRL training efficiency and solution quality. The third contribution is on integration of rule-interposing into DRL training and implementation. The rules aim to prevent certain routes or node sequences from being visited repeatedly during neighborhood moves (i.e., intra-route move, inter-route move, and 1-exchange) within a period of time, as repeated visiting discourages exploring more actions and may get the routing sequence trapped in local optimum, thus compromising the efficiency of DRL training. Specifically, we employ two rules that: 1) set up and update a priority list of crowdsourcee routes for each neighborhood move, based on criteria in line with the nature of the neighborhood moves. A crowdsourcee route that is chosen for a neighborhood move will be removed from the priority list and not considered for some period of time; 2) introduce Tabu tenure for the relative positions of pickup and delivery nodes. Two nodes that were neighbored and are moved away are prohibited to be neighbored again for some period of time. With the two rules, computation efforts involved in repeatedly visiting routes or node sequences during neighborhood moves are spared, thereby enhancing the training efficiency by accelerating learning. With the above three methodological contributions, the effectiveness of the proposed DRL-based approach to solve the crowdshipping problems of our interest is demonstrated through extensive numerical analysis. Our results show superiority of the trained DQN algorithm over traditional heuristics in solution quality, time, and scalability. Given that the training of DRL will be performed offline and a trained DRL model can solve problems in a matter of seconds, the proposed approach has significant potential for practical crowdshipping operation planning and even real-time decisionsupport. Moreover, the methodology framework, which in this paper tackles a more complicated type of pickup and delivery problems with time constraints from both "vehicles" (crowdsourcees) and "customers" (shipping requests), provides a new and promising avenue and generic framework for solving general pickup and delivery and VRP problems. The remainder of the paper is structured as follows. Section 2 reviews and synthesizes the relevant literature. In Section 3 provides a detailed presentation of the methodology including the fundamentals of reinforcement learning (RL) and DRL; information array, representation of states, actions, and rewards; the DQN algorithm for crowdshipping; and rule-interposing design. Section 4 implements the DRL model and discusses the results from extensive numerical experiments. Summaries and suggestions for future research are given in Section 5. We organize our review of the relevant literature in two parts. We first review and synthesize the recent advances of DRL in solving vehicle routing problems including both passenger transportation and freight delivery. We will synthesize the problem characteristics and DRL specifications in representative studies, based on which the uniqueness of our paper is highlighted. We also review the literature of crowdshipping, which reveals that DRL has not been adapted to tackle crowdshipping problems. The crowdshipping problem studied in this paper is a routing problem, more specifically belonging to the category of pickup and delivery problems. As such problems are NP-hard, previous efforts have been focused on developing efficient heuristics to obtain solutions with acceptable quality and in a reasonable amount of computation time (e.g., Lu and Dessouky, 2006; Berbeglia et al., 2010; Toth and Vigo, 2014) . Heuristics are typically expressed in the form of a set of rules, which may be interpreted as policies to make routing decisions (Kool et al., 2018) . With the advance in DRL and increasing availability of computation power to researchers in recent years, these policies can be formulated under a reinforcement learning framework, parameterized using deep neural networks (DNN), and trained to obtain new and stronger algorithms. Alone this line of thought, DRL-based solution approaches to solving routing problems are garnering growing attention lately. A basic version of routing problems is the traveling salesman problem concerning the routing of a single vehicle. Bello et al. (2016) probably make one of the first attempts to combine reinforcement learning with neural networks to tackle traveling salesman problems. A pointer network comprising two recurrent neural networks for encoding and decoding and an attention function is trained with policy gradient. The authors find that the trained DRL model outperforms Christofides heuristics and represent the policy to capture the property of a node in the context of its graph neighborhood. A fitted Q-learning is adopted to learn a greedy policy that is parameterized by the graph embedding network. For TSP problems considered above, only spatial information of nodes and a single vehicle tour are involved. Actions in DRL pertain to adding nodes-one at a time-to progressively construct the vehicle tour. It can be seen from the above review that most of the multi-vehicle routing problems considered in the existing DRL literature are different in problem characteristics from the crowdshipping problem considered in this paper. As shown in Table 1a to the complexity of DQN training and solutions. In contrast, the full control of a DSP of assigning crowdsourcees to shipping requests means that centralized DQN would be more appropriate. The present paper attempts to overcome these issues while comprehensively capturing the problem characteristics of crowdshipping, as shown in the last row of Table 1a . Because of the richer features and centralized nature for the crowdshipping system of our interest, fully capturing the states of a crowdshipping system requires more complicated representation than in the existing studies. As shown in Table 1b , the existing studies most have vehicle and/or customer locations as the major part of state representation. Very limited efforts are made to include time-related information for either vehicles or customers. On the other hand, given that both crowdsourcees and shipping requests have limited time windows, and that the DRL proposed in our work embeds heuristics-guided action choice which needs time-related information to proceed, the incorporation of time-related information is critical (see subsections 3.2.2-3.2.3). In fact, the heuristics-guided action choice design preserves training tractability, thereby empowering a centralized policy and contributing to the scalability of the proposed DRL approach. For performing these heuristics, information on routing sequence is needed, which is included in the state representation proposed in this paper, but not in any prior studies reviewed. As a result of the heuristics-embedding feature, the specification of action space in our work is richer and more elaborate. Finally, no other papers consider ruleinterposing. These are made clear in Table 1b . The second group of crowdshipping research does not consider intermediate transfers, but uses end-to-end trips either by dedicated crowdsourcees or people "on-the-way" who are willing to make some detour from their original trips to perform delivery. Either way, accounting for the willingness of ordinary people to participate in crowdshipping is important, as considered in Archetti et al. (2016) and Dayarian and Savelsbergh (2017) . In these two studies, the matching of crowdsourcees with requests is approached by respectively designing variants of classic capacitated vehicle routing problems and developing two rolling horizon dispatching approaches. A rolling horizon framework with an exact solution approach is proposed by Arslan et al. (2020) which look into dynamic pickups and deliveries using "on-the-way" crowdsourcees. Yildiz Despite the proliferation of research in the crowdshipping field, DRL has not been considered in the literature as a way to guide crowdshipping operations. In addition to this major gap, there seems to be little attention paid to the limited time availability that dedicated crowdsourcees are likely to have while performing crowdshipping. This paper intends to address these gaps. Note: The term "vehicle" is quoted because in crowdshipping, "vehicles" would refer to crowdsourcees. Similarly, the term "customers" is quoted as "customers" would refer to shipping requests on the freight side. This section describes the crowdshipping-adapted DRL methodology. First, we introduce the fundamental ideas of RL and DRL. Then, we discuss how states, actions, and rewards which are essential elements of DRL are specified in crowdshipping. Building on the specifications, we detail the training process using DQN. Two key ideas are worth mentioning. First, DQN learns from how a policy -a decision rule which directs what type of action to take given a state -performed on previous instances and improves the policy over time. Knowing the action type, the specific action will be determined by a corresponding heuristic. As mentioned in subsection 2.1.2, such heuristics-guided action choice preserves training tractability and consequently contributes to the scalability of the proposed approach. Second, solutions to a crowdshipping problem instance can be constructed sequentially, one step at a time, which is amenable to the DRL framework. RL is one of the three categories of machine learning (the other two are supervised learning and unsupervised learning) (Sutton and Barto, 2018). The tenet of RL is to train an agent such that the agent can optimize its behavior by accumulating and learning from its experiences of interacting with the environment. The optimality is measured as maximizing the total reward by taking consecutive actions. Thus, RL is a sequential decision process with the agent as the decision maker. At each decision point, the agent has information about the current state of the environment and selects the best action based on his current experiences. The action taken transitions the environment to a new state. The agent gets some reward, i.e., reinforcement, as a signal of how good or bad the action taken is. To formulate the sequential decision process, RL employs MDP as the mathematical foundation to keep track of the progression of the decision process. To do so, the following notations are introduced. is the set of states of the environment. is the set of actions the agent can take. is the set of possible rewards as a result of the agent taking an action at a given state. To illustrate, the environment is in state ∈ at time step . The agent takes an action ∈ . The action transitions the environment to a new state ∈ at the next time step + 1. Meanwhile, the agent receives a reward ∈ . The reward is a function of state-action pair: ( , ) (Fig. 1 ). Since the actions are taken sequentially, the objective of the agent at any time step is to maximize the cumulative reward, i.e., the return , from till the last time step : If we consider that the reward is received over a long period, a discount factor ∈ [0,1] is often used to reflect discounting: In RL, a policy is a mapping from states to probabilities of selecting each possible action. A value function expresses the expected return when starting in state and following policy thereafter. At time step , the value function can be written as: Related to the value function, we define the value of taking action in state and following policy thereafter, denoted as ( , ). ( , ) is termed action-value function, or "Q-function" of the state-action pair ( , ). The letter "Q" represents the quality of this state-action pair: It is desired to seek an optimal policy * such that * ( ) = max * ( , ), where * ( , ) means that the agent takes action at state and follows policy * thereafter. Clearly, if * ( , ) is known for every state-action pair ( , ), then * is also known. The problem of finding the optimal policy then becomes finding optimal Q-values * ( , ), ∀( , ) ∈ × . To do so, one of the prominent algorithms is Q-learning (Watkins and Dayan, 1992) . At a time step, the Q-function value (thereafter simplified as "Q-value") for a given state-action pair is updated using the following rule which is based on the Bellman optimality equation: where ′ is the transitioned state after taking action at state . The Q-learning algorithm works well to find the optimal policy when the state-action space is small. However, it would become computationally inefficient and even infeasible to compute Q-values for every state-action pair when the state-action space is large (just imagine Eq. (5) where the state-action space is very large. A prominent advantage of DQN is that it overcomes instability and divergence that occur when a nonlinear function approximator such as a neural network is used to represent the Q-function, by embedding two salient feature: experience replay and target network, whose use will be discussed in subsection 3.3. Before getting into the details of DQN, below we first describe our specifications of states, actions, and rewards in the context of crowdshipping. In this section, we propose a novel state and action space design as well as reward function specification for crowdshipping. A key in this proposal is the creation of an information array that contains the routing sequence of each crowdsourcee. Let and denote respectively the sets of shipping requests and crowdsourcees. The information array is a | | × (2| | + 1) matrix where | | and | | denote respectively the numbers of crowdsourcees (which is equivalent to the number of routes) and shipping requests. Each row indicates the routing sequence of one crowdsourcee. The matrix has 2| | + 1 columns to accommodate the extreme possibility that all | | requests (2| | nodes) are assigned to a single crowdsourcee plus the origin node of the crowdsourcee (thus one more node needs to be added). For example, if the th row of the information array contains the following tuple: ( , , , , ), it means that crowdsourcee will leave his/her origin node , go to the pickup node of the first request , pick up the second request , then drop off the first request , and finally drop off the second request . In this case, the cells of the first five columns of the th row are occupied, whereas the remaining cells in the row are empty (Fig. 2 ). Fig. 2 The information array is constantly updated after every time step. At the beginning, we assume that all requests are assigned to backup vehicles. Thus, initially each row in the information array contains only the origin node of a crowdsourcee. The information array provides a foundation for specifying the state space. At each time step , the state of the crowdshipping environment is described by a three-tuple = { , , } which provides respectively: 1) location information of pickup and delivery nodes of requests and crowdsourcee routing sequences; 2) request-specific time information; and 3) crowdsourcee-specific time and capacity information. With { , , }, the agent not only has a complete picture of the crowdsourcee routing sequences, but can leverage the time-related information to perform heuristicsguided actions, as described in subsection 3.2.3. The first component in the three-tuple, , is specified as follows: where is the coordinate of node ; is the coordinate of the successor node of the pickup node of request if is assigned; is the coordinate of the predecessor node of the delivery node of request if is assigned; is the coordinate of the first node visited by crowdsourcee if the crowdsourcee is assigned. For a request , slack time measures how urgent it needs to be assigned: where is the latest delivery time for request ; is the earliest pickup time for request ; , is the direct travel time by crowdsourcee from pickup node to delivery node ; ℳ is a very large number; is a binary variable indicating whether request is assigned to a crowdsourcee or not. For an unassigned request , its urgency is the difference between the largest amount of time allowed for pickup and delivery ( − ), and the minimum amount of time needed to do so by crowdsourcee ( , ). The larger the difference, the lower the urgency with which the request needs to be assigned. For an assigned request, a very large number ℳ is given, which means that its urgency is effectively zero (as it is already assigned). Using this urgency measure, the agent solving for the assignments can prioritize assigning requests that have not been assigned to crowdsourcees. The unused service time of a request ( ) quantifies the gap between the latest delivery time and the actual delivery time (Eq. (7)). Conceptually, a larger means greater flexibility in altering the way the request is picked up and delivered (e.g., by moving the request to a different position in the assigned crowdsourcee route or to a different route). Note that in the case of an unassigned request, the request will be delivered by a backup vehicle which departs from a pre-specified depot . Assuming that the backup vehicle will leave the depot at the earliest pickup time , the actual delivery time will be = + , + , where , and , denote respectively the travel time of the backup vehicle from the depot to the pickup node, and from the pickup node directly to the delivery node. The occupation time of a request ( ) quantifies the duration between pickup and delivery of a request . where is the pickup time of request by the assigned crowdsourcee. For an unassigned request, is equal to + , . Thus, = , . The third component in the three-tuple, namely , contains four pieces of crowdsourcee-specific time and capacity information: where is the routing duration for crowdsourcee ; is the total delivery time violation of requests assigned to crowdsourcee route ; is the remaining available time for crowdsourcee ; is the total capacity violation along the route of crowdsourcee . The calculation of is intuitive. is calculated using Eq. (9), where denote the set of requests assigned to crowdsourcee . The max operator is used when delivery is earlier than the latest delivery time (i.e., − ≤ 0) such that it does not contribute to the violation: The remaining available time of crowdsourcee , , is the difference between the crowdsourcee's total available time ( − ) and the route duration ( ), as shown in Eq. (10), where and are the end and start of crowdsource 's available time window. An underlying assumption is that an assigned crowdsourcee will start routing at . If the total available time of a crowdsourcee is less than the route duration, < 0 means that crowdsourcee 's time availability is violated when finishing the last delivery on the route. Given that a crowdsourcee has limited carrying capacity (measured in weight), the total capacity violation along a crowdsourcee route is the total number of capacity violation occurrences at each pickup node: where = 1 if the total weight carried right after picking up at node exceeds the carrying capacity, and zero otherwise. As mentioned in Section 1, the combinatorial nature of the crowdshipping problem means that a very large number of different actions can be taken to construct and improve crowdsourcee routing. However, enumerating all possible actions would be neither efficient nor practical in DRL training. To preserve training tractability, in this paper we abstract the action space into five types of actions. At each time step, the agent may perform one action from the five types to alter existing crowdsourcee route(s) or create a new crowdsourcee route. The first type of actions pertains to inserting an unassigned request to an existing/new route. The other three types of actions: intra-route move, inter-route move, and 1-exchange, are neighborhood moves of requests that have been previously placed in some existing crowdsourcee routes. The last action type is do-nothing, i.e., no action is taken. We consider do-nothing as an action for preserving good solutions. Specifically, if a very good solution has been achieved, having the option of do-nothing prevents taking another action that would move away from the solution to an inferior solution. Fig. 3 provides an illustration of the first four action types. (1) request pickup is no earlier than the earliest pickup, for all requests on the route: ≥ , ∀ ∈ ; (2) request delivery is no later than the latest delivery, for all requests on the route: ≤ , ∀ ∈ ; (3) remaining available time of the crowdsourcee after completing the route is non-negative: ≥ 0; (4) no violation of crowdsourcee capacity on the route: = 0. The remainder of this subsection describes, after an action type is chosen (by the DQN algorithm), how the specific action to take will be identified based on system state information and intuitive reasonings. For insertion, we need to determine which request to choose for insertion, and where to insert the request. The action consists of three steps. Step 1: Select a request. Among the unassigned requests, select one with the smallest slack time. Step 2: Insert the request to a route. For the selected request, calculate the distances between the pickup node of the request and each crowdsourcee. For an assigned crowdsourcee, the distance is to the end of the crowdsourcee route. For an unassigned crowdsourcee, the distance is to the crowdsourcee origin. Identify the smallest distance. If the smallest distance occurs to an assigned crowdsourcee, insert the node to the end of the crowdsourcee route. If the smallest distance occurs to an idle crowdsourcee, create a new route: crowdsourcee origin → request pickup node → request delivery node. Check routing feasibility. If not feasible, move to the crowdsourcee with the second smallest distance and check feasibility. If it is not possible to feasibly insert a request, then move to the request with the second smallest slack time, until a successful insertion. Step 3: Perform intra-route move. If the request is inserted to an existing crowdsourcee route, explore moving the pickup and delivery nodes of the request to earlier positions in the route to reduce routing cost. The move follows Step 2 of intra-route move below. Place the request at the position that leads to the smallest routing cost. In Step 1, the rationale for considering the unassigned request with the smallest slack time, the information of which comes from in (second component in the three-tuple state representation), is that we want to get the most urgent unassigned request assigned first. In Step 2, we perform insertion to the nearest crowdsourcee as this incurs the smallest time loss between the crowdsourcee finishing the currently assigned requests and picking up the request under study. Intra-route move involves moving a later request to an earlier position in a route to reduce routing cost. The action also consists of three steps. Step 1: Select a route. Select the crowdsourcee route with the largest remaining available time. Step 2: Move examination. Enumerate all feasible moves of a request to an earlier place. To illustrate, consider routing sequence ( , , , , , … , , , , ) and moving request . Move to an earlier position, one place at a time, i.e., to the places right before , right before ,…, until right after . For each new position of , examine feasibility of holding at its initial place, moving it one place at a time to an earlier position, as long as is not before . For each feasible ( , ) move, calculate the routing cost. Repeat the above move for every request in the route. Step 3: Identify the best move. Among all the feasible moves in Step 2, pick the one with the smallest routing cost. If the routing cost is smaller than the original routing cost, perform the move. In Step 1, the rationale for considering the route with the largest remaining available time, for which the information comes from in , is that such a route has the greatest flexibility for moving requests around. Inter-route move picks a request from a route and moves it to another route by performing the following three steps. Among all assigned requests, select one with the largest occupation time. Step 2: Move the request to the end of a different route. Insert request to another existing route or create a new route, following Step 2 of insertion. Step 3: Perform intra-route move of the request. Step 3 of insertion. In Step 1, the rationale for considering the request with the largest occupation time, for which the information comes from in , is that larger occupation time may suggest greater time (and thus cost) reduction potential by moving the request to a different route. 1-exchange move pertains to exchanging two requests which are on two crowdsourcee routes. Performing the move has four steps. Step 1: Select the first request. Among all assigned requests, select the first request that has the largest unused service time. Step 2: Select the second request. Excluding the route associated with the first selected request, select the second request that has the largest unused service time among the remaining assigned requests. Step 3: Exchange the selected requests. Remove the two requests from their routes. Add each request to the end of the other route. Step 4: Perform intra-route move of the two requests. Perform intra-route move for the two requests following Step 3 of insertion. In Steps 1 and 2, the rationale for choosing requests with the largest unused service time, for which the information comes from in , is that such requests have the greatest flexibility to be moved around. It should be noted that unlike the other three actions, we do not require feasibility to be met in performing 1-exchange (so strictly speaking, the intra-route move in Step 4 here is a modified version of Step 3 of insertion, without checking feasibility). This is intentional to help the search escape local optima (Nanry and Barnes, 2000). Given the state and the action at a time step, we specify the reward as the change in Total Shipping Cost (TSC) as a result of the action taken. If the action taken at time step is inserting request in crowdsourcee route , the reward is computed as: where is the unit cost of using a backup vehicle (in $/minute); is the unit cost of using a crowdsourcee (in $/minute); , is the route duration (in minutes) of crowdsourcee route at time step − 1; , is the route duration (in minutes) of crowdsourcee route at time step ; , is the backup vehicle travel time from the delivery node of request back to depot . In Eq. (12), , is the cost of crowdsourcee route at time step − 1, which is before request is inserted. If the route does not exist before inserting , this term will be zero. ( , + , + , ) is the cost of picking up and delivering the request by a backup vehicle. , is the cost of the crowdsourcee route at time step , after request is inserted. If the action taken at time step is a neighborhood move, let us use to denote the set of crowdsourcee route(s) that are involved in the move. For intra-route move, will have just one route. For inter-route move and 1-exchange, will have two routes. The reward is calculated as the difference of the routing costs before and after the move: where and are routing costs for the route(s) in before and after the neighborhood move: where , is the route duration for crowdsourcee at time step ; , is the delivery time violation of requests assigned to crowdsourcee route at time step ; , is the violation of available time for crowdsourcee at time step : , = min( , , 0); , is the carrying capacity violation for crowdsourcee at time step ; is the penalty for delivery time violation; is the penalty for crowdsourcee overworking; is the penalty for crowdsourcee carrying capacity violation; is the capacity violation-to-time conversion factor. This subsection describes how the DQN algorithm is adopted to the crowdshipping context. In DQN, the training of the agent is through multiple episodes. Each episode is associated with a different crowdshipping problem instance of a certain size, which is randomly generated and starts with an initial state that all crowdsourcees are idle (unassigned). Training in an episode involves improving the solution to the problem by taking actions described in subsection 3.2.3, one at a time in a number of time steps. At each time step, an -greedy strategy is employed to consider both exploration and exploitation as the agent decides what type of action to take among insertion, intra-route move, inter-route move, 1-exchange move, and do-nothing. By exploration, it means that the agent takes a random action type, with probability . By exploitation, the agent takes one of the five action types above that is the bestbased on the experiences that the agent has learned so far (reflected in the current Q-values, as shown in line 7 in Algorithm 1 at the end of this subsection), with probability 1 − . Once the best action type is chosen, the specific action follows the heuristics described in subsection 3.2.3 (line 8 in Algorithm 1). Consequently, a reward and a new state are observed. While exploitation takes advantage of what have been learned in terms of the best action to take, exploration is necessary to try to get the agent out of local optima toward even better action sequences, to further reduce total shipping cost. At the beginning of an episode, takes value 1, i.e., the focus is purely on exploration, which is intuitive as the agent has zero learned experience (thus nothing to exploit) at this point. Then as time goes by, the agent gradually increases the probability of exploiting learned actions. A decay rate of is used which describes the change in probabilities between two time steps (Eq. (16)). is a hyper parameter. One salient feature of DQN is experience replay, for which a replay memory is used to store the agent's experiences during training. Up to | | experiences can be stored in the replay memory. An experience is associated with taking an action at a given state and time step, observing a state transition, and getting a reward. For example, at time step , the agent performs an action which transforms the state from to and yields a reward . The experience is denoted as = ( , , , ). At the beginning of the training, is empty. As the training continues, experiences are accumulated and added to replay memory . Once | | experiences are stored in , adding a new experience requires simultaneous removal of the oldest experience stored in . At each time step, a DNN is trained using a minibatch of samples that are randomly selected from . Note that in the beginning of the training, the number of accumulated experiences in will be fewer than | |. In this case, experiences will continuously be accumulated in but DNN will not be trained, until the replay memory has | | experiences. The employment of experience replay using randomly selected minibatch samples has multiple advantages. First, because the samples are randomly selected, correlation between samples will be less compared to learning directly from consecutive samples, thereby enhancing the efficiency of learning. Second, each experience can potentially be used in many weight updates, thus allowing for greater data efficiency. Third, by experience replay the behavior distribution is averaged over many previous states, which contributes to smoothing out learning and avoiding oscillation or divergence in the parameters (Mnih et al., 2015) . For each selected experience ( , , , ), state is used as the input for the DNN (with weight parameters ) to generate state-action value ( , : ), or Q-value, which is the output of the DNN. 2 Collectively for all the selected experiences, the prediction of ( , : )'s comprises the first forward pass. ( , : ) is then compared with the target optimal Q-value * ( , ) which gives the maximum expected return achievable by following any DQN policy. Ideally, the target optimal Q-value should satisfy the Bellman optimality equation: where is the immediate reward by taking action at station . The comparison of ( , : ) with * ( , ) is performed using a loss function. Assuming a square form for the loss function and replacing * ( , ) by the RHS of Eq. (17), the loss function ℒ, which depends on DNN weight parameters , can be expressed as: After performing two forward passes as described above, the gradient of the loss in Eq. In implementing DQN, we use a slightly modified version of the squared loss function called Huber loss function. For each sample, the squared term is used only if the absolute error falls below a threshold (here we choose the value 1). Otherwise, we use an absolute term as shown in Eq. (21) . An advantage of the Huber function form is that the loss is less sensitive to outliers than the square loss for large errors, which prevents exploding gradients. Finally, it should be mentioned that during an episode, we also accumulate the rewards that are negative. If the accumulated negative reward in an episode falls below a threshold, then the training of the episode is perceived as not promising and consequently terminates. Summarizing, the overall learning algorithm is presented in Algorithm 1 and illustrated in Fig. 4 below. Whether in DRL training or in implementation of the trained policy, it is possible that some routes or node sequences are repeatedly visited during neighborhood moves. This reduces the efficiency of DRL training, as well as the efficiency in search for the best crowdsourcee-request assignment outcome when a trained policy is applied to solve a problem instance (note that after DRL training is done, at a given state the optimal Q-value only provides what type of action to take, i.e., * = argmax ∈ * ( , ) where * denotes the optimal Q-value). In this subsection, we propose two rules that aim to prevent such repeated visiting of routes and node sequences, by excluding a previously visited route or node sequence from being considered again in a number of subsequent actions. In what follows, the first rule focuses on routes. The second rule focuses on node sequences. To avoid that actions are repeatedly exerted on one or a subset of crowdsourcee routes, the first rule proposed relies on construction and use of three priority lists of crowdsourcee routes, with each list corresponding to one of the three neighborhood move action types (intra-route move, inter-route move, and 1-exchange) described in subsection 3.2.3. Specifically, when a neighborhood move action type is chosen, we pick the crowdsourcee route(s) from the top of the corresponding priority list to apply the action type. After the specific action is taken on the route(s), the route(s) are removed from the list. Thus over time, routes will be continuously picked and removed from the priority list. The priority list will be shortened, and eventually become empty. Then, we construct a new priority list of all the crowdsourcee routes for the same action type. By doing so, during the life cycle of a priority list, a route is considered only once for the associated action type. This allows more exploration of the same action type on other routes. This construction-destruction of priority lists repeats throughout the training and implementation of a trained DRL to solve a problem instance. Each of the three priority lists is constructed based on some criterion. For intra-route move, the priority list is constructed by sorting crowdsourcee routes in descending order based on the crowdsourcee's remaining available time, which is consistent with the rationale of Step 1 in After an inter-route move action is taken, that route is removed from the priority list. The occupation time of the route to which the request is moved will be updated. The position of that route in the priority list will also be updated, for which the computational complexity is O(log( )) based on binary search, with being the number of crowdsourcee routes in the priority list. For 1-exchange move, the priority list is constructed by sorting crowdsourcee routes in descending order based on unused service time, which is consistent with Steps 1 and 2 in subsection 3.2.3.4. Thus, the selection of the first and the second requests will be from the two routes with the highest and second highest priority respectively. After a 1-exchange move action is taken, the two routes will be removed from the priority list. The second interposing rule is that after a request node (either pickup or delivery) is moved away from a neighboring node (either right before or right after in the routing sequence) on a crowdsourcee route, the former node cannot be moved back to the same location relative to the latter node over a certain number of subsequent actions. This latter node can be of a different request, or of the same request (i.e., the former node is the pickup node of a request, and the latter node is the delivery node of the same request). Similar to Rule 1, this rule applies to the three types of neighborhood moves (intra-route move, inter-route move, and 1-exchange). For each type of neighborhood move, a Tabu tenure will be created to record for how many subsequent actions a request node cannot be neighbored with another node. Similar to Rule 1, Tabu tenure allows neighborhood moves to explore more routing sequences, rather than getting trapped in routing sequences that have been explored and only locally optimal. To operationalize Tabu tenure, two matrices are created and maintained. The first matrix, of dimension 2| | × 2| |, indicates whether a node (indexed by the column. There are in total | | requests thus 2| | pickup and delivery nodes) preceding another node (indexed by the row) is Tabu-ed and for how long. The second matrix, of dimension (| | + 2| |) × 2| |, indicates whether a node (indexed by the column) following another node (indexed by the row) is Tabu-ed and for how long. The second matrix has | | more rows which correspond to the origins of the | | crowdsourcees, as a pickup node can be placed right after the origin of a crowdsourcee. These two matrices are updated whenever an action is taken. Note that if a Tabu-ed position yields a solution that is better than the best solution obtained so far for a problem, then the Tabu tenure will be overridden. This section illustrates numerical implementation of the DQN algorithm described in Section 3, in two problem sizes: one of a medium size with 50 requests and 22 crowdsourcees, and the other of a larger size with 200 requests and 70 crowdsourcees. We present and discuss the results for mediumsize problems in great detail, including problem setup, training results, assessment of the benefits of heuristics-guided action choice and rule-interposing, comparison with three popular heuristic methods, and results sensitivity to key hyperparameters. For the larger-size problems, for paper length considerations we report more briefly the results of implanting the trained DRL model to 20 different problem instances in terms of total shipping cost and computation time, in comparison with the three heuristics. The DQN algorithm is coded and trained in the PyTorch environment. All numerical investigations are conducted on a personal computer with Intel Core i9-10920X CPUs at 3.50GHz and 128GB RAM and NVIDIA Titan RTX GPUs. As mentioned above, we consider a static problem of assigning 50 requests to 22 crowdsourcees. The service area has a square shape of 6 miles × 6 miles. The pickup and delivery locations of each request is randomly generated in the service area. So are the origins of the crowdsourcees. We assume that crowdsourcees bike to perform pickup and delivery at a speed of 10 mph. The carrying capacity of a crowdsourcee is 10 lbs. The available time of a crowdsourcee is randomly drawn from a uniform distribution of 1-2 hours. The weight of a shipping request is also randomly drawn from a uniform distribution of 2-7 lbs. The early pickup time of all requests is the present time. The latest delivery time of a request is randomly drawn from a uniform distribution of 100-120 minutes. If a request is not assigned to any crowdsourcees, it will be picked up and delivered by a backup vehicle which leaves a depot located at the center of the service area and returns to the depot after finishing the delivery. Given the small weight of a request relative to the typical carrying capacity of a backup vehicle, capacity constraints are not considered for backup vehicles. We assume backup vehicles travel at a speed of 20 mph. We follow Kafle et al. (2017) by setting the operating cost of a backup vehicle to be $68/hour ($1.13/minute) and the pay rate for crowdsourcees to be $10/hour ($0.17/minute), which is considerably cheaper. Crowdsourcees get paid whenever carrying requests. time steps and tends to stabilize afterwards. In Fig. 5(b) , the average Q-value keeps improving till after 30,000 time steps. Before that, updates in the DNN considerably improve the DQN algorithm which yields better solutions than before each update. As a result, the average Q-value keeps improving. Fig. 5 (b) also shows a magnifier of the first 5,000 time steps. It is interesting to observe step-wise jumps almost every 400 time steps, which is the target network update frequency (see Table 4 ). In other words, whenever an update of the target network occurs, it improves the average Q-value. The To further show the effectiveness of the DQN algorithm training, we apply the DQN algorithm throughout its training to three randomly generated problem instances of the same size (50 requests and 22 crowdsourcees). Fig. 6 shows the TSC results when applying the DQN algorithm with the most up-to-date DNN weight parameters every 2,500 time steps. It can be seen that TSC will be drastically reduced after the first few hundreds of time steps. For example, for problem instance 1 TSC reduces by more than three-quarters from 1,200 to less than 300. Afterwards, the improvement in TSC is more incremental with some rebounds. Consistent with Fig. 5(d) , after 40,000 time steps, TSC becomes very stable across all three problem instances. Fig. 6 . Evolution of total shipping cost during training Recall that one novelty of our proposed DRL algorithm lies in the embedment of heuristics-guided action choice in DRL. At each time step, the DRL agent performs one of the five types of actions to create new or change existing crowdsourcee routes. To compare the proposed DRL algorithm with a DRL algorithm without heuristics-guided action choice, a neighborhood move will be randomly chosen given any of the first four action types, as described in Appendix. Similar to what we do in Fig. 6 , we apply the DQN algorithm throughout its training to two randomly generated problem instances and present the TSC results using the most up-to-date DNN weight parameters every 2,500 time steps. For each problem instance, we train the DQN algorithm twice, one with heuristics-guided action choice and the other without. The results are shown in Fig. 7 . Across the 20 problem instances, the average TSC reduction is 15.8% with a standard deviation of 7.5%. The largest reduction, which occurs to problem instance 6, is 28.9%. In this subsection we evaluate the benefits of another novelty of the proposed DRL algorithmthe integration of rule-interposing into DRL training and implementation. As in Fig. 7 , we apply the DQN algorithm throughout its training to the same two randomly generated problem instances, and present the TSC results using the most up-to-date DNN weight parameters every 2,500 time steps. For each problem instance, we train the DQN algorithm twice, one with rule-interposing and the other without. The results are shown in Fig. 10 . For the first problem instance (in blue), although the TSC without the two rules appears to be diminishing at the beginning of the training, the TSC value rebounds after 15,000 time steps, then declines and meets the TSC curve when the two rules are used at around 32,500 time steps. Afterwards, the TSC curve without the two rules experiences some fluctuations and remains well above the TSC curve with the two rules. For the second problem instance (in red), the advantage of rule-interposing is even more visible since the convergence to the lowest stable TSC is much sooner. At the end of the training, a substantial TSC gap remains: the final TSC without the two rules is around 26.9% higher than the final TSC with the two rules. Overall, the results clearly demonstrate the advantage of ruleinterposing in DQN training. Finally, we investigate the sensitivity of DQN training to the values of four hyperparameters: (a) decay rate ; (b) learning rate ; (c) discount factor ; (d) target network update frequency . Fig. 15 presents the results. In each graph in Fig. 15 , a curve corresponds to a specific value of the hyperparameter under investigation and is obtained in a similar fashion as the curves in Fig. 6 , for a randomly generated problem instance. For a given graph, the three other hyperparameters not investigated take their values in Table 4 . While Fig. 15 reports TSC values of one problem instance, we have also experimented with many other randomly generated problem instances and found consistent results. It can be seen that, for all graphs in Fig. 15 Similar to subsection 4.1.5, this subsection compares the performance of the DRL-based approach with the three heuristics (simple heuristic, RTS, and SA). 20 problem instances with 200 requests and 70 crowdsourcees are randomly generated. Fig. 16 shows that DRL yields the best solution in 18 out of the 20 instances. As in subsection 4.1.5, the solutions from the simple heuristic are always the worst, despite small computation time (Fig. 17) . While the resulting TSC values from RTS and SA are closer to those from DRL, the computation time is much longer (around 15-20 minutes, as opposed to 6-10 seconds using DRL). By comparing the computation time change from the medium-size problem (Fig. 14) , it is clear that DRL is much more scalable than RTS or SA for solving the crowdshipping problems in this paper. Crowdshipping has gained increasing popularity for urban delivery given the low cost of hiring ad hoc couriers to perform pickups and deliveries. In this paper, we propose a novel, deep reinforcement learning-based approach to seek high-quality and computationally efficient assignment of requests to crowdsourcees. In performing the assignment, we consider that requests have time windows for pickup and delivery. In addition, crowdsourcees have limited time availability and carrying capacity. The novelty of the proposed DRL approach lies in its new characterization of system states, the embedment of heuristics-guided action choice, and the integration of rule-interposing into DRL training and implementation. The effectiveness of the approach is demonstrated through extensive numerical analysis. The results show the benefits brought by the heuristics-guided action choice and rule-interposing in DRL training, and the superiority of the proposed approach over existing heuristics in both solution quality, time, and scalability. With its comprehensive and detailed specifications of states, actions, and rewards, the proposed approach not only has the potential to improve the efficiency of crowdshipping operation planning, but provides a new avenue and generic framework that could be adopted to other pickup and delivery problems and vehicle routing contexts. For example, another type of crowdshipping with all requests originating from a central location (depot) can be viewed as a special case of the problems investigated in this paper. Also, while we consider dedicated crowdsourcees in the paper, the proposed DRL-based approach can be conveniently adapted to the context of opportunistic crowdsourcees given the origin and destination of the original trip of each crowdsourcee. For possible extension of the proposed approach, a few directions are suggested. First, future efforts could be made to investigate a dynamic version of the problem. In this case, the choice of actions should not only consider the present situation but also look ahead. Second, in the real world the pickup and delivery locations of shipping requests are usually in different spatial distributions (e.g., the locations of restaurants/retail stores in a city may be quite different from the locations of residential buildings), which gives rise to the need for proactively relocating idle crowdsourcees to balance the spatial distribution of crowdsourcee supply and request pickup demand. It will be interesting to explore how to incorporate relocation decisions in the DRL framework. Third, some behavioral aspects, e.g., a crowdsourcee rejects an assigned request, could be added to further enrich the flexibility of the DRL model. Step 1: Select a request. Among the unassigned requests, randomly select an unassigned request. Step 2: Insert the request to a route. Insert the request to the end of a randomly picked crowdsourcee route (which can be an existing or a new route). If the insertion is not feasible, then randomly pick another crowdsourcee route. If a feasible insertion cannot be found, then do nothing. Step 1: Select a route. Select the crowdsourcee route with the largest remaining available time (based on Rule 1 in subsection 3.4.1). Step 2: Move a request from the route to a different location on the same route. Randomly pick a request from the route. Enumerate all feasible moves of the pickup and delivery nodes of the request on the route. Pick the move with the maximum reduced cost. If such a move does not exist, then randomly pick another request and do the same. If such a move cannot be found after enumerating all requests on the route, then do nothing. Step 1: Select a request. Select the crowdsourcee route with the largest occupation time (based on Rule 1 in subsection 3.4.1). Step 2: Move the request to the end of a different route. Randomly select a request from the route. Investigate moving the request to the end of a different route that is also randomly picked. If the move if feasible, perform the move. Otherwise, randomly pick another route and investigate moving the request to the end of the route. If the request cannot be moved to the end of any different route, then do nothing. Step 1: Select two routes. Select the two crowdsourcee routes with the largest and the second largest unused service time (based on Rule 1 in subsection 3.4.1). Step 2: Select requests from the two routes and exchange. Randomly select a request from each routes and exchange their locations. Note that for intra-route move, inter-route move, and 1-exchange, we do not consider Rule 2 of subsection 3.4.2 since the rule is related to heuristics-guided action choice. Multi-tier adaptive memory programming and cluster-and job-based relocation for distributed on-demand crowdshipping Building a collaborative solution in dense urban city settings to enhance parcel delivery: An effective crowd model in Paris Deeppool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning Choice of best possible metaheuristic algorithm for the travelling salesman problem with limited computational time: quality, uncertainty and speed The vehicle routing problem with occasional drivers Crowdsourced delivery-A dynamic pickup and delivery problem with ad hoc drivers Operational Strategies for On-demand Personal Shopper Services A multi-period dial-a-ride problem with driver consistency Neural combinatorial optimization with reinforcement learning Dynamic pickup and delivery problems Deep Q-Learning for Same-Day Delivery with a Heterogeneous Fleet of Vehicles and Drones Learning combinatorial optimization algorithms over graphs Crowdshipping and Same-day Delivery: Employing Instore Customers to Deliver Online Orders Crowdsourcing the last mile delivery of online orders by exploiting the social networks of retail store customers On-Time Delivery in Crowdshipping Systems: An Agent-Based Approach Using Streaming Data Stochastic last-mile delivery with crowdshipping. Transportation research procedia A scenario-based planning for the pickup and delivery problem with time windows, scheduled lines and stochastic demands Design and modeling of a crowdsource-enabled system for urban parcel relay and delivery Adam: A method for stochastic optimization Optimization by Simulated Annealing Attention, learn to solve routing problems! Supply, demand, operations, and management of crowd-shipping services: A review and empirical evidence A new insertion-based construction heuristic for solving the pickup and delivery problem with time windows Crowd-shipping with time windows and transshipment nodes Human-level control through deep reinforcement learning Solving the pickup and delivery problem with time windows using reactive tabu search Reinforcement learning for solving the vehicle routing problem MOVI: A model-free approach to dynamic fleet management The benefits of transfers in crowdsourced pickup-and-delivery systems Variable neighborhood search and tabu search for auction-based waste collection synchronization A reinforcement learning based algorithm for multi-hop ride-sharing: Model-free approach Crowdsourced Delivery Quarterly share of e-commerce sales of total U.S. retail sales from 1st quarter 2010 to 2nd quarter 2020 Online food delivery -United States Reinforcement learning: An introduction Heuristic methods for vehicle routing problem with time windows Vehicle routing: problems, methods, and applications Simulated annealing Towards enhancing the last-mile delivery: An effective crowd-tasking model with scalable solutions Q-learning Service and capacity planning in crowd-sourced delivery Online vehicle routing with neural combinatorial optimization and deep reinforcement learning A branch-and-price algorithm for the heterogeneous fleet green vehicle routing problem with time windows This research is funded by the National Science Foundation under Grant Number 1663411. The financial support of the National Science Foundation is gratefully acknowledged. An earlier version of the paper was presented at the INFORMS 2020 Annual Meeting. We also thank the attendees of the presentation for providing their helpful feedback.