key: cord-0937028-idrrsz8i authors: nan title: Cross-Regional Customized Bus Route Planning Considering Staggered Commuting During the COVID-19 date: 2021-01-21 journal: IEEE Access DOI: 10.1109/access.2021.3053351 sha: 05692c7a006a44733771a62d6123a527c3d1853e doc_id: 937028 cord_uid: idrrsz8i In order to solve the problem of cross-regional customized bus (CB) route planning during the COVID-19, we develop a CB route planning method based on an improved Q-learning algorithm. First, we design a sub-regional route planning approach considering commuters’ time windows of pick-up stops and drop-off stops. Second, for the CB route with the optimal social total travel cost, we improve the traditional Q-learning algorithm, including state-action pair, reward function and update rule of Q value table. Then, a setup method of CB stops is designed and the path impedance function is constructed to obtain the optimal operating path between each of the two stops. Finally, we take three CB lines in Beijing as examples for numerical experiment, the theoretical and numerical results show that (i) compared with the current situation, although the actual operating cost of optimized route increases slightly, it is covered by the reduction of travel cost of passengers and the transmission risk of COVID-19 has also dropped significantly; (ii) the improved Q-learning algorithm can solve the problem of data transmission lag effectively and reduce the social total travel cost obviously. In recent years, with the rapid growth of the economy and society, the traffic congestion become more and more serious. Figure 1 is a trend chart of China's automobile numbers from 2015 to 2019. From this figure, it can be seen from this figure that the growth rate of automobile numbers has remained above 10% from 2015 to 2019 in China. Compared with other traffic modes such as private cars and taxis, public transportation not only saves infrastructure resources, but also has irreplaceable advantages in terms of passenger transportation capacity and energy saving and emission reduction. The congestion cost of the car in Guangzhou is The associate editor coordinating the review of this manuscript and approving it for publication was Michail Makridis . 1 .48 yuan/(person/km), while the congestion cost of the bus is 0.24 yuan/(person/km). For the cost of environmental pollution, the cost of the car is 2.06 yuan/(person/km), while the bus is only 0.004 yuan/(person/km) [1] , so public transport travel can greatly reduce the cost of social travel. Furthermore, as people's living standards improve, passengers have put forward higher expectations for the comfort of transportation and the convenience of transfer, the travel needs of residents have shown a trend of diversification. Besides, passengers have different travel time windows, especially during the COVID-19 period, due to the advocacy of the government and enterprises, the phenomenon of staggered commuting will become more and more common, therefore, the diversity of demand for passengers will become more and more obvious. The contradiction between the single service mode of traditional public transportation and the diversified travel needs of residents has become increasingly prominent. The development of diversified public transportation is imminent. With the continuous construction of urbanization, the phenomenon of ''separation of workplace and residence'' in large cities has become more and more common. Within the six districts of Beijing, only 16.03% of the residents have achieved employment near the area where they live, 26.61% of the employees live near the area where they work. The population is mainly concentrated outside the Fourth Ring Road at 7:00 and mainly concentrated within the Fourth Ring Road at 10:00 due to work; population distribution situation at 18:00 and 23:00 are just the opposite, which means the separation of workplace and residence in Beijing is obvious [2] . The commuting time between residential and working areas is long and requires multiple transfers. Therefore, a kind of cross-regional CB that travels between residential and work areas and with low/no transfer has emerged. CB is an advanced, personalized and flexible demand response public transportation mode, which has the advantages of high reliability and comfort, as well as relative time saving [1] , [2] . CB integrates passengers' travel demands and other information, provides services for travelers with similar needs, such as starting and ending points, travel time, etc., and sets CB stops, operation routes according to passengers' travel needs, only passengers who make a reservation in advance can enjoy the customized bus service. Therefore, the passengers' travel demands for customized buses are relatively fixed. The ticket fares are between the regular bus and taxi prices, passengers can enjoy one seat for each person, free WIFI, no transfer and other high-quality services that cannot be provided by regular buses, which can attract more passengers and reduce the number of car trips and the traffic burden on the road. Meanwhile, when a major public health emergency such as COVID-19 occurs, public transportation is also facing major challenges in blocking the epidemic. Cross-regional CB travel can realize direct connection between residential areas and work areas without multiple transfers, reducing the risk of disease transmission. With the normalization of epidemic prevention and control, in order to avoid excessive concentration of personnel, many companies have implemented management measures such as ''staggered commuting''. Staggered commuting stipulates the commuting time of different types of employees, the purpose is to reduce the number of employees arriving and leaving at the same time, thereby reducing the risk of transmission of COVID-19. In our research, staggered commuting is reflected in the different time windows of passengers. Because we consider the travel time cost of all passengers, compared with the normal state, changes in the time window will lead to different path planning results. Therefore, how to plan a reasonable CB operation route according to the passenger's travel time window (Due to the uncertainty of the road traffic state, it is difficult for the CB to depart/arrive at the pick-up/drop-off stops accurately at the time required by the passengers. Therefore, a small range of fluctuations in the departure/arrival time of the CB are allowed, which is called time window. If the departure time submitted by the passenger is 7:00 and the time fluctuation of 10 min is allowed, the time window is [6:50,7:10] .) to reduce the total social travel cost (the summation of the bus operation cost and the passenger travel cost) is very important. There is a growing research interest in CB route planning. In terms of theoretical research, most existing studies on CB route planning focused on the heuristic algorithm [3] - [5] . For route planning, various strategies have been proposed based on different objective functions, such as minimizing operation cost, total delay time of passengers and so on [6] - [8] . Different from the above studies, Considering the time window restrictions of pick-up stops and drop-off stops, this paper will focus on the CB route planning based on improved Q-learning algorithm which has the advantages of fast convergence speed and global optimal solution. Besides, in order to optimize the total time costs of passengers, we design a sub-regional route planning approach considering commuters' time windows of pick-up stops and drop-off stops. Finally, we take three cross-regional CB lines in Beijing as examples for numerical experiment, the results show that the social total travel cost of the route planned by our method is greatly reduced. The remainder of this paper is organized as follows. In Section II, an overview of existing relevant research literature is summarized and the main contributions of this research is introduced. Section III describes the application scenarios in detail. Section IV introduces the improved algorithm design, and then builds the path impedance function based on historical data to obtain the optimal path between two stops. In Section V, we take three CB lines in Beijing as numerical examples. Conclusions and suggestions for future research and some proposals for CB operation are given in Section VI. The main content framework of this study is shown in Figure 2 . This study is mainly related to the following two branches of existing literature, and contributions to this article are listed at the end of this section. The CB transit has been studied by some researchers. Liu and Ceder [3] presented a systematic and comprehensive analysis on a CB system. Wang et al. [11] constructed a multi-objective optimization model of CB routes, and clearly described the four processes of CB operation in the form of mathematical expressions, and designed a solution method based on NSGA-II algorithm. Shao et al. [12] established a variable-route bus dispatching two-level planning model with the goal of the largest number of people served and the smallest passenger travel time, and compared different algorithms through simulation experiments. Aiming at the shortest running time of all emergency CB lines and the constraint that passenger occupancy rate does not exceed the safety threshold, Ma et al. [13] built an optimization model of emergency CB routes in public health emergencies and designed a genetic algorithm to solve the model. Wang et al. [14] proposed three different survival models to study the mechanism of subscription behavior, among which the shared vulnerability model that considered the unobserved heterogeneity was the most appropriate. Yu et al. [15] introduced a method of generating bus route and stop planning suggestions based on massive demand data. A link network is generated from the input, which represents the shared path of the demand. Through community detection, the link network is divided into communities with similar travel routes. By examining the core-peripheral structure, the core part of the community is matched with the road network to generate customized bus routes. Hao et al. [16] constructed a complete theoretical framework that considers travel constraints and service quality satisfaction-behavior to explore parents' choices for CB. In order to identify the origin and destination distribution of potential passengers, a spatial clustering algorithm based on pair density was proposed, using this algorithm, a method for extracting potential passengers from ordinary bus passengers based on bus smart card data was introduced [17] . Huang et al. [18] established a two-stage framework and optimized the bus route by establishing three nonlinear programming models. Bie et al. [19] proposed a mixed scheduling method combining the all-stop service and the stop-skipping service, the method optimizes scheduling strategies for multiple routes by minimizing total passenger travel time. Han et al. [20] presented a detailed flow chart of a CB network planning methodology, including individual reservation travel demand data processing, CB line origin destination (OD) area division considering quantity constraints of demand in areas and distance constraints based on agglomerative hierarchical clustering. However, the above research did not consider the requirements of commuters for the time windows of the drop-off stops. In fact, commuters are more concerned about the time to arrive at the drop-off stop. In our study, we designed a route planning method for sub-regions considering the requirements of passengers' time windows of pick-up and drop-off stops. The algorithm for vehicle route planning can be divided into two categories: precise algorithm and heuristic algorithm. The precise algorithm includes dynamic programming method, network flow algorithm, branch and bound method, greedy algorithm, column method and so on [21] - [23] . Accurate algorithms can only solve small scale problems, and their complexity will increase exponentially with the enlargement of the problem scale, while large scale models can only be solved by heuristic algorithms. At present, the heuristic algorithms for vehicle route planning are mainly divided into two categories: population method and trajectory method. Population method mainly includes ant colony algorithm, bee colony algorithm, particle swarm algorithm, genetic algorithm and other intelligent algorithms [24] - [27] , while trajectory algorithm mainly includes simulated annealing algorithm, tabu search algorithm, iterative local search, variable neighborhood search, large-scale neighborhood search algorithm and so on [28] - [30] . Since the study CB optimization problem is combinatorial that is non-deterministic polynomial-time hard (NP-hard) [31] [32] . Lyu et al. [8] used a variety of travel data to optimize the location, route, timetable, and the probability of passengers choosing CB. The above planning method must fit the model parameters through training data, and reflect the status of the road network through weight parameters. Randomness, according to the corresponding algorithm to solve the planned route, the training of this kind of parameter model often results in the deviation of the model parameters due to the abnormal value of the data. In the actual model, it is necessary to increase the assumptions to achieve reasonable results, and the actual problems are difficult to verify. Gao and Huang [33] used a polynomial algorithm to analyze the number of discrete path distributions and pointed out that the number of parameters has an exponential relationship with the number of road sections. Some parametric models have unsatisfactory solution results due to high parameter dimensions. In addition, some scholars have carried out research on line network optimization based on non-parametric models. Mao and Z [34] designed a non-parametric reinforcement learning model to solve the adaptive path problem in stochastic time-varying networks. Research shows that the combination of Q learning and tree-based function approximation performs better than traditional stochastic dynamic programming methods during peak demand periods. In addition, reinforcement learning algorithm is also widely used in traffic signal optimization and other fields [35] - [37] . Reinforcement learning is one of the paradigms and methodologies of machine learning, and it is a research hotspot in the field of control decision-making [38] . In the research of CB route planning, reinforcement learning is rarely combined. In practical application, Q-learning algorithm, as a classic representative of reinforcement Learning, has always been the target algorithm of great concern in various fields [39] . Nevertheless, the traditional Q-learning algorithm has the problems of slow solving speed and low efficiency [40] . Our study adopts the method of reinforcement learning to configure and optimize the CB routes, which is of great significance in improving the quality of CB service and meeting the needs of passengers for personalized travel. The above-mentioned representative studies have achieved rich results, but they did not consider the optimization of time windows of drop-off stops and most of them used heuristic algorithms to solve route planning problems, which have problems such as convergence to local optimal solutions and slow solution speed. Q-learning algorithm has attracted wide attention due to its low requirement for environment model and excellent self-updating ability. Besides, considering that the traditional Q-learning algorithm has the problems of slow solving speed and low efficiency, we propose a cross-regional CB route planning method based on the improved Q-learning algorithm. The contributions of this study include: (i) we designed a route planning method for sub-regions, which can achieve the optimization of time windows of drop-off stops; (ii) the improved Q-learning algorithm where the Q value update strategy is improved to promote the algorithm efficiency is designed to solve the CB operation route with the optimal total travel cost; (iii) based on historical data, a path impedance function is constructed to determine the optimal operation path between two stops. The scene diagram of this mode is shown in Figure 3 , in the single region to single region service mode, we set bus stops based on travel demands at both ends of the CB line (pick-up area, drop-off area). In order to better optimize the total time costs of passengers, we transform the entire route plan into two single-region route plans. The scene graph is shown in Figure 4 and Figure 5 . When solving the route of pick-up area, define N = {1, 2, 3 · · · n} as the real stops in the pick-up area, and the virtual stop 0 + is set as the starting stop of the route, its distance from each actual stops is same and it is a positive number close to 0. The drop-off area is regarded as a virtual stop n + 1 as the terminal stop of the route, and the distance between stop n + 1 and each real stops is the actual distance. The time when the pick-up area arrives at the terminal is the departure time of the drop-off area; when solving the route of the drop-off area, define M = {1, 2, 3 · · · m} as the real stops in the drop-off area, the pick-up area is regarded as the virtual stop 0 − as the starting stop of the route, the distance between it and each actual stops is the actual distance, the virtual stop m + 1 is set as the line terminal stop, its distance from each actual stops is same and it is a positive number close to 0. According to the actual situation, cross-regional travel for commuting has certain requirements for departure time and arrival time, so passengers will provide the platform with expected departure time and arrival time, the platform divides passengers whose departure time and location are close to the same stop, assuming that all passengers arrive at the stop on time at the specified time. The time windows of CB's pick-up and drop-off stops are shown in Figure 6 . The two-dimensional plane composed of x-axis and y-axis indicates the position of the stops, and z-axis represents time axis. If the CB arrives before or within the travel time window of the stop, there is no delay time, and the passengers have no travel time cost; otherwise, the delay time and the travel time cost of passengers occur. The research environment of this study can be modeled as a deterministic Markov decision process, which can be viewed as a tuple M = S, A, P, R , where S is a finite set of states, A is a finite set of actions, and P is the probability of moving to the next state, R is the reward function, for instance, r s, a, s represents the immediate reward received when agent performs action a in state s and move to state s . The motivation of the learning ability of the Q-learning algorithm comes from the reward value of the environment feedback after the agent takes an action [39] . The algorithm flow of Q-learning is shown in Figure 7 . The Q-learning algorithm is a value iterative algorithm. Its decision-making process and update process are related to the Q value list generated by the Q-value table. The Q-value table is a matrix of state-action pairs, when the Q values in the Q-value table do not change, it indicates that the Q-value table has converged. The establishment of the Q value list depends on the return of rewards for updating. When the agent chooses an action, the environment will transfer state and give instant rewards, correct actions will be rewarded, and wrong actions will be 20212 VOLUME 9, 2021 punished. The basic update rule is as follows: (2) where r is the instant reward obtained; γ ∈ [0, 1] is the discount factor which indicates the importance the agent attaches to experience, when the agent updates its state, it will comprehensively consider immediate benefits [Q(s , a )] and memory benefits [Q(s, a)], the memory benefits refer to the maximum value of the utility value in the action of the next state in the agent memory (Q value table) . Therefore, the larger the discount factor, the more the agent attaches importance to memory benefits; Q(s, a) represents the value determined by the state and action, Q(s , a ) means the Q value of the action a taken in the next state s ; α is the learning rate which defines how much the current q-value will move towards the direction of the latest update. In order to implement the Q-learning algorithm to plan the CB route, the state-action pairs and reward function of the agent learning process are the main considerations. Therefore, the first step in applying Q-learning is to define states and actions [42] . This section is intended to illustrate: a) the CB stops are represented as states; b) the setting method of the CB stop (state). The CB is represented as an agent, and the bus stops (including real and virtual stops) are represented as the states, the states do not change with t, because the stops are static. The real stops are set by K-Means clustering method. First, K data objects are randomly selected from the passenger travel point set. Each centroid is defined by the mean of the data objects contained in the cluster, and each remaining object is divided into corresponding clusters according to its mean distance from each cluster, the newly obtained mean point is calculated, and the clusters are re-assigned and updated until the value of each cluster. The mean value or the center of mass no longer changes. The steps of K-means cluster analysis method for cluster analysis of dynamic travel demands in a small area are: Step 1: Process the dynamic data set and determine the location coordinates of passengers; Step 2: Establish a reservation request data set O; Step 3: Analyze the dynamic request location distribution, select H initial cluster centers Z h (1, 2, 3...h) from it; Step 4: Using Z h as a reference point, calculate the distance between other dynamic requests O w and the point Z h , and divide the requested data into the category of the nearest cluster center, namely If the Eq. (3) is satisfied, it means O w belongs to the cluster h; Step 5: Update the sample points in all cluster areas C h , and use the mean value of the sample points in all cluster areas as the new cluster center Z h ; Step 6: According to the square error criterion, calculate the square error E. where p indicates the sample point in clustering C h , and Z h is the new clustering center of clustering region C h . Step 7: Perform iterative calculation on the square error E. When the error sum of squares is locally minimum, the operation is completed; otherwise, return to Step 4. After analyzing the dynamic travel demands using the K-means cluster analysis method, assume that the coordinate of the w request in the h cluster center is (x rhw , y rhw ), the coordinates of the composite stop are (x chw , y chw ), and the maximum acceptable walking distance of passengers is d max = 300m. Determine whether the vehicle responds to the ride request according to the following formula: according to the service range of the ridesharing stop, the number of passengers that each cluster center accepts customized services can be determined. Going to different stations means different actions. In order to avoid local optimum, the bus agent chooses different actions in each state according to the ε-greedy strategy, that is, chooses the action with the largest Q value with probability ε, and chooses other actions for exploration with probability 1 − ε, and then moves to the next state. Introduce decision variables: x ij = 1, if CB drives from station i to station j 0, other when solving the routes in the pick-up area, the requirements at the end of an episode are as follows: i∈{0,n+1}∪N Eq. (6) indicates that the bus agent traverses all the real stops; Eq. (7) represents that the bus agent will not return to the virtual stop 0; Eq. (8) means that the virtual stop 0 is the starting stop; and Eq. (9) expresses that the virtual stop is the terminal stop. Similarly, the conditions for the end of an episode in the drop-off area are as follows: i∈{0,m+1}∪M It should be noted that the path is formed by multiple connecting road sections between two adjacent stops, route is defined as a directed line which is from the start stop to the terminal stop. However, there are often multiple alternative paths between two stops, and determining the routes between stops is the basis for constructing the public transport network. Since the paths between stops are independent and contain multiple road sections, this article uses historical data to calculate the travel time of the road section, and then determines the least impedance path. where F ij represents the impedance of the path f between the stop i and the stop j; G means the set of road sections; δ ij a,f indicates the path-road section associated variable, if the road section a belongs to the path f between stop i and stop j, the value is 1, otherwise the value is 0; l a is the length of the road section a; v a represents the free flow speed of the CB; q a means the actual traffic volume of the road section a, which can be obtained through historical data; c a expresses the traffic capacity of the road section a; ϕ, are constant parameters. As is shown in Figure 8 , there are three optional paths between stop i and stop j, and path 2 has the smallest impedance, and then path 2 is selected as the operation path of stop i and stop j. The reward function plays a guiding role in the agent training process. The purpose of this part is to evaluate the actions taken by the agent. The purpose of training is to maximize the final cumulative reward value, where the reward value is set as the opposite of the cost, that is, the higher the cost, the smaller the reward value. The reward function designed in this paper comprehensively considers the operating cost of the CB company and the travel time cost of passengers to obtain the route with the best total travel cost. The operating cost of a CB company includes fixed costs and vehicle operating costs. Vehicle operating costs are positively correlated with the length of the running line. The fuel consumption cost of vehicle operation (the product of the line length and the unit fuel consumption cost) is directly used to represent its operating cost. where r ij B indicates the operating cost between the current stop i and the stop j; l ij means the distance between the stop i and the stop j; p represents the unit fuel consumption cost. Assuming that passengers arrive at the stop at the earliest time within the time window, the time cost of the passenger is the time cost of waiting due to the late arrival of the CB (the bus arrives later than time window). Eq. (16) is the reward function of passenger's time cost. where r j p represents the time cost of passengers at the stop j; k j is the number of passengers at the stop j; t i s means the actual time for the CB to arrive at the stop i; t j s represents the actual time for the CB to arrive at the stop j, that is, the sum of the time of arrival at the stop j and the path impedance The optimization goals of Eq. (15) and Eq. (16) have the same direction, and the multi-objective optimization is transformed into single-objective optimization, and the final reward function is: where r indicates the reward function; ω 1 , ω 2 respectively are the weight coefficients of operating cost and passenger time cost [20] , the other variables and parameters have been explained above. The Q-learning algorithm has the problem of data transmission lag in the Q value update process, which causes the Q value of the subsequent state to not be fed back to the forward state in time to update the Q value, which affects the convergence speed of the Q value. In this study, the ''backtracking'' idea is used to improve the update rules of the Q-value table to solve the problem of data transmission lag, improve the convergence speed of the algorithm, and shorten the path planning time. First define the memory matrix M (t) ← [s t , a t ] to record the state-action pairs experienced by the bus agent in sequence. Suppose M (t) ← [s t , a t ] is a matrix with rows and 2 columns, at the beginning of training, since no prior experience is set for the agent, so the initial input of all Q values in the memory matrix is 0. where z is the number of states experienced from the initial moment to the current moment. Use [s t , a t ] in M (t) ← [s t , a t ] as the index to find the Q value corresponding to the previous ''state-action'' and update it. Then subtract 1 from t and determine whether t − 1 is 0. If it is 0, it means that the Q value of all the ''stateactions'' experienced by state s t has been updated; if it is not 0, continue to index the previous sequence ''The Q value of ''state-action'' is updated until all Q values are updated, so that the Q values of n states are updated without repeated training n times. The update formula is shown in equation (19) , where k = t − 1, t − 2, . . . , 2, 1. Since the reward and punishment value includes the passenger's time cost, which is related to the actual time when the bus arrives at the stop, the reward and punishment value of each action in this state must be recalculated every time it reaches a state. A step-by-step procedure to search for the optimal route using the improved Q-learning algorithm is shown in Figure 9 and discussed below. Step I: The initial input of all Q values in the Q value table is 0, and let = 1 to start the first training. Step II: Initialize the state, and let = 1. Step III: Update the reward value for all actions based on the current state. Step IV: Select the next action according to ε-greedy strategy, then move to the new state. Step V: Update the memory matrix M (t) and previous Q value. Step VI: Determine if an episode has been completed according to the set training conditions. If an episode is not completed, let = +1, and then judge if the maximum number of training steps has been reached, if so, go back to Step II; otherwise go back to Step IV. If an episode is completed, let i = i +1 and then go to Step VII. Step VII: Determine if the Q-value table has converged (the Q values in the Q-value table do not change). If Q-value table does not converge, it is judged if the maximum episode number has been reached, if the maximum number of episodes is reached, the training ends, if the maximum number of episodes is not reached, go back to Step II. If Q-value table has converged, the training ends. We take three existing customized bus lines in Beijing as examples to verify the effectiveness of the proposed algorithm. These three CB lines are typical cross-regional CB lines, and they are similar to other cross-regional CB lines in Beijing, passing through large residential areas and workspace, which are effective measures provided by Beijing to support enterprises to resume work and production under the normal situation of epidemic prevention and control, the project requires that the location of the stops cannot be changed, that is, the location of the stops is fixed, however the service order of the stops can be optimized according to the passengers' time window information. Meantime in order to compare the optimization results more intuitively and focus on the effectiveness of the algorithm, we directly adopt the location settings of the original stops. the stops' information (numbered according to the order of service) of the pick-up and drop-off areas is shown in Table 1 . Field data surveys are used to obtain the actual operating route of the CB, the time of departure from the stop and the time of arrival at the stop, the number of passengers who pick up and drop off bus. Investigators are assigned to the three routes. One sits at the front door and records the CB's arrival time, the number of passengers who pick up bus, and the average boarding time, and the other sitting at the back door records the number of passengers who drop off bus, the average alighting time, and the departure time. The actual expected departure time of passengers is obtained by issuing questionnaires on the CB, and the time window is determined. Route 1 contains 4 pick-up stops and 4 drop-off stops, Route 2 contains 6 pick-up stops and 7 drop-off stops, and Route 3 contains 7 pick-up stops and 2 drop-off stops. Table 2 lists the number of passengers and time window information for the three CB routes' stops. A sensitivity analysis based on the condition with Route 1 is performed to explore the relation between model parameters and optimized results. The learning rate and discount factor are the most critical parameters in the Q-Learning algorithm. We adjust different parameter combinations in the Q-Learning algorithm to obtain the optimal parameter settings. The value range of the learning rate and discount factor is set as [0.1, 0.9], and perform training experiment according to the value of 0.1 unit. The parameter sensitivity analysis results are shown in Figure 10 , the horizontal axis represents the discount factor, the vertical axis indicates the learning rate, and different colors represent different iterations, the following conclusions can be drawn that when the learning rate is within the range of [0.1,0.5] and the discount factor is within the range of [0.1,0.5], with the increase of the learning rate, the iterations on the whole presents an increasing trend; when the learning rate is within the range of [0.5,0.9] and the discount factor is within the range of [0.1,0.4], the iterations decrease as a whole with the increase of the discount factor; when the value range of the discount factor is [0.6,0.8] and the value range of the learning rate is [0.1,0.4], the iterations of the agent is relatively small compared to other parameter combinations, which indicates that the agent can complete the training faster, that is, the performance of the algorithm is better, and when the learning rate is 0.3, and the discount factor is 0.8, the algorithm efficiency is the highest. Besides, The CB speed and oil price are set according to the actual situation in Beijing [43] , and the time value of passengers is calculated according to Beijing's per capita income in 2019. The parameter and variable settings are shown in Table 3 . Based on actual data, the path with the least impedance is obtained by Eq (14) , and Table 4 shows the distance between the stops. Figure (11 a-c) and Table 5 show the convergence process of Q-value table and the comparison of algorithm performance. The total steps (y-axis) represents the number of steps which the agent needs to complete a training, iteration (x-axis) represents the number of training. From Figure (11 a-c) , we can see that the number of steps for the agent to complete a training session continues to decrease with the number of iterations, until the number of exploration steps no longer changes. It directly means that the optimal strategy is learned by agent, namely, the Q-value table converges. When the Q value converges, the number of iterations of the improved Q-Learning algorithm is significantly less than that of the traditional Q-Learning algorithm. From Table 5 we can draw that compared with traditional Q-learning, the improved Q-learning training times of Route 1 is reduced by 37.9%, the solution time become shorter by 41.7%, and the optimization rate of training times and solving time of Route 2 has also reached more than 20%, the optimization rate of training times and solving time of Route 3 are both about 34%. Indicating that the improved Q-learning algorithm has improved, the problem of lag in data transmission significantly improves the convergence speed of the algorithm and shortens the route planning time. The route solved by the improved Q-learning are as follows (stops' number): Route 1: 1-3-4-2-5-7-6-8; Route 2: 1-3-2-5-4-6-8-9-7-10-11-12-13, Route 3: 1-2-4-5-3-6-7-8-9, the cost information after route optimization is shown in Table 6 . According to the results of the numerical experiment in Table 6 , we can obtain the following observations. (i) Although the fuel consumption cost of the optimized Route 1 has increased by 6 yuan compared with the original route, the time cost of individual passenger has been reduced by 98 yuan, the average lost time of passengers has been reduced from 4.97 min to 1.7 min, and the total travel cost has been reduced by 39.1 %. (ii) The fuel consumption cost of the optimized Route 2 has increased by 27.8 yuan compared with the original route, but the time cost of passengers has been reduced by 168 yuan, the average lost time of individual passenger has been reduced from 5.42 min to 1.69 min, and the total travel cost has been reduced by 35.9%. (iii) The fuel consumption cost of the optimized Route 3 has increased by 9 yuan compared with the original route, but the time cost of individual passenger has been reduced by 75 yuan, the average lost time of passengers has been reduced from 5.85 min to 3.17 min, and the total travel cost has been reduced by 28.2%. This method can effectively reduce the total social travel cost. (IV) Compared with the original route, the waiting time of passengers on Route 1, Route 2 and Route 3 are reduced by 65.8%, 68.9% and 45.7% respectively. The average waiting time of the three routes is reduced by 60.1%. Assuming that the gathering time of passengers is positively correlated with the spread probability of COVID-19, it is believed that the probability of transmission of the COVID-19 is also correspondingly reduced sharply. Through the above analysis, CB operation companies can consider increasing the fare moderately (after optimization, Route 1, Route 2 and Route 3 will increase the fare by 0.12 yuan/person, 0.58 yuan/person and 0.32 yuan/person, respectively) to make up for the business operating costs after route optimization. The increase, that is, maintaining the same operating cost as the status quo. In this case, because the bus fare is only a transfer payment and is not included in the total social travel cost, the reduction in the total social travel cost is all due to the contribution of the passenger travel cost reduction. In this study, we propose a CB route planning method based on the improved Q-learning algorithm to solve the problem of cross-regional CB route planning during the COVID-19. Under such a method, the total social travel cost, including the operation cost and travel cost, is minimized. First, we design a sub-regional route planning approach considering commuters' time windows of pick-up stops and drop-off stops. Secondly, we develop the improved Q-learning algorithm, including state-action pair, reward function and update rule of Q value table. Specially, according to the operating cost of the bus company and the time cost of passengers, the reward function is set, and the ''backtracking'' idea is applied to the Q-learning algorithm to improve the update efficiency of the Q value table. Moreover, through the numerical experiment, we obtain the following conclusions. (i) The improved Q-learning algorithm can effectively increase the solution speed and improve the lag of data transmission. (ii) The optimized CB route can effectively reduce the total social travel cost and the transmission risk of COVID-19, thereby providing new ideas for CB route planning and pricing. As the development of this study, in the future research, the passenger's dynamic travel needs can be considered to make real-time planning of the CB route, besides, it is also possible to consider the situation of the passengers late or early to the stops and the different requirements for time window of passengers at the same stop. ANGE WANG born in Shandong, China. He received the bachelor's degree from East China Jiaotong University, and was awarded Outstanding Graduate in 2019. After obtaining the recommendation exemption admission qualifications, he is currently pursuing the master's degree in transportation planning and management with the Faculty of Urban Construction, Beijing University of Technology. His research interests include resource allocation for customized buses and shared parking. He has published a total of three academic articles (SCI/EI), applied for three national invention patents, and won the 2018 National Scholarship and National Encouragement Scholarship of China. HONGZHI GUAN received the B.S. degree from the Xi'an Highway College, Chang'an University, in 1982, and the Ph.D. degree from Kyoto University, Japan, in 1997. He is currently a Professor with the Beijing University of Technology and a Doctoral Tutor. He is also the author of more than 20 books and more than 200 articles. His research interests include traffic behavior analysis and modeling, traffic policy analysis, public transportation planning and management, parking planning and management, and traffic psychology research. He is also a member of the Transportation Steering Committee, Ministry of Education and the Vice Chairman of the Traffic Engineering Teaching Guidance Subcommittee, the Ministry of Public Security, the Central Civilization Office, the Ministry of Housing and Urban-Rural Development, and the Ministry of Transport's National Urban Traffic Civilization and Smooth Improvement Action Plan Expert Group Expert, World Transport Conference (WTC) Executive Committee members, and other social work. A study on the social cost comparison between public transportation and car travel 2016 Beijing Resident Travel Big Data Report Analysis of a new public-transport-service concept: Customized bus in China Role of customized bus services in the transportation system: Insight from actual performance A new approach to predict the market and impacts of round-trip and point-to-point carsharing systems: Case study of London Large-scale demand driven design of a customized bus network: A methodological framework and Beijing case study Customized bus service design for jointly optimizing passenger-to-vehicle assignment and vehicle routing CBplanner: A bus line planning framework for customized bus systems A two-phase optimization model for the demand-responsive customized bus network design Modeling and solving the vehicle routing problem with multiple fuzzy time windows CB route vehicle schedule method considering travel time windows Optimization algorithms for scheduling of flex-route buses Emergency customized bus route optimization under public health emergencies Key determinants and heterogeneous frailties in passenger loyalty toward customized buses: An empirical investigation of the subscription termination hazard of users Mobile phone GPS data in urban customized bus: Dynamic line design and emission reduction potentials analysis Modeling and analyzing of family intention for the customized student routes: A case study in China Clustering passenger trip data for the potential passenger investigation and line design of customized commuter bus Customized bus route optimization with the real-time data Mixed scheduling strategy for high frequency bus routes with common stops Customized bus network design based on individual reservation demands An analytical bound on the fleet size in vehicle routing problems: A dynamic programming approach A column generation approach for an emission-oriented vehicle routing problem on a multigraph The cumulative capacitated vehicle routing problem: New formulations and iterated greedy algorithms Route planning and track keeping control for ships based on the leader-vertex ant colony and nonlinear feedback algorithms Optimization of multi objective vehicle routing problem using a new hybrid algorithm based on particle swarm optimization and artificial bee colony algorithm considering precedence constraints A multi-compartment vehicle routing problem with time windows for urban distribution-A comparison study on particle swarm optimization algorithms Optimizing nonlinear charging times of electric vehicle routing with genetic algorithm A new approach based on the hybridization of simulated annealing algorithm and tabu search to solve the static ambulance routing problem Optimizing battery-electric-feeder service and wireless charging locations with nested genetic algorithm An adaptive large neighborhood search algorithm for a selective and periodic inventory routing problem Optimizing multiterminal customized bus service with mixed fleet Intractability of the diala-ride problem and a multiobjective solution using simulated annealing Real-time traveler information for optimal adaptive routing in stochastic time-dependent networks A reinforcement learning framework for the adaptive routing problem in stochastic time-dependent network Comparing two multiagent reinforcement learning approaches for the traffic assignment problem Evaluating semi-cooperative Nash/Stackelberg Q-learning for traffic routes plan in a single intersection Modeling pedestrian-cyclist interactions in shared space using inverse reinforcement learning Taming an autonomous surface vehicle for path following and collision avoidance using deep reinforcement learning A reinforcement learning scheme for the equilibrium of the in-vehicle route choice problem based on congestion game Reinforcement learning for vehicle route optimization in SUMO Solving the optimal path planning of a mobile robot using improved Q-learning Optimal path planning approach based on Q-learning algorithm for mobile robots Beijing transport development annual report