key: cord-1010806-h9kcbt9o authors: Khalilpourazari, Soheyl; Hashemi Doulabi, Hossein title: Designing a hybrid reinforcement learning based algorithm with application in prediction of the COVID-19 pandemic in Quebec date: 2021-01-03 journal: Ann Oper Res DOI: 10.1007/s10479-020-03871-7 sha: 63ab31a40b3bf4e8150853303919cb35ef453af4 doc_id: 1010806 cord_uid: h9kcbt9o World Health Organization (WHO) stated COVID-19 as a pandemic in March 2020. Since then, 26,795,847 cases have been reported worldwide, and 878,963 lost their lives due to the illness by September 3, 2020. Prediction of the COVID-19 pandemic will enable policymakers to optimize the use of healthcare system capacity and resource allocation to minimize the fatality rate. In this research, we design a novel hybrid reinforcement learning-based algorithm capable of solving complex optimization problems. We apply our algorithm to several well-known benchmarks and show that the proposed methodology provides quality solutions for most complex benchmarks. Besides, we show the dominance of the offered method over state-of-the-art methods through several measures. Moreover, to demonstrate the suggested method’s efficiency in optimizing real-world problems, we implement our approach to the most recent data from Quebec, Canada, to predict the COVID-19 outbreak. Our algorithm, combined with the most recent mathematical model for COVID-19 pandemic prediction, accurately reflected the future trend of the pandemic with a mean square error of 6.29E−06. Furthermore, we generate several scenarios for deepening our insight into pandemic growth. We determine essential factors and deliver various managerial insights to help policymakers making decisions regarding future social measures. Researchers use optimization in nearly every study area. Optimization remains a fundamental challenge in science and engineering, primarily because of the difficulty of real-world problems and the limitations of traditional methods. Randomization Search Algorithms (RSAs) are among the most flexible and most efficient methodologies to solve complicated problems. These algorithms are mostly polynomial-time algorithms and have significantly lower computational complexity. As one of the most commonly used RSAs, metaheuristics are algorithms that are inspired by natural phenomena to perform optimization. Metaheuristics perform very well in exploring the feasible region and evade local optimum using effective movement processes. Healthcare science is one of the top research areas in which metaheuristics have been widely applied to. Using these algorithms, scientists can optimize healthcare systems significantly in terms of several objectives, including minimizing cost, waiting time, service time, delivery time, and maximizing reliability or customer satisfaction. In December 2019, a novel strain of coronavirus called SARS-Cov-2 discovered in China. The virus causes COVID-19, a severe respiratory disease. Regardless of primary measures applied by the government of China, the disease spread quickly to many countries leading to 26,795,847 infected cases and 878,963 deaths. Currently, there are no effective medications and vaccines for the disease. However, the effectiveness of some treatment options is under study via clinical trials (Health Canada 2020; WHO 2020) . Although most people with mild COVID-19 symptoms recover independently, some other people with severe and critical symptoms need hospitalization in wards and Intensive Care Units (Public Health Authority of Canada 2020). However, due to the limited capacity of the healthcare system, it is impossible to admit all the patients in hospitals. Efficient modeling and prediction of the COVID-19 pandemic will meaningfully aid the officials and healthcare experts in making decisions to stop the spread of the disease. Besides, by forecasting the upcoming trend of the epidemic, we can also optimally allocate resources to hospitals that will avoid equipment shortages and save patients' lives. Prediction of the COVID-19's trend is challenging because of its uncertain nature and complication. Recently, scientists have provided a novel model to simulate the COVID-19 pandemic called SIDARTHE and was initially offered in research by Giordano et al. (2020) published in Nature Medicine. The researchers highlighted the efficiency of the proposed formulation in modeling the pandemic growth. Nevertheless, they emphasized that solving the presented set of differential equations is difficult because of the exceptional characteristics of the model. In the current research, we offer a novel search methodology that will solve many complex optimization problems very efficiently in a short time. Our algorithm simultaneously benefits from the advantages of machine learning (ML) and evolutionary computation (EC). In our research, we propose a hybrid algorithm that involves a reinforcement learning (RL) technique as the main engine and several ECs as updating operators. The learning process achieved by the RL method accelerates the algorithm and enables us to resolve complicated large-scale problems. The procedure utilizes several operators to enhance the exploration and exploitation capabilities of the algorithm. These features help the proposed process to sidestep local optimum while exploiting the solution space intelligently. We implement the algorithm on several well-known recently developed benchmarks to show its efficiency in solving such complex problems. Moreover, we highlighted significant differences in the performance of the method comparing to other methodologies using robust statistical tests. Furthermore, we use the proposed algorithm to model and predict the COVID-19 pandemic in Quebec, Canada, that has the most cases of COVID-19 in the country. Our results accurately predict the peak in the number of infected cases of COVID-19 in the province. The outcomes also determine the peak of the number of cases that develop life-threatening symptoms that will require hospitalization. We also perform complex sensitivity analyses to portray future scenarios that enable us to provide detailed information to policymakers and healthcare professionals. In our study, we also measure the effectiveness of implementing measures such as social distancing and partial lockdown on pandemic growth. The rest of the current study is prepared as follows: In Sect. 2, we deliver a comprehensive review of existing research on the topic. In Sect. 3, we offer a novel algorithm to resolve many complex problems using RL and EC. In Sect. 4, we apply our algorithm to several well-known benchmark functions. We assess the performance of the suggested approach and compare its efficiency to state-of-the-art methods. In Sect. 5, we implement our method to model and predict the COVID-19 pandemic in Quebec, Canada. In Sect. 6, we perform sensitivity analyses and provide valuable managerial insights to fight the COVID-19 pandemic. In Sect. 7, we conclude the paper. The core idea behind most EC algorithms is to follow a swarm intelligence that is inspired by animal behavior and natural phenomenon . Mirjalili and Lewis (2016) categorized metaheuristics into four main groups: evolutionary algorithms (EAs), physics-based algorithms (PAs), swarm algorithms (SAs), and machine learning-based algorithms (MLAs). EAs imitate the evolution procedure in nature to solve complex problems. PAs utilize laws of physics that enable this family of algorithms to handle complicated problems. On the other hand, SAs simulate the swarm behavior of many individuals in a group. Also, MLAs use artificial intelligence and machine learning to enhance the performance of the previous families of algorithms in terms of exploration and exploitation. Table 1 provides some of the most advanced algorithms on metaheuristics developed in recent years. Many researchers used these algorithms to solve complex optimization problems in different fields (Hoursan et al. 2020; Tirkolaee et al. 2020a, b; Sangaiah et al. 2020; Lotfi et al. 2018 Lotfi et al. , 2019 Lotfi et al. , 2020 Zare Mehrjerdi and Lotfi 2019; Khalilpourazari et al. 2020c; Hashemi Doulabi et al. 2020a, b) . Defined by the no free lunch (NFL) theorem, we can logically prove that no single method performs optimally in resolving all problems (Adam et al. 2019; Wolpert and Macready 1997) . Any metaheuristic algorithm may perform well in some benchmarks but weak in others. This theorem makes this field of study highly interesting for researchers searching for an algorithm that performs promising in many benchmarks. In our algorithm, we consider several ECs and operators and let an RL method decide which algorithm to use to relocate each element. Besides, learning during the optimization process will significantly reduce the computational burden and improve the quality of the results. The learning process accelerates the algorithm due to the fact that using the learning process over iterations, the algorithm adapts its operators to perform the best for each problem. Moreover, we consider a proper framework to maintain a decent equilibrium between exploration and exploitation of the feasible region over generations of the algorithm to avoid local optima. We show the efficiency of our algorithm on the most complex benchmarks in the literature. Besides, we apply the suggested algorithm to predict the COVID-19 pandemic (Holland 1992) Small-world optimization algorithm (SWOA) (Du et al. 2006) Particle swarm optimization (PSO) (Eberhart and Kennedy 1995) Stochastic fractal search (SFS) (Salimi 2015) Hybrid Q-learning based algorithm (this paper) Genetic programming (GP) (Koza and Koza 1992) Curved space optimization (CSO) (Moghaddam et al. 2012 ). Grasshopper optimization algorithm Sine-cosine algorithm (SCA) (Mirjalili 2016a, b) Degree-descending search strategy (DDS) (Cui et al. 2018) Charged system search (CSS) (Kaveh and Talatahari 2010) Ant lion optimization algorithm (ALO) (Mirjalili 2015a) Water cycle algorithm (WCA) (Eskandar et al. 2012) Biogeography based optimizer (BBO) (Simon 2008) Multi-verse optimization (MVO) algorithm (Mirjalili et al. 2016a, b) Crow search algorithm (CSA) ( Galaxy-based search algorithm (GBSA) (Kaveh and Dadras 2017) Grey Wolf optimizer (GWO) (Mirjalili et al. 2014) Lightning search algorithm (LSA) (Shareef et al. 2015) Evolution strategy (ES) (Rechenberg 1978) Simulated annealing (SA) (Kirkpatrick et al. 1983; CERBY 1985) Dragonfly algorithm (DA) (Mirjalili 2016a, b) Coronavirus optimization algorithm (COA) (Martínez-Álvarez et al. Black hole (BH) algorithm (Hatamlou 2013) Artificial bee colony (ABC) (Karaboga and Basturk 2007) Thermal exchange optimization (Kaveh and Dadras 2017) Moth-flame optimization (MFO) (Mirjalili 2015a, b) Dynamic Virtual Bats Algorithm (DVB) (Topal and Altun 2016) in Quebec, Canada. Our outcomes express that the designed algorithm robustly predicts the future trends of the pandemic. Metaheuristics work based on randomization. By randomization, we mean that these algorithms use random stepsizes while updating each particle's position in the solution space. Metaheuristics use unique operators and strategies to update the position of each particle (solution). The efficiency of these operators and algorithms significantly depends on the solution space of the problem. For instance, some algorithms follow a direct updating procedure, such as water cycle algorithm (WCA), while some other encircle the best solution to update the position of a given particle, such as Grey-Wolf optimizer (GWO). Each of these operators and moving strategies has unique advantages that enable the algorithms to perform well in optimizing specific problems. Therefore, developing new algorithms that could efficiently solve a higher number of optimization problems is essential. In this study, we use several operators (moving strategies) from various algorithms to update the particles' position in the solution space. For updating each particle, we have to choose an operator from the given set of operators. Determining the best strategy and the most efficient operator for any given optimization problem is computationally challenging. Therefore, we use a reinforcement learning method that learns and optimizes the choice of operators during the optimization process to achieve optimal performance. In the following, we first describe all the features of the offered algorithm, and then we define the algorithm in a unique structure. Reinforcement learning (RL) that approximates dynamic programming, and neuro-dynamic programming, is a type of machine learning which determines the best actions in a specific environment to maximize a reward (Bertsekas 2019) . One of the main features of reinforcement learning is that the agent receives a reward or punishment after executing an action. The RL continues to interact with the environment to achieve the optimal policy via trial and error. The Q-Learning is one of the most efficient reinforcement learning algorithms that determine an optimal policy by evaluating taken actions using the environment. Q-learning is a way to optimize solutions in a Markov decision process (MDP) problem (Akhtar and Farrukh 2017) . The Q-learning algorithm aims to maximize the anticipated reward by determining the optimal state-action pairs. The algorithm uses a Q(s, a) table where s t is the state and a t is the action at time step t, and Q is the cumulative reward matrix. The algorithm updates the components of the Q-table Q(s, a) iteratively using Eq. (1). In Eq. (1), t denotes the learning rate parameter and r t is the obtained reward/punishment from the current action. Besides, the expression γ is the scaling factor. One of the main challenges in designing an efficient Q-learning algorithm is in determining the importance of the information gained throughout interactions with the environment. For instance, assigning a value close to 1 to the learning rate parameter means that we consider higher importance for the recent information gained. To optimize the value of the learning rate parameter throughout iterations, we use an adaptive methodology that uses Eq. (2) to intelligently tune the learning rate parameter to explore and exploit the search region (Zamli et al. 2018) . In Eq. (2), t is the iteration index, and MT is the maximum generation. In this research, we consider a reward value of 1 if the current action improves the solution quality of the particle; otherwise, we consider a punishment value of − 1. Based on the given illustrations, we present the Pseudo-code of the Q-learning in Algorithm 1. In this research, we consider several efficient operators from different algorithms and let the Q-learning algorithm determine the best action throughout the optimization process to modify the location of each particle in the feasible space. In the following subsections, we present the operators in detail. Similar to other swarm intelligence-based systems, GWO initially generates a set of primary solutions. Then, it sorts the solutions regarding their fitness and considers the three best solutions as the dominant wolves (alpha, beta, and delta). The following equations imitate the encircling behavior of the grey wolves around prey: In Eqs. (3) and (4), t represents the iteration index and A and C characterize location vectors of target and other grey wolves. X p (t) and X (t) are the position of the prey and grey wolf, respectively. These coefficients are calculated as follows: In Eqs. (5) and (6), a 1 decreases over iterations from 2 to 0 and r 1 and r 2 are random numbers. After encircling the prey, the wolves start the hunting process. To mathematically Fig. 1 Hunting behavior in GWO express the movements of grey wolves in the hunting process, we consider that the alpha, beta, and delta have superior knowledge of the probable position of the target (possible optimal solution of the problem). In this framework, the following formulas are recommended to mimic the hunting process (Khalilpourazari et al. 2020a) : Figure 1 depicts a graphical interpretation of the hunting action in 2D space. SCA is a newly developed search procedure that mimics the sine and cosine like movements in the feasible space to modify the elements using Eq. (10): In Eq. (10), X t i is the present location P t i is the location of the best particle and r 1 , r 2 , r 3 are random numbers in (0,1]. Throughout iterations, r 1 displays the movement path, r 2 controls the moving distance, r 3 guarantees a suitable equilibrium among underline or deemphasize the desalination, and r 4 selects a sine or cosine measure for updating procedure. Figure 2 shows the movement behavior of the sine and cosine actions. SCA uses sine and cosine movements intelligently to evade local optima. In addition to adjusting the particles' movements during the solution process, SCA reduces the value of r 1 parameter using the below formula to sustain a proper equilibrium between exploration and exploitation as follows: In Eq. (11), t displays present repetition,a 2 is a constant, and T is the maximum generation. Like most of the optimization paradigms, MFO initiates the optimization by creating random solutions. Then, it mimics the spiral flying actions of the moths around light sources using a logarithmic spiral function as follows: In Eq. (12), t is a constant in [− 1, 1], and b is a constant for determining the form of the logarithmic spiral. F i is the flame (the best solution), I P X i is the moth, and F i − I P X i calculates the distance between the moth and flame. Figure 3 shows the spiral movement around the flame. In order to maintain a suitable balance among its exploration and exploitation, the MFO algorithm reduces the search radius using the following equations Mohammadi and Khalilpourazari 2017) : In Eqs. (13) and (14), t displays present repetition, and MT is the maximum generation. Particle swarm optimization (PSO) is one of the most efficient procedures for optimization, and it performs promising in solving many complex problems. PSO is a population-based algorithm that uses Eq. (15) to update the location of a given particle in the solution space: In Eq. (15), x i (t) presents the current location of the particle and v i (t + 1) determines the velocity of the particle. v i vector is the main component of the updating operator that is calculated as follows: In Eq. (16), r 1 , r 2 are random numbers in (0, 1], C 1 , C 2 are coefficients, p i (t) is the best solution found by the same solution so far, and G(t) is the best solution attained so far. Water cycle algorithm (WCA) is one of the best algorithms for solving complex problems. WCA is a population-based nature-inspired metaheuristic that mimics the flow of streams to rivers and sea to perform optimization. In Eq. (17), C is a random value. We used the updating operator on the WCA as one of the means to update the location of a given particle in the feasible space. Figure 4 represents the updating procedure in WCA. In this subsection, we use the leading operators of the stochastic fractal search (SFS) offered by Salimi (2015) . SFS utilizes an important scientific property called "fractal". Fractals are complicated geometric shapes that generally have a "fractional dimension," resulting in selfsimilarity. SFS follows diffusion limited aggregation (DLA), which is an efficient technique to create fractals. To simulate the DLA, we use the Gaussian walk and Lévy flight. Figure 5 presents a fractal shape produced through the DLA scheme. We use the following equation to simulate the diffusion process in the DLA method. In Eq. (18), q describes the number of new solutions created through the diffusion of each particle, σ is the standard deviation of the Gaussian walk, and B P is the best solution. Also, x q i denotes new particles produced via diffusion process and x i is the ith solution. Besides, ε and ε are randomly generated numbers in (0, 1]. Element σ is defined using the below formula (Khalilpourazari et al. 2020b) : where log(g) g decreases the length of Gaussian jumps over iterations. Element g is the iteration number and P i is the current position of the particle. In order to enhance exploration and randomness in the population, we also apply a Lévy flight based updating procedure to the particle under consideration as follows: The expression X i new is the new location and X i c is the current location of the particle, respectively. We calculate the Lévy flight using Eq. (13). where r 1 and r 2 are random numbers. In Eq. (13) σ is obtained as follows: In this section, we propose a novel Hybrid Q-learning based algorithm (HQLA) to solve complicated optimization problems. The idea behind this algorithm is to design a solution methodology that is capable of solving complicated problems by adapting its operators to any solution space. We designed an optimization procedure that can benefit from the advantages of several algorithms. The process can use any movement strategy based on each updating operator. However, a reinforcement learning based algorithm (Q-learning) selects the best action in each iteration for each particle. When optimization begins, the algorithm performs several random actions to evaluate the efficiency of each type of operator. As the iterations continue, Q-learning learns how to employ different actions to achieve the best possible solution. For each action, we consider a reward equal to 1 if the current operators improve the solution quality; Otherwise, the algorithm assigns a punishment value of − 1 to the action. The Pseudo-code of the algorithm is available in Algorithm 2. use GWO operators to update the position of the particle; 15. else if the action is SFS 16. use SFS operators to update the position of the particle; 17. else if the action is WCA 18. use WCA operators to update the position of the particle; 19. else if the action is PSO 20. use PSO operators to update the position of the particle; 21. else if the action is MFO 22. use MFO operators to update the position of the particle; 23. else if the action is SCA 24. use SCA operators to update the position of the particle; 25. end 26. check for infeasibility of the particle; 27. bring infeasible particle to the feasible solution space; 28. calculate the objective function value of the particle; 29. determine the reward/punishment value; 30. determine the max Q-value for the next state; 31. update a(t); 32. update the Q-table; 33. update the current state, s(t) s(t + 1); 34. end 35. end 36. return the best solution efficiency of a metaheuristic algorithm, we should apply them to many benchmark functions. For this purpose, we use 29 benchmarks, including Unimodal, multimodal, fixed dimensional multimodal, and hybrid composite functions that are among the most complex benchmarks in the literature. For more details regarding these benchmarks, see "Appendix 1". We compare our algorithm to state-of-the-art methods, including Crow Search Algorithm (CSA), Artificial Bee Colony (ABC), Cuckoo Search (CS), Genetic Algorithm (GA), Moth-Flame optimization (MFO), Gravitational Search Algorithm (GSA), and Dragonfly Algorithm (DA). In order to solve the benchmark functions, we considered 15,000 Number of Function Evaluations (NFEs) to perform a reasonable assessment. Besides, to draw a reliable conclusion, we apply each algorithm on each benchmark 30 times, and report mean, standard deviation, worst and best results. Table 2 provides more details about the values of the main parameters of the algorithms. In the first step, we assess the performance of the HQLA on unimodal benchmarks (F1-F7). Figure 6 depicts a 2D representation of these benchmark functions. Unimodal benchmarks do not have several local optima and are considered to assess the exploration ability of metaheuristic algorithms. This set of test suites are difficult to solve since the algorithms should first locate the global optima approximately and then perform exploitation to provide the best approximation of the location of the global optima. Table 3 provides detailed information on the performance of the algorithms in unimodal benchmarks. Based on the results of Table 3 , we observe that the HQLA performs the best in most of the benchmarks. In F1-F5 and F7, the HQLA outperforms all other algorithms by obtaining the best possible solution for all benchmarks. In F6, HQLA ranks third in providing the best solution for this benchmark. Besides, HQLA provides the lowest standard deviation that shows very low variability in the performance of this algorithm. Moreover, considering the boxplot of the results of Table 3 present in Fig. 7a -c, it becomes apparent that HQLA has the lowest and narrowest boxplot among the algorithms. Furthermore, Fig. 8a -c, that present the convergence plots of the methods, disclose that the HQLA can maintain a perfect balance among exploration and exploitation of the solution space. In Fig. 8a -c, we observe that the HQLA can continually improve the best solution attained in each iteration by choosing the best operator to explore the solution space. The learning process in HQLA enables the algorithm to determine the best operator to change the location of the particles in the solution region by evaluating the efficiency of each updating mechanism. Based on these observations, we conclude that HQLA is a reliable technique for this family of benchmark functions. The second and third family of the benchmarks are multimodal and fixed-dimension multimodal benchmarks. These benchmarks contain several local optima that make the solution process a complicated task. To perform well in solving these benchmark functions, the algorithms should maintain an excellent balance among exploration and exploitation. This will help the algorithms avoid local optimum. Figures 9 and 10 shows a schematic view of these benchmark functions in 2D. We provide detailed information on the performance of the algorithms on solving multimodal and fixed-dimension multimodal benchmarks in Tables 4 and 5. Based on the results, we observe that in F8-F12, F14, and F16-F23 (14 out of 16) benchmark functions, the HQLA outperforms other algorithms considering average, standard deviation, best and worst values over 30 repetitions. In some of these benchmarks, such as F9 and F11, the standard deviation of the results provided by HQLA is zero. In these benchmark functions, the HQLA achieves global optima in all the repetitions. These results indicate that HQLA is a robust and reliable algorithm in solving complex optimization problems. HQLA is able to choose between several operators that enable the algorithms to explore Table 2 The values of the parameters of the algorithms Algorithm and exploit the resolution region intelligently. The learning process in the HQLA helps the algorithm evaluate the efficiency of the operators and discover the most efficient operator (action) for any problem. Besides, Fig. 8a -c depict the perfect balance among exploration and exploitation in the performance of the HQLA throughout iterations. Moreover, in F13 and F15, the HQLA ranked second among all algorithms, which shows its high capability in solving optimization problems. Furthermore, Fig. 7a -c, disclose that the boxplot of the outcomes of the HQLA is narrower and lower than any other algorithm that highlights the superiority of the proposed methodology over existing approaches. The last family of the benchmark functions is hybrid composite benchmarks that are the most challenging (Liang et al. 2005) . Figure 11 presents a graphical representation of these benchmark functions in 2D. Computational results of solving this family of benchmark functions using each algorithm are available in Table 6 . Considering the results, we observe that the HQLA provides the best results and outperforms other algorithms in solving hybrid benchmarks. Table 6 shows that the HQLA obtained the lowest average, standard deviation, and best values for these benchmark functions while maintaining the lowest worst value. Besides, Fig. 13 approves this statement by showing that the boxplot of the HQLA is narrower and lower than any other algorithms. Moreover, Fig. 12 shows that the HQLA adjusts exploration and exploitation by intelligently choosing the best operators in each iteration. Based on Fig. 12 , we observe that the HQLA can improve the best solution consistently throughout iterations. Although we showed that the HQLA outperforms other state-of-the-art methods in terms of solution performance, we apply Friedman's test, which is a powerful statistical test, to show the statistical superiority of our algorithm to other methods. Table 7 presents the results of Friedman's test. Based on the outcomes, the average rank of the HQLA is 3.036207, which is far smaller than that of other algorithms. Therefore, from a statistical point of view, HQLA performs significantly better than all other algorithms. We note that we perform the Friedman's test at a 95 percent confidence level. Quebec is one of Canada's provinces that is dealing with the COVID-19 epidemic triggered by the SARS-CoV-2 virus. The province has been reported the most COVI-19 cases that account for more than 63,713 confirmed cases of COVID-19 and 5770 death cases by the disease. On April 29, 2020, Quebec hospitals announced that the healthcare system capacity could not respond to the influx of the COVID-19 patients to the hospitals (Weeks and Tu Thanh 2020; The Globe and Mail n.d.). On June 28, 2020, Montreal's Emergency Rooms (ERs) reported near capacity status due to limited resources (Fahmy and Ross 2020; CTV News. n.d.). These highlight the need for a methodology that could accurately predict the future trend of the pandemic in the province that enables the policymakers to optimize the resource allocation to avoid loss of lives, as many scientists declared that resource shortages such as ventilator shortages are the difference between life and death for patients (Kliff et al. 020 ; The New York Time. n.d.). Developing new methodologies to predict pandemic growth Fig. 8 a Convergence plot of the algorithms. b Convergence plot of the algorithms. c Convergence plot of the algorithms is essential to optimize resource allocation and determine the optimal time to implement lockdown measures. On the other hand, we could optimize resource allocation for lifethreatening cases admitted to ICUs and reduce the disease's fatality rate. In this section, we use the most recent and accurate model called SIDARTHE, presented by Giordano et al. (2020) published in Nature Medicine. The scientists showed that using the model, we could predict the future trend of the pandemic accurately. However, the solution to the problem is a cumbersome task. For more information on the SIDARTHE model, see "Appendix 2". In this research, we used the SIDARTHE model and applied it to real data from Quebec, Canada. In order to solve the model, we utilize HQLA. Figure 14 shows the convergence plot of HQLA throughout iterations. We divided the period (from January 25, 2020 (day 1) to July 19, 2020 (day 176)) into six stages as follows in which the Quebec government applied specific restrictions to control the pandemic: We note that we considered the sum of mean square errors as the objective function value for our model, and its optimal value for our case study is 6.29E−06. Based on the outcomes, we observe that the HQLA can solve the problem very efficiently. Table 8 shows detailed statistics about the optimized factors of the model. Now that the results of Table 8 are the In the following, we compared our results with actual data from Quebec to validate our results. Figure 15 shows the predicted and actual data from Quebec. In Fig. 15 , we accurately predict the number of infected cases, recovered cases, and cumulative diagnosed cases. Based on the outcomes, we conclude that the HQLA is an efficient solution methodology for the problem. As mentioned earlier, we separated the planning horizon into six phases in which the Quebec government announced detailed limitations to control the epidemic. The first case of COVID-19 was detected in Quebec, Canada, on February 27, 2020 (Lapierre 2020 . In the first phase, the transmission rate was considered low. Based on our results, we observe that the transmission rates were low at the first stage, and the reproduction rate was R 0 1.0998. Quebec province first announced a state of emergency on March 12, 2020. We consider March 15, 2020, to March 24, 2020, as the second phase of the pandemic due to a drastic increase in the number of cases. On March 15, the Quebec government ordered the closure of all recreational and entertainment facilities (Gouvernement Du Québec 2020). Following the quick progress in the number of infected cases, on March 27, Montreal declared a local state of emergency, and the Quebec government ordered the closure of all universities and schools. Besides, On March 20, the province banned indoor gatherings. In the second phase of the pandemic, our study estimates a reproduction rate of R 0 3.8028 for Quebec province. From March 24, 2020, to March 28, 2020, Quebec received more COVID-19 test kits enabling the healthcare authorities to perform more tests and determine the infected cases. In this phase, we approximate the reproduction rate equal to R 0 4.6096. From March 28, 2020, to April 2, 2020, strict actions were taken by the government decreased the reproduction rate to R 0 1.1193. The reproduction rate dropped over the next period to R 0 1.0248 from April 2, 2020, to April 13, 2020. After April 13, 2020, Two-dimension view of the hybrid composite functions Fig. 12 Convergence plots for algorithms in solving composite problems the reproduction rate was considered to be R 0 0.7782. In order to present the progress of the pandemic in the next few months, we extended our prediction to the next 365 days, as is shown in Fig. 16a , b. Based on our results, considering the current social distancing and limitations, we will experience a significant decrease in the number of cases in the next few months if and only if strict measures such as partial lockdown remain in place for the next few months. In the previous section, we reflected the pandemic growth over the next few months, considering the current partial lockdown and closure measures. However, in May 2020, the Quebec government ordered a gradual reopening of the businesses. Consequently, it is vital to discover how the reopening of the businesses will impact forthcoming circumstances. In this section, we examine the consequence of variation in transmission rates on the progress of the pandemic. Therefore, we augmented the parameters α, β, γ , δ, and ε and explore their impact on the number of infected, recovered, cumulative diagnosed, and death cases. We portray the outcomes in Figs. 17, 18, 19 and 20 . The outcomes prove that the parameter α has a substantial effect on the pandemic growth, and increasing this parameter increases the number of infected, recovered, cumulative diagnosed, and death cases. Besides, increasing β, γ , and δ increases the number of infected, recovered, cumulative diagnosed, and death cases. Moreover, increasing the parameter ε remarkably reduces the number of infected, recovered, cumulative diagnosed, and death cases. Therefore, to decrease the spread of the virus and stop the pandemic, we need to reduce the transmission rate of the infection by applying significant social distancing and behavioral measures while increasing the detection rate of asymptomatic cases. Based on our results, we observe that from day 250, we will see an increase in the number of infected cases from October 1, 2020, in Quebec by limiting the lockdown measures. Therefore, starting October 1, 2020, Quebec will experience a significant increase in the number of infected cases. COVID-19 is a severe health threat, and the condition is growing every day. Therefore, presenting new procedures to model and predict the COVID-19 pandemic is vital. Prediction of the COVID-19 pandemic will enable policymakers to optimize the use of healthcare system capacity and resource allocation to minimize the fatality rate. In this research, we design a new hybrid reinforcement learning-based algorithm capable of solving complex optimization problems. We applied our algorithm to several well-known benchmarks and show that the presented methodology provides high-quality solutions for the most complex problems in the literature. Besides, we showed the superiority of the offered method to state-of-the-art methods through several measures. Moreover, to demonstrate the efficiency of the proposed methodology in optimizing realworld problems, we implemented our approach to the most recent data from Quebec, Canada, to predict the COVID-19 outbreak. Our algorithm, combined with the most recent mathematical model for COVID-19 pandemic prediction, accurately reflected the future trend of the pandemic. Furthermore, we analyzed several scenarios for deepening our insight into Our results showed that the transmission rate caused by the interaction of a susceptible case with an asymptomatic case has the most significant effect on future trends. Increasing this parameter can significantly increase the number of infected, recovered, cumulative diagnosed, and death cases. Besides, increasing the transmission rate due to contact of a susceptible case with a diagnosed, ailing, and recognized case increases the number of infected, recovered, cumulative diagnosed, and death cases. Moreover, increasing the parameter ε remarkably reduces the number of infected, recovered, cumulative diagnosed, and death cases. Therefore, to decrease the spread of the virus and stop the pandemic, we need to reduce the transmission As future research, from a mathematical modeling viewpoint, it would be worthwhile to consider stochasticity in the proposed model (Belen et al. 2011; Özmen et al. 2011; Weber et al. 2011) . Besides, it would be interesting to use the proposed model to plan for resource management during the pandemic to decrease the fatality rate by increasing the healthcare system capacity. Based on the predictions, healthcare managers can plan for testing kit allocation to test centers. Also, it would be worthwhile to consider the age, medical condition, and gender of the infected cases in the model from a modeling perspective. Moreover, it would also be interesting to use the proposed model to plan for managing healthcare resources such as personal protective equipment (PPE) and ventilators during the pandemic. In this appendix, we describe the SIDARTHE model proposed by Giordano et al. (2020) . The model considers eight states for people, including Susceptible, Infected, Diagnosed, Ailing, Recognized, Threatened, Healed, Extinct. The model reflects cases in different situations, including symptomatic, asymptomatic, detected, undetected, and life-threatening symptoms that required ICU admission. The suggested model includes eight ordinary differential equations to show the progress of the pandemic over time. We use the following notations, as in Table A2 , to present the model. Based on the notations mentioned above, Giordano et al. (2020) presented the following model.Ṡ Parameters α Transmission rate due to contact of a susceptible case with an infected case β Transmission rate due to contact of a susceptible case with a diagnosed case δ Transmission rate due to contact of a susceptible case with an ailing case γ Transmission rate due to contact of a susceptible case with a recognized case 2 The rate of detecting asymptomatic cases ζ The probability that an infected case is aware of being infected η The probability that an infected case is unaware of being infected θ The detection rate of symptomatic cases λ, κ, ξ , ρ, σ The recovery rate of the five categories of infected cases μ The probability that an undetected/detected infected case shows life-threatening symptoms υ The probability that a detected infected case develops life-threatening symptoms τ Mortality rate The fraction of susceptible (not infected) cases in the population The fraction of infected (infected and undetected cases without symptoms) cases in the population The fraction of ailing (infected and undetected cases with symptoms) cases in the population The fraction of recognized (infected and detected cases with symptoms) cases in the population The fraction of threatened (infected detected cases that developed life-threatening symptoms) cases in the population The fraction of recovered cases in the population The fraction of death cases in the population No free lunch theorem: A review Gradient-based optimizer: A new metaheuristic optimization algorithm Practical Reinforcement Learning: Develop self-evolving, intelligent agents with OpenAI Gym, Python and Java A novel metaheuristic method for solving constrained engineering optimization problems: Crow search algorithm On the classical Maki-Thompson rumour model in continuous time Quebec hospitals struggling with influx of COVID-19 patients even as province moves to reopen-The Globe and Mail Thermodynamical approach to the travelling salesman problem: An efficient simulation algorithm Coronavirus Disease (COVID-19) in Québec | Gouvernement du Québec. (n.d.). Retrieved DDSE: A novel evolutionary algorithm based on degree-descending search strategy for influence maximization in social networks Small-world optimization algorithm for function optimization Particle swarm optimization Water cycle algorithm-A novel metaheuristic optimization method for solving constrained engineering optimization problems CTV News Montreal Digital Reporter Montreal ERs are near capacity with delayed health problems, a worrying sign as second wave looms Artificial intelligence through simulated evolution Central force optimization Modelling the COVID-19 epidemic and implementation of population-wide interventions in Italy Exploiting the structure of two-stage robust optimization models with exponential scenarios Vehicle routing problems with synchronized visits and stochastic travel and service times: Applications in healthcare Black hole: A new heuristic optimization approach for data clustering Drugs and vaccines for COVID-19: Authorized clinical trials Genetic algorithms A three-dimensional statistical volume element for histology informed micromechanical modeling of brain white matter A powerful and efficient algorithm for numerical function optimization: Artificial bee colony (ABC) algorithm A novel meta-heuristic optimization algorithm: Thermal exchange optimization Black Hole Mechanics Optimization: A novel meta-heuristic algorithm A novel heuristic optimization method: charged system search An efficient hybrid algorithm based on Water Cycle and Moth-Flame Optimization algorithms for solving numerical and constrained engineering optimization problems Designing energy-efficient high-precision multi-pass turning processes via robust optimization and artificial intelligence Multi-objective stochastic fractal search: A powerful algorithm for solving complex multi-objective optimization problems Sine-cosine crow search algorithm: Theory and applications Optimizing a multi-item economic order quantity problem with imperfect items, inspection errors, and backorders Designing an efficient blood supply chain network in crisis: Neural learning, optimization and case study Optimization by simulated annealing. science There aren't enough ventilators to cope with the coronavirus-The New York Times Genetic programming: On the programming of computers by means of natural selection Quebec's first case of COVID-19 has been confirmed A novel nature-inspired algorithm for optimization: Virus colony search Novel composition test functions for numerical global optimization A robust optimization model for sustainable and resilient closed-loop supply chain network design considering conditional value at risk. Numerical Algebra, Control and Optimization Investigation of wind farm location planning by considering budget constraints Interdependent demand in the two-period newsvendor problem Coronavirus Optimization Algorithm: A bioinspired metaheuristic based on the COVID-19 propagation model Moth-flame optimization algorithm: A novel nature-inspired heuristic paradigm. Knowledge-Based Systems The ant lion optimizer Dragonfly algorithm: A new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems SCA: A sine cosine algorithm for solving optimization problems. Knowledge-Based Systems The whale optimization algorithm Salp Swarm Algorithm: A bio-inspired optimizer for engineering design problems Multi-verse optimizer: A nature-inspired algorithm for global optimization Grey Wolf optimizer Multi-objective grey wolf optimizer: A novel algorithm for multi-criterion optimization Curved space optimization: A random search based on general relativity theory Minimizing makespan in a single machine scheduling problem with deteriorating jobs and learning effects RCMARS: Robustification of CMARS with different scenarios under polyhedral uncertainty set Differential evolution GSA: A gravitational search algorithm Evolutions strategien Stochastic fractal search: A powerful metaheuristic algorithm. Knowledge-Based Systems Big data-driven cognitive computing system for optimization of social media analytics Grasshopper optimisation algorithm: Theory and application Lightning search algorithm Biogeography-based optimization Multi-objective optimization for the reliable pollution-routing problem with cross-dock selection using Pareto-based algorithms Fuzzy mathematical programming and self-adaptive artificial fish swarm algorithm for just-in-time energy-aware flow shop scheduling problem with outsourcing option A novel meta-heuristic algorithm: Dynamic virtual bats algorithm. Information Sciences An effective estimation of distribution algorithm for solving the distributed permutation flow-shop scheduling problem Modeling, inference and optimization of regulatory networks based on time series data Q&A on coronaviruses No free lunch theorems for optimization Cuckoo search via Lévy flights A hybrid Q-learning sine-cosine-based strategy for addressing the combinatorial test suite minimization problem Development of a mathematical model for sustainable closed-loop supply chain with efficiency and resilience systematic framework Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations See Table A1 . The unimodal benchmark functions Appendix 2