key: cord-0278644-0unvmj2y authors: Kweon, Jihoon; Kim, Kyunghwan; Lee, Chaehyuk; Kwon, Hwi; Park, Jinwoo; Song, Kyoseok; Kim, Young In; Park, Jeeone; Back, Inwook; Roh, Jae-Hyung; Moon, Youngjin; Choi, Jaesoon; Kim, Young-Hak title: Deep reinforcement learning for guidewire navigation in coronary artery phantom date: 2021-10-05 journal: nan DOI: 10.1109/access.2021.3135277 sha: b250bdf7a782e4ed126407b55b7b33dfed4229bf doc_id: 278644 cord_uid: 0unvmj2y In percutaneous intervention for treatment of coronary plaques, guidewire navigation is a primary procedure for stent delivery. Steering a flexible guidewire within coronary arteries requires considerable training, and the non-linearity between the control operation and the movement of the guidewire makes precise manipulation difficult. Here, we introduce a deep reinforcement learning(RL) framework for autonomous guidewire navigation in a robot-assisted coronary intervention. Using Rainbow, a segment-wise learning approach is applied to determine how best to accelerate training using human demonstrations with deep Q-learning from demonstrations (DQfD), transfer learning, and weight initialization. `State' for RL is customized as a focus window near the guidewire tip, and subgoals are placed to mitigate a sparse reward problem. The RL agent improves performance, eventually enabling the guidewire to reach all valid targets in `stable' phase. Our framework opens anew direction in the automation of robot-assisted intervention, providing guidance on RL in physical spaces involving mechanical fatigue. Coronary arteries are vessels that supply oxygen-rich blood and nutrients to the myocardium. When coronary arteries are obstructed, the heart muscle is not supplied with sufficient oxygen and energy, resulting in ischemia. Such ischemic heart disease is reported to be the leading cause of death, responsible for 16% of the world's total deaths. Percutaneous coronary intervention (PCI) with balloon angioplasty and stent implantation is the standard treatment for coronary artery stenosis. The catheter access provides a path from the incision area to the coronary ostium, and the balloon and stent are delivered along the guidewire to the target location. Because the diameter of the coronary artery is relatively small (≤4 mm) (Dodge Jr et al., 1992) and the distance between the operating controls and the distal end of the guidewire is long, a considerable amount of specialized training is required for precise manipulation of the interventional devices. Friction with the vessel wall causes deformation of the flexible tip, impeding guidewire control, and a risk of perforation may increase in the treatment procedure of severe lesions (Guttmann et al., 2017) . The non-linear relationship between the control motion applied to the guidewire and the movement of the distal end is an important feature that makes the device difficult to precisely control. Interventional robots for coronary diseases have been introduced for improved manipulation of interventional devices with reduced irradiation (Beyar et al., 2005; Kiemeneij et al., 2008) . The safety and feasibility of interventional robots have been demonstrated by clinical studies (Weisz et al., 2013; Patel et al., 2020) , and their applications have been widened to complex lesions such as multi-vessel diseases and chronic total occlusion (Hirai et al., 2020; Mahmud et al., 2017) . Integration of telecommunication systems with robotic apparatus enables the remote operation of robot-assisted PCI (Patel et al., 2019) . In pandemic situations such as current COVID-19 wave, robotic procedures have been proposed as a way to reduce the potential infection risk for medical staff and patients (Attanasio et al., 2021; Zemmar et al., 2020) . With the adoption of artificial intelligence, interventional robots are expected to be further automated to minimize interference from human operators (Sardar et al., 2019) . Reinforcement learning (RL) is an area of machine learning that trains an agent to achieve a goal by maximizing rewards, which are imposed by the next state in response to changing conditions when an action is taken in the current state. Deep RL has been applied to various domains, such as Go and computer games, to surpass the world's best human players Mnih et al., 2015) , and its applications have expanded from software to the control engine for real-world hardware (Levine et al., 2016; Gandhi et al., 2017; . Considering that simple control operations are repeatedly performed to utilize a robot-assisted intervention system, deep RL may be a solution that effectively alleviates the burden of human operators. Recent applications for autonomous control of interventional devices in phantom simulation supported the potential applicability of deep RL (You et al., 2019; Behr et al., 2019; Karstensen et al., 2020; Chi et al., 2020; Zhao et al., 2019) . In this study, we propose a deep RL framework for autonomous guidewire navigation in robot-assisted coronary interventions. We focus on how to accelerate the RL training to prevent mechanical fatigue on the guidewire due to repetitive movements. First, under the constraints of the discrete action space and training in a real-world setting, Rainbow (Hessel et al., 2018) was applied ( Figure 1a ). Rainbow integrating Deep Q-Networks (DQN) (Mnih et al., 2015) with recent advancements (van Hasselt et al., 2016; Schaul et al., 2015; Wang et al., 2016; Fortunato et al., 2017; in reinforcement learning has demonstrated outstanding performance in real world environments (Church et al., 2020) . Replay memory (Mnih et al., 2015) , a representative an off-policy method to enhance sample efficiency and increase the training speed with quality data, was a key component to reduce the physical time requirement. To optimize the initial composition of the replay memory, human demonstrations with DQfD (Hester et al., 2018) and weighted random action (WRA) were evaluated. Second, the state for the RL agent was customized with a focus window near the guidewire tip. Given that the movement of the guidewire per control command was about two orders of magnitude smaller than the travel distance of the navigation, the focus window allowed the RL agent to confine its input to the area with the most important information. Subgoals like the dots in 'Pac-Man' guided the navigation to the goal beyond the focus window as well as mitigating the sparse reward problem. Finally, segment-wise training was conducted, which was inspired by the concept of curriculum learning . When expanding the navigation area, transfer learning was applied using the model from previous training. Our framework was assessed in a two-dimensional (2D) coronary phantom for trainees and further validated in three-dimensional (3D) coronary phantom with fluid flow. A robotic module developed in Asan Medical Center was used for guidewire navigation (Figure 1a) . A pair of roller units rotating in the opposite direction enabled forward and backward motion of the guidewire (Terumo Radifocus 0.035", Terumo Co., Ltd., Japan). The vertical translation of the roller units produced the rotation of the guidewire. The roller units driven by step motors generated discrete control commands corresponding to 0.4 mm displacement or 33°r otation of the guidewire at the roller side. The guidewire with a pre-angled tip was delivered via the guiding catheter (Heartrail II JL-3.5, Terumo Co., Ltd., Tokyo, Japan) engaged at the ostium of the coronary artery in the phantom. The entire phantom area was captured by a RealSense™ D435 camera (Intel Co., Ltd., CA) mounted orthogonal to the phantom. Figure 1 : (a) Interaction between physical environment and RL agent. Of the entire navigation area, only information around the guidewire tip is defined as 'state', which is given as an input to the RL agent. The RL agent selects a control command as an 'action' maximizing the expected 'rewards' at the current 'state', and the selected control command is transferred to the robotic device to perform one of the control operations: forward/backward motion and rotation. While this process is repeated, sets of states, actions, rewards, and next states (transitions) are accumulated in the replay memory, and RL training is performed periodically using transitions. (b) Reward design of reinforcement learning for guidewire navigation. Figure 2 : (a) The navigation area, divided into three zones, is initially designated as a proximal zone and is expanded by adding medial and distal zones, respectively. The goal is set at the target location of the guidewire, and terminal signals are assigned to other branches. (b) Because the goal is not visible in the focus window around the guidewire tip, subgoals are introduced. The focus window contains at least one subgoal or a goal. (c) Experimental setup according to navigation area. To build an RL agent determining control commands using Rainbow (Hessel et al., 2018) , a convolutional neural network (CNN) was constructed ( Figure 1a ). Using 'state' information composed of four consecutive images as input, the trained network output a distribution of Q values for deciding an 'action'. According to the control command, the guidewire was manipulated by the robotic module and the RL agent receives a 'reward' depending on the 'next state'. A step was defined as the generation process of a transition, which was a set of state, next state, action, and reward. Every transition was saved in the replay memory. In each episode, the guidewire tip initially located in front of the catheter was moved toward a goal by the combination of control commands. When the guidewire reached a target location within 500 steps, the episode was considered a 'success'; otherwise it was considered a 'failure'. After finishing an episode, the guidewire was pulled back to the initial location, and then a new episode began. The training consisted of 1000 episodes, and the goal was set to switch randomly for each episode. At the beginning of training, transitions for the replay memory were generated using weighted random action (WRA) or transferred weight for a given number of steps. The network was not updated for this 'transition generation' phase. The composition and generation method of transitions for the replay memory is summarized in Table 1 . The loss function was defined as a combination of Rainbow loss (Hessel et al., 2018) with large margin classification loss and L2 regularization loss from DQfD (Hester et al., 2018) . The large margin classification loss was only used for training the RL agent with human demonstrations. Hyper-parameters used for the training are summarized in Table A1 in Appendix. In defining 'state', two major modifications of focus window and subgoal were introduced. The image area for the state, which was converted to grayscale as in X-ray angiography, was cropped to 84 × 84 pixels near the guidewire tip, allowing the RL agent to focus on more important information (Figure 2b ). The main drawback of the image crop was that the RL agent could not recognize the goal until the guidewire approached the target location. Therefore, subgoals were added on the path leading to the target location. The subgoals were initially set at the bifurcation points and additional subgoals were placed at a distance of 20 pixels, which is about a quarter of the image size for the state. Also, terminal signals were designated near the entrance of untargeted branches, which helped prevent the RL agent from making useless exploration. The action space of the RL agent was composed of forward/backward motion and rotation. The magnitude of each action generated by the robotic manipulator was fixed as constant. The rotational direction of the guidewire was not changed until the maximum angle provided by the roller units. The RL agent accumulated a negative reward of -0.001 per step, while zero reward was added at subgoals and final goal ( Figure 1b) . When the guidewire tip arrived at a terminal signal, a large negative reward of -0.5 was imposed. DQfD (Hester et al., 2018) was proposed to enhance the performance of reinforcement learning with a small amount of demonstration data. In our application of DQfD, human demonstrations were used to pre-train the network with supervised learning and were sampled with a high priority in the replay memory for reinforcement learning. To record demonstration data of experiments using the 2D phantom, 10 episodes per target location in the proximal and medial zones were created, as trained personnel generated discrete control commands using a keyboard ( Figure 3 ). 3 Results First, the RL navigation in this study was aimed at delivering the guidewire to a target location at the main vessel or a side branch in a 2D phantom (PCI trainer for beginners, Medi Alpha Co., Ltd., Japan). The training was conducted by dividing the left anterior descending arteries into three parts and expanding the navigation area step by step (Figure 2a ). The RL performance was evaluated by independently performing the procedure three times per experiment, using a new guidewire each time. The target was selected at the beginning of every episode by uniform random distribution. A paired Wilcoxon test was used to compare the operation time and number of steps between RL agents. Values of p < 0.001 were considered statistically significant. Statistical analyses were performed using R package and SPSS 17.0 for Windows (IBM Corp., Armonk, NY, USA). For the main and side targets in the proximal zone, we evaluated the effects of human demonstrations with DQfD on RL performance for guidewire navigation (Figure 2c) . A stenotic lesion was located in the branching area between the targets, which hindered the guidewire control. Another side branch opposite to the proximal side target was set as a terminal signal. When human demonstrations were added in the replay memory (P1 model), the learning speed was faster (<175 episodes in Figure 4a ) than the RL agent trained with only WRA (P2 model). In this case, the success rate of P2 model increased rapidly, reaching 99% first at 216th episode (Table 2) . After 500 episodes, RL navigation hardly failed while P2 models required significantly less control commands (80.1 ± 41.3 vs. 67.7 ± 28.3, p < 0.0001) and reduced operating time (10.40 ± 6.32 s vs. 9.29 ± 6.00 s, p < 0.0001). Compared to the human operation for demonstration data (82.1 ± 34.2), the reduction rates in the number of steps were 10.3% and 23.1% for P1 and P2 models in the final 100 episodes, respectively. For both models, although the success rate for the proximal-main target (vs. proximal-side Figure 5 : Tip trajectories of the guidewire in RL navigation. (a) In the proximal zone, the RL agent initially explores the path in a stochastic pattern, and as the RL evolves, the guidewire successfully reaches its goal by repeating efficient patterns in the stable phase. (b) In the medial zone, the RL agent finds an effective way to move the guidewire into a small side branch where the ostium is narrowed. (c) The RL agent passes through the severe obstruction in the distal zone by facing the guidewire tip to the right with respect to the travel direction. target) was slightly lower in the beginning, as the RL training progresses, the difference between the navigation goals almost disappeared. In the early stage of training, the RL agent explored the path in a stochastic pattern (exploration phase in Figure 5a ). As the training progressed, unnecessary changes in the orientation of the guidewire tip were gradually reduced, and the probability of escape from untargeted branches was improved (evolving phase). In 'stable phase' at the last stage of training, two representative patterns were found in the trajectories of guidewire navigation. The first pattern was that after proceeding along the centerline of the main vessel, the guidewire rotated sharply to the proximal-side target in the bifurcation area or advanced to the proximal-main target. The second pattern was characterized by avoiding the branch vessel of terminal signal. Then, the RL agent steered the guidewire along the sidewall of the side branch (proximal-side target) or used the evasive movement again to the opposite side (proximal-main target). Because the travel distances to the medial targets were roughly three times longer than those to the proximal targets, it was extremely difficult to reach the goals with only WRA, especially the medial-side target (Figure 2a) . For the medial targets, the transfer learning approach was applied by initializing the network using the trained models in the proximal zone. The M1 model, as a control, was trained using human demonstration with DQfD like the P1 model. The M2 model brought the initial weights from the P1 model, which generated the initial transitions to the medial targets. The initial transitions of the M3 model using the weight of the P2 model were produced with both the transferred weight and WRA (Figure 2c and Table 1 ). Despite the increased travel distance with intervening multiple branches, the success rate of M2 and M3 models increased sharply in the same pattern as the proximal experiment, indicating > 95% from the 212th episode for both the models (Figure 4b and Table 2) . After the success rate became saturated, little variation was found between the performance of the two models using transfer learning. Also, the number of steps and total reward per episode remained stable in terms of averaging of 100 episodes. Although human demonstrations allowed the RL agent to temporarily produce better results (M1 model), after the initial stage (> 200 episodes), the performance of M2 and M3 models exceeded the RL agent trained without transfer learning. The most efficient patterns for medial targets, obviously, were to follow the centerline primarily using forward commands, which accounted for most of the last half of the experiment (Figure 5b ). The main reasons for the failure or longer travel distance in the navigation were that the guidewire was misled to the side branches (medial-main target) and the orientation of the guidewire tip had to be changed repeatedly to pass through the narrowing (medial-side target). The goal was to demonstrate that the RL agent was viable for the major destinations of guidewire navigation: proximal-side, medial-side, and distal-main targets. Transfer learning was applied using the trained weights of the M3 model. To address the overfitting issue, the weight initialization was applied by replacing the final layer of the convolutional neural network with randomly generated parameters. Without weight initialization (D1 model), the performance of the RL agent regressed as the training continued. For the last 100 episodes, D1 model produced a success rate of 25% and mostly failed to reach the distal-main target ( Figure 4c ). When the final layer initialization was applied (D2 model), the success rate for the medial-side and distal-main targets fluctuated, but eventually approached 100% for all targets. The success rate of the D2 model was 98.0% in average for the last 300 episodes. The change in the number of steps in the training process was relatively small compared to the proximal and medial zones, because the navigation was terminated early in the failed episodes. Initially, the guidewire control suffered from the orientation adjustment of the guidewire in front of the stenosis ( Figure 5c ). Unless the guidewire tip faced to the right relative to the travel direction, it was exceedingly difficult to proceed through the distal obstruction. As the training progressed, even when the guidewire moved along an incorrect route, the RL agent reverted it to the path that could reach the designated goal as it learned in the previous experimental stages. Eventually, the navigation trajectories for the distal-main target almost converged except for irregular patterns around the largest side branches in the proximal zone. The training process of our framework was further validated using a 3D phantom. Expanding the navigation area in the 3D phantom (Embedded Coronary Model, Trandomed 3D Medical Technology Co., Ltd., China), the training methods of the P2, M3, and D2 models were applied sequentially following the best scenario in the 2D phantom. In the 3D experimental setup (Figure 6a) , the vessel and the guidewire placed away from the center of the camera view could be detected as shorter than the actual length, like the shortening effect in X-ray coronary angiography. Also, fluid at physiological flow rate was supplied from an output-adjustable pump (WT300-1JA, Longer Precision Pump Co., Ltd., China) to the right coronary artery (RCA), which could affect the dynamic behavior of the guidewire along with the silicon wall. Despite substantial changes in the experimental environment and thereby mutual interaction (Figures 6c and 6d) , an RL agent was constructed that was able to steer the guidewire to reach all valid targets (see Figure A1 in Appendix). The navigation performance of the RL agent improved through the exploration and evolution phases, and eventually, it became capable of maneuvering the guidewire into the vessels with wavy walls and differing branching patterns ( Figure 6b and Video A1). Our framework demonstrated the potential applicability of autonomous guidewire navigation using RL. To accelerate the training speed and avoid mechanical fatigue of the guidewire, human demonstrations with DQfD (Hester et al., 2018) , transfer learning, and weight initialization were evaluated as a segment-wise learning approach. The focus window and subgoals were introduced to customize the state and reward for RL agent, respectively. The RL agent improved navigation performance through 'exploration' and 'evolution' phases, which eventually enabled the guidewire to reach all valid targets in a 'stable' phase. Human demonstrations initially accelerated the training speed but required more time to further increase the success rate. Sampling human demonstrations with a high priority, even after the RL agent's performance exceeds the human demonstrations, may be a hindrance to performance improvement. The patterns of the human demonstrations were suboptimal, which rarely appeared in the 'stable' phase (Gao et al., 2018) . Also, the difference in the input frequency between the human operator and the RL agent may cause different interactions between the guidewire tip and the experimental environment. In the navigation to the medial and distal targets, the segment-wise approach helped to collect better transitions for the training of the RL agent for a target with a low probability of reaching it only by random action. The transfer learning, which is commonly used for learning speed and performance of deep networks (Girshick et al., 2014) , also contributed to reduce the time required for 100% success. In realistic physiological conditions, providing accurate state information to the RL agent is the key to applying RL in three-dimensional deformation of the vascular pathways with a living heartbeat. Uncertainties in registration can be an obstacle to apply our framework that required precise position information to define 'state' and apply subgoals. To detect the relative location of the guidewire within the coronary tree, a dynamic coronary roadmap can be helpful (Piayda et al., 2018) . The latest updated method provides a real-time registration of X-ray angiography with the guidewire tip in fluoroscopic images using ECG gating (Ma et al., 2020) . Also, deep-learning segmentation of major vessels in X-ray angiography offers rapid and accurate identification of the target vessel to be reached (Yang et al., 2019) . The ultimate goal of autonomous navigation using deep RL is to build a generalized model encompassing the anatomical diversity of the coronary arteries. Time and cost for training can be an obstacle that fundamentally limits the application of RL navigation. To this end, the development of novel simulators is essential (Wang et al., 2017; Dulac-Arnold et al., 2019) . Virtual simulators provide an opportunity to train more subjects quickly at a low cost. Distributed RL can improve the training and the performance of the RL agent using the strength of virtual simulators (Mnih et al., 2016) . Advancements in the modeling of cardiovascular anatomy (Corral-Acero et al., 2020) and interventional devices (Sharei et al., 2018 ) support the construction of virtual simulators that more accurately mimic interactions in the human body. Physical simulators may help translate from virtual simulators to in-vivo applications by relieving safety issues. Integration of novel 3D printing techniques (Stepniak et al., 2020) with functional modeling of the cardiovascular system (Vukicevic et al., 2017) may allow implementation of dynamic response in interventional devices in physical simulators. Our framework is expected to contribute to the adoption of autonomous navigation not only by providing data necessary for modeling virtual simulators, but also by presenting guidance on training methods for physical simulators involving mechanical fatigue. Lumen diameter of normal human coronary arteries. Influence of age, sex, anatomic variation, and left ventricular hypertrophy or dilation Prevalence and outcomes of coronary artery perforation during percutaneous coronary intervention Concept, design and pre-clinical studies for remote control percutaneous coronary interventions Use of the stereotaxis niobe® magnetic navigation system for percutaneous coronary intervention: Results from 350 consecutive patients Safety and feasibility of robotic percutaneous coronary intervention: PRECISE (Percutaneous Robotically-Enhanced Coronary Intervention) study Comparison of robotic percutaneous coronary intervention with traditional percutaneous coronary intervention: A propensity score-matched analysis of a large cohort Initial report of safety and procedure duration of roboticassisted chronic total occlusion coronary intervention Demonstration of the safety and feasibility of robotically assisted percutaneous coronary intervention in complex coronary lesions: results of the CORA-PCI study (Complex Robotically Assisted Percutaneous Coronary Intervention) Long distance tele-robotic-assisted percutaneous coronary intervention: A report of first-in-human experience Autonomy in surgical robotics The rise of robots in surgical environments during COVID-19 Impact of artificial intelligence on interventional cardiology: From decision-making aid to advanced interventional procedure assistance Mastering the game of Go with deep neural networks and tree search Human-level control through deep reinforcement learning End-to-end training of deep visuomotor policies Learning to fly by crashing Asymmetric actor critic for image-based robot learning Automatic control of cardiac ablation catheter with deep reinforcement learning method Deep reinforcement learning for the navigation of neurovascular catheters Autonomous guidewire navigation in a two dimensional vascular phantom Collaborative robot-assisted endovascular catheterization with generative adversarial imitation learning A CNN-based prototype method of unstructured surgical state perception and navigation for an endovascular surgery robot Rainbow: Combining improvements in deep reinforcement learning Deep reinforcement learning with double q-learning Dueling network architectures for deep reinforcement learning Noisy networks for exploration A distributional perspective on reinforcement learning Deep reinforcement learning for tactile robotics: Learning to type on a braille keyboard Deep q-learning from demonstrations Automated curriculum learning for neural networks Reinforcement learning from imperfect demonstrations Rich feature hierarchies for accurate object detection and semantic segmentation Dynamic coronary roadmapping during percutaneous coronary intervention: A feasibility study Dynamic coronary roadmapping via catheter tip tracking in X-ray fluoroscopy with deep learning based bayesian filtering Seong-Wook Park, and Seung-Jung Park. Deep learning segmentation of major vessels in X-ray coronary angiography Nando de Freitas, and Nicolas Heess. Robust imitation of diverse behaviors Challenges of real-world reinforcement learning Asynchronous methods for deep reinforcement learning The 'Digital twin'to enable the vision of precision cardiology Navigation of guidewires and catheters in the body during intervention procedures: A review of computer-based models Novel 3D printing technology for CT phantom coronary arteries with high geometrical accuracy for biomedical imaging applications Cardiac 3D printing and its future directions Ministry of Science & ICT (MSIT, Korea), and Ministry of Health & Welfare (MOHW, Korea) under Technology Development Program for AI-Bio-Robot-Medicine Convergence (20001638), and by the Korea Medical Device Development Fund grant funded by Korea government (the Ministry of Science & ICT, the Ministry of Trade Figure A1 : Success rate, number of steps, operating time, and total reward in subsequent 3 experiments using 3D phantom.