key: cord-0495544-91a3cp2w authors: Abdelaal, Alaa Eldin; Hong, Nancy; Avinash, Apeksha; Budihal, Divya; Sakr, Maram; Hager, Gregory D.; Salcudean, Septimiu E. title: Orientation Matters: 6-DoF Autonomous Camera Movement for Minimally Invasive Surgery date: 2020-12-04 journal: nan DOI: nan sha: 9d5699197771124bbea8f1d699788a1485749859 doc_id: 495544 cord_uid: 91a3cp2w We propose a new method for six-degree-of-freedom (6-DoF) autonomous camera movement for minimally invasive surgery, which, unlike previous methods, takes into account both the position and orientation information from structures in the surgical scene. In addition to locating the camera for a good view of the manipulated object, our autonomous camera takes into account workspace constraints, including the horizon and safety constraints. We developed a simulation environment to test our method on the"wire chaser"surgical training task from validated training curricula in conventional laparoscopy and robot-assisted surgery. Furthermore, we propose, for the first time, the application of the proposed autonomous camera method in video-based surgical skill assessment, an area where videos are typically recorded using fixed cameras. In a study with N=30 human subjects, we show that video examination of the autonomous camera view as it tracks the ring motion over the wire leads to more accurate user error (ring touching the wire) detection than when using a fixed camera view, or camera movement with a fixed orientation. Our preliminary work suggests that there are potential benefits to autonomous camera positioning informed by scene orientation, and this can direct designers of automated endoscopes and surgical robotic systems, especially when using chip-on-tip cameras that can be wristed for 6-DoF motion. Minimally invasive surgery (MIS) refers to the paradigm of surgery that only needs small incisions into the patient's body to perform complex surgical procedures. Its advantages over open surgery include shorter hospital stay, fewer complications and less pain for the patient. MIS has been successfully used in different surgical specialties such as urology, gynecology and general surgery [1] . Broadly speaking, MIS has two main forms. The first is conventional laparoscopic surgery where 4 degrees-of-freedom (DoF) surgical tools are inserted through small incisions into the body and directly controlled by the surgeon [2] . The second is robot-assisted surgery (RAS) where 7-DoF robotic arms with surgical tools are inserted inside the body and controlled by the surgeon from a surgical console of a teleoperation system. The latter comes with many advantages over the former such as easier control of the tools, 3D vision and more precise motion [3] . An important component of MIS platforms is the vision system's endoscopic camera. The camera view is also used by surgeons to infer haptic/force feedback information in RAS [4] . Because camera placement is so important, surgical training curricula usually have a dedicated section on training novice surgeons the skills of moving the endoscopic camera [5] . Poor handling of the endoscopic camera leads to poor visualization which in turn can disrupt the surgical work flow, prolong the surgical procedures [6] , and compromise the patient's safety [7] . Therefore, improvements to the endoscopic camera control system are extremely important as they can improve the overall experience of the surgeon during MIS. The current standard practice is manual control of the endoscope motion. In conventional laparoscopic surgery, a dedicated camera assistant is responsible for this task based on the main surgeon's guidance. Such guidance is usually given using verbal communication, making the entire process less efficient, which can lead to suboptimal camera views [8] . This approach is usually associated with poor ergonomics for the camera assistant who has to hold the endoscope for long periods of time [9] . In robot-assisted surgery, the problems of camera positioning by an assistant are addressed by giving full camera control to the surgeon at the console. This, however, comes with the disadvantage of the surgeon having to switch from controlling the surgical tools to controlling the endoscope, and vice-versa. For surgeries that need many camera motions, such switching of control can be disruptive to the surgeon and can add to his/her cognitive workload. Indeed, with this approach the surgeon needs to control both the tools and the camera, unlike in the conventional laparoscopy case. This can interrupt the surgical work flow [10] . To improve the current practice, many groups proposed methods to automate the camera/endoscope motion in MIS. The majority of these methods are based on using some form of tracking information, e.g., tracking the surgeon's tools [11] or eye gaze [12] or contextual information in the surgical scene itself [13] ). These camera automation methods only use position information from the structures being tracked, e.g., the position of a specific tool or landmark in the scene. In this article , we argue that in order to provide a full visual feedback to the surgeon, both position and orientation of structures of interest in the scene should be considered. Good visual feedback in MIS does not only improve the performance of the actual surgical tasks, but can also improve video-based surgical skill assessment. Skill assessment is used is to allow surgeons to monitor and evaluate their own (or their trainees') performance in recorded videos with the goal of assessing their skills and identifying areas of improvement in future performances. One aspect of interest is whether the surgeon/trainee mistakenly touches critical areas in the surgical scene. To spot such instances, good visual feedback in the recorded videos is important. We argue that using automated camera methods that consider both the position and orientation of important structures in the surgical scene can improve the accuracy of video-based surgical skill assessment. The contributions of this work are as follows: • We propose a novel autonomous camera method that takes the orientation information of the surgical scene into account. In particular, our proposed method is based on following the normal to features of interest in the surgical view. Our proposed method is intended for endoscopic cameras with six DoF. • We implement the above autonomous camera concept in a simulated environment of the da Vinci surgical system where a pickup camera is attached to one of its arms following our proposed concept in [14] . Our implementation includes a motion planning component to satisfy some practical constraints as the camera moves autonomously. • We propose a novel application of the proposed autonomous camera method in video-based skill assessment in MIS. We compare our method against both a stationary camera and a point-based autonomous camera method where the camera moves to make a point of interest at the center of the view. We evaluate the effectiveness of the proposed method in this application by conducting a user study with N = 30 subjects where subjects assessed the performance of a simulated surgical training task in recorded videos under the above three camera methods. There has been an extensive body of work to facilitate camera control in MIS, with the earlier approaches aiming at giving the surgeon the full control over the camera without the need of any assistance. One of the early projects following this approach was the Automated Endoscopic System for Optimal Positioning (AESOP) project [15] . Different control modalities were tested in the context of this project such as voice control [16] and eye gaze [12] . Head tracking has also been explored in the EndoAssist TM project where the camera moves based on the surgeon's head motion [17] . While these methods eliminate the need to have a camera assistant, they added to the cognitive load of the surgeon which suggested the need to increase the level of automation in controlling the endoscope. Automated camera systems in MIS can be categorized based on the source of information used to automate the camera motion into three main categories. The first one is based on the surgical tools. The second is based on the anatomical structures in the surgical scene and the third one is a combination of the first two. The majority of the existing work is based on using one or more surgical tools and moving the camera according to their motion. For example, Eslamian et al [18] propose a method to track two surgical tools and move the camera to make the midpoint of the two tools at the center of the field of view (FOV); they apply their method to the da Vinci surgical robot. A similar approach is proposed in [19] for conventional laparoscopic surgery. A variation of this method is also proposed in [11] where the camera moves autonomously to make the currently used tool appear in the FOV (not just the center) based on the current state of the surgical procedure. Moreover, Ma et al [20] track the position of the two tools to control the camera rotation so that the line segment connecting the two tools is always horizontal. Weede et al [21] build a Markov model based on the motion of the surgical tools in previous surgeries to predict future motions of these tools. Based on the model, they move the camera's focal point to make the midpoint of the predicted tool positions at the center of the FOV. Anatomical structures have also been used to automate the camera motion. In an emulated surgical debridement subtask where the goal is to identify and remove damaged tissue so that the remaining tissue heal faster, Li et al [22] propose a "learning from demonstration" approach to automate the camera motion in two dimensions (2D). Their system ranks the damaged tissue and moves the camera's optical axis to make the highest ranked tissue at the center of the FOV. The combination of tools and anatomical structures information has also been studied in the context of automating the camera motion in MIS. For example, in [23] , the authors propose a method that tracks the midpoint of the two tools as well as another anatomical point in the surgical scene. The choice of the additional anatomical point is based on the current state of the surgical task. Ko et al [13] build a state transition diagram of the cholecystectomy procedure and propose a method to move the camera based on the current identified state of the procedure. In their method, the camera is moved to make either the currently used tool or a predefined fixed anatomical structure at the center of the FOV. In all the above work, the autonomous camera method/algorithm is always based on the position information of objects of interest in the view. The gap that this work is filling is the use of orientation information in addition to position to automate the camera motion. In particular, our work explores the use of pose (orientation and position) information of anatomical structures in this context. To the best of our knowledge, this is the first study to explore this aspect in the context of MIS. Video-based methods are extensively used in MIS for training and skill assessment. The effectiveness of these methods has been demonstrated in all surgical settings such as conventional laparoscopy [24] and RAS [25] . Furthermore, these methods have their dedicated and validated skill assessment tools such as the Objective Structured Assessment of Technical Skills (OSATS) [26] . The effectiveness of video-based skill assessment methods depends on the quality of the visual feedback provided in the videos. One approach to improve the visual feedback is to use multiple cameras that view the surgical scene from multiple perspectives. Several groups have explored the feasibility and effectiveness of this approach as in [27] and [28] . What all the above studies have in common is that the cameras used to record videos are all stationary. In this work, we explore for the first time the effectiveness of using an autonomous camera system based on our proposed method for video-based surgical skill assessment. In our proposed method, we aim to align the camera such that its optical (or viewing) axis coincides with the normal vector arising from an anatomical structure. Doing so allows the camera to maximize visual coverage of the structure of interest. Additionally, the camera moves to ensure that the anatomical structure is always at the center of the view. In this section, we describe our autonomous camera motion pipeline, various safety measures we have incorporated, and the implementation details of our proposed algorithm. We apply our motion pipeline to the setup shown in Fig. 1 . To center the anatomical structure of interest in the camera view, we consider the feature's position p c , and its normal vector n. We compute a goal position p g along this normal vector, at a fixed distance d f from p c . To avoid collision with tissue, we consider only the space above the anatomical feature and accordingly consider either n or −n. At each instant, the camera's positional goal is set to be the computed point p g such that p g = p c ± d f n. The camera's orientation can be fully described by a frame attached to it, as seen in Fig. 2 , and mathematically represented by a rotation matrix R g where col 1 [R g ] represents the x-axis, col 2 [R g ] the y-axis, and col 3 [R g ] the z-axis attached to the camera. The primary goal of our algorithm is to align the camera's optical axis with n, achieved by setting the zaxis of frame to n. Another desirable characteristic when moving the camera in surgery is to maintain a correct camera Fig. 1 . The setup used with our autonomous camera method showing the wire chaser scene that we added to the simulator. The rail pattern is the same as that in the validated curriculum in [30] . horizon [29] , and this is controlled by the camera's horizontal (or x) axis. We set this x-axis to be the cross product between a vector that is pointing upwards (i.e. z-axis of the world frame) and the feature normal vector n so that we always obtain an x-axis that is parallel to the xy-plane of the world frame. The y-axis is simply chosen to be orthonormal to the other two axes and this completes the desired rotation matrix R g . To facilitate collision avoidance, motion planning is incorporated when the distance between consecutive goal positions is larger than a set threshold. We generate a set of intermediate waypoints (IWP) between the current camera position and the goal position as seen in Fig. 3 . IWP 1 and 2 are manually set to a fixed distance d wp above the current and goal positions, respectively, along the positive z-axis of the world frame (or any vector pointing upwards). Using these four points, a linear function is interpolated and a trajectory is obtained. The orientation of the camera at each of these new intermediate positions is adjusted to ensure that the feature is always centered in the view. The z-axis is set to the vector between the current intermediate point and the feature's center point. The x-axis is adjusted to correct for the horizon as described previously, and the y-axis is chosen to be a vector orthonormal to both. Without human-in-the-loop control, it is essential to incorporate safety features into any autonomous system. First and foremost, we define a constrained workspace ((see Fig.1 )) within which we autonomously control the camera. Outside this workspace, the autonomous algorithm freezes, allowing the surgeon to manually control the camera according to his/her discretion. We chose to define this constrained workspace in the form of a 3D cone, with the following parameters: the cone tip is the remote center of motion (RCM), the cone height is slightly smaller than the length of the surgical instrument/endoscope, and the cone base radius is empirically chosen to be 10 cm. The cone's directional vector is initially set to the vector joining the RCM and the initial position of the feature, to ensure that the feature of interest is always in view, and remains unchanged after initialization. An additional safety constraint is to ensure that all surgical tools are always in the FOV. An out-of-view tool can unknowingly bring severe and unwanted damage to tissue in the surrounding area. Given the instrument tip positions in 3D space and using the camera's intrinsic parameters, these tip positions can be computed in image space at each frame. We define two windows within the image: an outer window, and an inner window, and adjust the camera's distance to the structure of interest until all surgical tools are found to be within the space between these windows. The adjustment is made by incrementing the distance d f in our autonomous camera algorithm by ±1 mm accordingly. Our proposed algorithm is implemented on a simulated da Vinci surgical system, where visual feedback is provided with a stereo endoscope. The 4-DoF endoscope provides limited angular or orientational freedom of movement, and hence cannot be fully exploited to show the merits of our proposed algorithm. We instead present our implementation with a 6-DoF stereo camera that is attached to the end of the surgical tool tip. Another possibility is to use the "pickup" stereoscopic camera concept proposed in our previous work [14] . The pickup camera is inserted axially through a surgical incision into the patients body and can be picked up and controlled by a surgical instrument (such as the da Vinci ProGrasp forceps) through its grasping interface. For this work, we focus on demonstrating the advantages of our proposed algorithm, and hence obtain and use ground truth data such as position and normal vector of the anatomical structure of interest. For future implementations, this data can be obtained through a dedicated vision pipeline as in [31] . It should be noted that any appropriate inverse kinematics module can easily be used to implement the proposed algorithm on any other camera-based robotic system such as articulated and snake-like cameras [32] . Due to the COVID-19 situation, we tested our proposed method in a simulated environment, instead of using the da Vinci Research Kit (dVRK) [33] , as originally planned. We use the first generation da Vinci system simulator proposed in [34] . It simulates the full patient-side cart of the da Vinci system including two patient-side manipulators (PSMs) and a 4-degree-of-freedom endoscopic camera manipulator (ECM). Controlling the motions of the PSMs and ECM can be performed in the same way as controlling the patient-side cart in the real robot using the dVRK. The simulator also includes an interface with the Robot Operating System (ROS). The simulator comes with some pre-prepared scenes of different tasks and it also allows adding new scenes/environments as needed. We used this feature to add the wire chaser task scene as shown in Fig. 1 , which is described in IV-B. We modified the simulated da Vinci system as described in III-D to include a 6-DoF endoscope. We test our autonomous camera method on the "wire chaser" task which is part of the validated training curricula in conventional laparoscopy [30] as well as RAS [5] . The task has also been validated for multiple surgical specialties such as urology, gynecology and general surgery [35] . Previous research has shown that the level of performance in this task is correlated with the performance level in the operating room [36] . The task involves holding a ring and moving it along a rail/wire. It is designed to measure the manual dexterity, hand-eye coordination and camera control skills of trainees. In our version of this task, trainees are penalized if the ring touches the rail. The same version of the task has also been used in the context of robot-assisted surgical training as in [37] . The ring represents the anatomical structure that we are interested in. The main idea is that a good visualization of the ring (as seen from the camera) is crucial to the wire chaser task. That is why our autonomous camera method used both the plane of the face of the ring as well as the ring's center as its inputs. The camera then moves autonomously following our proposed method in Section III so that the viewing plane of the camera is always parallel to the plane of the face of the ring and that the camera focal point is always at the center of the ring. Our hypothesis is that using the proposed autonomous camera method, subjects can better spot the cases when the ring touches the rail compared with other methods that automate the camera based only on position information. The position information in this case is the position of the center of the ring. This hypothesis is tested in the context of videobased skill assessment where subjects watch videos of the task and their goal is to assess the skill using specific criteria. Towards this end, we recorded several videos of the wire chaser task. We automated the ring motion that is held by one PSM to follow predefined trajectories along the rail. Some of these trajectories are ideal according to the following two conditions: (i) The ring is centered with respect to the rail and (ii) The ring's face plane is always perpendicular to the rail. Other trajectories were randomized by violating one or two of the above conditions. This in turn introduces a number of collisions between the ring and rail. The trajectory of the rail is defined by setting control points evenly spaced from start to finish. A control points position is defined in (x, y, z) coordinates, and orientation in Tait-Bryan Euler angles (α, β, γ) that together represent a single rotation: R total = R x (α) R y (β) R z (γ). R z , R y and R x represent elemental rotations about the z-, yand x-axes respectively of the simulation world frame. We automate the ring movement in the simulator to follow the pose of these control points. To introduce collisions between the ring and rail, randomly generated noise is added to the six variables describing each of the control points. By varying the number of control points and the threshold of noise added to the position and orientation, varying levels of difficulty can be represented in the resultant trajectories. For our tests, we chose trajectories with the following parameters: control points: 36 and 71, position threshold: 3-3.5 mm, angular threshold: 10-30 degrees. A higher number of control points introduces a higher degree of variation in the trajectory, providing more touches/collisions between the ring and rail resulting in the trajectory with 71 control points being the most difficult one. We tested the proposed autonomous camera method in two aspects. The first one is by measuring the tracking errors as the ring moves along the ideal and randomized trajectories which are explained in IV-C below. The second one is by conducting a user study where users watch the recorded videos to count the number of touches between the ring and rail. The goal in this second case is to measure how accurate the users are when using the proposed method in comparison with other methods as explained in IV-D below. The first method to evaluate the proposed autonomous camera method is to measure the tracking accuracy when the ring is moving along ideal and randomized trajectories. There were two sources of errors in our setup. The first one is the lag between the ring motion and camera motion, as we did not incorporate any information about future ring path into our autonomous camera method. The second is the limited performance of the simulator on our computing platform. We consider the following three tracking errors: • Centering error in the image space: This metric refers to the difference in pixels between the position of the center of the ring on the camera view and the position of the center of the view. • Centering error in the 3D space: This metric is similar to the first one except that it is the difference in millimeters between the 3D position of the center of the ring and the equivalent 3D position of the center of the FOV. • Orientation error: This metric refers to the angle between the camera optical axis and the vector n that is perpendicular to the plane of the face of the ring. The above three errors are reported as a function of time along the entire trajectory of the ring. We report them in three cases of using ideal trajectories and in another three cases of using randomized ones. We conducted a user study (N = 30) to measure the effectiveness of the proposed autonomous camera method while performing a video-based skill assessment task as described above. We recorded videos of the wire chaser task as the ring follows different randomized trajectories under three conditions for the camera motion as follows: • Condition I is when the camera is fixed, showing the entire task. This represents the baseline condition. • Condition II is when the camera motion is automated to follow the center point of the ring, regardless of the ring's orientation. This represents an autonomous camera method that is based solely on position information. We refer to this method as the "centering method". • Condition III is when the proposed autonomous camera method is applied. That is, when the autonomous camera method is based on both the position and orientation information as described in Section III above. In the last two conditions, the camera initial pose was the same as in the fixed camera condition (condition I) above. This was a within-subject user study, where each subject was exposed to all the study conditions. We recorded a total of nine videos, three per each condition. The nine videos were for the ring moving along the rail in three randomized trajectories with varying levels of difficulty. The goal is to measure the skill assessment accuracy in the videos where the ring moves in the most difficult trajectory (that is, the one with highest level of randomness which is the trajectory that has 71 control points). Each subject watched the nine videos in three sets, each set containing the three videos of each condition. Subjects were asked to count the number of touches between the ring and rail in each video. Counterbalancing was employed to reduce/eliminate the effect of any learning or carryover bias that may exist when a subject is exposed to each condition. The Latin squares [38] method was used to compute the order in which each subject is exposed to a condition. Since the study has three conditions, we applied two Latin squares, the second one being the mirror of the first, which led to having a total of six cases representing all the six possible combinations of the three conditions. Due to the restrictions of inviting subjects to the lab (because of the COVID-19 situation), the study was conducted virtually by sending an electronic form to each subject containing the videos. We added an attention question in the middle of the form to make sure that subjects were paying attention. Any subject who provided a wrong answer to this question was excluded from the study and his/her data were not considered. All our subjects were university students with little or no exposure to surgery. Previous research shows that crowdsourced video-based surgical skill assessment with nonexperts (as we did in this user study) is as accurate as the skill assessment performed by expert surgeons and surgical educators [39] . The user study was approved by the Research Ethics Board at the University of British Columbia. Based on the performance metrics outlined in IV-C, we conducted two tests with the wire chaser task to evaluate the accuracy of our implemented algorithm. In the first test, the trajectory represents the ideal trajectory of the ring along the rail, without any collisions/touches between the two. Our 6-DoF camera follows the ring at a fixed distance, with its optical axis aligned with the ring's normal vector. We compute the three metrics: centering error with respect to the left image space, centering error in the 3D space, and orientation error, as shown in Fig. 4 . This test is repeated three times (represented by trials 1, 2, and 3) to show the repeatability of our algorithm. Across all three trials, we obtained an average image centering error of 35 pixels, 3D centering error of 3.41 mm, and orientation error of 5.45 degrees. The peaks noticeable in the plots correspond to segments of the trajectory where we implement motion planning (due to largely distanced consecutive goal positions); to ensure that the feature is always in view. In these cases, we relax our orientation constraint which leads to large reported errors in the orientation angle. Despite these peaks, the average overall tracking accuracy of our system remains high. In the second test, we chose three noisy trajectories by adding noise to the ideal positions and orientations of the ring across its path such that the ring collides with the rail at certain points. The three noisy trajectories are the same trajectories used in the user study described in IV-D, and are represented by Path 1, 2 and 3 in Fig. 5 . We obtained an average image centering error of 36 pixels, 3D centering error of 3.33 mm, and orientation error of 4.30 degrees across the three paths. Similar to Fig. 4 , the peaks shown in Fig. 5 correspond to the motion planning segment with relaxed orientation constraints; the peaks here occur at different points of time for the three paths and hence appear more spread out. For both these tests, the errors across all three performance metrics are very low, indicating the high accuracy of our implementation, as can be seen in Figs. 4 and 5. It must be noted that the average error reported above includes the errors from the peaks, but even so, the final errors are reasonably low. As for the user study, we report the assessment errors in the most difficult video, that is, the one with the highest level of randomization. Assessment errors refer to the absolute difference between ground truth errors (which we get from the simulator) and the reported errors by each subject. From the 30 participants in the user study, three participants were excluded after providing a wrong answer to the attention question. In the remaining data, outliers have also been identified and removed. We then compared between the subjects' assessment errors across the three study conditions. As shown in Fig. 6 , using the proposed autonomous camera method leads to lower number of assessment errors and less variance between the subjects' scores compared with the other two conditions. In particular, the proposed method (condition III) leads to 25% and 21% fewer assessment errors compared with the centering method (condition II) and fixed camera method (condition I), respectively. Furthermore, the standard deviation in the assessment errors using the proposed method is 31% and 32% lower than that of the centering method and fixed camera method, respectively. These reductions in the average and standard deviation of the proposed method show its potential to improve the current practice in video-based surgical skill assessment. Previous research in this area show that variability between assessors is a major practical problem. Gingerich et al [40] report that this variability comes from the cognitive limitations of the assessors as well as from their making of unjustified inferences. Our proposed autonomous camera method has the potential to contribute to solving these problems as it can provide better visual feedback which allows the assessors to make more informed assessments and reduces their need to infer/guess due to the lack of the available visual information. Orientation matters in viewing the surgical scene. Instructions of many surgical procedures include moving the camera to view specific anatomical structures in a predefined orientation. Our proposed method provides an automated way of achieving this, which can reduce the burden of controlling the camera from the surgeon. This is especially true because adjusting the view with respect to orientation requirements is arguably more difficult to achieve by manually controlling the camera than adjusting with respect to the position only. This is a more difficult problem when controlling articulated/snake-like endoscopes [20] and we believe that our proposed method is a first step towards solving this problem. We view our proposed autonomous camera method in the bigger context as a component of an intelligent assistant to the surgeon. Such an assistant can use the surgical data to recognize the current stage/step of the surgical work flow. It can then use this to infer the anatomical structures of interest automatically and with input from the surgeon if needed. It can also provide the visualization requirements of these anatomical structures in terms of their required position and orientation. Computer vision methods can be used to identify and recognize these structures in the scene. Our proposed method can then be used to realize the required visualization requirements. The resulting camera views can then be transferred back to the intelligent assistant which can use them to infer the new step/stage of the surgical work flow. Each of the above components of the intelligent assistant is an active research area on its own (e.g., surgical work flow analysis [41] and tissue identification [31] ). Our proposed system provides another component which differs from previous work in the flexibility it provides in satisfying a wider range of visualization requirements. We presented an autonomous camera method for 6-DoF endoscopic camera systems in MIS. Our method takes into consideration both the position and orientation information of anatomical structures of interest in the surgical scene. Our method achieved an average position tracking accuracy of 3 mm and orientation tracking accuracy of 5 degrees when tested on a validated MIS training task in a simulated environment. We also presented some safety measures that can be included into our autonomous camera system to avoid collisions with anatomical structures in the surgical scene and avoid having the surgical tools outside the FOV. We also tested the effectiveness of using an autonomous camera system for video-based surgical skill assessment. We conducted a user study (N = 30) where subjects watched videos of a simulated surgical training task under different camera motion/automation conditions. Our results show that using the proposed autonomous camera method leads to up to 25% more accurate skill assessment and up to 32% lower standard deviations between different assessors. These results demonstrate the potential of the proposed autonomous camera method in augmenting the cognitive abilities of assessors by providing better visual feedback of the tasks compared with the other methods. Our results show the importance of including orientation information into automated camera system in MIS. With the extensive research on articulated endoscopic cameras, the constraints of tracking such information in practice are removed, unlike the commonly used 4-DoF endoscopic systems. Our future work includes improving the proposed autonomous camera pipeline to consider more than one anatomical structure, addressing potential problems in the visual feedback such as occlusions, and testing the proposed system with subjects conducting a surgical task on a MIS platform such as the da Vinci system. Minimally invasive surgery Laparoscopic surgery15 years after clinical introduction Robotics in vivo: A perspective on human-robot interaction in surgical robotics Force feedback and sensory substitution for robot-assisted surgery Fundamental skills of robotic surgery: a multi-institutional randomized controlled trial for validation of a simulation-based curriculum Medical students impact laparoscopic surgery case time Gravity line strategy may reduce risks of intraoperative injury during laparoscopic surgery Comparison of robotic versus human laparoscopic camera control Ergonomic risk associated with assisting in minimally invasive surgery A review of camera viewpoint automation in robotic and laparoscopic surgery Towards a cognitive camera robotic assistant Eye gaze tracking for endoscopic camera positioning: an application of a hardware/software interface developed to automate aesop Intelligent interaction between surgeon and laparoscopic assistant robot system A pickup stereoscopic camera with visual-motor aligned control for the da vinci surgical system: a preliminary study Laparoscopic visual field The voice-controlled robotic assist scope holder aesop for the endoscopic approach to the sella The EndoAssist robotic camera holder as an aid to the introduction of laparoscopic colorectal surgery Development and evaluation of an autonomous camera control algorithm on the da vinci surgical system Automatic guidance of an assistant robot in laparoscopic surgery Visual Servo of a 6-DOF Robotic Stereo Flexible Endoscope Based on da Vincix Research Kit (dVRK) System An intelligent and autonomous endoscopic guidance system for minimally invasive surgery Learning 2D surgical camera motion from demonstrations Smart cable-driven camera robotic assistant A randomized controlled study to evaluate the role of video-based coaching in training laparoscopic skills Play me back: a unified training platform for robotic and laparoscopic surgery Objective structured assessment of technical skill (OSATS) for surgical residents Are multiple views superior to a single view when teaching hip surgery? a single-blinded randomized controlled trial of technical skill acquisition A multi-camera, multi-view system for training and skill assessment for robot-assisted surgery Construct and face validity of a virtual reality-based camera navigation curriculum Laparoscopic skills training using inexpensive box trainers: which exercises to choose when constructing a validated training course Tissue tracking and registration for image-guided surgery A technical review of flexible endoscopic multitasking platforms An open-source research kit for the da vinci® surgical system A v-rep simulator for the da vinci research kit robotic platform Validation of the da vinci surgical skill simulator across three surgical disciplines: a pilot study Performance of robotic simulated skills tasks is positively associated with clinical robotic surgical performance An experimental comparison towards autonomous camera navigation to optimize training in robot assisted surgery Human-Computer Interaction: An Empirical Research Perspective Crowdsourced assessment of technical skills: a novel method to evaluate surgical performance Seeing the black boxdifferently: assessor cognition from three research perspectives Random forests for phase detection in surgical workflow analysis We would like to thank Jordan Liu for his assistance with use of the simulator for this work.