key: cord-0436238-gd9wk1rb authors: Deng, Xutian; Lei, Ziwei; Wang, Yi; Li, Miao title: Learning Ultrasound Scanning Skills from Human Demonstrations date: 2021-11-09 journal: nan DOI: nan sha: 4465a773c0ef0c236039c935d71a85f924749054 doc_id: 436238 cord_uid: gd9wk1rb Recently, the robotic ultrasound system has become an emerging topic owing to the widespread use of medical ultrasound. However, it is still a challenging task to model and to transfer the ultrasound skill from an ultrasound physician. In this paper, we propose a learning-based framework to acquire ultrasound scanning skills from human demonstrations. First, the ultrasound scanning skills are encapsulated into a high-dimensional multi-modal model in terms of interactions among ultrasound images, the probe pose and the contact force. The parameters of the model are learned using the data collected from skilled sonographers' demonstrations. Second, a sampling-based strategy is proposed with the learned model to adjust the extracorporeal ultrasound scanning process to guide a newbie sonographer or a robot arm. Finally, the robustness of the proposed framework is validated with the experiments on real data from sonographers. Medical ultrasound imaging technology is widely adopt in clinical diagnosis due to its incomparable merits including non-invasive, low-hazard, real-time imaging, safe and low cost. Nowadays, ultrasound imaging can quickly detect diseases of different anatomical structures, including liver [1] , gallbladder [2] , bile duct [3] , spleen [4] , pancreas [5] , kidney [6] , adrenal gland [7] , bladder [8] , prostate [9] , thyroid [10] , etc. In addition, during the global pandemic of COVID-19, medical ultrasound is also used for the diagnosis of infected persons by detecting pleural effusion [11] , [12] . However, the performance of ultrasound examination is highly depending on skills and experience of sonographers, which generally require a large amount of time and effort to acquire [13] , [14] . Moreover, the intensive and repetitive ultrasound scanning process causes a heavy burden on sonographers' physical condition, further leading to the scarcity of ultrasound practitioners. To address these issues, many previous studies in robotics have attempt to use robots to help or even to replace the sonographers [15] - [17] . According to the level of the system autonomy, robotic ultrasound system can be categorized into three levels: tele-operated, semi-autonomous and fullautonomous. A tele-operated robotic ultrasound system usually contains two main parts: teacher site and student site [18] - [20] . The motion of the student robot is completely determined by the teacher, usually a trained sonographer through different kinds of interaction devices including 3D space mouse [18] , All Fig. 1 : The feedback information from three different modalities during a free-hand ultrasound scanning process. The first row represents ultrasound images. The second row represents the contact force in the z-axis between the probe and the skin, collected using a 6 dimensional force/torque sensor. The third row represents the probe pose, which is collected using an inertial measurement unit (IMU). inertial measurement unit (IMU) handle [20] , [21] , haptic interface [21] , etc. While for a semi-autonomous robotic ultrasound system, the motion of the student robot is partly determined by the teacher [22] - [24] . For a full-autonomous robotic ultrasound system, the student robot is supposed to perform the whole process of local ultrasound scanning by itself [25] - [27] and the teacher robot is only used for emergency or unexpected situations. Until today, only local full-autonomous robotic ultrasound systems have been reported [28] , [29] . These robotic ultrasound systems usually focus on scanning of some certain anatomical structures, such as abdomen [28] , thyroid [26] and vertebra [29] . Despite of these achievements, it is still a challenging task to represent and learn the ultrasound scanning skill due to its high dimensionality and rich modality. A comprehensive survey on robotic ultrasound is given in TABLE I. In this paper, we proposed a learning-based approach to represent and learn ultrasound skills from sonographers' demonstrations, in order to guide the scanning process. During the learning process, the ultrasound images together with the relevant scanning variables (the probe pose and the contact force) are recorded and encapsulated into a high-dimensional model. Then, we leverage the power of deep learning to implicitly capture the relation between the quality of ultrasound images and scanning skills. During the execution stage, the [18] tele-operated no force, orientation, position human 2015 [19] tele-operated no force, orientation, position human 2016 [20] tele-operated no force, orientation, position human 2017 [21] tele-operated no force, orientation, position human 2020 [22] semi-autonomous no force, orientation, position, elastogram elastogram, human 2017 [23] semi-autonomous no force, orientation, position, vision CNN, human 2019 [24] semi-autonomous yes force, orientation, position trajectory, human 2019 [30] semi-autonomous yes force, orientation, position, image CNN, human 2020 [25] full-autonomous yes force, orientation, position, vision, image, MRI vision, MRI, confidence map 2016 [26] full-autonomous yes force, orientation, position, image SVM 2017 [27] full-autonomous no force, orientation, position, vision vision 2018 [28] full-autonomous yes force, orientation, position, vision, MRI vision, MRI 2016 [29] full-autonomous yes force, position, vision RL 2021 learned model is used to evaluate the current quality of the ultrasound image. In order to obtain a high quality ultrasound image, a sampling-based approach is used to adjust the probe motion. The main contribution of this paper is twofold: 1. A multimodal model of ultrasound scanning skills is proposed and learned from human demonstrations, which takes ultrasound images, the probe pose and the contact force into account. 2. Based on the learned model, a sampling-based strategy is proposed to adjust the ultrasound scanning process, in order to obtain a high quality ultrasound image. Note that the goal of this paper is to offer a learning-based framework to understand and acquire the ultrasound skills from human demonstrations. However, it is obvious that the learned model can be ported into a robot system as well, which is our work for the next step. This paper is organized as follows. Section II presents related work in the field of ultrasound images and ultrasound scanning guidance. Section III provides the methodology of our model, including the learning process of task representation, the data acquisition process through human demonstrations and the strategy for the scanning guidance during real time execution. Section IV describes the detailed experimental validation, with a final discussion and conclusion in Section V. The goal of the ultrasound images evaluation is to understand images in terms of classification [31] , segmenting [32] , recognition [33] , etc. With the rise of deep learning, many studies have attempt to process the ultrasound images with the help of neural networks. Liu et al. have summarized the extensive research results on ultrasound images processing with different network structures including convolution neural network (CNN), recurrent neural network (RNN), auto-encoder network (AE), restricted Boltzmann's machine (RBM), deep belief network (DBN), etc [34] . From the perspective of applications, Sridar et al. have employed the CNN for the main plane classification in fetal ultrasound images, considering both local and global features of the ultrasound images [35] . In order to judge the severity of patients, Roy el al. have collected ultrasound images of the COVID-19 patient's lesions to train a spatial transformer network [36] . Deep learning is also adopted in the task of segmenting thyroid nodules from the real-time ultrasound images [37] . While deep learning provides a superior framework to understand the ultrasound images, it generally requires a large number of expert-labeled data, which can be difficult and expensive to collect. The multi-modal task learning architecture with human annotations. The network takes data from three different sensors as input: the ultrasound images, force/torque (F/T) and the pose information. The data for the task learning is acquired through human demonstrations, where the ultrasound quality is evaluated by sonographers. With the trained network, the multi-modal task can be represented as a high-dimensional vector. The ultrasound scanning data collected from human demonstrations. The sonographer is performing an ultrasound scanning with a specifically designed probe holder. The sensory feedback during the scanning process is recorded, including the ultrasound images from an ultrasound machine, the contact force and torque from a 6D F/T sensor, and the probe pose from an IMU sensor. Confidence map provides an alternative method in ultrasound images processing [38] . The confidence map is obtained through pixel-wise confidence estimation using a random walk. Chatelain et al. have devised a control law based on the ultrasound confidence map [39] , [40] , with the goal to adjust the in-plane rotation and motion of the probe. Confidence map is also employed to automatically determine proper parameters for the ultrasound scanning [25] . Furthermore, the advantages of the confidence maps have been demonstrated by combining with position control and force control to preform automatic position and pressure maintenance [41] . However, confidence map is proposed with the hand-coded rules, which can not be directly used to guild the scanning motion. While the goal of the ultrasound images processing is to understand images, learning of the ultrasound scanning skills aims to obtain high-quality ultrasound images through the adjustment of the scanning operation. Droste et al. have used a clamping device with IMU to obtain the relation between the probe pose and the ultrasound images during ultrasound examination [42] . Li et al. have built a simulation environment based on 3D ultrasound data acquired by robot arm mounted with a ultrasound probe [43] . However, they didn't explicitly learn the ultrasound scanning skills. Instead, a reinforcement learning framework is adopted to optimize the confidence map of ultrasound images, by adapting the movement of the ultrasound probe. All of the above-mentioned work only take the pose and the position of the probe as input, while in this paper the contact force between the probe and humans is also encoded, which is considered as a crucial factor during ultrasound scanning process [44] . For the learning of force-relevant skills, a great variety of previous studies in robotic manipulation focused on learning the relation between force information and other task-related variables, such as the position and velocity [45] , the surface electromyography [46] , the task states and constraints [47] , and the desired impedance [48] - [50] . A multi-modal representation method for contact-rich tasks has been proposed in [51] to encode the concurrent feedback information from vision and touch. The method was learned through selfsupervision, which can be further exploited to improve the sampling efficiency and the task success rate. To the best of our knowledge, for a multi-modal manipulation task including feedback information from ultrasound, force and motion, this is the first work to learn the task representation and the corresponding manipulation skills from human demonstrations. Our goal is to learn the free-hand ultrasound scanning skills from human demonstrations. We want to evaluate the multimodal task quality of combining multiple sensory information including ultrasound images, the probe pose and the contact force, with the goal to extract skills from the task representation and even to transfer skills across tasks. We formulate the multisensory data by a neural network, where the parameters are trained by the data supervised through human ultrasound experts. In this section, we will discuss the learning process of the task representation, the data collection procedure and the online ultrasound scanning guidance respectively. For a free-hand ultrasound scanning task, three types of sensory feedback are available: ultrasound images from the Experience DP , DF Fig. 4 : Our strategy for scanning guidance takes the current pose P t , the contact force F t and the ultrasound image S t as input, and outputs the next desired pose P t and contact force F t . For sampling, we impose a bound between P t , F t and P t , F t , which prevents the next state from moving too far away from the current state. For execution, the desired pose P t and contact force F t is used as a goal for the human ultrasound scanning guidance. ultrasound machine, force feedback from a mounted F/T sensor, and the probe pose from a mounted IMU. To encapsulate the heterogeneous nature of this sensory data, we propose a domain-specific encoder to model the task, as shown in Fig. 2 . For the ultrasound imaging feedback, we use a VGG-16 network to encode the 224 × 224 × 3 RGB images and yield a 128-d feature vector. For the force and pose feedback, we encode them with a 4-layer fully connected neural network to produce a 128-d feature vector. The resulting two feature vectors are concatenated together into one 256-d vector and connected with a 1-layer fully connected network to yield a 128-d feature vector as the task feature vector. The multimodal task representation is a neural network model denoted by Ω θ , where the parameters are trained as described in the following section. The multi-modal model as shown in Fig. 2 has a large number of learnable parameters. To obtain the training data, we design a procedure to collect the ultrasound scanning data from human demonstrations, as shown in Fig. 3 . A novel probe holder is designed with intrinsically mounted sensors such as IMU and F/T sensors. A sonographer is performing the ultrasound scanning process with the probe, and the data collected during the scanning process is described as follows: ..N denotes a dataset with N observations. • S i ∈ R 224×224×3 denotes the i-th collected ultrasound image with cropped size. • P i ∈ R 4 denotes the probe pose in terms of quaternion. • F i ∈ R 6 denotes the i-th contact force/torque between the probe and the human skin. For each recorded data in the dataset D, the quality of the obtained ultrasound image is evaluated by three sonographers and labeled with 1/0. 1 stands for a good ultrasound image while 0 corresponds for an unacceptable ultrasound image. With the recorded data and the human annotations, the model Ω θ is trained with a loss function of cross-entropy. During training, we minimize the loss function with stochastic gradient descent. Once trained, this network produces a 128-d feature vector and evaluates the quality of the task at the same time. Given the task representation model Ω θ , an online adaptation strategy is proposed to improve the task quality by leveraging the multi-modal sensory feedback, as discussed in next section. As discussed in related work, it is still challenging to model and plan complex force-relevant tasks, mainly due to the inaccurate state estimation and the lack of a dynamics model. In our case, it is difficult to explicitly model the relations among ultrasound images, the probe pose and the contact force. Therefore, we formulate the policy of ultrasound skills as a model-free reinforcement learning problem where the set where Q θ denotes the quality of the task, which is computed using the learned model Ω θ by passing through the sensory feedback S, P, F . The constraint F z ≥ 0 means that the contact force along the normal direction should be positive. D P and D F denote feasible sets of the probe pose and the contact force, respectively. In our case, these two feasible sets are determined by human demonstrations. However, it is worth mentioning that other task-specific constraints for the pose and the contact force can also be adopted here. By choosing model-free, it requires no prior knowledge of the dynamics model of the ultrasound scanning process, namely the transition probabilities from one state (current ultrasound image) to another (next ultrasound image). More specifically, we choose Monte Carlo policy optimization [52] , where the potential actions are sampled and selected directly from previous demonstrated experience, as shown in Fig. 4 . For the sampling, we impose a bound between P t , F t and P t , F t , which prevents the next state from moving too far away from the current state. If the new state < P t , F t , S t > is evaluated by the task quality function Q θ as good, thus the desired pose P t and contact force F t is used as a goal for the human ultrasound scanning guidance. Otherwise, new P t and F t are sampled from the previous demonstrated experience. This process repeat N times, and the P t , F t with the best task quality is chosen as the final goal for the human scanning guidance. Note that this sampling-based approach does not guarantee the global optimality of Equation 1. However, this is sufficient for the human ultrasound scanning guidance because the final goal is only required to be updated at a relatively low frequency. In this section, we use real experiments to examine the effectiveness of our proposed approach of multi-modal task representation learning. In particular, we design experiments to verify the following two questions: • Does the force modality contribute to the task representation learning? • Is the sampling-based policy effective for real data? For the experimental setup, we used a Mindray DC-70 ultrasound machine with an imaging frame rate of 900Hz. The ultrasound image was captured using MAGEWELL USB Capture AIO with a frame rate of 120Hz and a resolution of 2048 × 2160, as shown in Fig. 5 . As shown in Fig. 3 , the IMU mounted on the ultrasound probe was ICM20948 and the microcontroller unit (MCU) was STM32F411. The highest frequency of IMU could reach Accuracy Fig. 10 : Accuracy of four networks in validation. Net1 was trained with S and P . Net2 was trained with S and F . Net3 was trained with S, P and F , without interaction between P and F . Net4 was trained with S, P and F , with interaction between P and F . The ultrasound data was collected at the Hospital of Wuhan University. The sonographer was asked to scan the left kidneys of 5 volunteers with different physical conditions. Before examination, the sonographer vertically held the probe above the left kidney of a volunteer. The ultrasound scanning process began with the recording program launched. The snapshots for the scanning process are shown in Fig. 6 . The collected data is consisting of ultrasound videos, the probe pose (quaternion), the contact force (force and torque) and labels (1/0). In total, there are 5995 samples of data. The number of positive samples (labeled 1) is 2266, accounting for 37.8%. The number of negative samples (labeled 0) is 3729, accounting for 62.2%. Fig. 7 presents trajectories of the recorded information. The detailed architecture of our network is shown in Fig. 8 . We started the training process with a warm start to classify the ultrasound images. The adopted neural network was VGG-16 with cross entropy loss. Data for training included ultrasound images and labels. The learning rate was 0.001 and the batch size was 20. For the ultrasound skill evaluation, data for training included images S, quaternion P , force F and labels. By inputting P, F, S, this neural network would output predicted label. We fixed channels of last fully connected layer in VGG-16 to 128 channels, and merged it with (P, F ) feature vector. Four fully connected layers were added to transform (P, F ) vector into 128 channels, which concatenated with VGG-16 output vector. After getting the vector with 256 channels, two fully connected layers and a softmax layer were added to output the confidence of label. Fig. 9 presents accuracy and loss in training neural networks. The neural network for classification finally reached accuracy of 96.89% and 95.61% in dataset of training and validation. The neural network for To confirm the correlation between P and F , we divided data into different levels for training of four networks with different input ports. Net1 was trained with S and P , while Net2 was trained with S and F . Net3 was trained with S, P and F with two parallel 4-layer fully connected neural networks for inputting P and F . Net4 (Fig. 8) was trained with S, P and F , with concatenated (P, F ) vectors. The main difference between Net3 and Net4 was the existence of interactions between P and F during training process. Each network had been trained for five times with 20 training epoches. Fig. 10 presents performance of four networks in validation. Online ultrasound scanning skill guidance: We selected some continuous data stream from dataset for verification, which had not been used for training the neural network. The sampling process in Fig. 4 was repeated 1000 times and the actions P, F with the best task quality was selected as the next desired action. The whole process took 3 to 5 seconds to output the desired action. Fig. 11 presents predicted results about components of contact force, compared with ground truth data. Fig. 12 presents the predicted probe pose with corresponding ultrasound images. Fig. 13 presents predicted and true probe pose with corresponding ultrasound images. There are some limitations in this paper. First, the online guidance method is based on random sampling, which leads to a certain degree of randomness. Therefore, there is a certain difference between forecast results and true values in the short term. Second, to ensure the effectiveness of the sampling, a large number of samples are required, which means a higher task quality improvement would require more computation cost. With the expendation of dataset, this method is difficult to meet the requirement of timely guidance, which can be solved by denoting the feasible set as a probabilistic model to acquire better sampling efficiency. Finally, we believe that through detailed adjustments to the neural network, the efficiency of this model has the opportunity to be greatly improved without losing too much accuracy. This paper presents a framework for learning ultrasound scanning skills from human demonstrations. By analyzing the scanning process of sonographers, we define the entire scanning process as a multi-modal model of interactions between ultrasound images, the probe pose and the contact force. A deep-learning based method is proposed to learn ultrasound scanning skills, from which a sampling-based strategy for ultrasound scanning guidance is proposed. Experimental results show that this framework for ultrasound scanning guidance is robust, and presents the possibility of developing a real-time learning guidance system. In the future work, we will speed up the prediction process by taking advantage of self-supervision, with the goal to port the learned guidance model into a real robot system. Ultrasound in chronic liver disease Gallbladder lesions identified on ultrasound. lessons from the last 10 years Utility of common bile duct measurement in ed point of care ultrasound: a prospective study Contrast-enhanced ultrasound of the spleen Ultrasound imaging of the hepatobiliary system and pancreas Ultrasound-based imaging methods of the kidney-recent developments Contrast-enhanced ultrasound for imaging of adrenal masses Diagnosis of postoperative urinary retention using a simplified ultrasound bladder measurement Ultrasound of the prostate Thyroid ultrasound and the increase in diagnosis of low-risk thyroid cancer Covid-19 outbreak: less stethoscope, more ultrasound Proposal for international standardization of the use of lung ultrasound for covid-19 patients; a simple, quantitative, reproducible method Teaching medical students diagnostic sonography Physician training requirements in sonography: how many cases are needed for competence? Three-dimensional ultrasound-guided robotic needle placement: an experimental evaluation Robotic ultrasound systems in medicine 3d ultrasound-guided robotic steering of a flexible needle via visual servoing Development of prototype system for robot-assisted ultrasound diagnosis An ultrasound robotic system using the commercial robot ur5 Study of a 6dof robot assisted ultrasound scanning system and its simulated control handle Cobot with prismatic compliant joint intended for doppler sonography A robotic control framework for 3-d quantitative ultrasound elastography A semi-autonomous robotic system for remote trauma assessment 3d ultrasound imaging of scoliosis with force-sensitive robotic scanning Third IEEE International Conference on Robotic Computing (IRC) Automatic force-compliant robotic ultrasound screening of abdominal aortic aneurysms Development of a control algorithm for the ultrasound scanning robot (nccusr) using ultrasound image and force feedback Robotic arm based automatic ultrasound scanning for three-dimensional imaging Towards mri-based autonomous robotic us acquisitions: a first feasibility study Autonomic robotic ultrasound imaging system based on reinforcement learning Robot-assisted semi-autonomous ultrasound imaging with tactile sensing and convolutional neural-networks Breast cancer classification in ultrasound images using transfer learning A supervised learning framework of statistical shape and probability priors for automatic prostate segmentation in ultrasound images Automatic thyroid nodule recognition and diagnosis in ultrasound imaging with the yolov2 neural network Deep learning in medical ultrasound analysis: a review Decision fusion-based fetal ultrasound image plane classification using convolutional neural networks Deep learning for classification and localization of covid-19 markers in point-of-care lung ultrasound Deep learning for real-time semantic segmentation: Application in ultrasound imaging Ultrasound confidence maps using random walks Optimization of ultrasound image quality via visual servoing Confidence-driven control of an ultrasound probe: Target-specific acoustic window optimization Confidence-driven control of an ultrasound probe Automatic probe movement guidance for freehand obstetric ultrasound Autonomous navigation of an ultrasound probe towards standard scan planes with deep reinforcement learning Automatic force-based probe positioning for precise robotic ultrasound acquisition Learning force-relevant skills from human demonstration Simultaneously encoding movement and semg-based stiffness for robotic skill learning Planning for multistage forceful manipulation Learning task manifolds for constrained object manipulation Learning object-level impedance control for robust grasping and dexterous manipulation Learning of grasp adaptation through experience and tactile sensing Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks Reinforcement learning: An introduction