key: cord-0527038-3xv84bv5
authors: Adewole, Sodiq; Fernandez, Philip; Yeghyayan, Michelle; Jablonski, James; Copland, Andrew; Porter, Michael; Syed, Sana; Brown, Donald
title: Lesion2Vec: Deep Metric Learning for Few-Shot Multiple Lesions Recognition in Wireless Capsule Endoscopy Video
date: 2021-01-11
journal: nan
DOI: nan
sha: a6ce280d1e3b2a1627d50a5351703ac4e9b55725
doc_id: 527038
cord_uid: 3xv84bv5

Effective and rapid detection of lesions in the Gastrointestinal tract is critical to gastroenterologist's response to some life-threatening diseases. Wireless Capsule Endoscopy (WCE) has revolutionized traditional endoscopy procedure by allowing gastroenterologists visualize the entire GI tract non-invasively. Once the tiny capsule is swallowed, it sequentially capture images of the GI tract at about 2 to 6 frames per second (fps). A single video can last up to 8 hours producing between 30,000 to 100,000 images. Automating the detection of frames containing specific lesion in WCE video would relieve gastroenterologists the arduous task of reviewing the entire video before making diagnosis. While the WCE produces large volume of images, only about 5% of the frames contain lesions that aid the diagnosis process. Convolutional Neural Network (CNN) based models have been very successful in various image classification tasks. However, they suffer excessive parameters, are sample inefficient and rely on very large amount of training data. Deploying a CNN classifier for lesion detection task will require time-to-time fine-tuning to generalize to any unforeseen category. In this paper, we propose a metric-based learning framework followed by a few-shot lesion recognition in WCE data. Metric-based learning is a meta-learning framework designed to establish similarity or dissimilarity between concepts while few-shot learning (FSL) aims to identify new concepts from only a small number of examples. We train a feature extractor to learn a representation for different small bowel lesions using metric-based learning. At the testing stage, the category of an unseen sample is predicted from only a few support examples, thereby allowing the model to generalize to a new category that has never been seen before. We demonstrated the efficacy of this method on real patient capsule endoscopy data.

Wireless Capsule Endoscopy allows non-invasive visualization of the entire gastrointestinal tract including the smallbowel region by the gastroenterologist for various disease diagnosis. Traditional upper and lower endoscopy procedure have limited visibility as they can only allow visualization of the upper and lower GI tract leaving the small bowel region inaccessible. A tiny capsule swallowed by the patient captures images at about 2 -6 1 frames per second (fps) as it is propelled down the GI tract through intestinal peristalsis. A single WCE examination could last up to 8 hours producing between 30,000 to 100,000 images compiled as a video. The collected images are subsequently transferred to a work station where they are reviewed and analysed frame-by-frame by an expert gastroenterologist for diagnosis. Usually, only about 5% of the entire video contain visual features that aid the diagnosis process. While WCE is an innovative technology, automating detection of lesions is critical to increase its clinical application.

For over two decades, automatic detection of lesions in WCE data has received much attention from researchers and several approaches have been proposed in literature [1] [2] [3] [4] [5] [6] .

Recently, Convolutional Neural Networks (CNN) based model have gained significant attention and are currently the state-of-the-art models for image classification and object detection tasks [7] [8] [9] [10] . While CNN based models have demonstrated great success, they generally require a large amount of labeled examples for training and are sample inefficient [11] . Generally, due to the required expertise, obtaining labels for medical data can be very difficult. Moreover, while a single endoscopy examination may contain multiple lesions, only about 5% of the entire CE video offer informative content that aid gastroenterologist in their diagnosis, producing far more examples of normal frames than the useful abnormal ones.

Another unique property of WCE data is the approach to data collection. Gastroenterologist are only able to obtain sample data from patients who visit for a specific problem. Traditional classification model trained based on currently available data will require fine-tuning when new cases of unseen categories arise. Thereby calling for another round of sample collection in such large quantity as would ensure the generalizability of the model to the new category. Such time-to-time fine-tuning for every new lesion category would ultimately limit the adoption of the system in real life clinical situation.

In contrast, humans possess a remarkable ability to learn a new concept from only a few instances and quickly generalize to new circumstances [11] . Given one or few template examples, humans are able to generalize to new circumstances. We employed similar concept to automatically recognize new category of lesion in WCE data.

Prior efforts to automate the analysis and detection of lesions in capsule endoscopy data can be broadly grouped into;

• Abnormal frame or outlier detection framework [2, 12, 13] . Broad categorization into normal/abnormal helps reduce redundancy rate caused by large number of normal frames and also minimize the review time and effort needed by the gastroenterologist to make diagnosis. However, this approach does not offer any granular information that specifically help identify characteristic lesion in the images. They also still require the gastroenterologist to review the frames before making decision.

• Informative/key frame extraction [14] [15] [16] [17] : This approach is otherwise known as the video summarization framework where the model learns to extract key frames (lesion-containing or abnormal) from entire video sequence.

• Models to detect specific abnormality or lesion such as bleeding in [4] , polyp in [5, 6, 18, 19] , ulcer [20] , and angioectasia [21] [22] [23] .

We will review each of these prior works in more detail in the following.

In [2] Miaou et al. propose a four-stage classification model based on low-level Hue-Saturation-Intensity (HSI) features followed by fuzzy-c means clustering analysis to separate images carrying different abnormalities in such step-wise manner. The final stage is a neural network model that discrimininate normal from abnormal frames. In [24] Mewes et al. applied similar multi-stage technique to extract quality frames by removing over-/under-expose images as well as images with significant non-tissue areas. Using color histogram of images, [3] employed fuzzy neural model which combines fuzzy systems and artificial neural networks to detect abnormal lesions in CE images.

Zhao et al. [13] proposed a temporal segmentation approach based on adaptive non-parametric key-point detection model using multi-feature extraction and fusion. The aim of their work was not only to detect key abnormal frames using pairwise distance, but also to augment gastroenterologist's performance by minimizing the miss-rate and thus, improving detection accuracy. [4] proposed to detect bleeding regions in frames by computing statistical features of the first order histogram probability of the three color channels (RGB) in the images before passing the computed features to a neural network to discriminate bleeding from non-bleeding frames.'In order to develop scalable models, such low level hand-crafted feature extraction method may not scale properly.

[6] Mamonov et al. propose a model for colorectal polyp detection based on a binary classification using geometric analysis and texture content of the frames. Their model achieve 47% sensitivity and 90% specificity. Similarly, Hwang et al. [18] propose a polyp detection model by first segmenting the affected region using Gabor texture features and the applying K-means clustering. The resulting geometric information is then used to identify frames containing polyp. Yixuan et al. [19] proposed a bag of feature (BoF) technique using integration of multiple features such as texture features, scale-invariant feature transform (SIFT), complete local binary pattern (CLBP) with visual words to automatically detect polyp in WCE image. While SIFT remains the baseline feature for traditional image analysis, CNN based models achieve significantly improved performance in complex geometric and lighting conditions which is typical of CE data.

Traditional low-level feature extraction techniques have been well explored in CE image analysis. However, little attention has been given to CNN-based models. For example, Akiyoshi et al. [21] work uses CNN-based model -Single Shot Multibox Detector -to automatically detect frames with angioectasia in CE images while [23] proposed a saliency-based unsupervised method for the same task. [22] combined deep learning and handcrafted features for polyp detection. [12] applied CNN-based model in a semi-supervised context to detect frames with abnormality. The main drawback of CNN-based models is that they typically require vast quantity of labelled data and suffer from poor sample efficiency, which excludes many application where data is typically rare or expensive [25] . Given the volume of frames generated in WCE data, the cost of obtaining expert label for every frame across multiple patient is generally prohibitive.

Given large amount of labelled data, state-of-the-art performance can be achieved by a CNN based classifier model on different lesion categories. However, given the peculiarity of WCE data, achieving 100% accuracy on limited category is not sufficient for real world clinical application. This can be further exacerbated by the far more normal example frames observed in each video. Making obtaining diverse and sufficient examples of every new lesion category [11] difficult. We propose to tackle this problem using the Metric-based learning and few-shots classification. To the best of our knowledge, we are not aware of any prior efforts that has explored few-shot learning on capsule endoscopy image analysis.

Our contributions are highlighted as follows:

• This is the first work to propose state-of-the-art few-shot learning in capsule endoscopy image analysis field. Our experimental evaluation shows that it is possible to learn much information for a new category of lesion from just a few examples.

• We conduct extensive experiment to investigate factors that affect performance including which CNN architecture produces better performance and the impact of support samples (shots) on the different lesion categories.

The remaining part of the paper is organized as follows. Section 2 reviews the principles and basic framework of Metric and few-shot learning. Section 3 presents the application of few-shot learning for lesion recognition in WCE data. Section 4 mainly covers the dataset and experimental results. Section 5 summarizes the paper and key direction for future works.

2 Theoretical Background

Few-shot learning is a special case of meta-learning where we aim to learn new concepts from a limited number of labeled samples and quickly adapt to unforeseen tasks. The few-shot learning task is an extension of the single-shot learning [26] framework where the discriminating potential of the learned embedding space is evaluated. For single-shot learning, given a test image x i , that we wish to classify into 1 of C classes, we are also given one example image {x c } C 1 from each category. We can then query the network to compute the embedding for {x i , x c } ∀ c = 1, ..., C. We then predict the class corresponding to the minimum distance from the test image.

where d is the distance metric between f (x i ) and each of f (x c ).

In few-shot learning, there are more than one example of each category. In this case, it is possible to learn a distribution over the embedding space for each category.

Metric-based learning is one of the family of approaches used in few-shot learning [27] where the model learn a representation of the current task such that, given a few support instances, it is able to generalized to an unseen task [11] . Metric based learning is designed to maximize the inter-class distance between embedding features belonging to different class while simultaneously minimizing the distance between embedding features belonging to the same class. Some of the proposed architecture in metric-based learning are the Siamese Neural Network [28] and the Triplet Network [29] . 

The Triplet Network (TN) proposed by [29] was inspired by Siamese Neural Network (SNN), proposed in [28] . The SNN was proposed to first solve signature verification problem as an image matching task. As shown in Fig. 1 , the SNN consist of two identical sub-networks that are trained simultaneously. The networks share the same set of parameters and are joined at their outputs by a joining neuron. The networks extract features from a pair of input images of different categories while the joining neuron measures the distance between the two feature vectors. Learning in the twin networks is done with distance-based metric. [28] proposed the contrastive loss which computes the euclidean distance between every pair of inputs d = x 1 , x 2 . During training, the SNN is optimized to minimize the distance between pairs of vector representing inputs from the same class and increase the distance between vector representation of inputs from different classes. Subsequently, [29] proposed the triplet loss function that combines two contrastive losses between an anchor and an input from same class (positive) and the anchor with an input from a different class (negative) to form a triplet. This architecture is shown in figure 2 . The TN and SNN have been successful in other domains on face verification and recognition problems [30, 31] where the models directly learns a mapping from face images to compact Euclidean space. [26] also reported effectiveness of the network on character recognition problem.

where α is a margin that is enforced between positive and negative pairs to prevent the network from learning a trivia solution. T is the set of all possible triplets in the training set and has cardinality N . In this paper we applied the triplet loss function in training the network parameters.

The triplet loss function shown in eq. (3) optimizes the parameters of the network to minimize the distance between an anchor and a positive instance through projection to a single point in the embedding space. It simultaneously maximizes the distance between the anchor and a negative instance as shown in figure 2 . The embedding function, represented as f (x) ∈ R D parameterized by θ extracts features from the input image x to produce a D-dimensional feature vector. The loss ensures that a frame x a i (anchor) with a specific lesion is closer to all other frames x p i with the same lesion than to other frames x n i (negative) with a different lesion or normal tissue.

where f represents the function that the CNN model learns.

WCE videos have variety of challenging characteristics. Due the complex structure of the GI tract frames from the capsule camera, the images suffer from uneven illumination, low resolution, variable focal sharpness, and high compression ratio. Some of the video frames contain highly light reflections or maybe out-of-focused because of peristalsis as well as completely accidental WCE movements through the peculiar GI tract. Moreover, non-lesion frames can show deceptive structures such as bubbles, extraneous, food items, fecal matter, turbid fluids, gastric/intestinal juices, etc.

Here we propose deep metric learning to perform few-shot classification on WCE images. The framework for the proposed architecture is shown in Figure 4 Specific example selection of an anchor, positive and negative instances in shown in Figure 5 Step 1 (Pre-processing): After exporting the raw video files from the RapidReader software, we processed all videos into frames. Since the gastroenterologist is usually more interest in the small bowel region, we only focus on images of the small bowel. Each frame was pre-processed to trim off the uninformative boundary region.

Step 2 (Sampling): Each batch sampling involve a triplet of anchor, positive and negative instances [30] . In WCE, the anchor is randomly initialized to one of the training categories containing certain lesion (e.g. ulcer as in fig. 5 ), a positive instance is another frame with the same lesion while a negative example is a frame with a different lesion or from a normal category. During training, we used the triplet loss (implemented in PyTorch [32] ) to optimize the model parameters to force the embedding of similar frames into the same region of the feature space, such that the squared distance between the anchor and a positive is minimal while simultaneously maximizing the distance between the anchor and any negative instance. Step 3 (Forward Pass): We passed the triplet sample to the network to compute the embedding for each frame. We applied the euclidean distance to compute the distance between the triplet. For each epoch, we compute the triple loss based on Eq. 2 as implemented in the PyTorch framework.

Step 4 (Back-propagation): Parameters of the network were updated using back-propagation to gradually reduce the loss. We experimented with different optimizers but found the Stochastic Gradient Descent to work best. Our learning rate was set to 0.001 and each model was trained for 150 epochs. Each frame was embedded to a fixed feature vector of dimension 128.

We evaluated the performance of the model on few-shot classification task based on different criteria: Precision, Recall, F1-score and Accuracy. The evaluation metrics are computed based on the number of correctly identified lesion (true positive, TP); the number of correctly identified frames as containing a different lesion for each class(true negative, TN), the number of missed frames containing a particular lesion (false negative, FN); and the number of normal frames wrongly identified as containing a particular lesion (false positives, FP). We compute the evaluation metrics based on the equations below:

Recall = T P T P + F N (5) 

This section details the makeup of our dataset and experimental steps.

Our dataset consist of real patients capsule endoscopy data collected under supervision of expert gastroenterologist. The dataset were collected from 52 patients using SB3 PilCam 2 . Using the RapidReader software 3 , each was extracted and processed into frames of 576 x 576 resolution. Each video was anonymized and annotated by two medical research scientist.

We randomly selected samples for 4 different lesions (whipples, ulcer, bleeding and angioectasia) in the small bowel region for the training. The entire training data consisted of 5,360 frames with an average of 1,072 for each of the four (4) categories. We randomly split the data using 70% for training and 30% as the test set. During preprocessing, each frame was cropped from 576 x 576 to 500 x 500 to remove the black boundary region. We resized each frame to 224 x 224 so as to fit into the GPU memory. For each epoch, we performed augmentation using random transformations such as horizontal and vertical flip and random rotation. Our model was implemented in PyTorch [32] and we computed the evaluation metrics using the Scikit-learn package [33] .

During the experiment, we performed series of test to compare performance across multiple Deep CNN architectures [34] as well as varying the number of support samples for the few-shot recognition. While we trained the model on four (4) categories, our testing was done on five (5) different lesion categories introducing an unseen task to the model. Based on the overall aim of meta-learning, the model is expected to map the fifth category to an embedding space that is distant from the other seen categories.

We tested with three (3) different CNN architectures -VGG-19 [7] , Resnet-50 [8] , and AlexNet [9] -we replaced the output of the final fully connected layer with the embedding feature dimension. The parameters of the network were initialized based on pretrained ImageNet dataset [10] .

The qualitative result of the lesion recognition performance for different CNN models is shown in where there large inter-class distance will cause the model to easily make mistake. This is most common with lesions such as the angioectasia that occupies a very tiny region of the entire frame (See Figure 4) . With more support samples the model tries to compute a minimum over all the distances computed for each example support.

From Table 2 , for a single support sample, we obtained the best performance with the ResNet model with similar result when support was increased to three (3) . With five and seven (7) supports, AlexNet model performed better on all metrics. The VGG-19 only achieved the best performance when there were more supports to compare against.

This work proposes a few-shot multiple lesion recognition in wireless capsule endoscopy using metric-based learning framework. Metric-based learning is designed to establish similarity or dissimilarity between concepts. Table 2 : Comparison of Model performance based on k-shots improvement on object recognition and image classification task, they require vast amount of labeled dataset to train in addition to being sample inefficient. Obtaining label for WCE data is very challenging, given the volume of data and the expertise needed to provide frame-by-frame label. We approach these problems using deep metric-based learning and applied it to few-shots lesion recognition task. We experiment with different support samples as well as different base-CNN architectures. We demonstrated the effectiveness of our solution on real patients' WCE video. With our proposed solution, physicians can easily query a patient video database for specific abnormality or disease based on other clinical information. Future direction will measure the impact of different distance / similarity metric on the model performance and also extending this framework to active learning task where the model request for label on any example where it is uncertain.

A survey on contemporary computer-aided tumor, polyp, and ulcer detection methods in wireless capsule endoscopy imaging

A multi-stage recognition system to detect different types of abnormality in capsule endoscope images

Neuro-fuzzy classification system for wireless-capsule endoscopic images

Automated bleeding detection in capsule endoscopy videos using statistical features and region growing

Capsule endoscopy versus colonoscopy for the detection of polyps and cancer

Automated polyp detection in colon capsule endoscopy

Very deep convolutional networks for large-scale image recognition

Deep residual learning for image recognition

Imagenet classification with deep convolutional neural networks

Imagenet large scale visual recognition challenge

Few-shot classification of aerial scene images via meta-learning

Deep model-based semi-supervised learning way for outlier detection in wireless capsule endoscopy images

An abnormality based wce video segmentation strategy

Wireless capsule endoscopy video summarization: a learning approach based on siamese neural network and support vector machine

Adaptive features extraction for capsule endoscopy (ce) video summarization

Reduction of capsule endoscopy reading times by unsupervised image mining

Automatic frame reduction of wireless capsule endoscopy video

Polyp detection in wireless capsule endoscopy videos based on image segmentation and geometric feature

Improved bag of feature for automatic polyp detection in wireless capsule endoscopy images

Saliency based ulcer detection for wireless capsule endoscopy diagnosis

Artificial intelligence using a convolutional neural network for automatic detection of small-bowel angioectasia in capsule endoscopy images

Deep learning and hand-crafted feature based approaches for polyp detection in medical videos

A saliency-based unsupervised method for angiectasia detection in endoscopic video frames

Semantic and topological classification of images in magnetically guided capsule endoscopy

Deep few-shot learning for hyperspectral image classification

Siamese neural networks for one-shot image recognition

Meta-learning: A survey

Signature verification using a" siamese" time delay neural network

Deep metric learning using triplet network

Facenet: A unified embedding for face recognition and clustering

Deepface: Closing the gap to human-level performance in face verification

Pytorch: An imperative style, high-performance deep learning library

Scikit-learn: Machine learning in Python

Deep learning methods for anatomical landmark detection in video capsule endoscopy images