key: cord-0058967-0yy065dp authors: da Silva, Bruno C. Gregório; Ferrari, Ricardo J. title: Exploring Deep Convolutional Neural Networks as Feature Extractors for Cell Detection date: 2020-08-19 journal: Computational Science and Its Applications - ICCSA 2020 DOI: 10.1007/978-3-030-58802-1_7 sha: 4b9aafda3f2d60a9b4ed872fa636c95f0b202e78 doc_id: 58967 cord_uid: 0yy065dp Among different biological studies, the analysis of leukocyte recruitment is fundamental for the comprehension of immunological diseases. The task of detecting and counting cells in these studies is, however, commonly performed by visual analysis. Although many machine learning techniques have been successfully applied to cell detection, they still rely on domain knowledge, demanding high expertise to create handcrafted features capable of describing the object of interest. In this study, we explored the idea of transfer learning by using pre-trained deep convolutional neural networks (DCNN) as feature extractors for leukocytes detection. We tested several DCNN models trained on the ImageNet dataset in six different videos of mice organs from intravital video microscopy. To evaluate our extracted image features, we used the multiple template matching technique in various scenarios. Our results showed an average increase of 5.5% in the [Formula: see text] -score values when compared with the traditional application of template matching using only the original image information. Code is available at: https://github.com/brunoggregorio/DCNN-feature-extraction. One of the countless applications of automated image analysis involves cell detection in biological experiments. The automatic detection and counting of leukocytes in the microcirculation of living small animals, for instance, can help in the comprehension of immunological mechanisms from inflammatory processes. As a consequence, researchers can develop new drugs and therapeutic strategies to fight several diseases such as multiple sclerosis, atherosclerosis, ischemiareperfusion injury, rheumatoid arthritis, and cancer [12] . However, this kind of analysis, which is typically done using intravital video microscopy (IVM), becomes an arduous and error-prone task since it is performed by visual observation of the cellular traffic. Different machine learning methods have been proposed in the last few years to overcome this problem. They are often designed for a particular set of images and tested on private datasets using different evaluation metrics [2] . One common approach is the use of shape information in active contours and gradient vector flow for both leukocytes detection and tracking [5, [22] [23] [24] 26, 34, 35] . Other works use adaptive template matching [1, 14] , image-level sets [21] , and Monte Carlo [4] technique. In our previous works, we proposed two different methods for the leukocytes detection based on the local analysis of eigenvalues obtained from Hessian matrices [28, 29] , and second-order momentum matrices of the phase congruency technique [10, 11] . Despite promising results (most above 0.75 for F 1 -score measure), these approaches were developed to enhance and detect blob-like structures in IVM images from the central nervous system (CNS) of mice. Therefore, these methods mostly fail when either the cells have distinct appearances or different image acquisition protocols. Although the works mentioned above have presented significant results, they still rely on domain/business knowledge, demanding high expertise to create handcrafted features capable of describing the object of interest. In the last decades, the use of artificial neural networks (ANN) or, more specifically, the convolutional neural networks (CNN) have attracted considerable attention because of its ability to learn data representations automatically while dealing with raw data. Egmont-Petersen et al. [9] , for instance, applied an ANN in IVM studies of the mesentery of mice and rats. In their work, they compared the application of an ANN using two training datasets collected from real and synthetic images of cells. Eden et al. [8] also resorted to the use of ANNs for the detection and tracking of leukocytes in IVM. The proposed cell detection approach started using a motion detection algorithm based on the image background subtraction. After this rough detection, they selected only the cells inside the vessel region and used an ANN for the classification of a sub-region as a target (cell) or a non-target, which have afterward their points analyzed by a clustering strategy. The use of these shallow ANN models, however, may not represent complex features, resulting in a low level of generalization and a weak learning of data representations. In order to have a CNN model with a high level of generalization and without overfitting, a high number of images with labeled objects is required for training it properly. As this condition is not always satisfied, other options should be considered. It is well-known that the first layers of deep CNNs (DCNN) trained on natural images learn more general features that can be similar to Gabor filters and color blobs [36] . This important statement suggests we can use the output of these layers as feature extractors in a process called transfer learning. Transfer learning is a popular approach in deep learning where a network developed for a specific task may have its weights from early layers used as a feature extractor or as a starting point for training a new model and adapted in response to a new problem. This procedure can exploit the generalization of a previously welltrained architecture in another model setting. In this study, we explore the transfer learning approach by using different models trained on the ImageNet dataset [25] as feature extractors. The resulting feature maps are then selected and used as input for a multiple template matching (MTM) method. Our results show that features extracted from kernels trained in a different task can increase the performance of the well know template matching technique. The rest of this paper is organized as follows: in Sect. 2, we describe our methodology and the database used in this work. Results and discussions are presented in Sect. 3, while in the last section, we make our final considerations. In this section, we first describe the IVM dataset used in this work and then elaborate on the techniques applied for leukocytes detection and the metrics used to evaluate them. To evaluate our approach, we used six videos from IVM studies with 705 frames in total that were obtained from distinct image acquisition protocols and four different animal organs: brain, spinal cord, cremaster muscle, and mesentery of mice. Figure 1 shows examples of frames from each one of the videos. In all videos, the leukocytes were frame-by-frame manually annotated by an expert. All information necessary to describe our dataset are presented in Table 1 . Although some of these videos have a relatively small number of analyzed frames, the total number of manually annotated leukocytes is quite large (see values in Table 1 ), providing enough data for a proper quantitative evaluation. For more information about the experimental procedures, please refer to our previous works [28, 30] and the works of our collaborators Prof. Juliana Carvalho Tavares, Ph.D. 1 [6, 7] and Prof. Mônica Lopes-Ferreira, Ph.D. 2 [27] . All the images in our dataset went through the following ordered sequence of processes to 1) remove the extremely blurred images, 2) noise reduction, 3) contrast standardization, 4) video stabilization, and 5) extraction of the region of interest (ROI). The application of these preprocessing techniques is better described in our previous work [28] and is out off the scope of this paper. Figure 2 illustrates the pipeline of our proposed method for leukocytes detection. The main goal of our approach is to apply the concept of transfer learning using different pre-trained DCNNs and to test the model's genericity when applied for distinct targets. For that, we chose a broad list of models already trained on the ImageNet dataset [25] and used the output of their first convolutional layers in our problem, i.e., in a task entirely different from the original. Conventional DCNN models generally have a small input shape value as default. In this study, however, we decided to rescale our input images into the fixed range of 1400 × 1000 as the massive information contained in this kind of image is composed by small objects that are quite significant for cellular morphology characterization. With all the images preprocessed and rescaled, we start our detection pipeline by extracting the first frame of each video and passing it forward into the DCNN model until the selected layer. In this work, each selected layer was chosen by visual inspection of its output feature images. Since our image frames present relevant information in small regions, we decided to analyze only the first convolutional layers (shallow layers) of each DCNN. As a consequence of transfer learning, not all output images present relevant characteristics that could help in a detection process. For this reason, we performed a feature image selection capable of separating only the best set of features to be used next. To accomplish that, we extracted a small ROI previously selected and used it as a template for the template matching technique. We then get the corresponding output maps for each feature image and applied a thresholding technique on each one of then. In this case, the threshold value was set to 0.9, which results in a map of detection candidates with a high probability of being indeed cells. The accuracy of each resulting map was evaluated following the metrics described in Sect. 2.4, but to chose the best set of features, we sorted and normalized the evaluation results in order to selected only those top features whose accumulated value (or retained score) was higher than 0.1. Fig. 3 illustrates the process of feature selection. After the first passage through our pipeline, illustrated by the green dashed arrows in Fig. 2 , we have our best set of image features from the DCNN layer and can now apply our approach to all the video frames. The blue arrows in Fig. 2 show the remaining steps in our pipeline. They are similar to the previous steps, except that we now know what the best feature set is and can finally apply the MTM algorithm to identify the cell candidates. At this step, we also include the original frame image into the vector of selected features. Next in our approach, we extracted three pre-selected ROIs from the first frames of each video to be used as template images in the entire processing. As stated before, to perform our cell detection step and consequently test our proposed approach, we used the template matching (TM) technique [13, 19, 20] with multiple templates. The normalized cross-correlation (NCC) based TM is an algorithm of pattern recognition field that performs the detection of similar objects in an image I(x, y), holding as input the image itself and a template (sub-image) T (x, y) to be detected. The TM algorithm used in this work has as similarity measure the NCC coefficient, computed as: where T is the average value of pixel intensities in T (x, y), I T is the average value of I in the coincident region with the current position of T , and the sums are only realized over the common coordinates of I(x, y) and T (x, y), delimited by the variables r and s of the summations. The coefficient of correlation ρ indicates the level of similarity between the template T and the current image region. Its scale varies in the range of [−1, 1] and is, therefore, normalized by the amplitudes of T and I, wherein ρ = 1 means the total correlation between T and the sub-region of I, ρ = 0 means that there is no correlation, and ρ = −1 means inverse correlation. As already stated at the beginning of this section, our approach used a set o leukocyte-templates as input for the MTM algorithm. As a result, we have intensity maps with the highest coefficient values indicating the spatial locations in the video frames with high similarity with our selected templates. In the cases where the number of templates is higher than one, a fusion step is employed by summing and normalizing all the MTM output maps. Finally, the resulting maps were thresholded by using different values. These values were defined in the range of [0.7, 0.95], with step of 0.5. For the evaluation of our approach, the resulting thresholded maps were assessed by comparing the spatial coordinates of the leukocytes' centroids that were manually identified and annotated by an expert (ground truth) with those automatically detected. In this sense, we defined the detection as true when the distance between a manually annotated centroid and an automatically detect one is less or equal than k pixels. This distance value k was estimated according to the average radius of the selected templates for each video. Accordingly, in this study we defined the number of true positives (TPs) as the accumulated amount of leukocyte positions that were correctly detected by the algorithm, the false positives (FPs) as the accumulated amount of leukocytes automatically detected without correspondence to those manually annotated, and the false negatives (FNs) as the accumulated amount of leukocytes that the algorithm could not identify. The measures of precision (P ), recall (R), and F 1 -score [15] were then used to evaluate the overall performance of our proposed approach. These measures are based on the accumulated TPs, FPs, and FNs over the sequence of video frames. They are defined as follows: The proposed approach for leukocytes detection was quantitatively evaluated using all the videos detailed in Sect. 2.1 for different pre-trained DCNN models. For the sake of comparison, we also applied the MTM directly to the video frames at gray level intensities, i.e., not passing through any DCNN model. In Fig. 4 , we can see the resulting F 1 -score values for all the videos and models processed. For each one of them, we plotted the best values found in our experiments, considering the set of threshold values tested. Each model exhibits three different bars, colored according to the number of templates used in our experiments with the MTM algorithm. The plots in Fig. 4 clearly show the use of selected convolutional layers in CNNs positively contribute as generic feature extractors, even coming from models trained for a completely different task. It is also worth noticing that the use of multiple templates can help to recognize targets that are slightly different, like those present in videos B1 and B2. Although some models did not contribute to the MTM technique in most cases, such as the ResNet50V2, ResNet101V2, ResNet152V2, InceptionV3, and InceptionResNetV2, the majority of them achieved values similar or higher than the application of the MTM using only the gray level intensity information, which also shows the potential of this approach. Table 2 shows the best set of values found in Fig. 4 for a better quantitative comparison of the methods. Indeed, we can observe that only in the C2 video, our strategy presented the same value as in the gray level image application, while in the remaining results, we had a considerable improvement (up to 14%). Videos C1 and C2, however, still exhibited low F 1 -score values (0.66 and 0.60, respectively), which is justifiable since they have the most challenging visual aspects, with a cluttered background and cell sizes in the order of 5 pixels. Examples of output frames for each processed video are shown in Fig. 5 . Each TP point found is illustrated by the green circles in the images, with its respective manual centroid annotation indicated as a cross. The blue circles represent the FP points, while the red squares are the FN ones. From the images in Fig. 5 , we observe that FN points are often the cells very close to each other or the ones whose appearance is quite different from the rest of them. The FP points, however, mostly correspond to bright regions in the images or to the erythrocytes, which are smaller cells that appear as bright blurred points in non-consecutive frames and are not part of the manual annotations. Even so, the final results were quite promising and indicated that pre-trained DCNN models could be a good option for generic feature extraction in IVM. Manual detection and counting of leukocytes are still fundamental tasks in IVM studies. However, visual analysis is error-prone and can generate false statistics and wrong interpretations, depending on the professional experience and expertise in the field. In this paper, we explored the transfer learning idea by using pre-trained DCNNs for image feature extraction in the application of cell detection in IVM. In order to create reliable image features, we selected the first convolutional layers of DCNNs pre-trained on ImageNet and tested them as the input for the multiple template matching algorithm. We conducted several experiments using six different videos from IVM with one, two, and three template images. Our results showed a considerable improvement in the template matching performance, even using generic image features extracted by DCNN models trained for distinct tasks and targets. Despite the great results, in the cases where small cells are predominant, we observed some models do not produce relevant information to perform a robust detection process. However, we proved that pre-trained DCNN models could, indeed, provide generic feature images to be used in different applications. Our future works include tests with models trained in different datasets and a detection process using different strategies for feature selection and fusion in an adaptive multiple threshold technique. Code is available at GitHub 3 . Automatic tracking of rolling leukocytes in vivo Cell tracking via proposal generation and selection Xception: Deep learning with depthwise separable convolutions A Monte Carlo approach to rolling leukocyte tracking in vivo Intravital leukocyte detection using the gradient inverse coefficient of variation CCL2 and CCL5 mediate leukocyte adhesion in experimental autoimmune encephalomyelitis an intravital microscopy study Kinin B2 receptor regulates chemokines CCL2 and CCL5 expression and modulates leukocyte recruitment and pathology in experimental autoimmune encephalomyelitis (EAE) in mice An automated method for analysis of flow characteristics of circulating particles from in vivo video microscopy Detection of leukocytes in contact with the vessel wall from in vivo microscope recordings using a neural network Automatic detection of leukocytes from intravital video microscopy using the phase congruency technique Detection of leukocytes in intravital microscopy video images using the phase congruency technique Intravital microscopy: new insights into cellular interactions Digital Image Processing Biomedical application of target tracking in clutter A probabilistic interpretation of precision, recall and F -score, with implication for evaluation Deep residual learning for image recognition Identity mappings in deep residual networks Densely connected convolutional networks Template matching based on a grayscale hit-or-miss transform Fast template matching Level set analysis for leukocyte detection and tracking A concave cost formulation for parametric curve fitting: detection of leukocytes from intravital microscopy images Motion gradient vector flow: an external force for tracking rolling leukocytes with shape and size constrained active contours Tracking leukocytes in vivo with shape and size constrained active contours ImageNet large scale visual recognition challenge Rolling leukocyte detection based on teardrop shape and the gradient inverse coefficient of variation Stingray venom activates IL-33 producing cardiomyocytes, but not mast cell, to promote acute neutrophil-mediated injury Detection of leukocytes in intravital video microscopy based on the analysis of Hessian matrix eigenvalues Detecting and tracking leukocytes in intravital video microscopy using a Hessian-based spatiotemporal approach Técnica de estabilização de movimento em microscopia intravital utilizando métodos de co-registro de imagens Very deep convolutional networks for large-scale image recognitions Inception-v4, inception-resnet and the impact of residual connections on learning Rethinking the inception architecture for computer vision Generalized gradient vector flow external forces for active contours Snakes, shapes, and gradient vector flow How transferable are features in deep neural networks? In: Advances in Neural Information Processing Systems 27 Learning transferable architectures for scalable image recognition