key: cord-0225376-dqe6sluh
authors: Afshar, Parnian; Naderkhani, Farnoosh; Oikonomou, Anastasia; Rafiee, Moezedin Javad; Mohammadi, Arash; Plataniotis, Konstantinos N.
title: MIXCAPS: A Capsule Network-based Mixture of Experts for Lung Nodule Malignancy Prediction
date: 2020-08-13
journal: nan
DOI: nan
sha: 3181c9ca598bc74325dbc9faf976724233af55b9
doc_id: 225376
cord_uid: dqe6sluh

Lung diseases including infections such as Pneumonia, Tuberculosis, and novel Coronavirus (COVID-19), together with Lung Cancer are significantly widespread and are, typically, considered life threatening. In particular, lung cancer is among the most common and deadliest cancers with a low 5-year survival rate. Timely diagnosis of lung cancer is, therefore, of paramount importance as it can save countless lives. In this regard, deep learning radiomics solutions have the promise of extracting the most useful features on their own in an end-to-end fashion without having access to the annotated boundaries. Among different deep learning models, Capsule Networks are proposed to overcome shortcomings of the Convolutional Neural Networks (CNN) such as their inability to recognize detailed spatial relations. Capsule networks have so far shown satisfying performance in medical imaging problems. Capitalizing on their success, in this study, we propose a novel capsule network-based mixture of experts, referred to as the MIXCAPS. The proposed MIXCAPS architecture takes advantage of not only the capsule network's capabilities to handle small datasets, but also automatically splitting dataset through a convolutional gating network. MIXCAPS enables capsule network experts to specialize on different subsets of the data. Our results show that MIXCAPS outperforms a single capsule network and a mixture of CNNs, with an accuracy of 92.88%, sensitivity of 93.2%, specificity of 92.3% and area under the curve of 0.963. Our experiments also show that there is a relation between the gate outputs and a couple of hand-crafted features, illustrating explainable nature of the proposed MIXCAPS. To further evaluate generalization capabilities of the proposed MIXCAPS architecture, additional experiments on a brain tumor dataset are performed showing potentials of MIXCAPS for detection of tumors related to other organs.

Lung cancer, according to recent statistics [1] , is associated with the highest mortality rate, among all different cancer types, and is considered as one of the top three cancers, in terms of incidence. The combined 5-year survival for lung cancer is still low [2] , at 18%, because the majority of patients are diagnosed at advanced stages [3] . What makes the early diagnosis of lung cancer significantly challenging is the lack of sufficient visible warning symptoms and signs in early stages of the disease. Computed Tomography (CT) scan [4] is by far one of the most advanced and effective techniques used for lung cancer diagnosis. However, even the CT scans may not reveal convincing signs that can contribute to early diagnosis of lung cancer. In other words, Imaging features of nodule such as size, shape, and attenuation that play an important role in identifying the cancer may not be immediately accessible to the unaided eye [5] . More importantly, human-centered diagnosis is subject to inter-observer variability, meaning that radiologists can have different judgments, depending on their previous experience. Finally, investigating the test results and coming into an inclusive decision can be extremely time-consuming and burdensome [6] .

Radiomics analysis [7, 8, 9] , referring to the extraction of several quantitative and semi-quantitative features from the medical images, is one of the most successful approaches towards automatizing the cancer diagnosis/prediction process [10] . Features extracted in the radiomics analysis are aimed at capturing different properties of the nodules, such as their shape and texture. Such features have shown association with the nodule malignancy, its stage, and even the patient's survival time. Radiomics is often categorized in two groups of hand-crafted [11, 12, 13, 14] and deep learning-based. The former category involves extraction of a set of pre-defined features that are further processed and analyzed by a statistical or Machine Learning (ML) model. Despite showing satisfactory results in different tasks [15, 16] , hand-crafted radiomics is limited to the features defined by the radiologists and as such there is no guarantee that the features contribute to the problem at hand. Furthermore, since handcrafted Radiomics features are extracted from the annotated Region of Interest (ROI), they are still subject to inter-observer variability, and besides being timeconsuming, their performance highly depends on the accuracy of the provided annotations [17] . In other words, extra effort is required to enhance the annotations and select features that are more descriptive and robust [18] .

Deep learning-based radiomics [19, 20, 21] , proposed to overcome the shortcomings of its hand-crafted counterparts, does not require a pre-knowledge about the types of features to be utilized. In other words, deep learning-based techniques are capable of extracting features that can best contribute to the problem at hand in an end-to-end fashion. Furthermore, deep learning-based radiomics does not need to be fed with the annotated ROI, which has the promise of reducing the effect of inter-observer variability as well as the burden of segmenting the images. Among different deep learning techniques, Convolutional Neural Networks (CNNs) are more popular within the field of radiomics [22] , due to their ability to efficiently process and learn meaningful features from medical images [23] . Performance of the CNNs, however, partly depends on the size of the available dataset [24] . More specifically, CNNs, typically, fail to determine the spatial relations between the image instances and identify rotation or transformation of an object. As such, CNNs need to be fed with a large dataset containing all the possible transformations of the objects. Large datasets are, however, not typically available in medical imaging in particular for lung cancer malignancy prediction.

Capsule networks [25] , also referred to as the CapsNets, are developed aiming at overcoming the aforementioned drawbacks of the CNNs. CapsNets use capsules, instead of using individual neurons, to represent imaging instances.

CapsNets, therefore, can identify the spatial relations via their "Routing by Agreement" process, through which capsules try to come to a mutual agreement about the existence of the objects. In particular, CapsNet's ability to handle transformations is further investigated in Reference [26] for medical image segmentation. In our recent studies [27, 28, 29] , we showed superior performance and capabilities of CapsNets for tumor type classification.

Capitalizing on the success of the CapsNets, in this study we propose a new framework, referred to as the Mixture of Capsule networks (MIXCAPS), for the task of lung nodule malignancy prediction. The proposed MIXCAPS framework is a "Mixture of Experts" type model [30, 31, 32, 33] , which has the potential to noticeably improve the classification accuracy by integrating/coupling several experts (individual CapsNets in the context of the proposed MIXCAPS). To be more precise, mixture of experts solves the classification problems by splitting the dataset into similar samples, and each expert specializes in classifying similar instances. To the best of our knowledge, the proposed MIXCAPS is the first CapsNet-based mixture of experts framework. The MIXCAPS model benefits from the following three important properties: (i) The embedded capsule network is capable of classifying the lung nodules without requiring availability of a large dataset; (ii) The mixture of experts approach enables each CapsNet within the MIXCAPS architecture to focus on a specific subset of the nodules, therefore, improving the overall classification performance of the model, and;

(iii) As shown in our experiments, MIXCAPS is not restricted to the task of lung nodule malignancy prediction. In fact, it can be easily generalized to the prediction of other tumor types such as brain cancer. The following summarizes our contributions:

• CapsNets are utilized, for the first time, as individual experts within a mixture of experts framework.

• A new and modified CapsNet loss function (margin loss) is developed to reflect the loss associated with the experts and gating models.

• Output of the gating model is investigated for potential correlations with nodule hand-crafted features to improve the potential interpretability of the proposed MIXCAPS.

• Generalizability of the proposed MIXCAPS is illustrated via extension and evaluation based on a separate dataset associated with a different prediction task other than the one initially used to design the framework.

The rest of this paper is organized as follows: First, in Section 2 the previous studies on lung nodule malignancy prediction is briefly investigated. In Section 3, the dataset and the pre-processing steps are described, along with the proposed MIXCAPS. Results and discussions are presented in Section 4.

Finally, Section 5 concludes the paper.

Generally speaking, most of the studies based on hand-crafted radiomics follow a pre-defined set of steps [7, 8, 9] :

(i) The first step is to pre-process the images and segment the nodule;

(ii) The second and the main step is to extract hundreds of features from the segmented nodule. These features mostly fall into three categories of intensity-based, shape-based, and texture-based features. The former category captures basic properties of the nodule related to its histogram.

While shape-based features quantify shape-related properties such as area, diameter, and volume, texture-based ones capture the heterogeneity of the nodule texture;

(iii) In the third step of the hand-crafted radiomics analysis, feature reduction techniques are utilized to select the most relevant and robust features;

(iv) In the final step, extracted features are fed to a statistical or machine learning tool to calculate the desired outcome.

For example, the study performed by authors in Reference [34] is a recent implementation of the above mentioned steps for extracting hand-crafted radiomics for lung nodule malignancy prediction. In this study, a total of 385 features is extracted from the annotated nodules. Consequently, based on a correlation analysis, the non-redundant features are selected and fed to a regression model to output the malignancy probability.

The limitations of the hand-crafted radiomics, including its dependence on the annotated region, have caused a surge of interest in deep learning-based radiomics, especially using CNNs [35, 36] . CNNs are powerful models for analyzing images and extracting features that best contribute to the problem at hand, through trainable filters. Furthermore, filters share weights across the input, which significantly reduces the computational cost, compared to a fullyconnected network. CNNs have been recently used for the problem of lung nodule malignancy prediction. While some studies [37, 38] have proposed to adopt previously developed CNN architectures for the radiomics analysis, others [22, 39] have designed and optimized their own specific CNN-based models. Although showing satisfying results, most of these studies had to use a data augmentation or transfer learning strategy to compensate for the lack of large datasets specifically for the problem of lung nodule malignancy prediction. These strategies, however, are associated with more computational costs.

Furthermore, there is still no comprehensive study on the effectiveness of these strategies for the nodule malignancy prediction. Capsule network (CapsNet), briefly described in the following section, is an alternative and attractive modeling paradigm to address the aforementioned issues, i.e., accounting for more variations in the input, without resorting to heavy data augmentation. Capsule networks are constructed based on capsules, as their main building blocks. A capsule being represented by a vector consists of several neurons representing, collectively, a specific object at a specific location. While neurons capture the instantiation parameters of the object, the length of a capsule determines the existence probability of that object. The most important property of a capsule network, distinguishing it from CNNs, is its routing by agreement process. Generally speaking, each Capsule i, having the instantiation parameter vector u i , in a lower layer tries to predict the output of the capsules in the next layer, through a trainable weight matrix W ij given bŷ

whereû j|i denotes the prediction for parent Capsule j. Through the routing by agreement process, the predictions are evaluated in terms of their similarity to the actual outputs. More weight is then given to the successful predictions, before calculating the final output s j for the capsule j, as follows

and

where a ij shows the agreement between actual output s j and its prediction u j|i , and c ij denotes the score assigned to the prediction based on the obtained agreement. The routing by agreement process, summarized in Fig. 1 , enables capsule the network to recognize spatial information between image instances.

Tumor classification based on capsule networks has been investigated in several recent studies, leading to increased performance when compared to CNNs.

Lung tumor malignancy prediction is considered in Reference [27] , where a multi-scale framework is proposed, outperforming single-scale and multi-scale

CNNs. Classifying tumors related to other organs, such as brain, using capsule networks, has also been investigated in several studies [28, 29, 40, 41] , leading to satisfying performance. The paper makes a unique contribution in this field by introducing a novel CapsNet architecture based on "Mixture of Experts", which is briefly described below.

Mixture of experts (MoE) [31] refers to adopting several experts, each of which is specialized on a subset of the data, to collectively perform the final prediction task. As shown in Fig. 2 , experts are separately fed with the input data and the final output is a weighted average of all the predictions coming from all the N active experts. The weight g i assigned to Expert i can be either a pre-determined value, or a trainable one. One simple example of the former case is averaging over all the experts' predictions [33] . However, more sophisticated approaches such as soft clustering of the input may also be adopted. In the latter case, weights may be trained at the same time with the experts. One other approach to use trainable gating weights is to concatenate the feature vectors obtained from the individual experts and feed the resulting vector to an external gating model to make the final decision.

The MoE concept has been widely used in medical imaging. The simple averaging scenario is investigated in References [42] and [43] for retinal vessel detection from fundus images and breast cancer detection from histology images, respectively. Trainable gating weights are studied in Reference [44] , where handcrafted and CNN-based features are combined to detect breast cancer from pathology images. The scenario where gating weights are trained at the same time with the experts is investigated in Reference [32] for breast cancer diagnosis.

In particular, CNN experts are combined using weights coming from an external gating network. The gating network itself is a CNN, taking the same inputs as the experts, and outputting the probability of each expert being responsible for each particular input. Our proposed MIXCAPS, which is based on the same gating scenario as Reference [32] , is explained in the next section, along with its incorporated data pre-processing approach.

In this section, first we present the dataset used to design and develop the proposed MIXCAPS. Afterwards, the pre-processing approach, and the proposed MIXCAPS framework are described. 

The lung nodule malignancy dataset is adopted from the Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI) dataset [45, 46, 47] . This dataset consists of CT scans from 1, 018 subjects. All the images are labeled and annotated by one to four radiologists. Labels include nonnodule, nodule less than 3 mm in size, and nodules with malignancy scores of 1 to 5, where larger numbers denote higher possibility of malignancy. In this study, we discarded all the cases with average malignancy score of 3 which dictates an indeterminate malignancy. Consequently, we regrouped labels 1 and 2 as benign nodules, and labels 4 and 5 as malignant nodules. Therefore, we ended up having a binary classification problem with a total of 2, 283 nodules.

It is worth mentioning that we included all the annotations provided by all the radiologists as separate nodules. However, the malignancy scores are the average over all the provided scores. For each nodule, we extracted a 3D patch from the center of the nodule (center slice and the two immediate neighbors).

Patches are extracted to fit the nodule boundary provided by the radiologists.

However, to have fixed size inputs, all patches were zero-padded to 80 × 80 (the largest possible width and height based on the training data).

The proposed capsule network-based mixture of experts for lung nodule malignancy prediction, referred to as the MIXCAPS, is shown in 

where G 1 and G 2 are pre-activation outputs. The Softmax layer ensures that g 1 and g 2 sum to one. These contributions are multiplied by o 1 and o 2 to calculate the final prediction o as follows

Output vector o encompasses the probability of benign and malignant classes, denoted by o (0) and o (1) , respectively. In other words

where superscript T denotes transpose operator. Originally, margin loss is proposed for the training of the capsule networks. In this study, we adopt the same loss function with the difference that the loss l is calculated over the final output of the MIXCAPS instead of the individual capsule networks, as follows

where l (0) and l (1) denote the losses associated with the benign and malignant classes, respectively. m + , λ, and m − are hyper-parameters. Terms T (0) and

T (1) are the ground-truth labels for benign and malignant classes, respectively.

According to Reference [31] comparing the desired output with the blend of outputs from the experts, leads to a strong coupling between experts and solutions in which many experts are used for one case. However, in this study, we did not encounter such a problem, and therefore did not adopt non-linear combinations of the outputs.

In this subsection, we revisit the idea of the capsule networks and show how they can be viewed within the mixture of experts framework. In other words, we show that a CapsNet is a series of consecutive MoE layers such that each lower level capsule with instantiation vector u i serves as an expert to predict the output of the capsule in the next layer with instantiation vector s j . Generally speaking, there are several solutions to an MoE problem [48] . An Expectation Maximization (EM) algorithm is one applicable solution, through which the experts' weights are considered as hidden variables, whose posteriors are estimated in the E-step, as follows

where binary variable z n i is one when instance n is assigned to expert i, and zero otherwise. Term p(z n i |t n , x n ) represents the posterior probability of z n i given input vector x n and target vector t n . Following the Bayes' rule, this posterior is calculated using the likelihood term p(t n |z n i = 1, x n ) and the prior over z n i , denoted by p(z n i = 1|x n ). All the terms appearing in Eq. (12) can be calculated through the MIXCAPS framework. The likelihood term can be replaced by the output of the expert capsule networks o n(1) i , which denotes the probability of malignancy for Instance n, based on the i th expert. The prior probability can also be estimated using the output of the gating model g n i denoting the probability of assigning Instance n to Expert i. The posterior, therefore, can be defined as

where M is the number of experts.

To further shed light on the MoE view of CapsNets, it would be interesting to note that the EM formulation of the MoE closely resembles the weight update process of a multiple model (MM) [49] approach. In MM formulation, observations are sequentially generated from different models and the goal is to identify the contribution of each single model i given all the observations up to the current time (Y k ), as follows

where y k is the most recent observation. Comparing Eq. (14) Table 1 , MIXCAPS outperforms its two aforementioned counterparts, in terms of sensitivity, specificity, accuracy, and AUC.

Experiment 2 : In the second experiment, we compare the proposed MIXCAPS with several well-known studies on the same dataset. Table 2 shows these studies, their methods, and the obtained results. As it can be inferred from Table 2 , [5] . However, it is worth mentioning that the aforementioned study utilizes hand-crafted radiomics, requiring fine annotation of the nodules, from which our proposed approach is independent. Reference [50] has obtained a higher specificity compared to the proposed MIXCAPS. Its low sensitivity, however, is a sign of an unbalanced classification and/or over-fitting.

Reference [51] has achieved the highest sensitivity among all the other references. Nevertheless, no confidence interval is provided to ensure the robustness of the result. of being assigned to the first expert. Fig. 4 shows two nodules in the test set.

The left nodule, which has a volume of 496.32 and diameter of 9.823, has a low probability of belonging to the first expert, whereas the nodule on the right, with a volume of 6663.44 and diameter of 23.347, has a high probability of being assigned to the first expert. In other words, the first expert tends to handle larger nodules, compared to the second expert.

Although MoE techniques are shown to be able to improve the classification performance, they typically face an objection related to the high computational cost at the test time. This problem, however, can be dealt with by using distillation [54] . Therefore, in our future studies, we will focus on distilling MIXCAPS into a smaller and more time-efficient model.

Brain tumor is among the deadliest cancers. Determining the type of the tumor, which is a challenging task in terms of accuracy and inter-observer variability, can significantly facilitate the control/treatment process. Therefore, we dedicate this subsection to investigate the generalizability of the proposed MIX-CAPS to brain tumor type classification. In a previous study [29] , we proposed a capsule network-based framework, which we referred to as the BoxCaps, for brain tumor classification, considering not only raw magnetic resonance imaging (MRI) inputs, but also the coarse tumor boundaries. The motivation behind such framework was that the whole brain image contained valuable information on the location of the tumor with respect to the brain tissue. The CapsNet, however, tends to get distracted from the main tumor region when being fed with all the details from the brain image. As such, we designed a modified architecture where the output capsules were concatenated with the tumor course boundary box. This way, the model had access to both brain tissue and tumor region.

To investigate whether the MIXCAPS can be generalized to brain tumor classification, we replaced the capsule experts in MIXCAPS with the previously designed BoxCaps architecture, as shown in Fig. 5 . We then tested the resulting framework on a brain tumor dataset [55] , where train, validation, and test splits are obtained from the same bootstrapping approach used for the LIDC-IDRI dataset. The aforementioned dataset consists of 3, 064 images from 233 patients, diagnosed with one of the three brain tumor types, i.e., Meningioma, Pituitary, and Glioma. Table 3 presents the obtained results, according to which, the MoE approach leads to higher accuracy compared to a single BoxCaps. Furthermore, the MoE approach leads to higher sensitivity for Glioma and Pituitary, and higher specificity for Meningioma and pituitary tumor types. 

In this paper, we proposed a capsule network-based mixture of experts framework, referred to as the MIXCAPS, for lung nodule malignancy prediction. The 

Global cancer statistics 2018: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries

Cancer statistics

Automated pulmonary nodule detection in ct images using deep convolutional neural networks

Reduced lung-cancer mortality with low-dose computed tomographic screening

Highly accurate model for prediction of lung nodule malignancy with ct scans

Radiomicsbased prognosis analysis for non-small cell lung cancer

Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach

Cheung, Radiomics analysis at pet/ct contributes to prognosis of recurrence and survival in lung cancer treated with stereotactic body radiotherapy

From hand-crafted to deep learning-based cancer radiomics: Challenges and opportunities

Radiomics: extracting more information from medical images using advanced feature analysis

Radiomic features analysis in computed tomography images of lung nodule classification

Radiomic feature clusters and prognostic signatures specific for lung and head and neck cancer

Multiview convolutional neural networks for lung nodule classification

Ct-based radiomic analysis of stereotactic body radiation therapy patients with lung cancer

Radiomics: Images are more than pictures, they are data

Histogram-based models on non-thin section chest ct predict invasiveness of primary lung adenocarcinoma subsolid nodules

Applications and limitations of radiomics

Reproducibility and generalizability in radiomics modeling: Possible strategies in radiologic and statistical perspectives

Deep learning based radiomics (dlr) and its usage in noninvasive idh1 prediction for low grade glioma

Precision radiology: Predicting longevity using feature engineering and deep learning methods in a radiomics framework

Bladder cancer treatment response assessment in ct using radiomics with deep-learning

Discovery radiomics for pathologically-proven computed tomography lung cancer prediction

Imagenet classification with deep convolutional neural networks, Neural Information Processing Systems (NIPS)

Convolutional neural networks: An overview and application in radiology, Insights into Imaging

Dynamic routing between capsules, Neural Information Processing Systems (NIPS)

Capsules for biomedical image segmentation

A 3d multi-scale capsule network for lung nodule malignancy classification

Brain tumor type classification via capsule networks

Capsule networks for brain tumor classification based on mri images and coarse tumor boundaries

IEEE International Conference on Acoustics, Speech and Signal Processing

Improper complex-valued multiplemodel adaptive estimation

Adaptive mixtures of local experts

Breast cancer diagnosis in dce-mri using mixture ensemble of convolutional neural networks

Deep cnn ensemble with data augmentation for object detection

Quantitative radiomic model for predicting malignancy of small solid pulmonary nodules detected by low-dose ct screening

Classification of patterns of benignity and malignancy based on ct using topology-based phylogenetic diversity index and convolutional neural network

Multi-view multi-scale cnns for lung nodule type classification from ct images

Automatic feature learning using multichannel roi based on deep structured algorithms for computerized lung cancer diagnosis

Comparison of machine learning methods for classifying mediastinal lymph node metastasis of non-small cell lung cancer from 18f-fdg pet/ct images

Automatic detection of lung nodules: false positive reduction using convolution neural networks and handcrafted features

Dilated capsule network for brain tumor type classification via mri segmented tumor region

Convcaps: Multi-input capsule network for brain tumor classification, International Conference on Neural Information

Ensemble of deep convolutional neural networks for learning to detect retinal vessels in fundus images

Mitosis detection in breast cancer histology images via deep cascaded networks

Mitosis detection in breast cancer pathology images by combining handcrafted and convolutional neural network features

The Cancer Imaging Archive

The lung image database consortium (lidc) and image database resource initiative (idri): A

The cancer imaging archive (tcia): Maintaining and operating a public information repository

Hierarchical mixtures of experts and the em algorithm

Improper complex-valued multiple-model adaptive estimation

Lung nodule classification by jointly using visual descriptors and deep features, Medical Computer Vision and Bayesian and Graphical Models for Biomedical Imaging

Pulmonary nodule classification with deep residual networks

Computer aided lung cancer diagnosis with deep learning algorithms

Multi-crop convolutional neural networks for lung nodule malignancy suspiciousness classification

Distilling the knowledge in a neural network

Retrieval of brain tumors by adaptive spatial pooling and fisher vector representation