key: cord-0231879-l0lu6u6d
authors: Hamdi, Ali; Aboeleneen, Amr; Shaban, Khaled
title: MARL: Multimodal Attentional Representation Learning for Disease Prediction
date: 2021-05-01
journal: nan
DOI: nan
sha: 16160dff005d9ade585fd2d9784b1603a39e312e
doc_id: 231879
cord_uid: l0lu6u6d

Existing learning models often utilise CT-scan images to predict lung diseases. These models are posed by high uncertainties that affect lung segmentation and visual feature learning. We introduce MARL, a novel Multimodal Attentional Representation Learning model architecture that learns useful features from multimodal data under uncertainty. We feed the proposed model with both the lung CT-scan images and their perspective historical patients' biological records collected over times. Such rich data offers to analyse both spatial and temporal aspects of the disease. MARL employs Fuzzy-based image spatial segmentation to overcome uncertainties in CT-scan images. We then utilise a pre-trained Convolutional Neural Network (CNN) to learn visual representation vectors from images. We augment patients' data with statistical features from the segmented images. We develop a Long Short-Term Memory (LSTM) network to represent the augmented data and learn sequential patterns of disease progressions. Finally, we inject both CNN and LSTM feature vectors to an attention layer to help focus on the best learning features. We evaluated MARL on regression of lung disease progression and status classification. MARL outperforms state-of-the-art CNN architectures, such as EfficientNet and DenseNet, and baseline prediction models. It achieves a 91% R^2 score, which is higher than the other models by a range of 8% to 27%. Also, MARL achieves 97% and 92% accuracy for binary and multi-class classification, respectively. MARL improves the accuracy of state-of-the-art CNN models with a range of 19% to 57%. The results show that combining spatial and sequential temporal features produces better discriminative feature.

Deep representation learning models are proposed to learn discriminative features in various applications. Recently, lung disease prediction tasks, be it progression regression or classification, have gained much attention due to the pandemic of COVID-19. Existing prediction models are challenged by uncertainty issues when determining the correct disease patterns [1] - [3] . These uncertainty issues effect lung disease prediction when performing lung segmentation and feature representation learning.

Lung segmentation is challenging due to the fuzziness of the visual compositions of the CT-scan images. These images contain radio-density Hounsfield scores that represent other human body parts. Conventional methods often depend on tissue thresholds over the CT-scan Hounsfield. Such methods also employ various morphological operations such as dilation to cover the lung nodules at far borders. However, these methods suffer from the uncertainty when separating the lung tissues. Therefore, we propose to apply Fuzzy-based spatial segmentation over the lung CT-scans to reduce noisy spots and spurious blobs in the images [4] . We then employ a pretrained Convolutional Neural Network (CNN) to learn the visual features of the images.

Visual representation learning models depend, in most cases, on CNN to capture useful patterns in a given image [5] . However, CNNs suffer from challenges such as limited local structural information as they are designed to learn local descriptors using receptive fields [6] . This, in turn, leads to losing important global structures. We propose to address this issue by augmenting the CNN visual features with global temporal characteristics from patients biological and health records. We train our proposed hybrid model to learn these temporal features through a Long Short-Term Memory (LSTM) network. However, LSTM learns features from sequences at a fixed length limiting the significance of the learning feature space. We propose to overcome this limitation by employing an attention layer that learns what should be learnt from the visual CNN and sequential LSTM features.

As a case study, we utilise a public dataset for Idiopathic Pulmonary Fibrosis (IPF) lung disease [7] . IPF scars lung tissues and it worsens over time for unknown causes [2] . When infected, lungs cannot take in the required amount of oxygen due to difficulty of breathing. The dataset contains both CT-scan images and patients' biological and health records with different attributes collected over periods of times. Such multimodal data harnesses spatial visual features of the disease and temporal patient attributes. Our proposed MARL, a novel Multimodal Attentional Representation Learning model architecture, has multiple components as visualised in Fig. 1 . MARL starts with preprocessing the input of the lung CTscan images using a fuzzy-based spatial segmentation and preparing the temporal sequences from their corespondent patients' data records. Two different deep learning networks, including CNN and LSTM, are then employed to learn the visual and temporal features, respectively. The encoded pairs of feature vectors are injected into an attention layer. The final feature vectors are propagated through a fully-connected layer. These final feature vectors are evaluated against multiple downstream tasks such as regressing the disease progression and classifying the disease status.

In summary, the multimodal learnt feature vectors, produced by MARL, offer to solve the uncertainty issues in lung disease prediction through the following contributions:

• Producing accurate visual representation vectors by improving the CNN feature learning through a Fuzzy-based spatial segmentation. • Developing effective temporal feature learning from the patient biological records and statistical information of the CT-scan images using an LSTM network. • Introducing an attention mechanism that improves the feature representation by focusing on the critical input sequences and image parts. • Carrying out an extensive experimental work to evaluate the proposed model on different lung disease prediction tasks such as lung declination regression and disease status classification. The rest of the paper is organised as follows. Section II reviews related works and contrast that with our work in this paper. Section III explains the proposed MARL. Section IV describes experimental setups, and discusses the results of performance evaluations. Section V concludes the paper.

Recent research efforts have used CNNs visual representation learning to advance multiple applications [8] - [10] . CNNbased methods are widely utilised to produce visual feature vectors from CT-scan images for disease prediction [11] . Lung images can be categorised based on the disease status [12] , [13] . A CNN network was developed by the authors of [14] to classify the disease status into positive, possibleto-have, and negative. They collected a lung disease dataset of 1, 157 high-resolution images. Their experimental results showed the superiority of the deep learning models against the radiologists in both accuracy and speed. However, their results depended on a small dataset which limits the feature learning space. In this paper, we utilise a dataset of 33, 026 CT-scan lung images in addition to the patient tabular data. Besides, their dataset was annotated by one expert who might have erroneous decisions. On contrast, the dataset we use is created and published by Open Source Imaging Consortium that made substantial cooperative efforts between the academia and healthcare industry.

State-of-the-art of CNN-based models proposed various network architectures to improve image representation learning [5] , [15] - [23] . CNN-based models are designed to have large sets of layers to adapt to the increasing size and complexity of the training data [24] . However, they usually suffer from the problem of overfitting when having relatively small training data. Recently, the overfitting problem has been addressed by different techniques such as data augmentation [25] . Moreover, CNNs neglect useful structures due to the limitations of their receptive fields and isotropic mechanism [6] . Therefore, we propose to combine the visual CT-scan data with their correspondent patients' data. Such multimodality adds useful features that increase the accuracy of lung disease prediction. The patient data is a set of patient attributes and disease progression measurements over time. Therefore, we employ LSTM to learn temporal features combined with the CNN visual representations in order to improve prediction. LSTM has been recently used to predict different disease developments such as Alzheimer [26] , hand-foot-mouth [27] , and COVID-19 [28] . LSTM learns better at equal intervals of regular spaced timestamps. Notwithstanding that CT-scan data are often collected at the patient needs making an irregular data collection sequences. The authors in [29] utilised adapted LSTM to learn irregular temporal data points for lung cancer detection. Moreover, LSTM learns features from sequences at a fixed length which effects the learning of the feature space. Therefore, we design our proposed MARL to combine CNN and LSTM to overcome their limitations.

The hybridisation between CNN and LSTM tends to have a potential increase in disease prediction accuracy. Recent work in [30] has reported that a hybrid model of LSTM and CNN outperformed the human experts in lung disease classification. However, their work did not consider the uncertainty in segmentation, and they used a small dataset of 102 patients. Therefore, there is still a need to address the above-discussed limitations and uncertainties in LSTM and CNN networks. We utilise Fuzzy-based spatial segmentation to improve the lung segmentation before the convolutional feature extraction. Using multimodal datasets contributes to accurate lung disease prediction [31] , [32] . The work in [33] predicted the recurrence of lung cancer based on a multimodal fusion of tomography images and genomics. However, using CNN and LSTM on multimodal data adds more complexity to the training process. We design an attentional neural layer at the bottleneck that connects the CNN and LSTM vectors with the fully-connected layer. This attention mechanism is designed to make the model focuses on essential features in the input sequences. The authors in [34] combined both medical codes and clinical text notes to implement multimodal attentional neural networks. Similarly, our proposed model, MARL, combines different data modalities such as CT-scan images, and patients' biological and health records. MARL, extract useful representations from CT-scan images, patient data, and visual statistical information.

We propose a novel representation learning model architecture for lung disease prediction. The proposed model is designed to address uncertainty issues that affect the downstream prediction tasks such as disease progression regression and disease status binary and multi-class classification. In this section, we explain the workflow phases of MARL.

The given CT-scan images in the dataset have multiple issues regarding colour exposure and varying sizes. We start by correcting the black exposure to ensure high quality of the subsequent feature extraction step. We also crop and scale-up the dataset images to match a unified size.

Around the lung, the CT-scan images include other human body parts, such as bones and blood vessels. Therefore, the images must be segmented to extract the lung parts only. Pixels in the CT-scan images contain radio-density scores. The pixel value represents the mean attenuation of the tissue scale from −1, 024 to +3, 071 at Hounsfield scale. Hounsfield unit (HU) is a scaled linear transformation of the radiodensity's attenuation coefficient measurements. HU values are calculated as in Eq. 1.

where slope and intercept are stored in the CT-scan file. The projected HUs are interpreted according to the ranges, such as, bone from +700 to +3000 and lung to be −500. However, segmenting the lung based on these numbers is cumbersome. The visual composition of the different body parts is uncertain at multiple locations. Therefore, we implement spatial segmentation based on Fuzzy C-Means (FCM) applied on the HUs.

The lung segmentation suffers from uncertainty due to the fuzzy area around the lung. This fuzziness happens because of the nature of the HU values that represent various human body parts around the lung. Using the Fuzzy spatial C-means has advantages over the classical Fuzzy C-Means. The latter is sensitive to noisy parts in the given images. Besides, the classic FCM expects data to have robust, and separated partitions to implement useful membership functions. Nonetheless, in our case, this assumption is not valid due to the high-dependency among the image segments. Spatial FCM computes the likelihood of a neighbourhood-pixel belongs to a specific segment, e.g., lung. The Fuzzy member function uses the spatial likeliness score to calculate the membership value. The work in [4] proposed to compute the membership values (m) based on score of spatial similarity and degree of hesitation, as in Eq. 2.

where m ij denotes membership values of a neighbourhood pixel with coordinates of i and j, u denotes membership function calculated based on the degree of hesitation score, and p regulates the initial membership's weights, q controls spatial functions, and h is the spatial function. We at that stage apply standard morphological transformation methods. Specifically, we utilise erosion and dilation in order to remove the noise remain thereafter segmentation.

The dataset contains patient records of biological information. These tabular data are temporally tagged with different timestamps of their collections. The dataset has the following columns:

• Patient unique identifiers that link the biological data with the CT-scan images. • Week numbers of which the CT-scan had been taken. The biological dataset is unbalanced, as shown in Fig. III -C. The patient ages, CT-scan times, and FVC percentages and scores show unbalanced distributions. Moreover, the dataset has records of males more than females, ex-smokers more than smokers and non-smokers. This fact adds more uncertainty due to the bias towards particular categories. Therefore, we enrich the biological tabular data with some visual statistical features. We compute the kurtosis, volume, mean, skewness, and moments for each CT-scan segmented lung. Adding such visual statistical information to the tabular biological data tends to be useful for achieving high accuracy. Moreover, we implement an LSTM network to overcome the uncertainty issue due to the data unbalance by leaning useful sequential temporal patterns.

We implement two deep neural networks. First, Convolutional Neural Network to extract the CT-scan images' visual features. Second, a double Long Short Term Memory Network to learn sequential temporal features from the biological and visual statistical data.

1) Convolutional Neural Networks: We employ Efficient-Net [5] , which has recently outperformed other pre-trained networks in accuracy, size, and efficiency.

A CNN network is typically composed of one or more convolutional layers. Each layer i represents a function as in Eq. 3.

where Y i is output, F i is a convolution operator, and X i is the input features. X i is a tensor of the shape of < H i , W i , C i > with H i , W i , and C i denote the tensor spatial dimension and channel dimension. Thus, a convolutional network can be composed of multiple stages or groups of CNN layers as in Eq 4 [16] .

where N represents the CNN neural network that has F layer repeated L i times in stages i. Most of the CNN architectures aim to find the best layer design and network scale of length L i and width C i . The employed Efficient-Net is designed to maximise the network performance according to the available resources.

max d,w,r Accuracy (N (d, w, r))

Memory(N ) ≤ target memory FLOPS(N ) ≤ target flops (5) where w, d, r denote the width, depth, and resolution coefficients for scaling the network,F i ,L i ,Ĥ i ,Ŵ i ,Ĉ i are the predefined network parameters. The network depth scaling is a popular task in CNN. Most recent CNN networks assume that deeper networks capture rich and complex features. However, this intuition is difficult to train because of the vanishing gradient problem [35] . Recent advances have proposed to alleviate this issue via batch normalisation [36] and skip connection [16] . However, the network performance diminishes, and the accuracy does not increase even if the network depth is increased [5] . Wider networks are also assumed to be able to capture more fine-grained features and can also be trained easily [37] . However, having a wide but shallow network suffers difficulty learning high-level representations. Training a CNN on high-resolution images tends to produce better representations. In our case, the CT-scan images are in different resolutions. This varying image size also adds to the uncertainty problem in capturing useful visual representations. Therefore, we augment the visual feature vector with another feature vector that can be learnt from the biological and visual statistical data through an LSTM model.

The patient temporally recorded biological data are combined with visual statistical features from the CT-scan images. This data are then padded into identical sequences to be ready for LSTM learning. LSTM is a type of Recurrent Neural Networks. LSTM is featured by having feedback connections by which it controls a sequence of data inputs instead of single inputs. An LSTM can be implemented with a forget gate as Eq. 6

where f t ∈ R h denotes the activation vector of the LSTM forget gate, x t ∈ R d represents the input vector to the utlised LSTM network, i t ∈ R h is the activation vector of the LSTM input and update gate, o t ∈ R h is the activation vector of the LSTM output gate,c t ∈ R h represents the activation vector of the LSTM cell input, c t ∈ R h denotes the cell state vector, h t ∈ R h denotes the hidden state or output vector of the LSTM unit, W ∈ R h×d denotes the weights of the input, U ∈ R h×h denotes the weights of the recurrent connections, and b ∈ R h denotes the parameters of the bias vector learnt throughout the training process. The superscripts h and d denote the number of hidden units and input features, respectively. We implement two LSTM layers on the biological and visual statistical data. The output feature vector will be injected into an attention layer alongside the previous CNN visual feature vector as denoted in Fig. 1 .

3) Attention Layer and Feature Vector Concatenation: At this stage, we have extracted two feature vectors from the CNN and LSTM models. We then pass these feature vectors to an attention layer to learn the best features. We implement a dot-product attention layer based on Luong attention [38] . The attention layer expects query T q and value T v tensors. It starts with calculating the scores of the dot product operations as scores = T q * T v computing the query-value dot product. Then, the scores are used to compute the distribution based on softmax function as in Eq. 7.

The output distribution vector is utilised to create a linear combination of the value tensor T v . The output attention vector is passed to a fully-connected layer to learn the final representation vector.

We present a set of experiments to highlight MARL's efficacy. We implement its components as follows:

• Preprocessing the CT-scan images -Correcting the black exposure.

-Unifying the image sizes.

-Using the Fuzzy spatial C-Means to segment the lung.

• Preprocessing the patient's health records, as follow: -Adding visual statistical features of the correspondent images.

-Making identical sequences to be ready for the LSTM sequential learning.

• Utilising multiple state-of-the-art CNN architectures to learn the visual features vectors form the lung images. • Using a double LSTM architecture to learn sequential temporal features from the health and visual statistical records. • Implementing two versions of the attention mechanism, as follows: -MARL V1: Using the CNN visual feature vectors as query tensors to find the best features for the LSTM to learn. -MARL V2: Using the LSTM feature vectors as query tensors to learn where to focus in the images.

• Adding a fully-connected layer to learn the final feature vectors. The learnt feature vectors at the last step are now ready to be consumed by any downstream task. We introduce lung disease tasks as follows:

• Estimating disease progression by adding a regression layer on top of MARL. • Classifying the lung disease status on binary and multiclass models. In the next subsection, we discuss the utilised dataset and experimental results of both the regression and classification tasks.

The data includes 1, 549 patients' health records, 33, 026 CT-scan images, 880 of them are used for testing. A sample of CT slices are shown in Fig. 3 . Some CT-scan images have different resolutions, and some need colour correction. Moreover, the dataset provides unbalanced data categories where the number of males exceeds the number of females, and the number of ex-smokers exceeds the smokers and nonsmokers. Table I reports the experimental results of lung disease regression using state-of-the-art CNNs on CT-scan images with three different setups, i.e., MARL V1, V2, and a fullyconnected regression layers on top of the CNN architectures to extract the visual features as explained earlier. The results show that our model outperforms the other models. MARL manages to improve the accuracy of all CNN networks. The performance improvements range from 11% to 49% as shown in Table I . Fig. 4 shows a comparison between the three experimental setups using a radar chart and the superiority of MARL V1 and V1 is noticeable. We also evaluate the impact of each data modality and their combinations. Table II and Fig. 5 compare the performance of the regression models using each data source separately and combined. Using the biological data alone produces better results than the image and visual statistical data individually. The multimodal dominates the results over the other regression models. Besides, 54% 68% 63% ResNet [16] 43% 60% 65% VGG16 [15] 40% 62% 66% Xception [23] 37% 72% 74% MobileNetV2 [20] 36% 79% 78% DenseNet201 [19] 49% 79% 80% InceptionResNetV2 [40] 56% 78% 84% EfficientNetB0 [5] 46% 91% 90% EfficientNetB5 [5] 48% 91% 0.89.6 the table lists the performance results of MARL V1 and V2 on the multimodal data. MARL outperforms all with 91% and 89.6% R 2 scores for V1 and V2 setups, respectively. These results are higher than the other regression models by a range of 8% to 27% R 2 when compared to the results of the other models that are provided with multimodal data. 5 . A radar-chart shows the performances of the regression models using different data modalities.

We evaluate the proposed MARL on binary and multi-class classification tasks. For the binary discritisation, we categorise data instances based on if the score of F V C >= 2500. For multi-class categorisation, the percent column is utilised to have three different classes, namely, sever (up to 60%), mild (60% to 80%), and good (above 80%). Table III lists the performance results of the binary and multi-class IPF lung disease status classification. Consistent with the regression results, MARL V1 and V2 improve the classification performance of state-of-the-art CNN models. The accuracy improvements range from 19% to 57% as shown in Table I and Fig. 6  (a) and (b) . Besides, we evaluate the utilisation of each data modality and their combinations as in the regression scenarios, see Table IV and Fig. 6 (c) and (d). Using the biological dataset has better results in the binary tasks than the multiclass where adding the visual statistical features contributes to higher performances. Models that use multiple data sources keep having the best results especially with our proposed MARL that has 97% and 92% accuracy for binary and multiclass classification, respectively.

We presented a novel multimodal attentional neural network architecture for representation learning. MARL, the proposed [20] 44% 86% 81% 12% 88% 87% Xception [23] 66% 91% 93% 32% 87% 88% Inc.ResV2 [40] 69% 96% 93.50% 38% 91% 88.9% InceptionV3 [39] 69% 95% 93% 37% 88% 89% Dens.N.201 [19] 64% 90% 89.5% 40% 88% 89% Eff.tNetB0 [5] 47% 84% 84% 35% 89% 91% Eff.NetB5 [5] 40% 97% 95.30% 40% 92% 89.5% model, significantly improve accuracy to regress and classify IPF lung disease progression over state-of-the-art models. Multimodal data enable learning better feature representations than single sources. MARL includes several components designed to overcome uncertainties in the lung disease prediction when performing lung segmentation and feature representation learning. It is worthy to generalise the proposed architecture on other and different applications.

ACKNOWLEDGEMENT Ali Hamdi is supported by RMIT Research Stipend Scholarship. 

Diagnosis and management of idiopathic pulmonary fibrosis: French practical guidelines

Reliability and minimal clinically important differences of fvc. results from the scleroderma lung studies (sls-i and sls-ii)

Spatiotemporal data mining: a survey on challenges and open problems

Image segmentation using spatial intuitionistic fuzzy c means clustering

Efficientnet: Rethinking model scaling for convolutional neural networks

Understanding the effective receptive field in deep convolutional neural networks

OSIC), 2020

Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data

Revisiting self-supervised visual representation learning

Drotrack: High-speed dronebased object tracking under uncertainty

Convolutional neural networks: an overview and application in radiology

An official ats/ers/jrs/alat statement: idiopathic pulmonary fibrosis: evidence-based guidelines for diagnosis and management

Updated fleischner society guidelines for managing incidental pulmonary nodules: common questions and challenging scenarios

Deep learning for classifying fibrotic lung disease on high-resolution computed tomography: a case-cohort study

Very deep convolutional networks for large-scale image recognition

Deep residual learning for image recognition

Deep networks with stochastic depth

Identity mappings in deep residual networks

Densely connected convolutional networks

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Mobilenets: Efficient convolutional neural networks for mobile vision applications

Learning transferable architectures for scalable image recognition

Xception: Deep learning with depthwise separable convolutions

flexgrid2vec: Learning efficient visual representations vectors

Do we really need to collect millions of faces for effective face recognition

Predicting alzheimer's disease using lstm

A method for hand-foot-mouth disease prediction using geodetector and lstm model in guangxi, china

Time series forecasting of covid-19 transmission in canada using lstm networks

Distanced lstm: time-distanced gates in long short-term memory models for lung cancer detection

Lung cancer histology classification from ct images based on radiomics and deep learning models

Elaboration of a multimodal mri-based radiomics signature for the preoperative prediction of the histological subtype in patients with non-small-cell lung cancer

Deep learning for variational multimodality tumor segmentation in pet/ct

Multimodal fusion of imaging and genomics for lung cancer recurrence prediction

Mnn: multimodal attentional neural networks for diagnosis prediction

Wide residual networks

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Mnasnet: Platform-aware neural architecture search for mobile

Effective approaches to attention-based neural machine translation

Rethinking the inception architecture for computer vision

Inception-v4, inception-resnet and the impact of residual connections on learning

Extremely randomized trees

The elements of statistical learning: data mining, inference, and prediction

Xgboost: A scalable tree boosting system

Random forests

Catboost: unbiased boosting with categorical features

Multi-class adaboost

Saga: A fast incremental gradient method with support for non-strongly convex composite objectives