key: cord-0595806-2qw1majl authors: Cui, Can; Yang, Haichun; Wang, Yaohong; Zhao, Shilin; Asad, Zuhayr; Coburn, Lori A.; Wilson, Keith T.; Landman, Bennett A.; Huo, Yuankai title: Deep Multi-modal Fusion of Image and Non-image Data in Disease Diagnosis and Prognosis: A Review date: 2022-03-25 journal: nan DOI: nan sha: a61bb8bc4f5f2c8d124edbf53625154f8db5ec76 doc_id: 595806 cord_uid: 2qw1majl The rapid development of diagnostic technologies in healthcare is leading to higher requirements for physicians to handle and integrate the heterogeneous, yet complementary data that are produced during routine practice. For instance, the personalized diagnosis and treatment planning for a single cancer patient relies on the various images (e.g., radiological, pathological, and camera images) and non-image data (e.g., clinical data and genomic data). However, such decision-making procedures can be subjective, qualitative, and have large inter-subject variabilities. With the recent advances in multi-modal deep learning technologies, an increasingly large number of efforts have been devoted to a key question: how do we extract and aggregate multi-modal information to ultimately provide more objective, quantitative computer-aided clinical decision making? This paper reviews the recent studies on dealing with such a question. Briefly, this review will include the (1) overview of current multi-modal learning workflows, (2) summarization of multi-modal fusion methods, (3) discussion of the performance, (4) applications in disease diagnosis and prognosis, and (5) challenges and future directions. Routine clinical visits of a single patient might produce digital data in multiple modalities, including image data (i.e., pathological images, radiological images, and camera images) and non-image data (i.e., lab test results and clinical data). The heterogeneous data would provide different views of the same patient to better support various clinical decisions (e.g., disease diagnosis and prognosis [1] [2] [3] ). However, such decision-making procedures can be subjective, qualitative, and exhibit large inter-subject variabilities [4] [5] . With the rapid development of artificial intelligence technologies, an increasingly large amount of deep learning-based solutions has been developed for multi-modal learning in medical applications. Deep learning includes high-level abstraction of complex phenomenon within high-dimensional data, which Figure 1 . The scope of this review is presented. Multi-modal data containing image data (e.g., radiological images and pathological images) and nonimage data (e.g., genomic data and clinical data) are fused through multi-modal learning methods for diagnosis and prognosis of diseases. Several surveys have been published for medical multi-modal learning [11] [12] [13] [14] . Boehm et al. [11] reviewed the applications, challenges, and future direction of multi-modal learning in oncology. Huang et al. [12] categorized the fusion methods by the stages of fusion. Schneider et al. [13] and Lu et al. [14] divided the multi-modal learning studies by downstream tasks and modalities. Different from the prior surveys, we review multi-modal fusion techniques from a new perspective of categorizing the methods into operation-based, subspace-based, tensor-based, and graph-based fusion methods. We hope the summary of the fusion techniques can foster new methods in medical multi-modal learning. In this survey, we collected and reviewed 29 related works published within the last 5 years. All of them used deep learning methods to fuse image and non-image medical data for prognosis, diagnosis, or treatment prediction. This survey is organized in the following structure: §2 provides an overview of multi-modal learning for medical diagnosis and prognosis; §3 briefly introduces the data preprocessing and feature extraction for different uni-modalities, which is the prerequisite for multi-modal fusion; §4 summarizes categorized multi-modal fusion methods, and their motivation, performance, and limitations are discussed; §5 provides a comprehensive discussion and future directions; and §6 is a conclusion. This survey only includes published studies (with peer-review) that used both image and non-image data for disease diagnosis or prognosis in the past five years. All of them used feature-level deep learning-based fusion methods for multimodal data. A total of 29 studies that satisfied these criteria are reviewed in this survey. A generalized workflow of collected studies is shown in Figure 2 . Typically, data preprocessing, uni-modal feature extraction, multi-modal fusion, and predictor sections are included in the workflow. Due to the heterogeneity of image and non-image modalities, it is unusual to fuse the original data directly. Different modalities always have separate methods of data preprocessing and feature extraction. For multi-modal learning, fusion is a crucial step, following the uni-modal data preprocessing and feature extraction steps that are the prerequisites. §3 and §4 will introduce and discuss the unimodal feature preparation and multi-modal fusion separately. Based on the type of inputs for multi-modal fusion, the fusion strategies can be divided into feature-level fusion and decision-level fusion. Feature-level fusion contains early fusion and intermediate fusion. For decision-level fusion, which is also referred to as late fusion, the prediction results of uni-modal models (e.g., probability logits or categorical results from uni-modal paths in classification tasks) are fused for multi-modal prediction by majority vote, weighted sum, or averaging. As for feature-level fusion, either the extracted high-dimensional features or the original structured data can be used as the inputs. Compared with decision-level fusion, feature-level fusion has the advantage of incorporating the complementary and correlated relationships of the low-level and high-level features of different modalities [12] [15] , which leads to more variants of fusion techniques. This survey mainly focuses on reviewing methods of feature-level fusion. Figure 2 . The overview of the multi-modal learning workflow is presented. Due to the heterogeneity of different modalities, separate preprocessing methods and feature extraction methods are used for each modality. For feature-level fusion, the extracted features from uni-modals are fused. Note that the feature extraction methods can be omitted, because some data can be fused directly (such as the tabular clinical features). As for decision-level fusion, the different modalities are fused in the probability or final predication level. Because feature-level fusion contains more variants of fusion strategies, we mainly focus on reviewing the feature-level fusion methods. In this survey, multi-modal fusion is applied to the disease diagnosis and prognosis. The disease diagnosis tasks include classifications such as disease severity, benign or malignant tumors, and regression of clinical scores. Prognosis tasks include survival prediction and treatment response prediction. After obtaining the multi-modal representations, multi-layer perceptrons (MLP) were used by most of the studies to generate the prognosis or diagnosis results. The specific tasks of diagnosis and prognosis can be categorized into regression or classification tasks based on discrete or continuous outputs. To supervise the modal training, the cross-entropy loss was usually used for classification tasks, while the mean square error (MSE) was a popular choice for regression tasks. To evaluate the results, the area under the curve (AUC), receiver operating characteristics curve (ROC), accuracy, F1-score, sensitivity, and specificity metrics were commonly used for classification, while the MSE was typically used for regression. However, although the survival prediction was treated as a time regression task or a classification task of long-term/short-term survival, the Cox proportional hazards loss function [16] was popular in survival prediction tasks. To evaluate the survival prediction models, the concordance index (c-index) was widely used to measure the concordance between the predicted survival risk and real survival time. Due to multi-modal heterogeneity, separate preprocessing and feature extraction methods/networks are required for different modalities to prepare uni-modal features for fusion. As shown in Table 1 , our reviewed studies contain image modalities such as pathological images (H&E), radiological images (CT, MRI, X-ray, fMRI), and camera images (clinical images, macroscopic images, dermatoscopic images); and non-image modalities such as lab test results (genomic sequences) and clinical features (free-text reports and demographic data). In this section, we briefly introduce these data modalities and summarize the corresponding data preprocessing and feature extraction methods. Pathological images analyze cells and tissues at a microscopic level, which is recognized as the "gold standard" for cancer diagnosis [17] . Yet, the whole slide image (WSI) of pathological images usually cannot be processed directly because of its gigantic size. To focus on the informative regions, the regions of interest (ROI) are usually defined at first. Some studies used the diagnostic ROIs manually annotated by experts or predicted by the pretrained models, while some studies instead selected ROIs from the dense region based on pixel intensity. Additionally, patches were usually cropped from ROIs to fit the computation memory. To learn hidden representation from image patches, pretrained convolutional neural networks (CNN) or graph neural networks (GCN) commonly used deep learning-based methods, while some other works used conventional feature extraction methods (e.g., CellProfiler [18] ) to extract structured features. If multiple patches were extracted from each image, multi-instance learning (MIL) was applied to aggregate the patch information for each pathological image. Radiology imaging supports medical decisions by providing visible image contrasts inside the human body with radiant energy, including magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET) and X-ray. Similar to the pathological images, some works used manual annotation or pretrained networks to extract ROIs. As for feature extraction, some used conventional radiomics methods to extract the intensity, shape, and texture features, while more works used the 2D or 3D CNNs to learn image representations. Functional MRI (fMRI) is another radiological modality used in some reviewed studies that investigated autism spectrum disorder (ASD) and Alzheimer's disease (AD). The images of brains were divided into multiple regions by the template. Then, the Pearson correlation coefficient between two brain regions was calculated to form the functional connectivity matrix. The matrix was finally vectorized for classification. In addition to pathological and radiological images, some other kinds of medical images captured by optical color cameras are categorized as camera images; examples of these camera images include the dermoscopic and clinical images for skin lesions [19] [20] , endoscopic images to examine the interior of a hollow organ or cavity of the body [21] , and the funduscopic images photographing the rear of eyes [22] . Different from pathological images, camera images can be taken directly while the sectioned and stained sample slides are not required. Also, because camera images are smaller in size, 2D CNN networks (which might be pretrained by ImageNet) are usually applied to the whole images or detected lesions. The non-image modalities contain lab test results and clinical features. Laboratory tests check a sample of blood, urine, or body tissues, access the cognition and psychological status of patients, and analyze genomic sequences, etc. Clinical features include demographic information and clinical reports. These modalities are also essential to diagnosis and prognosis in clinical practice. They can be briefly divided into structured data and free-text data for different preprocessing and feature extraction methods. Most of the clinical data and lab test results in the reviewed works are structured data and can be converted to feature vectors easily. In preprocessing, categorical clinical features were usually converted through one-hot encoding, while the numerical features were standardized. As the genomic data are in high dimension, some feature selection methods such as the highest variance [5] were used to extract the most expressive features. The missing value is a common problem for some structured data. The ones with a high missing rate were usually discarded directly, while the other missing data were imputed with the average value, mode value, or values of similar samples selected by K-nearest neighbors (KNN), and some works added missing status as features [23] . Clinical reports capture clinicians' impressions of diseases in the form of unstructured text. In order to deal with the free text data and extract informative features from the free-text, natural language processing techniques are implemented. For example, Chauhan et al. [24] prepared the tokenization of the text extracted by ScispaCy [25] . Then, the BERT [26] model initialized by weights pre-trained on the scientific text [27] was used to embed the tokenization. After using the above modal-specific preprocessing and features extraction methods, the uni-modal representations were converted to image feature maps or feature vectors. For feature vectors, in order to learn more expressive features with expected dimensions, some conventional feature reduction methods, autoencoders, denoising autoencoders, multi-layer perceptrons (MLP), and other deep-learning networks were further applied. For example, Parisot et al. [28] explored different feature reduction methods such as recursive feature elimination (RFE), principal component analysis (PCA), and the autoencoder to reduce the feature dimension of the vectorized functional connectivity matrix. Yan et al. [29] used the denoising autoencoder to enlarge the dimension of low-dimensional clinical features and enrich its information. Cui et al. [23] deconvoluted the feature vectors to the same size of feature maps of image modalities. Sappagh et al. [30] used bidirectional long short-term memory (biLSTM) models to handle the vectorized time-series features. As for the image feature maps learned by CNNs, these feature maps could be used for fusion directly in order to keep the spatial information, or vectorized with pooling layers for fusion. Regarding the training strategies, the uni-modal feature extraction section can be independent to the fusion section, or can be trained or finetuned with the fusion section end-to-end. Fusing the heterogeneous information from multi-modal data to effectively boost modal performance is the key pursuit in multi-modal learning. Based on the type of inputs for multi-modal fusion, the fusion strategies can be divided into featurelevel fusion and decision-level fusion. Decision-level fusion integrates the probability or categorical predictions from unimodal models using averaging, weighted vote, or majority vote [15] [31][32] [20] , to make a final multi-modal prediction. It can tolerate the missing modality situation, but it may lack the interaction of the hidden features. Yoo et al. [33] showed that post-hoc decision-level fusion performs worse than end-to-end-trained decision-level fusion methods. Holste et al. [15] showed that end-to-end trained decision-level fusion methods performed worse than feature-level fusion. On the other hand, feature-level fusion fuses the original data or extracted features of heterogeneous multi-modals into a compact and informative multi-modal hidden representation to make a final prediction. Compared with decision-level fusion, more variants of feature-level fusion methods have been proposed to capture the complicated relationship of features from different modalities. This survey reviews these methods and categorizes them into operation-based, subspace-based, attention-based, tensor-based, and graph-based methods. The representative structures of these fusion methods are displayed in Figure 3 , and the fusion methods of reviewed studies are summarized in Table 2 . To combine different feature vectors, the common practice is to perform simple operations of concatenation, element-wise summation, and element-wise multiplication. These practices are parameter-free and flexible to use, but the element-wise summation and multiplication methods always require the feature vectors of different modalities to be converted into the same shape. Many early works used one of the simple operations to show that multi-modal learning models outperforms uni-modal models [ [19] . Although the operation-based fusion methods are simple and effective, they might not exploit the complex correlation between heterogeneous modalities. Also, the long feature vectors generated by the concatenation may lead to overfitting when the amount of training data is not sufficient [39] [40] . More recently, Holste et al. [15] compared these three operation-based methods in the task of using clinical data and MRI images for breast cancer classification. The low-dimensional non-image features were processed by fully connected layers (FCN) to the same dimension of image features before fusion. The results showed that the three operations performed comparably (p-value > 0.05), while the element-wise summation and multiplication methods required less trainable parameters in the following fully connected layers. After comparing the learned non-image features by FCN and the original non-image features, the former ones achieved superior performance. Meanwhile, the concatenation of the feature vectors outperformed the concatenation of logits from the uni-modal data. Yan et al. [36] investigated the influence of the dimension of uni-modal features on the uni-modal performance using concatenation fusion. They hypothesized that the high-dimensional vectors of image data would overwhelm the low-dimension clinical data. To keep the rich information of the high-dimensional features for a sufficient fusion, they used the denoising autoencoder to increase the dimension of clinical features. Zhou et al. [41] proposed a three-stage uni-modal feature learning and multi-modal feature concatenation pipeline, where every two modalities were fused at the second stage and all three modalities were fused at the third stage, in order to use the maximum number of available samples when some modalities were missing. The subspace methods aim to learn an informative common subspace of multi-modality. A popular strategy is to enhance the correlation or similarity of features from different modalities. Yao et al. [42] proposed a DeepCorrSurv model and evaluated the survival prediction task. Inspired by the conventional canonical correlation analysis (CCA) method [53] , they proposed an additional CCA-based loss for the supervised FCN network to learn the more correlated feature space of features from two modalities. The proposed methods outperformed the conventional CCA methods by learning the nonlinear features and the supervised correlated space. Zhou et al. [34] designed two similarity losses to enforce the learning of modality-shared information. Specifically, a cosine similarity loss was used to supervise the features learned from these two modalities, and a loss of hetero-centre distance was designed to penalize the distance between the center of clinical features and CT features belonging to each class. In their experiments, the accuracy dropped from 96.36 to 93.18 without these similarity losses. Li et al. [43] used the average of L1-norm and L2-norm loss to improve the similarity of the learned uni-modal features from pathological images and genes before concatenating them as a multi-modal representation. The learned similar features can then be fused by concatenation as the multi-modal representation. Another study fused the feature vectors from 4 modalities with the subspace idea in the diagnosis task of 20 cancer types [44] . Inspired by the SimSiam network [45] , they forced the feature vectors from the same subject to be similar by a margin-based hinge-loss. Briefly, cosine similarity scores between the uni-modal features from the same patient were maximized, whereas the ones from different patients were minimized. The feature similarities of different patients were only penalized within a margin of the feature similarity. Such a regularity enforced similar feature representation from the same patient, while avoiding mode collapse. Chauhan et al. [24] used X-ray images and free-text reports in training for pulmonary edema assessment. Since only the X-ray images were available for inference, they did not fuse the learned features of different modalities by ranking-cosine similarity loss. Such a design was similar to the training strategy in Cheerla et al. [44] , which utilized two modalities to improve the classification accuracy in the inference phase. They also provided the comparison of different similarity losses by ranking L2-norm and dot-product similarity scores. Another strategy in the subspace-based fusion method is to learn a completed representation subspace. Li et al. [46] decoded the mean vectors of multi-modal features and used the reconstruction loss to force the mean vectors to contain the complete information of different views. The mean vectors with additional decoder and reconstruction loss achieved superior classification accuracy as compared with the counterparts without such loss functions. Also, with the modality dropout in the training phase, the proposed method learned to reconstruct the missing modality from the learned representation subspace. Attention-based methods computed and incorporated the importance scores (attention weights) of multi-modality features when performing aggregation. This progress simulated rountine clinical practice. For example, the information from clinical reports of a patient may inform the clinicians to pay more attention to a certain region in an MRI image. Duanmu et al. [47] built an FCN path for non-image data along with a CNN path for image data. The learned feature vectors from the FCN path were employed as the channel-wise attention for the CNN path at the corresponding layers. The low-level and high-level features of different modalities can be fused correspondingly, which achieved a better prediction accuracy than simple concatenation. Ye et al. [48] concatenated the learned feature vectors from four modalities by an attention layer, which weighted the importance for the downstream task. Chen et al. [49] calculated the co-attention weight to generate the genomic-guided WSI embeddings. Li et al. [50] aggregated pathological and clinical features to predict the lymph node metastasis (LNM) of breast cancer. To utilize the Gigapixel WSIs, they proposed the multi-modal attentionbased MIL to achieve patient-level image representation. The clinical features were further integrated to form instancelevel image attention for the downstream task. The experiments showed that the proposed attention-based methods outperformed both the gating-based attention used in [51] and a bag-concept layer concatenation [52] . Guan et al. [53] applied the self-attention mechanism [21] in their concatenated multi-modal feature maps. They tiled and transformed the clinical feature vectors to the same shape of the image feature matrix to keep the spatial information in an image feature map. Their performance surpassed both the concatenation and another subspace method using a similarity loss [24] . In addition to the MLP and CNN, the attention mechanism was also applied to the graph model for multi-modal learning in the medical domain. Cui et al. [23] built a graph where each node was composed of image features and clinical features with category-wise attention. The influence weights of neighbouring nodes were learned by the convolution graph attention network (con-GAT) and novel correlation-based graph attention network (cor-GAT). The attention value was used to update the node features for the final prediction. The above attention-based fusion methods rescaled features through complementary information from another modality, while Pölsterl et al. [55] proposed a dynamic affine transform module that shifted the feature map. The proposed modules dynamically produced scale factor and offset conditional on both image and clinical data. In such a design, the affine transform was added ahead of the convolutional layer in the last residual block to rescale and shift the image feature maps. As a result, the high-level image features can interact with the compacted clinical features, which outperformed the simple concatenation and channel-wise attention-based methods [47] . The tensor-based fusion methods conducted outer products across multi-modality feature vectors to form a higher order co-occurrence matrix. The high-order interactions tend to provide more predictive information beyond what those features can provide individually. For example, blood pressure rising is common when a person is doing high-pressure work, but it is dangerous if there are also symptoms of myocardial infarction and hyperlipidemia [34] . Chen et al. [56] proposed pathomic fusion to make prognosis and diagnosis utilizing pathological image, cellgraph, and genomic data. They used the tensor fusion network with a Kronecker product [57] to combine the uni-modal, bimodal, and trimodal features. To further control the expressiveness of each modality, a gated-attention layer [58] was added. Wang et al. [59] not only used the outer product for inter-modal feature interactions, but also for intra-modal feature interactions. It surpassed the performance of the CCA-based method known as DeepCorrSurv [42] . More recently, Braman et al. [60] followed the work of pathomic fusion [56] and extended it from three modalities to four modalities. Also, an additional orthogonal loss was added to force the learned features of different modalities to be orthogonal to each other, which helped to improve feature diversity and reduce feature redundancy. They showed that their methods outperformed the simple concatenation and the original Kronecker product. A graph is a non-grid structure to catch the interactions between individual elements represented as nodes. For disease diagnosis and prognosis, nodes can represent the patients, while the graph edges contain the associations between these patients. Different from CNN-based represntation, the constructed population graph updates the features for each patient by aggregating the features from the neighbouring patients with similar features. To utilize complementary information in the non-imaging features, Parisot et al. [28] proposed to build the graph with both image and non-image features to predict autism spectrum disorder and Alzheimer's disease. The nodes of the graph were composed of image features extracted from fMRI images, while the edges of the graph were determined by the pairwise similarities of image (fMRI) and nonimage features (age, gender, site, and gene data) between different patients. Specifically, the adjacency matrix was defined by the correlation distance between the subject's fMRI features multipled with the similarity measure of non-image features. Their experiment showed that the proposed GCN model outperformed MLP of multi-modal concatenation. Following this study, Cao et al. [61] built graphs similarly but proposed to use the edge dropout and DeepGCN structure with residual connection instead of the original GCN for deeper networks and thus avoid overfitting, which achieved better results. Classification of AD and its prodromal status Abbreviations: single nucleotide polymorphism (SNP), copy number variation (CNV), hematoxylin and eosin-stained pathological images (H&E) In the above sections, we reviewed recent studies using deep learning-based methods to fuse image and non-image modalities for disease prognosis and diagnosis. The feature-level fusion methods were categorized into operation-based, subspace-based, attention-based, tensor-based, and graph-based methods. The operation-based methods are intuitive and effective, but they might yield inferior performance when learning from complicated interactions of different modalities' features. However, such approaches (e.g., concatenation) are still used to benchmark new fusion methods. Tensor-based methods represent a more explicit manner of fusing multi-modal features, yet with an increased risk of overfitting. Attention-based methods not only fuse the multi-modal features but compute the importance of inter-and intra-modal features. Subspace-based methods tend to learn a common space for different modalities. The current graph-based methods employ graph representation to aggregate the features by incorporating prior knowledge in building the graph structure. Note that the five kinds of fusion methods are not exclusive to each other, since some studies combined multiple kinds of fusion methods to optimize the prediction results. It is difficult to compare the performance of different feature-level fusion methods directly, since different studies were typically done on different datasets with different settings. Moreover, most of the prior studies did not use multiple datasets or external testing sets for evaluation. Therefore, more complete and fair, comparative studies and benchmark datasets should be encouraged for multi-modal learning in the medical field. The reviewed studies showed that the performance of multi-modal models typically surpassed the uni-modal counterparts in the downstream tasks such as disease diagnosis or prognosis. On the other hand, some studies also mentioned that the model that fused more modalities may not always perform better than the ones with fewer modalities. In other words, the fusion of some modalities may have no influence or negative influence on multi-modal models [35] , [48] , [60] , [61] , [28] . It might be because the additional information introduces bias for some tasks. For example, Lu et al. [35] used the data of both primary and metastatic tumors for training to increase the top-k accuracy of the classification of metastatic tumors effectively. However, the accuracy decreased by 4.6% when biopsy site, a clinical feature, was added. Parisot et al. [28] and Cao et al. [61] demonstrated that the fusion of redundant information or data with noise (e.g., age, full intelligence quotient) led to defining inaccurate neighborhood systems of the population graph and further decreased the model performance. Meanwhile, additional modalities increase the network complexity with more trainable parameters, which may increase the training difficulties and the risk of overfitting. Braman et al. [60] , used outer products to fuse uni-modal features. However, the outer products with three modalities yielded an inferior performance compared with the pairwise fusion and even uni-modal models. Thus, although multi-modal learning tends to benefit model performance, modality selection should consider the model capacity, data quality, specific tasks, etc. This is still an interesting problem worth more exploration. A concern in this field is data availability. Although deep learning is powerful in extracting a pattern from complex data, it requires a large amount of training data to fit a reasonable model. However, over 60% of the reviewed studies used multimodal datasets containing less than 1,000 patients. To improve the model performance with limited data, many studies applied feature reduction and data augmentation techniques. The pretrained networks (e.g., pretrained by ImageNet [69] ) were widely used by many studies instead of training from scratch with small datasets. Meanwhile, several studies [27] [22] [20] deployed multi-task learning and showed improvements. Through sharing representations between related tasks, models generalized better on the original task. In the future, transfer learning, data synthesis, and federal learning are worth exploring in this domain to optimize the modal accuracy and improve generalizability with limited data. Data missing is another problem of data availability. Complete datasets with all modalities available for every patient are not always guaranteed in routine practice. In the reviewed papers, random modality dropout [34] [30] , data imputation [30] , recurrent neural networks (RNN) [64] , and autoencoders [36] have been implemented to handle the missing data. However, the comparison of these methods and the influence of missing data in training and testing phases were not thoroughly investigated. For future work, additional methods such as generative adversarial networks would be a promising direction for addressing the missing data issue. Explainability is another challenge in multi-modal diagnosis and prognosis. Lack of transparency is identified as one of the main barriers to deploying deep learning methods in clinical practice. An explainable model not only provides a trustworthy result but also helps the discovery of new biomarkers. In the reviewed papers, some explanation methods were used to show feature contributions to results. For image data, heatmaps generated with the class activation maps algorithm (CAM) were used to visualize the activated region of images that were most relevant to the models' outputs [24] [48] [56] . The activated image region was compared with prior knowledge to see whether the models focused on the diagnostic characteristics of images. Li et al. [50] displayed the attention weights of patches to visualize the importance of every patch to the multi-instance learning of a WSI. [15] and [46] investigated the importance scores for features, while other studies [30] [44] explored the contribution of modalities. The usefulness of these explanations is still waiting to be validated in clinical practice. This paper has surveyed the recent works of deep multi-modal fusion methods using the image and non-image data in medical diagnosis, prognosis, and treatment prediction. The multi-modal framework, multi-modal medical data, and corresponding feature extraction were introduced, and the deep fusion methods were categorized and reviewed. From the prior works, multi-modal data typically yielded superior performance as compared with the uni-modal data. Integrating multi-modal data with appropriate fusion methods could further improve the performance. On the other hand, there are still open questions to achieve a more generalizable and explainable model with limited and incomplete multi-modal medical data. In the future, multi-modal learning is expected to play an increasingly important role in precision medicine as a fully quantitative and trustworthy clinical decision support methodology. This work is supported by Leona M. and Harry B. Helmsley Charitable Trust grant G-1903-03793, NSF CAREER 1452485, and Veterans Affairs Merit Review grants I01BX004366 and I01CX002171. Dermatopathologists' concerns and challenges with clinical information in the skin biopsy requisition form: A mixed-methods study Non-hematologic diagnosis of systemic mastocytosis: Collaboration of radiology and pathology Midbrain and hindbrain malformations: Advances in clinical diagnosis, imaging, and genetics Sources of Variation and Bias in Studies of Diagnostic Accuracy: A Systematic Review The Effects of Changes in Utilization and Technological Advancements ofCross-Sectional Imaging onRadiologist Workload A survey on deep learning in medicine: Why, how and when? A survey on deep learning for multimodal data fusion Deep learning in digital pathology image analysis: a survey Deep learning for electronic health records: A comparative review of multiple deep neural architectures Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis Harnessing multimodal data integration to advance precision oncology Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines Integration of deep learning-based image analysis and genomic data in cancer pathology: A systematic review Integrating pathomics with radiomics and genomics for cancer prognosis: A brief review End-to-End Learning of Fused Image and Non-Image Features for Improved Breast Cancer Classification from MRI DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network Histopathological Image Analysis: A Review CellProfiler: Image analysis software for identifying and quantifying cell phenotypes Multimodal skin lesion classification using deep learning Seven-Point Checklist and Skin Lesion Classification Using Multitask Multimodal Neural Nets Accuracy of artificial intelligence-assisted detection of esophageal cancer and neoplasms on endoscopic images: A systematic review and meta-analysis Applications of deep learning and artificial intelligence in Retina Co-graph Attention Reasoning Based Imaging and Clinical Features Integration for Lymph Joint Modeling of Chest Radiographs and Radiology Reports for Pulmonary Edema Assessment ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing BERT: Pre-training of deep bidirectional transformers for language understanding SCIBERT: A pretrained language model for scientific text Disease prediction using graph convolutional networks: Application to Autism Spectrum Disorder and Alzheimer's disease Richer fusion network for breast cancer classification based on multimodal data Multimodal multitask deep learning model for Alzheimer's disease progression detection based on time series data Modeling uncertainty in multi-modal fusion for lung cancer survival analysis Correction to: Evaluation of a convolutional neural network for ovarian tumor differentiation based on magnetic resonance imaging Deep learning of brain lesion patterns and user-defined clinical and MRI features for predicting conversion to multiple sclerosis from clinically isolated syndrome Cohesive Multi-modality Feature Learning and Fusion for COVID-19 Patient Severity Prediction AI-based pathology predicts origins for cancers of unknown primary Richer fusion network for breast cancer classification based on multimodal data Predicting cancer outcomes from histology and genomics using convolutional networks Pan-cancer prognosis prediction using multimodal deep learning Breaking the curse of dimensionality with convex neural networks Effective feature learning and fusion of multimodality data using stage-wise deep neural network for dementia diagnosis Deep correlational learning for survival prediction from multi-modality data A Novel Pathological Images and Genomic Data Fusion Framework for Breast Cancer Survival Prediction Deep learning with multimodal representation for pancancer prognosis prediction Exploring Simple Siamese Representation Learning G-MIND: an end-to-end multimodal imaging-genetics framework for biomarker identification and disease classification Prediction of Pathological Complete Response to Neoadjuvant Chemotherapy in Breast Cancer Using Deep Learning with Integrative Imaging Multimodal Deep Learning for Prognosis Prediction in Renal Cancer Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images Multi-modal Multi-instance Learning Using Weakly Correlated Histopathological Images and Tabular Clinical Information Pathomic fusion: An integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis Semi-Supervised Multi-Modal Multi-Instance Multi-Label Deep Network with Optimal Transport Attention is all you need Combining 3D Image and Tabular Data via the Dynamic Affine Feature Map Transform Pathomic Fusion: An Integrated Framework for Fusing Histopathology and Genomic Features for Cancer Diagnosis and Prognosis Tensor fusion network for multimodal sentiment analysis Attention gated networks: Learning to leverage salient regions in medical images GPDBN: deep bilinear network integrating both genomic data and pathological images for breast cancer prognosis prediction Deep Orthogonal Fusion: Multimodal Prognostic Biomarker Discovery Integrating Radiology, Pathology, Genomic, and Clinical Data Using DeepGCN to identify the autism spectrum disorder from multi-site resting-state data TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-Rays Attention gated networks: Learning to leverage salient regions in medical images DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture Development and validation of a risk prediction model for radiotherapy-related esophageal fistula in esophageal cancer Cross-modal self-attention network for referring image segmentation FiLM: Visual reasoning with a general conditioning layer Latent feature representation with stacked auto-encoder for AD/MCI diagnosis ImageNet: A large-scale hierarchical image database