key: cord-0714318-hy905dq8
authors: Shen, Wan Xiang; Liu, Yu; Chen, Yan; Zeng, Xian; Tan, Ying; Jiang, Yu Yang; Chen, Yu Zong
title: AggMapNet: enhanced and explainable low-sample omics deep learning with feature-aggregated multi-channel networks
date: 2022-01-31
journal: Nucleic Acids Res
DOI: 10.1093/nar/gkac010
sha: 6bebc1b751c23b626173d258be4eb5f5ef983156
doc_id: 714318
cord_uid: hy905dq8

Omics-based biomedical learning frequently relies on data of high-dimensions (up to thousands) and low-sample sizes (dozens to hundreds), which challenges efficient deep learning (DL) algorithms, particularly for low-sample omics investigations. Here, an unsupervised novel feature aggregation tool AggMap was developed to Aggregate and Map omics features into multi-channel 2D spatial-correlated image-like feature maps (Fmaps) based on their intrinsic correlations. AggMap exhibits strong feature reconstruction capabilities on a randomized benchmark dataset, outperforming existing methods. With AggMap multi-channel Fmaps as inputs, newly-developed multi-channel DL AggMapNet models outperformed the state-of-the-art machine learning models on 18 low-sample omics benchmark tasks. AggMapNet exhibited better robustness in learning noisy data and disease classification. The AggMapNet explainable module Simply-explainer identified key metabolites and proteins for COVID-19 detections and severity predictions. The unsupervised AggMap algorithm of good feature restructuring abilities combined with supervised explainable AggMapNet architecture establish a pipeline for enhanced learning and interpretability of low-sample omics data.

Biomedical investigations frequently rely on the highdimensional, unordered feature, low-sample size, and multiplatform (BioHULM) data derived from omics (transcriptomics, proteomics, metabolomics) (1-3) analysis. Individual investigations have primarily focused on dozens to hundreds of samples. The derived omics data contains hundreds-to-thousands of unordered features (in the order of appearance). Although deep learning (DL) is superior in learning complex data, conventional machine learning (ML) methods are the primary tools for the recent BioHULM-based biomedical investigations (1) (2) (3) , such as nonlinear SVM or ensemble tree-based random forest in combination with various feature selection techniques (1, 2) . There are two obstacles hindering the direct DL of Bio-HULM data. First, DL trained by BioHULM data tends to overfit with numerous hyperparameters (4) . A comprehensive study has revealed that ML models outperform deep representation learning models trained by the same lowsample transcriptomic data (5) . Secondly, the DL outcomes are more difficult to explain than some ML algorithms (6) . Explainability is essential in biomedical investigations (7, 8) , particularly for informed decisions, mechanism investigations, biomarker discoveries, and model assessments (9, 10) .

High performance and explainable DL algorithms are needed for BioHULM tasks (11) . To adequately alleviate the "curse of dimensionality" problem in BioHULM learning, recent studies have focused on converting the 1D unordered data into 2D spatially-correlated image-liked e45 Nucleic Acids Research, 2022, Vol. 50, No. 8 PAGE 2 OF 22 feature maps (Fmaps) based on genetic locations (12, 13) , data neighborhoods (14) or functional relationships (15) . Although this conversion enables efficient deep learning with convolutional neural networks (CNNs), they lack rich channel information about the cluster groups of the input feature points. Because multi-channel networks are helpful for learning complex data by separately learning feature subsets (16) , their representational richness often allows capturing nonlinear dependencies at multiple scales (17) . The localized stationarity and compositionality of data can be efficiently extracted by CNNs (18) , the success of CNN hinges on its ability to leverage the compositional hierarchies (19) and intrinsic data structures (18) .

In this work, to enhance efficient CNN-based learning of the low sample omics data, we developed a novel unsupervised feature aggregation tool AggMap for aggregating and mapping individual unordered BioHULM feature points (FPs) into spatial-correlated multi-channel 2D Fmaps (Figure 1A) . AggMap feature restructuring focuses on the spatial and channel dimension of the Fmaps. With unsupervised AggMap, FPs are embedded in a 2D space using the manifold learning method Uniform Manifold Approximation and Projection (UMAP) (20) based on their pairwise correlation distances. Meanwhile, the FPs are agglomerated into multiple feature clusters (feat-clusters) using the agglomerative hierarchical clustering method (21) . FPs are aggregated into 2D grids by linear assignment algorithm LAPJV (22) based on the embedding coordinates to form spatially-correlated Fmaps, and the feat-clusters guide feature assignment into split channels. The proposed AggMap is defined as a jigsaw puzzle solver (23) because it solves jigsaw puzzles of unordered FPs based on their intrinsic similarities and topological structures. We also constructed a new multi-channel CNN architecture AggMapNet ( Figure  1B ) with two explainable modules (Shapley-explainer and Simply-explainer) for enhanced and explainable learning of BioHULM from AggMap Fmaps. AggMap/AggMapNet open-source codes are at https://github.com/shenwanxiang/ bidd-aggmap.

The feature restructuring capability of AggMap was evaluated by a proof-of-concept (POC) experiment on MNIST (24) . Interestingly, AggMap could almost completely reconstruct the original images from the random permutations based on their intrinsic correlations ( Figure 1C ), and the reconstruction ability of AggMap can be enhanced if it were fitted by higher-sample size randomized data. AggMap's good ability to rearrange FPs improves the learning of random data. The usefulness of AggMap for learning Bio-HULM data was evaluated by several tests. First, AggMap multi-channel Fmaps show notable improvements and better robustness than single-channel Fmaps on the learning of several datasets by AggMapNet. Secondly, AggMap outperforms the existing 2D feature engineering methods such as Lyu-reshape (12) and Bazgir-REFINED (14) on a multitask of RNA-seq based pan-cancer classification. In a cellcycle dataset, AggMap can pick up the stage-specific genes easily by aggregating and grouping the FPs. Thirdly, multichannel AggMapNet outperformed six ML models that are k-nearest neighbors (kNN), L2-regularized multinomial Logistic Regression (LGR), Random Forest (RF), Rotation Forest (RotF), Xgboost (XGB) and LightGBM (GBM) in most of the 18 low-sample transcriptomic benchmark datasets. Lastly, based on the developed Simply-explainer in AggMapNet, we further explored the important biomarkers for COVID-19 detection and severity predictions. Those identified COVID-19 relevant biomarkers are highly consistent with literature-reported findings or biological mechanisms. These results demonstrate that the feature representation ability in CNN models can be enhanced by unsupervised feature aggregation and multi-channel operations in AggMap, and the developed AggMap/AggMapNet pipeline is superior for DL of BioHULM data and key biomarker discovery ( Figure 1D ).

Humans are capable of logical restoration of broken fragmented objects, such as solving jigsaw puzzles or restoration of cultural property as illustrated in Figure 2A . This ability arises from pre-learned prior knowledge to connect and combine the fragments based on their correlations and edge connections. This knowledge is learned through various fragmentation-restoration processes. However, we are unable to reconstruct an image with its pixels randomly permuted (e.g. from image "a" to image "b" in Figure 2B ) despite our ability to restore the image from larger fragments ( Figure 2A ). This is because the original information of the image from "b" to "a" has been lost completely. Nevertheless, we may restructure the image from "a" to "c" based on the similarities of the pixels (feature points, FPs) in "a". The new image "c" is much more structured than image "a", or even with fragments very close to those of the original image for various patterns such as flowers, trunks, and leaves. The proposed AggMap was designed to Aggregate and Map the unordered FPs into structured feature maps (Fmaps) by imitating the assembly ability of humans (solving the jigsaw puzzles) in a self-supervised way. This restructuring process enables the mapping of unordered FPs into structured patterns for more effective deep learning (DL).

To restructure the unordered FPs into structured Fmaps, self-supervised AggMap needs a metric to measure the similarities between the FPs, an approach to embed the FPs, and an algorithm to assign the embedded FPs to the regular grid (i.e. the position map of the FPs). In AggMap , these tasks were performed by the correlation metric, the manifold-based UMAP (20, 25) approach and linear assignment Jonker-Volgenant (J-V) (22) algorithm, respectively. UMAP was initially developed for dimensionality reduction by embedding the samples in low-dimensional space. It can effectively aggregate similar FPs while preserving their relative proximities for both local and global data structure (20, 25) , leading to SOTA performance for dimensionality reduction in real world data (20, 25) . UMAP was employed in AggMap by default for embedding FPs instead of samples into 2D space.

There are nine steps in the fitting of AggMap, as indicated by a flowchart (Figure 3 ) using the restructuring of the randomized MNIST FPs (pixels are randomly permuted MNIST) as an example. In Step 1, given an input tabular data A with a shape of (M×N), where M and N are the number of the samples and features, respectively, AggMap measures the pairwise distance of FPs by the correlation distance to generate the distance matrixB(N×N). AggMap then uses the UMAP (20) algorithm to embed the FPs to 2D space in Step 3-Step 7 based on the calculated B. For the pixel randomly permuted MNIST training data, M is 60 000 and N is 784 (28 × 28) , where the N pixels are in arbitrary order. The pairwise correlation distance is defined by d corr (x i , x j ):

where r (x i , x j ) is the Pearson's r for the given FPs of x i and x j .

Step 2 is to conduct hierarchical clustering of the FPs to generate clusters C based on the calculated B, where complete linkage was used and the default number of clusters is 5. This clustering operation splits the FPs into different groups (clusters), more clusters produce more finegrained separations. Each cluster is separately embedded into an individual Fmap channel for feature group-specific or feature-selective learning by a CNN classifier. Because multi-channel colour images contain more information than grayscale images, multi-channel AggMap Fmaps are with more enriched and distinguished patterns. To visualize the multi-channel Fmaps in AggMap tool, the FPs of each channel were coloured in different colour, and the brightness of the colour corresponds to the FP value. The optimal number of clusters of FPs is a hyperparameter (described in the AggMap Hyperparameters section).

Step 3 is the first phase of UMAP graph construction (25) , but different from default UMAP in building the weighted topological k-neighbour graph using Euclidean distance, AggMap builds the weighted graph D by exponential probability distribution using correlation distance B:

where μ i | j is the weighted adjacency matrix of graph D, ρ i represents the distance from i th FP to its first nearest neighbor; this ensures the local connectivity of the manifold. σ i is a (smoothed) normalization factor for the metric d corr local to the point x i . In UMAP Algorithm 3 (25) , distance σ i can be estimated by binary search using the given hyperparameter k number of nearest neighbors (i.e. n neighbors). The adjacency matrix μ i | j has to satisfy the symmetry condition according to the UMAP algorithm (25):

The graph D is thus an undirected weighted graph whose adjacency matrix is given by μ i j , this construction provides an appropriate fuzzy topological representation of the data (20) .

Step 4 and Step 5 are to construct a weighted graph Fin low-dimensional space (i.e. the 2D embedding space). To initialize the 2D coordinates of the FPs in Step 4, AggMap uses the spectral layout to initialize the embedding E. The randomized initialization is unsuitable for preserving global data structure in both t-SNE and UMAP (26) . Therefore, AggMap uses the default spectral layout to initialize the embedding for faster convergence and greater stability within the algorithm (25) . Specifically, AggMap utilizes the correlation distance B to initialize the embedding E. To ensure a more uniform initialization embedding, AggMap first converts this distance matrix d corr (i, j ) of B into an exponential affinity matrix d (i, j ) :

Nucleic Acids Research, 2022, Vol. 50, No. 8 e45 Figure 3 . Flowchart the of self-supervised AggMap fitting process, the dynamic process of MNIST restructuring from randomly permuted images with 500 epochs is available in Video MNIST.mp4. The Input is the M × N matrix, where M is number of the samples, and N is number of the FPs with arbitrary order, i.e. the randomly permuted MNIST pixels across all training set of MNIST data (M = 60 000, N = 784). Step 1 to Step 9 are the steps in the fitting stages and Step 10 is the transform stage. The Step 3 to Step 7 are the basic ideas of UMAP 2D embedding. One sample, the handwritten number "9", is used as a tracker to illustrate how it will be restructured. The blank dots in the object E, D, F, F', G, H and Iare the pixels value of the number "9". The colors in the object G', H' and I'are the same five colors (clusters) as shown in object C, and the five colors stand for five clusters in hierarchical clustering C. The outputs are the single-channel or multi-channel Fmaps.

Subsequently, the matrix d (i, j ) is used for the spectral embedding by the Laplacian Eigenmaps (LE) algorithm. LE finds a low dimensional representation of the data using a spectral decomposition of the graph Laplacian (UMAP algorithm 4 (25, 27) ):

where E (x, y) is the 2D embedding results inE, and x, y are the coordinates for the N feature points. The pairwise Euclidean distance d (i, j ) of the FPs is calculated from the 2D coordinates in E. The weighted graph F in low-dimension space is constructed based on d (i, j ) according to the UMAP definition (25) :

where v i j is the weight matrix of the low-dimensional neighbour graph F, and a and b are the parameters estimated from non-linear least-square fitting to the piecewise function with the min dist hyperparameter.

Step 6 is the graph layout optimization of F. Since there are two weighted graph D and F, AggMap optimizes the layout of graph F to F' by minimizing the error between the two topological representations D and F. The graph layout for F(F') is force-directed, the forces are derived from gradients optimizing the edge-wise cross-entropy in formula (9) between the weighted graph μ (i.e. the D) and an equivalent weighted graph v (i.e. the F) constructed from 

where CE(μ, v) is the total cross entropy loss over all the edge existence probabilities between weighted graphs μ and v. Minimization of CE(μ, v) will let the low dimensional representation settle into a state that relatively accurately represents the overall topology of the source data. During the optimization, the similarities between the Dand F can also be measured by metric PCC:

where the μ vector and v vector are the vector-form of the weighted graph μ and v, respectively. Cov(μ vector , v vector ) is the covariance of μ vector and v vector , and the terms δ μ vector and δ v vector are the standard deviations of the two weighted graphs. To minimize CE(μ, v), the derivative of the CE(μ, v) is used to update the coordination of the lowdimensional data points to optimize the projection space until the convergence:

where lr is the learning rate, the term v i is the weight matrix of topological graph F, and will be updated after each i th epoch. The stochastic gradient descent (SGD) algorithm is used due to its faster convergence and lower memory consumption since we compute the gradients for a subset of the data set. Optimized weighted graph F' is generated upon the convergence of the loss.

Step 7 is to generate the 2D embedding resultsG by the graph F' with optimized layout. Meanwhile, Step 8 groups G into G' by the groups defined in Step 2. Each colour in G' is one cluster group as shown in C. Once AggMap has generated the G(G'), it will assign the 2D embedded FPs into the 2D regular grid H(H') by linear assignment algorithm in Step 9. The J-V algorithm (22) is used for the assignment, which preserves the 2D embedded neighbourhood relationships while the FPs are assigned into the grid points. Specifically, AggMap calculates the pairwise squared Euclidean distance as the cost matrix (CM) between two FPs from the 2D embedding and 2D regular mesh grid:

where (x embd i , y embd i ) is the 2D coordinates of the i th FP, (x embd i , y embd i ) is the 2D coordinates of j th FP in the mesh grid. The squared Euclidean distance matrix is the NxN size matrix (N is the number of the FPs), which further serves as the cost matrix (CM) to solve the linear assignment problem (LAP) by J-V algorithm. The J-V algorithm finds an optimal solution to the global nearest neighbour assignment problem by finding the set of assignments that minimize the total cost (i.e. the CM) of the assignments. The regular grid H(H') is the assignment result, in 2D grid, each FP has an optimized location and its neighbours are the highly correlated FPs. Based on the regular grid H(H'), the input MxN FPs can be transformed into standard 4D tensor with a shape of (M, w, h, c), where M, w, h and c are the number of the input samples, the width, the height and the channels of the Fmaps, respectively. The minimum value (or zero) of FPs is used to mend the Fmaps if in the case of w, h > N. The output of a multi-channel Fmap for one sample is shown in I', each channel contains one group cluster in C.

AggMap is an unsupervised learning method because no label is required during feature restructuring. AggMap can be considered as a Fmap jigsaw puzzle solver (23) because it solves jigsaw puzzles of unordered FPs based on their intrinsic similarities and topological structures. It can also be regarded as a representation learning tool because it presents a 1D vector into an image-liked 3D tensor by self-supervised learning. It employs UMAP to restructure unordered FPs by learning their intrinsic structures. The proxy task is to minimize the differences between the two weighted topological graphs built in the input data space and embedding 2D space. Thus, AggMap can expose the overall topology of the FPs to generate structured Fmaps based on the intrinsic structure of FPs.

AggMap is divided into three stages of initialization, fitting, and transformation, which is useful for learning lowsample labeled data because higher-sample unlabeled data may be used for training AggMap. The intrinsic relationship between FPs may be better exposed by higher-sample data. The hyperparameters in each stage of AggMap is provided in Supplementary Table S1. In the initialization stage, the pairwise distance matrix of FPs was typically computed by the higher-sample unlabeled data (28) . The fitting stage controls how AggMap rearranges the FPs. The hyperparameters in this stage are very important, which include non-UMAP-mediated and UMAP-mediated parameters. For the non-UMAP-mediated parameters, "cluster channels" is the number of clusters of the multi-channel Fmaps (greater number of clusters leads to finer separation of FPs), "var thr" is the threshold for filtering out lower variance FPs (default is -1, i.e. no FPs are filtered out). For the UMAP-mediated parameters, "n neighbors" is the k number of nearest neighbours for estimating the manifold structure of the data (Equation 3), "min dist" is the minimum distance between two FPs that can be explicitly separated and shown in the low dimensional representation (Equation 8), the number of the epochs ("n epochs") and learning rate ("lr") are to minimize the CE loss (Equation 9) to optimize the layout of the low-dimension graph. In the transformation stage, the inputs are 1D vectors while the outputs are the structured 3D tensor. The important parameter is the data scaling method. AggMap supports two kinds of data scaling methods: the "minmax" and the "standard" scaling. In minmax scaling, the data is transformed into 0-1 range based on the minimal and maximal value of data. In standard scaling (also called z-score normalization), the feature is standardized by removing the mean and scaling to unit variance.

We specifically developed a simple yet efficient CNN, Ag-gMapNet ( Figure 1B ) dedicated for learning the structured AggMap Fmaps. The multi-channel Fmaps (each in unstacked format of the single-channel Fmap), have two major advantages for omics data learning. Firstly, the separation of the FPs into several groups (channels) enables feature group-specific or feature-selective learning. Division of mages into multiple patches also benefits the vision transformer (ViT) model in image recognition tasks (29) . Secondly, multi-channel Fmaps provide more enriched information and more distinguished patterns than singlechannel Fmaps. However, AggMap multi-channel Fmaps may potentially break the local coherence and connectivity among the boundary FPs between two clusters, leading to information loss at the boundary. To overcome this potential problem, AggMapNet uses the 1 × 1 convolutional kernel in inception layers for cross-channel learning, which creates a projection of a stack of multi-channel Fmaps for avoiding the information loss from local boundarybreaking of the Fmaps.

AggMapNet consists of three parts, the input of AggMap Fmaps, the CNN-based feature extraction layers, and pyramid fully connected (FC) layers (Supplementary Figure S1 ). The first convolutional layer has a larger kernel number for increased data dimension. The max-pooling layer (kernel size=3) has a stride size 2 for more aggressive reduction of the spatial resolution and thus lowering the computational cost. Choosing the right kernel size for the convolution operation is difficult because different tasks may favor different kernel sizes. For performance improvement, AggMap-Net adopts the naïve inception layer of GooLeNet (30) (a top-performer of the ILSVRC-2014 classification task). The inception layer in AggMapNet consists of three small parallel kernels (sizes of 1 × 1, 3 × 3 and 5 × 5) for enhanced local perception. The 1 × 1 convolution is for cross-channel learning. Subsequently, a global max-pooling (GMP) layer is introduced before a dense layer instead of a flatten layer, which significantly decrease the number of parameters, followed by dense layers for improved nonlinear transformation capability. Overall, AggMapNet has a relatively small number of trainable parameters (∼0.3 million) but has a complex topological structure of two inception blocks.

The hyperparameters (HPs) and their default setting are in Supplementary Table S2 , which include the network architecture parameters (NAPs) and the training-control parameters (TCPs). The NAPs are the kernel size of the first convolutional layer (conv1 kernel size), the number of dense layers and corresponding units (dense layers), dropout rate in the dense layers (dropout), number of the inception layers (n inception), and batch normalization after convolution layers. AggMapNet uses a larger kernel size for the first convolutional layer for enabling more expres-sive power and a global perception (28, 31) . To decrease the trainable parameter, default AggMapNet adopts 2 inception layers and 1 dense layer, and no dropout is used for the dense layer. The TCPs include the number of epochs, learning rate (lr), and batch size. The cross-entropy loss was used for both multi-task and binary tasks. During the training, AggMapNet has two important parameters (the monitor and patience) for early stopping. The monitor is the metric performance of the validation set, and the patience parameter is the number of epochs with no improvement on the monitor after which training will be stopped.

In this study, the AggMap feature restructuring object was pre-fit by the its default parameters, but different channel setup in AggMap was explored for enhanced AggMap-Net performance. The optimized HPs of AggMap and Ag-gMapNet for each dataset are in Supplementary Table S3 . The model parameters were determined by the first inner fold of the nested cross-validations and then used for all the outer folds. The early-stopping method was used to avoid overfitting during the nested validations. To prevent the variability of the models of low-sample data, crossvalidations were conducted for at least 5 times by different random seeds.

Interpretability of DL and ML models is important for informed decisions, mechanism investigations, biomarker discoveries, and model assessments (9, 10) . However, it is challenging to identify important features effectively and reproducibly based on low-sample BioHULM data. The perturbation-based model interpretation is an established post hoc feature attribution method for interpreting black box models (7, 32) , which interprets predictions by altering or removing parts of input features to assess its influence on the prediction. Kernel Shapley Additive exPlanations (SHAP) is such kind of the model interpretation method, which can be used for any black-box model explanation (33) . Kernel SHAP requires a background data set for training, and feature absence is simulated by substituting feature values with prevalent values of training data. The feature importance in kernel Shapley value is measured by comparing the prediction value obtained with the feature and without it (33) .

AggMapNet integrated the kernel SHAP method (i.e. the Shapley-explainer) as one of the model explainers in Ag-gMapNet. However, AggMapNet also complement a new model explainer, i.e. the Simply-explainer. Kernel Shapley method is based on a solid theoretical foundation in game theory and is a popular explainer in local model explanation by calculating the magnitude of feature attributions. Nonetheless, it has several problems in the measurement of feature importance (34) . One problem is that it suffers from the computational complexity in the global explanation for the many samples (34) . Moreover, since kernel SHAP method considers the amount of contribution each feature makes to the model prediction value instead of the true value, thus it may not able to fully explore the global relationship between the features and the true outcomes (35) . The Simply-explainer was developed for providing ad- ditional means for AggMapNet model explanation. The Simply-explainer aims to be faster for calculating the global feature importance of the high-dimensional omics features and to consider the relationship of the features with the true labels.

The perturbation-based interpretation method Simplyexplainer can be used for both local (individual sample) and global (all samples of a dataset) (8) interpretation tasks. The feature importance (FI) score S in Simply-explainer is straightforwardly determined by replacing each FP with a background value, without retraining the model: Input: Trained model f , feature matrix X, target true label vector y, error measure L(y, f ). To estimate this error L, the log loss(cross-entropy) is used for the classification model and the mse loss is used for the regression model: a) Estimate the original model error e orig = L(y, f (X)) b) For each feature i = 1, . . . , k do: Generate feature matrix X pert by replacing feature i with the minimal value (background value) in the data X. This breaks the association between feature i and true outcome y.

Estimate error e pert = L(y, f (X pert )) based on the predictions of the perturbed data.

Calculate perturbation feature importance score: S i = e pert − e orig c) Sort features by descending feature importance score S. d) Optional corrections: applying a standard scaling or logarithm transformation on S The AggMapNet inputs are 4D tensor (batch size, width, height, channels) data in multiple channels. The perturbation only occurs on these meaningful FPs, and the perturbation value is a background value (e.g. zero value for blank pixel) of the input Fmaps. Noted that for the local FI, the model error e is calculated by the log loss of the individual sample prediction values versus true labels across the labels (classes). However, for the global FI, the model error e is calculated by the log loss of the prediction values versus true labels for one class by many samples. That is, in multitasks, the Simply-explainer can calculate FI for each class. The global FI based on all samples of a dataset provides a highly compressed, global insight into the behaviours between the feature and true outcomes for the whole dataset in Simply-explainer. In contrast, the local FI reveals the FI that is important to individual sample (8) . The correction of the FI of Simply-explainer includes the logarithm transformation and standard scaling of the FI values to reveal the important FPs. After scaling, those FPs with FI score Fl > 0 are considered notable FPs in a saliency-map for the proposed Simply-explainer. The local or global feature importance (FI) is of a Fmap can be presented as a 2D saliencymap for revealing the important features ( Figure 1C) .

In this study, the revealed important features can be presented by a saliency-map (36) . The Pearson's correlation coefficient (PCC) and structure similarity index (SSIM) (37) were used to measure the performances of two explainers on the local explanation of the same MNIST recognition models. The PCC and SSIM were calculated based on the original images and the explanations saliency-maps. The full code for AggMap feature restructuring, AggMapNet model learning, and AggMapNet model explanations with both Shapley-explainer and Simply-explainer are in Supplementary Figure S2 ; AggMap/AggMapNet was coded by Python 3+, and the AggMapNet was built by TensorFlow 2.x framework.

The datasets used in this study are listed in Table 1 . Proof-of-concept (POC) MNIST and F-MNIST benchmark datasets are 28x28 grayscale images, with a training and a test set of 60,000 and 10,000 image respectively (24, 38) . An original image in MNIST or F-MNIST is named as Org1, where "1" refers to the number of channels being 1 (the grayscale image). To randomize Org1, it was first flattened into a vector of 684 feature points (FPs), then shuffled with a random seed (randomly permuted), followed by the re-folding back into a newly-shuffled 28 × 28 image, which is named as OrgRP1. AggMap restructuring was conducted on the shuffled FPs. The image reconstruction capability of AggMap was tested by the criterion that the reconstructed image patterns is independent of how FPs are shuffled. The reconstructed images are named as RPAgg1 and RPAgg5 for 1-channel and 5-channel AggMap Fmap, respectively. The structure similarity index (SSIM) (37) for an original image and restructured image was used to measure the feature restructuring ability of AggMap on MINST and F-MNIST. The average accuracy was used to measure the performances of the AggMapNet models that were trained on the multi-task MNIST and F-MNIST.

The CCTD-U (39) dataset is a cell-cycle transcriptome data of U2OS cells, consisting of the expression levels of 5162 genes at 5 different cell cycle stages (G1, G1/S, S, G2, M). This dataset was transformed using zscore standard scaling. The multi-task pan-cancer transcriptomic benchmark dataset TCGA-T of 33 cancers is from normalized-level3 RNA-Seq expression studies of normal and tumor conditions, which is compiled by Lyu and Haque (12) . TCGA-T consists of 10446 samples (45-1212 samples each cancer type, average: 317). We used the same data and pre-processing method as provided by these studies (https://github.com/HHHit/DLbased-Tumor-Classification). Each sample contains 10,381 normalized gene expression read count data, which was transformed using y = log2(x + 1) (12) . For training and testing on TCGA-T, the same stratified 10-fold crossvalidation metrics of Lyu and Haque (12) were used. The overall average accuracy and average accuracy of each class were calculated based on the test et performances using the scikit-learn (40) package, where the weighted average was used for the average performance calculation.

Two groups of low-sample binary task cancer transcriptomic benchmark datasets TCGA-S and TCGA-G include 10 cancer stage and 8 cancer grade datasets, which are from reproducible RNA-seq analysis and compiled by Smith et al. (5) . Each TCGA-S and TCGA-G consists of 249-1154, 179-554 samples; each sample contains 17 970 "O" (in GO under the "biological process" or "molecular function" categories) genes with Z-score standardization (5) . For training and testing on TCGA-S and TCGA-G, the same 5-fold nested cross-validation ROC-AUC metrics of Smith et al. (5) were used for rigorously assessing the out-of-sample performance of the models.

Two COVID-19 proteomic/metabolomic datasets COV-D and COV-S, are from COVID-19 detection (1) and severity determination (2) investigations, respectively. COV-D includes 363 samples (211 Covid-positives and 151 negatives by RT-PCR) from 3 labs, and each sample contains 

The noisy test set for the four Fmaps of Org1, OrgRP1, RPAgg1 and RPAgg5 on MNIST were generated by following steps: First, Gaussian noises of varying levels (standard deviation 0.00 to 0.72 with a step of 0.12) were added to the MNIST/F-MNIST test set images (images of higher noise levels become harder to recognize Supplementary Figure  S3) , i.e. the noise was added to Org1 tests only (The Fmap values are divided by 255 to scale into 0-1), leading to a noise-added dataset Org1-N set, then Org1-N Fmaps were further randomly permuted into OrgRP1-N using the same random seed as the OrgRP1 generation. Subsequently, the OrgRP1-N Fmaps were transformed into the noisy set of RPAgg1-N and RPAgg5-N by the pre-fit AggMap (Supplementary Figure S4 ). AggMapNet models trained on the Org1, OrgRP1, RPAgg1 and RPAgg5 Fmaps were evalu-ated on the derived noisy test Fmaps of Org1-N, OrgRP1-N, RPAgg1-N and RPAgg5-N, respectively. On the multitask TCGA-T dataset, noises were added to the test set of each fold in the 10-fold cross-validation for testing Ag-gMapNet model performance. Specifically, various levels of Gaussian noise (standard deviation 0.00-0.48 with a step of 0.08, Supplementary Figure S5 ) were added to the test set of each fold in the TCGA-T dataset (scaled to 0-1), then the unstructured noise-added data was transformed into 1channel and 5-channel noise-added but structured Fmaps by pre-fitting AggMap.

We compared AggMapNet with several ML models on 18 transcriptomic datasets (10 TCGA-S and 8 TCGA-G, all binary tasks). Although Smith et al. (5) have benchmarked the performances of three standard ML models (RF, LGR, and kNN)) with or without feature embedding on these datasets, more advanced and efficient tree-based ensemble ML models such as RotF (41) generalized absolute fold change (FC) (44) among the two classes (binary labels) using the optimized cut-offs (for example, at least a 0.5× fold change across any two conditions). The cut-offs of FC were determined by the performance of the nested 5-fold cross-validations. Therefore, the number of the selected features are different in the each of the outer fold of the five-fold cross-validations. All these ML models were evaluated by the average performance of the outer 5-fold cross-validation. The exhausted grid-search strategy was used to find the best in the inner 5-fold cross validation. Specifically, for each fold of the outer folds, a nested inner 5-fold cross-validation was used to select optimal hyperparameters, and then the optimal hyperparameters were used to build the model for the outer fold. We optimized the important HPs such as "n estimators", "num leaves", "max depth" for the treebased models by scikit-learn package the "GridSearchCV" module (40) . Nested cross validation is resistant to hyperparameter overfitting, as the model is evaluated on data completely held out from the process of selecting the optimal hyperparameter (5).

To make fair comparisons, AggMapNet models were evaluated by the same data split random seeds, data scaling methods, and model evaluation metric as published ML models. To make AggMapNet more user-friendly, in the nested cross-validations, only one hyperparameter (i.e. the number of epochs) in AggMapNet was optimized while all other HPs of AggMap and AggMapNet were kept as default. AggMap was pre-fit by the unlabeled gene expression data of all 18 datasets and multi-channel Fmaps were generated by the default parameters (C = 5).

To explore the restructuring ability of AggMap in exposing the unordered data, we randomly permuted the MNIST (24) data by shuffling the orders of pixels ( Figure 4A ). Randomly permuted MNIST represents the unordered data with prior intrinsic patterns. We evaluated the extent to which AggMap Fmaps reconstruct MNIST from randomly permuted data (OrgRP1). We pre-fit AggMap using fractions (full, 1/2, 1/5, 1/10, 1/100 and 1/1000) of an OrgRP1tr (60K randomly permuted MNIST training set) and employed it to transform the Fmaps of the OrgRP1-tr and OrgRP1-ts (10K randomly permuted MNIST/F-MNIST test-set). We found that AggMap can reconstruct the randomly permuted MNIST data to the original data, and its reconstruction ability depends on the sample to perform the pre-fitting. Pre-fitting with the full MNIST OrgRP1-tr, AggMap well-restored MNIST, down to local patches. Prefitting with 1/2, 1/5 or 1/10 MNIST OrgRP1-tr, AggMap roughly restores MNIST at an increasing level of deformation, tilt or flip when trained with decreasing fractions of training sets ( Figure 4B ). Pre-fitting with 1/100 or 1/1000 F-MNIST OrgRP1-tr, AggMap cannot restore F-MNIST but still generate distorted curve-shaped patterns (Supplementary Figure S6A ). Pre-fitting with the full F-MNIST OrgRP1-tr, AggMap cannot restore the original F-MNIST but aggregates together the original local patches (Supplementary Figure S6 B) .

The dynamic processes of AggMap restructuring of the randomized MNIST and F-MNIST FPs are in Video MNIST.mp4 and Video F-MNIST.mp4, respectively, which showed that the restoration abilities of AggMap are linked to the reduction of the cross-entropy loss defined in Equation 9 . With increasing number of iterations (epochs), the generated Fmaps become more structured and eventually form stable patterns when the loss reaches convergence. AggMap can roughly restore the randomized MNIST FPs into original images, but not the randomized F-MNIST. MNIST is curve-shaped data and the correlation between FPs is not discrete but more uniformly distributed, which conforms to the UMAP assumption of data uniform distribution (25) . We compared the CE loss and PCC of MNIST with those of F-MNIST during the graph layout optimization stage of AggMap feature restructuring (Supplementary Figure S7A ). MNIST has a lower loss and higher PCC value, indicating that the 2D embedded distribution in MNIST more resembles the topological structure of the original data. The final 2D embedding of MNIST FPs is also more uniformly distributed than that of F-MNIST FPs (Supplementary Figure S7B) . Therefore, AggMap can reconstruct randomized MNIST partly because the manifold structure of the FPs is not totally changed despite the MNIST FPs being randomly permuted, and the manifold structure can be approximated by the weighted graph in low-dimension. The randomized F-MNIST was restructured into more compact patterns with some local patches restored to the original patches. Therefore, AggMap can restructure randomized F-MNIST into highly structured form even though it cannot fully restore the original image.

The split-channel operations based on the cluster groups enable the greyscale images to be presented as multichannel-colored images ( Figure 4C , RPAgg5, each color represents one channel), the further tracking of these colorful channels into the original MNIST images showed that they are in the same relative positions ( Figure 4C , RPAgg5tkb), indicating that the reconstruction is able to maintain the same local structure with original images. Thus, Ag-gMap feature restructuring can expose the curve-shaped intrinsic patterns (i.e., the MNIST restructuring example) and the local patches (i.e. the F-MNIST restructuring example) of the packed intrinsic patterns. AggMap's ability in exposing local intrinsic patterns is largely useful for CNN-based learning because the CNN classifier has been proved to rely on local texture cues strongly (45) .

We evaluated to what extent does AggMap feature restructuring improves AggMapNet classification of randomized MNIST, which sheds light on the ability to learn unordered data. AggMapNet was trained by the training set of the four Fmaps separately (Org1, OrgRP1, RPAgg1 and RPAgg5, Figure 4A , C), validated on their corresponding validation set, and tested on their corresponding test set (10K). The results show that AggMap (RPAgg1,5) transformation improved MNIST classification performance of AggMapNet from 96.7% (without AggMap, OrgRP1 Fmaps) to 99.1-99.2%, close to the performance of the model (99.5%) that In the training set, the 28x28 pixels were reshaped into 684 feature points (FPs) and randomly permuted into 684 unordered FPs (unordered image data). Then they were reshaped into the shuffled 28x28 images (namely the OrgRP1 images). The random permuted images OrgRP1 have destroyed the spatial correlation of the original images (Org1) completely. The AggMap was fit on the unordered image data and transformed the OrgRP1 images into RPAgg1 (channel = 1). The split-channel operations help transform the greyscale images into multi-channel images based on the clustered groups (RPAgg5, channel = 5). (B) AggMap pre-fit with a different number of random permuted images to reconstruct the MINST images (RPAgg1). The all (60K), 1/2, 1/5, 1/10, 1/100 and 1/1000 of the randomly permuted MNIST training set OrgRP1 were used for pre-fitting by AggMap, which was used for the reconstruction of the randomized MNIST test set. (C) the original, randomized, and restructured MNIST data. RPAgg5-tkb: the original images with the pixels divided into five groups according to the 5-channels of RPAgg5 and colored in the same way as RPAgg5. (D) the historical validation accuracies of AggMapNet training on the four Fmaps. To perform the training on the four Fmaps (Org1, OrgRP1, RPAgg1, RPAgg5), we stratified sampled 10% data from the training set (60K samples) as the validate set, leading to 54 000 training samples and 6000 validate samples; the degree of the validation loss and accuracy is monitored during the training, and the early stopping strategy was used to prevent from overfitting (store the model only that has the best performance on the validation set). was trained on the original images ( Figure 4D, E) . Therefore, the AggMap feature restructuring enhanced DL of randomized (unordered) data. The results show that the high performance of CNNs in image-related tasks critically lies in its architecture that takes advantage of the local order of images, and AggMap is a useful tool to restructure the unordered data into locally-ordered data.

AggMap multi-channel Fmaps have an obvious advantage over single-channel Fmaps. Visualization of AggMap multichannel Fmaps of the cell-cycle CCTD-U (39) dataset at different cell replication phases indicated that multi-channel Fmaps can easily select stage-specific genes ( Figure 5 ). The stage -specific genes in the five cell-cycle phases can be easily aggregated into hot-zones in the single-channel Fmaps based on their correlations, while the multi-channel Fmaps further separate the phase-specific genes into different channels. Therefore, the multi-channel Fmaps facilitate groupspecific feature learning or feature selective learning by Ag-gMapNet. By the hierarchical clustering-guided channel splits, each cluster (group) of FPs may be separately embedded into a different Fmap channel. More clusters enable more fine-grained separations of FPs into groups. However, AggMap multi-channel Fmaps may potentially break the local coherence and connectivity among the boundary FPs between two clusters (e.g. Figure 4C, RPAgg5) , leading to information loss at the boundary. To overcome this potential problem, AggMapNet uses the 1 × 1 convolutional kernel in inception layers for cross-channel learning, which creates a projection of a stack of multi-channel Fmaps for avoiding the information loss from local boundarybreakage of the Fmaps.

We tested the effects of channel number on the four representative datasets of MNIST, TCGA-S COAD, TCGA-G HNSC and COV-S ( Figure 6 ). The performance of Ag-gMapNet can be improved with the increasing the number of channels, because more channels generate more finegrained separations of the FP groups. However, if the number of channels increases to a certain extent, the performance of the model in some dataset can be decreased. This is because the greater number of channels in the Fmaps, the more trainable parameters of the model, which can lead to the overfitting problem. Overall, the multi-channel Fmaps are helpful for CNN-based model to learn complex data by separately learning feature subsets. Their representational richness often allows the capturing of nonlinear dependencies at multiple scales (17) . Thus, multi-channel Fmaps notably improve the performance of AggMapNet, where the channel number is a hyperparameter that need be optimized.

The CNN models have been proved very vulnerable to attacks in the form of subtle perturbations to inputs that lead a model to predict incorrect outputs (46, 47) , and although CNNs perform better than or on par with humans on good quality images, CNN performance is still much lower than human performance on distorted noise images (48) . We, therefore, examined whether the AggMapNet models trained on AggMap multi-channel Fmaps show better robustness on noisy test data. These models were trained on the noise-free Org1, OrgRP1, RPAgg1 and RPAgg5 Fmaps and then evaluated on the corresponding noisy test set (Org1-N, OrgRP1-N, RPAgg1-N, RPAgg5-N , Supplementary Figure S4 ) with different noise levels. As the noise level increases, the performances of all models showed varying degrees of deterioration. Nevertheless, the models trained by AggMap-transformed multi-channel Fmaps (RPAgg5) in both MNIST and F-MNIST showed better robustness to noise (Table 2, Supplementary Figure S8 ). For example, on the noise level of 0.36 in the MNIST test set, the classification accuracy of RPAgg5 can still be maintained at 90%, but the corresponding accuracy performances for single-channel Fmaps of Org1, OrgRP1 and RPAgg1 are 77%, 56% and 0.75% respectively. We performed the same experiments on the multi-task pan-cancer transcriptomic dataset TCGA-T (12) , it classified the noise-added pancancer transcriptomic test set with 95.1% and 13.2% accuracy at zero and 0.48 noise levels, respectively ( Table 2) . While after the pre-fit AggMap transforming the data into 5-channel Fmaps, it classified the noise-added pan-cancer transcriptomic test set with 96.4% and 53.6% accuracy at zero and 0.48 noise levels, respectively. These results demonstrated that the multi-channels Fmaps enhanced AggMap-Net learning of noisy/unordered data, the cluster-based channel-split feature in AggMap transformation is crucial.

Unsupervised AggMap operates in separate fitting and transformation stages for enabling transfer learning (Figure 7A ). The fitting operation can be trained on highersample data and subsequently used for transforming lowsample data. In the fitting stage, AggMap inputs are tabular data matrix of size (n, p) to calculate the correlation distance of feature points (FPs) for performing manifold 2D embedding and clustering, where n and p are the number of samples and FPs, respectively. In the transformation phase, 1D unorder FPs are input to AggMap for transforming into Fmaps, and a scaling method such as minmax or standard scaling was used for scaling the FPs. This setup is useful for learning low-sample data because a larger amount of unlabeled data could be used for fitting and generating AggMap objects to transform low-sample 1D unordered data into structured Fmaps. The intrinsic relationship between FPs could be more accurate if it were determined by higher-sample data and exposed by manifold-based learning. We fit AggMap by using different numbers of MNIST training samples (full, 1/2, 1/5, 1/10, 1/100 and 1/1000 of 60K), which showed that the AggMap Fmaps pre-fitted by a larger dataset has a better local and global reconstruction ability ( Figure 4B ). The same behavior was found on the multi-task omics TCGA-T dataset ( Figure 7B ). Pre-fitting AggMap with a very low sample subset of TCGA-T, the generated Fmap is closer to the randomized result because subset data alone cannot properly measure the intrinsic cor- Figure 7B) and consequently, achieve a relatively higher classification accuracy ( Figure 7C ). The average 10-fold cross-validation accuracies for the five lowerperforming tasks of the multi-task TCGA-T (the five cancer types: READ, CHOL, UCS, KICH and ESCA) have boosted from 44%, 62%, 79%, 84%, and 85% to 47%, 71%, 82%, 91% and 87%, respectively. Thus, the unsupervised transferable AggMap could boost the classification accu-racy of AggMapNet if it were pre-fitted on higher sample unlabeled data.

We compared the DL performance of AggMap with existing 2D Fmap generation methods Lyu's reshaping (12) wardly converts 1D vector into 2D image by directly reshaping the vector (12). Bazgir-REFINED method projects 1D vector into 2D image according to their neighborhood dependencies by means of the linear multidimensional scaling (MDS) method (14) . AggMap Fmaps exhibited more structured local textures than those of the two existing methods ( Figure 7A, Supplementary Figure S9 ). Based on the 10fold cross-validation performance of these Fmaps on Ag-gMapNet learning of the TCGA-T dataset ( Figure 7B ), the models of AggMap Fmaps scored lower loss values than the two existing methods, while AggMap 5-channel Fmaps achieved the lowest loss values. The accuracies of the Ag-gMapNet models of these Fmaps for the 33 cancers of the multi-task TCGA-T dataset are in Supplementary Table  S4 . For all 33 cancer tasks, the use of multi-channel Ag-gMap Fmaps substantially improved the average accuracy over existing methods from 92% to 94%. In particular, Ag-gMap Fmaps boosted the accuracies (ACCs) of the cancer classes of lower performances. For instance, the models of Lyu-reshaping (12) , scored <90% ACCs for five cancers (CHOL, ESCA, GBM, READ and KICH). Multichannel AggMapNet improved the accuracies from 35% to 47%, 56% to 71%, 77% to 87%, 81% to 82% and 87% to 91%, respectively. Therefore, AggMap feature restructuring method outperformed the existing methods Lyu's reshaping and Bazgir-REFINED. The advantage of AggMap arises from two reasons: Firstly, the UMAP-mediated Ag-gMap nonlinear mapping tends to generate more structured and local texture, while the local texture and local connectivity are crucial to the accuracy and robustness of CNN model (45) . Secondly, the multi-channel Fmaps are helpful for CNN-based models to learn complex data by separately learning feature subsets (16) , and their representational richness often allows CNN models to capture nonlinear dependencies at multiple scales (17) .

ML methods, combined with feature dimensionality reduction (DR) or feature selection (FS) techniques, have been commonly used for learning BioHULM data (1, 5) . Ag-gMapNet models were thus compared with six ML models on 18 low-sample (179-554 patients) and high-dimension Table 2 . Proof-of-concept evaluation of the contribution of AggMap feature restructuring to the robustness of AggMapNet classification of the original and noise-added randomized data. AggMapNet was trained with noise-free training-sets and tested by noise-added test-sets with original or randomized data as direct input or with AggMap Fmaps as input (17970 genes) transcriptomic datasets. These datasets are TCGA cancer stage (TCGA-S, 10 datasets) and grade (TCGA-G, 8 datasets) (5) . Three standard ML models RF, LGR and kNN with or without DR (PCA or deep representation method, which reduced the dimensionality to 512) have been developed and benchmarked by Smith et al. (5) . They found no consistent improvements of ML models with DR (5). We compared AggMapNet with these three ML models with or without PCA. We additionally benchmarked three efficient tree-based models of RotF, XGB and LGB. These three models with or without FC-based FS were also compared with AggMapNet. The comparison was measured by the nested five-fold cross-validation ROC-AUC performances of AggMapNet and six ML models with or without DR or FC. AggMapNet outperformed all six ML models with or without DR or FS ( Figure 7C ). Specifically, it outperformed the three standard ML models LGR, RF and kNN on 12, 17 and 18 of the 18 benchmark datasets (Supplementary Table S5) , and these three ML models with PCA on 18, 15, and 18 of the 18 benchmark datasets (Supplementary Table S6 ). AggMapNet outperformed the three tree-based ML models RotF, XGB, and RF on 18, 18 and 15 of the 18 benchmark datasets (Supplementary Table S7 ), and these three ML models with FS on 18, 16 and 16 of the 18 benchmark datasets (Supplementary  Table S8 ). AggMapNet can achieve higher performance if a proper AggMap channels number is selected ( Figure 6 ). Therefore, AggMapNet coupled with AggMap is a highly completive learning method for DL of BioHULM and other low-sample size and high-dimension data.

AggMapNet provides two perturbation-based interpretation methods for revealing important features of Ag-gMapNet models (i. e. the Simply-explainer and Shapleyexplainer), which facilitates biomarker discovery from Bio-HULM data. The Shapley-explainer is based on the kernel SHAP values, which considers the amount of contribution each FP makes to the model prediction value (33) . AggMapNet also provides a simple interpretation tool that select important features simply by the loss changes between the outcomes and the true labels before and after each feature alteration. We first compared the Simply-explainer and Shapley-explainer on the local explanation of the same MNIST recognition model. The PCC and SSIM between the original image and the explanation saliency-map were used for measuring the explanation performances. Simplyexplainer tends to score better explanation performance with higher PCC and SSIM values than the Shapleyexplainer on both explanations of noise-free and noisy test images ( Supplementary Figures S10 and S11) . Noticeably, Simply-explainer focuses on the relationships between the outcomes with the true labels instead of the changes in prediction values, which might be advantageous for better explanation performance.

We also compared the two explainers on the global explanations of the breast cancer diagnostic model trained on the WDBC dataset (569 samples, 30 features) (49) . The feature importance (FI) score of the two explanations is highly-correlated (Persons' r = 0.866) to each other (Supplementary Figure S12A) . However, the FI scores in Simply-explainer tends to be more discrete than Shapleyexplainer, suggesting that Simply-explainer can be a competitive method for biomarker identifications. The computational complexity for the Simply-explainer is O(n), which is much faster than the kernel Shapley-explainer (The complexity in kernel Shapley-explainer is O(m*l*(2n + 2048)), where l is the number of background samples, n is number of features and m is number of samples (33)) (Supplementary Figure S12B) . These results indicate that Simply-explainer is robust, highly discriminative and fast for the selection of important features, even with noisy data, which is particularly suitable for discovering the key biomarkers in Bio-HULM data.

We further tested AggMapNet on COVID-19 detections from a real-world spectra data of proteome COV-D. COV-D has been derived from the nasal swabs of 363 COVID-19 patients/controls using clinically-available MALDI-MS equipment (1). MALDI-MS assays exploit reference spectra for disease diagnosis through proteomic profiling. Six MLs and two feature selection methods have been exploited in previous study (1) for COVID-19 diagnosis and spectra marker identification, where Support Vector Machines with a Radial kernel (SVM-R) achieved SOTA performance in the detections, the average accuracy by the nested 4-fold cross-validation (4-FCV) with optimized SVM-R is 93.9% (1), however AggMapNet with multi-channel Fmaps inputs has achieved 94.5% detection accuracy based on the same data split method and evaluation metric ( Figure 8C ). By converting 1D spectral data into 2D multi-channel Fmaps, AggMap boosted the detection accuracy, because multichannel Fmaps are easier to distinguish the positive cases from controls ( Figure 8A ), and their representational richness allows the capturing of nonlinear dependencies at multiple scales, thus have improved the detection accuracies notably compared with single-channel Fmaps ( Figure 8B ). Examination of AggMapNet-revealed important features showed that, the global feature importance (FI) scores calculated by Simply-explainer for COV-D peaks in different folds of 4-FCV are highly correlated (the pairwise fold Pearson's r = 0.71-0.91) ( Figure 8D) . Notably, 7 of the top-10 and 18 of the top-30 important peaks are among the 39 statistically significant (p<0.05) COVID-19-correlated peaks of the previous MALDI-MS study (1) (Figure 8D , E), where the most correlated peak (p-7612) of that study ranked fifth by AggMapNet. Moreover, the first ranked peak p-7654 by AggMapNet is highly correlated with p-7612, the relative average intensities of both p-7612 and p-7654 for the COVID-19 positives are lower than the healthy controls ( Figure 8F ).

AggMapNet was also applied to classify the severe and nonsevere COVID-19 patients and identify key proteomic and metabolomic biomarkers. The sera multi-omics (proteomic and metabolomic) dataset of COV-S with 49 COVID-19 patients has been used for ML-facilitated determination of COVID-19 severity and sera protein/metabolite changes(potential biomarkers) (2) . The random forest (RF) model scored 70% accuracy in classifying the severe cases from non-severe cases in the independent test cohort in COV-S (2), while AggMapNet with multi-channel Fmaps ( Figure 9A, B) inputs can improve the accuracy to 80%. A non-severe patient XG45 misclassified by RF but correctly classified by AggMapNet reportedly received traditional Chinese medicines for >20 days before admission (2) . The 43-year-old male non-severe case XG25 was incorrectly classified as severe by both RF and AggMapNet model for reasons unclear, another non-severe patient XG22 misclassified by both RF and AggMapNet had chronic hepatitis B virus infection, diabetes, and long hospitalization (2) (Supplementary Figure S13 ).

We analyzed the global feature importance (GFI) on Ag-gMapNet severity prediction model, the top-50 important FPs including 39 metabolites and 11 proteins, are clustered into six major groups (G1-G6) by 2D embedding of UMAP ( Figure 9C ). Few of the 39 metabolites and 11 proteins are among the important signatures of the previous study (2), partly because of high variations of merely 41 samples. G1-G5 are the metabolites (except for the Q9NPH3 in G2), and G6 is the proteins (except for the Genistein sulfate). Metabolites in G1 are the sphingomyelins (SMs), G2 are the phospholipid metabolites, G3 are the derivates of amino acids such as tryptophan and thyroxine, G4 and G5 are major the amino acids and their metabolites.

AggMapNet-selected important metabolites sphingomyelins, phospholipids and ergothioneine ( Figure 9D ) are consistent with literature reports. Sphingomyelins (SMs) in group G1 show the most important contributes to the model (Ranks 1 and 3 in top-10). Plasma SMs have been proved that is a useful biomarker for distinguishing between COVID-19 patients and healthy subjects (50) , and more importantly, the serum level of the metabolites of SMs, sphingosine-1-phosphate (S1P), have been reported as a novel circulating biomarker negatively associated with COVID-19 severity and morbidity (51) . Those phospholipid metabolites in G2 group including glycerophosphatidylcholines (GPCs), glycerophosphatidylinositols (GPIs), glycerophosphatidylethanolamines (GPEs) are also reported as important phospholipidome signatures for Ebola virus disease (EVD) fatal outcomes (52) and associated with severity of COVID-19 (50) . Other most important biomarker is ergothioneine in G5, which ranked 2 among all the FPs. It can modulate inflammation and has been proposed as a therapeutic to reduce the severity and mortality of COVID-19, especially in the elderly and those with underlying health conditions (53) .

Among the 11 AggMapNet-selected important proteins ( Figure 9E ), three proteins (Immunoglobulin kappa 1D-13 and 2D-30, Semaphorin-4B) are immune-related (54), 1 protein (Vasorin) modulates arterial response to injury (55) and six proteins (NOTCH2, MMP2, SELENOP, GPX3, IL1RAP and SOD3) are related to COVID-19 severity (Supplementary Table S9 S167 S168 S169 S170 S171 S172

Positive A B C D E F Control S0 S1 S2 S3 S4 S5 S167 S168 S169 S170 S171 S172 (1) . (F) The Top 10 important peaks in AggMapNet and the two-tailed Wilcoxon rank-sum test (with Bonferroni correction) results for peaks intensities between COVID-19 and control samples, the significant level ****P ≤ 0.0001, ***0.0001 < P ≤ 0.001, **: 0.001 < P ≤ 0.01, *: 0.01 < P ≤ 0.05, not significant (ns): 0.05 < P ≤ 1.

Feature Fmaps, the size of the grid is 39x39, each different color in the grid stands for one cluster group or one channel in the Fmaps. Cluster 1 and 2 are major proteins, and Clusters 3, 4 and 5 are major metabolites. (B) the transformed multi-channel Fmaps for the six non-severe and six severe cases, the XG43-XG46 and XG20-XG25 are the four severe six non-severe cases in the independent test cohort, respectively. (C) the top-50 feature points (39 metabolites and 11 proteins) were embedded based on the correlation distance using UMAP, they can be grouped into six groups, G1-G5 are the metabolites (except for the Q9NPH3 in G2), and G6 is the proteins (except for the Genistein sulfate), their scatter size stands for the importance score. Metabolites in G1 are the sphingomyelins (SMs), in G2 are the phospholipid metabolites, including glycerophosphatidylcholines (GPCs), glycerophosphatidylinositols (GPIs), glycerophosphatidylethanolamines (GPEs). (D) the list card for the top-10 ranked feature points with their importance scores, and the two-sided Wilcoxon rank-sum test (with Bonferroni correction) P-value is used for the significant test among the severe and non-severe patients. (E) the detail information card for the proteins in G6. tions or severity predictions and may potentially facilitate biomarker discovery. These important biomarkers are discovered only based on 31 patients (18 non-severe and 13 severe, many of the signatures by the Wilcoxon rank-sum test is not statistic significant in the distinguish of severe and non-severe (P value < 0.05) due to limited number of samples, but AggMapNet model explanation is still able to identify these important biomarkers, suggesting the practicality and effectiveness of the method in the case of low-data samples with high-dimension.

Robust learning of Biomedical data is vital for disease diagnosis, biomarker discovery, and biological mechanism investigations (1, 56) . The relevant learning tasks are hindered by the high dimensional, variational (biological and technical), and unordered nature of BioHULM data (57) . Another obstacle is the low-sample sizes of typical biomedical investigations and the corresponding learning tasks (1). These problems confound the process of statistical inference and subject learning outcomes to random chances (58) , which may lead to false model explanations, false discoveries and mask the identification of genuine biological variations (58) . Interpretable DL algorithms provide additional means for assessing learning outcomes and support informed clinical decisions and biomarker discoveries (9,10). The new self-supervised AggMap algorithm restructures and aggregates unordered data into structured multi-channel Fmaps to expose the intrinsic data clusters, enabling enhanced DL by multi-channel CNNs. Together with unsupervised AggMap, the multi-channel CNN-based AggMapNet models showed enhanced and robust ability in learning both low-sample and higher-sample omics data, which outperformed the SOTA ML models in most of 18 low-sample omics benchmark tasks, suggesting that Ag-gMapNet is a valuable complement of ML methods for omics-based learning. The use of perturbation-based interpretation algorithm facilitates the assessment of important features of AggMapNet learning. The revealed important features are highly consistent with the literature-reported disease-related markers. Unsupervised AggMap and supervised AggMapNet, together with other emerging DL methods, collectively facilitate enhanced and robust learning of complex omics data for clinical diagnosis and biomedical investigations.

The MNIST/F-MNIST datasets are available in the Open Machine Learning repository (OpenML, https:// www.openml.org/). Other all datasets in this study are released in the Zenodo repository (https://doi.org/10.5281/ zenodo.3999156).

The source code of the manifold-guided AggMap framework is freely available at GitHub (https://github.com/ shenwanxiang/bidd-aggmap) to allow replication of the results, and the "paper" folder in this repository contains all codes and results in this study. The example code of Ag-gMap on the reconstruction of MNIST from a random permutation is in MNIST reconstruction code.pdf.

Detection of SARS-CoV-2 in nasal swabs using MALDI-MS

Proteomic and metabolomic characterization of COVID-19 patient sera

Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer

Deep Neural Networks for High Dimension, Low Sample Size Data

Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data

Deep learning and alternative learning strategies for retrospective real-world clinical data

Drug discovery with explainable artificial intelligence

From local explanations to global understanding with explainable AI for trees

An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets

Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models

Deep learning: new computational modelling techniques for genomics

Deep learning based tumor type classification using gene expression data

Artificial image objects for classification of schizophrenia with GWAS-selected SNVs and convolutional neural network

Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks

OmicsMapNet: transforming omics data to take advantage of deep convolutional neural network for discovery

Person re-identification by multi-channel parts-based cnn with improved triplet loss function

Deep learning in biomedicine

Geometric deep learning: going beyond euclidean data

Deep learning

Dimensionality reduction for visualizing single-cell data using UMAP

fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python

A shortest augmenting path algorithm for dense and sparse linear assignment problems

Unsupervised learning of visual representations by solving jigsaw puzzles

The MNIST database of handwritten digits

Umap: uniform manifold approximation and projection for dimension reduction

Initialization is critical for preserving global data structure in both t-SNE and UMAP

Laplacian eigenmaps and spectral techniques for embedding and clustering

Out-of-the-box deep learning prediction of pharmaceutical properties by broadly learned knowledge-based molecular representations

An image is worth 16x16 words: Transformers for image recognition at scale

Going deeper with convolutions

Large kernel matters-improve semantic segmentation by global convolutional network

Explaining the predictions of any classifier

A unified approach to interpreting model predictions

Problems with Shapley-value-based explanations as feature importance measures

Explaining the data or explaining a model?

Deep inside convolutional networks: visualising image classification models and saliency maps

Image quality assessment: from error visibility to structural similarity

Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms

The S-phase-induced lncRNA SUNO1 promotes cell proliferation by controlling YAP1/Hippo signaling pathway

Scikit-learn: machine learning in Python

Rotation forest: a new classifier ensemble method

Xgboost: A scalable tree boosting system

Lightgbm: a highly efficient gradient boosting decision tree

GFOLD: a generalized fold change for ranking differentially expressed genes from RNA-seq data

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

Threat of adversarial attacks on deep learning in computer vision: a survey

Adversarial examples are a natural consequence of test error in noise

A study and comparison of human and deep learning recognition performance under visual distortions

UCI machine learning repository, Wisconsin Diagnostic Breast Cancer (WDBC) Data Set

Omics-driven systems interrogation of metabolic dysregulation in COVID-19 pathogenesis

Decreased serum level of sphingosine-1-phosphate: a novel predictor of clinical severity in COVID-19

Plasma lipidome reveals critical illness and recovery e45

from human Ebola virus disease

Could ergothioneine aid in the treatment of coronavirus patients?

The role of semaphorins in immune responses and autoimmune rheumatic diseases

Vasorin, a transforming growth factor ␤-binding protein expressed in vascular smooth muscle cells, modulates the arterial response to injury in vivo

Artificial intelligence in healthcare

RNA-seq: technical variability and sampling

Avoiding common pitfalls in machine learning omic data science

We would also like to thank the anonymous reviewers for their constructive suggestions on this article. 

Supplementary Data are available at NAR Online.