key: cord-0125761-yrr0onfb
authors: Chang, Yu-Wei; Natali, Laura; Jamialahmadi, Oveis; Romeo, Stefano; Pereira, Joana B.; Volpe, Giovanni
title: Neural Network Training with Highly Incomplete Datasets
date: 2021-07-01
journal: nan
DOI: nan
sha: e43d4cdccd48c15800aa0aa45cd12c132f401b9d
doc_id: 125761
cord_uid: yrr0onfb

Neural network training and validation rely on the availability of large high-quality datasets. However, in many cases only incomplete datasets are available, particularly in health care applications, where each patient typically undergoes different clinical procedures or can drop out of a study. Since the data to train the neural networks need to be complete, most studies discard the incomplete datapoints, which reduces the size of the training data, or impute the missing features, which can lead to artefacts. Alas, both approaches are inadequate when a large portion of the data is missing. Here, we introduce GapNet, an alternative deep-learning training approach that can use highly incomplete datasets. First, the dataset is split into subsets of samples containing all values for a certain cluster of features. Then, these subsets are used to train individual neural networks. Finally, this ensemble of neural networks is combined into a single neural network whose training is fine-tuned using all complete datapoints. Using two highly incomplete real-world medical datasets, we show that GapNet improves the identification of patients with underlying Alzheimer's disease pathology and of patients at risk of hospitalization due to Covid-19. By distilling the information available in incomplete datasets without having to reduce their size or to impute missing values, GapNet will permit to extract valuable information from a wide range of datasets, benefiting diverse fields from medicine to engineering.

Supervised machine-learning models, such as the neural networks employed in deep learning, require to be trained and validated on large high-quality datasets [1] . In particular, these datasets need to be complete, i.e., each datapoint needs to have the values of all features, in order for them to be employed in standard neuralnetwork training algorithms [2] . However, in many applications only incomplete datasets are available, i.e., each datapoint has values only for some of the relevant features [3] . For example, this often occurs in medical applications involving patient data, e.g., because various patients might undergo different clinical and diagnostic procedures at different times, with some patients even dropping out from a study [3] .

In order to deal with these missing data, there are two standard approaches. The first and most commonly employed one is to exclude the datapoints that do not have all relevant features [4] . This approach has the advantage of ensuring the integrity of the data employed in training and validation, but it has the drawback of reducing the statistical power of the dataset [5] and, therefore, the predictive ability of the resulting deep-learning models [6] . Furthermore, data exclusion can introduce biases if the data are not missing completely at random [7] .

The second and more complex approach is to impute the missing data. Various statistical imputation strategies have been proposed. The simplest one is arguably to substitute the missing values with their ensemble average [7] . More sophisticated imputation strategies obtain better results employing, e.g., multilayer perceptrons, extreme gradient boosting machines, and support vector machines [8] [9] [10] . For example, Vivar et al. [11] improved the classification of individuals in the datasets of the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Parkinson's Progression Markers Initiative (PPMI) by employing a multiple recurrent graph convolutional network to impute the missing features (the brain volumes obtained from magnetic resonance imaging (MRI)). However, a major drawback of data imputation is that it can amplify biases in the available data or even introduce spurious correlations [4] , especially when data are missing in great numbers or not at random [12] . Because of their drawbacks, both exclusion and imputation strategies can only deal with a limited amount of missing data.

To address these limitations, here we introduce Gap-Net, an alternative neural-network training approach based on a hierarchical architecture that can make use of highly incomplete datasets. GapNet takes advantage of all available information without the need for impu-tation of missing data. First, the dataset is split into subsets containing all datapoints with a certain cluster of features. Then, these subsets are used to train individual neural networks. Finally, this ensemble of neural networks is combined into a single neural network whose training is fine-tuned using all complete datapoints. As real-world test cases with highly incomplete datasets, we show that GapNet improves the identification of patients in the Alzheimer's disease continuum in the ADNI cohort and of patients at risk of hospitalization due to Covid-19 in the UK BioBank cohort. By distilling the information available in large incomplete datasets without having to reduce their size or to impute missing values, GapNet allows extracting valuable information from a wider range of datasets, being employable for many applications.

To demonstrate the GapNet working principle, we first apply it on a simulated dataset, representing a two-class classification problem with F continuous input features. Specifically, we employ the simulated dataset Madelon [13] , where the datapoints are clustered around the vertices of an F -dimensional hypercube and assigned to the class of the closest vertex (see details in Methods, "Simulated dataset"). Figure 1a provides a schematic illustration for the case F = 3. Here, we consider a system with F = 40 features, so that

where y ∈ {0, 1} is the class assignment for each datapoint. As shown in Figure 1b, To establish a benchmark, we first consider a vanilla neural network approach, which is schematically shown in Figure 1c . We employ a dense neural network having an input layer with 40 nodes corresponding to the 40 features, two hidden layers with 80 nodes each (ReLU activation), a dropout layer with a frequency of 0.5, and an output layer with a single node (sigmoidal activation). This vanilla neural network must be trained using the complete samples. Thus, we split the available 100 complete samples into a 80-sample training set and a 20sample testing set. We train the neural network for 2000 epochs using the Adam optimizer [15] with binary crossentropy loss.

The GapNet approach involves a two-stage training process, as schematically presented in Figure 1d . In training stage I (Figure 1d ), the input feature space is split into a set of non-overlapping clusters for which complete samples are available. We then train a neural network using each of these clusters to predict the desired output. In the current example, we identify two clusters of features, the first one corresponds to features 1 to 25 (dark green area in Figure 1b ) and the second one to features 26 to 40 (light green area), each with 550 samples. Then, we train two dense neural networks to predict y using x 1 , . . . , x 25 and x 26 , . . . , x 40 , respectively. Each neural network consists of an input layer (with 25 and 15 nodes, respectively, corresponding to each cluster of features), two hidden layers (with 50 and 30 nodes, respectively, with ReLU activation), a dropout layer (frequency 0.5), and finally an output layer with a single node (sigmoidal activation). We train these neural networks on the available samples (retaining 20 complete samples for testing, the same for both neural networks) for 2000 epochs (Adam optimizer [15] , binary cross-entropy loss).

In training stage II (Figure 1d ), the input and hidden layers of the first-stage neural networks are combined into a single neural network, adding an output node (sigmoidal activation). These new connections are trained for 2000 epochs (Adam optimizer [15] , binary cross-entropy loss) using all available complete samples (retaining for testing the same 20 samples used for testing in training stage I).

In order to statistically compare the performance of GapNet to the performance of the vanilla neural network, we repeat the training and testing procedures 100 times with resampling of the training and testing datasets. We evaluate the performance using the receiver operator characteristic (ROC) curve, which plots the true positive rate (TPR) versus the false positive rate (FPR) as a function of the threshold. The area under the ROC curve (AUC) is then 0.5 for a random estimator and approaches 1 as the estimator performance improves. In Figure 1e , we show the ROC curve for GapNet (orange line) and vanilla neural network (blue line), obtained as the average over 100 repetitions, while the corresponding variance is shown by the shaded areas. The GapNet approach (AUC 0.89 ± 0.11) outperforms the vanilla neural network (AUC 0.68 ± 0.17). This difference is statistically significant with a p-value p < 0.0001 (z = 20.6), tested using the Delong test [14] . In addition to the AUC comparison, Table I shows the sensitivity, specificity, accuracy, and precision computed by setting the threshold at 0.5. All the metrics are improved by using GapNet, showing its ability to correctly identify the true positive and true negative cases. Interestingly, the variability of the GapNet approach is smaller than that of the vanilla neural network approach, indicating more consistent training results. Figure 1f shows the histograms of the AUC values for each independent run. The peak for the GapNet is larger than that of the vanilla neural network, indicating GapNet achieves a consistently better performance, increasing the robustness of the classification to the missing data.

Overall, these proof-of-principle results on simulated data show the effectiveness and robustness of the GapNet approach to fully exploit incomplete datasets. Thus, the clear advantage is that GapNet can make use of all avail- GapNet improves the classification of simulated data. a Schematic representation of Madelon [13] , an artificial dataset used to test binary classification problems. The points in the dataset are clustered around the vertices of an hypercube with dimensions corresponding to the number of features (F = 3 in the schematic) and attributed to 2 different classes (corresponding to the pink and red circles in the schematic). b We simulate N = 1000 samples with F = 40 features (x1, ..., x40). Colored areas represent the available data, while the white areas represent the missing data: Only 100 samples (samples 451 to 550, highlighted by the thick black line) have all 40 features. c The baseline benchmark is provided by a vanilla neural network trained and tested on the complete samples. d The GapNet approach: In training stage I, subsets containing all samples with certain features are used to train individual neural networks. In training stage II, their outputs are combined by an output node, whose connections are trained with all complete samples. e The receiver operator characteristic (ROC) curve showing the true positive rate (TPR) versus the false positive rate (FPR) of the GapNet (orange, area under the curve (AUC) 0.89 ± 0.11) is significantly better than that of the vanilla neural network (blue, AUC 0.68 ± 0.17). Average (lines) and standard deviations (shaded areas) are obtained from 100 random splits of the training and testing datasets. The black dashed line represents the ROC curve of a random classifier. To establish statistical significance, the classifiers are also compared using the Delong test [14] resulting in a significant p-value < 0.0001 (z = 20.6). f Comparing the histograms of the AUC values for each independent run of the GapNet (orange) and the vanilla neural network (blue), we observe that GapNet delivers better results (larger AUC) in a more consistent way (smaller variance).

able data, without having to impute the missing data.

As a first real-world GapNet application, we consider the identification of individuals with underlying amyloid pathology, which is one of the earliest pathologi-cal changes occurring in Alzheimer's disease (AD) [16] [17] [18] [19] . We use data from the ADNI cohort (see Methods, "ADNI cohort"). In total, 869 individuals underwent 1465 neuroimaging visits including both baseline visits and subsequent longitudinal follow-up visits for some individuals. In each visit, one or more of the following three neuroimaging modalities were employed: structural magnetic resonance (MRI), amyloid positron emission tomography (amyloid-PET), and fluorodeoxyglucose-PET Table I . Performance of GapNet and vanilla neural network. Comparison between the performance of GapNet and that of the vanilla neural network in terms of sensitivity, specificity, accuracy, and precision for the different datasets corresponding to simulated data ( Figure 1 ), neuroimaging data from the ADNI cohort ( Figure 2 ), and Covid-19 severity predictions from the UK BioBank cohort ( Figure 3 ). The numbers in bold represent the best results for each category. GapNet improves all the metrics for all datasets.

(FDG-PET), which are normally used to assess gray matter atrophy, amyloid deposition and glucose hypometabolism in AD, respectively ( Figure 2a ). For MRI, we include the mean thickness of the 68 cortical regions of the Desikan atlas [20] and the volumes of 14 subcortical gray matter regions of the Aseg atlas [21] . For amyloid-PET, we include the mean amyloid standard uptake value ratio (SUVR) values from the same brain regions included in MRI. For FDG-PET, we include the SUVR of 5 composite brain regions [22] . The final number of features is 169, and the corresponding regions of interest (ROIs) are listed in Supplementary Table S1 . MRI scans were acquired in 233 visits, amyloid-PET scans in 1045 visits, and FDG-PET scans in 1258. Only 120 visits (corresponding to 118 individuals, of which 40 cognitively normal subjects without amyloid pathology and 78 subjects with amyloid pathology) resulted in the acquisition of all three neuroimaging modalities (Figure 2b ).

To apply the GapNet approach (Figure 2c) , we identify three clusters of features corresponding to the three imaging modalities. We split the data into a training set and a testing set. We define the testing set from 20% of the complete data (N test = 24). We use the rest of the data for the training set (N train = 1441 including the remaining 80% of the samples with complete data as well as all the incomplete samples). A schematic of the GapNet architecture is shown in Figure 2c . In training stage I, these three clusters of features are used to train three independent neural networks (input layer with 5, 82 and 82 nodes, respectively; two hidden layers with 10, 164, and 164 nodes, respectively, with ReLu activation; dropout layer with frequency 0.5; output layer with single neuron and sigmoidal activation). We train the MRI network on 209 samples, the amyloid-PET network on 1021 samples, and the FDG-PET network on 1234 samples (in all cases for 2000 epochs with binary cross-entropy loss function and Adam optimizer). These three networks are then combined into a single neural network in training stage II with a joint output node, and the new connections are retrained on the 96 complete training samples (2000 epochs, binary cross-entropy loss function, Adam optimizer).

The ROC curves in Figure 2d show that the classification performance of the GapNet approach (orange, AUC 0.91 ± 0.09) is superior than that of the standard vanilla neural network (blue, AUC 0.87 ± 0.11). The shaded areas represent the variation of these ROC curves over 100 random splits between the training and testing sets. This difference is statistically significant (Delong test [14] , pvalue p < 0.0001, z = 7.76). In addition to the AUC comparison, Table I shows the sensitivity, specificity, accuracy and precision computed by setting the prediction threshold at 0.5. All the metrics are improved by using GapNet, showing its ability to correctly identify the true positive and true negative cases [23] . Figure 2e shows the box plots with the AUC values obtained over the independent runs for the GapNet approach, the networks trained on each of the data clusters, and the vanilla neural network. The GapNet approach achieves the best performance, outperforming the vanilla neural network as well as all networks trained on a single data cluster. Interestingly, the amyloid-PET-trained neural network also outperforms the vanilla neural network approach, probably taking advantage of its larger training set (1021 instead of 94 samples). Nevertheless, both the MRI-trained and the FDG-PET trained neural networks underperform the vanilla neural network, despite having access to more (233 and 1258, respectively) samples.

In Figure 2f , we mapped the feature importance for the GapNet model by performing a permutation feature analysis [24] . The results show that 16 Only 120 visits resulted in the acquisition of all three neuroimaging modalities (highlighted by the thick black line). c GapNet: First a neural network with two hidden layers is trained using each cluster of features (i.e., the brain region data for each neuroimaging modality). Then, all these neural networks are concatenated to return the final output. d Receiver operating characteristic (ROC) curve showing that GapNet (orange, AUC 0.91 ± 0.09) outperforms the vanilla neural network (blue, AUC 0.87 ± 0.11). The solid lines represent the averages over 100 independent runs and the shaded areas the corresponding standard deviations. The black dashed line represents the ROC curve of a random classifier. The classifiers are also compared using the Delong test [14] resulting in a significant p-value p < 0.0001 (z = 7.76). e Box plots of the AUC values over the independent runs showing that GapNet outperforms the vanilla neural networks trained either on all individuals with all diagnostic modalities ("Vanilla") or on only one diagnostic modality ("Amyloid", "MRI", or "FDG"). The red mark represents the median, the black boxes are the interquartile ranges, and the black horizontal lines are the whiskers. f Relevance heatmaps for the GapNet classification between cognitively normal individuals and patients in the Alzheimer's disease continuum. The color-coded areas represent the regions with high classification importance. For amyloid-PET (upper panel), these regions include the left pericalcarine, right pars trianularis, bilateral superior parietal lobule, right precuneus, right postcentral lobule, left medial orbitofrontal cortex, right precentral lobule at the cortical level, as well as bilateral hippocampus and bilateral amygdala at the subcortical level. For MRI (lower panel), the regions are mainly from the right hemisphere, including the caudal anterior cingulate and posterior cingulate at the cortical level as well as the right caudate and left pallidum at the subcortical level.

MRI. For amyloid-PET, the features included occipital (left pericalcarine gyrus), frontal (right pars triangularis, right postcentral gryus, left medial orbitofrontal cortex, right precentral gyrus), parietal (bilateral superior parietal gyri, right precuneus), and subcortical areas (bilat-eral hippocampus, bilateral amygdala). For MRI, the most important features included cortical thickness in limbic (right caudal anterior cingulate and right posterior cingulate) as well as volumes in subcortical areas (right caudate, left pallidum). Crucially, most of these features (or ROIs) have been reported to be impaired in AD by previous studies assessing patients at different disease stages. For amyloid-PET, the orbito-frontal and precuneus ROIs have been often identified in the early stages of amyloid pathology [25] ; the subcortical ROIs (amygdala and hippocampus) are in line with the Aβ42 accumulation regions that have been reported in amyloid pathology during the early AD stages; and the other three cortical ROIs (pericalcarine, postcentral, and precentral) are consistent with the Aβ42 accumulation regions showing high SUVRs during the late AD stages [25] . For MRI, both the hightlighted cortical ROIs (caudal anterior cingulate and posterior cingulate) and subcortical ROIs (caudate and pallidum) have also been reported in previous studies on informative regions across different stages of the AD continuum [26] [27] [28] [29] [30] [31] .

As a second example of a real-world case application, we consider an incomplete dataset to predict hospitalization due to Covid-19. The dataset is based on the UK BioBank cohort, which contains information concerning Covid-19 test results, hospitalizations, and clinical examinations (see details in Methods, "UK BioBank cohort"). The aim of this analysis is to discriminate patients at high risk of hospitalization due to severe Covid-19 symptoms from those at low risk of hospitalization, based on their previous medical records.

The cohort includes 3926 individuals and 34 different features with a varying number of records per feature, ranging from 1123 to 3875 values. The parameters include sex, age and easily accessible testing information, such as red blood and white blood cell counts (see Supplementary Table S2 for the full list of features). The missing data are irregularly distributed across the dataset, so we gather the attributes into 7 different clusters based on the overlap between the features in order to minimize the amount of information loss. Figure 3a shows the clusters labeled from A to G and the attributes included in each cluster. Figure 3b shows the input data color-coded based on the different clusters, from cluster A in yellow to cluster G in dark green. The colored areas represent the data, while the missing values are shown as white patches. Discarding all patients with incomplete records leads to a reduced cohort of only 501 subjects, as indicated by the black rectangle in Figure 3b .

The schematic representation of the GapNet approach is provided in Figure 3c . We split the data into training and testing sets. The testing set includes about 20% (N test = 100) of the complete samples, while the training set includes the remaining complete data (401 samples) and all the incomplete data (N train = 3826).

In training stage I, the seven clusters of features are used to train seven independent neural networks with input layer of 5 nodes, except for cluster A (with 4 nodes).

Each of these neural networks has 2 hidden layers with 10 nodes each (ReLu activation) and a dropout layer (frequency 0.2). The output layer consist of a single neuron (sigmoidal activation). We train the neural networks on each cluster of inputs for 500 epochs using the binary cross-entropy loss function and Adam optimizer. These seven networks are then combined into a single neural network in training stage II with a joint output node, and the new connections are retrained on the complete samples (500 epochs, binary cross-entropy loss function, Adam optimizer). Figure 3d shows that the ROC curve for the GapNet approach (orange, AUC 0.69 ± 0.07) is significantly better that that of the vanilla neural network (blue, AUC 0.60 ± 0.07) as demonstrated by the Delong test [14] (p < 0.0001, z = 11.8). The solid lines are the average ROC curves over 100 repetitions of random splitting between the training and testing sets. Table I shows that the sensitivity, specificity, accuracy, and precision computed by setting the predictions' threshold at 0.5 are much improved by the GapNet approach.

Finally, we compare the performance of each cluster to GapNet and the vanilla neural network in Figure 3e . The results are sorted in descending order of median AUC values. It is interesting to note that the GapNet results are better than those of the best cluster, demonstrating that combining the information in the clustered network improves the classification task. Moreover, the shift in the median between the GapNet and the subsequent best cluster is larger than the average distance between single clusters, showing a solid improvement in the predictions. Figure 3e shows also that the training on some specific features (included in clusters A and E) results in a better predictor than using all 34 features while discarding the missing data. A possible explanation is that the vanilla neural network is trained on a smaller dataset (including only the records from 501 patients), while the singular clusters have a larger size (3875 and 1992 subjects for clusters A and E, respectively).

The need to handle incomplete datasets is ubiquitous in all fields dealing with empirical data, such as medicine and engineering. While several imputation techniques have been developed, they always rely on (explicit or implicit) assumptions on the frequency and distribution of missing values, and often incur the risk of introducing biases. Here, we have proposed an alternative approach that can be employed also when the missing data are very frequent and not missing at random. We have called this approach GapNet. In the GapNet approach, the neural network undergoes two training stages, first training an ensemble of neural networks, each on a data subset with a complete cluster of features, and then combining these into a single estimator that is fine-tuned on the available complete samples. This estimator is better at predicting Figure 3 . GapNet improves the identification of patients at risk of hospitalization due to Covid-19 in the UK BioBank cohort. a Seven clusters of features employed to predict hospitalization due to Covid-19 (sizes A → 3875, B → 3713, C → 3622, D → 3109, E → 1992, F → 1123, and G → 1536). b The dataset comprises 3926 subjects and 34 features with colors representing the different clusters from A (yellow) to G (dark green). The white areas represent the missing data. The 501 subjects with all features are delimited by black lines. c Network architecture employed for the GapNet approach. Neural networks with two hidden layers are trained for each cluster and then concatenated to obtain a single output. d Receiver operating characteristic (ROC) curve showing that GapNet (orange, AUC 0.69 ± 0.07) outperforms the vanilla neural network (blue, AUC 0.60 ± 0.07). The solid lines represent the averages over 100 independent runs and the shaded areas the corresponding standard deviations. The black dashed line represents the ROC curve of a random classifier. The classifiers are also compared using the Delong test [14] resulting in a significant p-value p < 0.0001 (z = 11.8). e Box plots showing the statistics of the AUC over the independent runs. The red vertical lines mark the median value for each boxplot, the red dots are the fliers, the black boxes are the interquartile ranges, and the black horizontal lines are the whiskers. The performance of GapNet is better than that of the vanilla neural network. Furthermore, the results are compared with the training of neural networks on the single clusters and the plot shows them in descending order of medians (see the Supplementary Table S2 for the complete list of features and the extended names). The GapNet approach features the best outcome. Interestingly, the features in clusters A and E alone result in better classifiers than the vanilla neural network.

the results for complete data when compared both with a simple vanilla neural network approach trained only on complete data and with the single-cluster-trained neural networks. We have demonstrated the superior predictive ability of the GapNet approach on three examples, one corresponding to simulated data as a proof of principle, and the other two on real-word medical datasets of Alzheimer's disease and Covid-19 patients. These findings suggest that the improved predictions obtained by the GapNet approach are potentially generalizable to other datasets.

The major goal for the classification in the AD continuum is to establish a biomarker-based deep-learning model. In this respect, the GapNet approach outper-forms the vanilla neural network, delivering robust results and detecting complex relationships when combining different imaging modalities. Noteworthy, the feature importance analysis shows that the important brain regions for the GapNet approach are in line with the brain areas affected in AD reported by previous studies. Specifically, the most predictive features are mainly derived from the amyloid-PET modality, including parietal and frontal regions, which are typical sites of amyloid accumulation in AD [32] . These key findings on AD-related brain changes suggest that the GapNet estimator is capable of producing robust predictions for early and accurate AD diagnosis.

A hot topic in COVID-19 research is to understand the effect of other comorbid diseases and conditions on the risk of developing severe Covid-19 symptoms [33] [34] [35] [36] [37] [38] [39] . This type of studies can help understanding, for instance, which individual's characteristics are associated with a higher risk of hospitalization, and who should be constantly monitored or prioritized in the vaccination process [40] [41] [42] . However, when analyzing patient data, the choice between discarding the incomplete values or imputing them, and the imputation technique employed [43] can lead to different results [44] [45] [46] depending also on the size of the cohort and the handling of missing data. Here, we provide a simple example to predict severe Covid-19 outcomes, with the aim of pointing out the advantages of using GapNet in this context: exploiting the incomplete values and avoiding biases or alterations of the original dataset. In the presented results, the incidence of the different features is consistent with previous findings in the literature. The most relevant clusters, in fact, include features such as age, systolic and diastolic blood pressure and serum creatinine levels (see details in Supplementary Note A, "Feature importance analysis for Covid-19 severity"), connected with known Covid-19 high-risk comorbidities [37, 47, 48] .

In conclusion, we have proposed GapNet, a conceptually effective model of neural network architectures, to produce more robust predictions in datasets with missing values, which have become increasingly common in research. We have shown how GapNet can detect complex nonlinear relationships between all the variables and is capable of learning and inferring from medical data with incomplete features. We have verified the effectiveness of GapNet in two real-world prominent datasets, the identification of patients in the Alzheimer's disease continuum in the ADNI cohort, and the prediction of patients at risk of hospitalization due to Covid-19 in the UK BioBank cohort. We believe that GapNet is a preliminary step towards generic, scalable architectures that can investigate many real-world medical problems, or even tasks from many domains, holding great potential for several future applications. One of the next steps will be to ap-ply GapNet to more complex kinds of neural network architectures, such as recurrent and convolutional neural networks, as well as to apply it to more complex input data, such as time sequences and images.

To verify the working principle of GapNet, we use a simulated dataset adapted from Madelon [13] implemented with scikit-learn [49] (Figure 1a ). We simulate a binary-classification dataset including 1000 samples with (Figure 1b) .

The data used for this analysis were obtained from the ADNI database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). The individuals included in the current study were recruited as part of ADNI-GO and ADNI-2. This study was approved by the Institutional Review Boards of all participating institutions. Informed written consent was obtained from all participants at each site. For up-to-date information, see adni.loni.usc.edu.

Amyloid pathology has been identified using a cut-off of > 976.6 pg/ml on cerebrospinal fluid (CSF) levels of Aβ42, following previously established procedures [50] . Subjects who are cognitively normal and have high CSF Aβ42 values are used as a healthy reference group, while subjects who are cognitively normal, have mild cognitive impairment, or AD dementia with low CSF Aβ42 values are included in the group with high risk of having AD (the AD continuum group).

The dataset employed in the prediction of Covid-19 hospitalization is based on the general practitioners' records provided by the UK BioBank dedicated to Covid-related research [51] . We build the labels using the COVID-19 test result records and the hospital inpatient register data. The labels are assigned '0' for patients positive to Covid-19 who do not appear in the hospital records, and '1' for patients hospitalized due to Covid-19 and identified by the ICD-10 diagnostic code U07. 1 [52] .

The input values come from merging the primary care data from the two largest providers, TPP [53] and EMIS [54] . The data are cleaned to obtain the dataset: we select the patients positive to Covid-19, we filter only the medical coding systems SNOMED CT [55] and CTV3 [56] , and we include records in the five years interval 1 January 2015 to 1 January 2020 (antecedent to the first reported case in the dataset). The selected values include many standard medical examinations, such as pressure measurements, body weight, and blood tests. At this point, we cut the less common codes to obtain a dataset of 501 subjects with complete features and we undersample the more represented category (the non-hospitalized subjects are nearly five times larger) to obtain a balanced dataset. The final number of attributes incorporated is 34, listed in Table ? ? together with their corresponding Read Code.

We acknowledge support from the MSCA-ITN-ETN project ActiveMatter sponsored by the European Commission (Horizon 2020, Project Number 812780).

Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: 

A systematic survey of computer-aided diagnosis in medicine: Past and present developments

Axes of a revolution: challenges and promises of big data in healthcare

The Prevention and Treatment of Missing Data in Clinical Trials

When and how should multiple imputation be used for handling missing data in randomised clinical trials -a practical guide with flowcharts

Rebutting existing misconceptions about multiple imputation as a method for handling missing data

Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls

The prevention and handling of the missing data

The Feature Selection Effect on Missing Value Imputation of Medical Datasets

Predicting Missing Values in Medical Data Via XGBoost Regression

Data preprocessing issues for incomplete medical datasets

Simultaneous imputation and disease classification in incomplete medical datasets using Multigraph Geometric Matrix Completion (MGMC)

Accounting for missing data in statistical analyses: multiple imputation is not always the answer

Result analysis of the nips 2003 feature selection challenge

Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach

A method for stochastic optimization

Amyloid-first and neurodegenerationfirst profiles characterize incident amyloid pet positivity

Frequent amyloid deposition without significant cognitive impairment among the elderly

Rapid decline in episodic memory in healthy older adults with high amyloid-β

Imaging and cerebrospinal fluid biomarkers in early preclinical alzheimer disease

An automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest

Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain

Associations between cognitive, functional, and fdg-pet measures of decline in ad and mci

Sensitivity, specificity, and predictive values: Foundations, pliabilities, and pitfalls in research and practice

Interpretable machine learning

Cerebrospinal fluid analysis detects cerebral amyloid-β accumulation earlier than positron emission tomography

Prediction of autopsy verified neuropathological change of alzheimer's disease using machine learning and mri

Differential regional atrophy of the cingulate gyrus in alzheimer disease: a volumetric mri study

Structural mri biomarkers for preclinical and mild alzheimer's disease. Human brain mapping

Prediction of mci to ad conversion, via mri, csf biomarkers, and pattern classification

3d maps localize caudate nucleus atrophy in 400 alzheimer's disease, mild cognitive impairment, and healthy elderly subjects

Automatic classification of cognitively normal, mild cognitive impairment and alzheimer's disease using structural mri analysis

In vivo staging of regional amyloid deposition

Association of red blood cell distribution width with mortality risk in hospitalized adults with sars-cov-2 infection

Red blood cell distribution width (rdw) predicts covid-19 severity: a prospective, observational study from the cincinnati sars-cov-2 emergency department cohort

Red cell distribution width (rdw): a prognostic indicator of severe covid-19

Characteristics of peripheral blood differential counts in hospitalized patients with covid-19

Coronavirus disease 2019 in chronic kidney disease

Declined serum high density lipoprotein cholesterol is associated with the severity of covid-19 infection

Cholesterol in relation to covid-19: should we care about it

Who should be prioritised for covid-19 vaccines?

Impact of vaccination by priority group on uk deaths, hospital admissions and intensive care admissions from covid-19

Covid-19 vaccine: A neutrosophic mcdm approach for determining the priority groups

A novel scoring system for prediction of disease severity in covid-19

Can we predict the severity of coronavirus disease 2019 with a routine blood test? Polish archives of internal medicine

Clinical and laboratory features of covid-19: Predictors of severe prognosis

Covid-19 mortality in the uk biobank cohort: revisiting and evaluating risk factors

Predictors of covid-19 severity: A literature review

Hypertension and its severity or mortality in coronavirus disease 2019 (covid-19): a pooled analysis

Scikit-learn: Machine learning in python

Csf biomarkers of alzheimer's disease concord with amyloid-β pet and predict clinical progression: a study of fully automated immunoassays in biofinder and adni cohorts

This research has been conducted using data from UK Biobank, a major biomedical database, under the following application number: 37142.