key: cord-1040247-rht0d7ti authors: Singh, Vivek; Kamaleswaran, Rishi; Chalfin, Donald; Buño-Soto, Antonio; San Roman, Janika; Rojas-Kenney, Edith; Molinaro, Ross; von Sengbusch, Sabine; Hodjat, Parsa; Comaniciu, Dorin; Kamen, Ali title: A Deep Learning Approach for Predicting Severity of COVID-19 Patients Using A Parsimonious Set of Laboratory Markers date: 2021-11-27 journal: iScience DOI: 10.1016/j.isci.2021.103523 sha: 31dfa92ce12eb42e01bca2d918790c72fb7f1221 doc_id: 1040247 cord_uid: rht0d7ti The SARS-CoV-2 virus has caused tremendous healthcare burden worldwide. Our focus was to develop a practical and easy to deploy system to predict the severe manifestation of disease in COVID-19 patients with an aim to assist clinicians in triage and treatment decisions. Our proposed predictive algorithm is a trained artificial intelligence-based network using 8,427 COVID-19 patient records from four healthcare systems. The model provides a severity risk score along with likelihoods of various clinical outcomes, namely ventilator use and mortality. The trained model using patient age and nine laboratory markers has the prediction accuracy with an area under the curve (AUC) of 0·78 95% CI: 0·77-0·82, and the negative predictive value NPV of 0·86 95% CI: 0·84-0·88 for the need to use a ventilator and has an accuracy with AUC of 0·85 95% CI: 0·84-0·86, and the NPV of 0·94 95% CI: 0·92-0·96 for predicting in-hospital 30-day mortality. The COVID-19 pandemic has emerged as a worldwide health challenge with an overwhelming burden on healthcare systems, emphasizing the need for effective management of rapidly deteriorating patients constrained by limited clinical resources (Erika et al., 2020; Ranney et al., 2020) . Clinical presentation of COVID-19 ranges from mild symptoms to various levels of respiratory distress and/or multi-organ system failure and death (Grant et al., 2020) . Effective clinical management of these patients is dependent on early stratification of patients who have a higher likelihood of deteriorating (Marini and Gattinoni, 2020) . For example, it has been suggested that a subset of patients with higher likelihood of severity detected during the first encounter and presentation could benefit from immunomodulatory treatments in addition to purely antiviral treatment strategies (Meyerowitz et al., 2020) . Recently, there has been a proliferation of studies proposing prediction models based on various clinical parameters aimed at early stratification of deteriorating patients (Abdulaal et al., 2020; Aktar et al., 2021; Gao et al., 2020; Yan et al., 2020) . These approaches are primarily based on traditional machine learning approaches (Aktar et al., 2021) such as support vector machines (SVM) , random forests (RF) , or deep neural networks based (DNN) methods specifically when it comes to analyzing x-ray or computed tomography (CT) images (Lassau et al., 2021) . Despite these advances, we have yet to see a practical system, which could be used universally with an evidence of generalizability to help with early identification of patients, who develop severe clinical trajectory. The underlying reasons could be summarized by two main factors. First, most of proposed models are trained and validated based on small regional cohorts of less than 10,000 patients Assaf et al., 2020; Burdick et al., 2020; Cabitza et al., 2020; Gao et al., 2020; Li et al., 2020) . This potentially introduces biases in the results, which in turn hinders generalization and replication of findings to J o u r n a l P r e -p r o o f patients from different healthcare systems. Second, there has been a focus to be as comprehensive as possible and to use a variety of data points capturing various aspects of disease. This has led most of the models to include a variety of clinical data points including disease risk factors, comorbidities, various diagnostic markers, and vital signs as inputs to the model. While this approach could potentially lead to the most accurate predictions for carefully curated test cohorts, it lacks generalization capability due to lack of veracity in the routine clinical data (Gianfrancesco et al., 2018) . For example, hypertension is a clinical comorbidity believed to be correlated with unfavorable outcomes(Sheppard James P. et al., 2021) , however, the clinical condition of hypertension, which could be either managed or unmanaged may not be reliably captured from either medical records or direct blood pressure measurements done on a patient. In short, increase in model input data variety will certainly impact its veracity, which in turn diminishes performance and potentially decreases its generalization capacity. Furthermore, from a deployment and utility perspective, it is important to not rely upon clinical datapoints as they may not be easily retrievable due to complexities in connecting to various Electronic Health Record (EHR) systems with heterogenous conventions in coding clinical parameters affecting their interpretations and utility in predictive models (Rajkomar et al., 2018) . We aimed to leverage a deep neural network trained model using heterogenous cohorts of outcome matched patient data from four early epicenters of COVID-19 to predict clinical outcomes. We sought to devise the prediction system to be as practical and as easy to deploy as possible, primarily to maximize its potential impact. To this end, we set up numerous experiments to extensively study various data input contributions to the assigned surrogate outcomes and to better understand redundancies through minimizing interactions among input parameters. Finally, we aimed to create a prediction tool, which could use a parsimonious set of input data while maintaining predictive accuracy. We compared the performance of our proposed models, with other approaches from the J o u r n a l P r e -p r o o f literature. To assess generalization capacity of the approach, we also further evaluated and characterized the model accuracy using a publicly available external test cohort from an independent health system, which was not utilized during the overall model training process. Out of 11,832 patients from three of the institutions, 10,937 who met the criteria for inclusion were used to train and internal test the model. This data split into a training set and a test set consecutively based on split dates for various cohorts to maintain the ratio of 2 to 1. This was mainly to avoid any inadvertent biases and to assess performance on patient data acquired prospectively in the future. 7,293 patients (66·68%) were selected as the training set, and 3,644 patients (33·32%) were assigned to the test set. Additionally, 2,340 patients from Mount Sinai were used as an independent external testing and validation cohort. Specifically, we performed model external testing on the 739 patient cohort admitted after April 28, 2020. The consort diagram Figure 1 depicts the cohort numbers and details from various clinical sites. Table 1 and Table 3 depicts a total of 57 variables comprised of laboratory tests, demographics, and comorbid conditions for patients in the training and testing (i.e., both internal and external) cohorts. Furthermore, the figure shows the statistical significance (i.e., p-value) of the mean value difference between mortality and discharged sub-groups. All the p-values of less than 0·05 demonstrate individual potential of the variable as an independent predictor of mortality. Table 2 lists the distribution of the assigned severity levels broken down based on the specific health system. Table 2 further demonstrates both average and availability of patient data for each severity category. Figure 2 depicts both feature correlation heatmap and distributions of a set of selected features for severity level 4 (i.e., in-hospital mortality versus discharged). We first derived ingredients of the parsimonious model out of the total of 57 features (as depicted in Table 1 and Table 3 ) using the approach explained above. Figure 4 J o u r n a l P r e -p r o o f demonstrates the feature ranking based on the two-step approach of using MRMR, pathway-based clustering, and finally exhaustive testing of performance. Figure 4 (a) specifically shows the ranking of 20 most discriminative features based on MRMR approach, whereas 4 (b) demonstrates the feature importance for a final selected 10 marker model. We compared the performance of the deep profiler model with all the features as inputs versus the parsimonious version using only 10 markers. Figure 5 demonstrates the results for the full and parsimonious models on the left and right columns, respectively. The AUCs for the prediction of in-hospital mortality at 30 days (severity level 4) computed on the test dataset with 3,227 patient data are 0·85 + 0·01 (mean AUC + SE), and 0·85 + 0·004, for full model and parsimonious one, respectively. Similarly, the AUCs for predicting mechanical ventilation, both with and without other organ system failure, are 0·81 + 0·01, and 0·79 + 0·01. Kaplan-Meier curves for various levels of severity for the two models are also comparable. With respect to predictive accuracy of mortality (severity level = 4), the full model algorithm's positive predictive value (PPV) was 0·53 and the negative predictive value (NPV) was 0·93 (Sensitivity of 0·64 and Specificity of 0·88), whereas for the 10 markers model PPV was 0·42 and NPV was 0·94 (Sensitivity of 0·53 and Specificity of 0.85) . Regarding the predictive accuracy of the severity level of 2 and above (i.e., need to use a ventilator), values for PPV and NPV were 0·50 and 0·88 (Sensitivity of 0·62 and Specificity of 0·86) for the full model, and 0·50 and 0·86 (Sensitivity of 0·61 and Specificity of 0·85), for parsimonious model, respectively. We further looked at feature distribution of various severity scores for the test cohort in the latent space (i.e., patient fingerprint as depicted in Figure 1 ). We observed a good separation of the classes after UMAP analysis of the latent fingerprints for both models as depicted in Figure values (Lundberg and Lee, 2017) . Top ten set of most important features for both models are depicted in Figure 5 . In sub figure (h), we observe that the higher values of Age, Troponin-I, and CRP, for example contribute to the estimated high severities, in contrast lower values of D-Dimer have a similar effect. We also looked at model performance over time. As shown in Figure 6 , the model accuracy in predicting clinical endpoints degrades as events (i.e., exacerbation to more severe conditions) happen at later days from the first encounter or admission. In other words, the prediction accuracy is higher for patients who are potentially experiencing higher levels of severity within earlier days after the initial encounter or admission. In all these comparisons, the parsimonious model has maintained the level of performance specifically in predicting mortality, whereas the full marker model has a superior performance in predicting ventilator use or predicting events happening at later timepoints. This could hint toward the importance of co-morbidities and/or redundancies in various blood various markers, which are not explicitly present in the parsimonious model. To put the performance numbers in context, we also used IL-6 as a sole predictor for mortality as was suggested by McGonagle et al.(McGonagle et al., 2020) , and the average AUC for the same test cohort is 0·69, which is significantly lower than both full and parsimonious models. We finally compared our approach to two popular machine learning based approaches namely XGBoost (Chen and Guestrin, 2016) and Random Forest Regression Figure 7 depicts the performance of the mentioned three models (i.e., XGBoost, Random Forest Regression, and Logistic Regression) in terms of AUC for predicting various levels of severity as compared to the proposed method. To assess generalizability of model against unseen patient data from a new clinical site, we performed a series of analysis on the patient cohort from Mt. Sinai. Figure 8 depicts the consort diagram for the 2,340 records in the Mt. Sinai publicly available dataset. Specially, we used patients admitted after April 28, 2020 (i.e., denoted as "ARM1") for the external validation. Figure 9 demonstrates the results of various models including two comparators (i.e., XGBoost and Random Forrest Regression). The 10 marker model trained using internal data sets from Emory, Houston Methodist, and La Paz outperforms other approach by a significant margin. We also further refined the 10 marker model based on the Mt. Sinai data from prior April 28, 2020, however, we did not observe any performance improvement, indicating that additional data for training did not include complementary information to what the model has been trained with. In this study, we focused on developing an easy to use and deploy algorithm for predicting disease severity for COVID-19 patients who were admitted to the hospital. We utilized a novel combination of feature ranking and selection methods, along with a novel deep learning-based approach to develop a 10 marker parsimonious model and algorithm from a total of 57 laboratory, clinical, and demographic variables. The training process has been done using data from 24 hospitals and three health systems in two countries. The predicted disease severity ranges from a severity level 0 (no respiratory problem) to level 4 (in-hospital ≤30-day mortality). The prediction accuracy of severity 4 (i.e., mortality) has an AUC J o u r n a l P r e -p r o o f of 0·86 + 0·01 (mean AUC + SE) for the full model using all 57 parameters and 0·85 + 0·01 for parsimonious 10 marker model. The final selected ingredients of the 10 marker model were all shown to be independent predictors of severity in a number of prior studies (Carubbi et al., 2021; Efros et al., 2021; Fraissé et al., 2020; Long et al., 2020; Lu et al., 2021; Nie et al., 2020; Stringer et al., 2021; Wool and Miller, 2021; Wu et al., 2020; Xiang et al., 2021) . We further externally validated the model using a publicly available dataset from Mt. Sinai and achieved performance with AUC of 0.74 + 0·01 for predicting mortality, which is lower than the internally validated results. We attributed this performance degradation to the lack of key blood markers namely Eosinophil and Lymphocyte percentages in the external validation dataset as can also be seen in the Table 3 . Nevertheless, the performance of the proposed approach is shown to be superior as compared to alternative methodologies. The developed model is easy to deploy and use, primarily due to the few numbers of required laboratory markers. It is also robust as compared to other models using clinical parameters, which may be predictive but at a wider scale are fluctuant, userdependent, and subject to temporal influences. Our final model has nine blood biomarkers and age (Iaccarino Guido et al., 2020) , capturing various underlying biological processes also independently known to be early independent predictors of severity such as immune response (i.e., lymphocytes (Lu et al., 2021) and eosinophils (Fraissé et al., 2020) ), kidney and liver functions (i.e., creatinine (Harrell et al., 1996) , LDH ), cardiac function (i.e., Troponin I (Efros et al., 2021; Nie et al., 2020) ), inflammation processes (i.e., CRP (Stringer et al., 2021) and ferritin (Carubbi et al., 2021) ), and coagulation process (i.e., D-dimer (Long et al., 2020) and INR (Wool and Miller, 2021) ). Furthermore, there are studies looking at interleukins as early predictors of severity (Del Valle et al., 2020; McGonagle et al., 2020) , and we have also observed their contribution; however, interleukin-6 (IL-6) specifically did not make the final cut based on the feature selection methodology we chose. Despite that, we compared the proposed model J o u r n a l P r e -p r o o f performance with an IL-6 based model as the baseline reference. Our approach as depicted in Figure 1 , has many advantages, one of which is the possibility of doing further analysis on the latent feature space. This is particularly important for the model interpretations, uncertainty estimation, and to explore the relationship of unseen patient data to the patient cohort data used for training the model (Kingma and Welling, 2014) . The last one is an important aspect for the clinical utility of the overall system as it could be used to create a histogram of severity distribution for similar patients within the training cohort. The similar patient severity histogram type, which could be classified as either skewed, uniform, monomodal or random, could hint the clinical user regarding the level of uncertainty in the predicted severity level by the algorithm. We do believe that our model has an advantage over other models as we only use the blood markers and that the physiological response to vaccination induced immunity could potentially be expressed in these markers. A prospective validation study is warranted to further characterize the performance of the model and its clinical utility in effectively managing targeted specifically vaccinated patient population. Various therapeutic approaches administrated to the patients from different centers can be considered as confounding factors and can be seen as one of the limitations of this study and the resultant models. Another limitation of the study is that our data were solely gathered from unvaccinated patients. We have yet to study the accuracy of the model on vaccinated patient cohorts. (Ferguson et al., 2012; Lambden et al., 2019) , however, they were modified to align better with COVID-19 disease characteristics and to have a harmonized approach based on data availability from the different sites used for this study. Based on our defined severity scale, a severity score of 0 indicates no respiratory problem, whereas severity levels 1 and 2 are reserved for patients with mild/moderate and severe respiratory problems, respectively. A severity score of 3 is assigned to patients, who had a severe respiratory problem along with other organ system failures with a focus on liver and kidney during the hospital stay. Finally, severity level 4 denotes either in-hospital mortality within 30 days or transfer to hospice. The schema of the proposed (i.e., deep profiler) method, which is based on deriving a patient fingerprint from various demographic, clinical, and laboratory parameters and using it to predict severity score is shown in Figure 3 . Deep profiler consists of three main parts (i.e., networks): an encoder network for extracting prominent features represented in a latent space, which is also referred to as the patient fingerprint, a decoder network for reconstructing the input data to ensure data fidelity of the latent feature representation, and finally a severity classifier network, which is trained to estimate the severity score (Lou et al., 2019) . We chose a variational autoencoder (VAE) based approach to provide a probabilistic representation of input parameters gathered at the first clinical presentation in a latent space (Kingma and Welling, 2014) . In this approach, the encoder output is a probability distribution for each latent attribute corresponding to an input instance. The encoder consists of four fully connected layers with 64, 32, 32 and 16 channels respectively. Each fully connected layer is followed by a batch normalization layer and a leaky rectified linear activation operation (leaky ReLU) with slope of 0·2. The decoder consists of four fully connected layers with 16, 32, 32 and 64 channels J o u r n a l P r e -p r o o f respectively, with each fully connected layer followed by a batch normalization layer and a rectified linear (ReLU) activation operation. The advantage of VAE is twofold. First, the entire autoencoder set of networks (i.e., encoder and decoder) could be trained with entire patient data in an unsupervised manner without the knowledge of their specific clinical outcome and labels. This is specifically important for the problem at hand, since the complex interactions among input datapoints could be disentangled using a potentially much larger set of data points. Secondly, variation aspect of autoencoder forces the latent state representation to be smoother as compared to standard autoencoders and that helps with the improved generalization capacity. We have further outlined advantages of VAE in the supplementary material section, which also include more principled way of dealing with missing input point values, by learning a disentangled representation in the latent space. We employ a classifier network to predict the severity score based on the latent fingerprint variables. Since the order among various severity levels is important, we formulated severity score prediction problem as an ordinal regression one. We first train the weights of the neural network on the ordinal classification task to predict severity levels, and subsequently use an ordinal linear regressor to obtain severity scores between 0 and 4. The ordinal classification task over 5 values is reformulated as 4 binary classification tasks (P. A. Gutiérrez et al., 2016) . Thus, the severity prediction classifier network outputs 4 binary outputs, corresponding to whether severity level is greater or equal to 1, 2, 3 or 4 respectively. The network consists of 4 fully connected layers. The first 3 fully connected layers have 32 channels are followed by a rectified linear activation. The 4 th fully connected layer maps the 32-channel feature to 4 output variables with sigmoid activation. To avoid overfitting, a dropout of 0·25 is employed after the first and second fully connected layers. The network parameters are optimized by minimizing the combined loss of variational auto-encoder with L 1 reconstruction loss and the binary cross entropy loss on the output of the J o u r n a l P r e -p r o o f severity classifier using Adam with a learning rate of 3 × 10 −4 . Our model implementation is based on PyTorch (http://pytorch.org). Given the output scores of the 4 binary network classifiers, we employ an ordinal L 2 -regularized linear regressor (Rennie and Srebro, 2005) to obtain a severity score between 0 and 4. We use the All-Threshold based construction as described in the paper by Rennie et. al (Rennie and Srebro, 2005) , which bounds the mean absolute error between the true and predicted severity levels. The parameters of ordinal linear regressor were obtained by performing grid search with 5-fold cross validation using scikit learn, a machine learning library in Python (https://scikit-learn.org/). Deep profiler outputs the severity risk score between 0-4 for a patient, given the input parameters at the time of admission. The concordance index (CI) (Couronné et al., 2018) of the predicted severity levels, which quantifies the quality of rankings, on the internal testing dataset and external validation dataset of Mt Sinai was 0.71 and 0.64, respectively. In addition to the risk score, the outputs of the severity classification network can also be used to compute the likelihood of the clinical events of acute respiratory failure, end organ damage with respiratory failure and mortality, corresponding to severity levels >=2, >=3 and 4 respectively. However, the softmax output of the deep network classification networks is often not well calibrated to be interpretated as likelihoods. To address this, we fit a logistic regression model to the outputs of the severity classification network. We repeat this process for each deep profiler in the ensemble, then use the mean likelihood of the ensemble as the overall likelihood for the clinical event. Data preprocessing: The laboratory measurements recorded at the time of admission depends on several factors such as the patient's symptoms and thus, in retrospective data cohorts, all laboratory measurements are not available for all the patients. Furthermore, other clinical information such as comorbidities may not be available. Table 1 shows the number of patients for each J o u r n a l P r e -p r o o f characteristic. We preprocess the lab measurements to normalize the input data distributions. For certain laboratory measurements where the input distribution is heavy tailed, we transform them to logarithmic space. All variables are centered and scaled based on median and interquartile range for increased robustness to outliers. All the missing variables are imputed using the median value. Training deep profiler to minimize the combined loss of the variational auto encoder as well as ordinal regression posits several challenges. For instance, uneven data and outcome distributions as reported in Table 2 , e.g., the number of patients for severity level 2 is significantly more than severity level 4. This is addressed by appropriately weighting the loss terms. The losses corresponding to 4 binary classifiers are weighted as {0·3, 1, 1·5, 4} respectively based on distribution of samples over different outcomes and relative importance of the task. Another challenge is that classification loss can significantly bias the encoder weights in the initial epochs, making it difficult to learn a smooth manifold. This is addressed by only optimizing the parameters of the variational auto-encoder for the first 10 epochs and optimizing all the networks jointly for the subsequent epochs. all the networks are optimized jointly. While use of variational auto-encoder helps in dealing with model bias due to missing data via implicit imputation, there is still a risk that a single deep profiler model trained over entire training set can still incorporate such a bias. To further mitigate this risk, we train an ensemble of deep profilers over 10 different splits of the training dataset. Each split partitions the training dataset into a parameter training set and model selection set with 9 to 1 ratio, while maintaining the original ratio of samples for each severity level. Parameter training set is used for learning the weights via backpropagation and the model selection set is used to select the best model over epochs. Given the ensemble of 10 deep profilers, the overall severity score is computed as the average of the severity scores. In addition to the overall severity score, we also report the 95% confidence interval. Missing input data is a prevalent problem specifically for larger cohorts of patients in which consistent data acquisition and availability is hard to ensure. This posits a greater challenge to the acceptance of machine learning based prediction models, since missingness of data can be attributed to the level of severity of the patient (fewer tests may have been done since the patients didn't exhibit severe symptoms) and machine learning models are potential learn the pattern of missingness as a predictive feature. There are different approaches to deal with missing input data points, and these range from leaving the entire record with missing data point aside, to performing missing data imputation as part of pre-processing step (e.g., using K-nearest neighbor method (Faisal and Tutz, 2017) , singular value decomposition (Troyanskaya et al., 2001) , or variational auto-encoders ), or most preferably using implicit data imputation within the prediction task. In this work, since our prediction network has a variational auto-encoder embedded, we use a hybrid approach, which consists of three steps. First, for all missing data points for each patient record, we perform median based imputation. Pseudocode for pre-processing is detailed in the Algorithm 1. Second, during the training of the VAE, we randomly drop various data points from each patient record (de Jong et al., 2019) . In this scenario, the auto-reconstruction loss is computed only on non-missing values. In this case, the latent space representation (i.e., patient fingerprint) becomes inherently robust with respect to missing variables with different patterns of missing at random(S. Phung et al., 2019) . Pseudocodes for training procedure and specifically the training loss are detailed in the Algorithm 2 and 3, respectively. Finally, VAE inherently learns the distribution of the input data by estimating mean and standard deviation of the mapped latent parameters assuming a mixture of Gaussian distributions. Any missing data points at the input could potentially be less severely affecting the latent parameters and the overall distribution formed by the entire training cohort. Finally, since we train an ensemble of networks trained on various folds with random drop-out patterns, we further ensure that the effect of data missingness is minimized. In order to assess the effectiveness of these choices, we conducted additional analysis to study the missing data distribution. Figure S1 (a) shows the histogram of the patients with different number of missing input parameters. The pseudocode for the inference used during evaluation is brought in the Algorithm 4. Nearly half of the patient population has 50% of more of the 57 input parameters missing, thus, the risk of missing data to be a feature is certainly a possibility. We then set up an experiment, where we used the number of missing patient parameters as a feature and studied the patient distribution for each severity level. Figure S1 (b) shows the violin plots for each severity level; notice that the violin plots for each severity level are quite similar suggesting the number of missing data itself isn't a strong indicator of severity in this dataset. We also evaluated the performance of deep profiler on the subset of data where all the 10 markers were obtained at the time of admission. Figure S2 shows the AUCs as well as the UMAP projection of the subset of the data on the training manifold. While this subset of data only has 300 patients, we observed a marginal increase in performance among patients with higher severity while a drop in performance among patients with lowest severity. More importantly notice that the patient samples are projected uniformly over the manifold, thus suggesting the model is not implicitly grouping patients with different number of missing parameters at different location on the manifold. To improve the overall model performance and to increase the utility of the overall predictive system, it is important to devise a parsimonious model with as few input parameters as possible. An exhaustive optimal feature selection algorithm is a nondeterministic polynomial-time (NP)-hard problem as the number of combinations changes based on the factorial of the number of features. Aside from unsupervised feature selection techniques in J o u r n a l P r e -p r o o f which no target label is considered and the goal is to reduce the number of redundant variables through a correlation analysis, we have several categories of supervised approaches. The three categories are filter-based, wrapper-based and intrinsic approaches (Saeys et al., 2007) . Filterbased methods are based on conventional statistical approaches and for the most part do not consider complex non-linear relationship among the features and target labels. Wrapper-based methods are primarily based on iterative selection and/or elimination of features and qualification of the final selected set based on the algorithm performance to predict the target. These methods which are essentially search based are usually computationally expensive and require model training and testing for the number of selected feature set candidates. Finally, in embedded methods, the feature selection is formulated as an explicit part of the predictor loss function in addition to the prediction error. This could also be thought of enforcing regularization on the decision function or the decision boundary (R. Muthukrishnan and R. Rohini, 2016) . We used a modified two step feature selection method (P. Drotár and J. Gazda, 2016; Wang et al., 2007) . The modification is primarily based on bringing in the knowledge of the relationship among various blood biomarkers and clinical co-morbidities. For each of the laboratory markers, we assigned them to clusters based on the relationship with the following underlying physiological processes such as immune response, inflammation process, coagulation pathways, cardiac function, and liver and kidney functions. We used minimum redundance maximum relevance (MRMR) technique to rank order features based on the importance (Mandal and Mukhopadhyay, 2013) and reduce the number of candidates to 20. Furthermore, we chose less than 100 combinations of markers, taking in into account equitable representations from various clusters as stated above. For all these combinations, we trained and evaluated the deep profiler performance on the test dataset and chose the winning combination with the highest performance on basis of the area under the receiveroperator characteristic (ROC) curve (i.e., referred to as AUC). We compared the performance of proposed approach with three approaches, namely, XGBoost 49 , Random Forest Regression 50 and Logistic Regression. In order to have a fair comparison, we used a feature importance approach for each specific method to select the top ten most prominent markers for model creation(Díaz-Uriarte and Alvarez de Andrés, 2006; Genuer et al., 2010) . Specifically, for XGBoost, we utilized a hyperparameter search with 5-fold cross validation on the training set to identify the best performing model with all input parameters that minimizes logistic regression loss function and achieves the highest area under the curve. The search was conducted using XGBoost python library (https://xgboost.readthedocs.io/) , with following parameter ranges: Figure S3 (b). Similarly, for Logistic Regression, we used the 10 features selected using the Random Forest model, as it marginally outperformed the XGBoost model. We then subsequently trained the logistic regression model using 5-fold cross validation, while searching over the following hyperparameters : C ϵ {10 -2 , 10 -1 , 1, 10, 10 2 }, solver ϵ {liblinear, lbfgs}, with class_weight set to "balanced". Finally, for the proposed deep profiler approach, we used the hybrid feature selection as explained in the method section. Sinai data was performed using the portion of the dataset which was taken after April 28, 2020. Furthermore, to assess the upper bound of performance, we trained a model which included all the data from the three internal hospital systems along with the portion of the data from Mt. Sinai taken prior to April 28, 2020. We followed the same training procedure as described before, except that we initialized the model parameters with the learned parameters obtained from training only on the data from three internal hospital systems. We have summarized the results in the main manuscript Figure 9 . Aside from the proposed algorithm results on the external validation data, Figure 9 in the main manuscript also include the results from other approaches namely XGBoost and Random Forest Regression. We observed a moderate degradation of performance for example for mortality prediction (i.e., severity 4) from AUC of 0·85 to 0·74, however, we attribute this to missingness of the while blood cell counts in for the entire external validation cohort. As we have seen in the main manuscript Figure 5 for example, Lymphocyte percentage is predictive of the clinical outcomes and its missingness will most likely adversely affect the performance. The predictive performance of a model was evaluated on a testing data cohort of 3,554 patients (see Figure 1 ) by reporting the AUC to assess discriminative ability for each of the severity levels, the Kaplan-Meier survival plots to analyze time to events (mortality, ventilator use) as well as other evaluation metrics including positive predictive value (PPV), negative predictive value (NPV), sensitivity and specificity. For AUC results, we also considered splitting the clinical outcomes to various time intervals after the initial presentation of the disease to capture possible performance differences for clinical outcomes based on their time elapsed from the first encounter. We used the Uniform Manifold Approximation and Projection (UMAP) method (McInnes et al., 2020) to display the latent fingerprint distribution of the data. Using UMAP, we projected the 8-dimensional latent fingerprints to 2-dimensional points such that similar input data are closer and dissimilar points are farther apart with high probability. The UMAP plot shows various instance of predicted severity levels in relation to one another. J o u r n a l P r e -p r o o f Prognostic Modeling of COVID-19 Using Artificial Intelligence in the United Kingdom: Model Development and Validation Machine Learning Approach to Predicting COVID-19 Disease Severity Based on Clinical Blood Test Data: Statistical Analysis and Model Development Machine learning prediction for mortality of patients diagnosed with COVID-19: a nationwide Korean cohort study Utilization of machinelearning models to accurately predict the risk for critical COVID-19 Prediction of respiratory decompensation in Covid-19 patients using machine learning: The READY trial Development, evaluation, and validation of machine learning models for COVID-19 detection based on routine blood tests Ferritin is associated with the severity of lung involvement but not with worse prognosis in patients with COVID-19: data from two Italian COVID-19 units XGBoost: A Scalable Tree Boosting System Random forest versus logistic regression: a large-scale benchmark experiment Deep learning for clustering of multivariate clinical patient trajectories with missing values An inflammatory cytokine signature predicts COVID-19 severity and survival Gene selection and classification of microarray data using random forest Myocardial injury in hospitalized patients with COVID-19 infection-Risk factors and outcomes Triage decisionmaking at the time of COVID-19 infection: the Piacenza strategy Missing value imputation for gene expression data by tailored nearest neighbors The Berlin definition of ARDS: an expanded rationale, justification, and supplementary material Eosinophilia in critically ill COVID-19 patients: a French monocenter retrospective study Machine learning based early warning system enables accurate mortality risk prediction for COVID-19 Variable selection using random forests Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data The prevalence of symptoms in 24,410 adults infected by the novel coronavirus (SARS-CoV-2; COVID-19): A systematic review and meta-analysis of 148 studies from 9 countries Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors Age and Multimorbidity Predict Death Among COVID-19 Patients Auto-Encoding Variational Bayes The SOFA score-development, utility and challenges of accurate assessment in clinical trials Integrating deep learning CT-scan model, biological and clinical variables to predict severity of COVID-19 patients Prediction of COVID-19 Severity Using Chest Computed Tomography and Laboratory Measurements: Evaluation Using a Machine Learning Approach D-Dimer and Prothrombin Time Are the Significant Indicators of Severe COVID-19 and Poor Prognosis An image-based deep learning framework for individualising radiotherapy dose: a retrospective analysis of outcome prediction Prognostic value of lymphocyte count in severe COVID-19 patients with corticosteroid treatment A Unified Approach to Interpreting Model Predictions An Improved Minimum Redundancy Maximum Relevance Approach for Feature Selection in Management of COVID-19 Respiratory Distress The Role of Cytokines including Interleukin-6 in COVID-19 induced Pneumonia and Macrophage Activation Syndrome Like Disease UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction Immunomodulation as Treatment for Severe Coronavirus Disease 2019: A Systematic Review of Current Modalities and Future Directions Cardiac Troponin I Is an Independent Predictor for Mortality in Hospitalized Patients With COVID-19 Ordinal Regression Methods: Survey and Experimental Study Two-Step Feature Selection Methods for Selection of Very Few Features Genomic data imputation with variational autoencoders LASSO: A feature selection technique in predictive modeling for machine learning Scalable and accurate deep learning with electronic health records Critical Supply Shortages -The Need for Ventilators and Personal Protective Equipment during the Covid-19 Pandemic Loss functions for preference levels: Regression with discrete ordered labels A deep learning technique for imputing missing healthcare data A review of feature selection techniques in bioinformatics Association Between Blood Pressure Control and Coronavirus Disease COPE Study Collaborators, 2021. The role of C-reactive protein as a prognostic marker in COVID-19 Missing value estimation methods for DNA microarrays Accurate Cancer Classification Using Expressions of Very Few Genes The Impact of COVID-19 Disease on Platelets and Coagulation Clinical evaluation of potential usefulness of serum lactate dehydrogenase (LDH) in 2019 novel coronavirus (COVID-19) pneumonia Renal dysfunction and prognosis of COVID-19 patients: a hospital-based retrospective cohort study An interpretable mortality prediction model for COVID-19 patients IL-6 was not part of the 10-variable model due to the low number of entries in the hospital databases