key: cord-0739106-lh5oj0d4 authors: Ziemys, A. title: Predicting clinical outcomes and hospitalization stay of hospitalized COVID-19 patients by using Deep Learning methods date: 2022-01-30 journal: nan DOI: 10.1101/2022.01.28.22270040 sha: 3da2dc03cd1c07cfbe288222e9b310a8521e937b doc_id: 739106 cord_uid: lh5oj0d4 Predicting outcomes and other critical clinical events of hospitalized COVID-19 patients may provide a valuable asset to healthcare and a chance to improve patient outcomes. Here, we have analyzed over 10,000 hospitalized COVID-19 patients in the Houston Methodist Hospital at the Texas Medical Center from the beginning of pandemics till April of 2020. This work extends our previous study analyzing longitudinal symptomatics of the hospitalized patients by seeking to understand how standard patient clinical data, like demographics and comorbidities, together with symptom data from early hospitalization can be used to predict the clinical outcomes and hospitalization stay. Deep Learning (DL) classification and regression methods were applied to quantify patient record importance and to perform predictions. The results suggest that patient outcome can be predicted with up to 75% accuracy. However, the prediction of hospitalization stay was more complex indicating deeper optimization of features. There is large body of studies and knowledge dedicated to COVID 19 infection and patient outcomes. However, the most of them do not embark on the potentially valuable information source of longitudinal symptomatic. Here, we seek to understand the importance of longitudinal symptomatics in the prediction of clinical outcomes and hospitalization stay by using DL methods. This study capitalizes on the findings of our previous study about longitudinal symptomatic [1] and approaches the clinical data set from the large data and AI perspective. Data source. The study protocol was reviewed by the COVID Retrospective Research Task Force and due to the de-identified nature of the data set used, the study was granted a waiver from the IRB. The deidentified data set was acquired from the CURATOR data base in the HMH (PRO00025445). Only adult patients tested positive for COVID-19 were included. The ages of patients older than 90 year were capped at 90 years for deidentification purposes. Time records associated with patient were offset by the admission time. Approximately 40 unique symptoms were extracted from records originating from patient flowsheets. To make the analysis more robust, we have aggregates symptoms based on their anatomical associations. Table 1 presents symptom groups, their abbreviations, and unique symptoms attributed to symptom groups. The symptom grouping is unique so that each All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 30, 2022. ; https://doi.org/10.1101/2022.01.28.22270040 doi: medRxiv preprint individual symptom belongs to one symptom group only. The stiff neck symptom was attributed to the Central Nervous System (CNS) group because other viral brain infections, like viral meningitis [2, 3] , possess such a symptom. Eight unique symptom groups were created. Symptom remission was defined as a rate of symptom frequency change over time within a specific group. Preprocessing. All patients records missing clinical variables were excluded from analysis. Because the data set contains ~1:10 ratio between deceased and alive patients, the data set was transformed into the 2000 patients data sets having 1:1 ratio between deceased and alive patients. Collinearity analysis in feature selection was used to remove correlated variables (Figure 1 ). The dominant number of features were poorly correlated justifying their use in the model development. Classification models were complied with the binary cross-entropy loss function and the steepest gradient descent optimizer, while regression models were compiled with the mean square error as a loss function and Adam optimizer. The importance of features was scored by using permutations approach and sorter for further analysis. The optimal number of features was derived by performing manual was derived through comparing model accuracy of testing with different numbers of features. Patient classification based on outcomes. The permutation analysis has revealed that patient age, administrated electrolytes, steroids, and comorbidities like have the most influence to the model ( Figure 2 ). The collinearity of the top 11 features was low, except among few comorbidities (R ~ 0.5). All top features were used for outcome model optimization. We have performed randomized parameter optimization for the classification model by performing 1000 random models with 5-fold repeated random sub-sampling validation for model cross-validations. The average overall model accuracy of the top five models was 0.727±0.001. The prediction scores for alive and deceased patient were the following: precision -0. Hospitalization stay prediction by DL regression model. The feature scoring results were slightly different compared to features scored by the outcomes model ( Figure 4 ). Medication and BMI was found more important. The symptom remission rate at the first hospitalization days was also found important along with cardiovascular and renal comorbidities. The model optimization resulted in overfitting and poor prediction, which indicated deeper analysis and optimization is needed to optimize the set of feature, the size of the set, and other DL model parameters. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 30, 2022. ; https://doi.org/10.1101/2022.01.28.22270040 doi: medRxiv preprint Figure 5 . Regression model to predict hospitalization stay indicated further steps needed to optimize the model. Longitudinal symptom and clinical outcome analysis of hospitalized COVID-19 patients. medRxiv Toscana virus and acute meningitis, France. Emerging infectious diseases Aseptic meningitis in adults and children: Diagnostic and management challenges Tensorflow: Large-scale machine learning on heterogeneous distributed systems The raw data sets can be requested by contacting HMH. The data sets generated in this study for the purpose of tables and figures can be requested directly form the authors. All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.