key: cord-0954885-9ce6r8af authors: Qin, Shijie; Li, Weiwei; Shi, Xuejia; Wu, Yanjun; Wang, Canbiao; Shen, Jiawei; Pang, Rongrong; He, Bangshun; Zhao, Jun; Qiao, Qinghua; Luo, Tao; Guo, Yanju; Yang, Yang; Han, Ying; Wu, Qiuyue; Wu, Jian; Dai, Wei; Zhang, Libo; Chen, Liming; Xue, Chunyan; Jin, Ping; Gan, Zhenhua; Ma, Fei; Xia, Xinyi title: 3044 cases reveal important prognosis signatures of COVID-19 patients date: 2021-02-09 journal: Comput Struct Biotechnol J DOI: 10.1016/j.csbj.2021.01.042 sha: ad10a7bda92324cb8c20abf2997d0f422ec6b8b5 doc_id: 954885 cord_uid: 9ce6r8af Critical patients and intensive care unit (ICU) patients are the main population of COVID-19 deaths. Therefore, establishing a reliable method is necessary for COVID-19 patients to distinguish patients who may have critical symptoms from other patients. In this retrospective study, we firstly evaluated the effects of 54 laboratory indicators on critical illness and death in 3044 COVID-19 patients from the Huoshenshan hospital in Wuhan, China. Secondly, we identify the eight most important prognostic indicators (neutrophil percentage, procalcitonin, neutrophil absolute value, C-reactive protein, albumin, interleukin-6, lymphocyte absolute value and myoglobin) by using the random forest algorithm, and find that dynamic changes of the eight prognostic indicators present significantly distinct within differently clinical severities. Thirdly, our study reveals that a model containing age and these eight prognostic indicators can accurately predict which patients may develop serious illness or death. Fourthly, our results demonstrate that different genders have different critical illness rates compared with different ages, in particular the mortality is more likely to be attributed to some key genes (e.g. ACE2, TMPRSS2 and FURIN) by combining the analysis of public lung single cells and bulk transcriptome data. Taken together, we urge that the prognostic model and first-hand clinical trial data generated in this study have important clinical practical significance for predicting and exploring the disease progression of COVID-19 patients. sputum and nasopharyngeal swab twice in a row (sampling time interval of at least 24 hours). "Improved" means that the overall symptoms of the patient are significantly improved after treatment, but still do not meet the criteria for discharge. "Severity classification" represents the worst state (mild, severe, critical) of patients during the entire hospitalization. "Mild" symptoms are described as follows: 1. Clinical symptoms are mild and no imaging findings of pneumonia are found. "Severe" symptoms should meet one of the following: 1. Shortness of breath, respiration rate (RR) ≥ 30 times/min. 2. In the resting state, the oxygen saturation is ≤ 93%. 3. Arterial partial pressure of oxygen (PAOZ)/oxygen concentration (FIO2 ≤ 300mmHg) (1mmHg = 0.133kPa). "Critical" symptoms should meet one of the following: 1. Respiratory failure and the need for mechanical ventilation. 2. Shock. 3. Complicated with other organ failure, ICU monitoring and treatment are required. The single-cell data of 8 normal tissues came from the gene expression omnibus (GEO) database (GSE122960) [10] . The bulk transcriptome data of normal tissues adjacent to lung cancer were derived from The Cancer Genome Atlas (TCGA), and the standardized fragments per kilobase per million mapped reads (FPKM) expression data of these samples was obtained from the UCSC Xena database (https://xenabrowser.net/hub/). The Seurat3.0 R package was applied for quality control, filtering, standardization and subsequent analysis [11] . The inclusion criteria for cell quality control included 200-5000 genes detected in a single cell (nFeature_RNA), and less than 5% mitochondrial gene expression. The logNormalize function was used to normalize and normalize the expression matrix. The clustering performance of the cells was performed using the top 2000 most variable genes with a resolution of 0.5. The cell scatter gram method was obtained from t-distributed stochastic neighbor embedding (t-SNE) [12] . Median or average value indicates continuous variables, and the n (%) stands for categorical variables. Two-tailed wilcoxon rank sum test was applied to compare the differences of continuous variables of two groups. When there were three groups (mild, severe and critical), they were compared with each other. Chi-square test was used to compare the frequency of different groups, and the fisher exact test was applied instead when the theoretical prediction value of the chi-square test is less than 5. In order to avoid non-convergence in modeling, the extreme value of each indicator was processed by the block method in which data greater than 99 quantiles was replaced with 99 quantiles, while data less than 1 quantile was substituted with 1 quantile. The impact of various laboratory indicators on clinical critical illness and death was explored by logistic regression. These most important variables, which may give rise to the severity and mortality of COVID-19, were assessed by random forest machine learning algorithm. Logistic regression was applied to model age, gender and 8 important prognostic indicators. Receiver operating characteristic (ROC) curve was used to evaluate the quality of the model in the training set and verification set. All statistical analyses were performed using R software (version 3.5.3), and p-value under 0.05 were considered statistically significant. In this study, we intended to use the following process to find the laboratory indicators which could be served as prognostic factors for developing critical illness and death (Fig 1A) . At the beginning, 3044 patients with complete clinical information were screened from 3059 COVID-19 patients ( Fig 1A) . Secondly, statistics were made on the demographics, hospitalization, baseline characteristics, underlying diseases and complications of 3044 COVID-19 patients, whilst these laboratory indicators from patients with different severity and death outcomes were further compared to determine differences among them. Age and gender were then introduced as covariates to screen for significant laboratory indicators, and these screened laboratory indicators can affect critical illness and death by category. Next these 29 significant laboratory indicators extracted from the above results were used to undergo random forest algorithm screening, and a prognostic model containing 8 prognostic indicators and age was constructed and verified by performing ROC. The dynamic changes of 8 prognostic indicators were also evaluated. Finally, public single-cell and bulk transcriptome data were jointly analyzed to explore the underlying molecular mechanisms of different COVID-19 types with different ages and genders. As shown in Fig 1, these enrolled patient's ages were from 10 to 100 years old, and the most inpatients concentrated (n = 932, 30.62%) between 61 and 70 years old ( Fig 1B) . Another major age population for hospitalization was from 40 to 60 years old or from 70 to 80 years older, respectively (Fig 1B) . Our study indicated that the elderly constituted the main population among infected patients, which agrees with the previous report [13] ,. Our results demonstrated that the number of cured and improved patients could respectively reached 2930 (96.25%) and 48 (1.41%) in the clinical setting, indicating that most patients had good treatment outcomes, but the number of deaths remained 66 (2.17%) (Fig 1C) . Most patients could be attributable to mild and severe level while the rest minority could progress to critical level (5.2%) according to the classification ( Fig 1D) . Similar to the proportion of critical ones, the proportion of ICU patients accounted for 4.2% and there was a great overlap between ICU patients and critical patients (Fig 1E) . In view of gender, the proportion of males was only 1.58% higher than that of females (Fig 1F) , which seems to imply that the infection rate is no significant gender difference. In this study, patients were assigned into three severity groups according to their Table 1 ). In mild and moderate patients, the ratio of men to women was equal, but there was a significant increase number of male developing critical severity (p = 0.002, Table 1 ), and the mortality rate of males was also markedly higher than that of females (p = 0.013, Table 1 ). For ICU treatment, critically ill patients were significantly higher than mild and severe patients (p < 0.001), accounting for 64.78% (Table 1) . For the clinical outcomes, there were 4 severe and 61 critical cases among these deceased patients (p < 0.001, Table 1 ). In addition, most deceased patients were critical patients and underwent ICU treatment. The median length of hospital stay of 3044 COVID-19 patients was 13.0 days (IQR: 8.0-19.0), and patients with higher disease severity had a significant longer hospital stay (severe: 14.0 days (IQR: 8.0-22.0); critical: 19.0 days (IQR: 11.0-32.0) ( Table 1) . In our work, three most common comorbidities of 3044 COVID-19 patients were hypertension, diabetes and coronary atherosclerosis, which was similar to the reports from the United States and other countries [14] . Interestingly, we found that people with underlying diseases, such as hypertension, diabetes and coronary atherosclerosis, tumors, chronic obstructive pulmonary disease, and abnormal renal function, were more likely to develop severe and critical illness ( Table 1 , p < 0.05). Remarkably, four most common comorbidities related with clinical death were hypertension (p < 0.001), diabetes (p = 0.001), coronary heart disease (p = 0.001) and chronic obstructive pulmonary disease (p < 0.001). The result of logistic regression adjusted by age and gender revealed that hypertension (OR = 1.483, p = 0.014), diabetes (OR = 1.557, p = 0.016) and tumor (OR = 2.315, p = 0.022) were main risk factors (Table 1 and Table S1 ). Besides, respiratory failure, acute respiratory distress syndrome and thrombocytopenia also were the most common comorbidities among critical and dead patients, which could turn out to be potential lethal factors (Table 1 and Table S1 ). Especially, we found that 1369 (45.57%) infected patients had no comorbidity (p < 0.001, Table 1 and Table S1 ). After sorting and summarizing the laboratory examination indicators of COVID-19 patients, 54 indicators were screened for subsequent analysis. Here, we urged that not all 3044 patients have undergone all laboratory tests, and the specific number of people tested for each indicator will be shown in the results below. Moreover, most patients underwent massive tests of multiple indicators, but we here only used the earliest test values for subsequent analysis. These 54 indicators were roughly divided into 8 categories, including blood routine examination, electrolytes, liver function, urine tests, kidney function, heart function, blood coagulation indicators and others. Red blood cells and several white blood cells, including monocytes, lymphocytes, basophils, eosinophils and neutrophils, were involved in the blood routine examination. Electrolytes included sodium, potassium, chlorine, phosphorus, serum magnesium and calcium. The indicators related to liver function mainly contained alanine aminotransferase, aspartate aminotransferase, total protein, albumin, total bilirubin, direct bilirubin, total bile acid, indirect bilirubin, globulin, alkaline phosphatase, γ- Table S2 ). Various electrolytes such as sodium, potassium and calcium had large disturbances among patients with different grades (Table S2) . Remarkably, aspartate aminotransferase and alkaline phosphatase related to liver function gradually increased with disease level upgraded (p < 0.001, Table S2 ), and total protein and albumin decreased gradually (p < 0.001, Table 2 ), whereas alanine aminotransferase and indirect bilirubin had no significant changes (p = 0.670, p = 0.340, Table S2 ). Urine test-related cystatin C increased as the progression of infection (0.9vs. 0.96vs. 1.08, p < 0.001, Table S2 ). Creatinine in renal function also increased with the worsen situation of disease (0.9vs. 0.96vs. 1.08, p < 0.001, Table S2 ). Furthermore, the amount of many other indicators were more highly increased in more severe situations. For example, myoglobin and B-type natriuretic peptide were related to cardiac function, while fibrinogen and increase in D-Dimer indicates new blood coagulation. Notably, Creactive protein, interleukin-6, procalcitonin and blood glucose showed a significant increase in severe and critical patients (p < 0.001, Table S2 ). Similar changes in all above indicators were also observed in both ICU and non-ICU groups (Table S3) . By comparing laboratory indicators in patients with different survival outcomes and severity classification, we found that their trends were not identical. Some usual prognostic indicators such as neutrophils, interleukin-6, D-Dimer, and C-reactive protein increased markedly, while lymphocytes, eosinophils, total protein, and albumin abnormally decreased in both critical and dead patients (p < 0.001, Table S2 and Table S4 ). In addition, sodium, chloride, fibrinogen, globulin and other indicators had significant differences between different grades (p < 0.001, Table S2 ) but not the clinical outcomes (p > 0.05, Table S4 ). Taken together, we suggested that the disturbance of these indicators may be related to the disease progression but not survival rate because they are not obvious lethal factor. We further assessed the contribution of the 54 laboratory indicators above to the clinical critical illness and survival. Table 2 ). In short, we found that 44 indicators (p < 0.05) might affect the patient's disease process and survival outcomes (Table 2) . In order to further explore the main indicators in each category, a multi-factor stepwise regression analysis was carried out on each category, including blood routine test, electrolytes, urine tests, and function assessments of kidney, liver, heart and blood coagulation. In terms of cell ratio, we found that white blood cells amount (OR = 1.063, p = 0.026) and neutrophil percentage (OR = 1.130, p < 0.001) were the main risk factors (Table 3 ) that can promote disease progression. On the absolute level, lymphocytes (OR = 0.215, p < 0.001) and neutrophils (OR = 1.415, p < 0.001) were the main prognostic factors which can inhibit virus infection and increase inflammation, respectively (Table 3) . At the electrolyte level, although differences were significant in potassium, sodium, and magnesium (p < 0.001), the regression coefficient of the overall model was not significant (p = 0.090, Table 3 ). In renal function assessments, cystatin C (OR = 3.782, p<0.001) and urine red blood cells (OR = 1.001, p = 0.006) were the main risk factors while urea nitrogen (OR = 1.526, p < 0.001), creatinine (OR = 0.994, p < 0.001) and uric acid (OR = 0.991, p < 0.001) played more important roles in urine examination (Table3). Among the liver function indexes, aspartate aminotransferase, albumin, alkaline phosphatase, etc. were the main prognostic factors. Meanwhile, we found that alanine aminotransferase (OR = 0.990, p = 0.01), globulin (OR = 1.042, p = 0.041) and other indicators were significantly independent of age and gender (Table 3 vs. Table 2 ). These findings proposed that these indicators are greatly affected by age and gender or they are not sufficiently robust as prognostic indicators. Among cardiac-related indicators, lactate dehydrogenase (OR = 1.013, p < 0.001), myoglobin (OR = 1.012, p < 0.001) and creatine kinase (OR = 0.995, p < 0.001) were the main prognostic factor (Table 3) while prothrombin time (OR = 1.187, p < 0.001), fibrinogen (OR = 1.277, p < 0.0141), thrombin time (OR = 1.099, p < 0.017) and D-Dimer (OR = 1.255, p < 0.001) were the main prognostic factor in coagulation parameters ( Table 5 ). The remaining index items, including C-reactive protein (OR = 1.019, p < 0.001), interleukin-6 (OR = 1.014, p < 0.001), procalcitonin (OR = 2.362, p < 0.001) and blood glucose (OR = 1.227, p < 0.001) had nothing to do with clinical critical illness and death (Table 3) . Finally, we found that 34 laboratory indicators could serve as independent prognostic signatures ( Table 3) . Monitoring such large amounts of laboratory indicators is a heavy burden for clinical doctors in anti-virus therapy. Therefore, 29 significant prognostic indicators obtained from 491 patients were selected and further tested in random forest machine learning algorithms at the same time. Interestingly, they could clearly distinguish the event group (Critical or ICU or Dead) and non-event group according to the principal component results (Fig 2A) . Moreover, 5 times 10-fold cross-validation was used to screen the best number of variables included in the model, and eight turns out to be the most suitable for its the smallest error ( Fig 2B) . Combined with the importance of indicators given by the random forest algorithm (Fig 2C) , eight indicators were selected as the final prognostic indicators including neutrophil percentage, procalcitonin, neutrophil absolute value, C-reactive protein, albumin, interleukin-6, lymphocyte absolute value and myoglobin due to their significant differences in different disease grades, survival outcomes and ICU grouping (Fig S1) . More importantly, these 8 prognostic indicators at different times in the event and non-event patients showed stable and significant differences. In particular, neutrophil percentage, procalcitonin, neutrophil absolute value, C-reactive protein, myoglobin and interleukin-6 in patients with compound endpoint events were always higher than the non-event group (Fig 2D) . On the opposite, these protective factors, such as lymphocyte count and albumin obtained from patients with a composite endpoint event, were always lower than those without a composite endpoint event. Hence, these 8 laboratory testing indicators indeed be treated as the prognostic factor of patients, because they were significantly different in both the critical and mild groups from onset to a long time before the end event ( Figure 2D ). In order to assist doctors in defining patients who are more likely to be critically ill or even die, we here combined age and eight prognostic indicators presented above to establish a clinically available prognostic model. Prior to that procedure, patients were divided into normal and abnormal groups according to eight prognostic indicators and the cumulative event rate was counted between the two groups. The cumulative event rates in the abnormal risk factor group were significantly higher than the non-abnormal group (p < 0.001, Fig 3A~3F) . Similarly, the cumulative event rate in the abnormal protection factor group was significantly higher than the healthy group (p < 0.001, Fig 3G and p < 0.001, Fig 3H) . During the analysis, we noticed that age and gender are always important factors leading to critical illness and death compared with various testing indicators. Therefore, we compared three key genes ACE2, TMPRSS2 and FURIN, which were related to virus infection at both single cell and whole tissue levels under different ages and genders [15] [16] [17] . As shown in Table S5 ). ACE2 was mainly expressed in alveolar epithelial type 2 cells (AT2), basal cells and tuft cells (Fig 4C) . Based on our analysis, the expression of ACE2 was detected in limited cells. In AT2 subpopulation, less than 1% of this subpopulation were detected containing the expressed ACE2 with low expression level. In addition, we found that TMPRSS2 and FURIN could promote the binding of SARS-CoV-2 to ACE2, which were also mainly expressed in AT2 cells [15] [16] [17] (Fig 4D~4E) . The average expression level of ACE2 and the cell percent expressing it in the old group (age: 55 years, 63 years, 57 years) were higher than the young group (age: 21 years, 22 years, 29 years) ( Fig 4F) . Besides, we also found that higher percentage of cells in older patients express TMPRSS2 and FURIN (Fig 4F) , even though their expression levels in elder group were lower than the young group. The results of single-cell analysis partially explain the differences of the infection rate and mortality between people at different ages, which may not only be related to the expression of ACE2, TMPRSS2 and FURIN, but also the number of cells with the expression of these three key genes. However, in the analysis of the whole transcriptome from TCGA, there was no significant difference in ACE2, TMPRSS2, and FURIN in population older than 60 years old and younger than 60 years old ( Fig S2) . We inferred that the tissue-wide transcriptome data masked subtle differences in key molecules of different ages, which also highlighted the advantages of single-cell transcriptome data. Considering gender, the infection rate of men was only 1.58% higher than women (Fig 1) while the proportion of men who developed to critical illness and death was almost twice that of women (Table 1) , which might also attribute to higher ACE2 expression level [18] . To verify whether there is such a difference, we compared the expression levels and cell ratios of ACE2, TMPRSS2 and FURIN of different genders based on single cell RNA sequencing. As shown in Fig 4G, compared with females, the expression levels of ACE2, TMPRSS2 and FURIN were higher in males, and the proportion of cells expressing ACE2 and TMPRSS2 were also higher. In the bulk transcriptome, the expression of TMPRSS2 in males was significantly higher than that in female, which was consistent with the results derived from single cell level. No significant differences between ACE2 and FURIN of different genders was detected. Based on these results, the reasons of higher infection and mortality rates in male were illustrated at molecular level. It is urgent and necessary to find effective methods to predict and monitor the In our study, the most susceptible people are concentrated between 51 and 70 years old and the average age of critically ill patients and deceased patients is higher (Table 1 ). Consistent with previous extensive reports, the elderly is the main population of COVID-19 [13] . In addition, some studies have reported that the difference between infection and death in elderly and young people may be related to the expression of ACE2 receptor in the body [19] . To further verify this conclusion at single cell level and bulk tissue. Our results show that the elderly at the single-cell level seems to express more ACE2 and the proportion of cells expressing ACE2 is higher than that of the young (Fig 4F) , no significant difference was observed at the bulk transcriptome level ( Fig S2) . The same uncertainty appears in the results of TMPRSS2 and FURIN genes ( Fig S2) . Therefore, although this may partially illustrate the difference between critical illness and death in elderly and young people, more sufficient evidences still lack to fully explain. According to that, we propose that the difference in the outcome of the elderly and young people is related to the weakened immunity and more comprehensive underlying diseases accompanied with increasing age. Our research has proved that patients with underlying diseases have a higher critical illness ratio and mortality ( Table 1 , p < 0.001 and Table S1 ). From the perspective of gender, compared with the 39.7% female infection rate in the United States [14] , the infection rate of men is only 1.58 percentage points higher than that of women ( Fig 1F) . However, the proportion of men who develop critical illness and death doubles compared with women (Table 1 and Table S1 ), which may be also attributed to higher ACE2 expression [18] . It is better to believe that the expression level difference of ACE2, TMPRSS2 and FURIN genes is the cause of different critical illness rate and death rate in different genders rather than different ages, because these can be strongly supported by the data analysis results at the single cell and bulk transcriptome level (Fig 4G and Fig S2) . In terms of comorbidities, hypertension, diabetes, coronary heart disease, tumors, and chronic obstructive pulmonary disease are prone to critical illness and clinical outcomes ( Table 1 and Table S1 ), which is consistent with previous reports [14, 20] . From the patient's treatment outcome, the number of patients cured and improved reached 2644 and 281 respectively ( Fig 1B) and the number of deaths was only 66, accounting for 2.17% ( Fig 1B) . This shows that timely and active medical treatment is essential to curb the mortality of COVID-19 patients. A large number of disorders are presented in COVID-19 patients with death and critical illness regarding laboratory indicators. Compared with mild and severe ill patients, those critical and dead ones have obvious abnormalities in the immune system, kidney function, liver function, heart function, blood coagulation indexes and inflammatory factors. In order to facilitate clinical monitoring and supervision, we further found the 8 most important prognostic indicators. Among them, the increase of neutrophil percentage, neutrophil absolute value and interleukin-6 indicated that inflammation and inflammatory storm are some of the main manifestations of critical symptoms and death. The decrease of lymphocyte absolute value represents a decrease in the immunity of critically ill and dead patients, resulting in the inability to defend against the combined infection and sepsis represented by the increase in procalcitonin and C-reactive protein. In addition, the disorders of myoglobin and albumin are related to impaired heart and liver function, suggesting that many important organs of critical and dead patients have been damaged. Compared with mild and severe patients, the values of these eight indicators are always in a higher state before the end event, and patients with abnormal indicators are more likely to have composite endpoint events ( Fig 2D and Fig 3) . The model constructed by the combined age and gender of the patients and the eight detection indicators has good accuracy in the training set and validation set with the AUC values for 0.878 and 0.897, respectively ( Fig 3I~J) . Finally, we establish a clinically useful regression equation and nomogram to predict the risk probability of developing critical illness and death ( Fig 3K) . We believe that this model has practical significance for the prediction and monitoring of COVID-19 patients. In summary, we performed a statistical analysis of 3044 COVID-19 patients to find the eight most important prognostic factors (neutrophil percentage, procalcitonin, neutrophil absolute value, C-reactive protein, albumin, interleukin-6, lymphocyte absolute value and myoglobin) of COVID-19, and constructed a model to predict the prognosis of patients, which is of great significance for the management and monitoring of COVID-19. Moreover, through reanalyzing public lung single-cell and bulk transcriptome data, we suggest that compared with different ages, different genders have different critical illness rates and mortality are more likely to be attributed to differences in key genes such as ACE2, TMPRSS2, and FURIN. However, our study still has many limitations. First, our established prediction model still lacks an effective validation from external queues, which may result in overfitting of the model to a certain extent. Therefore, in the further study, we suggest that it is important to integrate multiple queues for modeling and validation. One model can only withstand validation from multiple external cohorts, it can be applied to a complex COVID-19 patient population. Second, our model construction is mainly based on random forest and logistic regression algorithm, which may have certain deficiencies. In fact, a comprehensive comparison of the results of the multiple algorithms will deepen our impression of the key prognostic factors and models. Support vector machine, Adaptive Boosting, neural network and artificial intelligence algorithms are good choices. Third, our current study has not been able to analyze the genetic background of these differences in laboratory test indicators. This is mainly because we lack the genetic information of these patients. We believe that the integration of laboratory indicators and genetic information such as genomes, transcriptome and proteome will greatly broaden our understanding of COVID-19. We also hope that future studies will pay attention to the output of these data. We thank our colleagues for their suggestions and criticisms on the manuscript. We declare that we have no conflict of interest. -2) , .has spread throughout many countries around the world. According to statistics up to July 28th, the spread of COVID-19 has infected more than 16 million people worldwide and caused more than 650,000 deaths. The number of confirmed cases and deaths in some regions may be even higher than data available due to multiple factors, such as detection methods, medical resource lists, political and cultural differences [1] . COVID-19 has resulted in considerable morbidity and mortality. Tthus early rapid diagnosis, surveillance, risk assessments, and medical resource managements are essential in the prevention and control of epidemics before protective vaccines are applied clinically. Generally, the COVID-19 patients with the highest mortality rate are the critically ill and ICU patients, but they only account for a small proportion of the hospitalized mild and severe patients The highest mortality rate is in critical group and ICU patients, which account for a small proportion [2] . Therefore, establishing reliable methods are crucial to distinguish high-risk patients from others. Although combined nucleic acid detection, antibody detection and computed tomography (CT) imaging can effectively diagnose COVID-19, the severity and prognosis of patients cannot be predicted Therefore, it is crucial to establish reliable methods to distinguish patients of high possibility to develop severe symptoms from others. Though combined nucleic acid detection, antibody detection and CT imaging diagnose COVID-19 effectively, the severity and prognosis cannot be predicted [3, 4] . Previous reseaches studies have reported multiple organ dysfunctions and prognostic markers including neutrophils and interleukin-6 [5] . In addition, mMyoglobin and C-reactive protein are related to myocardial injuries [6] . Alkaline phosphatase is related to liver damage whereaswhile increasing D-DD dimer indicated causes new impaired blood coagulation [7, 8] . These findings show reveal the potential of laboratory indexes serving as indicators of COVID-19 severity. Our previous work also indicated Previously, our group reported that the level of tumor biomarkers iwas associated with the severity of patients and could predict clinical outcomes [9] . However, most of the published data was based on a relatively small sample size, which may reduce the statistical reliability. Therefore, in order to establish a risk stratification model forto finding key laboratory indicators, which predict the disease progression and clinical outcomes of COVID-19 patients, we here further conduct the study based on a large sample sizea research based on a large sample size is conducted. Hospital, and. Written written informed consent was obtained from each patient. According to the world health organization (WHO)/International Severe Acute Respiratory and Emerging Infection Consortium case record form for severe acute respiratory infections, baselines of participants, epidemiological and clinical manifestations, laboratory findings and outcome data were extracted from electronic medical records. Major basic information (i.e., age, sex, the highest historical classification, preliminary diagnosis, discharge diagnosis and discharge conditions) were collected except for patients' personal information (e.g., name and identificationID) and comorbidities were also included in clinical symptoms. Median (IQR) or average value indicates continuous variables, and the n (%) shows categorical variables. Wilcoxon rank sum test was applied to compare the differences of continuous variables between the groups. Chi-square test was used to compare the frequency of different groups, and the Fisher exact test was applied instead when the theoretical prediction value of the chi-square test is less than 5. In order to avoid nonconvergence in modeling, the extreme value of each indicator was processed by the block method in which data greater than 99 quantiles was replaced with 99 quantiles while data less than 1 quantile was substituted with 1 quantile. The impact of various laboratory indicators on clinical critical illness and death was explored by logistic regression. The most important variables, which may give rise to the severity and mortality of COVID-19, were assessed by random forest machine learning algorithm. Logistic regression was applied to model age, gender and 8 important prognostic indicators. Receiver operating characteristic (ROC) curve was used to evaluated the quality of the model in the training set and verification set. The single-cell data of 8 normal tissues comes came from the GEO gene expression omnibus (GEO) database (GSE122960) [10] . The bulk transcriptome data of normal tissues adjacent to lung cancer were derived from The Cancer Genome Atlas (TCGA), and the standardized FPKMfragments per kilobase per million mapped reads (FPKM) expression data of these samples was downloaded obtained from the UCSC Xena database (https://xenabrowser.net/hub/). The Seurat3.0 R package is was applied for quality control, filtering, standardization and subsequent analysis [11] . The inclusion criteria for cell quality control included 200-5000 nFeature_RNAgenes detected in a single cell (nFeature_RNA) detection, and less than 5% mitochondrial gene expression. The logNormalize function is was used to normalize and normalize the expression matrix. The clustering performance of the cells was performed using the top 2000 most variable genes with a resolution of 0.5. The cell scatter gram method was obtained from t-distributed stochastic neighbor embedding (t-TSNE) [12] . All statistical analyses were performed using R software (version 3.5.3), and p-values under 0.05 were considered statistically significant. Median or average value indicates continuous variables, and the n (%) showstands for categorical variables. Two-tailed wilcoxon rank sum test was applied to compare the differences of continuous variables of two groups. When there were three groups (mild, severe and critical), they were compared with each other. Chi-square test was used to compare the frequency of different groups, and the fisher exact test was applied instead when the theoretical prediction value of the chi-square test is less than 5. In order to avoid non-convergence in modeling, the extreme value of each indicator was processed by the block method in which data greater than 99 quantiles was replaced with 99 quantiles, while data less than 1 quantile was substituted with 1 quantile. The In this study, we intended to use the following process to find the laboratory indicators which could be served as prognostic factors for developing critical illness and death (Fig 1A) . At the beginning, 3044 patients with complete clinical information were screened from 3059 COVID-19 patients (Fig 1A) . Secondly, statistics were made on the demographics, hospitalization, baseline characteristics, underlying diseases and complications of 3044 COVID-19 patients, followed bywhilst comparing these laboratory indicators of from patients with different severity and death outcomes were further compared to determine their differences among them. Age and gender were then that most patients had good treatment outcomes, respectively whilebut the number of deaths wasremained 66 (2.17%) (Fig 1C) . Most patients could were be attributable toleveled mild and severe level while the rest minority could progressed to critical level (5.2%) according to the classification (Fig 1D) . Similar to the proportion of critical ones, the proportion of ICU patients accounted for 4.2% and there was a great overlap between ICU patients and critical patients (Fig 1E) . In view of gender, the proportion of males was only 1.58% higher than that of females (Fig 1F) , which seems to imply that the infection differences rate between men and women are is similarno significant gender difference. In this study, patients were assigned into three severity groups according to their (Table S1 and Table S1 ). Besides, respiratory failure, acute respiratory distress syndrome and thrombocytopenia also are were the most common comorbidities among critical and dead patients, which could turned out to be potential lethal factors (Table 1 and Table S1 ). Especially, Wwe also found that 1369 (45.57%) infected patients had no comorbidity,. whichIt is obvious that patients without underlying diseases were less likely to be infected and have better disease classification and survival outcomes (p << 0.0001, Table 1 and Table S1 ). After sorting and summarizing the laboratory examination indicators of COVID- (Table S2) . Remarkably, aAspartate aminotransferase and alkaline phosphatase related to liver function gradually increased gradually with disease level upgraded (p < 0.001, Table S2 ), while and total protein and albumin decreased gradually (p < 0.0001, Table 2 ),. whereas alanine aminotransferase and indirect bilirubin had There was no significant change ins alanine aminotransferase and indirect bilirubin (p = 0.670, p = 0.340, Table S2 ). Urine test-related cystatin C were on an increased as the progression of infection progressed (0.9vs. 0.96vs. 1.08, p < 0.0001, Table S2 ). Creatinine in renal function showed analso increased with the worsen situation of disease (0.9vs. 0.96vs. 1.08, p < 0.0001, Table S2 ). Furthermore, the amount of many other indicators was were found largermore highly increased in more severe situations. For example, myoglobin and B-type natriuretic peptide were related to cardiac function, while fibrinogen and increase in D-DimerDD dimer indicates new blood coagulation. Notably, C-reactive protein, interleukin-6, procalcitonin and blood glucose showed a significant increase in severe and critical patients (p < 0.0001, Table S2 ). Similar changes in all the above indicators were also observed in both ICU and non-ICU groups (Table S32 ). By comparing laboratory indicators in patients with different survival outcomes and severity classification, we found that their trends were not identical. Most detection indicators were shown to be related with disease progression based on the comparison of laboratory indexes from patients with different grades and clinical outcomes. Some usual prognostic indicators such as neutrophils, interleukin-6, DD dimerD-Dimer, and C-reactive protein increased markedly, while lymphocytes, eosinophils, total protein, and albumin abnormally decreased both in both critical and dead patients (p < 0.0001, Table S2 and Table S43 ). In addition, sodium, chloride, fibrinogen, globulin and other indicators had significant differences between different grades (p < 0.0001, Table S2) but not the clinical outcomes. (p > 0.05, Table S43 ). Taken together, Wwe suggested that the disturbance of these indicators may be related to the disease progression but not survival rate, since because they are not obvious lethal factor. We (Table 24 ). In short, we found there arethat 44 detection indicators (Pp < 0.05) in total which mayight affect the patient's disease process and survival outcomes (Table 24) . In order to further explore the main indicators in each category, we carried out a multi-factor stepwise regression analysis was carried out on each category, which includesing blood routine test, electrolytes, urine tests, and function assessments of kidney, liver, heart and blood coagulation. In terms of cell ratio, we found that white blood cells amount (OR= = 1.0633, p= = 0.02657) and neutrophil percentage (OR= = 1.130297, p < 0.0001) were the main risk factors (Table 35) (Table 35) . At the electrolyte level, although differences are were significant in potassium, sodium, and magnesium (p < 0.0001), the regression coefficient of the overall model is was non-not significant (p= = 0.090898, Table 35 ). In renal function assessments, cystatin C (OR= (Table 35 vs. Table 24 ). We These findings proposed that these indicators are greatly affected by age and gender or they are not sufficiently robust as prognostic indicators. Among cardiac-related indicators, lactate dehydrogenase (OR= = 1.01326, p < 0.0001), myoglobin (OR= = 1.0122, p < 0.0001) and creatine kinase (OR= = 0.995, p < 0.0001) are were the main prognostic factor (Table 35) (Table 35) . Finally, There are we found that 34 laboratory indicators could serve as independent prognostic signaturesindicators sceened by category in total (Table 35) . Monitoring such large amounts of laboratory indicators is a large heavy burden to for clinical doctors in anti-virus therapy. Therefore, 2930 significant prognostic indicators obtained from 491 patients were selected and further tested in random forest machine learning algorithms at the same time. Interestingly, they could clearly distinguish the event group (Critical or ICU or Dead) and non-event group according to the principal component results (Fig 2A) . Moreover, 5 times 10-fold cross-validation was used to select screen the best number of variables included in the model, and eight turns out to be the most suitable for its the smallest error ( Fig 2B) . Combined with the importance of indicators given by the random forest algorithm (Fig 2C) , eight indicators were selected as the final prognostic indicators including neutrophil percentage, procalcitonin, myoglobin, neutrophil absolute valuecount, C-reactive protein, albumin, interleukin-6 level, lymphocyte absolute valuecount and and albumin,myoglobin due to their significant differences in different disease grades, survival outcomes, and ICU grouping (Fig S1) . More importantly, these 8 prognostic indicators at different times in the event and non-event patients showed stable and significant differences. In particular, The dynamic changes of the eight prognostic indicators presented above was also analyzed over time by using the multiple detection values of these patients since the hospitalization. nNeutrophil percentage, procalcitonin, myoglobin, neutrophil absolute valueneutrophil count, C-reactive protein, myoglobin and interleukin-6 in patients with compound endpoint events were always higher than the non-event group ( Fig 2D) . On the opposite, these protective factors, containing such as lymphocyte count and albumin obtained from patients with a composite endpoint event, were always lower than those without a composite endpoint event. Hence, these 8 laboratory testing indicatorsprognostic indicators above can indeed be treated as the prognostic factor indicators of patients, because they were significantly different in both the critical and mild groups from onset to a long time before the end event ( Figure 2D ). because of their significant differences between different groups of patients throughout hospitalized period. In order to establish clinically available prognostic models to assist doctors in defining patients who are more likely to be critically ill or even die, we here combined age, gender, and eight prognostic indicators presented above to establish a clinically available prognostic model. Prior to that procedure, patients were divided into normal and abnormal groups according to eight prognostic indicators and the cumulative event rate was counted between the two groups. The cumulative event rates in the abnormal risk factor group were significantly higher than the non-abnormal group (p < 0.0001, Fig 3A~3F) . Similarly, the cumulative event rate in the abnormal protection factor group was significantly higher than the healthy group (p < 0.001, Fig 3G and This modelwhich was further verified in another independent cohort containing 170 patients, and its AUC value reached 0.897 (95% CI: 0.787-1.000) (Fig 3J) . These results clearly demonstrated that this model has great robustness. Finally, a nomogram containing all 611 patients was drawn to facilitate clinical use and explain the relationship between model variabilities, which may not only query the risk scores of patients' various model indicators conveniently, but also predict the risk of disease progression and death in patients according to the sum of scores. During our the analysis, we noticed that age and gender are always important factors leading to critical illness and death compared with various testing indicators. Therefore, we compared three key genes, ACE2, TMPRSS2 and FURIN, which were reported related to virus infection at both single cell and whole tissue levels under different ages and genders [15] [16] [17] . Table S53 ). ACE2 is was mainly expressed in alveolar epithelial type 2 cells (AT2), basal cells and tuft cells (Fig 4C) . Based on our analysis, the expression of ACE2 was detected in limited cells. In AT2 subpopulation, less than 1% of this subpopulation were detected containing the expressed ACE2 with low expression level. In addition, we found that the genes TMPRSS2 and FURIN could promote the binding of SARS-CoV-2 to ACE2, which were also mainly expressed in AT2 cells [15] [16] [17] (Fig 4D~4E) . The average expression level of ACE2 and the cell percent expressing it in the old group (age: 55 years, 63 years, 57 years) were higher than in the young group (age: 21 years, 22 years, 29 years) (Fig 4F) . Besides, we also found that higher percentage of cells in older patients express TMPRSS2 and FURIN (Fig 4F) , even though their expression levels in elder group was were lower than in the young group. The results of single-cell analysis partially explain the differences of the infection rate and mortality between people at different ages, which may not only be related to the expression of ACE2, TMPRSS2 and FURIN, but also the number of cells with the expression of these three key genes. However, in the analysis of the whole transcriptome from TCGA, there was no significant difference in ACE2, TMPRSS2, and FURIN in population older than 60 years old and younger than 60 years old (Fig S2) . We inferred that the tissue-wide transcriptome data masked subtle differences in key molecules of different ages, which also highlighted the advantages of single-cell transcriptome data. Considering gender, the infection rate of men was only 1.58% higher than women (Fig 1) while the proportion of men who developed to critical illness and death was almost twice that of women (Table 1) , which might also attribute to higher ACE2 expression level [18] . To verify whether there was is such a difference, we compared the expression levels and cell ratios of ACE2, TMPRSS2 and FURIN of different genders based on single cell RNA sequencing. As shown in Fig 4G, compared with females, the expression levels of ACE2, TMPRSS2 and FURIN were higher in males, and the proportion of cells expressing ACE2 and TMPRSS2 were also higher. In the bulk transcriptome, the expression of TMPRSS2 in males was significantly higher than that in female, which was consistent with the results derived from single cell level. No significant differences between ACE2 and FURIN of different genders was detected. Based on these results, the reasons of higher infection and mortality rates in male were illustrated at molecular level. It is urgent and necessary to find effective methods to predict and monitor the In our study, the most susceptible people are concentrated between 51 and 70 years old and the average age of critically ill patients and deceased patients is higher (Table 1 ). Consistent with previous extensive reports, the elderly are is the main population of COVID-19 [13] . In addition, some studies have reported that the difference between infection and death in elderly and young people may be related to the expression of ACE2 receptors in the body [19] . To further verify this conclusion at single cell level and bulk tissue. Our results show that the elderly at the single-cell level seems to express more ACE2 and the proportion of cells expressing ACE2 is higher than that of the young (Fig 4F) , no significant difference was observed at the bulk transcriptome level ( Fig S2) . The same uncertainty appears in the results of TMPRSS2 and FURIN genes ( Fig S2) . Therefore, although this may partially illustrate the difference between critical illness and death in elderly and young people, more sufficient evidences still lacks to fully explain. According to that, we propose that the difference in the outcome of the elderly and young people was is related to the weakened immunity and more comprehensive underlying diseases accompanied with increasing age. Our research has proved that patients with underlying diseases have a higher infection rate and critical illness ratio and mortality (Table 1 , p < 0.0001 and Table S1 ). From the perspective of gender, compared with the 39.7% female infection rate in the United States [14] , the infection rate of men is only 1.58 percentage points higher than that of women ( Fig 1F)(Fig 1) .. (Fig 1) . However, the proportion of men who develop critical illness and death doubles compared with women (Table 1 and Table S1 ), which was may be also attributed to higher ACE2 expression [18] . It is better to believe that the expression level difference of ACE2, TMPRSS2 and FURIN genes is the cause of different critical illness rate and death rate in different genders rather than different ages, because which these can wasbe more strongly supported by the data analysis results at the single cell and bulk transcriptome level (Fig 4G and Fig S2) . In terms of comorbidities, hypertension, diabetes, coronary heart disease, tumors, and chronic obstructive pulmonary disease are prone to critical illness and clinical outcomes (Table 1 and Table S1 ), which is consistent with previous reports [14, 20, 21] . From the patient's treatment outcome, the number of patients cured and improved reached 2644 and 281 respectively ( Fig 1B) and the number of deaths was only 66, accounting for 2.17% (Fig 1B) . This shows that timely and active medical treatment is essential to curb the mortality of (Fig 2D and Fig 3) . The model constructed by the combined age and gender of the patients and the eight detection indicators had has good accuracy in the training set and validation set with the AUC values reached for 0.8785 and 0.897, respectively ( Fig 3I~J) . Finally, we establishconstructed a clinically useful regression equation and nomogram to predict and detect the risk probability of developing critical illness and death ( Fig 3K) . We believe that this model has practical significance for the prediction and monitoring of COVID-19 patients. In fact, a comprehensive comparison of the results of the multiple algorithms will deepen our impression of the key prognostic factors and models. Support vector machine, Adaptive Boosting, neural network and artificial intelligence algorithms are good choices. Third, our current study has not been able to analyze the genetic background of these differences in laboratory test indicators. This is mainly because we lack the genetic information of these patients. We believe that the integration of laboratory indicators and genetic information such as genomes, transcriptome and proteome will greatly broaden our understanding of COVID-19. We also hope that future studies will pay attention to the output of these data. In the end, our study still has some limitations. First, the prediction model we built still lacks validation from external queues, which may result in over-fitting of the model to a certain extent. Therefore, in the later study, we suggest that it is important to integrate multiple queues for modeling and validation. Only one model that can withstand validation from multiple external cohorts can be applied to a complex COVID-19 patient population. Second, our model construction is mainly based on random forest and logistic regression algorithm, which may have certain deficiencies. In fact, a comprehensive comparison of the results of the multiple algorithms will deepen our impression of the key prognostic factors and models. Support vector machine, Adaptive Boosting, neural network and artificial intelligence algorithms are good choices. Third, our current study has not been able to analyze the genetic background of these differences in laboratory test indicators. This is mainly because we lack the genetic information of these patients. We believe that the integration of laboratory indicators and genetic information such as genomes, transcriptome and proteome will greatly broaden our understanding of COVID-19. We also hope that future studies will pay attention to the output of these data. Supplementary: Table S1 . Difference of different basic diseases on clinical death outcomes.Effect of different underlying diseases on clinical death and critical illness outcomes after correcting age and gender. Table S2 . Differences of laboratory test indexes between mild, severe and critical patients.Differences of laboratory test indexes between ICU and non-ICU patients. Table S3 . Differences of laboratory test indexes between ICU and non-ICU patients. Table S4 . Differences of laboratory tests between surviving and dying patients. were collected except for patients' personal information (e.g., name and ID) and comorbidities were also included in clinical symptoms. The research was approved by the Research Ethics Commission of Huoshenshan hospital. Written informed consent was obtained from each patient. Spread of SARS-CoV-2 in the Icelandic Population Real estimates of mortality following COVID-19 infection Combination of RT-qPCR testing and clinical features for diagnosis of COVID-19 facilitates management of SARS-CoV-2 outbreak Sensitivity of Chest CT for COVID-19: Comparison to RT-PCR Clinical and Immune Features of Hospitalized Pediatric Patients With Coronavirus Disease Analysis of the clinical characteristics, drug treatments and prognoses of 136 patients with coronavirus disease 2019 Characteristics of liver function in patients with SARS-CoV-2 and chronic HBV co-infection COVID-19 and its implications for thrombosis and anticoagulation Tumor biomarkers predict clinical outcome of COVID-19 patients Single-Cell Transcriptomic Analysis of Human Lung Provides Insights into the Pathobiology of Pulmonary Fibrosis Comprehensive Integration of Single-Cell Data Dimensionality reduction for visualizing singlecell data using UMAP Age-dependent effects in the transmission and control of COVID-19 epidemics Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein SARS-CoV-2 and bat RaTG13 spike glycoprotein structures inform on virus evolution and furin-cleavage effects Sex differences in immune responses to SARS-CoV-2 that underlie disease outcomes. medRxiv Nasal Gene Expression of Angiotensin-Converting Enzyme 2 in Children and Adults Neutrophil-to-lymphocyte ratio predicts critical illness patients with 2019 coronavirus disease in the early stage Spread of SARS-CoV-2 in the Icelandic Population Real estimates of mortality following COVID-19 infection. Lancet Infect DisThe Lancet Infectious diseases Combination of RT-qPCR testing and clinical features for diagnosis of COVID-19 facilitates management of SARS-CoV-2 outbreak Sensitivity of Chest CT for COVID-19: Comparison to RT-PCR Clinical and Immune Features of Hospitalized Pediatric Patients With Coronavirus Disease 2019 (COVID-19) in Wuhan, China. JAMA Netw OpenJAMA network open Analysis of the clinical characteristics, drug treatments and prognoses of 136 patients with coronavirus disease 2019 Characteristics of liver function in patients with SARS-CoV-2 and chronic HBV co-infection Clinical gastroenterology and hepatology : the official clinical practice journal of the American Gastroenterological Association COVID-19 and its implications for thrombosis and anticoagulation Tumor biomarkers predict clinical outcome of COVID-19 patients Single-Cell Transcriptomic Analysis of Human Lung Provides Insights into the Pathobiology of Pulmonary Fibrosis Single-Cell Transcriptomic Analysis of Human Lung Provides Insights into the Pathobiology of Pulmonary Fibrosis Comprehensive Integration of Single-Cell Data Dimensionality reduction for visualizing singlecell data using UMAP Age-dependent effects in the transmission and control of COVID-19 epidemics Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein SARS-CoV-2 and bat RaTG13 spike glycoprotein structures inform on virus evolution and furin-cleavage effects Sex differences in immune responses to SARS-CoV-2 that underlie disease outcomes Nasal Gene Expression of Angiotensin-Converting Enzyme 2 in Children and Adults Neutrophil-to-lymphocyte ratio predicts critical illness patients with 2019 coronavirus disease in the early stage OpenSAFELY: factors associated with COVID-19 death in 17 million patients We thank our colleagues for their suggestions and criticisms on the manuscript. We promise that there will be no plagiarism and has not been published in any journal or platform. All authors have read and approved the manuscript and we declare that there is no conflict of interest. We declare that we have no conflict of interest. We declare that we have no conflict of interest.