key: cord-0872059-4nlyht63
authors: Ulgen, Ayse; Cetin, Sirin; Cetin, Meryem; Sivgin, Hakan; Li, Wentian
title: A Composite Ranking of Risk Factors for COVID-19 Time-To-Event Data from a Turkish Cohort
date: 2022-04-09
journal: Comput Biol Chem
DOI: 10.1016/j.compbiolchem.2022.107681
sha: 0921b30bb56d2d4cb36ef1db7aeef2d67f95558c
doc_id: 872059
cord_uid: 4nlyht63

Having a complete and reliable list of risk factors from routine laboratory blood test for COVID-19 disease severity and mortality is important for patient care and hospital management. It is common to use meta-analysis to combine analysis results from different studies to make it more reproducible. In this paper, we propose to run multiple analyses on the same set of data to produce a more robust list of risk factors. With our time-to-event survival data, the standard survival analysis were extended in three directions. The first is to extend from tests and corresponding p-values to machine learning and their prediction performance. The second is to extend from single-variable to multiple-variable analysis. The third is to expand from analyzing time-to-decease data with death as the event of interest to analyzing time-to-hospital-release data to treat early recovery as a meaningful event as well. Our extension of the type of analyses leads to ten ranking lists. We conclude that 20 out of 30 factors are deemed to be reliably associated to faster-death or faster-recovery. Considering correlation among factors and evidenced by stepwise variable selection in random survival forest, 10 ~ 15 factors seem to be able to achieve the optimal prognosis performance. Our final list of risk factors contain calcium, white blood cell and neutrophils count, urea and creatine, d-dimer, red cell distribution widths, age, ferritin, glucose, lactate dehydrogenase, lymphocyte, basophils, anemia related factors (hemoglobin, hematocrit, mean corpuscular hemoglobin concentration), sodium, potassium, eosinophils, and aspartate aminotransferase.

The purpose of meta analysis is to combine information from different datasets in multiple studies in order to provide robust and consistent conclusions on the effect of a factor on an outcome (Borenstein et al., 2009; Haidich, 2010) . However, it is less common to attempt multiple analyses on the same set of data to extract robust information. For example, in investigating risk factors for COVID-19 infection susceptibility, disease severity (e.g. hospitalization), and mortality (Guan et al., 2020) , the most common approach is to carry out one test to obtain p-value (J Tian et al., 2020; X Liu et al., 2020; Rosenthal et al., 2020; Williamson et al., 2020; Fadl et al., 2021; T Tian et al., 2020) . The test can be t-test/Wilcoxon test for continuous variable, or χ 2 -test/Fisher's test for discrete variables, with the value of a risk factor in samples within two groups compared. Alternatively, uni-variable logistic regression can be used, and the null hypothesis of regression coefficient to be zero is tested. For time-to-event data (survival data), the time from hospitalization to death of a COVID-19 patient can be used to examine which factor contributes to a faster death (per unit time rate of death), which can be done by the Cox regression (proportional hazard model). The null hypothesis of zero regression coefficient is then tested.

In this paper, we extended the above common practice in three directions, on a COVID-19 patient time-to-event data. The first is to use both p-value based measures and prediction performance based ones. Although p-value-based approach has advantages: the meaning is easy to understand and the result is easy to report, it also has problems. P-value itself, often treated as the "gold standard for statistical validity", may not be so golden (Nuzzo, 2014) .

A change in true prior probability of a signal will change the prediction error even when the p-value is the same (Nuzzo, 2014; Colquhoun, 2017) . There have already been proposals of alternatives for p-value in evaluating variables (Lu and Ishwaran, 2018; Halsey, 2019) .

The second extension is to use multiple-variable methods as well as single-variable ones.

Single-variable methods evaluate a variable in isolation with respect to other variables. As a result, they would not detect conditional importance of a variable and its interaction with other variables. Inconsistency, or larger confidence interval, between different datasets concerning the importance of a factor may well reflect the contextual heterogeneity in other variables 2 J o u r n a l P r e -p r o o f (Ghahramani et al., 2020; Kermali et al., 2020) . The multi-variable statistical/machine learning models (Strobl et al., 2008) are ideal to supplement the single-variable analyses, but are less mainstream in the case of COVID-19 data analysis, are most focused on prediction and diagnosis Li et al., 2020; McCoy et al., 2021; Bennett et al., 2021; Karthikeyan et al., 2021; Aljameel et al., 2021; Mahdavi et al., 2021; Cornelius et al., 2021; Kocadagli et al., 2022; Malik et al., 2022) , not on variable evaluation such as in our work.

There are more applications of machine learning and artificial intelligence in the context of COVID-19, ranging from drug repurposing to medical assistance (Zeng et al., 2020; Deepthi et al., 2021; Chen et al., 2022; Alafif et al., 2021; Khan et al., 2021; Piccaialli et al., 2021; Dogan et al., 2021; Majeed and Hwang, 2022) . On the other hand, over-interpreting specific multivariable models (Yan et al., 2020) might not be a good practice as it may not be applicable to other data (Barish et al., 2021) .

The third extension is specific for time-to-event data. For our inpatient data, not only have we deceased patients admission-to-death time information, but also we have larger number of patients who are completely recovered and released. In a typical survival analysis, these patients' time-to-release information would be treated as right-censored data. However, treating them as the main event of interest, extra information might be obtained (Cetin et al., 2021b,c) .

With our three extensions, we are able to carry out ten analyses on the same set of data.

Combining these analyses to get a composite ranking of risk factors for COVID-19 faster death or faster recovery, we effectively run a meta-analysis on the same dataset. Besides the standard testing for unit hazard ratio from Cox regression (thus p-value based), we also used single-variable random survival forest (Breiman, 2001; Ishwaran et al., 2008) (thus prediction performance based), multi-variable random survival forest and variable selection in regularized regression (thus multiple variable based). Two different measures of performance (discordance index and integrated Brier score) in single-variable random survival forest are used. Then all these analyses were repeated for the time-to-release data, resulting in ten different sets of results. We will show that our composite ranking of ten runs result in a more robust list of risk factors for COVID-19 severity than the p-value based method alone, and our selected factors are fully validated by being consistent with the literature. Assessing independent variables in their contribution to time-to-event dependent variable using four methods: Our data is of the following form: 30 independent variables, x = {x 1 , x 2 , · · · x 30 }, and one dependent variable y = {T, δ}, where T is the time to the event, and event status δ can take the value of 1 (event 1, or death), 2 (event 2, or release from hospital), 0 (right censored). We do not have any samples with right-censored time. How to handle multiple events data is usually within the scope of "competing risks" survival analysis (example: death from heart attack versus death from stroke); though in our case, the two events, risk and benefit, do not point to the same direction. How to analyze our special type of data will be discussed in the Result section. Here we simply reset δ = 2 to δ = 0 and summarize the existing methods.

All four of our analyses aim at finding which independent variable is associated with the 4 J o u r n a l P r e -p r o o f dependent variable. We will describe each method as below. (1) In Cox regression, instead of modelling and fitting the survival function, an arbitrary baseline survival function is assumed, and a change in independent variable is assumed to lead to a constant multiple of the whole baseline curve (proportional hazard hypothesis):

where h 0 (t) is the arbitrary baseline hazard function, h(t, x i ) is the hazard function with one of the independent variable present. The p-value p i for testing β i = 0 measures the statistical significance of the contribution of x i to the time-to-event data, and e β i measures the hazard ratio with one unit change in x i .

(2) Random survival forest (RSF) (Ishwaran et al., 2008) is an extension of Random Forest (RF) (Breiman, 2001) to handle time-to-event data. RF/RSF construct many decision trees (therefore "forest") that separate the dependent variable value in two daughter nodes as much as possible (for introduction on RF, see, for example, (Louppe, 2014; Fernández-Delgado, et al. 2014) ). Once the splitting of nodes in a tree is done, in each external node, the cumulative hazard function (CHF) can be estimated by the Nelson-Aalen estimator (Ishwaran et al., 2008) .

The CHF with an independent variable x i can be obtained by tracing the path on the tree, according to the x i value, to reach the external node (Ishwaran et al., 2008) :

H node (t) = t j,node <t d j,node n node H(t| x) = mean node,treeĤnode (t) (the node a bootstrap of x value leads to)

The performance of a RSF is measured by samples not used in the construction of the tree/forest, as only 1−e −1 ≈ 63.2% of the data are used in a sample-with-replacement approach (bootstrap), and those are called out-of-bag (OOB) samples. The CHF of individual OOB samples can be obtained in a similar way by tracing the path along a tree by its independent variable values, and averaged over trees (Ishwaran et al., 2008) . Averaging over all OOB samples lead to an ensemble prediction of the CHF. As all samples will become an OOB in one of run of a tree, if the number of runs is large enough, a RSF will produce a predicted CHF for each sample.

Using the predicted CHF from the RSF program, two errors can be used to measure the 5 J o u r n a l P r e -p r o o f prediction performance. One is Harrell's concordance index (C-index) (Harrell Jr., 1982) whose complement can be called discordance index (D-index) (Cetin et al., 2021a) . The C-index (Dindex) is simply the number of sample pairs whose event time and predicted CHF are consistent (not consistent), divided by the number of permissible sample pairs:

The second measure is (integrated) Brier score (IBS) (Brier, 1950) , which is the squared difference between the actual value (e.g. binary indicator for survival) and the predicted value (e.g., predicted survival probability P rob(T > t) ), integrated over available time-to-event points in the sample:

where n T is the number of samples with δ = 1 and the sum is over these samples.

(3) RF/RSF also provides a performance-based evaluation of individual variables when all variables x are used in RF/RSF (Breiman, 2001; Ishwaran et al., 2008) . We refer to the permutation based measure of variable important as VIM (the other choices are external node purity based, such as the Gini index (Nembrini et al., 2018) ). In this approach, a variable is removed (version-1) or randomized by permuting its value among samples (version-2), and the RF/RSF performance before and after permutation is compared:

proach provides only a ranking of variables, and no cutoff of the list to separate important from unimportant variables is given.

(4) The regularized (Hastie et al., 2009 ) regression LASSO (Tibshirani, 199b, 1997 for Cox regression for multiple variables is: 

Survival analyses with two types of mutually exclusive events: Our time-to-event data contains two events of very different nature. As early as 1980s, it was suggested to use a type-specific (cause-specific) use of standard survival analysis program by switching the second type event to right-censored event (i.e., convert δ=2 to δ=0) (Kalbfleisch and Prentice, 1980) .

The cause-specific (cs) hazard is defined as:

Although this definition clearly points to the possible cause-specific hazard for the second event, most people only concern about the main event, death, and do not consider the second event type.

We argue here that to run a survival analysis on the time-to-release data, while masking death as right-censored, actually provide valuable information. If h cs 2 (x i + 1)/h cs 2 (x i ) > 1, then a higher value of x i leads to a faster release of a patient, and the variable-i is protective. For that same variable, we would expect h cs 1 (x i + 1)/h cs 1 (x i ) < 1 for the death event. The deceased samples should be more important in estimating h cs 1 . Similarly, the released samples should be more useful for the estimation of h cs 2 (x i ). Therefore, we believe running the same survival analysis twice to find variables contributing to the two cause-specific hazard-ratios would use the dataset more fully.

In competing risk survival analysis, there is another population approach called Fine-Gray model (Fine and Gray, 1999) which defines the following subdistribution hazard:

where the occurring of a type-2 event has no impact on the calculation of h sd 1 , with the underlying assumption that the sample experiencing type-2 event may continue to experience a type-1 event. However, this scenario is impossible in our example because our two types of events, release from hospital and die from COVID-19, are mutually exclusive. Using causespecific survival analysis for mutually exclusive events is explicitly recommended in (Allison, 2014). The factors are ranked five times using five measures: D-index from single-variable RSF, IBS from single-variable RSF, p-value for testing unit hazard ratio by single-variable Cox regression, permutation based VIM from the full model RSF, the variable selection order in LASSO. We found that for full-model RSF, because of random components in a run (both samples and variables are randomly chosen in a tree formation), the ranking order may change from run to run, especially for low-ranking factors. Therefore, we run the full-model RSF 100 times and the average of these runs is used. Table 1 shows the results of these five measures. The rank for each method is within the parentheses, and the factors are listed by the overall rank order. It can be seen that some factors are consistently ranked high no matter what method is used (e.g., urea is ranked no.1 by all five methods, neutrophils and calcium are ranked within top 5 by all five methods).

Other factors are not ranked consistently among methods: e.g., LDH, AST are ranked high in the three RSF based methods but ranked lower by Cox and LASSO; sodium is ranked lower in single-variable RSF measures; etc. The consistency is reassuring that the very top factors would be discovered by any method. The inconsistency or variation provides a basis for our approach of combining multiple rankings to improve the robustness of the result, even for one dataset.

Composite ranking of factors based on five measures (time-to-release): Table 2 shows the similar five rankings (and the factors are ordered by the composite ranking obtained the five) with time-to-release as the dependent variable. It is interesting that different methods do not share a common top factor: the three RSF based methods pick ferritin as the top factor, whereas Cox regression picks age, and LASSO picks calcium. When the overall rank in Table   2 is combined with the overall rank in Table 1 , we have a composite rank using 10 ranking lists (last column in Table 2 ).

To compare the time-to-death and time-to-release obtained ranking, we plot the two 1/ranks versus the composite-19 rank in Fig.1(A) . Generally speaking, the two are consistent. When the two are less consistent, a "bubble" is formed. We mark the name in black if a factor is ranked higher (by more than 3) in the time-to-death analyses, and in blue if the factor is ranked higher in the time-to-release analysis (and grey if the ranks are similar). We can see 9 J o u r n a l P r e -p r o o f that neutrophils, urea, glucose, creatine, lymphocytes, etc. are ranked higher in time-to-death runs, whereas RDW.SD, age, ferritin, HGB, etc. are ranked higher in time-to-release runs. Correlated factors: Because collinearity in a regression model is a problem of concern, we examine variable pairs that are correlated with each other. We use plotting of the raw data, correlation coefficient, p-value for testing correlation to be zero, the R 2 from regression to determination the correlation status. Several issues are considered: (1) we check the deceased samples and survived samples separately; (2) we check both the original data and log-transformed data; and for the same reason, the correlation coefficient and testing zero correlation is based on the non-parametric Spearman method; (3) if the visual impression of the scatter plot is a guide, the R 2 from regression provides a better quantity to use than, e.g. p-value for testing zero correlation.

We found six pairs of strongly correlated variables: urea and creatine, neutrophils and white blood cell, AST and ALT, RDW.CV and RDW.SD, HGB and HCT, MCV and MCH. The measure of their correlation is shown in Table 3 . The lower ranked factor of a pair is marked with asterisk in Tables 1 and 2. There are more correlated variable pairs than those shown in Table 3 , e.g., sodium and chloride. We use a more conservative R 2 cutoff point, and require the correlation in both survived and deceased group.

Estimation of the number of independent factors that achieve the optimal prediction performance: Although we have the composite ranking order of factors (both composite-5 in Table 1 and Table 2 and composite-10 in the last column of Table 2 ), there is still a question of where to cut the list to select the relevant factors. Towards this, we use a stepwise variable selection, similar to that in regression (e.g. (Ryan, 2008) ), but in the framework of RSF. We first need to clarify the meaning of adding or removing a variable. There are two versions: the first is actually add a variable starting from an empty field, or remove a variable starting from a full model. The second version is to keep all variables, but instead of an empty field with no variable, the null model refers to all variables being value-shuffled. Therefore, adding the first variable is to retain its values while keeping other variables scrambled. The difference between the two versions might be written as (up to step-i of a variable selection):

where the subscript [i] refers to the i-th variable selected, and operation R refers to random value-shuffling. Figure 2 shows the OOB error IBS as a function of variable selection with this variable selection criterion (to stage-i):

where the index j goes through all variables not already selected in stage-1,2, · · · , i − 1, and all the rest of variables not selected remain to be value-scrambled. We have run the stepwise variable selection three times each for both time-to-death RSF and time-to-release RSF. The horizontal line is the mean of IBS from 500 runs of the full model, and dashed lines are one standard deviation from the mean (for time-to-death data, IBS= 0.0945 ± 0.000514, and for time-to-release data IBS=0.0753 ± 0.000246).

There are several observations from these runs: (1) Fig.2 to roughly estimate the number of (independent) factors to achieve the best performance. This estimation will not be precise because different runs exhibit variations;

However 10∼15 (10 from Fig.2 (A) and 15 from Fig.2 (B)) factors should be a correct range.

Final selection of list of risk factors: Because the factor order in Fig.2 changes from run to run, we use the pre-determined rank order (column 1 in Tables 1 and 2) It has been suggested in the literature that IBS is a better measure than D-index because it is more practical (Longato et al., 2020) and more quantitative (Kattan and Gerds, 2018).

We also prefer the virtual variable addition over actual one because the change in the error curves in Fig.3 is smoother. Therefore, we use the black curve in Fig.3 Table 4 .

In Table 4 , we mark factors being selected by Fig.3(B) , Fig.3(D) , and those by two other error curves not shown. The last two columns in Table 4 mark factors which would have been selected by the standard p-value approach (p-value < 0.001). In fact, the threshold for p-value can be 0.01, or 0.005, and 0.001, but we consider the threshold 0.001 to be a good choice (see, for example, (Colquhoun, 2017; Ioannidis, 2018; ). If we choose factors that contribute to a better prediction performance, 21-22 factors would be selected, 17-18 of them are independent. These are the factors above the horizontal partition line, except potassium.

We may consider chloride a borderline choice as it ranks last in our list, and monocyte as a borderline possibility. Note that the Cox regression p-value based selection (at p=0.001) would select MPV and PDW, which are not on our list.

The partition of factors in Table 4 into selected and not-selected ones should be considered to be ad hoc to some extent. We are only more confident of the factors being selected based on multiple lines of evidence, whereas less confident for the factors not selected. It does not imply that those not selected here are never relevant to COVID-19 severity/mortality, only that in our data they do not have the fullest level of evidence support.

In this work, we have carried out a careful analysis from a single dataset to determine which factors contribute to either faster death or faster release from hospital. By a literature search, we found all our selected factors were addressed in other studies and were shown to be significantly associated with the COVID-19 disease severity. Here is a partial summary: Although leukocyte count may be normal or decreased in COVID-19 patients, the incidence of leukocytosis (higher WBC) increases in ICU patients, and has been reported to be associated with COVID-19 severity (Huang et al. , 2020; Sayad et al., 2020) . There is also an investigation of the causal role of WBC in disease severity by Mendelian randomization (Sun et al., 2021) . Note that leukocytosis could also be related to bacterial infections, corticosteroids use, or age, and others unrelated to COVID-19 severity. Our "within-sample-meta-analysis" lead to a more robust conclusion concerning risk factors for COVID-19 severity. Using independent studies published by other groups, our cutoff in Table 4 may lead to close to zero false discovery rate (FP/(FP+TP)) or close to 100% precision (positive predictive value, TP/(FP+TP)).

We may still underestimate the number of risk factors for the following reason. Because the number of variables selected by Figs.2 and 3 represent the sufficient number of variables needed to achieve optimal performance, and adding more correlated/collineared variables would not decrease the error further. Among factors not selected in Other factors not selected by our composite ranking in Table 4 The fact that factors not selected in our data whereas mentioned in other publications as potential relevant factors might reflect two situations. The first is that these factors are actually only weakly associated with COVID-19 severity/mortality and we are not able to detect them. The second possibility is that any two datasets can not be identical, and there are always chances that heterogeneity, unmeasured covariates, variations due to finite sample sizes may lead to inconsistent conclusions.

In the calculation of variable VIM in RSF, one variable is removed/shuffled from the full model, and the most important variable is listed first. It is tempting to extend this for a stepwise variable selection procedure by removing/shuffling the important variable at each stage (currently we remove/shuffle the least important variable first). However, this procedure does not produce an error curve that reaches plateau (result not shown). Therefore, we did not use this procedure for deciding the cutoff in our variable list.

In conclusion, with more interests in using blood test for prognosis in COVID-19 patients (COMBAT Consortium, 2022) , we have carried out a careful analysis on blood-test factors affecting COVID-19 hazard in a time-to-event data from a Turkish cohort. We use multiple measures and methods to rank factors, some are traditional single-variable test (Cox proportional hazard ratio model), whereas others are multiple-variable machine learning techniques (random survival forest). A novel choice in our composite ranking is to utilize shorter hospital stay for released patients to discover protective factors, which should also be a risk factor when the factor value changes in the opposite direction. This approach complements the approach in using the time-to-death information for deceased patients. All of our top choices in the composite ranking list are confirmations to one of the other studies for risk factors for COVID-19 severity and/or mortality, resulting in a 100% positive predictive value.

Acknowledgements. WK thanks the support from the Robert S Boas Center for Genomics and Human Genetics. LG Halsey (2019) (3), .107 (9) 3.8E-21 (2) 4.08 (7) (1) (1) 6 d-dimer .368 (7), .104 (6) 3.6E-15 (9) 6.56 (4) (6.5T) 1 of Tables 1 and 2 ). If a factor is selected by either being among the top ranking variables by the corresponding error curve, or by the p-value < 0.001 in Cox regression, it is marked by +. 36

Prognostic value of leukocytosis and lymphopenia for coronavirus disease severity

The proposal to lower P value thresholds to .005

Random survival forests for R, Rnews

Random survival forest

SARS-CoV-2 infects the human kidney and drives fibrosis in kidney organoids

Gender differences in patients with COVID-19: focus on severity and mortality

Increased levels of lactate dehydrogenase and hypertension are associated with severe illness of COVID-19

The Statistical Analysis of Failure Time Data

Machine learning based clinical decision support system for early COVID-19 mortality prediction

The index of prediction accuracy: an intuitive measure useful for evaluating risk prediction models

Serum ferritin as a predictive biomarker in COVID-19. A systematic review, meta-analysis and meta-regression analysis

The role of biomarkers in diagnosis of COVID-19 -A systematic review

Applications of artificial intelligence in COVID-19 pandemic: a comprehensive review

Clinical prognosis evaluation of COVID-19 patients: An interpretable hybrid machine learning approach

Robustness of Random Forest-based gene selection methods

A meta-analysis: coronary artery calcium score and COVID-19 prognosis

Identifying novel factors associated with COVID-19 transmission and fatality using the machine learning approach

The-more-the-better and the-less-the-better

Beyond standard pipeline and p < 0.05 in pathway enrichment analyses

Using machine learning of clinical data to diagnose COVID-19: a systematic review and meta-analysis

Serum ferritin as an independent risk factor for severity in COVID-19 patients

Eosinophil responses during COVID-19 infections and coronavirus vaccination

Elevated fasting blood glucose within the first week of hospitalization was associated with progression to severe illness of COVID-19 in patients with preexisting diabetes: A multicenter observational study

Electrolyte imbalances in patients with severe coronavirus disease 2019 (COVID-19)

*We intended to have an "intra-data" meta analysis by extending the standard single-variable, p-value based approach in three directions: use prediction performance also, use multi-variable approaches also, and specific to time-to-event data

*We determined that 20 factors commonly measured in a blood test are associated with COVID-19 patients' severity (or mildness) and/or mortality. Our selected factors are consistent with other reports in the literature