key: cord-0888083-zfqz4k8a authors: Park, Hyung; Tarpey, Thaddeus; Liu, Mengling; Goldfeld, Keith; Wu, Yinxiang; Wu, Danni; Li, Yi; Zhang, Jinchun; Ganguly, Dipyaman; Ray, Yogiraj; Paul, Shekhar Ranjan; Bhattacharya, Prasun; Belov, Artur; Huang, Yin; Villa, Carlos; Forshee, Richard; Verdun, Nicole C.; Yoon, Hyun ah; Agarwal, Anup; Simonovich, Ventura Alejandro; Scibona, Paula; Burgos Pratx, Leandro; Belloso, Waldo; Avendaño-Solá, Cristina; Bar, Katharine J; Duarte, Rafael F.; Hsue, Priscilla Y.; Luetkemeyer, Anne F.; Meyfroidt, Geert; Nicola, André M.; Mukherjee, Aparna; Ortigoza, Mila B.; Pirofski, Liise-anne; Rijnders, Bart J. A.; Troxel, Andrea; Antman, Elliott M.; Petkova, Eva title: Development and Validation of a Treatment Benefit Index to Identify Hospitalized Patients With COVID-19 Who May Benefit From Convalescent Plasma date: 2022-01-25 journal: JAMA Netw Open DOI: 10.1001/jamanetworkopen.2021.47375 sha: 61ec969dda4850c876abad0bf7974864ce38564f doc_id: 888083 cord_uid: zfqz4k8a IMPORTANCE: Identifying which patients with COVID-19 are likely to benefit from COVID-19 convalescent plasma (CCP) treatment may have a large public health impact. OBJECTIVE: To develop an index for predicting the expected relative treatment benefit from CCP compared with treatment without CCP for patients hospitalized for COVID-19 using patients’ baseline characteristics. DESIGN, SETTING, AND PARTICIPANTS: This prognostic study used data from the COMPILE study, ie, a meta-analysis of pooled individual patient data from 8 randomized clinical trials (RCTs) evaluating CCP vs control in adults hospitalized for COVID-19 who were not receiving mechanical ventilation at randomization. A combination of baseline characteristics, termed the treatment benefit index (TBI), was developed based on 2287 patients in COMPILE using a proportional odds model, with baseline characteristics selected via cross-validation. The TBI was externally validated on 4 external data sets: the Expanded Access Program (1896 participants), a study conducted under Emergency Use Authorization (210 participants), and 2 RCTs (with 80 and 309 participants). EXPOSURE: Receipt of CCP. MAIN OUTCOMES AND MEASURES: World Health Organization (WHO) 11-point ordinal COVID-19 clinical status scale and 2 derivatives of it (ie, WHO score of 7-10, indicating mechanical ventilation to death, and WHO score of 10, indicating death) at day 14 and day 28 after randomization. Day 14 WHO 11-point ordinal scale was used as the primary outcome to develop the TBI. RESULTS: A total of 2287 patients were included in the derivation cohort, with a mean (SD) age of 60.3 (15.2) years and 815 (35.6%) women. The TBI provided a continuous gradation of benefit, and, for clinical utility, it was operationalized into groups of expected large clinical benefit (B1; 629 participants in the derivation cohort [27.5%]), moderate benefit (B2; 953 [41.7%]), and potential harm or no benefit (B3; 705 [30.8%]). Patients with preexisting conditions (diabetes, cardiovascular and pulmonary diseases), with blood type A or AB, and at an early COVID-19 stage (low baseline WHO scores) were expected to benefit most, while those without preexisting conditions and at more advanced stages of COVID-19 could potentially be harmed. In the derivation cohort, odds ratios for worse outcome, where smaller odds ratios indicate larger benefit from CCP, were 0.69 (95% credible interval [CrI], 0.48-1.06) for B1, 0.82 (95% CrI, 0.61-1.11) for B2, and 1.58 (95% CrI, 1.14-2.17) for B3. Testing on 4 external datasets supported the validation of the derived TBIs. CONCLUSIONS AND RELEVANCE: The findings of this study suggest that the CCP TBI is a simple tool that can quantify the relative benefit from CCP treatment for an individual patient hospitalized with COVID-19 that can be used to guide treatment recommendations. The TBI precision medicine approach could be especially helpful in a pandemic. The development of the treatment benefit index uses the regression framework of single index models with multiple links (SIMML) 1,2 and is based on an extension for the case of ordinal categorical outcomes 3 . It uses proportional odds models (POM) with cumulative probabilities and can capture nonlinear associations between a combination of pretreatment covariates ("single-index") and an ordinal treatment outcome from two or more alternative therapies (here CCP and control). The combination of covariates is referred to, in the statistical literature, as a "single-index" and in the context of precision medicine it is a linear combination of patient pretreatment characteristics, which is termed the treatment benefit index (TBI) in the manuscript. For the development of TBI, the log-odds of the ordinal outcome ≥ , for = 1,2, … ,10, are modeled through a POM: ( ≥ ) = + ( ) + ( , ), = 1,2, … ,10, ( 1.1) where the variable is the treatment indicator taking a value in either 1 (CCP) or 0 (control), and the linear combination of pre-treatment covariates (i.e., ) is referred to as a "single-index" or a CCP-TBI, or just TBI. In (S1.1), are cut-point parameters associated with the ordered response categories = 1,2, … ,10, satisfying the property > > ⋯ > , and also represents pre-treatment covariates, but it can be different from . In POM (S1.1), the heterogeneous CCP effect is modeled by the component ( , ) , which is the crucial part determining the TBI. On the other hand, the term ( ) in (S1.1), that models the covariates' main effects, have no impact on the differential effect of the CCP treatment and are used to factor out variation common in the outcomes under both treatments (here CCP and control) to improve the performance of the TBI. For model identifiability, in (S1.1), we impose an identifiability constraint [ ( , ) | ] = 0 on . This makes the component ( , ) orthogonal to ( ) in equation (S1.1) (see 2,3 for details of the identifiability condition). Then the covariates' main effect ( ) in (S1.1) can be estimated separately based on the model ( ≥ ) = ̃ + ( ), = 1,2, … ,10, ( 1.2) where ̃ are unknown cut-point parameters associated with the ordered response categories = 1,2, … ,10 (with the property ̃ > ̃ > ⋯ >̃ ) and is augmented in the estimation of model (S1.1) (see 2,4 for the efficiency augmentation). For model parsimony, we restricted the functional form of ( ) to be linear with respect to the covariates (including transformation and interactions of patient characteristics). Specifically, a cumulative odds ratio (OR) less than 1, i.e., ( ) < 1, indicates that a patient with a particular TBI value is expected to benefit from CCP, whereas the cumulative OR greater than or equal to 1, i.e., ( ) ≥ 1, indicates that such a patient is not expected to benefit from CCP. In general, if the function (•) is not monotone with respect to , then the TBI can be more generally defined as: = ( , = 0). Due to the constraint [ ( , ) | ] = 0 that we imposed on the data-driven function of the model (1) , where ( = 0) and ( = 1) are known randomization probabilities in the context of RCTs. Thus, the function (•) in (S1.3) can be reparametrized to: ( ) = exp{− ( ( ) ( ) + 1 )}, which is a monotone function of the TBI , based on which a continuous gradation of the expected benefit from CCP can be provided. However, in this TBI analysis, since the data-driven function (•) was already monotone with respect to , we simply employed as the CCP-TBI. The patient pretreatment characteristics that enter into the TBI in model (S1.1 in eAppendix 1) are selected based on internal cross-validation (CV) for the following criteria. 1. The ratio of the CCP efficacy ORs between the group predicted to benefit (B) and the group predicted to not benefit (NB). By using the efficacy criterion of the cumulative OR=1 for the model-based cumulative OR (S1.3), the patient population can be stratified into the subpopulation B and the subpopulation NB. To quantify the differential CCP benefit between B and NB, we calculate the OR for the CCP efficacy for each subgroup (B and NB), denoted as OR(B) and OR(NB), respectively, and then compute the ratio of these two ORs, i.e., OR(B)/OR(NB). Specifically, a smaller ratio that satisfies OR(B)/OR(NB) < 1 indicates good TBI performance, in terms of separating B and NB, with respect to differential CCP efficacy. We use the ordinal WHO scale at day 14 (which was the co-primary outcome of the study) to train the TBI model, which is then evaluated on all outcomes on a validation set: the ordinal WHO scale, the indicator of WHO  7, and the indicator of WHO=10 (at day 14 and at day 28). 2. The "value" 6 of treatment classification rules derived from the respective TBIs. Specifically, for any given treatment classification rule, we compute the expected proportions (i.e., the incidences-the socalled "values") associated with the clinically undesirable binary outcomes-the indicator of WHO  7 and the indicator of WHO=10 at day 14 and at day 28-when the treatment is assigned according to the treatment classification rule. Smaller expected values indicate better treatment classification rules and, correspondingly, better TBIs. These expected values were not computed for ordinal WHO scores, because means (or medians) are not appropriate summaries for ordinal data. A set of different TBI models (with different variables entering into the corresponding TBIs) with comparable AIC values were first identified as a set of "candidate" TBI models (see eTable 5). Then further extensive CV was conducted to select the final model from the set of these candidates. The model development workflow through internal CVs focusing on the selection of the variables is presented on eFigure 1. The respective CV procedures are described in eAppendix 3-5. eFigure 1. Workflow Diagram for Developing the TBI The focus was on identifying the robust features to optimize the TBI performance for treatment classification. The model performance was assessed with CV, evaluated with respect to the outcome of: (1) ordinal WHO 11-point scale at days 14 and 28; (2) binary indicator of WHO  7 at days 14 and 28; (3) binary indicator of WHO=10 (mortality) at days 14 and 28. eAppendix 3. Split-Sample-Simulation Cross-validation, 1000 replications The candidate models were compared with respect to how well they performed for separating the benefit groups, via an extensive CV. Specifically, we perform a split sample simulation, repeating 1000 times, to compare treatment classifiers derived from TBIs. Each time, we randomly partition the data, derivation of classifiers in one part (2/3 rds ) of the data, with subsequent test of the classifier in the remaining part (1/3 rd ) of the data. We first record each classifier's subgroup specific ORs and the ratio, OR(B)/OR(NB), computed in the validation sample. The median of these measures obtained from the 1000 replications was used to compare different TBI models. The TBI models were ranked on the basis of their CV performance, with respect to the outcome used to train the TBI -day 14 ordinal WHO score. This corresponds to Step 1 of part II from the workflow diagram in eFigure 1. This step resulted in sets of TBIs performing well and those with clearly poorer performance. All "well-performing" TBIs were then tested with respect to the remaining five outcomes: the ordinal WHO score at day 28, binary indicator of WHO  7 at days 14 and 28, and the binary indicator WHO=10 at days 14 and 28. This corresponds to Step 2a of part II of the workflow diagram in eFigure 1. In addition, we compared the "values" of the treatment classifiers, evaluated with respect to the binary outcomes (the indicators of WHO  7 and WHO=10, at days 14 and 28), where smaller "values" (i.e., the proportions of patients with the bad events) indicate a better performance. This internal CV with respect to the "values" of the treatment classifiers corresponds to Step 2b of part II of the workflow diagram in eFigure 1. Based on those analyses, a small final set of best TBIs was nominated. eAppendix 4. Leave-One-RCT-Out Cross-validation As another internal validation criterion for developing the TBIs, a leave-one-RCT-out CV was used as a proxy for an external validation for selecting the patient covariates used in the models. There were 4 large RCTs (RCT IDs: AA, CC, EE, KK) participating in COMPILE and the remaining 4 trials (RCT IDs: BB, DD, GG, RR) were pooled to form a larger combined RCT to be used in the leave-one-RCT-out CV, see eTable 1. We hold out each RCT (among AA, CC, EE, KK and "combined") and train the TBI models on the remaining 4 RCTs. Using the trained classifiers, patients in the held-out RCT were then classified into B and NB, respectively, and the corresponding CCP efficacy ORs, denoted as OR(B) and OR(NB) respectively, and their ratio, OR(B)/OR(NB), were computed with respect to each of the six outcomes. By repeating this step for all RCTs, the leave-one-RCT-out ORs were averaged across all five once-held-out RCTs (AA, CC, EE, KK and "combined"). A similar procedure was also performed to compare classifiers with respect to the "values". The TBI models with the smallest averaged ratio OR(B)/OR(NB) for all or most of the six outcomes and the smallest expected values for all or most of the four binary outcomes were elicited. This corresponds to Steps 3a, 3b and 3c of part III from the workflow diagram in eFigure 1. eAppendix 5. Leave-One-Enrollment-Quarter-Out Cross-validation Another CV used in developing of the TBIs was based on time of the patients' enrollment in the study. The main analysis of the COMPILE study identified differences in the efficacy of CCP during the 4 quarters from April 2020 to March 2021 when patients were enrolled in the RCTs. While the effects of quarter of enrollment on the outcomes in the control group were expected to change over time due to the changing standards for usual care for COVID-19 patients, we also observed some differences in the relative efficacy of CCP (compared to control) during different quarters with respect to the six outcomes of interest, see eFigure 2 below. One possible explanation for this shift in the efficacy is that the distributions of patient characteristics differed across the quarters. An alternative explanation could be that the efficacy of CCP (vs. control) was not the same in the different quarters even for patients with the same baseline profiles. To investigate this question, similar to the leave-one-RCT-out approach, we performed leave-one-enrollment-quarter-out CV. This step corresponds to 4a, 4b and 4c of part III from the workflow diagram in eFigure 1. For clinical practice it might be convenient to create categories rather than use the TBI as a continuous measure. The expected benefit from CCP treatment might be usefully described as having three levels: (B1) large benefit, (B2) modest benefit, and (B3) potential harm expected. To determine the cut-off points for creating three levels of benefit, we employed a likelihood criterion as follows. For a given pair of cut-off points (a, b), we define from the TBI a grouping variable with 3 levels: TBI < a; a  TBI < b; and TBI  b, that is, a 3-levels categorical TBI. Then, we use POM to regress the ordinal WHO scale at day 14 (i.e., the outcome used in the development of the TBI) on the 3-levels categorized TBI, the treatment indicator and the interaction between the two (as in the estimation of the POMs displayed in Figure 1 of the main manuscript, but we used the 3-level categorical TBI), and assess the model likelihood. This POM with a categorized TBI has a constant effect constraint of TBI within each of the three intervals defined by (a, b), making it effective in determining groups when there are points of inflection in the OR (as in Figure 1 of the main manuscript) with respect to TBI. A grid of pairs of cutoff points is considered and the pair of cut-off points that results in the highest likelihood is selected to define three levels of expected benefit from treatment with CCP. eAppendix 7. Application of Other Methods for Treatment Classification Rules An advantage of the single-index POM (S1.1 in eAppendix 1) approach used here is that, in addition to a binary treatment classifier (that identifies a subpopulation expected to benefit from CCP vs. not benefit from CCP), the index provides a continuous, easy-to-compute gradation of the benefit. Most other precision medicine methodologies in the statistical literature under the rubric of treatment classification rules do not provide such gradient of benefit. Still, we compared the results from several popular machine learning methods to our selected TBI index, in terms of OR(B), OR(NB) and the ratio of ORs OR(B)/OR(NB), as well as in terms of the "values." The following alternative methods were used: Q-learning 7 trained with (1) the original outcome and (2) For these alternative methods, we also evaluate the "value" of the associated treatment decision rules, with respect to the four binary outcome measures (i.e., the indicator of WHO  7 and the indicator of death, WHO=10, at days 14 and 28). These values are compared with those of two "treatment policies" -to treat everyone with CCP ("All CCP") and to treat no one with CCP ("No CCP"). The results are given in eTable 3. A smaller "value" indicates a good performance. Some of the RBF kernel-based methods were not able to be computed due to singularities in the estimation. The computation time for the support vector machine-based methods (i.e., RWL/OWL/RICWL) was about 3 hours for each run; this is in contrast to the computation time of a TBI model which is only a few seconds. The variables that were used in the above alternative methods were: age, sex, baseline WHO score, blood type, and the indicators of cardiovascular disease, diabetes, and pulmonary disease; this set of variables corresponds to the set of variables used in the index "P" specified in eTable 6. Thus, a direct comparison can be made between the results present in eTables 2 and 3 and the results in eTables 7 and 8 corresponding to the index "P". In eTables 2 and 3, the outcome-weighted learning (OWL) with linear kernel and boosting trained based on the binary outcome of WHO  7 performed superiorly among these alternative approaches considered. However, both of these approaches were clearly outperformed by the index "P" of the TBI models. This is expected because the TBI model (S1.1) efficiently utilized the ordinal structure of the response data, and given the noisy clinical data, the simplicity of the TBI model compared to the alternative approaches provided a superior prediction. The TBI was developed on complete cases (n=2287). eFigure 3 below provides the marginal and joint distributions of the missingness (n=82) in the key baseline characteristics. Selection of the covariates in the main effect ( ) of the POM (S1.1 in eAppendix 1) for day 14 WHO score was based on AIC, see eTable 5 for the selected covariates used for the development of all TBIs, elicited in Part I of the workflow diagram on eFigure 1. The model also contains RCT-specific random intercept effects, which were highly significant. The initial model used all baseline covariates existing in the COMPILE study, including (i) age, (ii) sex, (iii) baseline WHO score, (iv) duration of symptoms (in bins of 3 or 4 days), (v) number of days from diagnosis to randomization, (vi) days from randomization to first infusion, (vii) blood type, (viii) diabetes, (ix) cardiovascular disease, (x) pulmonary disease, (xi) quarter of enrollment. presents the set of covariates in the indices considered. The variables in the candidate TBI models included categorization of continuous features (e.g., age, with the dichotomizing cutoff 67, which was selected based on a CV), collapsing of multi-category variables into factors with fewer levels (e.g., duration of symptoms prior to randomization) and interactions between the variables. eTable 6. Candidate Sets of Features Considered for Inclusion in the TBI This is the result of process II in the "Development of the TBI model" in the workflow diagram in eFigure 1. Age, baseline WHO score, blood type 1 , CVD 3 , DM 4 , PD 5 Index B Baseline WHO score, blood type 1 , CVD, DM, PD Index C Age, baseline WHO score, blood type 2 , CVD, DM, PD Index D Baseline WHO score, blood type 2 , CVD, DM, PD Index E Age, baseline WHO score, blood type 2 , CVD, DM Index F Baseline WHO score, blood type 2 , CVD, DM Index G Baseline WHO score, CVD, DM, PD Index H Baseline WHO score, CVD, DM Index I Age, Baseline WHO score, CVD, DM Index J Baseline WHO score, blood type 2 , CVD, DM&PD 6 Index K Baseline WHO score, blood type 2 , CVD, DM&CVD 7 , DM&PD Index L Baseline WHO score, blood type 2 , CVD, DM, DM&CVD, DM&PD Index M Baseline WHO score, blood type 2 , CVD, DM, DM&CVD Index N Baseline WHO score, CVD, DM&CVD, DM&PD Index O Baseline WHO score, blood type 2 , CVD, DM, PD, DM&CVD, DM&PD Index P Age, Sex, Baseline WHO score, blood type 1 , CVD, DM, PD Index Q Age, Sex, Baseline WHO score, blood type 2 , CVD, DM, PD Index R Baseline WHO score, blood type 2 , CVD Index S Age 8 , Baseline WHO score, Baseline WHO score&Age 8 , blood type 2 , CVD, DM&PD Index T Age 8 , Baseline WHO score, Baseline WHO score&Age 8 , CVD, DM&PD Index U Age 8 , Baseline WHO score, blood type 2 , CVD Index V Age, Baseline WHO score, Baseline WHO score&Age, blood type 2 , CVD, DM&PD Index "Basic" B Baseline WHO score, Baseline WHO score&Age 8 , CVD, DM&PD Index "Expanded" Baseline WHO score, Baseline WHO score&Age 8 , blood type 2 , CVD, DM&PD 1 Blood type has 4 levels -O, A, B, AB; 2 Blood type has 2 levels -(A or AB) and (O or B) 3 CVD=cardiovascular disease 4 DM=Diabetes Mellitus 5 PD=pulmonary disease 6 DM&PD = comorbid DM and PD 7 DM&CVD = comorbid DM and CVD 8 Age= Age dichotomized (1 if Age  67, 0 otherwise) eAppendix 11. Results From 1000 Split-Sample-Simulation Cross-validation Evaluation of TBI with respect to the differential benefit in groups B and NB CV, using random partitioning of the data, derivation of classifiers in one part (2/3 rds ) of the data, with subsequent test of the classifier in the remaining part (1/3 rd ) of the data, was conducted on all TBIs in eTable3.3, and the results are shown in eTable 7 (in the rows A-V, "Basic" and "Expanded"). The last row of the table corresponds to the results for the case where we select and estimate a TBI, rather than using an already specified model (e.g., Index A, Index B, Index C, …) to do the CV. In eTable 7 below, the entries are the medians over the 1000 random splits used for CV, where the performance is evaluated on the respective (1/3 rd ) test sets. In the table, the columns labeled as "OR(B)" show the median of the CCP efficacy ORs in the group predicted to benefit (B) from CCP; columns labeled as "OR(NB)" show those in the group not predicted to benefit (NB) from CCP. Columns "B/NB" show the median of the ratios, OR(B)/OR(NB). Smaller values in columns "B" and "B/NB" correspond to better TBI performance. Looking at the results across all six outcomes, the TBIs "Expanded", J, S and V appear to have the best, or close to best, CV performance among all candidate TBIs. Also, considering that the patient's blood type information may not be readily available when clinicians make treatment decisions, we considered the TBIs that do not require the patient's blood type information (i.e., G, H, I, N, and "Basic"). Among these, the TBI "Basic" has the best performance. In eTable 7, the last row reports the CV result for the case where we estimate a TBI including patient's feature selection, rather than using an already specified set of features (e.g., Index A, Index B, Index C, …) to do the CV. Specifically, given a training set (which corresponds to the 2/3 rds of the whole sample), we further split this 2/3 rds training set into: 1) the 2/3 rds "development" set (i.e., the 2/3 rds *2/3 rds of the whole sample), and 2) the 1/3 rd "validation" set (i.e., the 2/3 rds *1/3 rd of the whole sample). Then we conducted the CV within the training set (i.e., derivation of TBIs on the "development" set, and evaluation of their performance on the "validation" set), followed by ranking the TBIs according to the performance on the "validation set". Then, we selected the best-performing TBI (say, Index A), and re-estimated this selected TBI on the training set (which corresponds to the 2/3 rds of the whole sample, specified at the beginning). We then evaluated this trained TBI on the testing set (which corresponds to the 1/3 rd of the whole sample, set at the beginning). This CV procedure including the model selection procedure was repeated over 1000 times, and the median results are reported in the last row of eTable 7. eFigure 4 below displays the distribution of the odds ratios, OR(B) and OR(NB), associated with the last row of eTable 7 from the 1000 split-sample simulation CV, with respect to each of the 6 outcomes considered. There is a reasonably clear separation in the CCP efficacy odds ratio between B and NB, especially with respect to the outcome of being on mechanical ventilator or worse (WHO ≥ 7) (the second column), suggesting that this development approach provides a reasonably well-performing TBI. (The OR distributions obtained from this 1000 split-sample simulation CV for the "Expanded" TBI are provided in Section eAppendix 30 of this document.) Evaluation of TBI with respect to "value" As described in eAppendix 2, we also compared candidate TBIs with respect to their "value". A "value" of a treatment classification rule is the expected value of the outcome if every patient is treated according to the given treatment classification rule. This expected "value" is computed with respect to the binary outcomes (i.e., the indicator of WHO7 and the indicator of WHO=10 at days 14 and 28), and it indicates the incidence of those clinically undesirable outcomes when the given treatment classification rule is applied to the patient population. eTable 8 (in columns "TBI") reports the median of the values under the TBI-based treatment classification rules considered in eTable 6, obtained from the 1000 split-sample simulation CV. These values are compared with the medians of the values of two one-size-fits-all treatment policies --to treat everyone with CCP (in columns "All CCP") and to treat no one with CCP (in columns "No CCP"). As in eTable 7, the last row in eTable 8 reports the CV results for the case where the patient's feature selection procedure is incorporated: for each cross-validation run, the 2/3 rds training set is further split into 2/3 rds of the "development" and 1/3 rd of the "validation" sets, i.e., 4/9 of the whole sample was used to train the model, and 2/9 of the whole sample was used to validate the model; then use the model with the smallest B/NB ratio (with respect to the ordinal outcome at day 14), then re-fit the selected model on the 2/3 rds initial training set and evaluate the fitted model on the initial 1/3 rd testing set. Several observations can be made from eTable 8. First, while prevalence of WHO7 at day 14 under CCP treatment was 13.6% (see the 3 rd column, "All CCP"), under control treatment it is 15.9%, for a difference of 2.3%. Every one of the TBIs had better (i.e., lower) "value" than, not only the policy to treat no one with CCP, but also the policy to treat everyone with CCP. The difference between the values of the "All CCP" policy and the values for all TBIs are small and varying, with the improvement from the TDR based on the "Expanded" TBI equaling 1.2% (comparing the 2 nd and 3 rd columns). Second, similar conclusions can be made with respect to the outcome WHO7 at day 28, where the rate of WHO7 under CCP treatment is 14.1%, under control treatment is 16.8% for a difference of 2.7%. In this case, the improvement with respect to value if the TDR based on the "Expanded" TBI is implemented instead of treating everyone with CCP, is (0.141-0.127=) 1.4%, i.e., half of the average effect of CCP vs. control, which is 2.7% WHO0 at day 28. This illustrates how, by targeted use of CCP, improvement can be obtained that is half as large in magnitude, as that of the improvement of treating everyone with CCP vs. not treating anyone with CCP. Third, with respect to mortality, both at days 14 and 28, none of the TDRs derived from the considered TBIs resulted in better value than the policy of treating everyone with CCP, or the improvement was miniscule (see mortality at day 28, for the "Expanded" TBI), where an improvement of only 0.1% is achieved, see the 2 nd and 3 rd columns from the right in the last row of eTable 8. These observations, together with results Figures 1 and 2 in the main manuscript, suggest that CCP treatment might be harmful for some patients in B3 with respect to the status of being on ventilation or worse (WHO7) at days 14 and 28, but with respect to mortality, CCP is safe with no evidence for harm. The results in eTables 7 and 8 indicate that the "Expanded" TBI outperformed the rest of the candidate indices in the majority of the cases, in terms of its ability to separate the groups B and NB in a way that makes the outcomes in those groups most different, as well as with respect to value of the treatment classification rules. In terms of performance, the "Expanded" TBI is followed by other candidate indices (J, S and V in eTable 6). Among those not involving blood type information, the "Basic" TBI had superior performance. This process -part II from the workflow diagram on eFigure 1 --reduced the number of strong candidate TBIs to 5: indices J, S and V, "Basic" and "Expanded". eTable 7. Results From 1000 Split-Sample Simulation Cross-validation Based on a TBI developed on the training set (using the POM (S1.1 in eAppendix 1) for WHO scores at day 14), each hold-out set is stratified into a group expected to benefit (B) and a group not expected to benefit (NB) from CCP. For all six outcomes, the CCP efficacy ORs are computed within the subgroups B and NB of the holdout set, and the ratio of these, ratio of ORs, OR(B)/OR(NB), are computed; a smaller (and OR(B)/OR(NB)< 1) ratio indicates a good TBI model. The entries show the median of the ORs across the 1000 splits. Index refers to the indices in eTable 5. Evaluated on the respective test sets, corresponding the last row of eTable 7 (i.e., for the case where we included feature selection in estimating a TBI). OR(B) OR(NB) B/NB OR(B) OR(NB) B/NB OR(B) OR(NB) B/NB OR(B) OR(NB) B/NB OR(B) OR(NB) B/NB OR(B) OR(NB) B/ Based on a TBI developed on the training set (using the POM (S1.1 in eAppendix 1) for WHO scores at day 14), the expected values of the four binary outcomes under the TBI-based treatment classification rules (in column "TBI") and those under two "treatment policies" -to treat everyone with CCP (in column "All CCP") and to treat no one with CCP (in column "No CCP")-are evaluated on each testing set; small values are desired. The entries (in columns TBI, All CCP or No CCP) show the median of the estimated expected values across the 1000 splits. Index refers to the models in eTable 5. The leave-one-RCT-out CV was performed, as described in eAppendix 4, where we use each of the 4 largest trials (AA with 940 subjects, EE with 477 subjects, KK with 381 subjects and CC with 350 subjects) and a combined trial of the remaining 4 small RCTs as a testing set, forming 5 non-overlapping testing sets. Given each testing RCT, the remaining RCTs were used as a training set. Based on a TBI developed on the training set, the CCP efficacy ORs (and their ratios) for all six outcomes and the expected "values" with respect to the four binary outcomes were evaluated on the testing RCT. We did this for all 5 non-overlapping testing sets, and the results were averaged across these 5 testing sets. In eTable 9, the differential CCP efficacy, as judged by OR(B)/OR(NB) < 1, for all of the identified TBIs from the part II from the workflow diagram (i.e., "Expanded", "Basic", J, S and V) increased the confidence of these TBIs. The results shown in eTable 10 for the "value" of the treatment classifiers are also consistent with those of eTable 8 obtained from the 1000 split-sample simulation CV. As in eTable 7 and 8, the last row of eTables 9 and 10 report the CV results for the case where we estimated a TBI including feature selection (from the 24 index sets considered). To be specific, in the leave-one-RCT-out CV, based on each training sample consist of four RCTs, we derived classifiers in three RCTs, with subsequent test of the classifier in the remaining one RCT. We did this for each RCT in the training sample. We then averaged the results, and ranked the TBIs and select a TBI. The selected TBI was then re-estimated on the training sample and then evaluated on the RCT initially held-out; we did this CV for all 5 RCTs one by one, and the CV results were averaged across the 5 held-out RCTs (reported in the last row of eTable 9 and 10). The results in the last rows indicate that the performance of TBIs is satisfactory. eTable 9. Results From a Leave-One-RCT-Out Cross-validation Each holdout set is stratified into the groups B and NB, based on a TBI developed on the training set (using the POM (S1.1 in eAppendix 1) for WHO scores at day 14). For all six outcomes, the CCP efficacy ORs are computed within B and NB of the (S1.1 in eAppendix 1) holdout set, and the ratio, OR(B)/OR(NB), are computed. The entries show the mean of the ORs (in columns "OR(B)" or "OR(NB)") or of the ratio of the ORs (in columns "B/NB") across the 5 holdout RCTs. The holdout set is stratified into the groups B and NB, based on a TBI developed on the training set (using the POM (S1.1 in eAppendix 1) for WHO scores at day 14). For the four binary outcomes, the expected values of outcomes under the TBI-based decision rules and those under two "treatment policies" -to treat everyone with CCP ("All CCP") and to treat no one with CCP ("No CCP") --are estimated based on the holdout testing set; small value are desired. The entries show the mean estimated expected values (in columns TBI, All CCP or No CCP) of the binary outcomes across the 5 holdout RCTs. The observed heterogeneity of the effect of CCP over time (see eFigure 2), which is non-patient related heterogeneity, may affect the performance of the TBIs. To assess whether the TBI-based treatment classification rules demonstrated stability over time through CV, we used the 4 annual quarters covering the period from the beginning of April 2020 to the end of March 2021 (as described in Supplement eAppendix 5). The number of subjects in the 4 enrollment quarters was as follows: 641 in April-June 2020 (Q2); 457 in July-September 2020 (Q3); 909 in October-December 2021 (Q4); 362 in January-March 2021 (Q5). As in the previous subsection, we performed a leave-one-enrollment-quarter-out CV for the selected indices J, S, V, "Basic" and "Expanded". The results with respect to the CCP efficacy ORs are shown in eTable 11. For all of the indices and for all outcomes, we have OR(B)/OR(NB) < 1 (in the columns "B/NB"). This provides some strong evidence of internal validity with respect to time of treatment during the pandemic, and increases the confidence of the proposed TBIs. In eTable 12 below, we further analyze the performance reported in eTable 11, for the Basic and Expanded TBIs, which are the final selected TBIs (see the next subsection). In eTable 12, we display each quarter-specific CCP efficacy ORs associated with the two subgroups (B and NB) and their ratios before they were averaged across the four enrollment quarters. The results in eTable 12 indicate that, except for one entry associated with the basic TBI's ratio of ORs (OR(B)/OR(NB)) with respect to the outcome of WHO 7 at day 14 in the quarter Q5 (January -March 2021), which was slightly greater than 1 (1.037), the performance of both the Basic and Expanded TBIs remains good (OR(B)/OR(NB)<1) over each passing quarter. This provides internal support of validity of the TBIs with respect to time of treatment during the pandemic. In eTable 13 below, we report the leave-one-enrollment-quarter-out CV results for the "values" of the treatment classification rules. The results shown in eTable 13 for the "value" of the treatment classification rules are also consistent with those of eTable 8 obtained from the 1000 split-sample simulation CV. As in the previous subsection, the last rows of eTable 11 and eTable 13 report the CV results for the case where we estimate a TBI including model selection (from the 24 index sets considered; see eTable 6). To be specific, in the leave-one-enrollment-quarter-out CV, based on each training sample consist of three enrollment quarters, we derived classifiers in two enrollment quarters, with subsequent test of the classifier in the remaining one enrollment quarter. We did this for each quarter in the training sample. We then averaged the results, and ranked the TBIs and selected a TBI. The selected TBI was then re-estimated on the training sample and then evaluated on the enrollment quarter initially held-out; we did this CV for all 4 enrollment quarters one by one, and the CV results were averaged across the 4 held-out enrollment quarters (reported in the last rows of eTable 11 and eTable 13). The results in the last rows in the both eTable 11 and eTable 13 indicate that the performance of TBIs is satisfactory. The holdout set is stratified into a group expected to benefit (B) and a group not expected to benefit (NB) from CCP, based on a TBI developed on the training set (using the POM (eAppendix 1) for WHO scores at day 14). For all six outcomes, the CCP efficacy ORs are computed within B and NB of the holdout set, and the ratio, OR(B)/OR(NB), are computed. The entries show the mean of the ORs (in columns "OR(B)" or "OR(NB)") or of the ratio of the ORs (in columns "B/NB") across the 4 holdout enrollment quarters. The holdout set is stratified into a group that is expected to benefit (B) and a group that is not expected to benefit (NB) from CCP treatment, based on a TBI developed on the training set (using the POM (S1.1 in eAppendix 1) for WHO scores at day 14). For the four binary outcomes, the expected values of outcomes under the TBI-based treatment decisions ("TBI") and those under two "treatment policies" -to treat everyone with CCP ("All CCP") and to treat no one with CCP ("No CCP")-are estimated based on the holdout set; small value are desired. The entries show the mean estimated expected values (in columns TBI, All CCP or No CCP) of outcomes across the 4 holdout enrollment quarters. In this subsection, for the Expanded TBI, we further examine the results presented in eTable 7. We display the distributions of the odds ratios for CCP vs. control, with respect to the six outcomes. Specifically, these distributions were obtained from the leave-one-third-out CV (1000 random splits of the data into 2/3 rds training and 1/3 rd testing), as described in eAppendix 3. The patient subgroup predicted to benefit (B; or "CCP-advised") and the patient subgroup predicted to not benefit (NB; or "CCP-not-advised") were identified by the TBI trained on each training set, and the subgroup-specific CCP efficacy odds ratios (OR(B) and OR(NB)) were evaluated based on the corresponding testing set; this process was repeated 1000 times. This gave 1000 OR(B) and OR(NB), evaluated on 1000 test sets, corresponding to each of the six outcomes: a) the ordinal WHO score at day 14 (displayed in eFigure 5); b) the indicator of WHO  7 at day 14 (displayed in eFigure 7); c) the indicator of WHO=10 (death) at day 14 (displayed in eFigure 9); d) the ordinal WHO score at day 28 (displayed in eFigure 6); e) the indicator for WHO  7 at day 28 (displayed in eFigure 8); f) the indicatory of WHO=10 (death) at day 28 (displayed in eFigure 10). For comparison, we also display the subgroup-specific (i.e., "CCP-advised" and "CCP-not-advised") CCP efficacy odds ratios for a "naïve" treatment decision rule (Naïve TDR), which is defined as follows: give a patient the CCP treatment if he/she satisfies one of the following conditions: 1. The number of days since COVID-19 symptoms onset is ≤ 6 2. The patient's baseline (11-point) WHO score = 4 (i.e., the baseline WHO score is not 5 or 6) 3. The patient has at least one of the following 2 medical conditions: 1) history of diabetes; 2) history of cardiovascular disease. For each test set, the patients satisfying this criterion were considered to be in the "CCP-advised" (i.e., B) subgroup according to this "naïve" TDR, and the corresponding CCP efficacy OR was computed. Those who were not in the "CCP-advised" subgroup were considered to be in the "CCP-not-advised" (i.e., NB) subgroup, and the corresponding CCP efficacy OR was computed. The distributions of the subgroup-specific ORs obtained from the 1000 testing sets, for each of the 6 outcomes, are displayed in eFigures 5-10. The top panel of each figure corresponds to the results of the TBI-based stratification and the bottom panel corresponds to those of the Naïve TDR. Within each panel, for the "CCPadvised" (displayed in blue) and the "CCP-not-advised" (displayed in red), the histograms and density estimates are displayed. In each panel, the location of the blue or red triangle on the horizontal-axis represents the median of each distribution. In addition, as a reference point, the median of the ORs evaluated on non-stratified testing samples ("Pooled") was denoted as a green triangle, which represents the overall CCP efficacy OR. Overall, those who were identified as "CCP-advised" by the TBI clearly have stronger CCP efficacies (i.e., smaller ORs) compared to those identified by the Naïve TDR, with respect to all 6 outcomes. For example, for the important outcome of death at day 28 (see eFigure 10), in the bottom panel (corresponding to the naïve stratification approach), the locations of the blue and green triangles are almost the same, implying that the naïve TDR approach is not effective in identifying patients who are expected to benefit from the CCP treatment. Furthermore, in eFigures 5-10, the subgroup-specific (i.e., CCP-advised and CCP-not-advised) distributions have a much more extensive overlap with each other for the naïve TDR (bottom panels), compared to those from the TBI (top panels). This indicates that the naïve rule is not effective in identifying individual patients who benefit or not benefit from the CCP treatment compared to the TBI approach, with respect to all outcomes. In addition to this performance advantage, the TBI can quantify the magnitude of the expected relative benefit of taking CCP over control, as described in eFigure 28, which can help guide clinicians' treatment decisions. eAppendix 15. Specification of the Basic and Expanded TBIs Overall, in the 1000 split-sample simulation CV (see eTable 7 and 8), the index "Expanded" outperformed the indices S, V and J, with respect to most of the outcomes. The index "Expanded" also consistently outperformed or exhibited a comparable performance as these indices, both in the leave-one-enrollment-quarter-out CV and leave-one-RCT-out CV (see eTable 9-13). In addition to its superior CV performance, the index "Expanded" is simpler and more interpretable than the indices S and V, while generally outperforming the index J. Therefore, we nominated the index "Expanded" as the expanded TBI for the CCP treatment. On the other hand, among those indices not requiring the patient's blood type information, the index "Basic" clearly exhibited the best performance in all the CVs considered. Therefore, we nominated the index "Basic" as the basic TBI, which requires less information than the expanded TBI. When the patient's blood type information is not available, the basic TBI can be used instead of the expanded TBI. The variables and the coefficients ( ) of linear combination (along with 95% bootstrap confidence intervals; the confidence intervals that do not include 0 are marked with *) under model (S1.1 in eAppendix 1) for the final selected TBIs, i.e., the "Basic" and the "Expanded" TBIs, are given in eTable 14 below. Note that the significance (*) in the table is not used for keeping or eliminating terms in the TBI; significance is indicated only as a guidance. The retention of terms in the indices is selected via the CV process outlined above. eTable 14. Coefficients of the Linear Combination for Baseline Patient Characteristics in the Basic and Expanded TBIs The TBIs are continuous measures between 0 and 1. In addition to specifying the cut-point (0.25 and 0.27 for the Basic and Expanded TBIs, respectively; see eAppendix 2 for how the cut-point was determined) for the group predicted to benefit (B) and the group predicted to not benefit (NB), the last row of the We follow the risk stratification table approach 13 to cross-tabulate treatment benefit predictions with (i.e., using the Expanded TBI) and without blood type information (i.e., using the Basic TBI). eTable 15 analyzes the extent to which the benefits calculated from the models reflect the actual benefit in the population (model fidelity), and compares the two TBIs (i.e., the indices with and without blood type information), in terms of agreement between the treatment benefit stratifications. The benefit index has a capacity to stratify the population into different benefit levels. We use stratification specified in the last row of eTable 14, i.e., the three benefit groups (B1, B2, B3) stratification. The stratification cut-points were determined as described in eAppendix 6. As indicated in eAppendix 18, these TBIs' cut-points (i.e., 0.2 and 0.4 for the Basic TBI, and 0.2 and 0.37 for the Expanded TBI) roughly corresponded to the modelbased CCP efficacy ORs of 1.10 and 0.80, for the outcome of the ordinal WHO score at day 14, under the TBI model (S1.3). eTable 15 analyzes the models with and without blood type information (the Expanded and Basic TBIs, respectively), in terms of model fidelity. Model fidelity means that the model-predicted treatment efficacy OR for a person with specified predictor values is close to the actual treatment efficacy OR of persons in the population with those same predictor values. Since there are often very few patients with the same predictor values, for evaluation, the patients are aggregated into pre-specified categories as in eTable 15. The fidelity of the benefit prediction models can be assessed by comparing the ORs in the margins of eTable 15 with the corresponding row and column labels. For example, for the Expanded TBI, the observed benefits within each benefit category (B1, B2, B3) are in the bottom "Total" row, and they generally agree with the column labels. Similarly, for the Basic TBI, the observed benefits within each benefit category are in the far-right "Total" column, and they generally agree with the row labels. Thus, both models seem to be well calibrated (under the proportional odds assumption). As indicated in eTable 14, the blood type information enters the CCP benefit calculation, through the indicator of the blood types A or AB, with the associated coefficient 0.14 (> 0). In eTable 15, the ORs within each row (see, e.g., the row associated with B2, specified by the Basic TBI) has an increasing trend, as we move from left to right (i.e., moving from B1 to B2, and to B3, determined from the Expanded TBI). This indicates that, by incorporating blood type information through the Expanded TBI, the patients sorted by the Basic TBI are further sorted according to their benefit level in a more refined manner. That is, the Expanded TBI provides a more refined gradation of the treatment benefit. In eTable 15, we note that there was no patient classified to the benefit group B1 (or B3) by one TBI while being classified to the benefit group B3 (or B1) by the other TBI. That is, there was no large disagreement between the two stratifications. The two TBI's Pearson correlation is 0.90, when they are treated as continuous variables. , ) ) to describe the heterogeneous treatment effect, allowing a simple treatment classification rule based on the odds ratio (S1.3) common to all cumulative odds. On the other hand, a more general cumulative logit model permits the heterogeneous treatment effect to vary across the 10 cumulative logits. In both the Basic and Expanded TBIs, we first note that the functions (. ) associated with heterogeneous treatment effect (HTE) in POM (S1.1 in eAppendix 1) were estimated to be essentially linear. Thus, we restrict our attention to a linear cumulative logit model for testing the proportional odds assumption associated with HTE, for the sake of simplicity. Given the TBI = (either Basic or Expanded TBI), we consider a model ( ≥ ) = + ( ) + + , = 1,2, … ,10, ( 3.1) for cumulative logits, ( ≥ ) , = 1,2, … ,10, where HTE is allowed to vary across the cumulative logits. In particular, HTE is described by + . In (S3.1), we use a centered treatment variable ( = −0.5 for control and = 0.5 for CCP) to make the HTE term orthogonal to the other term + ( ), making it robust against the misspecification of the TBI and covariates 's main effect 14 (note, ( ) is restricted to be linear in ). In eTable 16 below, we display the estimated coefficients and , where the coefficients are estimated from separate logistic regression for each = 1,2, … ,10. In eTable 16, except for those associated with the cumulative logits corresponding to lower WHO-scales (y=1, 2, 3), the estimates of the coefficients and appear to vary relatively little across the cumulative logits of y=4, 5, …, 10; for the Basic TBI, range from 0.53 to 0.81, and range from -2.05 to -3.20; for the Expanded TBI, range from 0.73 to 1.05, and range from -2.51 to -3.55, suggesting that the assumption of proportional odds is practically reasonable. In the following, we report results from a Brant test 15 for ordinal logit models (S3.1), investigating whether the parallel regression assumption significantly deviates from the observed data. The null hypothesis (H0) of the test is given by: = and = for all = 1,2, … ,10, for some ∈ ℝ and ∈ ℝ. In eTable 17, that the resulting p-values are all greater than 0.1 indicates that the observed deviation from the proportional odds assumption associated with HTE is not statistically significant at a significance level =0.1. The analysis results in eTable 16 and 17 suggest that using POM is likely to be more robust and effective compared to a more general cumulative logit model for derivation of the TBIs. An advantage of the TBI approach to developing treatment classification rules is the continuous nature of the index, which allows capitalizing on the gradation of the expected benefit. Our investigations suggested that there was a sub-group that was expected to have a substantial benefit among patients expected to benefit from CCP and a subgroup with potential harm from CCP. With an ability to capture the different degrees of the CCP treatment benefit, the TBI allows providing more refined guidelines regarding the expected benefit. For instance, the expected benefit from CCP can be categorized into the three levels: large benefit (B1), modest benefit (B2) and potential harm (B3) from CCP treatment. In eFigure 11 and eFigure 12 below, the model-based cumulative OR functions (•) in (S1.3 in eAppendix 17) are displayed as functions of the Basic (in eFigure 11) and Expanded (in eFigure 12) TBIs (i.e., ), respectively. The two benefit levels (i.e., B and NB), determined by the model-based CCP efficacy OR of 1, are displayed in the left panel (the corresponding cut-point was 0.25 for the Basic TBI, and 0.27 for the Expanded TBI), and the three benefit levels (B1, B2 and B3) are displayed in the right panel. The cut-points (a, b) defining the 3 benefit groups (TBI < a; a  TBI < b; and TBI  b) were chosen to optimize the 3-category model fit, as described in eAppendix 6. These cut-points are (a, b) = (0.2, 0.4) for the Basic TBI, and (a, b) = (0.2, 0.37) for the Expanded TBI, as specified in eTable 14. For both TBIs, these cut-points roughly corresponded to the model-based CCP efficacy cumulative ORs of 1.10 and 0.80, respectively, for the outcome of the ordinal WHO score at day 14, under the TBI model (S1.3). eFigure 11. Basic TBI Model-Based Cumulative Odds Ratio for the Ordinal WHO Score at Day 14, as a Function of the Basic TBI Basic TBI specified in eTable 14 Two-groups stratification (B and NB) is displayed in the left (panel a) and three-groups stratification (B1, B2 and B3) is displayed in the right (panel b). The within group CCP efficacy odds ratios (for the outcome of the ordinal WHO at day 14, unadjusted for any covariates) are provided in the legends. eFigure 12. Expanded TBI Model-Based Cumulative Odds Ratio for the Ordinal WHO Score at Day 14, as a Function of the Expanded TBI Expanded TBI specified in eTable 14. The two-groups stratification (B and NB) is given in the left (panel a) and the threegroups stratification (B1, B2 and B3) is given in the right (panel b). The within group CCP efficacy odds ratios (for the outcome of the ordinal WHO at day 14, unadjusted for any covariates) are provided in the legends. The subgroup-specific CCP efficacy cumulative ORs indicated in the legends on eFigure 11 and eFigure 12 above were estimated based on frequentist analyses. (Bayesian estimates are provided in eFigure 23 for the Basic TBI, and in eFigure 14 for the Expanded TBI.) In eFigure 20, we display unadjusted CCP efficacy ORs as functions of the Basic TBI for all six outcomes. In Figure 1 of the main manuscript, we display unadjusted CCP efficacy ORs as functions of the Expanded TBI for all six outcomes. eTable 18 and eTable 19 report the results from CV when the expected benefit from CCP is categorized into the three benefit levels (i.e., B1, B2 and B3). In eTable 18 and eTable 19, although the relationships of the odds ratios OR(B1) < OR(B2) < OR(B3) and of the expected incidences E(B1) < E(B2) < E(B3) for the Leave-one-RCT-out CV and the Leave-one-enrollmentquarter-out CV were not as clear (especially for some benefit levels in the Basic TBI) in comparison to the more extensive split-sample-simulation CV (1000 random splits) given in the first section of each of the tables, there is strong support for the generalizability of the TBIs in distinguishing groups with different benefit levels. eTable 18. Odds Ratios Associated with All 6 Outcomes, Under 3 Levels of Categorization From the Basic and Expanded TBIs The TBIs are developed using the POM (S1.1 in eAppendix 1) for WHO scores at day 14. Based on a TBI developed on the training set, the holdout set is stratified into 3 groups: large benefit (B1), modest benefit (B2), and potential harm (B3) from CCP. The ORs for all six outcomes are computed within each subgroup (B1, B2 and B3) of the holdout set. The entries show the median (for the 1000 split-sample-simulation) or the mean (for the leave-one-RCT-out and the leaveone-enrollment-quarter-out CVs) of the ORs computed across the holdout sets. 3) Leave-one-enrollment-quarter-out CV These within-group expected incidences are estimated via CV, based on: 1) 1000 random splits into training data set (2/3 rds ) and testing data set (1/3 rd ) (i.e., split-sample-simulation) CV; 2) leave-one-RCT-out CV; and 3) leave-oneenrollment-quarter-out CV. Based on a TBI developed on the training set, the testing set is stratified into B1, B2 and B3, and the corresponding incidences were estimated. The entries show the median (for the 1000 split-sample-simulation CV) or the mean (for the leave-one-RCT-out and the leave-one-enrollment-quarter-out CVs) of the estimated incidences, across the holdout samples. In eTable 20, in the rows associated with the enrollment quarters, we only reported the distribution of the patient population across the enrollment quarters within each benefit group (i.e., each column sums to 1). On the other hand, in eFigure 13 below, we display the distribution of the patient population across the different benefit groups (B1, B2 and B3) by each quarter, to investigate whether the patient population has shifted over time and its possible relationship with the observed shifted CCP efficacy presented in eFigure 2. In eFigure 13 below, a notable observation is that the proportion of patients in B3 increases over time, with B3 (the group predicted to be potentially harmed by CCP) having the largest prevalence of patients enrolled between January and March 2021, that had the lowest CCP efficacy. This partly explains the observed CCP efficacy that decreased over time in the COMPILE study. Additionally, we provide below the binary subgroup characteristics where the subgroups were identified based on the CCP efficacy OR of 1 (OR1 or OR<1) (see eAppendix 2), under the TBI model (S1.3 in eAppendix 1), i.e., the subgroup of the patients who were predicted to benefit (B) and the subgroup of the patients who were not predicted to benefit (NB) from the CCP treatment, predicted by the Expanded TBI. After identifying the TBI subgroups associated with three levels of benefit -B1, B2 and B3 -we evaluated the efficacy of CCP with respect to all six outcomes separately within each group. Specifically, identical Bayesian analyses were performed as those used for evaluating the efficacy of CCP in the entire sample, see the main COMPILE results 16 . Within each of the 3 benefit-level groups defined by the Expanded TBI, eTable 22 describes the posterior distributions of the CCP efficacy odds ratios (ORs) using the median, the 2.5 th percentile and the 97.5 th percentile (i.e., the 95% credible interval) and the posterior probabilities that the CCP efficacy OR is less than 1 (P(OR<1)) and less than 0.8 (P(OR<0.8)) (see the next page for an explanation of these posterior probabilities). Here we give an explanation about how evidence from the Bayesian posterior distributions of the odds ratios should be interpreted and how this interpretation contrasts with the more common frequentist reporting of evidence in terms of p-values. A probability distribution of an OR gives the ability to estimate the likelihood that the OR lies in any given interval. The posterior distribution characterizes the OR after the data are observed. The posterior probability of the OR allows us to compute, for example, that the OR is less than 1 [P(OR<1]. An OR<1 indicates that there is a desirable effect from the treatment with CCP. The posterior probability for OR <1 [P(OR<1)] is the likelihood that CCP has any desirable effect; for example, P(OR<1) > 90% provides strong evidence for CCP efficacy. However, with a large sample size, this posterior probability can increase considerably, even if the median of the OR is very close to 1; for example, median OR=0.93. To ensure that beyond the presence of any effect, i.e., that the effect is large enough to be clinically meaningful, a useful characteristic of the posterior probability distribution is the quantity P(OR<0.8). An OR=0.8 means that the odds for undesirable outcome under the control treatment are reduced by 20% with CCP. Such an effect can be considered more than a minimal effect. A P(OR<0.8) 50% indicates that it is more likely than not that the CCP desirable effect is more than minimal. Combined, the conditions P(OR<1) 90% and P(OR<0.8)  50% constitute strong evidence for more than a minimal effect of CCP. Bayesian credible intervals (CrIs) convey different information than frequentist confidence intervals (CIs). A frequentist confidence interval refers to confidence in the analytic procedure (see below). The Bayesian credible interval refers to the likelihood that the effect of interest (here the OR for CCP vs. control) lies within a particular interval. Therefore, the forest plots (displayed in eFigure 14) should not be interpreted in the same way as frequentist CIs. Next, we provide more discussion on the difference between Bayesian credible intervals (CrIs) and frequentist confidence intervals (CIs). For illustration we focus on 95% intervals. A frequentist 95% CI either includes or excludes the true but unknown effect (e.g., OR for CCP vs. control), and the probability that it includes the true but unknown OR is 95%. The confidence in the procedure means that if we repeated the procedure many times (i.e., perform the same experiment, collect data, fit the models we used, and estimate the OR and the 95% frequentist CIs), 95% of those CI will contain the true OR. For any given experiment, we cannot know whether our particular 95% CI contains the true OR or not, but we are confident in the procedure that we used, because under repeated experimentation, 95% of the CIs would cover the true OR. On the other hand, a 95% Bayesian credible interval tells us that, given the data we observed, the likelihood that the OR is in the interval (a, b) is 95%: this statement incorporates the uncertainty about the distribution of the OR of interest, given the prior information and data. Another contrast between the Bayesian and frequentist frameworks for making inferences and statements about uncertainty is the following. P-values indicate the likelihood of the data given the (null) hypothesis, i.e., how likely it is for what we observed to happen, if the null hypothesis (OR=1) were true? In other words, the p-value is a conditional probability of the data, given a hypothesis. A Bayesian posterior probability, on the other hand, assesses the probability of a hypothesis given the data. For example, P(OR<1) is the conditional probability that CCP has an effect (i.e., OR<1) given the observed data. Combined with the discussion above, having both P(OR < 1)  90% and P(OR < 0.8)  50% constitutes strong evidence for more than minimal efficacy of CCP against control. The results in eTable 22 indicate that, with respect to all six outcomes, the benefit group B1 (identified from the Expanded TBI) has both P(OR < 1)  90% and P(OR < 0.8)  50%. This suggests strong likelihood for the group B1 for more than minimal efficacy of CCP against control with respect to all six outcomes. Patients in B2 are also expected to benefit, but the posterior probability of the CCP efficacy is not as strong as that of B1. Finally, the patients in B3, with P(OR < 1) < 30% for all six outcomes (see eTable 22) , are unlikely to benefit from the CCP treatment. eFigure 14 below displays the medians and the 95% CrIs for the ORs provided in eTable 22. Note that the 2.5% and the 97.5% of the posterior distributions reported in The posterior distributions are estimated from the Bayesian cumulative proportional odds models for the ordinal WHO scores and logistic regression, adjusted for age, sex, baseline WHO scores, duration of symptoms prior to treatment, diabetes, cardiovascular disease, pulmonary disease and quarter of enrollment. eAppendix 21. Differential Treatment Effect by Benefit Levels, in Comparison With Differential Treatment Effect by Single Patient Characteristics eFigures 15-20 below contrast the moderating effect of the Expanded TBI to that of the baseline variables prespecified in the COMPILE protocol. We conducted a set of subgroup analyses for CCP effect moderations by levels of the TBI. The results show strong CCP effect moderations by levels of the TBI, in terms of the magnitude contrast (i.e., the CCP efficacy OR difference) between the subgroup expected to most benefit (B1) and that expected to not benefit (B3), compared to the contrast based on the other individual baseline characteristics. This strong moderation exhibited by the TBI suggests the advantage of using the TBI for guiding clinical decisions over the baseline patient characteristics individually. Although the results were displayed in terms of only the 3 discrete benefit levels (i.e., B1, B2 and B3), the TBI, as it continuously ranges from 0-1, provides a generally continuous gradation of the expected benefit from CCP, which can be useful in clinical practice in quantifying the treatment benefit from CCP for an individual subject. eAppendix 23. Baseline Patient Characteristics by Benefit Levels Determined From the Basic TBI eTable 23 describes the baseline characteristics of patients in the three benefit-level groups (B1, B2 and B3) defined from the Basic TBI. For each benefit group, frequentist's unadjusted OR estimates (not adjusted for any covariates) and the associated 95% confidence intervals are also provided in the bottom of the table. In eTable 23, in the rows associated with the enrollment quarters, we only reported the distribution of the patient population across the enrollment quarters within each benefit-level group (i.e., each column sums to 1). In eFigure 22 below, we display the distribution of the patient population across the different benefit groups (B1, B2 and B3) identified by the Basic TBI, by each quarter. We do this to investigate whether the patient population has shifted over time and its possible relationship with the observed shift in CCP efficacy presented in eFigure 2. In eFigure 22, a notable observation is that the proportion of patients in B1 decreases over time, with B1, the group predicted to have the most benefit from CCP, having the largest prevalence of patients enrolled the early period of the pandemic, i.e., between April and June 2021. This partly explains the observed CCP efficacy that decreased over time in the COMPILE study. Additionally, we provide the binary subgroup characteristics where the subgroups were identified based on the CCP efficacy OR of 1 (OR1 or OR<1), under the TBI model (S1.3), i.e., the subgroup of the patients who were predicted to benefit (B) and the subgroup of the patients who were not predicted to benefit (NB) from the CCP treatment, predicted by the Basic TBI. eAppendix 24. CCP Efficacy in the 3 Benefit Groups With Respect to All 6 Outcomes, Determined by the Basic TBI Within each of the 3 groups defined by the Basic TBI, eTable 25 characterizes the posterior distributions of the odds ratios using the 2.5 th percentile, the median, and the 97.5 th percentile (i.e., the 95% credible interval) and the posterior probabilities that the odds ratio is less than 1 (P(OR<1) and less than 0.8 (P(OR<0.8). The results in eTable 25 indicate that, with respect to all six outcomes the benefit group B1 (identified from the Basic TBI) has P(OR < 1)  90% and P(OR < 0.8)  75%, suggesting strong evidence for more than minimal efficacy of CCP against control for the group with respect to all six outcomes. There were 8,698 EAP subjects (there were 5303 subjects who received high-titers CCP, as qualified by a live viral neutralization titer of ≥ 1:250) who had no missing information for neutralization titer in the transfused CCP and the following covariates: quarter (or month) of treatment, age, sex, diabetes indicator, pulmonary disease indicator, cardiovascular disease indicator, blood type, WHO baseline score (ranging from 4 to 6 on a 0-10 scale), and a known mortality status at day 28 (which was the only available outcome from the study), see eTable 26 below for baseline characteristics of this EAP validation cohort. The EAP sample contains only CCP treated subjects. A matching of the EAP sample with a cohort of controltreated subjects from COMPILE was performed. Considering evidence that outcomes improved over the course of the pandemic independently of known patient risk factors and that no EAP patients were enrolled after September 2020, the time period of the analysis was restricted to April 2020 through September 2020. There were 546 control patients in COMPILE treated from April 2020 through September 2020 that were available for matching. This gave the set of 546 control-treated and 8698 CCP-treated subjects, a total of n=9244 patients in the sample, which was available for matching. Since blood type information was available for the whole study sample, both the Basic and Expanded TBI scores could be computed and tested. Coarsened exact matching was used to provide matched cohorts, in which the continuous covariate (age) was coarsened into 5-year bins, and a complete cross of the coarsened covariates is used to form subclasses defined by each combination of the coarsened covariate levels. The matching was performed via R package MatchIt 19 . This resulted in an effective sample size of the control-treated subjects n = 212 and of the CCP-treated subjects n =1896 in the matched set, see eTable 27 for the summary of balance of the matched data. We also considered matching of the COMPILE control-treated subjects with the high titer EAP subjects only, using the same coarsened exact matching procedure, which resulted in a high titers matched set consisting of an effective sample size of n=1145 CCP-treated patients and n=197 control-treated, see eTable 28 below for the results (summary of balance) of the matched data. TBIs provide binary subgroups: the group predicted to benefit from treatment with CCP (group B) or the group not predicted to benefit from treatment with CCP (group NB). As a primary analysis, the TBI would be considered to be validated, if OR(B) < OR(NB). OR(B2) < OR(B3). As another sensitivity analysis (sensitivity with respect to different definition of CCP treatment), we also considered the case where the analysis was restricted to the patients who received high titers CCP (as qualified by a live viral neutralization titer of ≥ 1:250). To obtain the B and NB and the B1, B2 and B3 groups associated with different levels of benefit, we computed the TBI scores of all patients in the matched sets. Using the prespecified cut-points (see eTable 14), we partitioned the patients in the matched sets into the B and NB and the B1, B2 and B3 groups. To compute the subgroup specific ORs within each benefit subgroup, we performed a subgroup-specific logistic regression with just the treatment indicator as a covariate, using the weights according to the subclasses formed during matching (i.e., the matching weights) in computing the OR. These subgroup-specific ORs are then compared with one another to validate the relationship between the TBIs and the CCP efficacy. Results eTable 29 below shows the distribution of the subjects in the matched data across levels of benefit (two or three levels of benefit) defined by the Basic TBI or the Expanded TBI, respectively, and depending on whether partitioning was restricted to the patients with high titers CCP or not. eTable 30 below reports the results (i.e., the subgroup-specific ORs) of this validation analysis. The results of the primary analysis are highlighted. Both the Basic TBI and the Expanded TBI were validated with respect to the two-levels of benefit subgroups (showing the relationship OR(B) < OR(NB)), for both the primary analysis using all EAP patients (reported in the highlighted cells in the 3 rd and 5 th columns) and the sensitivity analysis using the high titer EAP patients only (reported in the 4 th and 6 th columns). Also, with respect to the three-levels of benefit subgroups, the Expanded TBI was validated (showing the relationship OR(B1) < OR(B2) < OR(B3)), in both cases of the all EAP patients matching and the high titers CCP patients matching. However, with respect to the three-levels of benefit subgroups, the Basic TBI was not validated, in both cases of the all EAP patients matching and the high titers CCP patients matching. As the Basic TBI does not incorporate blood type in the determination of the index, this may indicate the relevance/importance of blood type information for determination of the CCP benefit levels. eTable 26. Baseline Characteristics for Participants Treated Under the EAP With Available Data for All Specified Parameters "CCP (All)" corresponds to the group of patients who received any CCP, and "CCP (High titer)" corresponds to the group of patients who received high titers CCP (as qualified by a live viral neutralization titer of ≥ 1:250 COMPILE were matched, resulting in a total of n=420 subjects in the matched sample. The summary of balance for these matched data is given in eTable 32. Blood type information was available for the whole study sample, and thus both the Basic and Expanded TBI scores for all n=420 patients were able to be computed. These TBIs were used to stratify the sample (n=420) into 1) two groups: expected to benefit (B) and not expected to benefit (NB); and 2) three groups: large benefit (B1), moderate benefit (B2) and potential harm (B3) expected from the CCP treatment, using the cutoff points specified in eTable 14, for the Basic TBI and the expanded TBI, respectively. We then computed the subgroupspecific ORs. Although the 11-point WHO scale ordinal outcome at day 28 was not available for most of the EUA patients (65% missing), that at day 14 was available for all the subjects. Thus, we computed the subgroupspecific ORs with respect to the ordinal 11-point WHO outcome at day 14, using cumulative logistic regression, with the treatment indicator as the only covariate. Results: eTable 33 below shows the distribution of the subjects across levels of benefit (two or three levels of benefit) defined by the Basic TBI or the Expanded TBI, respectively. A single center open label phase II RCT was done to assess the pathogen and host-intrinsic factors influencing clinical and immunological benefits of passive immunization using CCP therapy, in addition to standard of care (SOC) therapy (clinical trial registration: Clinical Trial Registry of India No. CTRI/2020/05/025209) 24 . Convalescent plasma was collected from patients recovered from COVID-19 (with disease remission at least 28 days prior to screening and on attaining negative status on SARS-CoV-2 RT-PCR) following a screening protocol which also included measuring plasma anti SARS-CoV-2 spike IgG content. Severe COVID-19 patients with evidence for acute respiratory distress syndrome (ARDS) with PaO2/FiO2 ratio 100-300 (moderate ARDS) were recruited and randomized into two parallel arms of SOC and CCP, n=40 in each arm. See Table S .6.10 below for the patients' pretreatment characteristics and their clinical information. Patients were followed up for 30 days post-admission to assess the primary outcomes of all-cause mortality and immunological correlates for clinical benefits. The standard of care received by the patients (which was added with two transfusions of 200ml CCP on two consecutive days for the CCP arm) included variable pharmacotherapy, in addition to standard protocols for O2 therapy as described in eTable 36 below. Approach: In this study, blood type information was only available in patients randomized to CCP and therefore, only the Basic TBI was able to be computed and used for the validation. The Basic TBI scores for all n=80 patients were computed, and these TBI scores were then used to stratify the whole (n=80) sample into 1) two groups: expected to benefit (B) and not expected to benefit (NB); and 2) three groups: large benefit (B1), moderate benefit (B2) and potential harm (B3) expected from the CCP treatment, using the cutoff points specified in eTable 14. Then we computed the subgroup-specific ORs to validate the relationship between the TBI and the CCP efficacy. Mortality at day 30 was the only available outcome in common to the COMPILE outcomes. We used logistic regression, with the treatment indicator as the only covariate, to analyze this outcome. Based on the TBI scores, we identified that 20 and 60 belong to the subgroups B and NB, respectively, and that 16, 48 and 16 belong to the subgroups B1, B2 and B3 respectively, as reported in eTable 37 below. The ORs (and 95% CIs) computed within the subgroups defined by the Basic TBI are reported in eTable 38 below, supporting validation of the TBI. The results of the primary analysis are highlighted. Since the data are from an RCT, we can draw the relationship between the CCP efficacy odds ratio and the TBI, using the whole n=80 sample (as in Figure 1 in the main text), as displayed in eFigure 25 below. Although the confidence bounds for the OR's are wide (a consequence of the small sample size), the Basic TBI is considered externally validated because specification for the subgroups were determined independently using the COMPILE data set and the respective ORs have the benefit relationships, OR(B) < OR(NB) and OR(B1) < OR(B2) < OR(B3), that were predicted by the TBI to exhibit on this external dataset. (We note that imputing blood type information and using the Expanded TBI produced results similar to those of the Basic TBI.) 4 -hospitalized/ no O2 0(0.0) 0(0.0) 0(0.0) 5 -hospitalized/ O2 by mask or nasal 1(2.5) 3(7.5) 4(5.0) 6 -hospitalized/ O2 by non-invasive 39 ( eFigure 27 (for the binary outcomes) display the smoothed odds ratios as continuous functions of the Basic and Extended TBIs, respectively. In eFigure 26 and eFigure 27, for both the Basic and Expanded TBIs, there is scarcity in the evaluated TBI scores greater than 0.6 (the small blue ticks on the horizontal axis represent the values of the TBI scores). As a consequence, the confidence bounds for the associated ORs are wide for the TBI values greater than 0.6. This is because of the difference in the enrolled patient population between this RCT and the COMILE study. Nevertheless, they exhibit monotone decreasing relationships with the TBIs (i.e., a large value of the TBI designates a large benefit), indicating validation of the TBIs on this external RCT. In eFigure 28 below, we categorized the 11-point WHO ordinal outcome scales into 4 ordinal categories: i) WHO scale 0-3 (non-hospitalized); ii) WHO scale 4-6 (hospitalized but without mechanical ventilation); iii) WHO scale 7-9 (mechanically ventilated); iv) WHO scale 10 (death), and displayed the stacked probability of these 4 ordinal categories, as a function of the CCP-TBI and the treatment conditions (CCP or Control). In the plots, the darkest (black) colored band corresponds to the binary outcome WHO score =10 (i.e., death). As the value of the CCP-TBI increases (from 0 to 1), the risk of a bad outcome (e.g., death) under the CCP treatment generally decreases, whereas the risk under the Control treatment generally increases, for the both outcomes at days 14 (in the left panel) and 28 (in the right panel). When TBI ≤ 0.2, the heights of all color bands are similar for both treatments (CCP and Control) or the Control is slightly favored, indicating that such patients would not benefit much from the CCP treatment. On the other hand, when TBI  0.4, CCP-treated patients are expected to have a better outcome than control-treated. In other words, the patients with TBI  0.4 are predicted to experience a better outcome with CCP treatment, compared to what is predicted under the Control treatment (i.e., the height of the lighter-colored bands is greater for the CCP-treated than the Control-treated when TBI  0.4). eFigure 29. Fitted POM The left two panels: the cumulative ODDS( ≥ ) predicted by the cumulative logit POM (S1.1 in eAppendix 1) (for a fixed value of ), as a 2-dimensional response surface of the Risk index and the TBI, under the CCP treatment (left panel) and under the control treatment (middle panel), respectively. Risk Index refers to the term ( ) in POM (S1.1in eAppendix 1), that is scaled to [0,1]; see eTable 4 for the term ( ). TBI refers to the component in POM (S1.1 in eAppendix 1), that is scaled to [0,1]. The right panel: the cumulative odds ratio comparing CCP vs. control, implied by POM (S1.1), as a function of Risk Index and TBI. Notice that the odds ratio (OR) surface depends only on TBI (and not on Risk Index). This is because, in POM (S1.1 in eAppendix 1), the CCP efficacy OR is a function only of TBI, and not of Risk Index. Thus, independently of Risk Index, TBI can be used to evaluate the relative benefit of CCP vs. control. Patients received 1 unit of high titer CCP obtained from the New York Blood Center that met the criteria defined by the FDA on the Ortho-Clinical Diagnostics VITROS Anti-SARS-CoV-2 IgG platform (S/C  12) or 2 units considered low titer (having a S/C <12) 21 until were studied retrospectively (see eTable 31 below for baseline characteristics of the patients) According to FDA guidance, CCP was mainly used in patients considered to have early COVID-19 based on the assessment of an infectious diseases physician, lack of or non-invasive oxygen supplementation, and patients with B cell immunodeficiency. Eligibility was determined when patients were referred by the primary team physician. Clinical, laboratory, and patient outcome data were collected through chart review from the electronic health record under Albert Einstein College of Given this relatively small sample size, the coarsened exact matching used in Section eAppendix 25 limited the effective sample size for the CCP-treated patients of the EAP sample to only n=29. Thus, instead of using this line of exact matching, we assigned each treated patient in the EUA sample to a control unit as a match, one by one. Specifically, age, sex, diabetes indicator, pulmonary disease indicator, cardiovascular disease indicator, enrollment quarter and the baseline WHO score were used to compute a distance between each treated unit and each control unit, based on which the nearest neighbor matching was performed via R package MatchIt 23 Recommendations for Investigational COVID-19 Convalescent Plasma tFaDAA US Food and Drug Administration. Convalescent Plasma EUA Letter of Authorization March 9 MatchIt: Nonparametric Preprocessing for Parametric Causal Inference Approach: Both the Basic TBI and the Expanded TBI were considered for this external validation. The Basic and Expanded TBI scores for n=309 (we excluded n=24 patients with missing baseline oxygen supplementation status) patients were computed to stratify the sample into 1) two groups: expected to benefit (B) and not expected to benefit (NB); and 2) three groups: large benefit (B1), moderate benefit (B2) and potential harm (B3) expected from the CCP treatment, using the cutoff points specified in Table 1 in the main text. The ORs computed within these subgroups, evaluated with respect to the WHO 6-point scale ordinal outcome specified above, were compared one another to validate the relationship between the TBIs and the CCP efficacy. The cumulative ORs, computed for both the days 14 and 30 outcomes, were obtained based on cumulative logit proportional odds models with the treatment indicator as the only covariate. Since the study does not have the baseline WHO scale, the information about oxygen supplementation and ICU at baseline were used to surrogate the baseline WHO 11 for the days 14 and 30 WHO 6-point scale ordinal outcomes (see eTable 41), and for the days 14 and 30 binary outcomes of ventilation or worse (see eTable 42). The results of the primary analysis are highlighted Although there are clear monotone relationships, the relationships between the Basic TBI and the CCP efficacy odds ratios appear to be less robust compared to those of the Expanded TBI (presented in Figure 1 of the main manuscript), as suggested by wider confidence intervals.The top panel shows the Kaplan-Meier curves for time to death (the two decreasing trajectories in the panel) and time to hospital discharge (the two increasing trajectories in the panel) for group B1, and the middle and bottom panels show the same results for groups B2 and B3, respectively. Results from frequentist inferences from stratified log-rank and the Gray's tests are shown in the individual panel. Control (n=40) The probability of each outcome category, predicted by the TBI ranging from 0 to 1, under the two treatment conditions (CCP and Control, respectively) is displayed. These stacked probability plots describe the relationship between the CCP-TBI and the treatment outcomes at day 14 (in the left panel) and day 28 (in the right panel).