key: cord-0430108-gf3fynci authors: Vemena, E.; Wessler, B. S.; Paulus, J. K.; Salah, R.; Raman, G.; Leung, L. Y.; Koethe, B. C.; Nelson, J.; Park, J. G.; van Klaveren, D.; Steyerberg, E.; Kent, D. M. title: Large-scale validation of the Prediction model Risk Of Bias ASsessment Tool (PROBAST) using a short form: high risk of bias models show poorer discrimination date: 2021-01-25 journal: nan DOI: 10.1101/2021.01.20.21250183 sha: 5fc16b371f64ae3892b71f38635a6e9babc1cda6 doc_id: 430108 cord_uid: gf3fynci Objective To assess whether the Prediction model Risk Of Bias ASsessment Tool (PROBAST) and a shorter version of this tool can identify clinical prediction models (CPMs) that perform poorly at external validation. Study Design and Setting We evaluated risk of bias (ROB) on 102 CPMs from the Tufts CPM Registry, comparing PROBAST to a short form consisting of six PROBAST items anticipated to best identify high ROB. We then applied the short form to all CPMs in the Registry with at least 1 validation and assessed the change in discrimination (dAUC) between the derivation and the validation cohorts (n=1,147). Results PROBAST classified 98/102 CPMS as high ROB. The short form identified 96 of these 98 as high ROB (98% sensitivity), with perfect specificity. In the full CPM registry, 529/556 CPMs (95%) were classified as high ROB, 20 (4%) low ROB, and 7 (1%) unclear ROB. Median change in discrimination was significantly smaller in low ROB models (dAUC -0.9%, IQR -6.2% &- -4.2%) compared to high ROB models (dAUC -11.7%, IQR -33.3% &- 2.6%; p<0.001). Conclusion High ROB is pervasive among published CPMs. It is associated with poor performance at validation, supporting the application of PROBAST or a shorter version in CPM reviews. • High risk of bias is pervasive among published clinical prediction models is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 25, 2021. be used to select patients for a certain treatment or study. [1] [2] [3] The development of a CPM consists of 5 several important steps, including careful predictor selection and model specification. 4, 5 6 Methodological shortcomings may cause bias and systematic overestimation of the model 7 performance measures, promoting misleading conclusions of its value for clinical practice. 8 9 The Prediction model Risk Of Bias ASsessment Tool (PROBAST) was developed to assess risk of 10 bias (ROB) based on the methods used for model development. 8 It was developed for systematic 11 reviews and provides a comprehensive overview of methodological quality. It is unclear whether 12 adherence to the methodological standards of PROBAST is associated with a better performance in 13 external validation. Moreover, PROBAST requires both subject and methodological expertise to apply 14 and might be too time intensive for large-scale use. We aimed to examine whether poor PROBAST 15 scores can identify CPMs that perform poorly at external validation and to develop a short form that is 16 equally capable to identify poorly performing CPMs. 17 18 19 We used publications from the Tufts Predictive Analytics and Comparative Effectiveness (PACE) CPM 20 Registry, a database which includes CPMs for cardiovascular disease published in English language 21 from January 1990 through March 2015 (www.pacecpmregistry.org). A systematic literature search 22 was performed to comprehensively identify CPMs. 9, 10 For this registry, a de novo CPM was defined as 23 a newly derived prediction model to estimate an individual patient's absolute risk for a binary outcome. 24 Information from each CPM was extracted from the original article and entered in the database. CPMs 25 were characterized based on the index condition, including coronary artery disease, congestive heart 26 failure, arrhythmias, stroke, venous thromboembolism, and peripheral vascular disease. Populations 27 at risk for developing incident CVD were classified with the index condition of 'population sample'. 28 29 . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 25, 2021. ; https://doi.org/10.1101/2021.01.20.21250183 doi: medRxiv preprint We included validations of each de novo CPM in the CPM Registry, which have been previously 2 identified with a Scopus citation search conducted on March 22, 2017. 11 An external validation was 3 defined as any model evaluation on a dataset distinct from the derivation data, including validations 4 that were performed on a temporally or geographically distinct part of the same cohort (i.e., non-5 random split sample), that reported at least one measure of model performance (discrimination and/or 6 calibration). Discrimination, expressed as the area under the receiver operating characteristic curve 7 (AUC), describes how well a model separates those who develop the outcome of interest from those 8 who do not. Calibration refers to the agreement between the observed and predicted probabilities. 9 To assess clinical relatedness of the derivation and validation populations, relatedness rubrics were 11 constructed for the 10 most common index conditions and included details on the population sample, 12 major inclusion and exclusion criteria, outcome measure, enrolment period, type of intervention, and 13 follow-up duration. Details of these rubrics are available elsewhere. 11 The PROBAST tool assesses risk of bias based on 20 signalling questions in 4 key domains: 21 participants (e.g. study design and patient inclusion), predictors (e.g., differences in predictor 22 definitions), outcome (e.g., differences in outcome assessment), and analyses (e.g., sample size and 23 handling of missing data). All questions are phrased so that "yes" indicates low ROB and "no" high 24 ROB. As a result, a domain where all questions are answered with "yes" is rated low ROB, while 25 questions answered with "no" indicate a high ROB. Unclear ROB is assigned to a domain when there 26 is insufficient information to assess one or more questions. The overall judgment is considered low 27 ROB when all domains have low ROB. If at least 1 domain has high ROB, the overall judgment is high 28 ROB as well. If at least 1 domain has unclear ROB and all other domains have low ROB, the overall 29 judgment is unclear ROB. 8, 12 30 . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 25, 2021. ; https://doi.org/10.1101/2021.01.20.21250183 doi: medRxiv preprint 1 We applied the PROBAST tool to all stroke models with at least one external validation and a reported 2 derivation AUC (n=52) and 50 randomly selected models with other index conditions. Each of these 3 102 models was assessed by two independent reviewers-blinded to validation study results--using 4 the guidelines in the PROBAST 'Explanation and Elaboration'. 12 Discrepancies were discussed with a 5 third reviewer to arrive at a consensus. 6 7 For practical reasons, a selection was made from items in the original PROBAST. We discussed the 9 relevance of all 20 questions within a group of methodologists with specialized expertise in prediction 10 (including DvK, EWS, and DMK). We rated items according to their potential effect on performance of 11 a CPM at validation. We aimed for a subset of questions that can be applied by a trained research 12 assistant (i.e., trained study staff without doctoral-level clinical or statistical expertise) within 15 13 minutes. Usability of the items was verified through testing by experienced research assistants at Tufts 14 PACE Center. 15 16 The following 6 items were considered most relevant and easy to use: 'outcome assessment', 'events 17 per variable (EPV)', 'continuous predictors', 'missing data', 'univariable analysis', and 'correction for 18 overfitting/optimism'. We adjusted the definitions from the original PROBAST article to improve clarity 19 and resolve unambiguity (see Supplemental Material). One point was assigned for each item that was 20 incorrectly performed, resulting in total scores ranging from 0 to 6 points. As with the full PROBAST, 21 models with a total score of 0 were classified as 'low ROB' and models with a score ≥1 as 'high ROB'. 22 When the total score was 0 but there was insufficient information provided to assess all items, the 23 model was rated 'unclear ROB'. We assumed that the effect of using univariable selection or not 24 correcting for optimism would be negligible when the effective sample size was very large. Hence, we 25 did not assign points to these items when the EPV was ≥25 for candidate predictors, or ≥50 for the 26 final model (when candidate predictors were unknown). 27 We applied this short form to the same set of 102 models and compared results with those of the full 29 PROBAST. We then refined and clarified the scoring guidelines. Research assistants of Tufts PACE 30 . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 25, 2021. ; https://doi.org/10.1101/2021.01.20.21250183 doi: medRxiv preprint center then applied the short form to all de novo CPMs with at least one external validation in the 1 Registry (n=556). Blinded double assessment of the first 40 models was done to compare the 2 assessments and discuss discrepancies. Because the short form was composed of a subset of 3 PROBAST items, CPMs classified as high ROB by the short form were anticipated to also be 4 classified as high ROB by the full PROBAST. CPMs classified by the short form as low ROB might be 5 reclassified as high ROB by the full PROBAST. Thus, all models that were rated as low or unclear 6 ROB were re-assessed by a separate reviewer using PROBAST to reveal any potential items 7 suggestive of high ROB not captured by the short form. 8 9 Cohen's kappa statistic was calculated to assess interrater reliability and agreement between 11 PROBAST and the short form. As a measure of the observed ROB in each derivation-validation pair, 12 we used the change in discriminative performance between the derivation and validation cohorts, as 13 quantified by the AUC. The AUC ranges from 0.5 (similar to a coin flip) to 1.0 (perfect discrimination). 13 14 The percent change in discrimination was thus calculated: 15 For example, when the AUC decreases from 0.70 in derivation to 0.60 in validation, the delta AUC 17 (dAUC) represents a 50% loss in discriminative ability (since 0.50 is the reference value for AUC). We 18 calculated the median and interquartile range (IQR) of the change in discrimination for low ROB 19 versus high ROB models and stratified for relatedness. 20 21 We used generalized estimation equations (GEE) with robust covariance estimator 14, 15 to assess the 22 association between the ROB classification and the observed change in discrimination, taking into 23 account the correlation between validations of the same CPM. In these analyses, we calculated the 24 absolute difference between two dAUCs. For example, if one model had dAUC 20%, while the other 25 model had a dAUC 6%, the difference in dAUC would be 14%. We also constructed a multivariable 26 GEE model to control for the following factors: relatedness, index condition, CPM authors (same as in 27 validation paper, author overlap, no author overlap), CPM method (logistic, time-to-event, other), CPM 28 center (single, multicenter), CPM source (medical record, registry, trial, other), validation design 29 (cohort, trial, other), validation center (single, multicenter), validation source (medical record, registry, 30 . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint PROBAST was assessed on the first set of 102 models (52 stroke models and 50 models). Of these 9 models, 98 (96%) were rated high ROB and only 4 (3.9%) low ROB. Overall high ROB was mainly 10 caused by high ROB in the analysis domain, while the other three domains contributed little 11 information ( Figure 1A ). Agreement between the two reviewers before the final consensus meeting 12 was 90% for the overall judgment (kappa 0.33). Interrater agreement ranged between 49 and 97% per 13 item (kappa -0.05 to 0.88., Supplemental Table S1 ). When applying the short form to the same 102 14 models, the sensitivity to detect high ROB was 98% (using the full PROBAST as reference standard) 15 and specificity was 100%. Overall agreement was good (98%, kappa 0.79). The item 'outcome 16 assessment' was rated high ROB in only 4% of the models, while the percentage high ROB of the 17 other items ranged between 39 and 77% ( Figure 1B) . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 25, 2021. ; https://doi.org/10.1101/2021.01.20.21250183 doi: medRxiv preprint full PROBAST assessment of all low and unclear ROB models. Information on both the derivation 1 AUC and validation AUC was available for 1,147 validations (62%). The median dAUC of the 2 derivation-validation pairs was -10.7% (IQR -32.4% -2.8%). The difference was significantly smaller 3 in low ROB models (dAUC -0.9%, IQR -6.2% -4.2%) compared to high ROB models (dAUC -11.7%, 4 IQR -33.3% -2.6%; p<0.001; Table 1 and Figure 3 ). 5 6 The GEE analyses estimated a difference in dAUC of 16.8% (95% CI 6.1% to 27.6%, p=0.002) for low 7 ROB versus high ROB after adjustment for CPM and validation characteristics ( Table 2 ). This number 8 can be interpreted as an absolute difference between the validation AUC of a high ROB model and a 9 low ROB model of 0.02 when the derivation AUC was 0.60 or as 0.07 when the derivation AUC was 10 0.90 (Supplemental Table S2 ). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 25, 2021. The selection of items for the short form, which was done strictly by expert opinion and prioritized 2 items in the analysis domain, showed concurrent validity with the PROBAST assessment in 102 3 papers. Nearly all high ROB CPMs were identified based on violating any item from the analysis 4 domain, which was well represented in the short form. Additionally, because the short form is 5 comprised of a subset of the PROBAST questions, it is 100% specific for identifying low ROB CPMs. 6 Because the number of low ROB CPMs is very small (typically <5% of the total), review of this subset 7 with the full PROBAST should result in identical classification as using PROBAST for all CPMs while 8 taking a fraction of the time and expertise. The original PROBAST can only reclassify some low ROB 9 CPMs according to the short form to high ROB, not the other way around. In this particular sample, 10 only a single CPM was so reclassified from low to high ROB. 11 12 Most CPMs in our study were rated high ROB, in line with previous systematic reviews using 13 PROBAST. [16] [17] [18] [19] [20] This is inherent to the structure of PROBAST: one incorrectly performed item in one 14 domain determines an overall judgment of high ROB. While the high percentage of models with high 15 ROB might be interpreted as a reflection of the low overall quality of the literature, it might also reflect 16 limitations of the tool. We found substantial variation in the number of items violated by any individual 17 CPMs, which suggests variation in methodological rigor among high ROB CPMs. Moreover, the 18 discriminatory performance within the high ROB group varied widely (IQR of dAUC ranging from -33% 19 to +2.6%). While the purpose of this study was to validate the ROB assessment by PROBAST, future 20 work might explore whether it is useful to identify an "intermediate" ROB category, which might provide 21 a more graded, less stringent assessment. 22 23 Our study has several limitations. Poor model performance at external validation can be due to the 24 relatedness of the settings where the CPM is developed versus validated, changes in case-mix, and 25 methodological problems (i.e., statistical overfitting). 21, 22 Our analyses were focused on ROB caused 26 by methodological issues, while relatedness of the validation cohort and case-mix differences will also 27 affect the observed change in discrimination. 23, 24 Relatedness rubrics were used to adjust for 28 differences in relatedness of the validation cohort. The effect of case mix differences is potentially 29 measurable through a model based concordance (c) statistic, 25, 26 but this requires primary patient-30 . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 25, 2021. level data to compute. Furthermore, discrimination is not the only, or even necessarily the most 1 important metric by which to evaluate model performance. The net benefit of applying a model for 2 decision support in a new population is a function of both discrimination and calibration. 27 Calibration 3 might be more sensitive to bias in model development, since it depends on the consistency of both 4 measured and unmeasured predictor effects. However, calibration metrics are incompletely and 5 inconsistently reported; when reported, the metrics provided are usually either clinically uninformative 6 (e.g., the Hosmer-Lemeshow test) or difficult to quantitatively analyse (e.g., graphical). 21 In conclusion, high ROB is pervasive and is associated with poorer model performance at validation, 24 supporting the application of PROBAST in reviews of CPMs. A subset of questions from PROBAST 25 may be particularly useful for high volume assessments, when classification into high and low ROB 26 categories is the primary goal. Furthermore, the high prevalence of high ROB models emphasizes the 27 need to improve methodological quality of prediction research. 28 . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 25, 2021. Award (ME-1606-35555). The authors declare that the funder had no role in study design, data 8 collection and analysis, decision to publish, or preparation of the manuscript. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 25, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 25, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 25, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Relatedness was assessed using relatedness rubrics specifically developed for this purpose. 11 AUC indicates area under the receiver operator characteristic curve; IQR interquartile range; ROB, risk of bias. . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint ‡Model includes relatedness and risk of bias. Interaction between relatedness and risk of bias was tested in a separate model: interaction p-value = 0.17. §Model (20 imputed data sets) includes: relatedness, risk of bias, index condition, CPM authors (same as in validation paper, author overlap, no author overlap), CPM method (logistic, time-to-event, other; missing = 20), CPM center (single, multicentre; missing = 23), CPM source (medical record, registry, trial, other; missing = 4), validation design (cohort, trial, other; missing = 18), validation center (single, multicentre; missing = 98), validation source (medical record, registry, trial, other; missing = 44), CPM parameter degrees of freedom, CPM events per variable (missing = 208), CPM sample size (missing = 29), CPM events (missing = 208), validation events per variable (missing = 147), validation sample size (missing = 6), validation events (missing = 147), relative outcome rate difference >40% (missing = 372), and difference in years between CPM and validation. dAUC indicates the change in area under the receiver operator characteristic curve; CI, confidence interval; CPM, clinical prediction model; ROB, risk of bias. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 25, 2021. ; https://doi.org/10.1101/2021.01.20.21250183 doi: medRxiv preprint Total score can range from 0 to 6, with 0 points indicating low risk of bias and ≥1 high risk of bias. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 25, 2021. Prognosis and prognostic research: what, why, and how? Clincial Prediction Models: A Practical Approach to Development, Validation and Updating Support of personalized medicine through risk-stratified treatment recommendations -an environmental scan of clinical practice guidelines Prognosis and prognostic research: Developing a prognostic model Towards better clinical prediction models: seven steps for development and an ABCD for validation Prognosis and prognostic research: validating a prognostic model Assessment of heterogeneity in an individual participant data meta-analysis of prediction models: An overview and illustration PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies Clinical Prediction Models for Cardiovascular Disease: Tufts Predictive Analytics and Comparative Effectiveness Clinical Prediction Model Database Tufts PACE Clinical Predictive Model Registry: update 1990 through 2015 How Well Do Clinical Prediction Models (CPMs) Validate? A Large-scale Evaluation of Cardiovascular Clinical Prediction Models PROBAST: A Tool to Assess Risk of Bias and Applicability of Prediction Model Studies: Explanation and Elaboration Regression modeling strategies Longitudinal data analysis for discrete and continuous outcomes Models for longitudinal data: a generalized estimating equation approach Prognostic models for outcome prediction in patients with chronic obstructive pulmonary disease: systematic review and critical appraisal Prediction models for prostate cancer to be used in the primary care setting: a systematic review Systematic review of prediction models in relapsing remitting multiple sclerosis The Unrealised Potential for Predicting Pregnancy Complications in Women with Gestational Diabetes: A Systematic Review and Critical Appraisal Evaluating risk prediction models for adults with heart failure: A systematic literature review Assessing the performance of prediction models: a framework for traditional and novel measures Assessing the generalizability of prognostic information External validity of risk models: Use of benchmark values to disentangle a case-mix effect from incorrect coefficients