key: cord-0687003-qhcpzlv7 authors: Collins, Gary S; Riley, Richard D; van Smeden, Maarten title: Flaws in the development and validation of a covid-19 prediction model date: 2020-09-16 journal: Clin Infect Dis DOI: 10.1093/cid/ciaa1406 sha: 7121bb77cc44f3ae0f7d58693412c48fbde46581 doc_id: 687003 cord_uid: qhcpzlv7 nan A c c e p t e d M a n u s c r i p t To the Editor -The covid-19 pandemic has seen the development of a number of clinical prediction models to support assessing disease severity or aiding prognosis. A recent systematic review has identified 145 models, concluding all to be at high risk of bias citing concerns with data quality, statistical analysis and reporting leading to the conclusion that none are recommended for use [1] . We therefore read with interest the recent paper by Dong and colleagues describing the development of a prediction model for assessing survival in covid-19 patients [2] . Unfortunately, we observed a number of concerns in the study which we believe deserve highlighting to readers. In their paper, the authors randomly split their data into training and validation setsa practice widely known to be statistically inefficient [3] it reduces the sample size for developing the model (increasing the risk of overfitting), whilst the validation cohort is often too small to evaluate performance. The authors then develop a model on the training cohort containing ~75 deaths (60% of 121 deaths) and examining over 30 candidate predictorsthis will almost certainly lead to an overfit model [4] . The resulting model was then evaluated in the validation cohort which contained only ~46 deaths (40% of 121 deaths), which is much lower than the recommended 100 deaths needed for validation [5] . The preferred approach, making the most of the data, would be to develop the model on all available data, and use methods like bootstrapping or cross-validation to both adjust the performance measures (e.g., c-index) for optimism (which are almost certainly too high) and shrink the model coefficients (which will likely be overestimated [6] ). Other major analysis concerns include categorization of continuous predictors (which results in loss of information [7] ), no mention of missing data, use of lasso followed by 'multivariate'(sic) Cox regression to screen predictors for inclusion, incorrect (i.e., does not reflect the actual model building process) and confusing implementation of cross-validation on the validation data, weak assessment of model calibration by binning observations, and assessment of both the AUC and the c-indexboth assess discrimination (however, only the latter is correct for survival outcomes), and no assessment of clinical utility. Our second point relates to model presentation. The authors presented their model as a nomogram presumably to aid clinical uptake [8] . However, for other investigators to externally validate this model in their own data to see, it is important that the model underpinning the nomogram is fully reported, namely all the regression coefficients and importantly the baseline survival at the timepoints of interest (i.e., at 14 and 21 days)this latter information is missing, impeding independent external validation of the model. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal Development and Validation of a Nomogram for Assessing Survival in Patients with COVID-19 Pneumonia Internal validation of predictive models: efficiency of some procedures for logistic regression analysis Sample size for binary logistic prediction models: Beyond events per variable criteria Sample size considerations for the external validation of a multivariable prognostic model: a resampling study Prognosis Research in Healthcare: Concepts, Methods and Impact Quantifying the impact of different approaches for handling continuous predictors on the performance of a prognostic model Guide to presenting clinical prediction models for use in clinical settings Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis: The TRIPOD statement Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration The TRIPOD Statement (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) for reporting (covid) prediction models should be consulted (www.tripodstatement.org) so that important information is presented when authors describe their study [9] . Accompanying the TRIPOD Statement is an Explanation and Elaboration paper that describes the rationale for reporting, as well as discussing various methodological aspects [10] .The authors report no conflicts of interest. A c c e p t e d M a n u s c r i p t