key: cord-0746538-nh59y0zr authors: Wu, Guangyao; Woodruff, Henry C.; Chatterjee, Avishek; Lambin, Philippe title: Reply to “COVID-19 prediction models should adhere to methodological and reporting standards” date: 2020-08-14 journal: Eur Respir J DOI: 10.1183/13993003.02918-2020 sha: 567a9dfbde89bbfd76b7eb0adfe8eea2e2aebc88 doc_id: 746538 cord_uid: nh59y0zr Reply to “COVID-19 prediction models should adhere to methodological and reporting standards” We would like to thank G.S. Collins, M. van Smeden, and R.D. Richard for their commentary on the design, analysis, and reporting of our article [1] . However, their comments seem to stem from a traditional biostatistics angle rather than from a translational research machine-learning approach and the overwhelming majority of criticisms arise from either misunderstandings or misreading. The authors inaccurately state that we randomly split datasets. As described in our manuscript we nonrandomly split the data by time and place, making it a stronger design according to the TRIPOD statement. The use of independent cohorts to test model generalizability make it a TRIPOD Type 3 study [2] . We agree that splitting reduces the training dataset size, increasing the probability of overfitting. However, as an RNA virus, SARS-CoV-2 may be able to mutate rapidly and develop diverse characteristics. Hence, we split the datasets by time and place rather than using cross-validation or bootstrapping. The authors used 75 candidate predictors rather than the 7 selected ones to perform their sample size calculations for our training dataset [3] . Although we agree that using candidate predictors is a more rigorous approach compared to using only the selected ones, it is too strict in the modern machine-learning and -omics field, and disregards the power of feature dimensionality reduction and selection methods we employed. While we understand that overfitting remains possible, the validation of the model on five datasets from unrelated institutions strengthens the likelihood that the model presented is robust. Test set results are presented separately to improve understanding of robustness, because it is easy to hide possible poor performance in a small test set by combining it with a large test set where the performance is good. More importantly, the selected variables make sense from the clinical point of view [4, 5] , making our models explainable, transparent, and therefore acceptable by the end-users. We agree that excluding missing data may lead to biases, and list this as our first limitation in the Discussion. Given the time-critical nature of this quickly developing pandemic, we decided that excluding 38 patients was preferable to imputation and that the bias introduced by such a selection would be revealed in the five external validations and further validations post-publication. The authors inaccurately state that we assume that continuous predictors are linearly associated with the outcome. We emphasize that neither feature selection nor modeling assume a linear association between predictors and outcomes. The process of randomizing the outcomes and re-running of the analysis is a powerful sanity check against overfitting [6] . We must point out that Adaptive Synthetic (ADASYN) algorithm is a published and validated method for dealing with dataset unbalance. Whilst we agree that this methodology could introduce an error in the model intercept, we believe that this error can be estimated when calculating the model's performance in the five external validation datasets. Everyone has their preferred metrics and often a better metric can be found than those commonly reported. This is especially true in the convergence zone between machine-learning and clinical application, where reporting possibly sub-optimal metrics that are easier to understand may have added benefit over more technical metrics used by data scientists. Reporting confusion matrices, a widely used and readily understandable way of evaluating classification performance, can easily be defended. Equally, reporting the universally adopted sensitivity and specificity metrics as well as the results from the calibration plots align well with the readership of this esteemed publication. The authors call our risk groupings arbitrary. Using three risk groups was a requirement of the clinicians and is common in the clinic, including COVID-19: low-risk (home care), medium-risk (hospital surveillance), and high-risk (ICU admission). The risk probability thresholds were based on the 25th and 75th probability percentiles in the balanced training set. With these thresholds, the low-risk group had <20% incidence of severe outcomes, and the high-risk group had >75% chance of severe outcomes on each test set, which the clinicians deemed clinically useful. The authors reprimand us for not reporting the model parameters explicitly. For us, the main aim of any clinical triage model is the application on individual patients in a clinical setting. We believe both a nomogram and a web calculator satisfy this requirement. In addition, for model evaluation, the model parameters can be fully reconstructed from the nomogram. There are numerous checklists or guidelines for diagnostic and predictive models [7] [8] [9] [10] . In retrospect, we agree that TRIPOD is a more appropriate checklist than STARD for modelling studies due to the details regarding the reporting of methodology and results. We chose a more familiar checklist from the submission guidelines of this Journal (guidelines in which TRIPOD is not listed) and will make sure to also include TRIPOD reporting in the future. Given the quickly changing nature of machine learning and the increasing number of guidelines, it is hard to forge standards, while the need for them in the reporting of model studies increases. Overall, we believe our work is useful and explainable, and have received positive feedback from colleagues, including clinicians, who appreciate that their requirements have been taken into account. We are currently prospectively validating our models out of a conviction that only this approach can truly validate a predefined model. Development of a Clinical Decision Support System for Severity Risk Prediction and Triage of COVID-19 Patients at Hospital Admission: an International Multicenter Study Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): Explanation and Elaboration Minimum sample size for developing a multivariable prediction model: PART II -binary and time-to-event outcomes Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, retrospective, observational study An Empirical Approach for Avoiding False Discoveries When Applying High-Dimensional Radiomics to Small Datasets Guidelines for developing and reporting machine learning predictive models in biomedical research: A multidisciplinary view Peering into the black box of artificial intelligence: Evaluation metrics of machine learning methods Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction Assessing Radiology Research on Artificial Intelligence: A Brief Guide for Authors, Reviewers, and ReadersFrom the Radiology Editorial Board