key: cord-0334747-ojtq1kd4
authors: Aguirre, Josu; Padilla, Natàlia; Özkan, Selen; Riera, Casandra; Feliubadaló, Lidia; de la Cruz, Xavier
title: Choosing variant interpretation tools for clinical applications: context matters
date: 2022-02-19
journal: bioRxiv
DOI: 10.1101/2022.02.17.480823
sha: 5d64b26c76535e2a40562c61357c2f617097b1b3
doc_id: 334747
cord_uid: ojtq1kd4

Motivation Our inability to solve the Variant Interpretation Problem (VIP) has become a bottleneck in the biomedical/clinical application of Next-Generation Sequencing. This situation has favored the development and use of in silico tools for the VIP. However, choosing the optimal tool for our purposes is difficult because of a fact usually ignored: the high variability of clinical context/scenarios across and within countries, and over time. Results We have developed a computational procedure, based on the use of cost models, that allows the simultaneous comparison of an arbitrary number of tools across all possible clinical scenarios. We apply our approach to a set of pathogenicity predictors for missense variants, showing how differences in clinical context translate to differences in tool ranking. Availability The source code is available at: https://github.com/ClinicalTranslationalBioinformatics/clinical_space_partition

Before describing the results, we first give an intuitive explanation of the application 127 of costs models to the VIP in the clinical setting. Although we focus on the case of 7 pathogenicity predictors, the reasoning is valid for any variant annotation process, whether 129 computational or experimental. 130

In different healthcare problems, in silico variant annotations are combined with 131 diverse amounts of clinical data. A common characteristic of these problems, e.g., in 132 diagnostics, is that one must identify the correct/best answer among several options. Each 133 answer is associated to a series of medical actions and costs to patients and their families, 134 healthcare institutions, etc. The medical decision process aims at minimizing these costs 135 (Hunink et al., 2014) . Accurate pathogenicity predictions contribute to this goal, whereas 136 misclassification errors (MISC) -benign variants predicted as pathogenic, or vice versa-favor 137 incorrect decisions. Even the apparently harmless situation in which a predictor produces no 138 answer (which we will call NOA) has consequences. It may call for further clinical testing in 139 diagnosis, limit the reach of screenings, etc. 140

In summary, both MISC and NOA are associated with costs affecting healthcare 141 stakeholders. Reducing this impact is thus a guiding principle when ranking annotation tools 142 for clinical applications. Here, we advance a rigorous approach, based on cost models 143 (Adams and Hand, 1999; Pepe, 2003 To explore the dependence of the previous ranking on clinical context we could use a 191 brute force approach: varying rc 1 between 0 and 1 systematically and repeating the ranking at 192 each value. This approach will produce a division of the clinical space into regions within 193 which only a single predictor is cost-optimal. However, brute force may be computationally 194 demanding and inaccurate when the number of predictors is large. To avoid these problems, 195

we have developed a fast alternative; an algorithm (CSP-noco, Appendix 1, Supplementary 196 Materials) that produces the exact division of the clinical space using some formal properties 197 of the problem. Below, we illustrate the application of this algorithm to the comparison of a 198 set of pathogenicity predictors across the whole clinical space (Fig. 1) and show, also, how 199 the results obtained depend on r (Fig. 2) . represented by a triangle encompassing all cost scenarios -all (rc 0 ,rc 1 ) pairs. The line divides 230 the triangle into two regions within which a single method is cost-optimal. b, As we add 231 more predictors, a pattern of polygons forms; inside each polygon, a single predictor is cost-232 optimal. c, Identification and unification of the polygons in regions corresponding to the 233 same predictor, using our approach (CSP-co). d, Ranking of the sixteen pathogenicity 234 predictors according to AUC (grey bars, left half) and fraction of cost scenarios for which 235 each predictor is optimal (pink bars, right half; results for r=0.5 and r=0.001). This fraction 236 is the size of the predictor's area in the triangle, divided by 0.5 (the whole triangle area). 237

13 Because rc has the same meaning as in MISC, it can be used to systematically rank 238 predictors in multiple clinical scenarios, in terms of cost. However, for this task we cannot 239 utilize the above algorithm, CSP-noco, because the clinical space is now two-dimensional. To 240 solve this problem, we have developed a Breadth First Search-based algorithm (CSP-co, 241

Appendix 2, Supplementary Materials) that divides the clinical space into regions within 242 which only a single predictor is cost-optimal. As before, this division depends on r, an aspect 243 that we explore below (Fig. 4) . In this section, we show the application of MISC and MISC+NOA, using CSP-noco 255 and CSP-co, respectively, to the comparison of sixteen selected pathogenicity predictors 256 across clinical space. The goal is to study the dependence of the cost-optimal method on 257 context. 258 259

In MISC, each pathogenicity predictor is characterized by a line relating normalized 261 cost (rc) and clinical scenario (represented by an rc 1 value). 262

In Fig. 1a , we show the lines for the sixteen predictors analyzed, for r=0.5. The 263 relative positions of these lines at a given rc 1 value determine which predictor is preferable in 264 terms of cost. This comparison is extended to all possible rc 1 values with our algorithm CSP-265 noco (Appendix 1, Supplementary Materials). The result obtained (Fig. 1a, bottom) shows 266 that changes in rc 1 , i.e., in the clinical scenario, may affect the tool of choice for 267 pathogenicity prediction. More precisely, we see that in 81% of the cases PON-P2 is the cost-268 optimal method; in the remainder, REVEL and CADD prevail in 16% and in 3% of the cases, 269

respectively. This picture contrasts with the fixed view conveyed by the AUC analysis (Fig.  270 1b), in which REVEL has the highest AUC and would always be the tool of choice, 271 regardless of the clinical context, if predictor selection is based only on AUC. When 272 reproducing the same analysis with MCC instead of AUC, we find (Supplementary Figure  273 S2a) that while choosing methods using MCC is not cost-optimal for some scenarios, on 274 average, it is better than using AUC, in this case. 275 15 Using cost models identifies r, the proportion of pathogenic variants, as a factor that 276 we may want to consider when choosing our preferred predictor. For example, if we change r 277 from 0.5 to 0.001, the ranking of the methods changes (Fig. 1b) , and REVEL predominates in 278 99.5% cost scenarios and PON-P2 in 0.5%. In Fig. 2 , we explore more systematically the 279 effect of r on the rankings, describing the transition between REVEL and PON-P2, an effect 280 due to the fact that as r grows methods with high sensitivities will be favored. 281

Finally, we would like to note that the previous analyses can be repeated as often as 282 required, e.g., when clinical costs change with time, or when we are working with specific 283 patient cohorts, etc., because our algorithm CSP-noco is computationally inexpensive. 284 285

Again, our goal is to describe the distribution of optimal predictors across the clinical 287 space (which is now two-dimensional) obtained with our algorithm CSP-co (Appendix 2, 288

Once applied, the algorithm divides the clinical space into a mosaic of polygons (Fig.  290 3b) that is processed to give the distribution of predictors sought (Fig. 3c) . We see that 291

introducing coverage considerations into the election of predictors, i.e., using MISC+NOA, 292 does not affect, overall, our previous result: deployment context affects the ranking of 293 predictors. There are, however, some quantitative/qualitative differences in the distribution of 294 the predictors. For example, for r=0.5 five tools (CADD, MutationTaster2, PON-P2, 295 REVEL, and Vest), instead of three, constitute the optimal division of the clinical space. 296

Again, this division is uneven, with REVEL prevailing in 77.3% scenarios, PON-P2 in 8.5%, 297 etc. 298

Comparison of these results with those of the AUC ranking (Fig. 3d) confirms that 299 there are clinical scenarios for which the predictor with top AUC is not cost-optimal. NOA 300 plays an important role in this result. PON-P2, which has moderate coverage, prevails in cost 301 scenarios near the diagonal (Fig. 3c) . That is, for those cases for which the cost of NOA is 302 negligible. REVEL, on the contrary, because of its higher coverage, predominates over most 303 of the clinical space. As a consequence, using AUC for choosing predictors gives more cost-304 optimal results than for MISC. Reproducing the same analysis with MCC instead of AUC 305 shows (Supplementary Figure S2b ) that choosing predictors using MCC is not cost-optimal 306 for all scenarios either, as for MISC. 307

Characterization of the impact of different r values (Fig. 4) complete the description 308 of this section. We find that the overall picture does not vary so much between different r's 309 because only two to five methods out of sixteen participate in the final division of the clinical 310 space. Their identities change slightly. All across the r range, REVEL predominates, with 311 PON-P2 also appearing, near the diagonal. As r increases, three other methods (CADD, 312 challenge to interested users, who have to identify the optimal tool for their needs among tens 322 of options. Usually, this is done ranking pathogenicity predictors on the basis of their AUCs, 323 or MCCs, etc., parameters that are blind to the deployment context. Here, we show, 324 applying/extending cost models used in healthcare (Pepe, 2003; Hunink et al., 2014) , that this 325

Comparing classifiers when the misallocation costs 413 are uncertain

PolyPhen-2 : prediction of functional effects of human 416 nsSNPs

Performance of ACMG-AMP Variant-Interpretation 418

Guidelines among Nine Laboratories in the Clinical Sequencing Exploratory Research

Challenges in genetic testing: clinician variant interpretation 422 processes and the impact on clinical care

Identifying Mendelian disease genes with the variant effect scoring 425 tool

Predicting the functional effect of amino Acid substitutions and 427 indels

Identification of deleterious mutations within three human 429 genomes

Comparison and integration of deleteriousness prediction methods for 431

nonsynonymous SNVs in whole exome sequencing studies

Cost curves: An improved method for visualizing 434 classifier performance

A Collaborative Effort to Define Classification Criteria for 437

ATM Variants in Hereditary Cancer Patients

Improved, ACMG-compliant, in silico prediction of pathogenicity 440 for missense substitutions encoded by TP53 variants

ClinGen TP53 Variant Curation Expert Panel. Specifications of the 443

ACMG/AMP variant interpretation guidelines for germline TP53 variants', Human Mutation

Evaluation of in silico algorithms for use with 446

ACMG/AMP clinical variant interpretation guidelines

Assessing the Performance of Classification Methods', International 449 Statistical Review

A unified view of performance metrics

Translating threshold choice into expected classification loss

Decision making in health and medicine: Integrating 454 evidence and values, second edition, Decision Making in Health and Medicine: Integrating 455

REVEL: An Ensemble Method for Predicting the 457

Pathogenicity of Rare Missense Variants

Predicting the effects of coding non-460 synonymous variants on protein function using the SIFT algorithm

Development of pathogenicity 463 predictors specific for variants that do not comply with clinical guidelines for the use of 464 computational evidence

dbNSFP v3.0: A One-Stop Database of Functional Predictions

Annotations for Human Nonsynonymous and Splice-Site SNVs

A comparison of deep learning performance against health-care 470 professionals in detecting diseases from medical imaging: a systematic review and meta-471 analysis

PMut: A web-based tool for the annotation of pathological 474 variants on proteins

PON-P2: Prediction Method for Fast

Reliable Identification of Harmful Variants

The computational approach to variant interpretation: principles, 480 results, and applicability

Variant Interpretation: Theory and Practice

Inferring the molecular and phenotypic impact of amino acid 484 variants with MutPred2

The Statistical Evaluation of Medical Tests for Classification and 486 Prediction

CADD: Predicting the deleteriousness of variants throughout the 488 human genome

Predicting the functional impact of protein 491 mutations: Application to cancer genomics

Standards and guidelines for the interpretation of sequence 494 24 variants: A joint consensus recommendation of the American College of

Li-Fraumeni syndrome

Mutationtaster2: Mutation prediction for the deep-sequencing 500 age

Predicting the Functional, Molecular, and Phenotypic 504 Consequences of Amino Acid Substitutions using Hidden Markov Models

Variant Interpretation: Functional Assays to the Rescue

Problems in variation interpretation guidelines and in their 509 implementation in computational tools

Prediction models for diagnosis and prognosis of covid

Systematic review and critical appraisal

The authors acknowledge comments on the work from members of the Pirepred european 409 consortium. 410

procedure is not optimal and that changes in clinical scenarios can indeed affect the tool of 326 choice. Our work is inspired (i) by the use of cost in the similar problem of deciding between 327 candidate tests for the same task (Pepe, 2003) , e.g., for Elisa assays, etc., and (ii) by the use 328 of cost to compare the performance of binary classifiers in the machine learning field (Adams 329 and Hand, 1999; Drummond and Holte, 2006) . We have explored two options: MISC, in 330 which only misclassification errors are contemplated, and MISC+NOA, where, in addition, 331 the situation in which the predictor does not produce an outcome is also contemplated. In 332 both cases, clinical context is encoded using a few cost-related parameters (two for MISC and 333 three for MISC+NOA). For each model, we have developed an algorithm (CSP-noco and 334 CSP-co, respectively) that allows the comparison of an arbitrary number of predictors across 335 the whole set of clinical scenarios. We have then applied these results to a set of sixteen 336 pathogenicity predictors used for the VIP. The views resulting from MISC ( A+R is optimal in 19.3% of the scenarios. The situation changes even more when using 361 MISC+NOA, which also takes into account discordances between paired predictors (NOA). 362The distribution of the clinical space experiences a significant change (Fig. 5) relative to the 363 MISC case. The accumulated size of A+B and A+R drops from 100% to 31.1%, and two new 364 regions appear, corresponding to BayesDel (50%) and REVEL (18.9%), respectively. These 365 new regions correspond to clinical contexts where NOA costs are high relative to 366 misclassification costs. For example, these could be scenarios in which surveillance programs 367 for pediatric patients with an unclear Li-Fraumeni diagnosis are more expensive than usual, 368 etc. In these settings, rejecting in silico evidence when Align-GVGD and its companion tool 369 (REVEL or BayesDel) disagree may be a worse option than using REVEL or BayesDel 370 alone. We are not advocating the use of these two predictors in these clinical space regions; 371 we want to emphasize that cost models identify a problem in the election of in silico tools. 372There are different options to address this problem, some of which are directly suggested by 373 the cost model (equation (2)). For example, we can focus on the method-dependent 374 parameters, sensitivity (s e ), specificity (s p ), and coverage (a), and look for combinations of 375 19 predictors that guarantee good s e and s p while increasing a. This is a realistic alternative since 376 presently there are well over 50 pathogenicity predictors available (see Özkan et al. (Özkan et 377 al., 2021) , and references therein). Alternatively, we can aim to increase a by developing 378 methods specific for variants with conflicting predictions (de la Campa, Padilla and de la 379 Cruz, 2017). We can also try to act directly on the cost parameters; e.g., in the case of rc 2 we 380 could find strategies that reduce the costs of managing patients with no diagnosis, etc. In 381 summary, the use of cost models helps identify key issues in the choice of predictors and 382 provides insight on how to solve them.