key: cord-0167033-bkuwq8gd
authors: Watson, David S.
title: Interpretable Machine Learning for Genomics
date: 2021-10-06
journal: nan
DOI: nan
sha: 9647449b5fe8d349a58241779718c790b2479938
doc_id: 167033
cord_uid: bkuwq8gd

High-throughput technologies such as next generation sequencing allow biologists to observe cell function with unprecedented resolution, but the resulting datasets are too large and complicated for humans to understand without the aid of advanced statistical methods. Machine learning (ML) algorithms, which are designed to automatically find patterns in data, are well suited to this task. Yet these models are often so complex as to be opaque, leaving researchers with few clues about underlying mechanisms. Interpretable machine learning (iML) is a burgeoning subdiscipline of computational statistics devoted to making the predictions of ML models more intelligible to end users. This article is a gentle and critical introduction to iML, with an emphasis on genomic applications. I define relevant concepts, motivate leading methodologies, and provide a simple typology of existing approaches. I survey recent examples of iML in genomics, demonstrating how such techniques are increasingly integrated into research workflows. I argue that iML solutions are required to realize the promise of precision medicine. However, several open challenges remain. I examine the limitations of current state of the art tools and propose a number of directions for future research. While the horizon for iML in genomics is wide and bright, continued progress requires close collaboration across disciplines.

complicated for humans to understand without the aid of advanced statistical methods.

Machine learning (ML) algorithms, which are designed to automatically find patterns in data, are well suited to this task. Yet these models are often so complex as to be opaque, leaving researchers with few clues about underlying mechanisms. Interpretable machine learning (iML) is a burgeoning subdiscipline of computational statistics devoted to making the predictions of ML models more intelligible to end users. This article is a gentle and critical introduction to iML, with an emphasis on genomic applications. I define relevant concepts, motivate leading methodologies, and provide a simple typology of existing approaches. I Technological innovations have made it relatively cheap and easy to observe biological organisms at molecular resolutions. High-throughput methods such as next generation sequencing and the full suite of "omic" platforms -e.g., genomic, proteomic, metabolomic, and related technologies -have inaugurated a new era of systems biology, providing data so abundant and detailed that researchers are not always sure just what to do with the newfound embarrassment of riches. One of the most salient traits of these datasets is their sheer size.

Sequencing technologies can record anywhere from a few thousand to a few billion features per sample. Another important factor, related but distinct, is that genomic data is not immediately intelligible to humans. Whereas a small child can accurately classify pictures of animals, experts cannot generally survey a genetic sequence and predict health outcomes -at least not without the aid of advanced statistical models.

Machine learning (ML) algorithms are designed to automatically mine data for insights using few assumptions and lots of computational power. With their ability to detect and exploit complex relationships in massive datasets, ML techniques are uniquely suited to the challenges of modern genomics. In this article, I will focus specifically on supervised learning methods, which attempt to estimate a function from inputs (e.g., gene expression) to outputs (e.g., disease diagnosis). 1 ML algorithms have become enormously popular in medical research (Topol, 2019) , especially in imaging tasks such as radiological screening (Mazurowski et al., 2019) and tumor identification (McKinney et al., 2020) . They have also been successfully applied to complex molecular problems such as antibiotic discovery (Stokes et al., 2020) and predicting regulatory behavior from genetic variation (Eraslan et al., 2019) . ML promises to advance our understanding of fundamental biology and revolutionize the practice of medicine, enabling personalized treatment regimes tailored to a patient's unique biomolecular profile.

Despite all their strengths and achievements, supervised learning techniques pose a number of challenges, some of which are especially troubling in biomedical contexts. Foremost among these is the issue of interpretability. Successful ML models are often so complex that no human could possibly follow the reasoning that leads to individual predictions. Inputs may pass through a long sequence of recursive nonlinearities, spanning thousands or millions of parameters, before a prediction emerges out the other side. How can such a black box, no matter how accurate, advance our knowledge of biological mechanisms? How can we trust what we do not understand?

Interpretable machine learning (iML) -also known as explainable artificial intelligence (xAI) or, more simply, explainability -is a fast-growing subfield of computational statistics devoted to helping users make sense of the predictions of ML models. narrowly on iML for genetics. For reviews of interpretable deep learning in genomics, see Talukder et al. (2020) and Binder (this issue).

My goal in this article is not to add yet another survey to this overpopulated field.

Instead, I aim to provide a gentle and critical introduction to iML for genomic researchers. I define relevant concepts, motivate prominent approaches, and taxonomize popular method-ologies. I examine important opportunities and challenges for iML in genomics, arguing that more intelligible algorithms will ultimately advance our understanding of systems biology and play a crucial role in realizing the promise of precision medicine. I show how iML algorithms augment the traditional bioinformatics workflow with explanations that can be used to guide data collection, refine training procedures, and monitor models throughout deployment (see Fig. 1 ). To do so, however, the field must overcome several conceptual and technical obstacles.

I outline these issues and suggest a number of future research directions, all of which require close interdisciplinary collaboration. Figure 1 . The classic bioinformatics workflow spans data collection, model training, and deployment. iML augments this pipeline with an extra interpretation step, which can be used during training and throughout deployment (incoming solid edges). Algorithmic explanations (outgoing dashed edges) can be used to guide new data collection, refine training, and monitor models during deployment.

The remainder of this article is structured as follows. I review relevant background material and provide a typology of iML in Sect. 2. I examine three motivations for iML in Sect.

3, each of which has a role to play in computational biology. In Sect. 4, I introduce a number of popular iML methodologies that have been used in recent genomic research. A discussion follows in Sect. 5, where I consider three open challenges for iML that are especially urgent in bioinformatics. Sect. 6 concludes. §2 Background

In supervised learning, we assume access to a finite training dataset of n input-output pairs, (Vapnik, 1995) . Popular algorithms of this type include (potentially regularized) linear models, neural networks, and tree-based ensembles such as random forests and gradient boosting machines. See (Hastie, Tibshirani, & Friedman, 2009 ) for a good introduction.

It is not always obvious what would constitute a successful explanation of a given model prediction. Indeed, explanation itself is an epistemologically contested concept, the subject of ancient and modern philosophical debates (Woodward, 2019) . It should perhaps come as no surprise then to learn that algorithmic explanations come in many flavors. The first point to acknowledge is that iML tools are used for different analytic purposes. For instance, they may help to estimate or understand a true functional relationship presumed to hold in nature. Alternatively, they may be used to analyze the behavior of a fitted model -to illuminate the black box, as it were. Finally, they may be involved in the design of so-called "glass box" algorithms, i.e. some novel function class specifically built for transparency. These goals may overlap at the edges, and methods originally intended for one may be repurposed for another. However, each represents a distinct task with its own challenges. Not all are equally prevalent in genomics, though this review will discuss examples of each. Keeping these aims separate is crucial to avoid the conceptual pitfalls addressed in Sect. 5.

The following typology is adapted from Molnar's (2021) (Rudin, 2019) . Unfortunately, many interesting real-world problems cannot be adequately modelled with intrinsically explainable algorithms. Watson et al. (2019) argue that in clinical medicine, doctors are obligated to use whatever available technology leads to the best health outcomes on average, even if that involves opaque ML algorithms. Of course, flexible ML models are also prone to overfit the training data, especially in high-dimensional settings. The choice of which approach to use invariably depends on contextual factors -the task at hand, the prior knowledge available, and what assumptions the analyst deems reasonable. Should researchers choose to use some black box method, interpreting predictions will require post-hoc tools, which take a target model f as input and attempt to explain its predictions, at least near some region of interest.

Model-specific vs. model-agnostic. Model-specific iML solutions take advantage of the assumptions and architectures upon which particular algorithms are built to generate fast and accurate explanations. For example, much work in iML has been specifically devoted to deep neural networks (Bach et al., 2015; Montavon et al., 2017; Shrikumar, Greenside, & Kundaje, 2017; Sundararajan, Taly, & Yan, 2017) , an especially rich class of functions with unique explanatory affordances and constraints. Model-agnostic tools, on the other hand, strive for more general applicability. Treating the fitted function f as a black box, they attempt to explain its predictions with few or no assumptions about the data generating process. Model-agnostic approaches are especially useful in cases where f is inaccessible (for example if an algorithm is protected by intellectual property laws), while model-specific methods are generally more efficient and reliable when f's structure is known.

Global vs. local. A global explanation helps the user understand the behavior of the target model f across all regions of the feature space. This is difficult to achieve when f is complex and/or high-dimensional. A local explanation, by contrast, is only meant to apply to the area near some particular point of interest. For instance, a properly specified linear regression is globally explainable in the sense that the model formula holds with equal probability for any randomly selected datapoint. However, a local linear approximation to some nonlinear f will fit best near the target point, and does not in general tell us anything about how the model behaves in remote regions of the feature space. In biological contexts, we can think of global and local explanations applying at population-and individual-levels, respectively. These are poles of a spectrum that also admits of intermediate alternatives, e.g. subpopulation-or group-level explanations, which are possible as well.

A final axis of variation for iML tools is their output class. Typically, these methods explain predictions through some combination of images, statistics, and/or examples. Visual explanations are especially well suited to image classifiers (see Fig. 2 ). Other common visual approaches include plots that illustrate the partial dependence (Friedman, 2001) (1) to audit for potential bias;

(2) to validate performance, guarding against unexpected errors; and (3) to discover underlying mechanisms of the data generating process.

All three are relevant for genomics.

Healthcare often magnifies social inequalities. For recent evidence, look no further than the COVID-19 pandemic, which disproportionately affects minority populations in the US and UK (Egede & Walker, 2020) . ML threatens to automate these injustices. Obermeyer et al. (2019) found evidence of significant racial bias in a healthcare screening algorithm used by millions of Americans. Simulations suggest that rectifying the disparity would nearly triple the number of Black patients receiving medical care. Similar problems are evident in genomic research.

Individuals of white European ancestry make up some 16% of the global population, but constitute nearly 80% of all genome-wide association study (GWAS) subjects, raising legitimate concerns that polygenic risk scores and other tools of precision medicine may increase health disparities (Martin et al., 2019) . As genomic screening becomes more prevalent, there will be substantial and justified pressure to ensure that new technologies do not reinforce existing inequalities. Algorithmic fairness and explainability may even be legally required under the European Union's 2018 General Data Protection Regulation (GDPR), depending on one's interpretation of the relevant articles (Selbst & Powles, 2017; Wachter et al., 2017) . By making a model's reliance on potentially sensitive attributes more transparent, iML methods can help quantify and mitigate potential biases.

The second motivation concerns the generalizability of ML models. Supervised learning algorithms are prone to overfitting, which occurs when associations in the training data do not generalize to test environments. In a famous example, a neural network trained on data from a large New York hospital classified asthmatics with pneumonia as low risk, a result that came as a surprise to the doctors and data scientists working on the project (Caruana et al., 2015) .

The algorithm had not uncovered some subtle pulmonological secret. On the contrary, the apparent association was spurious. Asthmatics with pneumonia are at such great risk that emergency room doctors immediately send them to the intensive care unit, where their chances of survival are relatively high. It was the extra medical attention, not the underlying condition, that improved outcomes for these patients. Naively applying this algorithm in a new environment -e.g., some hospital where patient triage is performed by a neural network with little or no input from doctors -could have grave consequences.

Overfitting has been observed in GWAS models (Nicholls et al., 2020) , where associations that appear informative in one population do not transfer to another. Such failures can be difficult to detect given the complexity of the underlying signals, which may depend on environmental factors or subtle multi-omic interactions. The problem of external validity or transportability is well known in the natural and social sciences, if not always well understood (Bareinboim & Pearl, 2016; Pearl & Bareinboim, 2014 (Li et al., 2021) . iML algorithms, together with causal inference tools (Imbens & Rubin, 2015; Pearl, 2000; Peters et al., 2017) , can help researchers identify and remove spurious signals, ensuring better generalizability to new environments.

A final goal of iML, less widely discussed than the previous two but arguably of greater interest in genomics, is to reveal unknown properties and mechanisms of the data generating process.

In this case, the guiding assumption is not that the target model f is biased or overfit; on the contrary, we assume that it has found some true signal and investigate its internal logic to learn more. This may mean examining weights from a support vector machine, approximating a decision boundary with some local linear model, or extracting Boolean rules to describe the geometry of some complex regression surface. Examples of all these approaches and more will be examined below, demonstrating how iML can be -and to some extent already has beenintegrated into genomic research workflows. By unpacking the reasoning that underlies highperformance statistical models, iML algorithms can mine for insights and suggest novel hypotheses in a flexible, data-driven manner. Rightly or wrongly, it is this capacity -not its potential utility for auditing or validation -that is likely to inspire more widespread adoption in bioinformatics.

In this section, I introduce a number of prominent approaches to iML. As a running example, as BRCA1, which is strongly associated with basal-like breast cancer (BLBC) (Turner & Reis-Filho, 2006) , and ESR1, a known marker of the luminal A subtype (Sørlie et al., 2003) .

The ease of computing VI for linear models has been exploited to search for causal variants in single nucleotide polymorphism (SNP) arrays via penalized regression techniques like the lasso (Tibshirani, 1996) and elastic net (Zou & Hastie, 2005) , both popular in GWAS (Waldmann et al., 2013) . Random forests (Breiman, 2001a) , one of the most common supervised learning methods in genomics (Chen & Ishwaran, 2012) , can provide a range of marginal or conditional VI scores, typically based either on permutations or impurity metrics (Altmann et al., 2010; Nembrini et al., 2018; Strobl et al., 2008) . Support vector machines (SVMs) may give intelligible feature weights depending on the underlying kernel (Schölkopf et al., 2004) . For example, Sonnenburg et al. (2008) use a string kernel to predict splice sites in C. elegans and extract relevant biological motifs from the resulting positional oligomer importance matrices. The method has since been extended to longer sequence motifs and more general learning procedures (Vidovic et al., 2015; Vidovic et al., 2017) . More recently, Kavvas et al. (2020) used an intrinsically interpretable SVM to identify genetic determinants of antimicrobial resistance (AMR) from whole genome sequencing data. Whether these methods tell us about the importance of features in nature or just in some fitted model f depends on whether we assume that f accurately captures the functional form of the relationship between predictors and outcomes.

To go back to the typology above, VI measures are global parameters that may be intrinsic (as in linear models) or post-hoc (as in random forest permutation importance).

Model-specific versions are popular, although several model-agnostic variants have emerged in recent years. These include targeted maximum likelihood measures (Hubbard, Kennedy, & van der Laan, 2018; Williamson et al., 2020) , nested model tests using conformal inference (Lei et al., 2018; Rinaldo et al., 2019) , and permutation-based reliance statistics (Fisher et al., 2019) . Such methods typically involve computationally intensive procedures such as bootstrapping, permutations, and/or model refitting, which pose both computational and statistical challenges when applied in settings with large sample sizes and/or highdimensional feature spaces, such as those commonly found in genomics.

A notable exception specifically designed for high-dimensional problems is the knockoff test for variable selection (Barber & Candès, 2015; Candès et al., 2018) . The basic idea behind this approach is to generate a set of "control" variables -the eponymous knockoffs -against which to test the importance of the original input features. For a given × design matrix , we say that 9 is the corresponding knockoff matrix if it meets the following two deviates markedly from the classic patient profile, yet our algorithm assigns her to this class with high probability. Given the aggressive treatment regime likely to follow such a diagnosis, Alice wants to be certain that the classification is correct. A local explanation reveals that, though BRCA1 mutations account for many BLBC predictions, in her case, the feature is relatively unimportant. Instead, her local explanation turns largely on CXCR6 -a gene associated with the Basal I subtype, which has a better prognosis on average than Basal II, and is therefore less likely to require high doses of chemotherapy (Milioli et al., 2017) . Google, 3 and IBM. 4 I will briefly explicate the theory behind this method, which unifies a number of similar approaches, including LIME.

were originally proposed as a way to fairly distribute surplus across a coalition of players (Shapley, 1953) . In iML settings, players are replaced by input features and Shapley values measure their contribution to a given prediction. Let ! ∈ ℝ % denote an input datapoint and (2020), we consider a general formulation of the value function, indexed by a distribution 7 :

Popular options for 7 include the marginal distribution ( 7 ), which is the default choice in SHAP; the conditional ( 7 | ! + ), implemented in the R package shapr (Aas, Jullum, & Løland, 2019) ; and the interventional ( 7 | ( ! + )), recently proposed by Heskes et al. (2020) . Each reference distribution offers certain advantages and disadvantages, but the choice of which to use is ultimately dependent upon one's analytical goals (see Sect. 5).

SHAP has been used to identify biomarkers in a number of genomic studies. A modelspecific variant known as DeepSHAP -with close ties to related methods DeepLIFT (Shrikumar et al., 2017) and integrated gradients (Sundararajan et al., 2017) , all techniques for explaining the predictions of deep neural networks -was recently used in conjunction with a model for predicting differential expression based on genome-wide binding sites on RNAs and promoters (Tasaki et al., 2020) . The same tool was used to identify CpG loci in a DNA methylation experiment that best predicted a range of biological and clinical variables, including cell type, age, and smoking status (Levy et al., 2020) . top genes as selected by DeepSHAP. Another SHAP variant -TreeExplainer (Lundberg et al., 2020) , which is optimized for tree-based ensembles such as random forests -was used to identify taxa in the skin microbiome most closely associated with various phenotypic traits (Carrieri et al., 2021) . SHAP is also gaining popularity in mass spectrometry, where data heterogeneity can complicate more classical inference procedures (Tideman et al., 2020; Xie et al., 2020) . In each of these cases, further investigation was required to confirm the involvement of selected features in the target functions. The outputs of SHAP, or any other iML algorithm for that matter, are by no means decisive or infallible. However, they offer a principled and novel approach for feature ranking and selection, as well as exploring interactions that can reveal unexpected mechanisms and guide future experiments. For instance, Lundberg et al.

(2020) use Shapley interaction values to demonstrate that white blood cells are positively associated with risk of end stage renal disease (ESRD) in patients with high blood urea nitrogen, but negatively associated with ESRD in patients with low blood urea nitrogen (see Fig. 4 ). The multivariate nature of these attributions makes them more informative than the probe-level analyses common in differential expression testing, even when shrinkage estimators are used to "pool information" across genes (Law et al., 2014; Love, Huber, & Anders, 2014; Smyth, 2004 at least when there are relatively few conditions, which is why rule lists are widely promoted as "intrinsically interpretable" (Lage et al., 2018) . This accords with the privileged position of material implication in propositional logic, where → is typically regarded as a primitive relation, along with conjunction (∧), disjunction (∨), and negation (¬). These logical connectives form a functionally complete class, capable of expressing all possible Boolean operations. This flexibility, which allows for nonmonotonic and discontinuous decision boundaries, affords greater expressive power than linear models. Thus, if the true reason for

Alice's unexpected diagnosis lies neither in her BRCA1 allele nor her CXCR6 expression but rather in some nonlinear interaction between the two, then she may be better off with a rule list that can concisely explain the (local or global) behavior of that function.

In statistical contexts, rule lists are generally learned through some process of recursive partitioning. For instance, the pioneering classification and regression tree (CART)

algorithm (Breiman et al., 1984) predicts outcomes by dividing the feature space into hyperrectangles that minimize predictive error. Computing optimal decision trees is NPcomplete (Hyafil & Rivest, 1976) , but CART uses greedy heuristics that generally work well in practice. Because individual decision trees can be unstable predictors, they are often combined through ensemble methods such as bagging (Breiman, 2001a) , in which predictions are averaged across trees trained on random bootstrap samples, and boosting (Friedman, 2001) , in which predictions are summed over a series of trees, each sequentially optimized to improve upon the last. While combining basis functions tends to improve predictions, it unfortunately makes it difficult if not impossible to extract individual rules for better model interpretation.

However, some regularization schemes have been developed to post-process complex learning forests for precisely this purpose. For instance, Friedman & Popescu (2008) propose the RuleFit algorithm, which mines a collection of Boolean variables by extracting splits from a gradient boosted forest. These engineered features are then combined with the original predictors in a lasso regression, producing a sparse linear combination of splits and inputs.

Nalenz & Villani (2018) develop a similar procedure using a Bayesian horseshoe prior (Carvalho et al., 2010) instead of an # penalty to induce shrinkage. They also add splits extracted from a random forest with those learned via gradient boosting to promote greater diversity.

Another strand of research in this area has focused on falling rule lists, which create monotonically ordered decision trees such that the probability of the binary outcome = 1 strictly decreases as one moves down the list. These models were originally designed for medical contexts, where doctors must evaluate patients quickly and accurately. For instance, Letham et al. (2015) design a Bayesian rule list to predict stroke risk, resulting in a model that outperforms leading clinical diagnostic methods while being small enough to fit on an index card. Falling rule lists can be challenging to compute -see the note above about NPcompleteness -and subsequent work has largely focused on efficient optimization strategies.

Specifically, researchers have developed fast branch-and-bound techniques to prune the search space and reduce training time (Chen & Rudin, 2018; Yang et al., 2017) , culminating in several tree-learning methods that are provably optimal under some restrictions on the input data (Angelino et al., 2018; Hu et al., 2019) .

Less work has been done on localized rule lists, but there have been some recent advances in this direction. Ribeiro et al. (2018) followed up on their 2016 LIME paper with a new method, Anchors, which combines graph search with a multi-armed bandit procedure to find a minimal set of sufficient conditions for a given model prediction. introduce LORE, which simulates a balanced dataset of cases using a genetic algorithm . Example rule lists for AMR prediction from genotype data. Each rule detects the presence/absence of a k-mer and is colored according to the genomic locus at which it was found. From (Drouin et al., 2019, p. 4 ).

To date, rule lists have not been as widely used in genomics as feature attributions or local linear approximations. This likely has more to do with computational obstacles than any preference for particular model assumptions, per se. Still, some recent counterexamples buck the trend. Drouin et al. (2019) combine sample compression theory with recursive partitioning to learn interpretable genotype-to-phenotype classifiers with performance guarantees. As depicted in Fig. 5 , these lists -visualized as trees and formulated as logical propositions below -can predict AMR in M. tuberculosis and K. pneumoniae with high accuracy using just a small handful of indicator functions over the space of all k-mers. Though their experiments focus on AMR, the method can be applied more generally. Anguita-Ruiz et al. (2020) use a sequential rule mining procedure to uncover gene expression patterns in obese subjects from longitudinal DNA microarray data. Garvin et al. (2020) combined iterative random forests (Cliff et al., 2019) , a method for gene regulatory network inference, with random intersection trees (Shah & Meinshausen, 2014) , which detect stable interactions in tree-based ensembles, to discover potentially adaptive SARS-CoV-2 mutations.

Despite all the recent progress in iML, the field is still struggling with several challenges that are especially important in genomics. I highlight three in particular: ambiguous targets, error rate control, and variable granularity. §5.1 Ambiguous Targets I have done my best above to be clear about the distinction between two tasks for which iML is often used: to better explain or understand (a) some fitted model f, or (b) some natural system that f models. It is not always obvious which goal researchers have in mind, yet modeland system-level analyses require entirely different tools and assumptions. Whereas a supervised learning algorithm does not generally distinguish between correlation and causation, the difference is crucial in nature. Clouds predict rain and rain predicts clouds, but the causal arrow runs in only one direction. Genomic researchers face a fundamental ambiguity when seeking to explain, say, why Alice received her unexpected algorithmic diagnosis. Is the goal to explain why the model made the prediction it did, independent of the ground truth? Or, alternatively, is the goal to understand what biological conditions led to the diagnosis? The former, which I will call a model-level explanation, may be preferable in cases of auditing or validation, where the analyst seeks merely to understand what the algorithm has learned, without any further restrictions. In this case, we do not necessarily assume that the model is correct. The latter, which I will call a system-level explanation, is more useful in cases of discovery and/or planning, where real-world mechanisms cannot be ignored. In such instances, we (tentatively) presume that the prediction in question is accurate, at least to a first approximation.

Model-level explanations are generally easier to compute, since features can be independently perturbed one at a time. This is the default setting for popular iML tools such as LIME and SHAP. System-level explanations, by contrast, require some structural assumptions about dependencies between variables. Such assumptions may be difficult or even impossible to test, raising legitimate questions about identifiability and

underdetermination. Yet, as Pearl (2000) has long argued, there is value in articulating one's assumptions clearly, opening them up for scrutiny and debate instead of burying them behind defaults. The last year has seen a burst of new papers on causally-aware iML tools (Heskes et al., 2020; Wang et al., 2021) , indicating that researchers in computational statistics are increasingly sensitive to the distinction between model-and system-level analyses. Genomic practitioners should avail themselves of both explanatory modes, but always make sure the selected tool matches the stated aim. Addressing this challenge is difficult at both a conceptual level, because the distinction between model-and system-level analyses may not be immediately obvious to practitioners, and at a technical level, because causal approaches can require careful covariate adjustments and data reweighting. The sooner these issues are addressed head on, the more fruitful the results will be.

Another open challenge in iML concerns bounding error and quantifying uncertainty.

Bioinformaticians are no strangers to p-values, which are typically fixed at low levels to control false positive rates in GWAS (Panagiotou & Ioannidis, 2012) , or else adjusted to control familywise error rates (Holm, 1979) or FDR (Benjamini & Hochberg, 1995) in other omic settings. Bayesians have their own set of inferential procedures for multiple testing scenarios (Gelman et al., 2012; Scott & Berger, 2010) , although there is some notable convergence with frequentism on the subject of q-values (Storey, 2003) . In any event, the error-statistical logic that guides testing in computational biology is largely absent from contemporary iML. This may be partially a result of cultural factors. As Breiman (2001b) observed some 20 years ago, there are two main cultures of statistical modeling -one focused on predicting outcomes, the other on inferring parameter values. Authors in contemporary iML, which grew almost exclusively out of the former camp, are generally less worried about error rates than their colleagues in the latter camp.

Several critics have pointed out that post-hoc methods do not generally provide standard errors for their estimates or goodness of fit measures for their approximations (Ribeiro et al., 2018; Wachter et al., 2018) . Indeed, it would be difficult to do so without some nonparametric resampling procedure such as the bootstrap (Davison & Hinkley, 1997) , which would add considerable computational burden as the number of samples and/or features grows. It is not clear that such methods are even applicable in these settings, however, given the instability of bootstrap estimators in high dimensions (Karoui & Purdom, 2018 A handful of other iML methods make at least a nominal effort to quantify uncertainty (Gimenez & Zou, 2019; Ribeiro et al., 2018; Schwab & Karlen, 2019 ). Yet these examples are perhaps most notable for their scarcity. To gain more widespread acceptance in genomicsand the sciences more generally -iML algorithms will need to elevate rigorous testing procedures from an occasional novelty to a core requirement. Generic methods for doing so in high dimensions raise complex statistical challenges that remain unresolved at present.

A final challenge I will highlight concerns variable granularity. This is not a major issue in the low-or moderate-dimensional settings for which most iML tools are designed. But it quickly becomes important as covariates increase, especially when natural feature groupings are either known a priori or directly estimable from the data. For instance, it is well-established that genes do not operate in isolation, but rather work together in co-regulated pathways.

Thus, even when a classifier uses gene-level RNA-seq data as input features, researchers may want to investigate the prognostic value of pathways to test or develop new hypotheses. In multi-omic models, where features typically represent a range of biological processes, each measured using different platforms, analysts may want to know not just which variables are most predictive overall, but which biomarkers are strongest within a given class. Interactions across subsystems may also be of particular interest.

Few methods in use today allow users to query a target model at varying degrees of resolution like this, but such flexibility would be a major asset in systems biology. Once again, some exceptions are worth noting. Sesia et al. (2020) introduce a knockoff method for localizing causal variants at different resolutions using well-established models of linkage disequilibrium. This amounts to a global post-hoc feature attribution method for whole genome sequencing data. The leave-out-covariates (LOCO) statistic (Lei et al., 2018; Rinaldo et al., 2019) can be used to quantify the global or local importance of arbitrary feature subsets, but only at the cost of extensive model refitting. Groupwise Shapley values have been formally described (Conitzer & Sandholm, 2004) but are not widely used in practice. Resolving the granularity problem will help iML tools scale better in high-dimensional settings, with major implications for genomics. The problem is complicated, however, by the fact that hierarchical information regarding biomolecular function is not always available. Automated methods for discovering such hierarchies are prone to error, while data-driven dimensionality reduction techniques -e.g., the latent embeddings learned by a deep neural network -can be difficult or impossible to interpret. Promising directions of research in this area include causal coarsening techniques (Beckers et al., 2019; Chalupka et al., 2017) and disentangled representation learning (Locatello et al., 2019; Schölkopf et al., 2021) .

The pace of advances in genomics and ML can make it easy to forget that both disciplines are relatively young. The subfield of iML is even younger, with the vast majority of work published in just the last three to five years. The achievements to date at the intersection of these research programs are numerous and varied. Feature attributions and rule lists have already revealed novel insights in several genomic studies. Exemplary methods have not yet seen similar uptake, but that will likely change with better generative models. As datasets grow larger, computers become faster, and theoretical refinements continue to accumulate in statistics and biology, iML will become an increasingly integral part of the genomics toolkit. I have argued that better algorithmic explanations can serve researchers in at least three distinct ways: by auditing models for potential bias, validating performance before and throughout deployment, and revealing novel mechanisms for further exploration. I provided a simple typology for iML and reviewed several popular methodologies with a focus on genomic applications.

Despite considerable progress and rapidly expanding research interest, iML still faces a number of important conceptual and technical challenges. I highlighted three with particular significance for genomics: ambiguous targets, limited error rate control, and inflexible feature resolution. These obstacles can be addressed by iML solutions in a post-hoc or intrinsic manner, with model-agnostic or model-specific approaches, via global or local explanations.

All types of iML require further development, especially as research in supervised learning and genomics continues to evolve. Ideally, iML would become integrated into standard research practice, part of hypothesis generation and testing as well as model training and deployment. As the examples from Sect. 4 illustrate, this vision is already on its way to becoming a reality.

The future of iML for genomics is bright. The last few years alone have seen a rapid proliferation of doctoral dissertations on the topic -e.g., Greenside (2018) ; Danaee (2019); Nikumbh (2019); Kavvas (2020); Ploenzke (2020); and Shrikumar (2020) -suggesting that early career academics in particular are being drawn to this highly interdisciplinary area of research. Existing work has been promising, though not without its challenges. As the field continues to gather more data, resources, and brainpower, there is every reason to believe the best is yet to come.

Explaining individual predictions when features are dependent: More accurate approximations to Shapley values

Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)

Permutation importance: A corrected feature importance measure

Learning Certifiably Optimal Rule Lists for Categorical Data

eXplainable Artificial Intelligence (XAI) for the identification of biologically relevant gene expression patterns in longitudinal human studies, insights from obesity research

Opening the Black Box: Interpretable Machine Learning for Geneticists

On pixelwise explanations for non-linear classifier decisions by layer-wise relevance propagation

Controlling the false discovery rate via knockoffs

Causal inference and the data-fusion problem

Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion

Approximate Causal Abstraction

Controlling the False Discovery Rate : A Practical and Powerful Approach to Multiple Testing

Explainable Machine Learning in Deployment

Random Forests. Machine Learning

Statistical Modeling: The Two Cultures

Classification and Regression

Causal feature learning: an overview

This Looks Like That: Deep Learning for Interpretable Image Recognition

An Optimization Approach to Learning Falling Rule Lists

Random forests for genomic data analysis

Interpretable classification of bacterial Raman spectra with knockoff wavelets

A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks

Computing Shapley Values, Manipulating Value Division Schemes, and Checking Core Membership in Multi-Issue Domains

Interpretable machine learning: Applications in biology and genomics. Doctoral dissertation

Opportunities and Challenges in Explainable Artificial Intelligence (XAI): A Survey

Bootstrap Methods and their Application

Interpretable genotype-to-phenotype classifiers with performance guarantees

Slave to the algorithm? Why a "right to explanation" is probably not the remedy you are looking for. Duke Law and Technology Review

Structural Racism, Social Risk Factors, and Covid-19 -A Dangerous Convergence for Black Americans

Deep learning: new computational modelling techniques for genomics

Dermatologist-level classification of skin cancer with deep neural networks

All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously

Greedy Function Approximation: A Gradient Boosting Machine

Predictive Learning via Rule Ensembles

Potentially adaptive SARS-CoV-2 mutations discovered with novel spatiotemporal and explainable AI models

Why We (Usually) Don't Have to Worry About Multiple Comparisons

Explaining Explanations: An Overview of Interpretability of Machine Learning

Discovering Conditionally Salient Features with Statistical Guarantees

Interpretable machine learning methods for regulatory and disease genomics. Doctoral dissertation

Local Rule-Based Explanations of Black Box Decision Systems

A Survey of Methods for Explaining Black Box Models

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Invariant Causal Prediction for Nonlinear Models

Causal Shapley Values: Exploiting Causal Knowledge to Explain Individual Predictions of Complex Models

A Simple Sequentially Rejective Multiple Test Procedure

Causability and explainability of artificial intelligence in medicine

Optimal Sparse Decision Trees

Data-adaptive target parameters

Targeted Learning in Data Science

Constructing optimal binary decision trees is NP-complete

Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction

A survey of algorithmic recourse: definitions, formulations, solutions, and prospects

Algorithmic recourse under imperfect causal knowledge: a probabilistic approach

Can We Trust the Bootstrap in High-dimensions? The Case of Linear Models

Biologically-interpretable machine learning for microbial genomics. Doctoral dissertation

A biochemicallyinterpretable machine learning classifier for microbial GWAS

An Evaluation of the Human-Interpretability of Explanation

Faithful and Customizable Explanations of Black Box Models

voom: precision weights unlock linear model analysis tools for RNA-seq read counts

Distribution-Free Predictive Inference for Regression

Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model

MethylNet: an automated and modular deep learning approach for DNA methylation analysis

Searching for consistent associations with a multi-environment knockoff filter

Explainable AI: A Review of

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

From local explanations to global understanding with explainable AI for trees

A Unified Approach to Interpreting Model Predictions

Clinical use of current polygenic risk scores may exacerbate health disparities

Deep learning in radiology: An overview of the concepts and a survey of the state of the art with focus on MRI

International evaluation of an AI system for breast cancer screening

The Explanation Game: Explaining Machine Learning Models Using Shapley Values

Basal-like breast cancer: molecular profiles, clinical features and survival outcomes

A Multidisciplinary Survey and Framework for Design and Evaluation of Explainable AI Systems

Interpretable Machine Learning: A Guide for Making Black Box Models Interpretable

Explaining nonlinear classification decisions with deep Taylor decomposition

Explanation in Human-AI Systems: A Literature Meta-Review

Definitions, methods, and applications in interpretable machine learning

Tree ensembles with rule structured horseshoe regularization

The revival of the Gini importance?

Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci

Interpretable machine learning methods for prediction and analysis of genome regulation in 3D. Doctoral dissertation

Dissecting racial bias in an algorithm used to manage the health of populations

What should the genome-wide significance threshold be? Empirical replication of borderline genetic associations

Causality: Models, Reasoning, and Inference

External Validity: From Do-Calculus to

Causal inference by using invariant prediction: identification and confidence intervals

The Elements of Causal Inference: Foundations and Learning Algorithms

Invariant Causal Prediction for Sequential Data

Interpretable machine learning methods with applications in genomics. Doctoral dissertation

Why Should I Trust You?": Explaining the Predictions of Any Classifier

Anchors: High-Precision Model-Agnostic Explanations

Bootstrapping and sample splitting for highdimensional, assumption-lean inference

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Machine learning integrated ensemble of feature selection methods followed by survival analysis for predicting breast cancer subtype specific miRNA biomarkers

Toward Causal Representation Learning

Kernel Methods in Computational Biology

CXPlain: Causal Explanations for Model Interpretation under Uncertainty

Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem

Meaningful information and the right to explanation

Random Intersection Trees

A Value for n-Person Games

Interpretable machine learning for scientific discovery in regulatory genomics. Doctoral dissertation

Learning Important Features Through Propagating Activation Differences

Linear models and empirical Bayes methods for assessing differential expression in microarray experiments

POIMs: positional oligomer importance matrices-understanding support vector machine-based signal detectors

Repeated observation of breast tumor subtypes in independent gene expression data sets

A Deep Learning Approach to Antibiotic Discovery

The positive false discovery rate: a Bayesian interpretation and the qvalue

Conditional Variable Importance for Random Forests

The many Shapley values for model explanation

Axiomatic Attribution for Deep Networks

Interpretation of deep learning in genomics and epigenomics

Deep learning decodes the principles of differential gene expression

Regression Shrinkage and Selection via the Lasso

Automated Biomarker Candidate Discovery in Imaging Mass Spectrometry Data Through Spatially Localized Shapley Additive Explanations

High-performance medicine: the convergence of human and artificial intelligence

Basal-like breast cancer and the BRCA1 phenotype

The Nature of Statistical Learning Theory

SVM2Motif--Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor

ML2Motif-Reliable extraction of discriminative sequence motifs from learning machines

Explainable Artificial Intelligence: a Systematic Review

Why a right to explanation of automated decision-making does not exist in the general data protection regulation

Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR

Evaluation of the lasso and the elastic net in genome-wide association studies

Clinical applications of machine learning algorithms: beyond the black box

The explanation game: a formal framework for interpretable machine learning

Nonparametric variable importance assessment using machine learning techniques

Scientific Explanation