key: cord-0108400-srtf761l
authors: Holbrook, Andrew J.; Nishimura, Akihiko; Ji, Xiang; Suchard, Marc A.
title: Computational Statistics and Data Science in the Twenty-first Century
date: 2022-04-12
journal: nan
DOI: nan
sha: 69e88d02350712f5fa48db2be8b071015cf864a9
doc_id: 108400
cord_uid: srtf761l

Data science has arrived, and computational statistics is its engine. As the scale and complexity of scientific and industrial data grow, the discipline of computational statistics assumes an increasingly central role among the statistical sciences. An explosion in the range of real-world applications means the development of more and more specialized computational methods, but five Core Challenges remain. We provide a high-level introduction to computational statistics by focusing on its central challenges, present recent model-specific advances and preach the ever-increasing role of non-sequential computational paradigms such as multi-core, many-core and quantum computing. Data science is bringing major changes to computational statistics, and these changes will shape the trajectory of the discipline in the 21st century.

We are in the midst of the data science revolution. In October 2012, the Harvard Business Review famously declared data scientist the sexiest job of the 21st century (Davenport and Patil, 2012) . By September 2019, Google searches for the term 'data science' had multiplied over 7-fold (Google Trends, 2020), one multiplicative increase for each intervening year. In the U.S. between the years 2000 and 2018, the number of bachelor's degrees awarded in either statistics or biostatistics increased over 10-fold (382 to 3,964) and the number of doctoral degrees almost trippled (249 to 688) (American Statistical Association, 2020). In 2020, seemingly every major university has established or is establishing its own data science institute, center or initiative.

Data science (Cleveland, 2001; Donoho, 2017) combines multiple pre-existing disciplines (e.g., statistics, machine learning, computer science) with a redirected focus on creating, understanding and systematizing workflows that turn real-world data into actionable conclusions. The ubiquity of data in all economic sectors and scientific disciplines makes data science eminently relevant to cohorts of researchers for whom the discipline of statistics was previously closed-off and esoteric. Data science's emphasis on practical application only enhances the importance of computational statistics, the interface between statistics and computer science primarily concerned with the development of algorithms producing either statistical inference 1 or predictions. Since both of these products comprise essential tasks in any data scientific workflow, we believe that the pan-disciplinary nature of data science only increases the number of opportunities for computational statistics to evolve by taking on new applications 2 and serving the needs of new groups of researchers. This is the natural role for a discipline that has increased the breadth of statistical application from the beginning. First put forward by R.A. Fisher in 1936 (Fisher, 1960 , the permutation test allows the scientist (who owns a computer) to test hypotheses about a broader swath of functionals of a target population while making fewer statistical assumptions (Wald and Wolfowitz, 1944) . With a computer, the scientist uses the bootstrap (Efron, 1992; Efron and Tibshirani, 1994) to obtain confidence intervals for population functionals and parameters of models too complex for analytic methods. Newton-Raphson optimization and the Fisher scoring algorithm facilitate linear regression for binary, count and categorical outcomes (Bliss, 1935; McCullagh and Nelder, 1989) . More recently, Markov chain Monte Carlo (MCMC) (Tierney, 1994; Brooks et al., 2011) has made Bayesian inference practical for massive, hierarchical and highly structured models that are useful for the analysis of a significantly wider range of scientific phenomena.

While computational statistics increases the diversity of statistical applications historically, certain central difficulties exist and will continue to remain for the rest of the 21st century. In Section 2, we present the first class of Core Challenges, or challenges that are easily quantifiable for generic tasks. Core Challenge 1 is Big N , or statistical inference when the number 'N ' of observations or data points is large. Core Challenge 2 is Big P , or statistical inference when the model parameter count 'P ' is large. And Core Challenge 3 is Big M , or statistical inference when the model's objective or density function is multimodal (having many modes 'M ') 3 . When large, each of these quantities brings its own unique computational difficulty. Since well over 2.5 exabytes (or 2.5 × 10 18 bytes) of data come into existence each day (Chavan et al., 2014) , we are confident Core Challenge 1 will survive well into the 22nd century.

But Core Challenges 2 and 3 will also endure: data complexity often increases with size, and researchers strive to understand increasingly complex phenomena. Because many examples of big data become 'big' by combining heterogeneous sources, big data often necessitate big models. With the help of two recent examples, Section 3 illustrates how computational statisticians make headway at the intersection of big data and big models with model-specific advances. In Section 3.1, we present recent work in Bayesian inference for big N, big P regression. Beyond the simplified regression setting, data often come with structures (e.g., spatial, temporal, network), and correct inference must take these structures into account. For this reason, we present novel computational methods for a highly structured and hierarchical model for the analysis of multi-structured, epidemiological data in Section 3.2. The growth of model complexity leads to new inferential challenges. Whereas we define Core Challenges 1-3 in terms of generic target distributions or objective functions, Core Challenge 4 arises from inherent difficulties in treating complex models generically. Core Challenge 4 (Section 4.1) describes the difficulties and trade-offs that must be overcome to create fast, flexible and friendly 'algo-ware'. This Core Challenge requires the development of statistical algorithms that maintain efficiency despite model structure and, thus, apply to a wider swath of target distributions or objective functions 'out-of-the-box'. Such generic algorithms typically require little cleverness or creativity to implement, limiting the amount of time data scientists must spend worrying about computational details. Moreover, they aid the development of flexible statistical software that adapts to complex model structure in a way that users easily understand. But it is not enough that software be flexible and easy to use: mapping computations to computer hardware for optimal implementations remains difficult. In Section 4.2, we argue that Core Challenge 5, effective use of computational resources such as central and graphics processing units (CPU and GPU) and quantum computers, will become increasingly central to the work of the computational statistician as data grow in magnitude.

2 Core challenges 1-3 Before providing two recent examples of 21st century computational statistics (Section 3), we present three easily quantified Core Challenges within computational statistics that we believe will always exist: big N , or inference from many observations; big P , or inference with high-dimensional models; and big M , or inference with non-convex objective-or multimodal density-functions. In 21st century computational statistics, these challenges often co-occur, but we consider them separately in this section.

Having a large number of observations makes different computational methods difficult in different ways. A worst case scenario, the exact permutation test requires the production of N ! datasets. Cheaper alternatives, resampling methods such as the Monte Carlo permutation test or the boostrap, may require anywhere from thousands to hundreds of thousands of randomly produced datasets (Wald and Wolfowitz, 1944; Efron and Tibshirani, 1994) . When, say, population means are of interest, each Monte Carlo iteration requires summations involving N expensive memory accesses.Another example of a computationally intensive model is Gaussian process regression Rasmussen, 1996, 2006) ; it is a popular nonparametric approach but the exact method for fitting the model and predicting future values requires matrix inversions that scale O(N 3 ). As the rest of the calculations require relatively negligible computational effort, we say that matrix inversions represent the computational bottleneck for Gaussian process regression.

To speed up a computationally intensive method, one only needs to speed up the method's computational bottleneck. We are interested in performing Bayesian inference (Gelman et al., 2013) based on a large vector of observations x = (x 1 , . . . , x N ). We specify our model for the data with a likelihood function π(x|θ) = N n=1 π(x n |θ) and use a prior distribution with density function π(θ) to characterize our belief about the value of the P -dimensional parameter vector θ a priori. The target of Bayesian inference is the posterior distribution of θ conditioned on x π(θ|x) = π(x|θ)π(θ) / π(x|θ)π(θ) dθ .

(1)

The denominator's multi-dimensional integral quickly becomes impractical as P grows large, so we choose to use the Metropolis-Hastings (M-H) algorithm to generate a Markov chain with stationary distribution π(θ|x) (Metropolis et al., 1953; Hastings, 1970; Tierney, 1994) . We begin at an arbitrary position θ (0) and, for each iteration s = 0, . . . , S, randomly generate the proposal state θ * from the transition distribution with density q(θ * |θ (s) ). We then accept proposal state θ * with probability a = min 1, π(θ * |x)q(θ (s) |θ * ) π(θ (s) |x)q(θ * |θ (s) ) .

(2)

The ratio on the right now longer depends on the denominator in Formula (1), but one must still compute the likelihood and its N terms π(x n |θ * ). It is for this reason that likelihood evaluations are often the computational bottleneck for Bayesian inference. In the best case, these evaluations are O(N ), but there are many situations in which they scale O(N 2 ) (Holbrook et al., 2020a,b) or worse. Indeed, when P is large, it is often advantageous to use more advanced MCMC algorithms that use the gradient of the log-posterior to generate better proposals. In this situation, the log-likelihood gradient may also become a computational bottleneck (Holbrook et al., 2020a) .

One of the simplest models for big P problems is ridge regression (Seber and Lee, 2012) , but computing can become expensive even in this classical setting. Ridge regression estimates the coefficient θ by minimizing the distance between observed and predicted values y and Xθ along with a weighted square norm of θ:

For illustrative purposes, we consider the following direct method for computingθ. 4 We can first multiply the N × P design matrix X by its transpose at the cost of O(N 2 P ) and

subsequently invert the P ×P matrix (X X+Φ) at the cost of O(P 3 ). The total O(N 2 P +P 3 ) complexity shows that (1) a large number of parameters is often sufficient for making even the simplest of tasks infeasible and (2) a moderate number of parameters can render a task impractical when there are a large number of observations. These two insights extend to more complicated models: the same complexity analysis holds for the fitting of generalized linear models (GLMs) as described in McCullagh and Nelder (1989) .

In the context of Bayesian inference, the length P of the vector θ dictates the dimension of the MCMC state space. For the M-H algorithm (Section 2.1) with P -dimensional Gaussian target and proposal, Gelman et al. (1996) show that the proposal distribution's covariance should be scaled by a factor inversely proportional to P . Hence, as the dimension of the state space grows, it behooves one to propose states θ * that are closer to the current state of the Markov chain, and one must greatly increase the number S of MCMC iterations. At the same time, an increasing P often slows down rate-limiting likelihood calculations (Section 2.1). Taken together, one must generate many more, much slower MCMC iterations. The wide-applicability of latent variable models (Van Dyk and Meng, 2001) (Sections 3.1 and 3.2) for which each observation has its own parameter set (e.g., P ∝ N ) means M-H simply does not work for a huge class of models popular with practitioners.

For these reasons, Hamiltonian Monte Carlo (HMC) (Neal, 2011) has become a popular algorithm for fitting Bayesian models with large numbers of parameters. Like M-H, HMC uses an accept-step (Equation (2)). Unlike M-H, HMC takes advantage of additional information about the target distribution in the form of the log-posterior gradient. HMC works by doubling the state space dimension with an auxiliary Gaussian 'momentum' variable p ∼ Normal P (0, M) independent to the 'position' variable θ. The constructed Hamiltonian system has energy function given by the negative logarithm of the joint distribution

and we produce proposals by simulating the system according to Hamilton's equationṡ

Thus, the momentum of the system moves in the direction of steepest ascent for the logposterior forming an analogy with first-order optimization. The cost is repeated gradient evaluations that may comprise a new computational bottleneck, but the result is effective MCMC for tens of thousands of parameters (Holbrook et al., 2017 (Holbrook et al., , 2020a . The success of HMC has inspired research into other methods leveraging gradient information to generate better MCMC proposals when P is large (Bouchard-Côté et al., 2018).

Global optimization, or the problem of finding the minimum of a function with arbitrarily many local minima, is NP-complete in general (Murty and Kabadi, 1985) , meaning-in layman's terms-it is impossibly hard. In the absence of a tractable theory, by which one might prove one's global optimization procedure works, brute-force grid and random searches and heuristic methods such as particle swarm optimization (Kennedy and Eberhart, 1995) and genetic algorithms (Davis, 1991) have been popular. Due to the overwhelming difficulty of global optimization, a large portion of the optimization literature has focused on the particularly well-behaved class of convex functions (Hunter and Lange, 2004; Boyd et al., 2004) , which do not admit multiple local minima. Since Fisher introduced his 'maximum likelihood' in 1922 (Fisher, 1922) , statisticians have thought in terms of maximization, but convexity theory still applies by a trivial negation of the objective function. Nonetheless, most statisticians safely ignored concavity during the 20th century: exponential family loglikelihoods are log-concave, so Newton-Raphson and Fisher scoring are guaranteed optimality in the context of GLMs (Boyd et al., 2004; McCullagh and Nelder, 1989) . Nearing the end of the 20th century, multimodality and non-convexity became more important for statisticians considering high-dimensional regression, i.e., regression with many covariates (big P ). Here, for purposes of interpretability and variance reduction, one would like to induce sparsity on the weights vectorθ by performing best subset selection (Beale et al., 1967; Hocking and Leslie, 1967) :

where 0 < k P , and || · || 0 denotes the 0 -norm, i.e., the number of non-zero elements. Because best subset selection requires an immensely difficult non-convex optimization, Tibshirani (1996) famously replaces the 0 -norm with the 1 -norm, thereby providing sparsity while nonetheless maintaining convexity.

Historically, Bayesians have payed much less attention to convexity than have optimization researchers. This is most likely because the basic theory (Tierney, 1994) of MCMC does not require such restrictions: even if a target distribution has one-million modes, the well constructed Markov chain explores them all in the limit. Despite these theoretical guarantees, a small literature has developed to tackle multi-modal Bayesian inference (Geyer, 1991; Tjelmeland and Hegstad, 2001; Lan et al., 2014; Nishimura and Dunson, 2016) because multi-modal target distributions do present a challenge in practice. In analogy with Equation (3), Bayesians seek to induce sparsity by specifiying priors such as the spike-andslab (Mitchell and Beauchamp, 1988; Madigan and Raftery, 1994; George and McCulloch, 1997) , e.g.,

As with the best subset selection objective function, the spike-and-slab target distribution becomes heavily multimodal as P grows and the support of Γ's discrete distribution grows to 2 P potential configurations. In the following section, we present an alternative Bayesian sparse regression approach that mitigates the combinatorial problem along with a state-of-the-art computational technique that scales well both in N and P .

These challenges will remain throughout the 21st century, but it is possible to make significant advances for specific statistical tasks or classes of models. Section 3.1 considers Bayesian sparse regression based on continuous shrinkage priors, designed to alleviate the heavy multimodaility (big M ) of the more traditional spike-and-slab approach. This model presents a major computational challenge as N and P grow, but a recent computational advance makes the posterior inference feasible for many modern large-scale applications.

And because of the rise of data science, there are increasing opportunities for computational statistics to grow by enabling and extending statistical inference for scientific applications previously outside of mainstream statistics. Here, the science may dictate the development of structured models with complexity possibly growing in N and P . Section 3.2 presents a method for fast phylogenetic inference, where the primary structure of interest is a 'family tree' describing a biological evolutionary history.

3.1 Bayesian sparse regression in the age of big N and big P With the goal of identifying a small subset of relevant features among a large number of potential candidates, sparse regression techniques have long featured in a range of statistical and data science applications (Hastie et al., 2015) . Traditionally, such techniques were commonly applied in the "N P " setting and correspondingly computational algorithms focused on this situation (Friedman et al., 2010) , especially within the Bayesian literature (Bhattacharya et al., 2016) .

Due to a growing number of initiatives for large-scale data collections and new types of scientific inquiries made possible by emerging technologies, however, increasingly common are datasets that are "big N " and "big P " at the same time. For example, modern observational studies using healthcare databases routinely involve N ≈ 10 5 ∼ 10 6 patients and P ≈ 10 4 ∼ 10 5 clinical covariates . The UK Biobank provides brain imaging data on N = 100,000 patients, with P = 100 ∼ 200,000 depending on the scientific question of interests (Passos et al., 2019) . Single cell RNA sequencing can generates datasets with N (the number of cells) in millions and P (the number of genes) in tens of thousands, with the trend indicating further growths in data size to come (Svensson et al., 2019) Continuous shrinkage: alleviating big M Bayesian sparse regression, despite its desirable theoretical properties and flexibility to serve as a building block for richer statistical models, has always been relatively computationally intensive even before the advent of "big N and big P " data (George and McCulloch, 1997; Nott and Kohn, 2005; Ghosh and Clyde, 2011) . A major source of its computational burden is severe posterior multi-modality (big M ) induced by the discrete binary nature of spikeand-slab priors (Section 2.3). The class of global-local continuous shrinkage priors is a more recent alternative to shrink θ p s in a more continuous manner, thereby alleviating (if not eliminating) the multi-modality issue (Carvalho et al., 2010; Polson and Scott, 2010) . This class of prior is represented as a scale-mixture of Gaussians:

The idea is that the global scale parameter τ 1 would shrink most θ p 's towards zero while local scale λ p 's, with its heavy-tailed prior π local (·), allow a small number of τ λ p and hence θ p 's to be estimated away from zero. While motivated by two different conceptual frameworks, the spike-and-slab can be viewed as a subset of global-local priors in which π local (·) is chosen as a mixture of delta masses placed at λ p = 0 and λ p = σ/τ . Continuous shrinkage mitigates the multi-modality of spike-and-slab by smoothly bridging small and large values of λ p .

On the other hand, use of continuous shrinkage priors does not address the increasing computational burden from growing N and P in modern applications. Sparse regression posteriors under global-local priors are amenable to an effective Gibbs sampler, a popular class of MCMC we describe further in Section 4.1. Under the linear and logistic models, the computational bottleneck of this Gibbs sampler stems from the need for repeated updates of θ from its conditional distribution

where Ω is an additional parameter of diagonal matrix and Λ = diag(λ). 5 Sampling from this high-dimensional Gaussian distribution requires O(N P 2 + P 3 ) operations with the standard approach (Rue and Held, 2005) : O(N P 2 ) for computing the term X ΩX and O(P 3 ) for Cholesky factorization of Φ. While an alternative approach by Bhattacharya et al. (2016) provides the complexity of O(N 2 P + N 3 ), the computational cost remains problematic in the "big N and big P " regime at O(min{N 2 P, N P 2 }) after choosing the faster of the two.

The conjugate gradient (CG) sampler of Nishimura and Suchard (2018) combined with their prior-preconditioning technique overcomes this seemingly inevitable O(min{N 2 P, N P 2 }) growth of the computational cost. Their algorithm is based on a novel application of the CG method (Hestenes and Stiefel, 1952; Lanczos, 1952) , which belongs to a family of iterative methods in numerical linear algebra. Despite its first appearance in 1952, CG received little attention for the next few decades, only making its way into major software packages such as matlab in 1990s ( Van der Vorst, 2003) . With its ability to solve a large and structured linear system Φθ = b via a small number of matrix-vector multiplications v → Φv without ever explicitly inverting Φ, however, CG has since emerged as an essential and prototypical algorithm for modern scientific computing (Cipra, 2000; Dongarra et al., 2016) . Despite its earlier rise to prominence in other fields, CG has not found practical applications in Bayesian computation until rather recently (Nishimura and Suchard, 2018; L. Zhang et al., 2019) . We can offer at least two explanations for this. First, being an algorithm for solving a deterministic linear system, it is not obvious how CG would be relevant to Monte Carlo simulation, such as sampling from Normal P (µ, Φ −1 ); ostensively, such a task requires computing a "square root" L of the precision matrix so that Var(L −1 z) = L −1 L − = Φ −1 for z ∼ Normal P (0, I P ). Secondly, unlike direct linear algebra methods, iterative methods such as CG have a variable computational cost that depends critically on the user's choice of a preconditioner and thus cannot be used as a "black-box" algorithm. 6 In particular, this novel application of CG to Bayesian computation is a reminder that other powerful ideas in other computationally intensive fields may remain untapped by the statistical computing community; knowledge transfers will likely be facilitated by having more researchers working at intersections of different fields. Nishimura and Suchard (2018) turns CG into a viable algorithm for Bayesian sparse regression problems by realizing that 1) we can obtain a Gaussian vector b ∼ Normal P X Ωy, Φ by first generating z ∼ Normal P (0, I P ) and ζ ∼ Normal N (0, I N ) and then setting b = X Ωy + X Ω 1/2 ζ + τ −1 Λ −1 z, and 2) subsequently solving Φθ = b yields a sample θ from the distribution (4). The authors then observe that the mechanism through which a shrinkage prior induces sparsity of θ p 's also induces a tight clustering of eigenvalues in the priorpreconditioned matrix τ 2 ΛΦΛ. This fact makes it possible for prior-preconditioned CG to solve the system Φθ = b in K matrix-vector operations of form v → Φv, where K roughly represents the number of "significant" θ p 's that are distinguishable from zeros under the posterior. For Φ having a structure as in (4), Φv can be computed via matrix-vector multiplications of form v → Xv and w → X w, so each v → Φv operation requires a fraction of the computational cost of directly computing Φ and then factorizing it.

Prior-preconditioned CG demonstrates an order of magnitude speed-up in posterior computation when applied to a comparative effectiveness study of atrial fibrillation treatment involving N = 72, 489 patients and P = 22,175 covariates (Nishimura and Suchard, 2018) . Though unexplored in their work, the algorithm's heavy use of matrix-vector multiplications provides avenues for further acceleration. Technically, the algorithm's complexity may be characterized as O(N P K), for the K matrix-vector multiplications by X and X , but the theoretical complexity is only a part of the story. Matrix-vector multiplications are amenable to a variety of hardware optimizations, which in practice can make orders of magnitude difference in speed (Section 4.2). In fact, given how arduous manually optimizing computational bottlenecks can be, designing algorithms so as to take advantage of common routines (as those in Level 3 blas) and their ready-optimized implementations has been recognized as an effective principle in algorithm design (Golub and Van Loan, 2012) .

Whereas big N and big P regression adapts a classical statistical task to contemporary needs, the 21st century is witnessing the application of computational statistics to the entirety of applied science. One such example is the tracking and reconstruction of deadly global viral pandemics. Molecular phylogenetics has become an essential analytical tool for understanding the complex patterns in which rapidly evolving pathogens propagate throughout and between countries, owing to the complex travel and transportation patterns evinced by modern economies (Pybus et al., 2015) , along with other factors such as increased global population and urbanisation (Bloom et al., 2017) . The advance in sequencing technology is generating pathogen genomic data at an ever-increasing pace, with a trend to real-time that requires the development of computational statistical methods that are able to process the sequences in a timely manner and produce interpretable rsults to inform national/global public health organizations.

The previous three Core Challenges are usually interwind such that the increase in the sample size (big N) and the number of traits (big P) for each sample usually happen simul- taneously and lead to increased heterogeneity that requires more complex models (big M). For example, recent studies in viral evolution have seen a continuing increase in the sample size that the West Nile virus, Dengue, HIV, Ebola virus studies involve 104, 352, 465, 1610 sequences (Pybus et al., 2012; Nunes et al., 2014; Bletsa et al., 2019; Dudas et al., 2017) and the GISAID database has collected 92, 000 COVID-19 genomic sequences by the end of August 2020 (Elbe and Buckland-Merrett, 2017) .

To accomodate the increasing size and heterogeneity in the data and be able to apply the aforementioned efficient gradient-based algorithms, Ji et al. (2020) propose a linear-time algorithm for calculating an O(N )-dimensional gradient on a tree w.r.t. the sequence evolution. The linear-time gradient algorithm calculates each branch-specific derivative through a pre-order traversal that complements the post-order traversal from the likelihood calculation of the observed the sequence data at the tip of the phylogeny by marginalizing over all possible hidden states on the internal nodes. The pre-and post-order traversals complete the Baum's forward-backward algorithm in a phylogenetic framework (Baum, 1972) . The authors then apply the gradient algorithm with HMC (Section 2.2) samplers to learn the branch-specific viral evolutionary rates. Thanks to these advanced computational methods, one can employ more flexible models that lend themselves to more realistic reconstructions and uncertainty quantification. Following a random-effects relaxed clock model, they model the evolutionary rate r p of branch p on a phylogeny as the product of a global tree-wise mean parameter µ and a branchspecific random effect p . They model the random-effect p 's as independent and identically distributed from a lognormal distribution such that p has mean 1 and variance ψ 2 under a hierarchical model where ψ is the scale parameter. To accomodate the difference in scales of the variability in the parameter space for the HMC sampler, the authors adopt preconditioning with adaptive mass matrix informed by the diagonal entries of the Hessian matrix. More precisely, the non-zero diagonal elements of the mass matrix truncate the values from the first s HMC iterations of H (s) (Nunes et al., 2014) . Figure 1 illustrates the estimated maximum clade credible evolutionary tree of the Dengue virus dataset.

The authors report relative speedup in terms of the effective sample size per second (ESS/s) of the HMC samplers compared to a univariate transition kernal. The 'vanilla' HMC sampler with an identity mass matrix gains 2.2× speedup for the minimum ESS/s and 2.5× speedup for the median ESS/s whereas the 'preconditioned' HMC sampler gains 16.4× and 7.4× speedups respectively. Critically, the authors make these performance gains available to scientists everywhere through the popular, open-source software package for viral phylogenetic inference Bayesian evolutionary analysis by sampling trees (BEAST) . In Section 4.1, we discuss how software like BEAST address Core Challenge 4, the creation of fast, flexible and friendly statistical algo-ware.

Section 3 provides examples of how computational statisticians might address Core Challenges 1-3 (big N , big P and big M ) for individual models. Such advances in computational methods must be accompanied by easy-to-use software to make them accessible to end-users. As Gentle et al. (2012) puts it, "While referees and editors of scholarly journals determine what statistical theory and methods are published, the developers of the major statistical software packages determine what statistical methods are used." We would like statistical software to be widely applicable yet computationally efficient at the same time. Trade-offs invariably arise between these two desiderata, but one should nonetheless strive to design algorithms that are general enough to solve an important class of problems and as efficiently as possible in doing so.

Section 4.1 presents Core Challenge 4, achieving 'algo-ware' (a neologism suggesting an equal emphasis on the statistical algorithm and its implementation) that is sufficiently efficient, broad and user-friendly to empower everyday statisticians and data scientists. Core Challenge 5 (Section 4.2) explores the mapping of these algorithms to computational hardware for optimal performance. Hardware-optimized implementations often exploit modelspecific structures, but good, general-purpose software should also optimize common routines.

To accommodate the greatest range of models while remaining simple enough to encourage easy implementation, inference methods should rely solely on the quantities that can be computed algorithmically for any given model. The log-likelihood (or log-density in the Bayesian setting) is one such quantity, and one can employ the computational graph framework (Lunn et al., 2009; Bergstra et al., 2010) to evaluate conditional log-likelihoods for any subset of model parameters, as well as their gradients via back propagation (Rumelhart et al., 1986) . Beyond being efficient in terms of the first three Core Challenges, an algorithm should demonstrate robust performance on a reasonably wide range of problems without extensive tuning if it is to lend itself to successful software deployment.

HMC (Section 2.2) is a prominent example of a general-purpose algorithm for Bayesian inference, only requiring the log-density and its gradient. The generic nature of HMC has opened up possibilities for complex Bayesian modeling as early as Neal (1996) , but its performance is highly sensitive to model parametrization and its three tuning parameters, commonly referred to as trajectory length, stepsize and mass matrix (Neal, 2011) . Tuning issues constitute a major obstacle to the wider adoption of the algorithm, as evidenced by the development history of the popular HMC-based probabilistic programming software Stan (Gelman, 2014) , which employs the No-U-Turn (NUTS) sampler of Hoffman and Gelman (2014) to make HMC user-friendly by obviating the need to tune its trajectory length. Bayesian software packages such as Stan empirically adapt the remaining stepsize and mass matrix (Stan Development Team, 2018); this approach helps make the use of HMC automatic though is not without issues (Livingstone and Zanella, 2019) and comes at the cost of significant computational overhead.

Although HMC is a powerful algorithm that has played a critical role in the emergence of general-purpose Bayesian inference software, the challenges involved in its practical deployment also demonstrate how an algorithm -no matter how versatile and efficient at its best -is not necessarily useful unless it can be made easy for practitioners to use. It is also unlikely that one algorithm works well in all situations. In fact, there are many distributions on which HMC performs poorly (Stan Development Team, 2018; Mangoubi et al., 2018; . Additionally, HMC is incapable of handling discrete distributions in a fully general manner despite the progresses made in extending HMC to such situations (Dinh et al., 2017; Nishimura et al., 2020) .

But broader applicability comes with its own challenges. Among sampling-based approaches to Bayesian inference, the Gibbs sampler (Geman and Geman, 1984; Gelfand and Smith, 1990) is, arguably, the most versatile of MCMC methods. The algorithm simplifies the task of dealing with a complex multi-dimensional posterior distribution by factorizing the posterior into simpler conditional distributions for blocks of parameters and iteratively updating parameters from their conditionals. Unfortunately, the efficiency of an individual Gibbs sampler depends on its specific factorization and the degree of dependence between its blocks of parameters. Without a careful design or in the absence of effective factorization, therefore, Gibbs samplers' performance may lag behind alternatives such as HMC (Monnahan et al., 2017) .

On the other hand, Gibbs samplers often require little tuning and can take advantage of highly optimized algorithms for each conditional update, as done in the examples of Section 3. A clear advantage of the Gibbs sampler is that it tends to make software implementation quite modular; for example, each conditional update can be replaced with the latest state-of-theart samplers as they appear (Z. Zhang et al., 2020) , and adding a new feature may amount to no more than adding a single conditional update . In this way, an algorithm may not work in a completely model-agnostic manner but with a broad enough scope can serve as a valuable recipe or meta-algorithm for building model-specific algorithms and software. The same is true for optimization methods. Even though its "E"-step requires a derivation (by hand) for each new model, the EM algorithm (Dempster et al., 1977) enables maximum likelihood estimation for a wide range of models. Similarly, variational inference (VI) for approximate Bayes requires manual derivations but provides a general framework to turn posterior computation into an optimization problem (Jordan et al., 1999) . As metaalgorithms, both EM and VI expand their breadth of use by replacing analytical derivations with Monte Carlo estimators but suffer losses in statistical and computational efficiency (Wei and Tanner, 1990; Ranganath et al., 2014) . Indeed, such trade-offs will continue to haunt the creation of fast, flexible and friendly statistical algo-ware well into the 21st century.

But successful statistical inference software must also interact with computational hardware in an optimal manner. Growing datasets require the computational statistician to give more and more thought to how the computer implements any statistical algorithm. To effectively leverage computational resources, the statistician must (1) identify the routine's computational bottleneck (Section 2.1) and (2) algorithmically map this rate-limiting step to available hardware such as a multi-core or vectorized central processing unit (CPU), a many-core graphics processing unit (GPU) or-in the future-a quantum computer. Sometimes, the first step is clear theoretically: a naive implementation of the high-dimensional regression example of Section 3.1 requires an order O(N 2 P ) matrix multiplication followed by an order O(P 3 ) Cholesky decomposition. Other times, one can use an instruction-level program profiler, such as Intel VTune (Windows, Linux) or Instruments (OSX), to identify a performance bottleneck. Once the bottleneck is identified one must choose between computational resources, or some combination thereof, based on relative strengths and weaknesses as well as natural parallelism of the target task.

Multi-core CPU processing is effective for parallel completion of multiple, mostly inde-pendent tasks that do not require inter-communication. One might generate 2 to, say, 72 independent Markov chains on a desktop computer or shared cluster. A positive aspect is that the tasks do not have to involve the same instruction sets at all; a negative is latency, i.e., that the slowest process dictates overall runtime. It is possible to further speedup CPU computing with single instruction, multiple data (SIMD) or vector processing. A small number of vector processing units (VPU) in each CPU core can carry out a single set of instructions on data stored within an extended-length register. Intel's streaming SIMD extensions (SSE), advance vector extensions (AVX) and AVX-512 allow operations on 128, 256 and 512 bit registers. In the context of 64 bit double precision, theoretical speedups for SSE, AVX and AVX-512 are 2-fold, 4-fold and 8-fold. For example, if a computational bottleneck exists within a for-loop, one can unroll the loop and perform operations on, say, 4 consecutive loop bodies at once using AVX (Holbrook et al., 2020a,b) . Conveniently, languages like OpenMP (Dagum and Menon, 1998 ) make SIMD loop optimization transparent to the user (Warne et al., 2019) . Importantly, SIMD and multi-core optimization play well together, providing multiplicative speedups. Whereas a CPU may have tens of cores, GPUs accomplish fine-grained parallelization with thousands of cores that apply a single instruction set to distinct data within smaller work-groups of tens or hundreds of cores. Quick communication and shared cache-memory within each work-group balance full parallelization across groups, and dynamic on-loading and off-loading of the many tasks hides the latency that is so problematic for multi-core computing. Originally designed for efficiently parallelized matrix math calculations arising from image rendering and transformation, GPUs easily speed up tasks that are tensor multiplication intensive such as deep learning (Bergstra et al., 2011) , but general-purpose GPU applications abound. Holbrook et al. (2020a) provides a larger review of parallel computing within computational statistics. The same paper reports a GPU providing 200-fold speedups over single-core processing and 10-fold speedups over 12-core AVX processing for likelihood and gradient calculations while sampling from a Bayesian multi-dimensional scaling posterior using HMC at scale. Holbrook et al. (2020b) reports similar speedups for inference based on spatio-temporal Hawkes processes. Neither application involves matrix or tensor manipulations.

A quantum computer acts on complex data vectors of magnitude 1 called qubits with gates that are mathematically equivalent to unitary operators (Nielsen and Chuang, 2002) . Assuming that engineers overcome the tremendous difficulties involved in building a practical quantum computer (where practicality entails simultaneous use of many quantum gates with little additional noise), 21st century statisticians might have access to quadratic or even exponential speedups for extremely specific statistical tasks. We are particularly interested in the following five quantum algorithms: quantum search (Grover, 1996) , or finding a single 1 amid a collection of 0s, only requires O( √ N ) queries, delivering a quadratic speedup over classical search; quantum counting (Boyer et al., 1998) , or finding the number of 1s amid a collection of 0s, only requires O( N/M ) (where M is the number of 1s) and could be useful for generating p-values within Monte Carlo simulation from a null distribution (Section 2.1); the quantum minimization algorithm of Durr and Hoyer (1996) finds the minimizer of a finite ordered set in time O( √ N ) and features in the quantum parallel MCMC of Holbrook (2021) ; to obtain the gradient of a function (e.g., the log-likelihood for Fisher scoring or HMC) with a quantum computer, one only needs to evaluate the function once (Jordan, 2005) as opposed to O(P ) times for numerical differentiation; finally, the HHL algorithm (Harrow et al., 2009) obtains the scalar value q T Mq for the P -vector q satisfying Aq = b and M an P ×P matrix in time O (log(P κ 2 )), delivering an exponential speedup over classical methods. Technical caviats exist (Aaronson, 2015) , but HHL may find use within high-dimensional hypothesis testing (big P ). Under the null hypothesis, one can rewrite the score test statistic u T (θ 0 ) I −1 (θ 0 ) u(θ 0 ) as u T (θ 0 ) I −1 (θ 0 ) I(θ 0 ) I −1 (θ 0 ) u(θ 0 ) for I(θ 0 ) and u(θ 0 ) the Fisher information and log-likelihood gradient evaluated at the maximum likelihood solution under the null hypothesis. Letting A = I(θ 0 ) = M and b = u(θ 0 ), one may write the test statistic as q T Mq and obtain it in time logarithmic in P . When the model design matrix X is sufficiently sparse-a common enough occurrence in large-scale regression-to render I(θ 0 ) itself sparse, the last criterion for the application of the HHL algorithm is met.

Core Challenges 4 and 5-fast, flexible and user-friendly algo-ware and hardware optimized inference-embody an increasing emphasis on application and implementation in the age of data science. Previously undervalued contributions in statistical computing, e.g., hardware utilization, database methodology, computer graphics, statistical software engineering and the human-computer interface (Gentle et al., 2012) , are slowly taking on greater importance within the (rather conservative) discipline of statistics. There is perhaps no better illustration of this trend than Dr. Hadley Wickham's winning the prestigious COPSS Presidents' Award for 2019

[for] influential work in statistical computing, visualization, graphics, and data analysis; for developing and implementing an impressively comprehensive computational infrastructure for data analysis through R software; for making statistical thinking and computing accessible to large audience; and for enhancing an appreciation for the important role of statistics among data scientists (COPSS, 2020) .

This success is all the more impressive because Presidents' Awardees have historically been contributors to statistical theory and methodology, not Dr. Wickham's scientific software development for data manipulation (Wickham et al., 2007 (Wickham et al., , 2011 (Wickham et al., , 2014 and visualization (Kahle and Wickham, 2013; Wickham, 2016) .

All of this might lead one to ask: does the success of data science portend the declining significance of computational statistics and its Core Challenges? Not at all! At the most basic level, data science's emphasis on application and implementation underscores the need for computational thinking in statistics. Moreover, the scientific breadth of data science brings new applications and models to the attention of statisticians, and these models may require or inspire novel algorithmic techniques. Indeed, we look forward to a golden age of computational statistics, in which statisticians labor within the intersections of mathematics, parallel computing, database methodologies and software engineering with impact on the entirety of the applied sciences. After all, significant progress toward conquering the Core Challenges of computational statistics requires that we use every tool at our collective disposal.

Read the fine print

Statistics Degrees total and by gender. https:// ww2.amstat.org/misc/StatTable1987-Current

An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process

The discarding of variables in multivariate analysis

Deep learning on gpus with python. Pages 1-48 in NIPS 2011, BigLearning Workshop

Theano: a CPU and GPU math expression compiler

Fast sampling with Gaussian scale mixture priors in high-dimensional regression

Divergence dating using mixed effects clock modelling: An application to HIV-1

The comparison of dosage-mortality data

Emerging infectious diseases: a proactive approach

The bouncy particle sampler: A nonreversible rejection-free Markov chain Monte Carlo method

Convex optimization

Tight bounds on quantum searching

Handbook of Markov chain Monte Carlo

The horseshoe estimator for sparse signals

Survey paper on big data

The best of the 20th century: Editors name top 10 algorithms

Data science: an action plan for expanding the technical areas of the field of statistics

COPSS. 2020. Committee of Presidents of Statistical Societies (Website

Openmp: an industry standard api for shared-memory programming

Data scientist

Handbook of genetic algorithms

Maximum likelihood from incomplete data via the EM algorithm

Probabilistic path Hamiltonian Monte Carlo

High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems

50 years of data science

Virus genomes reveal factors that spread and sustained the ebola epidemic

A quantum algorithm for finding the minimum

Bootstrap methods: another look at the jackknife. Pages 569-593 in Breakthroughs in statistics

An introduction to the bootstrap

Data, disease and diplomacy: Gisaid's innovative contribution to global health

On the mathematical foundations of theoretical statistics

The design of experiments. (Especially Section 21

Statistical methods for research workers

Regularization paths for generalized linear models via coordinate descent

Sampling-based approaches to calculating marginal densities

Petascale hierarchical modeling via parallel execution

Bayesian data analysis

Efficient metropolis jumping rules

Stochastic relaxation, gibbs distributions, and the bayesian restoration of images

How computational statistics became the backbone of modern data science. Pages 3-16 in Handbook of Computational Statistics

Approaches for bayesian variable selection. Statistica sinica Pages

Markov chain Monte Carlo maximum likelihood

Rao-Blackwellization for Bayesian variable selection and model averaging in linear and binary regression: a novel data augmentation approach

Google Trends. 2020. Data source: Google trends

A fast quantum mechanical algorithm for database search

Quantum algorithm for linear systems of equations

Statistical learning with sparsity: the lasso and generalizations

Monte Carlo sampling methods using Markov chains and their applications

Methods of conjugate gradients for solving linear systems

Selection of the best subset in regression analysis

The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo

A bayesian supervised dual-dimensionality reduction model for simultaneous decoding of lfp and spike train signals

A quantum parallel markov chain monte carlo

Massive parallelization boosts big bayesian multidimensional scaling

Scalable bayesian inference for self-excitatory stochastic processes applied to big american gunfire data

A tutorial on mm algorithms

Gradients do grow on trees: a linear-time O(N)-dimensional gradient for statistical phylogenetics

An introduction to variational methods for graphical models

Fast quantum algorithm for numerical gradient estimation

ggmap: Spatial visualization with ggplot2

Particle swarm optimization

Wormhole Hamiltonian Monte Carlo

Solution of systems of linear equations by minimized iterations

Kinetic energy choice in Hamiltonian/hybrid Monte Carlo

On the robustness of gradient-based mcmc algorithms

The BUGS project: Evolution, critique and future directions

Model selection and accounting for model uncertainty in graphical models using occam's window

Does Hamiltonian Monte Carlo mix faster than a random walk on multimodal densities?

Generalized linear models., 2nd edn.(chapman and hall: London)

Equation of state calculations by fast computing machines

Bayesian variable selection in linear regression

Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo

Some np-complete problems in quadratic and nonlinear programming

Bayesian Learning for Neural Networks

MCMC using Hamiltonian Dynamics. in Handbook of Markov chain Monte Carlo

Quantum computation and quantum information

Geometrically tempered Hamiltonian Monte Carlo

Discontinuous Hamiltonian Monte Carlo for discrete parameters and discontinuous likelihoods

Prior-preconditioned conjugate gradient for accelerated gibbs sampling in" large n & large p" sparse bayesian logistic regression models

Adaptive sampling for Bayesian variable selection

Air travel is associated with intracontinental spread of dengue virus serotypes 1-3 in Brazil

Personalized psychiatry: big data analytics in mental health

Shrink globally, act locally: Sparse Bayesian regularization and prediction

Bayesian inference for logistic models using Pólya-Gamma latent variables

Unifying the spatial epidemiology and molecular evolution of emerging epidemics

Virus evolution and transmission in an ever more connected world

Black box variational inference

Gaussian Markov random fields: theory and applications

Learning representations by back-propagating errors

Linear regression analysis vol

Stan Modeling Language Users Guide and Reference Manual, Version 2

Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10

Comprehensive comparative effectiveness and safety of first-line antihypertensive drug classes: a systematic, multinational, largescale analysis

A curated database reveals trends in single-cell transcriptomics

Regression shrinkage and selection via the lasso

Markov chains for exploring posterior distributions. the Annals of Statistics Pages

Mode jumping proposals in mcmc

Iterative Krylov Methods for Large Linear Systems

The art of data augmentation

Statistical tests based on permutations of the observations

Acceleration of expensive computations in bayesian statistics using vector operations

A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms

ggplot2: elegant graphics for data analysis

Reshaping data with the reshape package

The split-apply-combine strategy for data analysis

Tidy data

Gaussian processes for regression. Pages 514-520 in Advances in neural information processing systems

Gaussian processes for machine learning vol

Practical Bayesian modeling and inference for massive spatial data sets on modest computing environments. Statistical Analysis and Data Mining

Large-scale inference of correlation among mixed-type biological traits with phylogenetic multivariate probit models

AJH is supported by NIH grant K25AI153816. MAS is supported by NIH grant U19AI135995 and NSF grant DMS1264153.