key: cord-0623375-hz6g6f8e
authors: Cappello, Lorenzo; Palacios, Julia A.
title: Adaptive preferential sampling in phylodynamics
date: 2020-09-04
journal: nan
DOI: nan
sha: 52033b35419a4d1dfc76148d6144f71bdb2f94d7
doc_id: 623375
cord_uid: hz6g6f8e

Longitudinal molecular data of rapidly evolving viruses and pathogens provide information about disease spread and complement traditional surveillance approaches based on case count data. The coalescent is used to model the genealogy that represents the sample ancestral relationships. The basic assumption is that coalescent events occur at a rate inversely proportional to the effective population size $N_{e}(t)$, a time-varying measure of genetic diversity. When the sampling process (collection of samples over time) depends on $N_{e}(t)$, the coalescent and the sampling processes can be jointly modeled to improve estimation of $N_{e}(t)$. Failing to do so can lead to bias due to model misspecification. However, the way that the sampling process depends on the effective population size may vary over time. We introduce an approach where the sampling process is modeled as an inhomogeneous Poisson process with rate equal to the product of $N_{e}(t)$ and a time-varying coefficient, making minimal assumptions on their functional shapes via Markov random field priors. We provide scalable algorithms for inference, show the model performance vis-a-vis alternative methods in a simulation study, and apply our model to SARS-CoV-2 sequences from Los Angeles and Santa Clara counties. The methodology is implemented and available in the R package adapref.

Molecular sequence data, within the framework of phylodynamics (Grenfell et al. 2004) , is increasingly being used to track disease spread caused by rapidly evolving viruses and pathogens such as Influenza viruses (Rambaut et al. 2008) , Zika (Faria et al. 2016) , and SARS-CoV-2 (Hadfield et al. 2018) . The coalescent process (Kingman 1982b,a) , a probability model of gene genealogies, depends on a parameter called effective population size N e (t), which is a time-varying measure of genetic diversity. When disease dynamics can be modeled by simple epidemiological models such as Susceptible-Infected-Recovered, the coalescent effective population size can be expressed in terms of transmission rates and prevalence (Volz et al. 2009, Frost and Volz 2010) . Accurate and scalable inference for N e (t) is thus relevant to estimate epidemiological parameters of great interest in public health. Although this work is motivated by applications in molecular epidemiology of infectious diseases, estimation of N e (t) is an active area of research with applications ranging across many other scientific domains such as conservation biology and population genetics (e.g. Shapiro et al. (2004) , Huff et al. (2010) , Lorenzen et al. (2011) ).

A common feature in these applications is that genetic data are collected sequentially (heterochronous samples). In viral studies, samples are collected and sequenced when infected individuals attend clinics, hospitals, or testing centers. In ancient DNA studies, specimens are dated according to the time they lived, estimated through radiocarbon dating or other techniques. The coalescent typically models the gene genealogy conditionally on sampling dates, that is, the sampling dates are treated as censoring information (Felsenstein and Rodrigo 1999) . However, in some situations, it is reasonable to assume that samples are collected at a higher frequency when the population is large and at a lower frequency when the population is small: for example, at the onset of an epidemic, as the viral population grows and more people get infected, more resources may be allocated to monitor the viral spread, possibly leading to more molecular sequence collection. The number of SARS-CoV-2 sequences uploaded daily in GISAID offers some evidence of this claim (Shu and McCauley 2017 ) (see the histogram in the supplementary material). Karcher et al. (2016) study the scenario in which the sampling process depends on the population size, and show that an estimator of N e (t) that does not account for this dependence is biased.

This issue was first discussed in the spatial statistics literature by Diggle et al. (2010) , who term preferential sampling a situation in which the process that determines the data locations and the process under study are dependent. In this paper, we will introduce a new model that account for preferential sampling in a coalescent framework, while making minimal assumptions on N e (t), the sampling process, and their dependence.

Three estimators that incorporate preferential sampling into the coalescent framework have been proposed. Volz and Frost (2014) propose an estimator in the case that N e (t) grows exponentially and samples are collected as an inhomogeneous Poisson process with rate linearly dependent on the effective population size. Karcher et al. (2016) assume that N e (t) is a continuous function, and the samples are collected as an inhomogeneous Poisson process with rate λ(t) = exp(β 0 )N e (t) β 1 , for β 0 , β 1 ≥ 0, i.e. the dependence between the sampling process and the effective sample size is described by a parametric model. Recent work by Parag et al. (2020) weakens this assumption substantially, allowing for the dependence between the sampling process and effective population size to vary over time. The key assumption in Parag et al. (2020) is that the sampling rate depends linearly on N e (t) within a given time interval, but the linear coefficient changes across time intervals. The estimator of Parag et al. (2020) , termed Epoch skyline plot (ESP), is an extension of the classic skyline plot estimator for N e (t) (Pybus et al. 2000) , in which the sampling rate and the effective population size are both piecewise-constant, and the location and number of change points (boundary points of the time intervals) are either specified or inferred. As it is typical with skyline plots, the estimates are highly variable, rough, and highly dependent on the specification of change points locations and the number of piecewise-constant pieces used. All three works show that under correct model specification, accounting for preferential sampling leads to a more accurate estimation of N e (t) (in terms of absolute deviations to the true trajectory), and narrower credible regions.

Other non-coalescent approaches for phylodynamics, such as birth-death processes (Stadler 2010) , explicitly incorporate the sampling process by modeling the sampling dates as a partially observed death process where only a fraction of the population is observed. Stadler et al. (2013) extended previous work to allow sampling rates to vary through time. Volz and Frost (2014) show that in both, coalescent and birth-death processes alike, statistical power largely depends on the correct specification of the sampling process rate, rather than on the genealogical model. Hence, the need for a flexible modeling approach of the sampling process, adaptive to any possible scenarios encountered in applications.

There are a plethora of nonparametric estimators of N e (t) following the skyline plot (Pybus et al. 2000) : among others, the generalized skyline plot (Strimmer and Pybus 2001) and the Bayesian skyline plot (Drummond et al. 2005 ) reduce the high variance and roughness that characterize the skyline plot estimators. These methodologies require either fixing or estimating change-points in N e (t). A set of models that do not employ change-points but arbitrary discrete grids is based on Markov random fields (MRF): the Gaussian MRF (GMRF) (Minin et al. 2008, Palacios and Minin 2012) allows for the recovery of smooth continuous trajectories; the Horseshoe MRF (HSMRF) (Faulkner et al. 2020) is an alternative to GMRF which is locally adaptive, i.e. it can successfully recover sharp changes in a trajectory and it is adaptive to a varying level of smoothness.

In this paper, we borrow from this literature and introduce an adaptive preferential sampling framework for phylodynamics, where the adaptivity follows from the fact that the dependence of the sampling process on N e (t) changes over time. The effective population size N e (t) is modeled as a latent parameter included in both, the coalescent and the sampling processes. The latter assumed to be an inhomogeneous Poisson process with rate λ(t) = β(t)N e (t), where β(t) is a continuous function controlling the dependence on N e (t), analogous to that introduced by Parag et al. (2020) . We a priori model N e (t) and β(t) as two Markov random fields (MRF), with the flexibility of using either a GMRF or a HSMRF. The prior choice follows from the properties of the fields. The advantage of the proposed adaptive preferential sampling over the ESP estimator is that there is no need to specify (or estimate) the number and location of the change points of β and N e . Also, the resulting estimates are smooth and the high variability that characterizes skyline estimates disappears.

We develop the methodology assuming that a genealogy is available to the researcher and develop algorithms for inference under this framework. We test our model on simulated data and compare it to alternative methods, including both, estimators that account for preferential sampling and others that do not. We implement our method in the R package adapref, available at https://github.com/lorenzocapp/adapref, provide two algorithms for poste-rior approximation: a Hamiltonian MCMC and a Laplace approximation. We apply our method to SARS-CoV-2 sequences from California and study whether there is evidence of preferential sampling.

The rest of the paper proceeds as follows. In Section 2, we provide background on the coalescent process, the MRF priors on N e (t), and previous work on preferential sampling. In Section 3 we introduce the adaptive preferential sampling framework and explain how to approximate the posterior distribution of model parameters. Section 4 includes a simulation study, in which we test the new set of priors through simulated data and compare them to alternatives. In Section 5, we apply our method to two data sets of SARS-CoV-2 sequences from Santa Clara and Los Angeles counties in California. Section 6 concludes.

Coalescent models are continuous-time Markov chains used to model the set of ancestral relationships of a sample of n individuals from a large population called gene genealogy. In the context of molecular epidemiology, a genealogy is a subset of the transmission history among the samples ( Figure 1 ). Starting from the original work of Kingman (1982a) , several extensions to the standard coalescent have been developed to incorporate more realistic population and sampling features, such as variable population size (Slatkin and Hudson 1991) , longitudinal sampling (also called heterochronous sampling) (Felsenstein and Rodrigo 1999) and population structure (Hudson 1990); Wakeley (2009) provides a good introduction to the subject. Coalescent processes can be characterized by two underlying processes: a jump chain defining the ancestral relationships represented by a binary tree topology and a pure death process that defines the timing of the coalescent events, i.e. the times when pairs of lineages meet their common ancestors. This sequence of holding times defines the branch length of the corresponding tree topology.

Let the vector n = (n 1 , . . . , n m ) denote the sample sizes at times t s = (t s 1 , . . . , t s m ), with m number of sampling points and n total sample size. The process goes backward in time (from present toward the past): with t s 1 = 0 denoting the present time, and t s j > t s j−1 for j = 2, . . . , m. I 0,7 I 1,7 I 1,6 I 0,6 I 1,5 I 0,5 I 0,4 I 0,3 I 0,2

Example of a heterochronous genealogy. A genealogy of 7 individuals sampled at 4 different times (color of tips) with multiplicities (n 1 = 1, n 2 = 2, n 3 = 2, n 4 = 2). Sampling times are denoted by (t s k ) 1:4 , coalescent times are denoted by (t k ) 2:7 and I i,j denoted the interval lengths delimited by coalescent times and/or sampling times, i.e. every time there is a change in the number of lineages.

Let t = (t n+1 , . . . , t 2 ) be the vector of coalescent times with t n+1 = 0 < t n < ... < t 2 . Note that the subscript in t k is not the current number of extant lineages (often a convention in the coalescent literature) but the number of lineages that have yet to coalesce. Starting from t = 0, vectors t and t s partition time into intervals ( Figure 1 ). An interval ending with a coalescent event, say t k , is denoted by I 0,k ; the intervals that end with a sampling time within the interval (t k+1 , t k ) are denoted as I i,k , where i ≥ 1 indexes all the sampling events in (t k+1 , t k ). Formally, for every k ∈ {2, . . . , n}, we define

where the maximum is taken over all t s j < t k , and for every i ≥ 1 we set

) with the max taken over all t s j−i+1 > t k+1 and t s j < t k .

With n i,k denoting the number of extant lineages during the time interval I i,k . Figure 1 plots an example of a heterochronous genealogy with n = (1, 2, 2, 2), at times t t = (t s 1 , . . . , , t s 4 ) with t s 1 = 0. In the interval (t 6 , t 5 ) there are two intervals: I 1,5 = [t 6 , t s 4 ), I 0,7 = [t s 4 , t 5 ). The vector of coalescent times t is a random vector whose density with respect to the Lebesgue measure on R n−1 + depends on two quantities: the coalescent factor C i,k := n i,k 2 , and the effective population size N e (t). The coalescent density can be factorized as the product of the conditional densities of t k−1 given t k , i.e.

(1)

A few remarks. First, the integral over I i,k−1 accounts for the probability of no coalescence during I i,k−1 . It is zero if there are less than i sampling times between t k and t k−1 . Second, conditionally on s, n and t k , the coalescent factors can be computed exactly and N e (t) is the only unknown parameter, sampling times are assumed fixed.

Coalescent times can be alternatively viewed as the realization of an inhomogeneous point process with rate C(t)N e (t) −1 , with the coalescent factor C(t) being defined for all t ≥ 0 by the notation above. This alternative view allows us to frame the problem of inferring N e (t) as that of inferring the intensity function of an inhomogeneous point process. Palacios and Minin (2013) is an example of how this representation is useful in inference and simulations.

Markov random field-based priors on the log effective population trajectory allows smoothing without careful modeling of change points. They are computationally tractable thanks to the sparsity assumption in the covariance matrix of the field (Rue and Held 2005) . All MRF-based priors for phylodynamic inference share the assumption that the trajectory N e (t) is an unknown continuous function. The integral in (1) is numerically approximated by the Riemann sum at a regular grid of M + 1 points (k i ) 1:M +1 , and one assumes that the trajectory N e (t) is well

We stress that neither the grid cell boundaries (k i ) 1:M +1 nor M depend on t and N e (t), with the choice of M commonly based on n (Faulkner et al. 2020) . A description of the discretized coalescent log-likelihood L(θ|t) is given in detail in Palacios and Minin (2012) and Faulkner et al. (2020) .

The Horseshoe Markov random field prior (HSMRF) for θ (Faulkner et al. 2020 ) assumes that the pth order forward differences of θ are independent and Horseshoe distributed (Carvalho et al. 2010) , i.e.

where C + (0, a) is the standard half-Cauchy distribution with positive support with scale parameter a, τ i are the local shrinkage parameters and γ is the global smoothing parameter. To completely specify the prior, one sets θ 1 ∼ N (µ, σ 2 1 ), and for p ≥ 2, the first p values of the field have running order q difference priors as follows:

with a q = 2 −(p−q)/2 . As it is common in the trend filtering literature (Kim et al. 2009 ), only orders 1 and 2 are typically employed in applications.

A related prior consists in assuming that the pth order forward difference of θ, more pre-

is distributed as a GMRF with mean vector µ and covariance matrix Q(γ) corresponding to:

θ 1 ∼ N (µ, σ 2 1 ), and for 1 ≤ q ≤ p − 1 we set ∆ q θ q |a q γ ∼ N (0, a q γ 2 ). A common alternative to the half-Cauchy distribution on γ is a Gamma prior, e.g. in Palacios and Minin (2012) . We will employ both formulations in our implementations.

A fully nonparametric prior on log N e (t) has been studied by Palacios and Minin (2013) , who proposed a Gaussian process prior on the log effective population size. The advantage of this approach is that no grid needs to be specified a priori. In applications, we believe that the GMRF, the discretized version of this prior, achieves a comparable empirical performance.

Preferential sampling arises when the process that determines the locations of the data (i.e. sampling process) and the process under study are stochastically dependent. The notion was intro-duced by Diggle et al. (2010) who show that not accounting for this effect leads to biased inference as a result of the model misspecification. On the other hand, a correctly specified sampling model can lead to more accurate estimates.

In phylodynamics, preferential sampling arises when the sampling process depends on N e (t). Volz and Frost (2014) provide the first evidence that coalescent-based inference under a misspecified sampling process can be biased. They propose a new estimator tailored to a coalescent process with exponentially growing effective population size and a sampling process with rate linearly dependent on N e (t). They show that the estimator obtained by correctly modeling the sampling process is more accurate than the standard coalescent estimator. Karcher et al. (2016) assume that N e (t) is a continuous function and the sampling process is a Poisson process with rate λ(t) = exp(β 0 )N e (t) β 1 , for β 0 , β 1 ≥ 0, i.e. the rate λ(t) is proportional to the effective population size. This model is parsimonious, capturing a variety of scenarios with two parameters: with β 1 = 1, the rate is a constant times N e (t), on the opposite side of the spectrum, with β 1 = 0, one models uniform sampling. Another advantage is that little assumptions are made on N e (t). However, the parametric assumptions on λ(t) make the sampling dates strongly informative about N e (t). This situation can be problematic in the case of sampling dates errors. Moreover, under no preferential sampling or under a different rate λ(t) 

where X is a vector of covariates and β the corresponding linear coefficients. Here a covariate can be for example a dummy variable indicating a change in sampling protocols, or when a new sampling center joined the study. The term β X adds more flexibility to the parametric dependence enforced by exp(β 0 )N e (t) β 1 . Clearly, this extension implies the availability of covariates informative on the sampling design. Parag et al. (2020) introduce the epoch sampling skyline plot (ESP) estimator that allows for a more flexible dependence of the sampling process on the effective population size. More specifically, the ESP method assumes that N e (t) is a piecewise-constant function with r segments described by the vector (N 1 , . . . , N r ) of r parameters, and time is further partitioned in d epochs, such that in epoch i and segment r the sampling process is a Poisson process with rate β i N j , where (β 1 , . . . , β d ) is a vector of d parameters. The vector (β 1 , . . . , β d ) modulates the dependence of the sampling process on the effective population size, assuming that the dependence changes across d epochs. This is a notable advantage over the parametric model of Karcher et al. (2016) : one can model a variety of realistic time-varying sampling protocols, or simply deal with sampling discontinuities typical of outbreaks. We conjecture that higher flexibility in the preferential model reduces the risk of model misspecification and bias when the preferential sampling assumption does not hold.

In the ESP, the endpoints of the r segments coincide with a subset of the coalescent times t. Similarly, the boundary points of the d epochs are determined by a subset of the sampling times. The number of segments r and epochs d, as well as their lengths, need to be determined or inferred. These choices affect the ESP estimates heavily. The authors implement a frequentist and a Bayesian version with independence assumptions in (N 1 , . . . , N p ) and (β 1 , . . . , β d ), leading to estimates with high variance, a characteristic feature of skyline plot-type estimators.

In the adaptive preferential sampling framework, the sampling times (s i ) 2:n are determined by the jumps of an inhomogenous Poisson process with rate λ(t) = β(t)N e (t), with N e (t) effective population size, and β(t), a function modulating the linear dependence between λ(t) and N e (t), and s 1 is fixed to 0. We assume that both β(t) and N e (t) are unknown continuous functions.

To numerically approximate the integrals in (1), we resort to the approximation sketched in in Section 2.2 and detailed in Palacios and Minin (2012) . We employ the regular grid (k i ) 1:M +1 and assume that N e (t) is governed by parameters

A few preliminary remarks. Modeling the log β(t) ensures that β(t) ≥ 0. M is not related to the number of epochs d in ESP: it is solely determined by s n and the grid (k i ) 1:M +1 , which in turn does not depend on t. We have by definition M < M to ensure that β(t) is not modeled after the last sampling time. Lastly, after discretizing β(t), we can write the log-likelihood contribution of the sampling process as

where ∆ i = k i+1 − k i and the first interval [k 1 , k 2 ] is closed to include s 1 . Through the term Under the GMRF prior on β(t), one favors smooth sampling designs, a situation which is also desirable when one does not have exact knowledge of the underlying sampling protocol. Note that the choice of field and order of the priors can be disjoint: for example, one can place a HSMRF of order 1 prior on N e (t) and a GMRF of order 2 prior on β(t).

To formalize, Bayesian phylodynamic inference under adaptive preferential sampling can be written in the most general form as t|θ, n, s ∼ Coalescent model s|θ, α, n ∼ Poisson process (5)

where ξ is the global smoothing parameter of the MRF on α, ψ is the vector of local shrinkage parameter of the HSMRF prior on α, p 1 and p 2 are the orders of the respective MRFs. We will refer to any combination of priors above as the adaptive preferential model.

Note that the adaptive preferential model differs notably from the framework of the ESP estimator by the fact that the parameter vectors θ and α are each dependent, the grid at which they are defined does not depend on t, and these priors favor smooth estimates.

Posterior distributions. Under the assumption that t and s are known, and that we place HSMRF priors on both α and θ, the posterior distribution of model parameters could be readily computed π(α, θ, ψ, τ , γ, ξ|t, s) ∝ L(α, θ|s)L(θ|t)π(θ|τ )π(τ |γ)π(γ|ζ 1 )π(α|ψ)π(ψ|ξ)π(ξ|ζ 2 ),

where L(θ|t) is the discretized coalescent log-likelihood. Under GMRF priors on α and θ, the posterior would be π(α, θ, γ, ξ|t, s) ∝ L(α, θ|s)L(θ|t)π(θ|γ)π(γ|ζ 1 )π(α|ξ)π(ξ|ζ 2 ).

For our analysis, we fixed the pair (g, t), which can be estimated by other methods such as INLA approximates posterior marginals π(γ, ξ|t, s), π(θ i |t, s) for 1 ≤ i ≤ M , and π(α j |t, s)

for 1 ≤ j ≤ M . The posterior marginal distribution of hyperparameters is π(γ, ξ|t, s) ∝ π(γ, ξ, θ, α, t.s) π G (θ, α|γ, ξ, t, s) α=α * (ξ,γ),θ=θ * (ξ,γ)

, where π G (θ, α|γ, ξ, t, s) is the Gaussian approximation of π(θ, α|γ, ξ, t, s) obtained from a Tay-lor expansion around its modes θ * (ξ, γ) and α * (ξ, γ) (modes can be computed through any optimization algorithm, e.g. Newton-Raphson).

The approximation of the marginal distributions of the MRFs π(θ i |t, s) and π(α j |t, s) are

where E G denotes the expected value w.r.t. π G (θ, α|γ, ξ, t, s). Analogously, we can define the same term for π GG (θ −i , α|γ, ξ, t, s).

Given π(θ i |γ, ξ, t, s) and π(α i |γ, ξ, t, s), we use π(γ, ξ|t, s) to integrate out γ and ξ and obtained the desired distributions.

We rely on simulations to evaluate the performance of the adaptive preferential sampling (adaPref) method in estimating the effective population size trajectory in the presence of "strong" preferential sampling and under "weak" preferential sampling in which sampling is preferential only during some time periods. We then evaluate the sensitivity of adaPref posterior inference to the choice of the MRF (GMRF vs HSMRF) priors and their orders (1 vs 2). We compare the performance of adaPref to alternative methods with and without preferential sampling. In the supplementary material we study the approximation error incurred using INLA in place of MCMC.

Also, although β(t) is a parameter of no direct scientific interest, we deem to successfully recover all model parameters, and we study how well our model infer β(t).

Simulation setup. For each simulated dataset, we estimated adaPref posteriors using 16 different models with all possible combinations of GMRFs and HSMRFs of orders 1 and 2 per parameter N e (t) and β(t). We compare these models to those obtained by ignoring preferential sampling implemented in the the R packages spmrf (Faulkner et al. 2020 ) and phylodyn (Karcher et al. 2017 ).

We use smprf to estimate the posterior of N e (t), without preferential sampling, with GMRF rely on Stan, in particular the R interface rstan (Team et al. 2018) . For our implementations, we use the same settings used by Faulkner et al. (2020) .

In addition, we use INLA approximations for models that employ GMRF priors. We used R-INLA (Rue et al. 2009 ) for our implementation of adaPref with GMRF order 1 priors (GMRF1) on both N e (t) and β(t). We compare GMRF1 adaPref with the GMRF1 prior with preferential sampling of Karcher et al. (2016) (parPref) , and the the GMRF1 prior without preferential sampling (noPref) (Palacios and Minin 2012) .

For each dataset, we test the performance of all models through a set of commonly used summary statistics. For a regular grid of time points (v i ) i:K , we consider the sum of relative errors:

the 97.5% and 2.5% quantiles of the posterior distribution of N (v i ); lastly, the envelope measure

which measures the proportion of the curve that is covered by the 95% credible region. We fix K = 100, v 1 = 0 and v k = .8 t 2 . We compute the Watanabe Aikake information criteria (WAIC, Watanabe and Opper (2010) ) implemented in the R package loo (Vehtari et al. 2017) , for model comparison across the 20 models computed through rstan.

Data. We simulate genealogies under two population size trajectories: a piece-wise constant and exponential trajectory (CE), and a bottleneck trajectory (B). For the sampling protocols, we simulate sampling trajectories that resemble situations encountered in applications: a sampling protocol proportional to N e (t) (PP), uniform (U), a "lagged" response to changes in N e (t) (LP), and a situation where only some segments of the sampling trajectory are preferential (UP). The combination of the two acronyms will be used in the plots, e.g B-U refers to bottleneck trajectory and uniform sampling. The first rows of Figures 2-3 depict the N e (red) and λ(t) (black) trajectories (up to a scaling constant and in log-scale) of the eight simulation scenarios considered.

Exact specifics of the trajectories used are given in the supplementary material. We simulate both the coalescent times and the sampling times from their corresponding inhomogeneous Poisson processes using the Lewis-Shedler thinning algorithm (Lewis and Shedler 1979, Palacios and Minin 2013) . We consider three sample sizes n in (100, 300, 500) and 50

simulations for each combination of trajectories and sample size. We estimated posteriors with the twenty-three models for each simulated genealogy. The code for reproducing the simulation study is available at https://github.com/lorenzocapp/adapref as a R package.

Results. We first analyze a single simulated genealogy with n = 500 tips from each simulation scenario and approximate posterior marginals with INLA assuming GMRFs of order 1 priors.

The four panels of Figure 2 second row depict the posterior medians and 95% BCI of N e (t) for the constant-exponential trajectory (CE). Our adaPref results are depicted in black and grey scale, the parPref method in red, and the noPref in blue. Each column corresponds to each a sampling The adaPref model is more heavily parametrized and one may be lead to think that the performance of the adaPref estimator is affected by the sample size. Figure 4 last row panels depict the boxplots of the three statistics considered, now grouping simulations by sample size. There is no detectable sample size effect: the relative performance of the estimators is roughly similar as n increases. The adaPref estimator is the best performing according to DEV and RWD averaging In the stacked bar plots (bottom two rows), each bar refers to a simulation scenario and each subbar refers to a model (color legend). The sub-bar width represents the percentage of simulated datasets for which the model under consideration is in the top two best performance for the corresponding summary statistics ( highest ENV, lowest RWD, and lowest DEV). In the legend, the first capital letter refers to the type of prior on N e (t), the second one on β(t), with H denoting the HSMR prior, G the GMRF prior; the two numbers refer to the corresponding orders. over all the simulation scenarios jointly.

We now assess the sensitivity of different MRF priors on β(t) and N e (t) parameters in the adaPref model and compare the N e (t) adaPref estimators to the noPref estimators according to ENV, RWD, and DEV. We discuss the results considering the noPref model with HSMRF-1 prior on N e (t), the adapref models with HSMRF-1 prior on N e (t) and HSMRFs of orders 1 and 2, and GMRFs of orders 1 and 2, on β(t). In all cases, we use HMC to estimate the corresponding posterior distribution. Figure 5 top two rows include the boxplots of ENV, RWD, and DEV for the eight simulation scenarios. Each bar in Figure 5 bottom two rows depicts the percentage of simulations each model was one of the top two according to ENV, RWD, and DEV across all simulation scenarios. Each bar refers to a simulation scenario according to one metric. In all plots, we include all three sample sizes considered for each simulation scenario (1500 data sets in total). Figure 5 confirm that modeling preferential sampling leads to better accuracy.

Looking at the bottom two rows, if preferential sampling did not matter, all models would have approximately the same chance of being ranked in the top two (20% each). This is roughly the case for ENV in the CE scenarios. All adaPref models always achieve narrower credible regions (RWD). In terms of DEV, there are a few instances in which ignoring preferential sampling leads to better performance (B-PP, B-LP, B-UP). However, one of the adaPref models (HSMRF-1 prior on both parameters) achieves an identical performance in those scenarios. Figure 5 boxplots also show that the adaPref priors lead to narrower credible regions, and sometimes to smaller mean absolute deviations (DEV).

Although we only show results of noPref with HSMR-1 prior on N e (t) in Figure 5 , we computed the WAIC values for the four MRF priors on N e (t) based on GMRF and HSMRF of orders 1 and 2. The chosen HSMR-1 model achieved the highest WAIC more frequently across the 3000 datasets generated (50 runs, three sample sizes, and four sampling rates for each N e (t) trajectory).

SARS-CoV-2 is the virus responsible for the coronavirus disease pandemic in 2019-2020. Molecular surveillance of SARS-CoV-2 complements traditional surveillance methods based on case count data and provides a unique opportunity to retrospectively learn past disease dynamics.

Here, we use the adaPref model for estimating the viral genetic diversity trajectory N e (t), from currently available viral molecular sequences in GISAID obtained from infected individuals. We analyzed viral whole-genome sequences collected in California in Santa Clara (S.C.) and Los Angeles (L.A.) counties and made publicly available in the GISAID EpiCov database (Shu and McCauley 2017). The GISAID reference numbers of the sequences included in this study, together with data access acknowledgments, are included in the supplementary material.

We downloaded all molecular sequences available on June 27, 2020. The datasets consist of 195 and 134 sequences from S.C and L.A. counties respectively, with collection dates ranging from mid-February, 2020 to April 13 of 2020. We included only high coverage sequences with more than 25000 base pairs. Sampling frequency information is depicted in the first and third parameters are: 20 × 10 6 iterations, thinning every 1000 and burnin of 10 × 10 6 iterations. We selected the following priors: Extended Bayesian Skyline prior on N e (t) (Heled and Drummond 2008 ), HKY mutation model with empirically estimated base frequencies (Hasegawa et al. 1985) , and uniform prior on the mutation rate with support constrained between 9 × 10 − The average width of the credible regions (RW = 1

differ across methods and datasets. In S.C. county, RW is 8.5 for noPref, 6.6 for parPref, and 4.7 for adaPref inferences. In L.A. county, the RW is 23.9 for noPref, 13.6 for parPref , and 20.8 for adaPref inferences. We get a general confirmation that preferential sampling estimators lead to narrower credible regions.

A final remark. The estimates of N e (t) presented here are representative of genetic diversity over time and do not directly translate to the number of infections. The coalescent we employed ignores recombination, population structure, and selection, which are all assumptions commonly violated in viruses (Rambaut et al. 2008) . As more scientific knowledge on this virus is produced, the validity of our model assumptions to SARS-CoV-2 will be the subject of further research.

Also, we note that observed nucleotide substitutions may be caused by sequencing errors and these are being ignored in our study.

We have introduced an adaptive preferential sampling model to estimate the effective population size N e (t) of a coalescent process accounting for a situation in which sampling dates are stochastically dependent on the effective population size. We model sampling dates as an inhomogeneous Poisson process with rate β(t)N e (t), where β(t) is a time-varying coefficient that modulates how this dependence varies over time. We assume that both N e (t) and β(t) are continuous functions and model them in a Bayesian framework with Markov random field priors. This methodology allows us to account for preferential sampling while making minimal assumptions on the dependence between the sampling process and the genealogical process. We term the model proposed adaptive preferential sampling.

The adaptive preferential sampling model allows for a situation in which the sampling protocol changes over time but no detailed knowledge on the way samples are collected is available. In particular, the local adaptivity of the Horseshoe Markov random field prior allows also to model abrupt changes in the sampling protocol.

We show through simulation studies that the estimates obtained through the adaptive preferential sampling are more accurate than some of the available alternatives, leading to smaller absolute deviations from the true trajectories and narrower credible regions. The performance is competitive also in a broad set of scenarios in which the parametric assumptions of the alternative methods are met. We provide an application to SARS-CoV-2 monitoring and show the "adaptive nature" of our methodology: in one scenario the estimate was comparable to that of the model without preferential sampling. In a second one, the estimate was matched that of the parametric preferential sampling methodology.

The most direct extension for future work is to include genealogical uncertainty, which is being ignored in the present work. While Kingman heterochronous n-coalescent is the standard coalescent model choice to include genealogical uncertainty, recent works have proposed to infer N e (t) employing lower resolution coalescent models (Sainudiin et al. 2015 , Cappello et al. 2020 ).

Our adaptive preferential sampling framework can be paired with any of the ancestral processes.

Another natural extension to the proposed adaptive preferential framework is to incorporate covariates into λ(t), the sampling rate, as it is done in Karcher et al. (2020) , to include auxiliary information about the sampling protocols available to the modeler. For example, it is easy to imagine that one may have direct control over the sampling protocol. The resulting rate of the sampling process would be λ(t) = β(t)N e (t) + β X(t), where X is a vector of covariates and β the corresponding linear coefficients.

In this paper, we model jointly the coalescent process and a sampling process depending on N e (t). An interesting direction of future work is to model jointly the coalescent process with other processes that depend on N e (t), such as the total number of infected individuals in an epidemic (Volz et al. 2009 ). The adaptive framework introduced in this paper seems to be suitable to such an extension, given that we make limited assumptions on the dependence between the two processes.

Observed preferential sampling of SARS-CoV-2 reported in GISAID. Figure 6 depicts the histogram of the sampling dates of the SARS-CoV-2 USA sequences available on GISAID as of July 20, 2020. The red line depicts the daily number of new cases in USA (data downloaded from https://github.com/nytimes/covid-19-data).

Simulation Details We used simulations to assess the performance of the adaPref priors. We provide here details of the eight simulation trajectories considered (CE and B). In CE scenarios, N e (t) is equal to 3 for t ≤ .15, 3 exp(3 − 20t) for .15 < t ≤ .3, and .15 otherwise; in B scenarios, N e (t) is equal to .6 for t ≤ .2, .005 for .2 < t ≤ .5, and .3 otherwise. The sampling rate λ(t) changes for each simulation scenario. There is a constant of proportionality that will make it change also as a function of the sample size (n = 100, 300, 500) in order to ensure that the maximum sampling times is approximately the same.

In CE-PP λ(t) = c n N e (t) with c n=100 = 180, c n=300 = 500,c n=500 = 830; in CE-U λ(t) = c n with c n=100 = 25, c n=300 = 65,c n=500 = 120; in CE-LP λ(t) = c n 10 exp(1.6 − 20t) for t ≤ .23, and c n 0.5 otherwise, with c n=100 = 45, c n=300 = 140,c n=500 = 210; in CE-UP λ(t) = c n 10 exp(3 − 20t) for t ≤ .3, and c n 0.5 otherwise, with c n=100 = 14, In B-PP λ(t) = c n N e (t) with c n=100 = 520, c n=300 = 1700,c n=500 = 3000; in B-U λ(t) = c n with c n=100 = 15, c n=300 = 40,c n=500 = 70; in B-LP λ(t) = c n 4 for t ≤ .1, c n 2 for .1 < t ≤ .4, and c n 4 otherwise, with c n=100 = 40, c n=300 = 130,c n=500 = 200; in B-UP λ(t) = c n for t ≤ .5, and c n 4 otherwise, with c n=100 = 150, c n=300 = 210,c n=500 = 250. Each box refers to one method (color in the legend) and depicts the distribution of the 150 estimated statistics for each simulation trajectory (50 datasets for each sample size, 150 in total) based on β(t) posterior: ENV, first column; RWD, second column; DEV, third column. In the legend, the first capital letter refers to the type of prior on N e (t), the second one on β(t), with H denoting the horseshoe prior, G the Gaussian prior; the two numbers refer to the corresponding orders.

Accuracy of adaPref priors in recovering β(t). We study the ability of the adaPref prior to infer the sampling intensity β(t), which we recall to be λ(t)/N e (t). Figure 8 depicts the posterior medians and 95% BCI of β(t). It shows that β(t) is well recovered in all the scenarios. Figure 9 includes the boxplots of the performance of the 16 adaPref priors grouped by simulation scenarios. The general conclusion is that empirical accuracy is largely driven by the prior on N e (t): in CE, models with second-order priors on N e (t) (both Gaussian and Horseshoe) achieve lower RWD and DEV (boxes in shades of green and purple); in B, models with HSMRF-1 on N e (t) generally achieve the best performance across the three metrics. We believe that this is a positive feature of the adaPref models because the choice of the prior can be largely based on the prior on N e (t) which is the actual parameter of interest.

INLA posterior approximation study We continue the analysis of the simulation study dis- by simulation study. In the first two rows, each box refers to one method (color in the legend) and depicts the distribution of the 150 estimated statistics for each simulation trajectory (50 datasets for each sample size, 150 in total) based on N e (t) posterior: ENV, first column; RWD, second column; DEV, third column. In the third row, the grouping is done according to the sample size: each box is based on 400 datasets(50 datasets for each of the eight trajectories). In the legend, HMC N 1 N 1 and INLA adaPref are our methods, INLA noPref is the method in Palacios and Minin (2012) , HMC N 1 is the method of Faulkner et al. (2020). cussed in Section 4. Here, we study how well INLA (Rue et al. 2009 ) approximates the posterior marginals of N e (t) across the different simulation scenarios considered and across different sample sizes (100, 300, 500). In this study we consider the HMC approximated posteriors as the "ground truth" and see how the INLA approximations perform. We compare the posteriors approximated with INLA to the ones approximated with HMC of the adaptive preferential sampling model with GMRF-1 prios on both β(t) and N e (t). We do an identical comparison for the no preferential model with GMRF-1 prior on N e (t).

We check whether the accuracy of INLA approximated posteriors differ from that of HMC approximated posteriors. Once more, we employ ENV, RWD, and DEV to study accuracy. 

Beast 2.5: An advanced software platform for Bayesian evolutionary analysis

The Tajima heterochronous n-coalescent: inference from heterochronously sampled molecular data

Stan: A probabilistic programming language

The horseshoe estimator for sparse signals

Geostatistical inference under preferential sampling

Bayesian coalescent inference of past population dynamics from molecular sequences

Zika virus in the Americas: early epidemiological and genetic findings

Horseshoe-based Bayesian nonparametric estimation of effective population size trajectories

Coalescent approaches to HIV population genetics, in 'The Evolution of HIV

Viral phylodynamics and the search for an effective number of infections

Unifying the epidemiological and evolutionary dynamics of pathogens

Nextstrain: real-time tracking of pathogen evolution

Dating of the human-ape splitting by a molecular clock of mitochondrial DNA

Bayesian inference of population size history from multiple loci

Gene genealogies and the coalescent process

Mobile elements reveal small population size in the ancient ancestors of homo sapiens

Quantifying and mitigating the effect of preferential sampling on phylodynamic inference

phylodyn: an R package for phylodynamic simulation and inference

Estimating effective population size changes from preferentially sampled genetic sequences

On the genealogy of large populations

The coalescent

Simulation of nonhomogeneous Poisson processes by thinning

Species-specific responses of late quaternary megafauna to climate and humans

Smooth skyride through a rough skyline: Bayesian coalescent-based inference of population dynamics

Integrated nested Laplace approximation for Bayesian nonparametric phylodynamics

Gaussian process-based Bayesian nonparametric inference of population size trajectories from gene genealogies

Jointly inferring the dynamics of population size and sampling intensity from molecular sequences

An integrated framework for the inference of viral population history from reconstructed genealogies

The genomic and epidemiological dynamics of human influenza A virus

Gaussian Markov random fields: theory and applications

Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations

Finding the best resolution for the Kingman-Tajima coalescent: theory and applications

Rise and fall of the Beringian steppe bison

GISAID: Global initiative on sharing all influenza data-from vision to reality

Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations

Sampling-through-time in birth-death trees

Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV)

Exploring the demographic history of dna sequences using the generalized skyline plot

Rstan: the R interface to stan

Practical Bayesian model evaluation using leaveone-out cross-validation and WAIC

Sampling through time and phylodynamic inference with coalescent and birth-death models

Phylodynamics of infectious disease epidemics

Coalescent theory: an introduction

Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory

A new coronavirus associated with human respiratory disease in China

EP I ISL 4756652020−04−06), (EP I ISL 4757072020− 03−27), (EP I ISL 4756772020−03−28), (EP I ISL 4755882020−04−02), (EP I ISL 4756382020− 03−25), (EP I ISL 4756242020−03−28), (EP I ISL 4755932020−04−04), (EP I ISL 4757102020− 04−11), (EP I ISL 4756302020−03−25), (EP I ISL 4756592020−04−09), (EP I ISL 4756712020− 04−01), (EP I ISL 4756692020−04−01), (EP I ISL 4757022020−04−04), (EP I ISL 4756352020− 03−25), (EP I ISL 4757052020−03−26), (EP I ISL 4756782020−03−28), (EP I ISL 4755842020− 03−22), (EP I ISL 4756642020−04−06), (EP I ISL 4756132020−03−26), (EP I ISL 4756682020− 04−07), (EP I ISL 4755902020−04−02), (EP I ISL 4756232020−04−08), (EP I ISL 4755802020− 04−02), (EP I ISL 4756572020−03−31), (EP I ISL 4755812020−04−02), (EP I ISL 4755962020− 03−28), (EP I ISL 4756942020−03−26), (EP I ISL 4757112020−04−11), (EP I ISL 4756532020− 03−30), (EP I ISL 4756332020−03−25), (EP I ISL 4757132020−04−02), (EP I ISL 4756892020− 03−26

EP I ISL 4356702020−03−19), (EP I ISL 4767802020−03−15), (EP I ISL 4370462020− 04−04), (EP I ISL 4767682020−03−25), (EP I ISL 4366592020−04−10), (EP I ISL 4504462020− 03−28), (EP I ISL 4366422020−04−01), (EP I ISL 4356572020−03−14), (EP I ISL 4504672020− 03−23), (EP I ISL 4546862020−04−12), (EP I ISL 4504452020−03−29), (EP I ISL 4546672020− 04−12), (EP I ISL 4356362020−03−11), (EP I ISL 4546792020−04−12), (EP I ISL 4546732020− 04−12), (EP I ISL 4366442020−04−02), (EP I ISL 4355802020−02−26), (EP I ISL 4356082020− 03−06), (EP I ISL 4767822020−03−13), (EP I ISL 4370492020−04−05), (EP I ISL 4504752020− 03−19), (EP I ISL 4767932020−03−18), (EP I ISL 4370642020−04−12), (EP I ISL 4355982020− 03−04), (EP I ISL 4504642020−03−23), (EP I ISL 4366772020−04−02), (EP I ISL 4356092020− 03−06), (EP I ISL 4355812020−02−28), (EP I ISL 4504742020−03−17), (EP I ISL 4356412020− 03−12), (EP I ISL 4504492020−03−25), (EP I ISL 4504662020−03−23), (EP I ISL 4504652020− 03−23), (EP I ISL 4356472020−03−13), (EP I ISL 4370522020−04−06