Submitted 18 July 2019 Accepted 7 November 2019 Published 6 January 2020 Corresponding author Jarrett D. Phillips, jphill01@uoguelph.ca Academic editor Gang Mei Additional Information and Declarations can be found on page 31 DOI 10.7717/peerj-cs.243 Copyright 2020 Phillips et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves Jarrett D. Phillips1, Steven H. French1, Robert H. Hanner2 and Daniel J. Gillis1 1 School of Computer Science, University of Guelph, Guelph, Ontario, Canada 2 Department of Integrative Biology, Biodiversity Institute of Ontario, University of Guelph, Guelph, Ontario, Canada ABSTRACT Assessing levels of standing genetic variation within species requires a robust sampling for the purpose of accurate specimen identification using molecular techniques such as DNA barcoding; however, statistical estimators for what constitutes a robust sample are currently lacking. Moreover, such estimates are needed because most species are currently represented by only one or a few sequences in existing databases, which can safely be assumed to be undersampled. Unfortunately, sample sizes of 5–10 specimens per species typically seen in DNA barcoding studies are often insufficient to adequately capture within-species genetic diversity. Here, we introduce a novel iterative extrapolation simulation algorithm of haplotype accumulation curves, called HACSim (Haplotype Accumulation Curve Simulator) that can be employed to calculate likely sample sizes needed to observe the full range of DNA barcode haplotype variation that exists for a species. Using uniform haplotype and non-uniform haplotype frequency distributions, the notion of sampling sufficiency (the sample size at which sampling accuracy is maximized and above which no new sampling information is likely to be gained) can be gleaned. HACSim can be employed in two primary ways to estimate specimen sample sizes: (1) to simulate haplotype sampling in hypothetical species, and (2) to simulate haplotype sampling in real species mined from public reference sequence databases like the Barcode of Life Data Systems (BOLD) or GenBank for any genomic marker of interest. While our algorithm is globally convergent, runtime is heavily dependent on initial sample sizes and skewness of the corresponding haplotype frequency distribution. Subjects Bioinformatics, Computational Biology, Data Science, Optimization Theory and Computation, Scientific Computing and Simulation Keywords Algorithm, DNA barcoding, Extrapolation, Iterative method, Sampling sufficiency, Species INTRODUCTION Background Earth is in the midst of its sixth mass extinction event and global biodiversity is declining at an unprecedented rate (Ceballos et al., 2015). It is therefore important that species genetic diversity be catalogued and preserved. One solution to address this mounting crisis in a How to cite this article Phillips JD, French SH, Hanner RH, Gillis DJ. 2020. HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves. PeerJ Comput. Sci. 6:e243 http://doi.org/10.7717/peerj-cs.243 https://peerj.com/computer-science mailto:jphill01@uoguelph.ca https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.243 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.243 systematic, yet rapid way is DNA barcoding (Hebert et al., 2003). DNA barcoding relies on variability within a small gene fragment from standardized regions of the genome to identify species, based on the fact that most species exhibit a unique array of barcode haplotypes that are more similar to each other than those of other species (e.g., a barcode ‘‘gap’’). In animals, the DNA barcode region corresponds to a 648 bp fragment of the 5′ terminus of the cytochrome c oxidase subunit I (COI) mitochondrial marker (Hebert et al., 2003; Hebert, Ratnasingham & De Waard, 2003). A critical problem since the inception of DNA barcoding involves determining appropriate sample sizes necessary to capture the majority of existing intraspecific haplotype variation for major animal taxa (Hebert et al., 2004; Meyer & Paulay, 2005; Ward et al., 2005). Taxon sample sizes currently employed in practice for rapid assignment of a species name to a specimen, have ranged anywhere from 1–15 specimens per species (Matz & Nielsen, 2005; Ross, Murugan & Li, 2008; Goodall- Copestake, Tarling & Murphy, 2012; Jin, He & Zhang, 2012; Yao et al., 2017); however, oftentimes only 1–2 individuals are actually collected. This trend is clearly reflected within the Barcode of Life Data Systems (http://www.boldsystems.org) (Ratnasingham & Hebert, 2007), where an overwhelming number of taxa have only a single record and sequence. A fitting comparison to the issue of adequacy of specimen sample sizes can be made to the challenge of determining suitable taxon distance thresholds for species separation on the basis of the DNA barcode gap (Meyer & Paulay, 2005). It has been widely demonstrated that certain taxonomic groups, such as Lepidoptera (butterflies/moths), are able to be readily separated into distinct clusters largely reflective of species boundaries derived using morphology (Čandek & Kuntner, 2015). However, adoption of a fixed limit of 2% difference between maximum intraspecific distance and minimum interspecific (i.e., nearest-neighbour) divergence is infeasible across all taxa (Hebert, Ratnasingham & De Waard, 2003; Collins & Cruickshank, 2013). Species divergence thresholds should be calculated from available sequence data obtained through deep sampling of taxa across their entire geographic ranges whenever possible (Young et al., 2017). There is a clear relationship between specimen sample sizes and observed barcoding gaps: sampling too few individuals can give the impression of taxon separation, when in fact none exists (Meyer & Paulay, 2005; Hickerson, Meyer & Moritz, 2006; Wiemers & Fiedler, 2007; Dasmahapatra et al., 2010; Čandek & Kuntner, 2015), inevitably leading to erroneous conclusions (Collins & Cruickshank, 2013). It is thus imperative that barcode gap analyses be based on adequate sample sizes to minimize the presence of false positives. Introducing greater statistical rigour into DNA barcoding appears to be the clear way forward in this respect (Nielsen & Matz, 2006; Čandek & Kuntner, 2015; Luo et al., 2015; Phillips, Gillis & Hanner, 2019). The introduction of computational approaches for automated species delimitation such as Generalized Mixed Yule Coalescent (GMYC) (Pons et al., 2006; Monaghan et al., 2009; Fujisawa & Barraclough, 2013), Automatic Barcode Gap Discovery (ABGD) (Puillandre, Lambert & Brouillet, 2011) and Poisson Tree Processes (PTP) (Zhang et al., 2013) has greatly contributed to this endeavour in the form of web servers (GMYC, ABGD, PTP) and R packages (GMYC: Species’ LImits by Threshold Statistics, splits (Ezard, Fujisawa & Barraclough, 2017)). Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 2/37 https://peerj.com http://www.boldsystems.org http://dx.doi.org/10.7717/peerj-cs.243 Various statistical resampling and population genetic methods, in particular coalescent simulations, for the estimation of sample sizes, have been applied to Lepidoptera (Costa Rican skipper butterflies (Astraptes fulgerator)) (Zhang et al., 2010) and European diving beetles (Agabus bipustulatus) (Bergsten et al., 2012). Using Wright’s equilibrium island model (Wright, 1951) and Kimura’s stepping stone model (Kimura & Weiss, 1964) under varying effective population sizes and migration rates, Zhang et al. (2010) found that between 156-1985 specimens per species were necessary to observe 95% of all estimated COI variation for simulated specimens of A. fulgerator. Conversely, real species data showed that a sample size of 250-1188 individuals is probably needed to capture the majority of COI haplotype variation existing for this species (Zhang et al., 2010). A subsequent investigation carried out by Bergsten et al. (2012) found that a random sample of 250 individuals was required to uncover 95% COI diversity in A. bipustulatus; whereas, a much smaller sample size of 70 specimens was necessary when geographic separation between two randomly selected individuals was maximized. Others have employed more general statistical approaches. Based on extensive simulation experiments, through employing the Central Limit Theorem (CLT), Luo et al. (2015) suggested that no fewer than 20 individuals per species be sampled. Conversely, using an estimator of sample size based on the Method of Moments, an approach to parameter estimation relying on the Weak Law of Large Numbers (Pearson, 1894), sample sizes ranging from 150–5,400 individuals across 18 species of ray-finned fishes (Chordata: Actinopterygii) were found by Phillips et al. (2015). Haplotype accumulation curves paint a picture of observed standing genetic variation that exists at the species level as a function of expended sampling effort (Phillips et al., 2015; Phillips, Gillis & Hanner, 2019). Haplotype sampling completeness can then be gauged through measuring the slope of the curve, which gives an indication of the number of new haplotypes likely to be uncovered with additional specimens collected. For instance, a haplotype accumulation curve for a hypothetical species having a slope of 0.01 suggests that only one previously unseen haplotype will be captured for every 100 individuals found. This is strong evidence that the haplotype diversity for this species has been adequately sampled. Thus, further recovery of specimens of such species provide limited returns on the time and money invested to sequence them. Trends observed from generated haplotype accumulation curves for the 18 actinopterygian species assessed by Phillips et al. (2015), which were far from reaching an asymptote, corroborated the finding that the majority of intraspecific haplotypes remain largely unsampled in Actinopterygii for even the best-represented species in BOLD. Estimates obtained from each of these studies stand in sharp contrast to sample sizes typically reported within DNA barcoding studies. Numerical optimization methods are required to obtain reasonable approximations to otherwise complex questions. Many such problems proceed via the iterative method, whereby an initial guess is used to produce a sequence of successively more precise (and hopefully more accurate) approximations. Such an approach is attractive, as resulting solutions can be made as precise as desired through specifying a given tolerance cutoff. However, in such cases, a closed-form expression for the function being optimized is known a priori. In many instances, the general path (behaviour) of the search space being Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 3/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 explored is the only information known, and not its underlying functional form. In this paper, we take a middle-ground approach that is an alternative to probing sampling completeness on the basis of haplotype accumulation curve slope measurement. To this end, iteration is applied to address the issue of relative sample size determination for DNA barcode haplotype sampling completeness, a technique suggested by Phillips, Gillis & Hanner (2019). Given that specimen collection and processing is quite a laborious and costly endeavour (Cameron, Rubinoff & Will, 2006; Stein et al., 2014), the next most direct solution to an otherwise blind search strategy is to employ computational simulation that approximates specimen collection in the field. The main contribution of this work is the introduction of a new, easy-to-use R package implementing a novel statistical optimization algorithm to estimate sample sizes for assessment of genetic diversity within species based on saturation observed in haplotype accumulation curves. Here, we present a novel nonparametric stochastic (Monte Carlo) iterative extrapolation algorithm for the generation of haplotype accumulation curves based on the approach of Phillips et al. (2015). Using the statistical environment R (R Core Team, 2018), we examine the effect of altering species haplotype frequencies on the shape of resulting curves to inform on likely required sample sizes needed for adequate capture of within-species haplotype variation. Proof-of-concept of our method is illustrated through both hypothetical examples and real DNA sequence data. Motivation Consider N DNA sequences that are randomly sampled for a given species of interest across its known geographic range, each of which correspond to a single specimen. Suppose further that H* of such sampled DNA sequences are unique (i.e., are distinct haplotypes). This scenario leads naturally to the following question: What is N*, the estimated total number of DNA sequence haplotypes that exist for a species θ? Put another way, what sample size (number of specimens) is needed to capture the existing haplotype variation for a species? The naïve approach (adopted by Phillips et al. (2015)) would be to ignore relative frequencies of observed haplotypes; that is, assume that species haplotypes are equally probable in a species population. Thus, in the absence of any information, the best one can do is adopt a uniform distribution for the number of sampled haplotypes. Such a path leads to obtaining gross overestimates for sufficient sampling (Phillips et al., 2015). A much better approach uses all available haplotype data to arrive at plausible estimates of required taxon sample sizes. This latter method is explored here in detail. METHODS Haplotype accumulation curve simulation algorithm Algorithm functions Our algorithm, HACSim (short for Haplotype Accumulation Curve Simulator), consisting of two user-defined R functions, HAC.sim() and HAC.simrep(), was created to run simulations of haplotype accumulation curves based on user-supplied parameters. The simulation treats species haplotypes as distinct character labels relative to the number of individuals possessing a given haplotype. The usual convention in this regard is that Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 4/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Figure 1 Modified haplotype network from Phillips, Gillis & Hanner (2019). Haplotypes are labelled according to their absolute frequencies such that the most frequent haplotype is labelled ‘‘1’’, the second- most frequent haplotype is labelled ‘‘2’’, etc., and is meant to illustrate that much species locus variation consists of rare haplotypes at very low frequency (typically only represented by 1 or 2 specimens). Thus, species showing such patterns in their haplotype distributions are probably grossly under-respresented in public sequence databases like BOLD and GenBank. Full-size DOI: 10.7717/peerjcs.243/fig-1 Haplotype 1 is the most frequent, Haplotype 2 is the next most frequent, etc. (Gwiazdowski et al., 2013). A haplotype network represents this scheme succinctly (Fig. 1). Such an implementation closely mimics that seen in natural species populations, as each character label functions as a unique haplotype linked to a unique DNA barcode sequence. The algorithm then randomly samples species haplotype labels in an iterative fashion with replacement until all unique haplotypes have been observed. This process continues until all species haplotypes have been sampled. The idea is that levels of species haplotypic variation that are currently catalogued in BOLD can serve as proxies for total haplotype diversity that may exist for a given species. This is a reasonable assumption given that, while estimators of expected diversity are known (e.g., Chao1 abundance) (Chao, 1984), the frequencies of unseen haplotypes are not known a priori. Further, assuming a species is sampled across its entire geographic range, haplotypes not yet encountered are presumed to occur at low frequencies (otherwise they would likely have already been sampled). Because R is an interpreted programming language (i.e., code is run line-by-line), it is slow compared to faster alternatives which use compilation to convert programs into machine-readable format; as such, to optimize performance of the present algorithm in terms of runtime, computationally-intensive parts of the simulation code were written in the C++ programming language and integrated with R via the packages Rcpp (Eddelbuettel & François, 2011) and RcppArmadillo (Eddelbuettel & Sanderson, 2014). This includes Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 5/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-1 http://dx.doi.org/10.7717/peerj-cs.243 function code to carry out haplotype accumulation (via the function accumulate(), which is not directly called by the user). A further reason for turning to C++ is because some R code (e.g., nested ‘for’ loops) is not easily vectorized, nor can parallelization be employed for speed improvement due to loop dependence. The rationale for employing R for the present work is clear: R is free, open-source software that it is gaining widespread use within the DNA barcoding community due to its ease-of-use and well-established user-contributed package repository (Comprehensive R Archive Network (CRAN)). As such, the creation and disemination of HACSim as a R framework to assess levels of standing genetic variation within species is greatly facilitated. A similar approach to the novel one proposed here to automatically generate haplotype accumulation curves from DNA sequence data is implemented in the R package spider (SPecies IDentity and Evolution in R; (Brown et al., 2012)) using the haploAccum() function. However, the approach, which formed the basis of earlier work carried out by Phillips et al. (2015), is quite restrictive in its functionality and, to our knowledge, is currently the only method available to generate haplotype accumulation curves in R because spider generates haplotype accumulation curves from DNA sequence alignments only and is not amenable to inclusion of numeric inputs for specimen and haplotype numbers. Thus, the method could not be easily extended to address our question. This was the primary reason for the proposal of a statistical model of sampling sufficiency by Phillips et al. (2015) and its extension described herein. Algorithm parameters At present, the algorithm (consisting of HAC.sim() and HAC.simrep()) takes 13 arguments as input (Table 1). A user must first specify the number of observed specimens/DNA sequences (N) and the number of observed haplotypes (i.e., unique DNA sequences) (H*) for a given species. Both N and H* must be greater than one. Clearly, N must be greater than or equal to H*. Next, the haplotype frequency distribution vector must be specified. The probs argument allows for the inclusion of both common and rare species haplotypes according to user interest (e.g., equally frequent haplotypes, or a single dominant haplotype). The resulting probs vector must have a length equal to H*. For example, if H*=4, probs must contain four elements. The total probability of all unique haplotypes must sum to one. The user can optionally input the fraction of observed haplotypes to capture p. By default, p=0.95, mirroring the approach taken by both Zhang et al. (2010) and Bergsten et al. (2012) who computed intraspecific sample sizes needed to recover 95% of all haplotype variation for a species. At this level, the generated haplotype accumulation curve reaches a slope close to zero and further sampling effort is unlikely to uncover any new haplotypes. However, a user may wish to obtain sample sizes corresponding to different haplotype recovery levels, e.g., p=0.99 (99% of all estimated haplotypes found). In the latter scenario, it can be argued that 100% of species haplotype variation is never actually achieved, since with greater sampling effort, additional haplotypes are almost surely to be found; thus, a true asymptote is never reached. In any case, simulation completion times will vary Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 6/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Table 1 Parameters inputted (first 7) and outputted (last six) by HAC.sim() and HAC.simrep(), along with their definitions. Range refers to plausible values that each parameter can assume within the hap- lotype accumulation curve simulation algorithm. [ and ] indicate that a given value is included in the range interval; whereas, ( and ) indicate that a given value is excluded from the range interval. Simulation progress can be tracked through setting progress = TRUE within HACHypothetical() or HACReal(). Users can optionally specify that a file be created containing all information outputted to the R console (via the argument filename, which can be named as the user wishes). Parameter Definition Range N total number of specimens/DNA sequences (1,∞) H* total number of unique haplotypes (1, N] probs haplotype probability distribution vector (0, 1) p proportion of haplotypes to recover (0, 1] perms total number of permutations (1,∞) input.seqs analyze FASTA file of species DNA sequences TRUE, FALSE conf.level desired confidence level for confidence interval calculation (0, 1) H cumulative mean number of haplotypes sampled [1, H∗] H∗−H cumulative mean number of haplotypes not sampled [0, H∗) R= HH∗ cumulative mean fraction of haplotypes sampled (0, 1] H∗−H H∗ cumulative mean fraction of haplotypes not sampled [0, 1) N∗ mean specimen sample size corresponding to H∗ [ N ,∞) N∗−N mean number of individuals not sampled [0, N] depending on inputted parameter values, such as probs, which controls the skewness of the observed haplotype frequency distribution. The perms argument is in place to ensure that haplotype accumulation curves ‘‘smooth out’’ and tend to H* asymptotically as the number of permutations (replications) is increased. The effect of increasing the number of permutations is an increase in statistical accuracy and consequently, a reduction in variance. The proposed simulation algorithm outputs a mean haplotype accumulation curve that is the average of perms generated haplotype accumulation curves, where the order of individuals that are sampled is randomized. Each of these perms curves is a randomized step function (a sort of random walk), generated according to the number of haplotypes found. A permutation size of 1,000 was used by Phillips et al. (2015) because smaller permutation sizes yielded non-smooth (noisy) curves. Permutation sizes larger than 1,000 typically resulted in greater computation time, with no noticeable change in accumulation curve behaviour (Phillips et al., 2015). By default, perms = 10,000 (in contrast to Phillips et al. (2015)), which is comparable to the large number of replicates typically employed in statistical bootstrapping procedures needed to ensure accuracy of computed estimates (Efron, 1979). Sometimes it will be necessary for users to sacrifice accuracy for speed in the presence of time constraints. This can be accomplished through decreasing perms. Doing so however will result in only near-optimal solutions for specimen sample sizes. In some cases, it may be necessary to increase perms to further smooth out the curves (to ensure monotonicity), but this will increase algorithm runtime substantially. Should a user wish to analyze their own intraspecific COI DNA barcode sequence data (or sequence data from any single locus for that matter), setting input.seqs = Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 7/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 TRUE allows this (via the read.dna() function in ape). In such a case, a pop-up file window will prompt the user to select the formatted FASTA file of aligned/trimmed sequences to be read into R. When this occurs, arguments for N , H* and probs are set automatically by the algorithm via functions available in the R packages ape (Analysis of Phylogenetics and Evolution) (Paradis, Claude & Strimmer, 2004) and pegas (Population and Evolutionary Genetics Analysis System) (Paradis, 2010). Users must be aware however that the number of observed haplotypes treated by pegas (via the haplotype() function) may be overestimated if missing/ambiguous nucleotide data are present within the final alignment input file. Missing data are explicitly handled by the base.freq() function in the ape package. When this occurs, R will output a warning that such data are present within the alignment. Users should therefore consider removing sequences or sites comprising missing/ambiguous nucleotides. This step can be accomplished using external software such as MEGA (Molecular Evolutionary Genetics Analysis; (Kumar, Stecher & Tamura, 2016)). The BARCODE standard (Hanner, 2009) was developed to help identify high quality sequences and can be used as a quality filter if desired. Exclusion of low-quality sequences also has the advantage of speeding up compution time of the algorithm significantly. Options for confidence interval (CI) estimation and graphical display of haplotype accumulation is also available via the argument conf.level, which allows the user to specify the desired level of statistical confidence. CIs are computed from the sample α 2 100% and (1− α 2 )100% quantiles of the haplotype accumulation curve distribution. The default is conf.level = 0.95, corresponding to a confidence level of 95%. High levels of statistical confidence (e.g., 99%) will result in wider confidence intervals; whereas low confidence leads to narrower interval estimates. How does HACSim work? Haplotype labels are first randomly placed on a two-dimensional spatial grid of size perms × N (read perms rows by N columns) according to their overall frequency of occurrence (Fig. 2). The cumulative mean number of haplotypes is then computed along each column (i.e., for every specimen). If all H* haplotypes are not observed, then the grid is expanded to a size of perms×N* and the observed haplotypes enumerated. Estimation of specimen sample sizes proceeds iteratively, in which the current value of N* is used as a starting value to the next iteration (Fig. 2). An analogy here can be made to a game of golf: as one aims towards the hole and hits the ball, it gets closer and closer to the hole; however, one does not know the number of times to hit the ball before it lands in the hole. It is important to note that since sample sizes must be whole values, estimates of N* found at each iteration are rounded up to the next whole number. Even though this approach is quite conservative, it ensures that estimates are adequately reflective of the population from which they were drawn. HAC.sim(), which is called internally from HAC.simrep(), performs a single iteration of haplotype accumulation for a given species. In the case of real species, resulting output reflects current levels of sampling effort found within BOLD (or another similar sequence repository such as GenBank) for a given species. If the desired level of haplotype recovery is not reached, then HAC.simrep() is called to perform successive iterations until Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 8/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Figure 2 Schematic of the HACSim optimization algorithm (setup, initialization and iteration). Shown is a hypothetical example for a species mined from a biological sequence database like BOLD or GenBank with N =5 sampled specimens (DNA sequences) possessing H*=5 unique haplotypes. Each haplotype has an associated numeric ID from 1-H* (here, 1-5). Haplotype labels are randomly assigned to cells on a two-dimensional spatial array (ARRAY) with perms rows and N columns. All haplotypes occur with a frequency of 20%, (i.e., probs= (1/5, 1/5, 1/5, 1/5, 1/5)). Specimen and haplotype information is then fed into a black box to iteratively optimize the likely required sample size (N*) needed to capture a propor- tion of at least p haplotypes observed in the species sample. Full-size DOI: 10.7717/peerjcs.243/fig-2 the observed fraction of haplotypes captured (R) is at least p. This stopping criterion is the termination condition necessary to halt the algorithm as soon as a ‘‘good enough’’ solution has been found. Such criteria are widely employed within numerical analysis. At each step of the algorithm, a dataset, in the form of a dataframe (called ‘‘d’’) consisting of the mean number of haplotypes recovered (called means), along with the estimated standard deviation (sds) and the number of specimens sampled (specs) is generated. The estimated required sample size (N*) to recover a given proportion of observed species haplotypes corresponds to the endpoint of the accumulation curve. An indicator message is additionally outputted informing a user as to whether or not the desired level of haplotype recovery has been reached. The algorithm is depicted in Fig. 3. In Fig. 3, all input parameters are known a priori except Hi, which is the number of haplotypes found at each iteration of the algorithm, and Ri = Hi H∗ , which is the observed fraction of haplotype recovery at iteration i. The equation to compute N* N∗i+1=Ni+ Ni Hi ( H∗−Hi ) = NiH∗ Hi = Ni Ri (1) is quite intuitive since as Hi approaches H*, H∗−Hi approaches zero, Ri= Hi H∗ approaches one, and consequently, Ni approaches N*. In the first part of the above equation, the quantity NiHi (H ∗ −Hi) is the amount by which the haplotype accumulation curve is extrapolated, which incorporates random error and uncertainty regarding the true value of θ in the search space being explored. Nonparametric estimates formed from the above Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 9/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-2 http://dx.doi.org/10.7717/peerj-cs.243 Figure 3 Iterative extrapolation algorithm pseudocode for the computation of taxon sampling suffi- ciency employed within HACSim. A user must input N , H* and probs to run simulations. Other function arguments required by the algorithm have default values and are not necessary to be inputted unless the user wishes to alter set parameters. Full-size DOI: 10.7717/peerjcs.243/fig-3 iterative method produce a convergent monotonically-increasing sequence, which becomes closer and closer to N* as the number of iterations increase; that is, N∗1 ≤N ∗ 2 ≤ ...≤N ∗ i ≤N ∗ i+1→N ∗ (2) which is clearly a desirable property. Since haplotype accumulation curves are bounded below by one and bounded above by H*, then the above sequence has a lower bound equal to the initial guess for specimen sampling sufficiency (N) and an upper bound of N*. Along with the iterated haplotype accumulation curves and haplotype frequency barplots, simulation output consists of the five initially proposed ‘‘measures of sampling closeness’’, the estimate of θ (N*) based on Phillips et al. (2015)’s sampling model, in addition to the number of additional samples needed to recover all estimated total haplotype variation for a given species (N∗−N ; Fig. 4) (Table 1). These five quantities are given as follows: (1) Mean number of haplotypes sampled: Hi, (2) Mean number of haplotypes not sampled: H∗−Hi, (3) Proportion of haplotypes sampled: HiH∗ , (4) Proportion of haplotypes not sampled: H∗−Hi H∗ , (5) Mean number of individuals not sampled: N∗−Ni = Ni Hi (H∗−Hi) and are analogous to absolute and relative approximation error metrics seen in numerical analysis. It should be noted that the mean number of haplotypes captured at each iteration, Hi, will not necessary be increasing, even though estimates of the cumulative mean value of N* are. It is easily seen above that Hi approaches H* with increasing number of iterations. Similarly, as the simulation progresses, H∗−Hi, H∗−Hi H∗ and N ∗ −Ni = Ni Hi (H∗−Hi) all approach zero, while HiH∗ approaches one. The rate at which curves approach H* depends on inputs to both HAC.sim() and HAC.simrep(). Once the algorithm has converged to the desired level of haplotype recovery, a summary of findings is outputted consisting of (1) the initial guess (N) for sampling sufficiency; (2) the total number of iterations until convergence and simulation runtime (in seconds); (3) the final estimate (N*) of sampling sufficiency, along with an approximate (1 −α)100% confidence interval (see next paragraph); and, (4) the Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 10/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-3 http://dx.doi.org/10.7717/peerj-cs.243 Figure 4 Graphical depiction of the iterative extrapolation sampling model as described in detail herein. The figure is modified from Phillips, Gillis & Hanner (2019). The x-axis is meant to depict the number of specimens sampled, whereas the y-axis is meant to convey the cumulative number of unique haplotypes uncovered for every additional individual that is randomly sampled. Ni and Hi refer respectively to specimen and haplotype numbers that are observed at each iteration ( i) of HACSim for a given species. N* is the total sample size that is needed to capture all H* haplotypes that exist for a species. Full-size DOI: 10.7717/peerjcs.243/fig-4 number of additional specimens required to be sampled (N∗−N) from the initial starting value. Iterations are automatically separated by a progress meter for easy visualization. An approximate symmetric (1 − α)100% CI for θ is derived using the (first order) Delta Method (Casella & Berger, 2002). This approach relies on the asymptotic normality result of the CLT and employs a first-order Taylor series expansion around θ to arrive at an approximation of the variance (and corresponding standard error) of N*. Such an approach is convenient since the sampling distribution of N* would likely be difficult to compute exactly due to specimen sample sizes being highly taxon-dependent. An approximate (large sample) (1 −α)100% CI for θ is given by N∗±z1−α2 ( σ̂H H √ N∗ ) (3) where z1−α2 denotes the appropriate critical value from the standard Normal distribution and σ̂H is the estimated standard deviation of the mean number of haplotypes recovered at N*. The interval produced by this approach is quite tight, shrinking as Hi tends to H*. By default, HACSim computes 95% confidence intervals for the abovementioned quantities. It is important to consider how a confidence interval for θ should be interpreted. For instance, a 95% CI for θ of (L, U), where L and U are the lower and upper endpoints of the confidence interval respectively, does not mean that the true sampling sufficiency lies between (L, U) with 95% probability. Instead, resulting confidence intervals for θ are themselves random and should be interpreted in the following way: with repeated sampling, one can be (1 − α)100% confident that the true sampling sufficency for p% haplotype recovery for a given species lies in the range (L, U) (1 − α)100% of the time. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 11/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-4 http://dx.doi.org/10.7717/peerj-cs.243 That is, on average, (1 − α)100% of constructed confidence intervals will contain θ (1 − α)100% of the time. It should be noted however that as given computed confidence intervals are only approximate in the limit, desired nominal probability coverage may not be achieved. In other words, the proportion of times calculated (1 − α)100% intervals actually contain θ may not be met. HACSim has been implemented as an object-oriented framework to improve modularity and overall user-friendliness. Scenarios of hypothetical and real species are contained within helper functions which comprise all information necessary to run simulations successfully without having to specify certain function arguments beforehand. To carry out simulations of sampling haplotypes from hypothetical species, the function HACHypothetical() must first be called. Similarly, haplotype sampling for real species is handled by the function HACReal(). In addition to all input parameters rquired by HAC.sim() and HAC.simrep() outlined in Table 1, both HACHypothetical() and HACReal() take further arguments. Both functions take the optional argument filename which is used to save results outputted to the R console to a CSV file. When either HACHypothetical() or HACReal() is invoked (i.e., assigned to a variable), an object herein called HACSObj is created containing the 13 arguments employed by HACSim in running simulations. Note the generated object can have any name the user desires. Further, all simulation variables are contained in an environment called ‘envr’ that is hidden from the user. RESULTS Here, we outline some simple examples that highlight the overall functionality of HACSim. When the code below is run, outputted results will likely differ from those depicted here since our method is inherently stochastic. Hence, it should be stressed that there is not one single solution for the problem at hand, but rather multiple solutions (Spall, 2012). This is in contrast to a completely deterministic model, where a given input always leads to the same unique output. To ensure reproducibility, the user can set a random seed value using the base R function set.seed() prior to running HAC.simrep(). It is important that a user set a working directory in R prior to running HACSim, which will ensure all created files (‘seqs.fas’ and ‘output.csv’) are stored in a single location for easy access and reference at a later time. In all scenarios, default parameters were unchanged (perms=10,000, p=0.95). Application of HACSim to hypothetical species Equal haplotype frequencies Figure 5 shows sample graphical output of the proposed haplotype accumulation curve simulation algorithm for a hypothetical species with N =100 and H*=10. All haplotypes are assumed to occur with equal frequency (i.e., probs=0.10). Algorithm output is shown below. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 12/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 ## Set parameters for hypothetical species ## > N<-100 # total number of sampled individuals > Hstar<-10 # total number of haplotypes > probs<-rep(1/Hstar, Hstar) # equal haplotype frequency ### Run simulations ### > HACSObj<-HACHypothetical(N = N, Hstar = Hstar, probs = probs) # call helper function # set seed here if desired, e.g., set.seed(12345) > HAC.simrep(HACSObj) Simulating haplotype accumulation... |===============================================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 10 Mean number of haplotypes not sampled: 0 Proportion of haplotypes sampled: 1 Proportion of haplotypes not sampled: 0 Mean value of N*: 100 Mean number of specimens not sampled: 0 Desired level of haplotype recovery has been reached ---------- Finished. ---------- The initial guess for sampling sufficiency was N = 100 individuals The algorithm converged after 1 iterations and took 3.637 s The estimate of sampling sufficiency for p = 95% haplotype recovery is N* = 100 individuals ( 95% CI: 100-100 ) The number of additional specimens required to be sampled for p = 95% haplotype recovery is N* - N = 0 individuals Algorithm output shows that R = 100% of the H* = 10 haplotypes are recovered from the random sampling of N = 100 individuals, with lower and upper 95% confidence limits of 100–100. No additional specimens need to be collected (N∗−N =0). Simulation results, consisting of the six ‘‘measures of sampling closeness’’ computed at each iteration, can be optionally saved in a comma-separated value (CSV) file called ‘output.csv’ (or another filename of the user’s choosing). Figure 5 shows that when haplotypes are equally frequent in species populations, corresponding haplotype accumulation curves reach an asymptote very quickly. As sampling effort is increased, the confidence interval becomes narrower, thereby reflecting one’s increased confidence in having likely sampled the majority of haplotype variation existing for a given species. Expected counts of the number of specimens possessing a given haplotype can be found from running max(envr$d$specs) * envr$probs in the R console once a simulation has converged. However, real data suggest that haplotype frequencies are not equal. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 13/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Figure 5 Graphical output of HAC.sim() for a hypothetical species with equal haplotype frequen- cies. (A) Iterated haplotype accumulation curve. (B) Corresponding haplotype frequency barplot. For the generated haplotype accumulation curve, the 95% confidence interval for the number of unique haplo- types accumulated is depicted by gray error bars. Dashed lines depict the observed number of haplotypes (i.e., RH*) and corresponding number of individuals sampled found at each iteration of the algorithm. The dotted line depicts the expected number of haplotypes for a given haplotype recovery level (here, p= 95%) (i.e., pH*). In this example, R=100% of the H*=10 estimated haplotypes have been recovered for this species based on a sample size of only N =100 specimens. Full-size DOI: 10.7717/peerjcs.243/fig-5 Unequal Haplotype Frequencies Figures 6 and 7 show sample graphical output of the proposed haplotype accumulation curve simulation algorithm for a hypothetical species with N = 100 and H* = 10. All haplotypes occur with unequal frequency. Haplotypes 1–3 each have a frequency of 30%, while the remaining seven haplotypes each occur with a frequency of c. 1.4%. ## Set parameters for hypothetical species ## > N<-100 > Hstar<-10 > probs<-c(rep(0.30, 3), rep(0.10/7, 7)) # three dominant haplotypes each with 30% frequency ### Run simulations ### > HACSObj<-HACHypothetical(N = N, Hstar = Hstar, probs = probs) > HAC.simrep(HACSObj) Simulating haplotype accumulation... |========================================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 8.3291 Mean number of haplotypes not sampled: 1.6709 Proportion of haplotypes sampled: 0.83291 Proportion of haplotypes not sampled: 0.16709 Mean value of N*: 120.061 Mean number of specimens not sampled: 20.06099 Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 14/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-5 http://dx.doi.org/10.7717/peerj-cs.243 Desired level of haplotype recovery has not yet been reached |================================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 9.2999 Mean number of haplotypes not sampled: 0.7001 Proportion of haplotypes sampled: 0.92999 Proportion of haplotypes not sampled: 0.07001 Mean value of N*: 179.5718 Mean number of specimens not sampled: 12.57182 Desired level of haplotype recovery has not yet been reached |==================================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 9.5358 Mean number of haplotypes not sampled: 0.4642 Proportion of haplotypes sampled: 0.95358 Proportion of haplotypes not sampled: 0.04642 Mean value of N*: 188.7623 Mean number of specimens not sampled: 8.762348 Desired level of haplotype recovery has been reached ---------- Finished. ---------- The initial guess for sampling sufficiency was N = 100 individuals The algorithm converged after 6 iterations and took 33.215 s The estimate of sampling sufficiency for p = 95% haplotype recovery is N∗ = 180 individuals ( 95% CI: 178.278-181.722 ) The number of additional specimens required to be sampled for p = 95% haplotype recovery is N* - N = 80 individuals Note that not all iterations are displayed above for the sake of brevity; only the first and last two iterations are given. With an initial guess of N = 100, only R=83.3% of all H* = 10 observed haplotypes are recovered. The value of N* = 121 in the first iteration above serves as an improved initial guess of the true sampling sufficiency, which is an unknown quantity that is being estimated. This value is then fed back into the algorithm and the process is repeated until convergence is reached. UsingEq. (1),theimprovedsamplesizewascalculatedasN∗=100+ 1008.3291 (10−8.3291)= 120.061. After one iteration, the curve has been extrapolated by an additional N∗−Ni=20.06099 individuals. Upon convergence, R=95.4% of all observed haplotypes are captured with a sample size of N*=180 specimens, with a 95% CI of 178.278–181.722. Given that N =100 individuals have already been sampled, the number of additional specimens required is N∗−N =80 individuals. The user can verify that sample sizes close to that found by HACSim are needed to capture 95% of existing haplotype variation. Simply set N = N* = 180 and rerun the algorithm. The last iteration serves as a check to verify that the desired level of haplotype recovery has been achieved. The value of N*=188.7623 that is outputted at this step can be used as a good starting guess to extrapolate the curve Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 15/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Figure 6 Initial graphical output of HAC.sim() for a hypothetical species having three dominant hap- lotypes. (A) Specimens sampled; (B) Unique haplotypes. In this example, initially, only R = 83.3% of the H*=10 estimated haplotypes have been recovered for this species based on a sample size of N =100 specimens. Full-size DOI: 10.7717/peerjcs.243/fig-6 Figure 7 Final graphical output of HAC.sim() for a hypothetical species having three dominant haplo- types. (A) Specimens sampled; (B) Unique haplotypes. In this example, upon convergence, R=95.4% of the H*=10 estimated haplotypes have been recovered for this species based on a sample size of N =180 specimens. Full-size DOI: 10.7717/peerjcs.243/fig-7 to higher levels of haplotype recovery to save on the number of iterations required to reach convergence. To do this, one simply runs HACHypothetical() with N = 189. Application of HACSim to real species Because the proposed iterative haplotype accumulation curve simulation algorithm simply treats haplotypes as numeric labels, it is easily generalized to any biological taxa and genetic loci for which a large number of high-quality DNA sequence data records is available in public databases such as BOLD. In the following examples, HACSim is employed to examine levels of standing genetic variation within animal species using 5′-COI. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 16/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-6 https://doi.org/10.7717/peerjcs.243/fig-7 http://dx.doi.org/10.7717/peerj-cs.243 Lake Whitefish (Coregonus clupeaformis) An interesting case study on which to focus is that of Lake whitefish (Coregonus clupeaformis). Lake whitefish are a commercially, culturally, ecologically and economically important group of salmonid fishes found throughout the Laurentian Great Lakes in Canada and the United States, particularly to the Saugeen Ojibway First Nation (SON) of Bruce Peninsula in Ontario, Canada, as well as non-indigenous fisheries (Ryan & Crawford, 2014). The colonization of refugia during Pleistocene glaciation is thought to have resulted in high levels of cryptic species diversity in North American freshwater fishes (Hubert et al., 2008; April et al., 2011; April et al., 2013a; April et al., 2013b). Overdyk et al. (2015) wished to investigate this hypothesis in larval Lake Huron lake whitefish. Despite limited levels of gene flow and likely formation of novel divergent haplotypes in this species, surprisingly, no evidence of deep evolutionary lineages was observed across the three major basins of Lake Huron despite marked differences in larval phenotype and adult fish spawning behaviour (Overdyk et al., 2015). This may be the result of limited sampling of intraspecific genetic variation, in addition to presumed panmixia (Overdyk et al., 2015). While lake whitefish represent one of the most well-studied fishes within BOLD, sampling effort for this species has nevertheless remained relatively static over the past few years. Thus, lake whitefish represent an ideal species for further exploration using HACSim. In applying the developed algorithm to real species, sequence data preparation methodology followed that which is outlined in Phillips et al. (2015). Curation included the exclusion of specimens linked to GenBank entries, since those records without the BARCODE keyword designation lack appropriate metadata central to reference sequence library construction and management (Hanner, 2009). Our approach here was solely to assess comprehensiveness of single genomic sequence databases rather than incorporating sequence data from multiple repositories; thus, all DNA barcode sequences either originating from, or submitted to GenBank were not considered further. As well, the presence of base ambiguities and gaps/indels within sequence alignments can lead to bias in estimate haplotype diversity for a given species. Currently (as of November 28, 2018), BOLD contains public (both barcode and non-barcode) records for 262 C. clupeaformis specimens collected from Lake Huron in northern parts of Ontario, Canada and Michigan, USA. Of the barcode sequences, N = 235 are of high quality (full-length (652 bp) and comprise no missing and/or ambiguous nucleotide bases). Haplotype analysis reveals that this species currently comprises H*=15 unique COI haplotypes. Further, this species shows a highly-skewed haplotype frequency distribution, with a single dominant haplotype accounting for c. 91.5% (215/235) of all individuals (Fig. 8). The output of HACSim is displayed below. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 17/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 ### Run simulations ### > HACSObj<-HACReal() > HAC.simrep(HACSObj) Simulating haplotype accumulation... | =====================================================| 100% --- Measures of Sampling Closeness --- Mean number of haplotypes sampled: 11.0705 Mean number of haplotypes not sampled: 3.9295 Proportion of haplotypes sampled: 0.7380333 Proportion of haplotypes not sampled: 0.2619667 Mean value of N*: 318.4138 Mean number of specimens not sampled: 83.4138 Desired level of haplotype recovery has not yet been reached | =======================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 13.8705 Mean number of haplotypes not sampled: 1.1295 Proportion of haplotypes sampled: 0.9247 Proportion of haplotypes not sampled: 0.0753 Mean value of N*: 603.439 Mean number of specimens not sampled: 45.43895 Desired level of haplotype recovery has not yet been reached | ==========================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 14.3708 Mean number of haplotypes not sampled: 0.6292 Proportion of haplotypes sampled: 0.9580533 Proportion of haplotypes not sampled: 0.04194667 Mean value of N*: 630.4451 Mean number of specimens not sampled: 26.44507 Desired level of haplotype recovery has been reached ---------- Finished. ---------- The initial guess for sampling sufficiency was N = 235 individuals The algorithm converged after 8 iterations and took 241.235 s The estimate of sampling sufficiency for p = 95% haplotype recovery is N∗= 604 individuals ( 95% CI: 601.504-606.496 ) The number of additional specimens required to be sampled for p = 95% haplotype recovery is N* - N = 369 individuals From the above output, it is clear that current specimen sample sizes found within BOLD for C. clupeaformis are probably not sufficient to capture the majority of within-species COI haplotype variation. An initial sample size of N = 235 specimens corresponds to recovering only 73.8% of all H* = 15 unique haplotypes for this species (Fig. 9). Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 18/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Figure 8 Initial haplotype frequency distribution for N = 235 high-quality lake whitefish (Coregonus clupeaformis) COI barcode sequences obtained from BOLD. This species displays a highly-skewed pat- tern of observed haplotype variation, with Haplotype 1 accounting for c. 91.5% (215/235) of all sampled records. Full-size DOI: 10.7717/peerjcs.243/fig-8 Figure 9 Initial graphical output of HAC.sim() for a real species (Lake whitefish, C. clupeaformis) hav- ing a single dominant haplotype. (A) Specimens sampled; (B) Unique haplotypes. In this example, ini- tially, only R=73.8% of the H*=15 estimated haplotypes for this species have been recovered based on a sample size of N =235 specimens. The haplotype frequency barplot is identical to that of Fig. 8. Full-size DOI: 10.7717/peerjcs.243/fig-9 A sample size of N* = 604 individuals (95% CI [601.504–606.496]) would likely be needed to observe 95% of all existing genetic diversity for lake whitefish (Fig. 10). Since N =235 individuals have been sampled previously, only N∗−N =369 specimens remain to be collected. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 19/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-8 https://doi.org/10.7717/peerjcs.243/fig-9 http://dx.doi.org/10.7717/peerj-cs.243 Figure 10 Final graphical output of HAC.sim() for Lake whitefish (C. clupeaformis) having a single dominant haplotype. (A) Specimens sampled; (B) Unique haplotypes. Upon convergence, R = 95.8% of the H*=15 estimated haplotypes for this species have been uncovered with a sample size of N =604 specimens. Full-size DOI: 10.7717/peerjcs.243/fig-10 Deer tick (Ixodes scapularis) Ticks, particularly the hard-bodied ticks (Arachnida: Acari: Ixodida: Ixodidae), are well- known as vectors of various zoonotic diseases including Lyme disease (Ondrejicka et al., 2014). Apart from this defining characteristic, the morphological identification of ticks at any lifestage, by even expert taxonomists, is notoriously difficult or sometimes even impossible (Ondrejicka, Morey & Hanner, 2017). Further, the presence of likely high cryptic species diversity in this group means that turning to molecular techniques such as DNA barcoding is often the only feasible option for reliable species diagnosis. Lyme-competent specimens can be accurately detected through employing a sensitive quantitative PCR (qPCR) procedure (Ondrejicka, Morey & Hanner, 2017). However, for such a workflow to be successful, wide coverage of within-species haplotype variation from across broad geographic ranges is paramount to better aid design of primer and probe sets for rapid species discrimination. Furthermore, the availability of large specimen sample sizes for tick species of medical and epidemiological relevance is necessary for accurately assessing the presence of the barcode gap. Notably, the deer tick (Ixodes scapularis), native to Canada and the United States, is the primary carrier of Borrelia burgdorferi, the bacterium responsible for causing Lyme disease in humans in these regions. Because of this, I. scapularis has been the subject of intensive taxonomic study in recent years. For instance, in a recent DNA barcoding study of medically-relevant Canadian ticks, Ondrejicka, Morey & Hanner (2017) found that out of eight specimens assessed for the presence of B. burgdorferi, 50% tested positive. However, as only exoskeletons and a single leg were examined for systemic infection, the reported infection rate may be a lower bound due to the fact that examined specimens may still harbour B. burgdorferi in their gut. As such, this species is well-represented within BOLD and thus warrants further examination within the present study. As of August 27, 2019, 531 5′-COI DNA barcode sequences are accessble from BOLD’s Public Data Portal for this species. Of these, N = 349 met criteria for high quality Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 20/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-10 http://dx.doi.org/10.7717/peerj-cs.243 outlined in Phillips et al. (2015). A 658 bp MUSCLE alignment comprised H*=83 unique haplotypes. Haplotype analysis revealed that Haplotypes 1–8 were represented by more than 10 specimens (range: 12–46; Fig. 11). Simulation output of HACSim is depicted below. ### Run simulations ### > HACSObj<-HACReal() > HAC.simrep(HACSObj) Simulating haplotype accumulation... | ======================================================| 100% --- Measures of Sampling Closeness --- Mean number of haplotypes sampled: 65.3514 Mean number of haplotypes not sampled: 17.6486 Proportion of haplotypes sampled: 0.7873663 Proportion of haplotypes not sampled: 0.2126337 Mean value of N*: 443.2499 Mean number of specimens not sampled: 94.24988 Desired level of haplotype recovery has not yet been reached | ============================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 78.3713 Mean number of haplotypes not sampled: 4.6287 Proportion of haplotypes sampled: 0.9442325 Proportion of haplotypes not sampled: 0.05576747 Mean value of N*: 802.7684 Mean number of specimens not sampled: 44.76836 Desired level of haplotype recovery has not yet been reached | =========================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 79.2147 Mean number of haplotypes not sampled: 3.7853 Proportion of haplotypes sampled: 0.954394 Proportion of haplotypes not sampled: 0.04560602 Mean value of N*: 841.3716 Mean number of specimens not sampled: 38.37161 Desired level of haplotype recovery has been reached ---------- Finished. ---------- The initial guess for sampling sufficiency was N = 349 individuals The algorithm converged after 8 iterations and took 1116.468 s The estimate of sampling sufficiency for p = 95% haplotype recovery is N∗= 803 individuals ( 95% CI: 801.551-804.449 ) The number of additional specimens required to be sampled for p = 95% haplotype recovery is N* - N = 454 individuals Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 21/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Figure 11 Initial haplotype frequency distribution for N = 349 high-quality deer tick (Ixodes scapu- laris) COI barcode sequences obtained from BOLD. In this species, Haplotypes 1-8 account for c. 51.3% (179/349) of all sampled records. Full-size DOI: 10.7717/peerjcs.243/fig-11 Figure 12 Initial graphical output of HAC.sim() for a real species (Deer tick, I. scapularis) having eight dominant haplotypes. In this example, initially, only R=78.7% of the H*=83 estimated haplotypes for this species have been recovered based on a sample size of N =349 specimens. The haplotype frequency barplot is identical to that of Fig. 11. Full-size DOI: 10.7717/peerjcs.243/fig-12 The above results clearly demonstrate the need for increased specimen sample sizes in deer ticks. With an initial sample size of N = 349 individuals, only 78.7% of all observed haplotypes are recovered for this species (Fig. 12). N*=803 specimens (95% CI: 801.551-804.449) is necessary to capture at least 95% of standing haplotype variation for I. scapularis (Fig. 13) . Thus, a further N*−N =454 specimens are required to be collected. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 22/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-11 https://doi.org/10.7717/peerjcs.243/fig-12 http://dx.doi.org/10.7717/peerj-cs.243 Figure 13 Final graphical output of HAC.sim() for deer tick (I scapularis) having eight dominant hap- lotypes. Upon convergence, R = 95.4% of the H*=83 estimated haplotypes for this species have been uncovered with a sample size of N =803 specimens. Full-size DOI: 10.7717/peerjcs.243/fig-13 Scalloped hammerhead (Sphyrna lewini) Sharks (Chondrichthyes: Elasmobranchii: Selachimorpha) represent one of the most ancient extant lineages of fishes. Despite this, many shark species face immediate extinction as a result of overexploitation, together with a unique life history (e.g., K-selected, predominant viviparity, long gestation period, lengthy time to maturation) and migration behaviour (Hanner, Naaum & Shivji, 2016). A large part of the problem stems from the increasing consumer demand for, and illegal trade of, shark fins, meat and bycatch on the Asian market. The widespread, albeit lucrative practice of ‘‘finning’’, whereby live sharks are definned and immediately released, has led to the rapid decline of once stable populations (Steinke et al., 2017). As a result, numerous shark species are currently listed by the International Union for the Conservation of Nature (IUCN) and the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES). Interest in the molecular identification of sharks through DNA barcoding is multifold. The COI reference sequence library for this group remains largely incomplete. Further, many shark species exhibit high intraspecific distances within their barcodes, suggesting the possibility of cryptic species diversity. Instances of hybridization between sympatric species has also been documented. As establishing species-level matches to partial specimens through morphology alone is difficult, and such a task becomes impossible once fins are processed and sold for consumption or use in traditional medicine, DNA barcoding has paved a clear path forward for unequivocal diagnosis in most cases. The endangered hammerheads (Family: Sphyrnidae) represent one of the most well- sampled groups of sharks within BOLD to date. Fins of the scalloped hammerhead (Sphyrna lewini) are especially highly prized within IUU (Illegal, Unregulated, Unreported) fisheries due to their inclusion as the main ingredient in shark fin soup. As of August 27, 2019, 327 S. lewini specimens (sequenced at both barcode and non- barcode markers), collected from several Food and Agriculture Organization (FAO) regions, including the United States, are available through BOLD’s Public Data Portal. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 23/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-13 http://dx.doi.org/10.7717/peerj-cs.243 Of these, all high-quality records (N = 171) were selected for alignment in MEGA7 and assessment via HACSim. The final alignment was found to comprise H* = 12 unique haplotypes, of which three were represented by 20 or more specimens (range: 28–70; Fig. 14). HACSim results are displayed below. ### Run simulations ### > HACSObj<-HACReal() > HAC.simrep(HACSObj) Simulating haplotype accumulation... | ======================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 9.9099 Mean number of haplotypes not sampled: 2.0901 Proportion of haplotypes sampled: 0.825825 Proportion of haplotypes not sampled: 0.174175 Mean value of N*: 207.0657 Mean number of specimens not sampled: 36.06566 Desired level of haplotype recovery has not yet been reached ===================================================== 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 11.3231 Mean number of haplotypes not sampled: 0.6769 Proportion of haplotypes sampled: 0.9435917 Proportion of haplotypes not sampled: 0.05640833 Mean value of N*: 413.3144 Mean number of specimens not sampled: 23.31438 Desired level of haplotype recovery has not yet been reached ===================================================== 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 11.4769 Mean number of haplotypes not sampled: 0.5231 Proportion of haplotypes sampled: 0.9564083 Proportion of haplotypes not sampled: 0.04359167 Mean value of N*: 432.8695 Mean number of specimens not sampled: 18.8695 Desired level of haplotype recovery has been reached ---------- Finished. ---------- The initial guess for sampling sufficiency was N = 171 individuals The algorithm converged after 9 iterations and took 174.215 s The estimate of sampling sufficiency for p = 95% haplotype recovery is N* = 414 individuals ( 95% CI: 411.937-416.063 ) The number of additional specimens required to be sampled for p = 95% haplotype recovery is N* - N = 243 individuals Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 24/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Figure 14 Initial haplotype frequency distribution for N = 171 high-quality scalloped hammerhead (Sphyrna lewini) COI barcode sequences obtained from BOLD. In this species, Haplotypes 1–3 account for c. 87.7% (150/171) of all sampled records. Full-size DOI: 10.7717/peerjcs.243/fig-14 Figure 15 Initial graphical output of HAC.sim() for a real species (Scalloped hammerhead, S. lewini) having three dominant haplotypes. In this example, initially, only R = 82.6% of the H*=12 estimated haplotypes for this species have been recovered based on a sample size of N =171 specimens. The haplo- type frequency barplot is identical to that of Fig. 14. Full-size DOI: 10.7717/peerjcs.243/fig-15 Simulation output suggests that only 82.6% of all unique haplotypes for the scalloped hammerhead have likely been recovered (Fig. 15) with a sample size of N = 171. Further, HACSim predicts that N* = 414 individuals (95% CI [411.937–416.063]) probably need to be randomly sampled to capture the majority of intraspecific genetic diversity within 5′-COI (Fig. 16). Since 171 specimens have already been collected, this leaves an additional N* − N = 243 individuals which await sampling. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 25/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-14 https://doi.org/10.7717/peerjcs.243/fig-15 http://dx.doi.org/10.7717/peerj-cs.243 Figure 16 Final graphical output of HAC.sim() for scalloped hammerhead (S. lewini) having three dominant haplotypes. Upon convergence, R = 95.6% of the H*=12 estimated haplotypes for this species have been uncovered with a sample size of N =414 specimens. Full-size DOI: 10.7717/peerjcs.243/fig-16 DISCUSSION Initializing HACSim and overall algorithm behaviour The overall stochastic behaviour of HACSim is highly dependent on the number of permutations used upon algorithm initialization. Provided that the value assigned to the perms argument is set high enough, numerical results ouputted by HACSim will be found to be quite consistent between consecutive runs whenever all remaining parameter values remain unchanged. It is crucial that perms not be set to too low a value to prevent the algorithm from getting stuck at local maxima and returning suboptimal solutions. This is a common situation with popular optimization algorithms such as hill-climbing. Attention therfore must be paid to avoid making generalizations based on algorithm performance and obtained simulation results (Spall, 2012). In applying the present method to simulated species data, it is important that selected simulation parameters are adequately reflective of those observed for real species. Thus, initial sample sizes should be chosen to cover a wide range of values based on those currently observed within BOLD. Such information can be gauged through examining species lists associated with BOLD records, which are readily accessible through Linnean search queries within the Taxonomy browser. As with any iterative numerical algorithm, selecting good starting guesses for initialization is key. While HACSim is globally convergent (i.e., convergence is guaranteed for any value of N ≥ H*), a good strategy when simulating hypothetical species is to start the algorithm by setting N = H*. In this way, the observed fraction of haplotypes found, R, will not exceed the desired level of haplotype recovery p, and therefore lead to overestimation of likely required specimen sample sizes. Setting N high enough will almost surely result in R exceeding p. Thus, arbitrarily large values of N may not be biologically meaningful or practical. However, in the case of hypothetical species simulation, should initial sample sizes be set too high, such that R > p, a straightforward workaround is to observe where the dashed horizontal line intersects the final haplotype accumulation curve Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 26/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-16 http://dx.doi.org/10.7717/peerj-cs.243 (i.e., not the line the touches the curve endpoint). The resulting value of N at this point will correspond with p quite closely. This can be seen in Fig. 5, where an eyeball guess just over N* = 20 individuals is necessary to recover p = 95% haplotype variation. A more reliable estimate can be obtained through examining the dataframe ‘‘d’’ outputted once the algorithm has halted (via envr$d). In this situation, simply look in the row corresponding to pH* ≥ 0.95(10) ≥ 9.5. The required sample size is the value given in the first column (specs). This is accomplished via the R code envr$d[which(envr$d$means >= envr$p * envr$Hstar), ][1, 1]. The novelty of HACSim is that it offers a systematic means of estimating likely specimen sample sizes required to assess intraspecific haplotype diversity for taxa within large-scale genomic databases like GenBank and BOLD. Estimates of sufficient sampling suggested by our algorithm can be employed to assess barcode coverage within existing reference sequence libraries and campaign projects found in BOLD. While comparison of our method to already-established ones is not yet possible, we anticipate that HACSim will nevertheless provide regulatory applications with an unprecedented view and greater understanding of the state of standing genetic diversity (or lack thereof) within species. Additional capabilities and extending functionality of HACSim In this paper, we illustrate the application of haplotype accumulation curves to the broad assessment of species-level genetic variation. However, HACSim is quite flexible in that one can easily explore likely required sample sizes at higher taxonomic levels (e.g., order, family, genus) or specific geographic regions (e.g., salmonids of the Great Lakes) with ease. Such applicability will undoubtedly be of interest at larger scales (i.e.. entire genomic sequence libraries). For instance, due to evidence of sampling bias in otherwise densely-sampled taxa housed in BOLD (e.g., Lepidoptera), D’Ercole et al. (J. D’Ercole, 2019, unpublished data) wished to assess whether or not intraspecific haplotype variation within butterfly species remains unsampled. To test this, the authors employed HACSim to examine sampling comprehensiveness for species comprising a large barcode reference library for North American butterflies spanning 814 species and 14,623 specimens. We foresee use of HACSim being widespread within the DNA barcoding community. As such, improvements to existing code in terms of further optimization and algorithm runtime, as well as implementation of new methods by experienced R programmers in the space of DNA-based taxonomic identification, seems bright. Potential extensions of our algorithm include support for the exploration of genetic variation at the Barcode Index Number (BIN) level (Ratnasingham & Hebert, 2013), as well as high-throughput sequencing (HTS) data for metabarcoding and environmental DNA (eDNA) applications. Such capabilities are likely to be challenging to implement at this stage until robust operational taxonomic unit (OTU) clustering algorithms are developed (preferably in R). One promising tool in this regard for barcoding of bulk samples of real species and mock communities of known species composition is JAMP (Just Another Metabarcoding Pipeline) devised for use in R by Elbrecht and colleagues (Elbrecht et al., 2018). JAMP includes a sequence read denoising tool that can be used to obtain haplotype numbers and frequency information (H* and probs). However, because JAMP relies Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 27/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 on third-party software (particularly USEARCH (Edgar, 2010) and VSEARCH (Rognes et al., 2016)), it cannot be integrated within HACSim itself and will thus have to be used externally. In extending HACSim to next-generation space, two issues arise. First, it is not immediately clear how the argument N , is to be handled since multiple reads could be associated with single individuals. That is, unlike in traditional Sanger-based sequencing, there is not a one-to-one correspondence between specimen and sequence (Wares & Pappalardo, 2015; Adams et al., 2019). Second, obtaining reliable haplotype information from noisy HTS datasets is challenging without first having strict quality filtering criteria in place to minimize the occurrence of rare, low-copy sequence variants which may reflect artifacts stemming from the Polymerase Chain Reaction (PCR) amplification step or sequencing process (Elbrecht et al., 2018; Braukmann et al., 2019; Turon et al., 2019). Turning to molecular population genetics theory might be the answer (Adams et al., 2019). Wares & Pappalardo (2015) suggest three different approaches to estimating the number of specimens of a species that may have contributed to a metabarcoding sample: (1) use of prior estimates of haplotype diversity, together with observed number of haplotypes; (2) usage of Ewens’ sampling formula (Ewens, 1972) along with estimates of Watterson’s θ (not to be confused with the θ denoting true sampling sufficency herein) (Watterson, 1975), as well as total number of sampled haplotypes; and (3) employment of an estimate of θ and the number of observed variable sites (S) within a multiple sequence alignment. A direct solution we propose might be to use sequencing coverage/depth (i.e., the number of sequence reads) as a proxy for number of individuals. The outcome of this would be an estimate of the mean/total number of sequece reads required for maximal haplotype recovery. However, the use of read count as a stand-in for number of specimens sampled would require the unrealistic assumption that all individuals (i.e., both alive and dead) shed DNA into their environment at equal rates. The obvious issue with extending HACSim to handle HTS data is computing power, as such data typically consists of millions of reads spanning multiple gigabytes of computer memory. Summary Here, we introduced a new statistical approach to assess specimen sampling depth within species based on existing gene marker variation found in public sequence databanks such as BOLD and GenBank. HACSim is both computationally efficient and easy to use. We show utility of our proposed algorithm through both hypothetical and real species genomic sequence data. For real species (here, lake whitefish, deer tick and scalloped hammerhead), results from HACSim suggest that comprehensive sampling for species comprising large barcode libraries within BOLD, such as Actinopterygii, Arachnida and Elasmobranchii is far from complete. With the availability of HACSim, appropriate sampling guidelines based on the amount of potential error one is willing to tolerate can now be established. For the purpose of addressing basic questions in biodiversity science, the employment of small taxon sample sizes may be adequate; however, this is not the case for regulatory applications, where greater than 95% coverage of intraspecific haplotype variation is needed to provide high confidence in sequence matches defensible in a court of law. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 28/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Of immediate interest is the application of our method to other ray-finned fishes, as well as other species from deeply inventoried taxonomic groups such as Elasmobranchii (e.g., sharks), Insecta (e.g., Lepidoptera, Culicidae (mosquitoes)), Arachnida (e.g., ticks) and Chiroptera (bats) that are of high conservation, medical and/or socioeconomic importance. Although we explicitly demonstrate the use of HACSim through employing COI, it would be interesting to extend usage to other barcode markers such as the ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) and maturase K (matK) chloroplast genes for land plants, as well as the nuclear internal transcribed spacer (ITS) marker regions for fungi. The application of our method to non-barcode genes routinely employed in specimen identification like mitochondrial cytochrome b (cytb) in birds for instance (Baker, Sendra Tavares & Elbourne, 2009; Lavinia et al., 2016), nuclear rhodopsin (rho) for marine fishes (Hanner et al., 2011) or the phosphoenolpyruvate carboxykinase (PEPCK) nuclear gene for bumblebees (Williams et al., 2015) is also likely to yield interesting results since sequencing numerous individuals at several different genomic markers can often reveal evolutionary patterns not otherwise seen from employing a single-gene approach (e.g., resolution of cryptic species or confirmation/revision of established taxonomic placements) (Williams et al., 2015). While it is reasonable that HACSim can be applied to genomic regions besides 5′-COI, careful consideration of varying rates of molecular evolution within rapidly-evolving gene markers and the effect on downsteam inferences is paramount, as is sequence quality. Previous work in plants (Genus: Taxus) by Liu et al. (2012) has found evidence of a correlation between mutation rate and required specimen sampling depth: genes evolving at faster rates will likely require larger sample sizes to estimate haplotype diversity compared to slowly-evolving genomic loci. We simply focused on 5′-COI because it is by far the most widely sequenced mitochondrial locus for specimen identification, owing to its desirable biological properties as a DNA barcode for animal taxa and because it has an associated data standard to help filter out poor-quality data. (Phillips, Gillis & Hanner, 2019). However, it should be noted that species diagnosis using COI and other barcode markers is not without its challenges. While COI accumulates variation at an appreciable rate, certain taxonomic groups are not readily distinguished on the basis of their DNA barcodes (e.g., the so-called ‘‘problem children’’, such as Cnidaria, which tend to lack adequate sequence divergence (Bucklin, Steinke & Blanco-Bercial, 2011)). Other taxa, like Mollusca, are known to harbour indel mutations (Layton, Martel & Hebert, 2014). Introns within Fungi greatly complicate sequence alignment (Min & Hickey, 2007). Thus, users of HACSim must exercise caution in interpreting end results with other markers, particularly those which are not protein-coding. It is necessary to consider the importance of sampling sufficiency as it pertains to the myriad regulatory applications of specimen identification established using DNA barcoding (e.g., combatting food fraud) in recent years. It since has become apparent that the success of such endeavours is complicated by the ever-evolving state of public reference sequence libraries such as those found within BOLD, in addition to the the inclusion of questionable sequences and lack of sufficent metadata for validation purposes in other genomic databases like GenBank (e.g., Harris (2003)). Dynamic DNA-based identification Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 29/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 systems may produce multiple conflicting hits to otherwise corresponding submissions over time. This unwanted behaviour has led to a number of regulatory agencies creating their own static repositories populated with expertly-identified sequence records tied to known voucher specimens deemed fit-for-purpose for molecular species diagnosis and forensic compliance (e.g., the United States Food and Drug Administration (USFDA)’s Reference Standard Sequence Library (RSSL) employed to identify unknown seafood samples from species of high socioeconomic value). While such a move has partially solved the problem of dynamism inherent in global sequence databases, there still remains the issue of low sample sizes that can greatly inflate the perception of barcode gaps between species. Obtaining adequate representation of standing genetic variation, both within and between species, is therefore essential to mitigating false assignments using DNA barcodes. To this end, we propose the use of HACSim to assess the degree of saturation of haplotype accumulation curves to aid regulatory scientists in rapidly and reliably projecting likely sufficient specimen sample sizes required for accurate matching of unknown queries to known Linnean names. A defining characteristic of HACSim is its convergence behaviour: the method converges to the desired level of haplotype recovery p for any initial guess N specified by the user. Based on examples explored herein, it appears likely that already-sampled species within repositories like BOLD are far from being fully characterized on the basis of existing haplotype variation. In addition to this, it is important to consider the current limitations of our algorithm. We can think of only one: it must be stressed that appropriate sample size trajectories are not possible for species with only single representatives within public DNA sequence databases because haplotype accumulation is unachievable with only one DNA sequence and/or a single sampled haplotype. Hence, HACSim can only be applied to species with at least two sampled specimens. Thus, application of our method to assess necessary sample sizes for full capture of extant haplotype variation in exceedingly rare or highly elusive taxa is not feasible. Despite this, we feel that HACSim can greatly aid in accurate and rapid barcode library construction necessary to thoroughly appreciate the diversity of life on Earth. CONCLUSIONS Herein, a new, easy-to-use R package was presented that can be employed to estimate intraspecific sample sizes for studies of genetic diversity assessment, with a particular focus on animal DNA barcoding using the COI gene. HACSim employs a novel nonparametric stochastic iterative extrapolation algorithm with good convergence properties to generate haplotype accumulation curves. Because our approach treats species’ haplotypes as numeric labels, any genomic locus can be targeted to probe levels of standing genetic variation within multicellular taxa. However, we stress that users must exercise care when dealing with sequence data from non-coding regions of the genome, since these are likely to comprise sequence artifacts such as indels and introns, which can both hinder successful sequence alignment and lead to overestimation of existing haplotype variation within species. The application of our method to assess likely required sample sizes for both Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 30/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 hypothetical and real species produced promising results. We argue the use of HACSim will be of broad interest in both academic and industry settings, most notably, regulatory agencies such as the Canadian Food Inspection Agency (CFIA), Agriculture and Agri-Food Canada (AAFC), United States Department of Agriculture (USDA), Public Health Agency of Canada (PHAC) and the USFDA. While HACSim is an ideal tool for the analysis of Sanger sequencing reads, an obvious next step is to extend usability to Next-Generation Sequencing (NGS), especially HTS applications. With these elements in place, even the full integration of HACSim to assess comprehensiveness of taxon sampling within large sequence databases such as BOLD seems like a reality in the near future. ACKNOWLEDGEMENTS The authors wish to extend thanks to Robert (Rob) Young for helpful discussions and providing critical feedback on an earlier draft of this manuscript that greatly improved its flow and overall readability. We acknowledge that the University of Guelph resides on the ancestral lands of the Attawandaron people and the treaty lands and territory of the Mississaugas of the Credit. We recognize the significance of the Dish with One Spoon Covenant to this land and offer our respect to our Anishinaabe, Haudenosaunee and Métis neighbours as we strive to strengthen our relationships with them. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the College of Physical and Engineering Science (CPES) Graduate Excellence Entrance Scholarship to Jarrett D. Phillips. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: College of Physical and Engineering Science (CPES) Graduate Excellence Entrance Scholarship. Competing Interests The authors declare there are no competing interests. Author Contributions • Jarrett D. Phillips conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Steven H. French conceived and designed the experiments, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Robert H. Hanner and Daniel J. Gillis conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 31/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Data Availability The following information was supplied regarding data availability: The aligned and trimmed DNA barcodes are available at FigShare: Phillips, Jarrett (2019): Coregonus clupeaformis 5′-COI DNA barcode sequences. figshare. Dataset. 10.6084/m9.figshare.8870804.v1. Phillips, Jarrett (2019): Ixodes scapularis 5′-COI DNA barcode sequences. figshare. Dataset. 10.6084/m9.figshare.10006595.v1. Phillips, Jarrett (2019): Sphyrna lewini 5′-COI DNA barcode sequences. figshare. Dataset. 10.6084/m9.figshare.10006598.v1. A stable version of the algorithm is available on GitHub: https://github.com/jphill01/ HACSim.R, along with a detailed README on how to run code. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.243#supplemental-information. REFERENCES Adams C, Knapp M, Gemmell N, Jeunen G-J, Bunce M, Lamare M, Taylor H. 2019. Beyond biodiversity: can environmental DNA (eDNA) cut it as a population genetics tool Genes 10(192):1–20 DOI 10.3390/genes10030192. April J, Hanner RH, Dion-Côté A-M, Bernatchez L. 2013a. Glacial cycles as an allopatric speciation pump in north-eastern American freshwater fishes. Molecular Ecology 22(2):409–422 DOI 10.1111/mec.12116. April J, Hanner RH, Mayden RL, Bernatchez L. 2013b. Metabolic rate and climatic fluctuations shape continental wide pattern of genetic divergence and biodiversity in fishes. PLOS ONE 8(7):e70296 DOI 10.1371/journal.pone.0070296. April J, Mayden RL, Hanner RH, Bernatchez L. 2011. Genetic calibration of species diversity among North America’s freshwater fishes. Proceedings of the National Academy of Sciences of the United States of America 108(26):10602–10607 DOI 10.1073/pnas.1016437108. Baker A, Sendra Tavares E, Elbourne R. 2009. Countering criticisms of single mito- chondrial DNA gene barcoding in birds. Molecular Ecology Resources 9(S1):257–268 DOI 10.1111/j.1755-0998.2009.02650.x. Bergsten J, Bilton DT, Fujisawa T, Elliott M, Monaghan MT, Balke M, Hendrich L, Geijer J, Herrmann J, Foster GN, I R, Nilsson A, Barraclough T, Vogler A. 2012. The effect of geographical scale of sampling on DNA barcoding. Systematic Biology 61(5):851–869. Braukmann T, Ivanova N, Prosser S, Elbrecht V, Steinke D, Ratnasingham S, De Waard J, Sones J, Zakharov E, Hebert P. 2019. Metabarcoding a diverse arthropod mock community. Molecular Ecology Resources 19(3):711–727. Brown SD, Collins RA, Boyer S, Lefort M-C, Malumbres-Olarte J, Vink CJ, Cruick- shank RH. 2012. Spider: an R package for the analysis of species identity and Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 32/37 https://peerj.com https://doi.org/10.6084/m9.figshare.8870804.v1 https://doi.org/10.6084/m9.figshare.10006595.v1 https://doi.org/10.6084/m9.figshare.10006598.v1 https://github.com/jphill01/HACSim.R https://github.com/jphill01/HACSim.R http://dx.doi.org/10.7717/peerj-cs.243#supplemental-information http://dx.doi.org/10.7717/peerj-cs.243#supplemental-information http://dx.doi.org/10.3390/genes10030192 http://dx.doi.org/10.1111/mec.12116 http://dx.doi.org/10.1371/journal.pone.0070296 http://dx.doi.org/10.1073/pnas.1016437108 http://dx.doi.org/10.1111/j.1755-0998.2009.02650.x http://dx.doi.org/10.7717/peerj-cs.243 evolution, with particular reference to DNA barcoding. Molecular Ecology Resources 12(3):562–565 DOI 10.1111/j.1755-0998.2011.03108.x. Bucklin A, Steinke D, Blanco-Bercial L. 2011. DNA barcoding of marine metazoa. Annual Review of Marine Science 3:471–508 DOI 10.1146/annurev-marine-120308-080950. Cameron S, Rubinoff D, Will K. 2006. Who will actually use DNA barcoding and what will it cost Systematic Biology 55(5):844–847 DOI 10.1080/10635150600960079. Čandek K, Kuntner M. 2015. DNA barcoding gap: reliable species identification over morphological and geographical scales. Molecular Ecology Resources 15(2):268–277 DOI 10.1111/1755-0998.12304. Casella G, Berger R. 2002. Statistical inference. California: Duxbury Thomson Learning. Ceballos G, Ehrlich PR, Barnosky AD, García A, Pringle RM, Palmer TM. 2015. Accel- erated modern human–induced species losses: entering the sixth mass extinction. Science Advances 1(5):e1400253 DOI 10.1126/sciadv.1400253. Chao A. 1984. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics 11:265–270. Collins R, Cruickshank R. 2013. The seven deadly sins of DNA barcoding. Molecular Ecology Resources 13(6):969–975 DOI 10.1111/1755-0998.12046. Dasmahapatra KK, Elias M, Hill RI, Hoffman JI, Mallet J. 2010. Mitochondrial DNA barcoding detects some species that are real, and some that are not. Molecular Ecology Resources 10(2):264–273 DOI 10.1111/j.1755-0998.2009.02763.x. Eddelbuettel D, François R. 2011. Rcpp: seamless R and C++ integration. Journal of Statistical Software 40(8):1–18 DOI 10.18637/jss.v040.i08. Eddelbuettel D, Sanderson C. 2014. RcppArmadillo: accelerating R with high- performance C++ linear algebra. Computational Statistics and Data Analysis 71:1054–1063 DOI 10.1016/j.csda.2013.02.005. Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinfor- matics 26(19):2460–2461 DOI 10.1093/bioinformatics/btq461. Efron B. 1979. Bootstrap methods: another look at the jackknife. The Annals of Statistics 7(1):1–26. Elbrecht V, Vamos EE, Steinke D, Leese F. 2018. Estimating intraspecific ge- netic diversity from community DNA metabarcoding data. PeerJ 6:e4644 DOI 10.7717/peerj.4644. Ewens WJ. 1972. The sampling theory of selectively neutral alleles. Theoretical Population Biology 3(1):87–112 DOI 10.1016/0040-5809(72)90035-4. Ezard T, Fujisawa T, Barraclough T. 2017. splits: SPecies’ limits by threshold statistics. R package version 1.0-19/r52. Available at https://R-Forge.R-project.org/projects/splits/ . Fujisawa T, Barraclough T. 2013. Delimiting species using single-locus data and the Generalized Mixed Yule Coalescent approach: a revised method and evaluation on simulated data sets. Systematic Biology 62(5):707–724 DOI 10.1093/sysbio/syt033. Goodall-Copestake W, Tarling G, Murphy E. 2012. On the comparison of population- level estimates of haplotype and nucleotide diversity: a case study using the gene cox1 in animals. Heredity 109(1):50–56 DOI 10.1038/hdy.2012.12. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 33/37 https://peerj.com http://dx.doi.org/10.1111/j.1755-0998.2011.03108.x http://dx.doi.org/10.1146/annurev-marine-120308-080950 http://dx.doi.org/10.1080/10635150600960079 http://dx.doi.org/10.1111/1755-0998.12304 http://dx.doi.org/10.1126/sciadv.1400253 http://dx.doi.org/10.1111/1755-0998.12046 http://dx.doi.org/10.1111/j.1755-0998.2009.02763.x http://dx.doi.org/10.18637/jss.v040.i08 http://dx.doi.org/10.1016/j.csda.2013.02.005 http://dx.doi.org/10.1093/bioinformatics/btq461 http://dx.doi.org/10.7717/peerj.4644 http://dx.doi.org/10.1016/0040-5809(72)90035-4 https://R-Forge.R-project.org/projects/splits/ http://dx.doi.org/10.1093/sysbio/syt033 http://dx.doi.org/10.1038/hdy.2012.12 http://dx.doi.org/10.7717/peerj-cs.243 Gwiazdowski RA, Elkinton JS, Dewaard JR, Sremac M. 2013. Phylogeographic diversity of the winter moths Operophtera brumata and O. bruceata (Lepidoptera: Geometri- dae) in Europe and North America. Annals of the Entomological Society of America 106(2):143–151 DOI 10.1603/AN12033. Hanner R. 2009. Data standards for BARCODE records in INSDC (BRIs). University of Guelph. Hanner R, Floyd R, Bernard A, Collette BB, Shivji M. 2011. DNA barcoding of billfishes. Mitochondrial DNA 22(sup1):27–36 DOI 10.3109/19401736.2011.596833. Hanner RH, Naaum AM, Shivji MS. 2016. Conclusion: DNA-based authentication of shark products and implications for conservation and management. In: Naaum AM, Hanner RH, eds. Seafood authenticity and traceability: a DNA-based perspective. 1st edition. California, Massachusetts, Oxford: Academic Press. Harris DJ. 2003. Can you bank on GenBank Trends in Ecology & Evolution 18(7):317–319 DOI 10.1016/S0169-5347(03)00150-2. Hebert PD, Cywinska A, Ball SL, DeWaard J. 2003. Biological identifications through DNA barcodes. Proceedings of the Royal Society of London B: Biological Sciences 270(1512):313–321 DOI 10.1098/rspb.2002.2218. Hebert PD, Ratnasingham S, De Waard JR. 2003. Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species. Proceedings of the Royal Society of London B: Biological Sciences 270(Suppl 1):S96–S99. Hebert PD, Stoeckle MY, Zemlak TS, Francis CM. 2004. Identification of birds through DNA barcodes. PLOS Biology 2(10):e312 DOI 10.1371/journal.pbio.0020312. Hickerson MJ, Meyer CP, Moritz C. 2006. DNA barcoding will often fail to discover new animal species over broad parameter space. Systematic Biology 55(5):729–739 DOI 10.1080/10635150600969898. Hubert N, Hanner R, Holm E, Mandrak NE, Taylor E, Burridge M, Watkinson D, Dumont P, Curry A, Bentzen P, Zhang J, April J, Bernatchez L. 2008. Identifying Canadian freshwater fishes through DNA barcodes. PLOS ONE 3(6):e2490 DOI 10.1371/journal.pone.0002490. Jin Q, He L-J, Zhang A-B. 2012. A simple 2D non-parametric resampling statistical approach to assess confidence in species identification in DNA barcoding—an alternative to Likelihood and Bayesian approaches. PLOS ONE 7(12):e50831 DOI 10.1371/journal.pone.0050831. Kimura M, Weiss G. 1964. The stepping stone model of population structure and the decrease of genetic correlation with distance. Genetics 49(4):561–576. Kumar S, Stecher G, Tamura K. 2016. MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets. Molecular Biology and Evolution 33(7):1870–1874. Lavinia P, Kerr K, Tubaro P, Hebert P, Lijtmaer D. 2016. Calibrating the molecular clock beyond cytochrome b: assessing the evolutionary rate of COI in birds. Journal of Avian Biology 47(1):86–91 DOI 10.1111/jav.00766. Layton K, Martel A, Hebert P. 2014. Patterns of DNA barcode variation in Canadian marine molluscs. PLOS ONE 9(4):e95003 DOI 10.1371/journal.pone.0095003. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 34/37 https://peerj.com http://dx.doi.org/10.1603/AN12033 http://dx.doi.org/10.3109/19401736.2011.596833 http://dx.doi.org/10.1016/S0169-5347(03)00150-2 http://dx.doi.org/10.1098/rspb.2002.2218 http://dx.doi.org/10.1371/journal.pbio.0020312 http://dx.doi.org/10.1080/10635150600969898 http://dx.doi.org/10.1371/journal.pone.0002490 http://dx.doi.org/10.1371/journal.pone.0050831 http://dx.doi.org/10.1111/jav.00766 http://dx.doi.org/10.1371/journal.pone.0095003 http://dx.doi.org/10.7717/peerj-cs.243 Liu J, Provan J, Gao L-M, Li D-Z. 2012. Sampling strategy and potential utility of indels for DNA barcoding of closely related plant species: a case study in Taxus. Interna- tional Journal of Molecular Sciences 13(7):8740–8751 DOI 10.3390/ijms13078740. Luo A, Lan H, Ling C, Zhang A-B, Shi L, Ho SY, Zhu C. 2015. A simulation study of sample size for DNA barcoding. Ecology and Evolution 5(24):5869–5879 DOI 10.1002/ece3.1846. Matz MV, Nielsen R. 2005. A likelihood ratio test for species membership based on DNA sequence data. Philosophical Transactions of the Royal Society of London B: Biological Sciences 360(1462):1969–1974 DOI 10.1098/rstb.2005.1728. Meyer CP, Paulay G. 2005. DNA barcoding: error rates based on comprehensive sampling. PLOS Biology 3(12):e422 DOI 10.1371/journal.pbio.0030422. Min X, Hickey D. 2007. Assessing the effect of varying sequence length on DNA barcod- ing of fungi. Molecular Ecology Notes 7:365–373 DOI 10.1111/j.1471-8286.2007.01698.x. Monaghan M, Wild R, Elliot M, Fujisawa T, Balke M, Inward D, Lees D, Ranaivosolo R, Eggleton P, Barraclough T, Vogler A. 2009. Accelerated species inventory on Madagascar using coalescent-based models of species delineation. Systematic Biology 58(3):298–311 DOI 10.1093/sysbio/syp027. Nielsen R, Matz M. 2006. Statistical approaches for DNA barcoding. Systematic Biology 55(1):162–169 DOI 10.1080/10635150500431239. Ondrejicka DA, Locke SA, Morey K, Borisenko AV, Hanner RH. 2014. Status and prospects of DNA barcoding in medically important parasites and vectors. Trends in Parasitology 30(12):582–591 DOI 10.1016/j.pt.2014.09.003. Ondrejicka DA, Morey K, Hanner RH. 2017. DNA barcodes identify medically impor- tant tick species in Canada. Genome 60(1):74–84 DOI 10.1139/gen-2015-0179. Overdyk LM, Braid HE, Crawford SS, Hanner RH. 2015. Extending DNA barcoding coverage for Lake Whitefish (Coregonus clupeaformis) across the three major basins of Lake Huron. DNA Barcodes 3(1):59–65. Paradis E. 2010. pegas: an R package for population genetics with an integrated— modular approach. Bioinformatics 26(3):419–420 DOI 10.1093/bioinformatics/btp696. Paradis E, Claude J, Strimmer K. 2004. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20(2):289–290 DOI 10.1093/bioinformatics/btg412. Pearson K. 1894. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A 185:71–110 DOI 10.1098/rsta.1894.0003. Phillips JD, Gillis DJ, Hanner RH. 2019. Incomplete estimates of genetic diversity within species: Implications for DNA barcoding. Ecology and Evolution 9(5):2996–3010 DOI 10.1002/ece3.4757. Phillips JD, Gwiazdowski RA, Ashlock D, Hanner R. 2015. An exploration of sufficient sampling effort to describe intraspecific DNA barcode haplotype diversity: examples from the ray–finned fishes (Chordata: Actinopterygii). DNA Barcodes 3(1):66–73. Pons J, Barraclough T, Gomez-Zurita J, Cardoso A, Duran D, Hazell S, Kamoun S, Sumlin W, Vogler A. 2006. Sequence-based species delimitation for the Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 35/37 https://peerj.com http://dx.doi.org/10.3390/ijms13078740 http://dx.doi.org/10.1002/ece3.1846 http://dx.doi.org/10.1098/rstb.2005.1728 http://dx.doi.org/10.1371/journal.pbio.0030422 http://dx.doi.org/10.1111/j.1471-8286.2007.01698.x http://dx.doi.org/10.1093/sysbio/syp027 http://dx.doi.org/10.1080/10635150500431239 http://dx.doi.org/10.1016/j.pt.2014.09.003 http://dx.doi.org/10.1139/gen-2015-0179 http://dx.doi.org/10.1093/bioinformatics/btp696 http://dx.doi.org/10.1093/bioinformatics/btg412 http://dx.doi.org/10.1098/rsta.1894.0003 http://dx.doi.org/10.1002/ece3.4757 http://dx.doi.org/10.7717/peerj-cs.243 DNA taxonomy of undescribed insects. Systematic Biology 55(4):595–609 DOI 10.1080/10635150600852011. Puillandre N, Lambert A, Brouillet S. 2011. ABGD, automatic barcode gap discovery for primary species delimitation. Molecular Ecology 21:1864–1877. R Core Team. 2018. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Available at https://www.R-project.org/ . Ratnasingham S, Hebert PD. 2007. BOLD: the barcode of life data system (http://www.barcodinglife.org). Molecular Ecology Notes 7(3):355–364 DOI 10.1111/j.1471-8286.2007.01678.x. Ratnasingham S, Hebert PD. 2013. A DNA-based registry for all animal species: the Barcode Index Number (BIN) system. PLOS ONE 8(7):e66213 DOI 10.1371/journal.pone.0066213. Rognes T, Flouri T, Nichols B, Quince C, Mahé F. 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584 DOI 10.7717/peerj.2584. Ross HA, Murugan S, Li WLS. 2008. Testing the reliability of genetic methods of species identification via simulation. Systematic Biology 57(2):216–230 DOI 10.1080/10635150802032990. Ryan K, Crawford S. 2014. Distribution and abundance of larval lake whitefish (Core- gonus clupeaformis) in Stokes Bay, Lake Huron. Journal of Great Lakes Research 40:755–762 DOI 10.1016/j.jglr.2014.05.008. Spall JC. 2012. Stochastic optimization. In: Gentle JE, Härdle WK, Mori Y, eds. Hand- book of computational statistics: concepts and methods. 2nd edition. Heidelberg: Springer. Stein ED, Martinez MC, Stiles S, Miller PE, Zakharov EV. 2014. Is DNA barcoding actually cheaper and faster than traditional morphological methods: results from a survey of freshwater bioassessment efforts in the United States PLOS ONE 9(4):e95525 DOI 10.1371/journal.pone.0095525. Steinke D, Bernard AM, Horn RL, Hilton P, Hanner R, Shivji MS. 2017. DNA analysis of traded shark fins and mobulid gill plates reveals a high proportion of species of conservation concern. Scientific Reports 7(9505):1–6 DOI 10.1038/s41598-016-0028-x. Turon X, Antich A, Palacin C, Præbel K, Wangersteen O. 2019. From metabarcoding to metaphylogeography: separating the wheat from the chaff. bioRxiv. Ward RD, Zemlak TS, Innes BH, Last PR, Hebert PD. 2005. DNA barcoding Australia’s fish species. Philosophical Transactions of the Royal Society of London B: Biological Sciences 360(1462):1847–1857 DOI 10.1098/rstb.2005.1716. Wares JP, Pappalardo P. 2015. Can theory improve the scope of quantitative metazoan metabarcoding? Diversity 8(1):1–15 DOI 10.3390/d8010001. Watterson G. 1975. On the number of segregating sites in genetical models without recombination. Theoretical Population Biology 7(2):256–276 DOI 10.1016/0040-5809(75)90020-9. Wiemers M, Fiedler K. 2007. Does the DNA barcoding gap exist?—a case study in blue butterflies (Lepidoptera: Lycaenidae). Frontiers in Zoology 4(8):1–16. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 36/37 https://peerj.com http://dx.doi.org/10.1080/10635150600852011 https://www.R-project.org/ http://dx.doi.org/10.1111/j.1471-8286.2007.01678.x http://dx.doi.org/10.1371/journal.pone.0066213 http://dx.doi.org/10.7717/peerj.2584 http://dx.doi.org/10.1080/10635150802032990 http://dx.doi.org/10.1016/j.jglr.2014.05.008 http://dx.doi.org/10.1371/journal.pone.0095525 http://dx.doi.org/10.1038/s41598-016-0028-x http://dx.doi.org/10.1098/rstb.2005.1716 http://dx.doi.org/10.3390/d8010001 http://dx.doi.org/10.1016/0040-5809(75)90020-9 http://dx.doi.org/10.7717/peerj-cs.243 Williams P, Byvaltsev A, Cederberg B, Berezin M, Ødegaard F, Rasmussen C, Richardson L, Huang J, Sheffield C, Williams S. 2015. Genes suggest ancestral colour polymorphisms are shared across morphologically cryptic species in arctic bumblebees. PLOS ONE 10(12):e0144544 DOI 10.1371/journal.pone.0144544. Wright S. 1951. The genetical structure of populations. Annals of Eugenics 15(1):323–354. Yao P-C, Gao H-Y, Wei Y-N, Zhang J-H, Chen X-Y, Li H-Q. 2017. Evaluating sampling strategy for DNA barcoding study of coastal and inland halo-tolerant Poaceae and Chenopodiaceae: a case study for increased sample size. PLOS ONE 12(9):e0185311 DOI 10.1371/journal.pone.0185311. Young R, Abott C, Therriault T, Adamowicz S. 2017. Barcode-based species delimitation in the marine realm: a test using Hexanauplia (Multicrustacea: Thecostraca and Copepoda). Genome 60(2):169–182 DOI 10.1139/gen-2015-0209. Zhang A-B, He L-J, Crozier RH, Muster C, Zhu C-D. 2010. Estimating sample sizes for DNA barcoding. Molecular Phylogenetics and Evolution 54(3):1035–1039 DOI 10.1016/j.ympev.2009.09.014. Zhang J, Kapli P, Pavlidis P, Stamatakis P. 2013. A general species delimitation method with applications to phylogenetic placements. Bioinformatics 29(22):2869–2876 DOI 10.1093/bioinformatics/btt499. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 37/37 https://peerj.com http://dx.doi.org/10.1371/journal.pone.0144544 http://dx.doi.org/10.1371/journal.pone.0185311 http://dx.doi.org/10.1139/gen-2015-0209 http://dx.doi.org/10.1016/j.ympev.2009.09.014 http://dx.doi.org/10.1093/bioinformatics/btt499 http://dx.doi.org/10.7717/peerj-cs.243