debar, a sequence-by-sequence denoiser for COI-5P DNA barcode data


1 

 
Title: 1 

debar, a sequence-by-sequence denoiser for COI-5P DNA barcode data 2 

 3 

Authors 4 

Cameron M. Nugent1,2,* 5 

Tyler A. Elliott2 6 

Sujeevan Ratnasingham2  7 

Paul D. N. Hebert2 8 

Sarah J. Adamowicz1 9 

 10 
1 Department of Integrative Biology, University of Guelph. Guelph, Ontario, Canada  11 
2 Centre for Biodiversity Genomics, University of Guelph. Guelph, Ontario, Canada 12 
*Corresponding author: nugentc@uoguelph.ca 13 

 14 

  15 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

mailto:nugentc@uoguelph.ca
https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


2 

 
Abstract 16 

 17 

DNA barcoding and metabarcoding are now widely used to advance species discovery and 18 

biodiversity assessments. High-throughput sequencing (HTS) has expanded the volume and 19 

scope of these analyses, but elevated error rates introduce noise into sequence records that can 20 

inflate estimates of biodiversity. Denoising —the separation of biological signal from instrument 21 

(technical) noise—of barcode and metabarcode data currently employs abundance-based 22 

methods which do not capitalize on the highly conserved structure of the cytochrome c oxidase 23 

subunit I (COI) region employed as the animal barcode. This manuscript introduces debar, an R 24 

package that utilizes a profile hidden Markov model to denoise indel errors in COI sequences 25 

introduced by instrument error. In silico studies demonstrated that debar recognized 95% of 26 

artificially introduced indels in COI sequences. When applied to real-world data, debar reduced 27 

indel errors in circular consensus sequences obtained with the Sequel platform by 75%, and 28 

those generated on the Ion Torrent S5 by 94%. The false correction rate was less than 0.1%, 29 

indicating that debar is receptive to the majority of true COI variation in the animal kingdom. In 30 

conclusion, the debar package improves DNA barcode and metabarcode workflows by aiding the 31 

generation of more accurate sequences aiding the characterization of species diversity. 32 

 33 

Keywords: COI, DNA barcode, metabarcode, denoising, Markov model, biodiversity  34 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


3 

 
Introduction 35 

 36 

Motivated by global biodiversity decline, conservation policies and strategies are being 37 

implemented to mitigate extinction rates (Driscoll et al. 2018; Baynham-Herd et al. 2018). 38 

Accurate assessments of biodiversity and its change over time are critical to support conservation 39 

strategies, to remediate environmental damage, and to manage natural resources, but this 40 

information is lacking for most ecosystems (Sogin et al. 2006; Hajibabaei et al. 2016; Hebert et 41 

al. 2016; D’Souza & Hebert 2018). 42 

DNA barcoding provides a technological solution to the problem of identifying 43 

organisms and characterizing biodiversity (Hebert et al. 2003; Hubert & Hanner 2015). Instead 44 

of identifying specimens through morphological study, standardized DNA regions—termed 45 

DNA barcodes—are used to identify specimens belonging to known species and to recognize 46 

new taxa. Reflecting advances in sequencing technology, DNA barcode studies are expanding in 47 

scale from analyzing single specimens to characterizing bulk samples, an approach termed 48 

metabarcoding, as well as multi-marker and metagenomics approaches (Taberlet et al. 2012; 49 

Cristescu 2014; Hajibabaei et al. 2016; Wilson et al. 2019). These advances are providing newly 50 

detailed information on species diversity in different geographic regions and habitats (Hajibabaei 51 

et al. 2012; Hebert et al. 2016; Delabye et al. 2019; Lopez-Vaamonde et al. 2019) while also 52 

aiding the identification of invasive species (Brown et al. 2016; Xu et al. 2017), food web 53 

analysis (Wirta et al. 2014; Kanuisto et al. 2017), and environmental monitoring (Hajibabaei et 54 

al. 2016; Stat et al. 2017; Cordier et al. 2019). 55 

Despite the broad adoption of DNA barcoding and metabarcoding, a fundamental 56 

problem persists. Efforts to quantify biodiversity from barcode and metabarcode data can be 57 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


4 

 
strongly affected by analytical methodology (Clare et al. 2016; Braukmann et al. 2019). For 58 

example, if high-throughput sequence (HTS) data are cleaned suboptimally, the estimated 59 

number of taxa may be grossly inflated as variation introduced by sequencing (technical) errors 60 

are interpreted as biological variation (Hardge et al. 2018). 61 

To reduce the impact of technical errors, sequence reads are often clustered into 62 

operational taxonomic units (OTUs) at specific identity thresholds (Elbrecht et al. 2018). Several 63 

software packages have attempted to increase the accuracy of this OTU method by separating 64 

biological signal from technical noise (Rosen et al. 2012; Callahan et al. 2016; Edgar 2016; 65 

Amir et al. 2017; Elbrecht et al. 2018; Kumar et al. 2018; Nearing et al. 2018). Many standard 66 

denoisers, such as DADA2 (Callahan et al. 2016), Deblur (Amir et al. 2017), and UNOISE 67 

(Edgar 2016), utilize cluster-based approaches, custom error models, or pre-clustering algorithms 68 

to account for and correct technical errors. Comparative studies have shown that all three of 69 

these methods outperform threshold-based OTU-clustering approaches (Nearing et al. 2018). It 70 

has also been shown that they produce similar estimates of species richness and relative 71 

abundance, but significantly different values for alpha diversity (intra-habitat diversity) and the 72 

number of unique exact sequence variants (ESVs) (Nearing et al. 2018). When a highly 73 

conserved protein-coding region, such as cytochrome c oxidase subunit I (COI), is employed as 74 

the barcode, structural information can be leveraged to improve denoising. The adoption of this 75 

approach can improve the accuracy of alpha-diversity estimates and the quality of identified 76 

barcode sequences by ensuring barcodes conform to biological reality. Additionally, rare 77 

sequences or important intra-species variants need not be discarded based solely on their 78 

abundance and can be retained with higher confidence if they conform to the expected gene 79 

structure. This latter benefit will be particularly valuable for work on hyper-diverse communities, 80 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


5 

 
(e.g. tropical insects) and for analyses of metabarcode data, where uneven sampling is often the 81 

norm and the resolution of intra-species variation is challenging (Elbrecht et al. 2018; Nearing et 82 

al. 2018; Braukmann et al. 2019; Zizka et al. 2020). 83 

Hidden Markov models (HMMs) are probabilistic representations of sequences that allow 84 

unobserved (hidden) states to be inferred through the observation of a series of non-hidden states 85 

(Durbin et al. 1998; Wilkinson 2019). HMMs have been applied widely in the analysis of 86 

biological sequences, in areas such as sequence alignment and annotation (Durbin et al. 1998; 87 

Eddy 1998). Profile Hidden Markov models (PHMMs) are a variant well suited for the 88 

representation of biological sequences with a shared evolutionary origin (Durbin et al. 1998; 89 

Eddy 1998, 2009). They are probabilistic models that contain position-specific information about 90 

the likelihood of potential characters (base pairs or amino acid residues) at the given position in 91 

the sequence (emission probabilities) and the likelihood of the observed character given the 92 

previously observed character in the sequence (transition probabilities). Once a PHMM is trained 93 

on a set of sequences, the Viterbi algorithm can be used to obtain the path of hidden states that 94 

align the novel sequences to the PHMM (Durbin et al. 1998). The Viterbi path is comprised of 95 

hidden match states (indicating the observed character matches to a position in the PHMM) and 96 

non-match states: either inserts or deletions. In the context of error correction, hidden non-match 97 

states identify the most likely positions at which novel sequences deviate from the PHMM’s 98 

statistical profile. In this manner, individual sequences can be queried for evidence of insertion 99 

or deletion (indel) errors and adjusted in a statistically informed manner. The conserved protein-100 

coding structure of the most common animal barcode gene, COI, and the wealth of available 101 

training sequences (Ratnasingham & Hebert 2007) for this region have allowed PHMMs to be 102 

successfully applied in the detection of technical errors in novel barcode sequences (Nugent et 103 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


6 

 
al. 2020). Correction of technical indel errors in data from protein-coding barcode sequences is 104 

an important development as it maximizes the likelihood that both the nucleotide and amino acid 105 

sequences correspond to the true biological sequence. Mitigation of indels arising due to 106 

technical errors also makes sequence reads from a given specimen more directly comparable, 107 

allowing low-frequency point mutations to be eliminated when multiple reads are available for a 108 

given biological sequence. Here, we aim to extend the use of PHMMs in COI data processing to 109 

allow for the sequence-by-sequence correction (denoising) of technical errors. 110 

This study had four primary goals: (1) design a denoising tool for COI barcode data that 111 

utilizes PHMMs to identify and correct insertion and deletion errors resulting from technical 112 

error; (2) test the tool’s performance and optimize its default parameters by denoising a set of 113 

10,000 barcode sequences with artificially introduced indel errors; (3) develop, implement, and 114 

evaluate a workflow for denoising DNA barcode data produced through single-molecule, real 115 

time (SMRT) sequencing of 29,525 specimens on the Sequel platform (Pacific Biosciences); and 116 

(4) denoise a DNA metabarcode mock community data set using debar and evaluate the 117 

improvement in quality of consensus sequences and the ability to resolve intra-OTU haplotype 118 

variation. The denoiser resulting from this work, debar (DEnoising BARcodes), is a free, 119 

publicly available package written in R that is available through CRAN (https://CRAN.R-120 

project.org/package=debar) and GitHub (https://github.com/CNuge/debar).  121 

 122 

Materials and Methods 123 

 124 

 Implementation 125 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://cran.r-project.org/package=debar
https://cran.r-project.org/package=debar
https://github.com/CNuge/debar
https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


7 

 
The debar utility includes several customizable steps which denoise DNA barcode and 126 

metabarcode data (Figure 1; Supplementary File 1). Corrections with debar are based upon the 127 

comparison of input sequences with a nucleotide-based profile hidden Markov model (PHMM) 128 

(model training detailed in Nugent et al. 2020) using the Viterbi algorithm (Durbin et al. 1998). 129 

Briefly, debar’s PHMM was trained using a curated set of 11,387 COI-5P barcode sequences 130 

obtained from the Barcode of Life Data Systems (BOLD: www.boldsystems.org) public database 131 

that were checked to ensure: (i) the sequence was >600 bp in length, (ii) taxonomy was known to 132 

a genus level, (iii) there were no missing base pairs, (iv) the amino acid sequence did not contain 133 

stop codons, and (v) BOLD’s internal check for contaminants was negative (Nugent et al. 2020). 134 

The Viterbi path produced through alignment of the sequence to the PHMMs is used to match 135 

the input sequence to the PHMM (by finding the first set of 10 consecutive match states which 136 

indicate the absence of indels for the given 10 base pairs). The read is then adjusted to account 137 

for detected insertions or deletions (Figure 1). Three consecutive nucleotide insertions or 138 

deletions are permitted (not adjusted) as sequences of this kind are more likely to reflect true 139 

biological variants than technical errors (they do not result in reading frame shifts and may 140 

reflect an insertion or deletion of an amino acid in a functional protein-coding gene). The 141 

probability of such changes through sequencing error is relatively low (i.e. for the Pacific 142 

Biosciences Sequel platform the baseline probability of three consecutive deletions would be 143 

0.05% (baseline delete probability) cubed, or 0.000125%). 144 

 The denoising of sequences with debar is controlled using a suite of parameters (Figure 145 

1). The censorship parameter is most important as it controls the size of the masks (substitution 146 

of nucleotides for placeholder N characters) applied around sequence adjustments. This option is 147 

designed to prevent the introduction of errors that would be caused if the denoising process 148 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


8 

 
deleted the wrong base pair or inserted a placeholder in the incorrect position. Derivation of the 149 

default value for the censorship parameter is detailed in the Methods and Results sections. The 150 

package also enables the translation of denoised sequences to amino acids to confirm that 151 

denoised outputs conform to the expected properties of the protein-coding gene region. Because 152 

debar can interface directly with fasta and fastq files, it enables file-to-file denoising in addition 153 

to denoising within an R programming environment. The default PHMM used for denoising by 154 

debar represents the complete 657bp barcode region of COI. The package also permits the use of 155 

customized PHMMs provided by a user, which allows the denosiser to be applied to data from 156 

other gene regions or for the denoiser to be targeted to a specific user-defined subsection of the 157 

COI barcode. Training of a PHMM for a new barcode or gene is supported by the R package 158 

aphid (Wilkinson 2019), while sub-setting of debar’s default PHMM is enabled by the R package 159 

coil (Nugent et al. 2020). Details of the package’s components together with a demonstration of 160 

its implementation is available in the package’s vignette (Supplementary File 1). 161 

 162 

 Quantification of package performance  163 

Simulated error data 164 

The debar package was tested using a phylogenetically stratified random sample of publicly 165 

available COI-5P sequences with artificially introduced indels. This test was designed to assess 166 

the accuracy of sequence corrections and to obtain a quantitatively informed set of default 167 

parameters for the denoising process. A random sample of 10,000 animal COI-5P sequences 168 

(excluding those used in PHMM model training) were obtained from BOLD and cleaned using 169 

the steps described in Nugent et al. 2020 (methods section – BOLD data acquisition). Errors 170 

were introduced into each sequence in accordance with the statistical error profile of the Pacific 171 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


9 

 
Biosciences Sequel based upon the error profile for COI barcode region in Hebert et al. (2018). 172 

This profile indicated a baseline indel rate of 0.1% (insertions and deletions equally likely), a 173 

baseline substitution rate of 0.5%, and an elevated indel rate for long homopolymers (repeat 174 

length of 6,7, and 8+ with indel probabilities of 0.75%, 1.2%, and 3.8%, respectively) (Hebert et 175 

al. 2018). The location of all errors was recorded so that accuracy of subsequent corrections 176 

could be evaluated. Sequences were iteratively processed, and errors were limited to a single 177 

insertion or deletion error of one base pair in length (with the error introduction process being 178 

repeated for the original sequence when more than one indel occurred), which allowed for the 179 

accuracy of corrections to be assessed without the need to consider interaction effects. 180 

 The resultant sequences, each with one indel, were then denoised with debar (‘denoise’ 181 

function, using the parameter censor_length = 0). The outputs of the denoise function were 182 

queried to determine the number and location of indel corrections applied by debar. This 183 

information was compared to the recorded ground truth error locations to quantify the following: 184 

1) the frequency with which debar located and exactly corrected indels, 2) the miss distance 185 

(number of nucleotide positions) between introduced errors and corrections applied in instances 186 

where debar did not correct the indel errors in exactly the correct position, and 3) the frequency 187 

at which debar applied an incorrect number of sequence corrections (i.e. 0 correction or 2+ 188 

corrections). If one correction was made and the distance between the correction and true indel 189 

position was 0, then the correction was considered accurate. Corrections were also considered 190 

accurate if all base pairs between the correction location and the true indel position were the 191 

same (i.e. if base pair 2 in the homopolymer "TTTTT" was an insertion, but the 5th T in the 192 

sequence was removed by debar, this is functionally an exact correction as the true sequence is 193 

restored). All other corrections at inexact positions were considered inaccurate, and the distance 194 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


10 

 
(number of positions) between the correction and true indel location was recorded. The mean and 195 

standard deviation of the miss distance were determined and used to select the default 196 

censor_length parameter for the debar package, equal to the mean miss distance plus 2 standard 197 

deviations (censor_length = ceiling( μmiss_distance + (2 x σmiss_distance)) ). This value was selected as 198 

it would be expected to avoid the introduction of an error for > 95% of inexact corrections. 199 

Sequences where no corrections or multiple corrections were made had their outputs inspected 200 

further to determine if other parts of the denoising pipeline (e.g. the check for stop codons in the 201 

translated amino acid sequence or trimming of sequence edges in the framing process) removed 202 

the error or led to the complete rejection of the sequence. 203 

  204 

False correction rate 205 

The performance of debar on sequences with no indel errors was also quantified to determine the 206 

frequency and cause of erroneous corrections applied to cleaned, publicly available COI-5P 207 

barcode sequences with no known technical errors. A random sample of 10,000 sequences from 208 

all the animal COI-5P barcode sequences available on BOLD was obtained (Supplementary File 209 

2) meeting the following criteria was obtained: 1) the barcode was publicly available on the 210 

BOLD database, 2) the barcode was > 600bp in length, 3) the barcode did not contain missing 211 

characters (“N”) in the Folmer region, 4) the corresponding amino sequence did not contain stop 212 

codons, 5) the result of BOLD’s internal check for contaminants was negative, and 6) the 213 

sequence was not used in PHMM training and the simulated error dataset. Sequences were 214 

processed using debar’s denoise function (censor_length = 0). All sequences that had corrections 215 

applied, or that were flagged for rejection, were counted and examined in detail to search for 216 

evidence of the proximal cause of the false correction. To search for evidence of taxonomic bias, 217 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


11 

 
the taxonomy associated with all falsely corrected sequences were tallied at the order level, and 218 

manually examined for evidence of bias. 219 

 220 

 Denoising PacBio Sequel data 221 

We quantified the performance of debar on raw DNA barcode sequence data by interfacing with 222 

the existing mBRAVE workflow (http://www.mbrave.net) used to process DNA barcode circular 223 

consensus sequences (CCS) obtained with the Sequel platform. A custom analysis pipeline 224 

(Supplementary File 3) was constructed to analyze and denoise the final set of CCS barcodes 225 

produced by the mBRAVE workflow (one CCS per OTU) (Figure 2). The pipeline was designed 226 

to search the final barcodes produced by mBRAVE for evidence of indel errors (by considering 227 

the translated amino acid sequence with the R package coil (Nugent et al. 2020)), denoise all the 228 

associated CCS with detected errors using the debar package, and then regenerate a consensus 229 

barcode sequence using the denoised data to produce a final, denoised barcode sequence for each 230 

specimen (Figure 2).  231 

 The outputs of this analysis were examined to determine if the debar pipeline decreased 232 

the number of technical errors in the barcode sequences and that those barcode sequences 233 

resulted in likely amino acid sequences when translated. Initial quantification of the 234 

improvement was conducted by comparing the number of barcode sequences whose amino acid 235 

sequences were flagged by the R package coil (Nugent et al. 2020, default parameters) before 236 

and after denoising. Barcodes are flagged by coil when they possess a stop codon when 237 

translated to amino acids or when the resultant amino acid sequence is improbable, both 238 

indicating that the sequence likely possesses an indel error. 239 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

http://www.mbrave.net)/
https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


12 

 
 Since the coil and debar packages both employ the same nucleotide profile hidden 240 

Markov model (coil also utilizes an amino acid PHMM), an independent test of pipeline 241 

effectiveness was also conducted. The effectiveness of the denoising pipeline was quantified by 242 

submitting both the original and denoised barcode sequences to BOLD. It was used to determine 243 

the number of original barcodes and denoised barcodes with evidence of stop codons after 244 

aligning the sequences using the BOLD’s hidden Markov model (a model developed 245 

independently of the debar PHMM) and translating the sequence using the appropriate 246 

translation table corresponding to the taxonomic information accompanying the sequence record. 247 

Comparison of these numbers made it possible to quantify the increase in barcode-compliant 248 

sequences (i.e. those with no stop codon) produced by debar. Additionally, the Sequence Quality 249 

Report on BOLD was examined to determine the number of unknown nucleotides (“N”) in the 250 

barcode sequences after denoising. The report categorizes barcode quality as: high (<1% Ns), 251 

medium (<2% Ns), low (<4% Ns), or unreliable (>4% Ns), and the number of barcodes in these 252 

different categories was recorded.  253 

 254 

 Denoising metabarcode data  255 

To characterize debar’s performance on metabarcode data, we analyzed a metabarcode dataset 256 

for a mock arthropod community (Braukmann et al. 2019). These data derived from a single 257 

sequencing run on an Ion Torrent S5 on COI amplicons generated by pooled DNA extracts from 258 

abdomens from single specimens of 369 arthropod species (methods described in detail in 259 

Braukmann et al. 2019). Sequences were from a 407bp fragment of the COI barcode region 260 

targeted using the primers MlepF1 and LepR1 (Hebert et al. 2004; Braukmann et al. 2019). 261 

Following amplification and sequencing on the Ion S5, quality control, sequence dereplication, 262 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


13 

 
chimeric read filtering, matching to reference sequences, and clustering were performed on 263 

mBRAVE (Braukman et al. 2019). Two sets of data resulted from this process, a set of 123,926 264 

unique sequences that were assigned to 398 different Barcode Index Numbers (BINs) 265 

(Ratnasingham and Hebert 2013) through the comparison to reference sequences (matched at 266 

>98% similarity), and a set of 2,199 unique sequences not matching to available references that 267 

were clustered into an additional 1,255 OTUs at a 97% similarity threshold (using clustering 268 

algorithm described in Braukmann et al. 2019).  269 

All sequences were denoised using debar’s denoise_list function and a custom nucleotide 270 

PHMM. The custom PHMM was a 398bp subset of the complete COI PHMM (PHMM profile 271 

positions 250 – 648), corresponding to a segment of the Folmer (Folmer et al. 1994) region 272 

targeted by the metabarcoding primers. The PHMM was created using coil’s ‘subsetPHMM’ 273 

function (Nugent et al. 2020). After denoising, two tests were conducted to determine if 274 

denoising improved the quality of the metabarcode pipeline’s output data. 275 

First, for each BIN and OTU consensus sequences were generated using denoised 276 

sequences and the debar function ‘consensus_sequence’. These consensus sequences were 277 

assessed for evidence of stop codons using coil and the same custom PHMMs used in denoising 278 

(function coi5p_pipe with the additional parameter: trans_table = 5). This test revealed the 279 

number of denoised consensus sequences which contained a stop codon when translated to 280 

amino acids, indicating an indel error persisted in the nucleotide sequence. The centroid 281 

sequences for the BINs and OTUs were used as a baseline metric for the number of barcode-282 

compliant sequences. For each BIN, centroid sequences were obtained by clustering the 283 

sequences in the group using the R package kmer’s ‘otu’ function (parameters: k = 4, threshold = 284 

0.95) (Wilkinson 2018, Version 1.0.0). For the OTUs, centroids were obtained from data 285 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


14 

 
generated by mBRAVE. All centroids were assessed with coil (Nugent et al. 2020, Version 1.0), 286 

and the number of barcode-compliant representative sequences for the original centroids and the 287 

final consensus sequences was compared. 288 

Secondly, the individual sequences within each BIN and OTU were analyzed with coil to 289 

determine the number that were likely error free, as evidenced by the absence of stop codons 290 

after translation. This assessment was repeated on the denoised reads to determine the 291 

effectiveness of debar in correcting errors in individual sequences and to reveal if the denoising 292 

process improved the resolution of ESVs for subsequent analysis of intra-species genetic 293 

variation by placing the ESVs in reading frame and reducing the frequency of identified indel 294 

errors.  295 

 296 

 297 

Results 298 

Quantification of package performance 299 

Simulated error data 300 

Debar was used to correct 10,000 barcodes, each with a single indel error (Supplementary File 301 

1). The denoised sequences and associated data were compared to the ground truth error 302 

locations to determine the accuracy of corrections applied by debar (Figure 3). For 9,459 303 

sequences (95.59%), a single correction was applied by debar, indicating that the package 304 

correctly identified the type of error in these sequences. However, debar either failed to 305 

recognize an indel or made too many corrections (2+) in the other 541 sequences. No correction 306 

was made for most (426) of these sequences, meaning that debar’s PHMM did not identify the 307 

indel error. The overlooked indels were largely restricted to the terminal regions of the sequence; 308 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


15 

 
75% (329/426) of them were positioned within 20 base pairs of the read termini (Figure 4), 309 

regions that only comprised 5% (40bp/650bp) of the sequences. The cause of this is that the 310 

debar denoising algorithm uses the first observation of 10 consecutive bp matching to the 311 

PHMM to establish the corrective window. Errors on the periphery of sequences therefore lead 312 

to trimming of the sequence (via the keep_flanks function) instead of indel correction. A 313 

substantial fraction of the remaining uncorrected indel errors (43) occurred between positions 314 

452 to 465 (Figure 4), a region associated with a 3bp indel present in some animal groups and 315 

absent in others. Its presence reduced the PHMM’s indel detection ability in this region due to 316 

greater true variability. Not all unidentified indels were retained in the final output sequences as 317 

double checks of debar (employing the keep_flanks and aa_check parameters) identified many 318 

(266/426 – 62%) of the uncorrected sequences and either omit the problem region or flag the 319 

sequence as likely to contain an error. Therefore, debar’s double checks allow many false 320 

negatives to be trimmed or flagged as problematic. 321 

 For 119 sequences (1.2%), two or more corrections were applied by debar when only a 322 

single indel existed (Figure 3). In contrast to the false negatives, debar’s double checks only 323 

captured three of the false positives. Many of the false corrections appeared to be the presence of 324 

indels near codons that are not present in all animals. Due to true biological variation in the 325 

training data, these regions of the PHMM have higher probabilities of transitioning from a match 326 

state to an insert or delete state, and therefore indels in these locations are sometimes handled 327 

incorrectly (i.e. the sequence is characterized as having two deleted base pairs, when there was a 328 

1bp insertion). Because false corrections of this type result in sequences that conform to the 329 

structure of the protein-coding gene region (i.e. a lack of stop codons in the amino acid 330 

sequence), they are not identified by debar’s aa_check function. 331 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


16 

 
 The 9,459 sequences for which the presence of a single indel was correctly identified 332 

were further analyzed to determine how accurately they were located (Figure 3). The analysis 333 

showed that debar was able to exactly locate and correct 5,847 (61.81% of sequences in single 334 

correction category) of the indel errors in the dataset. For the other 3,612 sequences (38.19% of 335 

the single corrections category), the indel corrections were not placed in exactly the correct 336 

position (Figure 5). For these sequences, the average distance between the true indel location and 337 

the applied correction was 2.31 base pairs (standard deviation = 1.9767).  338 

These results were used to select a default censorship value for debar to ensure that 339 

inexactly identified indel errors are masked in most sequences (Figure 1). A default censorship 340 

length of 7 (the average miss distance plus two times the standard deviation, rounded up) was 341 

selected in order to mask the true error in >95% of instances where inexact corrections were 342 

applied, thereby successfully denoising sequences, albeit with some associated loss of 343 

information in the sequences, which can be overcome by building a consensus sequence when 344 

multiple reads are available for an individual. 345 

Overall, denoising of the 10,000 barcodes with the default censorship parameter 346 

(censor_length = 7) resulted in 9,309/10,000 (93.09%) of sequences with errors being 347 

successfully denoised. The additional double check parameters (aa_check = True, keep_flanks = 348 

False) captured, but did not correct, 269 (2.69%) errors. The debar package thereby corrected or 349 

removed 95.74% of sequences with indel errors (Figure 3). 350 

 351 

False correction rate 352 

A set of 10,000 barcode sequences with no known indel errors was analyzed with debar to 353 

determine the incidence of erroneous corrections. Nearly all sequences (99.91%) were not altered 354 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


17 

 
nor flagged as erroneous. Nine sequences were erroneously corrected, and none were flagged for 355 

rejection. These sequences included a single sequence from each of five orders and four 356 

sequences from the order Diptera (flies). Interestingly, the four Diptera sequences that were 357 

incorrectly altered all belonged to the same genus: Culicoides. They represented 4/58 of all 358 

sequences from the family Ceratopogonidae that were in dataset, indicating that the performance 359 

issue was isolated to this single genus. 360 

 These results indicate that debar deals well with variation in COI sequences across most 361 

of the animal kingdom, but that it displays some taxonomic bias in performance. This is a 362 

limitation of debar, as any genus with a COI profile that systematically deviates from the COI 363 

PHMM used in debar will be erroneously denoised. The benefit of the conservative censorship 364 

approach used in the package is that although these reads are erroneously adjusted, the 365 

corrections made are masked by Ns, and the entire sequence is not rejected. Rather, only a small 366 

section of the sequences is lost, as if it were to contain an indel error. Most of any falsely 367 

corrected sequences can thereby be recovered, and in most instances, this would be sufficient to 368 

identify associated taxonomy and inform biological conclusions. 369 

 370 

Denoising PacBio Sequel data  371 

We applied debar in the analysis of real DNA barcode data by developing a processing pipeline 372 

(Figure 2 – hereafter ‘the debar pipeline’) and compared the amount of technical noise in the 373 

barcodes before and after processing. A set of 29,525 consensus barcode sequences derived from 374 

processing data from four Sequel runs were obtained from mBRAVE and were re-processed with 375 

the debar pipeline (Table 1). 376 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


18 

 
 Analysis of the consensus barcodes with coil (step ii. of the debar pipeline) flagged 3,495 377 

(11.8% of total) of consensus sequences due to the detection of a stop codon in the translated 378 

sequence or due to the presence of an unexpected amino acid (log likelihood score below the 379 

default threshold). The large number of flagged sequences is likely reflective of false positives 380 

(sequences flagged by coil that lack indel errors due to the incorrect establishment of reading 381 

frame). In fact, 2,418 sequences (8.1% of total, 69.2% of flagged sequences) were flagged due to 382 

the presence of a stop codon, and 1,282 of them (4.3% of total, 36.7% of flagged sequences) 383 

contained a stop codon in all three forward reading frames, providing extremely strong evidence 384 

of an indel error (i.e. a low likelihood of being a false positive). 385 

 After denoising, the output sequences were again assessed with coil (step viii. of the 386 

debar pipeline) and this analysis revealed that debar had corrected many indel errors (Table 1, 387 

Table 2). Only 1,123 (3.8%) of the final barcode sequences were flagged by coil’s coi5p_pipe 388 

function, suggesting that 66.8% (2,335) of the flagged sequences were successfully denoised. 389 

When comparison was restricted to the 2,418 sequences with stop codons, only 176 were still 390 

flagged as containing stop codons, indicating that 92.7% (2,242/2,418) of the sequences in this 391 

subcategory were effectively denoised. A more conservative estimate of correction success was 392 

provided by the subset of flagged sequences with stop codons in all reading frames. Of these 393 

sequences, 1106/1282 (86.27%) passed the coil check following denoising, suggesting the 394 

successful correction of an indel error and improved representation of the true sequence.  395 

External quantification of the debar pipeline’s denoising ability was obtained by the submission 396 

of pre- and post- pipeline barcode sequences to BOLD (http://www.boldsystems.org). The 397 

sample size for this test was smaller as BOLD requires taxonomic designations and this 398 

information was only provided by mBRAVE for 27,041 sequences. The total number of original 399 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


19 

 
sequences flagged by BOLD due to its detection of a stop codon was 1,515 (6.3%), a 400 

considerably lower frequency than reported by coil on the initial pipeline inputs. Of the 1,515 401 

sequences with initial evidence of stop codons, 14 were rejected outright by the debar pipeline, 402 

223 were flagged but not successfully corrected, 147 were unflagged and not corrected, and 403 

1,131 had no evidence of errors following denoising (Table 3). Based on this assessment with 404 

BOLD, the debar pipeline produced a 75% reduction in the number of errors in the dataset from 405 

6.3% (1,515) to 1.6% (384). Of the remaining 384 errors, the majority (223) were detected as 406 

problematic and flagged as erroneous by debar. As a consequence, the debar pipeline reduced the 407 

number of unidentified errors by >90% (from 1,515 to 147) in the barcode dataset (Table 3). 408 

 The denoising of the barcodes with the debar pipeline did not result in sequences with 409 

large amounts of missing information. Of the 29,525 output barcodes, 28,802 were high quality 410 

(<1% Ns), 11 were medium quality (<2% Ns), 498 were low quality (<4% Ns), and 214 were 411 

unreliable (>4% Ns). There was a strong negative relationship between the number of CCS 412 

available for a sample and the amount of missing information in the final barcode sequence 413 

(Figure 6).  414 

 415 

Denoising metabarcode data 416 

Consensus sequence quality 417 

Metabarcode data from a mock arthropod community were also denoised followed by 418 

comparison of original sequences to the denoised consensus sequences to determine if the debar 419 

improved sequence quality (Table 4). Of the original centroid sequences for the 398 BINs, 420 

125/398 (31.4%) contained evidence of indel errors when analyzed with coil. Following 421 

denoising and consensus sequence generation via debar, the number of barcode-compliant 422 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


20 

 
outputs was considerably higher with only 7/394 (1.8%) displaying evidence of indel errors. 423 

Four BINs had all their component sequences rejected by debar so no consensus sequences were 424 

generated. The rate of apparent indel errors was higher in the centroids of the 1255 OTUs; 681 425 

(54%) displayed evidence of a stop codon when analyzed with coil, suggesting the presence of 426 

indels in more than half of the sequences representing each OTU. The consensus sequences 427 

produced through denoising and consensus sequence generation with debar were of apparent 428 

higher quality as only 134 (10.6%) displayed evidence of a stop codon when analyzed with coil. 429 

An additional 31 OTUs (2.5%) failed to produce a valid consensus sequence after denoising 430 

because all their component sequences were rejected by debar. 431 

 The corrections did cause some loss of information; 46/394 (11.7%) of the consensus 432 

sequences for the BIN groups contained at least one ‘N’ due to ambiguous or censored base pairs 433 

in their component reads, and 861/1255 (68.6%) of the OTU consensus sequences contained at 434 

least one ‘N’. The number of ‘Ns’ per sequence was generally low for the BINs (median = 0; 12 435 

sequences with 14 or more ‘Ns’) but was higher for the OTUs (median number of ‘Ns’ = 15), 436 

indicating there was on average one correction per OTU (correction of an indel, plus the seven 437 

bp mask in either direction result in 14 (insertion) or 15 (deletion) consecutive ‘Ns’). There was 438 

a positive relationship between the number of sequences within an OTU and the completeness of 439 

information in the final consensus sequence.  440 

 441 

ESV data quality 442 

Data analysis on mBRAVE revealed 398 BINs represented by 123,926 unique 443 

dereplicated reads as well as 1255 OTUs lacking taxonomic assignment that were represented by 444 

2199 unique sequence reads. When original sequences were checked with coil, it indicated that 445 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


21 

 
61,351/123,926 (49.5%) of BIN sequences and 1310/2199 (59.97%) of the OTU sequences 446 

displayed strong evidence of an indel error as they contained a stop codon when translated. By 447 

contrast, following denoising with debar the incidence of stop codons was far lower as just 448 

2858/122,349 (2.3%) of the BIN sequences and 418/2,145 (19.49%) of the OTU sequences had 449 

evidence of indels. This result indicated that denoising of individual sequences reduced the 450 

incidence of apparent indel errors by over 95% for the BINs (58,593 fewer indel errors) and by 451 

68% for the OTUs (892 fewer indel errors). Most sequences were subjected to at least one indel 452 

correction by debar, with 85,298/122,349 (69.7%) of the final BIN sequences and 1387/2145 453 

(64.7%) of final OTU sequences containing at least one ‘N’ character. Low abundance OTUs in 454 

the data set represented by biologically valid sequences need not be discarded solely due to their 455 

low abundance and could be further inspected for putative evidence of rare community members. 456 

 457 

 458 

  459 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


22 

 
Discussion 460 

 461 

This manuscript introduces debar, a PHMM-based denoiser, and demonstrates how it can 462 

improve the quality of sequence data used for both DNA barcode library construction and for 463 

metabarcode studies by correcting indels introduced by sequencing error. We first evaluated its 464 

effectiveness through an in silico study that tested its capacity to recognize and repair reference 465 

barcodes with artificially introduced indels. Debar was shown to be effective, as it corrected 466 

>95.7% of the errors and applied erroneous adjustments to less than 0.01% of correct sequences. 467 

This strong performance extended to real-world data sets. Debar reduced the rate of frameshift 468 

indels by 75% in sequence records generated by the long-read Sequel platform, generating more 469 

barcode-compliant sequences, most with little or no missing information. Debar also improved 470 

the quality of metabarcode data generated by the ION S5 allowing for ESVs to be considered 471 

with higher confidence and for the recovery of higher-quality representative sequences for 472 

OTUs. 473 

 Denoising sequences with artificial errors and known ground truths showed that the 474 

corrections performed by debar were imperfect, with the exact indel location being identified 475 

only 61.8% of the time. The application of a default 7bp censorship on both sides of putative 476 

indel corrections proved to be an effective means of masking most errors, improving the 477 

denoiser’s error removal rate to >95.75%. This high error removal rate involves a tradeoff, as 478 

sequence adjustments are accompanied with a loss of 14 base pairs of information. This 479 

information loss is an acceptable cost, as it ensures that all remaining base pairs can be 480 

considered with high confidence. The nature of high-throughput sequence data, namely that there 481 

are usually multiple sequencing reads for a given specimen available, can help mitigate the loss 482 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


23 

 
of information. Corrected sequences from a specimen or OTU can be used in conjunction with 483 

one another, filling in the different censored locations and overcoming the loss of information. 484 

The censorship of bases adjacent to indel corrections is an optional parameter that users may 485 

alter to suit their needs. Smaller censorship values, or no censorship at all, would result in less 486 

loss of information per sequence, but would come at the cost of more errors remaining in the 487 

final data. 488 

 Denoising of real DNA barcode data obtained from sequencing of specimens on the 489 

Pacific Biosciences Sequel platform resulted in higher-quality output sequences. An exact metric 490 

quantifying the improvement is, however, difficult to state with certainty, as the ground truth of 491 

the sequences is not known. The independent tests of the sequences through submission of 492 

consensus sequences to BOLD before and after denoising provided a conservative estimate of 493 

the debar package’s effectiveness. Conservatively, this test showed a 75% reduction in the 494 

number of barcode sequences with technical indel errors after application of the debar pipeline 495 

and a low false negative rate (147 unidentified errors out of 1,515 total putative errors). This is 496 

an important improvement because the Pacific Biosciences Sequel platform is used at the Centre 497 

for Biodiversity Genomics to produce high-quality reference barcodes for the barcoding research 498 

community (Hebert et al. 2018). Accuracy of these sequences is therefore important; the debar 499 

package is shown to improve sequence quality, yielding more biologically likely and therefore 500 

reliable outputs. The generation of barcode sequences is also made more efficient. By increasing 501 

the rate of barcode-compliant outputs from 93.7% to 98%, fewer samples require reprocessing or 502 

resequencing.  503 

 Understanding within-species patterns of genetic diversity is an essential metric for 504 

characterizing community health. High intra-species genetic diversity is assumed to indicate 505 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


24 

 
healthy ecosystems, comprised of large and stable populations with the standing genetic 506 

variation needed to survive environmental stressors (Zizka et al. 2020). The characterization of 507 

ESVs within OTUs can provide intra-species diversity measures for member species of a 508 

community (Frøslev et al. 2017). The initial check of the sub-OTU sequence data from the mock 509 

community sequenced with IonTorrent revealed a high rate of putative indel errors (54% of 510 

sequences), which would lead to a gross over estimation of the number of ESVs within the 511 

OTUs. The reduction of the error rate after denoising with debar allows for a more accurate 512 

examination of intra-OTU ESVs and therefore allows for more accurate assessments of intra-513 

species diversity and community health, despite the fact that debar is not capable of eliminating 514 

non-indel errors from sequences. Even with the improvements to ESV quality by debar, intra-515 

species diversity estimates will likely remain inflated to some extent, as the sequence-by-516 

sequence corrections applied by debar exclusively account for indel errors while substitution 517 

errors could persist within the data.  518 

 We have demonstrated that debar is an effective means of reducing technical errors in 519 

DNA barcode and metabarcode data, but the package is not without limitations. The package is 520 

designed to correct insertion and deletion errors, but these are not the only technical issues that 521 

can lead to inflated biodiversity estimates. The program is not an effective means of identifying 522 

or correcting chimeric sequences or non-animal COI biological contaminants and should these 523 

exist within an input data set they are likely to go uncorrected. Additionally, debar does not have 524 

the ability to correct substitution errors on a sequence-by-sequence basis. Because of indel 525 

correction, denoised sequences are aligned, and nucleotide positions become directly comparable 526 

across different sequences from a given specimen or OTU. Random point substitution errors can 527 

thereby be corrected in consensus sequence generation, through the ‘majority rule’ approach 528 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


25 

 
debar uses in base calling. However, if systematic errors exist (i.e. most sequences possess the 529 

same substitution), few sequences are available for consensus sequence generation, or ESVs are 530 

being examined, then substitution errors may persist in the data. An additional source of error 531 

unaccounted for by debar is contaminant sequences. It has been demonstrated previously that the 532 

PHMM utilized in debar is not an effective means of separating animal barcode sequences from 533 

off-target barcodes derived from bacteria, plant, fungi, or other origins (Nugent et al. 2020). 534 

Taken together, these limitations show that debar cannot single handedly address the technical 535 

challenges associated with DNA barcoding. The tool is likely most effective when applied in 536 

conjunction with existing barcode and metabarcode workflows and improves the quality of final 537 

sequences if the inputs have been filtered based on quality, had primers removed, and been 538 

cleaned of chimeric and contaminant sequences. The sequence-by-sequence denoising approach 539 

of debar means that it is a flexible tool capable of integrating into analysis pipelines for 540 

sequencing data from various sources. Application of debar in tandem with conventional, 541 

clustering-based denoising tools would likely lead to the highest quality assessment of 542 

biodiversity. Following OTU generation with other tools, using debar to denoise all reads within 543 

a given OTU prior to consensus sequence generation would maximize accuracy of the  consensus 544 

sequence while conforming to the conserved structure of the COI barcode region. The removal 545 

of intra-OTU noise can also improve the accuracy of alpha-diversity estimates. Additionally, 546 

application of debar in the denoising of rare, low-abundance sequences not present in the OTUs 547 

would allow these data to be further examined with higher confidence, revealing biological 548 

insights that would be overlooked in conventional workflows. 549 

 The PHMM denoising technique used by debar is an effective barcode-focused 550 

framework that can be extended to fit a variety of needs. Data from only two sequencing 551 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


26 

 
platforms were tested in this study: the Pacific Biosciences Sequel and Thermo IonTorrent S5. 552 

Since the PHMM used in debar is barcode specific and not sequencer specific, debar can be 553 

effectively applied in denoising of barcode data obtained from any sequencing platform. 554 

However, the effectiveness of the denoiser will depend on the types and rates of technical errors 555 

associated with a given platform. When applied to data from sequencers such as the Illumina 556 

MiSeq, the rate of technical errors corrected by debar will be lower, as this platform is more 557 

prone to introduction of substitution, as opposed to indel, errors (Schirmer et al. 2015). Although 558 

the debar package contains a PHMM for only the common animal barcode COI, the denoising 559 

algorithm can in the future be extended and applied in the correction of data for other DNA 560 

barcodes with conserved structures. 561 

 562 

Conclusion 563 

This study has described debar, an R package for denoising DNA barcode data, and 564 

demonstrated its ability to correct indels in both barcode and metabarcode sequences due to 565 

instrument error. In each dataset, debar improved sequence quality. It reduced the apparent 566 

number of indels by 75% in data generated by Sequel, increasing the proportion of sequences 567 

that met the quality standards required to qualify as a reference barcode. The merits of debar for 568 

metabarcode analysis were twofold, allowing more likely consensus sequences to be obtained for 569 

OTUs, and for intra-OTU variation to be quantified with higher confidence. Overall, debar is a 570 

robust utility for identifying deviations from the highly conserved protein-coding sequence of the 571 

COI barcode region. Corrections informed by its use improve the separation of true biological 572 

variation from technical noise, with low frequencies of false corrections. Integration of debar 573 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


27 

 
into the workflows for processing barcode and metabarcode data will allow biological variation 574 

to be characterized with higher accuracy.  575 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


28 

 
Acknowledgements 576 

 577 

This research was supported by grants from Genome Canada through Ontario Genomics and 578 

from the Ontario Ministry of Economic Development, Job Creation and Trade. The funders 579 

played no role in study design or decision to publish. This research was enabled in part by 580 

resources provided by Compute Canada (www.computecanada.ca). We thank Tony Kuo and 581 

Thomas Braukmann for aid with data acquisition and interpretation and Tony for helpful 582 

comments on the manuscript. 583 

  584 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

http://www.computecanada.ca/
https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


29 

 
References 585 

 586 

Amir, A., McDonald, D., Navas-Molina, J. A., Kopylova, E., Morton, J. T., Xu, Z. Z., ... & 587 

Knight, R. (2017). Deblur rapidly resolves single-nucleotide community sequence 588 

patterns. MSystems, 2(2). 589 

 590 

Baynham-Herd, Z., Amano, T., Sutherland, W. J., & Donald, P. F. (2018). Governance explains 591 

variation in national responses to the biodiversity crisis. Environmental 592 

Conservation, 45(4), 407-418. 593 

 594 

Braukmann, T. W., Ivanova, N. V., Prosser, S. W., Elbrecht, V., Steinke, D., Ratnasingham, S., ... 595 

& Hebert, P. D. N. (2019). Metabarcoding a diverse arthropod mock 596 

community. Molecular Ecology Resources, 19(3), 711-727. 597 

 598 

Brown E.A., Chain, F. J., Zhan, A., MacIsaac, H. J., & Cristescu, M. E. (2016). Early detection 599 

of aquatic invaders using metabarcoding reveals a high number of non‐indigenous 600 

species in Canadian ports. Diversity and Distributions, 22(10), 1045-1059. 601 

 602 

Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J. A., & Holmes, S. P. 603 

(2016). DADA2: high-resolution sample inference from Illumina amplicon data. Nature 604 

Methods, 13(7), 581. 605 

 606 

Clare, E. L., Chain, F. J., Littlefair, J. E., & Cristescu, M. E. (2016). The effects of parameter 607 

choice on defining molecular operational taxonomic units and resulting ecological 608 

analyses of metabarcoding data. Genome, 59(11), 981-990. 609 

 610 

Cordier, T., Lanzén, A., Apothéloz-Perret-Gentil, L., Stoeck, T., & Pawlowski, J. (2019). 611 

Embracing environmental genomics and machine learning for routine 612 

biomonitoring. Trends in Microbiology, 27(5), 387-397. 613 

 614 

Cristescu, M. E. (2014). From barcoding single individuals to metabarcoding biological 615 

communities: towards an integrative approach to the study of global biodiversity. Trends 616 

in Ecology & Evolution, 29(10), 566-571. 617 

 618 

Delabye, S., Rougerie, R., Bayendi, S., Andeime-Eyene, M., Zakharov, E. V., deWaard, J. R., ... 619 

& Mavoungou, J. F. (2019). Characterization and comparison of poorly known moth 620 

communities through DNA barcoding in two Afrotropical environments in 621 

Gabon. Genome, 62(3), 96-107. 622 

 623 

Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: 624 

probabilistic models of proteins and nucleic acids. Cambridge University Press. 625 

 626 

Driscoll, D. A., Bland, L. M., Bryan, B. A., Newsome, T. M., Nicholson, E., Ritchie, E. G., & 627 

Doherty, T. S. (2018). A biodiversity-crisis hierarchy to evaluate and refine conservation 628 

indicators. Nature Ecology & Evolution, 2(5), 775-781. 629 

 630 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


30 

 
Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics (Oxford, England), 14(9), 631 

755-763. 632 

 633 

Eddy, S. R. (2009). A new generation of homology search tools based on probabilistic inference. 634 

In Genome Informatics 2009: Genome Informatics Series Vol. 23 (pp. 205-211). 635 

 636 

Edgar, R. C. (2016). UNOISE2: improved error-correction for Illumina 16S and ITS amplicon 637 

sequencing. BioRxiv, 081257 638 

 639 

Elbrecht, V., Vamos, E. E., Steinke, D., & Leese, F. (2018). Estimating intraspecific genetic 640 

diversity from community DNA Metabarcoding Data. PeerJ, 6, e4644. 641 

 642 

Folmer, O., Black M., Hoeh W., Lutz R, Vrijenhoek, R. (1994). DNA primers for amplification 643 

of mitochondrial cytochrome c oxidase subunit I from diverse metazoan 644 

invertebrates. Mol Mar Biol Biotechnol, 3(5), 294-9. 645 

 646 

Frøslev, T. G., Kjøller, R., Bruun, H. H., Ejrnæs, R., Brunbjerg, A. K., Pietroni, C., & Hansen, A. 647 

J. (2017). Algorithm for post-clustering curation of DNA amplicon data yields reliable 648 

biodiversity estimates. Nature Communications, 8(1), 1-11. 649 

 650 

Hajibabaei, M., Spall, J. L., Shokralla, S., & van Konynenburg, S. (2012). Assessing biodiversity 651 

of a freshwater benthic macroinvertebrate community through non-destructive 652 

environmental barcoding of DNA from preservative ethanol. BMC Ecology, 12(1), 28. 653 

 654 

Hajibabaei, M., Baird, D. J., Fahner, N. A., Beiko, R., & Golding, G. B. (2016). A new way to 655 

contemplate Darwin’s tangled bank: how DNA barcodes are reconnecting biodiversity 656 

science and biomonitoring. Philosophical Transactions of the Royal Society B: Biological 657 

Sciences, 371(1702), 20150330. 658 

 659 

Hebert, P. D. N., Cywinska, A., Ball, S. L., & Dewaard, J. R. (2003). Biological identifications 660 

through DNA barcodes. Proceedings of the Royal Society of London. Series B: 661 

Biological Sciences, 270(1512), 313-321. 662 

 663 

Hebert, P. D. N., Ratnasingham, S., Zakharov, E. V., Telfer, A. C., Levesque-Beaudin, V., Milton, 664 

M. A., ... & DeWaard, J. R. (2016). Counting animal species with DNA barcodes: 665 

Canadian insects. Philosophical Transactions of the Royal Society B: Biological 666 

Sciences, 371(1702), 20150333. 667 

 668 

Hebert, P. D. N., Braukmann, T. W., Prosser, S. W., Ratnasingham, S., DeWaard, J. R., Ivanova, 669 

N. V., ... & Zakharov, E. V. (2018). A Sequel to Sanger: amplicon sequencing that 670 

scales. BMC Genomics, 19(1), 219. 671 

 672 

Hubert, N., & Hanner, R. (2015). DNA barcoding, species delineation and taxonomy: a historical 673 

perspective. DNA Barcodes, 3(1), 44-58. 674 

 675 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


31 

 
Kaunisto, K. M., Roslin, T., Sääksjärvi, I. E., & Vesterinen, E. J. (2017). Pellets of proof: First 676 

glimpse of the dietary composition of adult odonates as revealed by metabarcoding of 677 

feces. Ecology and Evolution, 7(20), 8588-8598. 678 

 679 

Kumar, V., Vollbrecht, T., Chernyshev, M., Mohan, S., Hanst, B., Bavafa, N., ... & Golden, M. 680 

(2019). Long-read amplicon denoising. Nucleic Acids Research, 47(18), e104-e104. 681 

 682 

Lopez-Vaamonde, C., Sire, L., Rasmussen, B., Rougerie, R., Wieser, C., Allaoui, A. A., ... & 683 

Lees, D. C. (2019). DNA barcodes reveal deeply neglected diversity and numerous 684 

invasions of micromoths in Madagascar. Genome, 62(3), 108-121. 685 

 686 

Nearing, J. T., Douglas, G. M., Comeau, A. M., & Langille, M. G. (2018). Denoising the 687 

Denoisers: an independent evaluation of microbiome sequence error-correction 688 

approaches. PeerJ, 6, e5364. 689 

 690 

Nugent, C. M., Elliott, T. A., Ratnasingham, S., & Adamowicz, S. J. (2020). Coil: an R package 691 

for cytochrome C oxidase I (COI) DNA barcode data cleaning, translation, and error 692 

evaluation. Genome. 63(6):291-305. 693 

 694 

Ratnasingham, S., & Hebert, P. D. N. (2013). A DNA-based registry for all animal species: the 695 

Barcode Index Number (BIN) system. PloS One, 8(7). 696 

 697 

Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., & Sokhansanj, B. (2008). Metagenome 698 

Fragment Classification Using 𝑁-Mer Frequency Profiles. Advances in 699 
Bioinformatics, 2008. 700 

 701 

Schirmer, M., Ijaz, U. Z., D’Amore, R., Hall, N., Sloan, W. T., & Quince, C. (2015). Insight into 702 

biases and sequencing errors for amplicon sequencing with the Illumina MiSeq 703 

platform. Nucleic Acids Research, 43(6), e37-e37. 704 

 705 

Sogin, M. L., Morrison, H. G., Huber, J. A., Welch, D. M., Huse, S. M., Neal, P. R., … & Herndl, 706 

G. J. (2006). Microbial diversity in the deep sea and the underexplored “rare 707 

biosphere”. Proceedings of the National Academy of Sciences, 103(32), 12115-12120. 708 

 709 

Stat, M., Huggett, M. J., Bernasconi, R., DiBattista, J. D., Berry, T. E., Newman, S. J., ... & 710 

Bunce, M. (2017). Ecosystem biomonitoring with eDNA: metabarcoding across the tree 711 

of life in a tropical marine environment. Scientific Reports, 7(1), 1-11. 712 

 713 

Taberlet, P., Coissac, E., Hajibabaei, M., & Rieseberg, L. H. (2012). Environmental 714 

DNA. Molecular Ecology, 21(8), 1789-1793. 715 

 716 

Wilkinson SP. (2018) kmer: an R package for fast alignment-free clustering of biological 717 

sequences. R package version 1.0.0. https://cran.r-project.org/package=kmer 718 

 719 

Wilkinson, S. P. (2019). aphid: an R package for analysis with profile hidden Markov 720 

models. Bioinformatics, 35(19), 3829-3830. 721 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://cran.r-project.org/package=kmer
https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


32 

 
 722 

Wilson, J. J., Brandon-Mong, G. J., Gan, H. M., & Sing, K. W. (2019). High-throughput 723 

terrestrial biodiversity assessments: mitochondrial metabarcoding, metagenomics or 724 

metatranscriptomics?. Mitochondrial DNA Part A, 30(1), 60-67. 725 

 726 

Wirta, H. K., Hebert, P. D. N., Kaartinen, R., Prosser, S. W., Várkonyi, G., & Roslin, T. (2014). 727 

Complementary molecular information changes our perception of food web 728 

structure. Proceedings of the National Academy of Sciences, 111(5), 1885-1890. 729 

 730 

Zizka, V. M., Weiss, M., & Leese, F. (2020). Can metabarcoding resolve intraspecific genetic 731 

diversity changes to environmental stressors? A test case using river 732 

macrozoobenthos. BioRxiv. 733 

 734 

  735 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


33 

 
 736 

Data Accessibility Statement 737 

 738 

DNA barcode sequences used in training of the Profile Hidden Markov Models are available in 739 

the Supplementary data of the following paper: https://doi.org/10.1139/gen-2019-0206. DNA 740 

barcode sequences used in model testing are available in this manuscript’s Supplementary files. 741 

The R source code for the debar package is available on GitHub: 742 

https://github.com/CNuge/debar. Additional data and code available on request from the authors. 743 

 744 

 745 

Author Contributions 746 

 747 

The study was conceived and designed by SJA, PDNH, SR, and CMN. The programming of the 748 

debar package was performed by CMN. Analyses of package performance were performed by 749 

CMN with resources, design, and other assistance provided by TAE, SR, and SJA. The initial 750 

draft of the manuscript was written by CMN and SJA. All authors contributed to the editing of 751 

the manuscript. 752 

  753 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1139/gen-2019-0206
https://github.com/CNuge/debar
https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


34 

 
Tables and Figures 754 

 755 

Table 1. Summary of the results for the 29,525 barcode sequences (produced from PacBio 756 

Sequel data analyzed using the mBRAVE platform) after processing with the debar pipeline.  757 

 758 

PacBio 

Sequel run 
Run 1 Run 2 Run 3 Run 4 Total 

Consensus 

sequences 

generated 

7,518 7,373 7,235 7,399 29,525 

Consensus 

sequences 

flagged by coil 

for indel error 

869 817 900 909 3,495 (11.8%) 

Rejected by 

debar denoising 
8 4 16 9 37 (0.1%) 

Sequences 

flagged by coil 

post-denoising 

256 285 305 277 1,123 (3.8%) 

Sequences 

corrected 
605 528 579 623 

2,335 (66.8% of 

flagged 

sequences) 

 759 

  760 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


35 

 
Table 2. Assessment of the correction ability of the debar pipeline for the subset of sequences in 761 

the high-confidence error set. This set of sequences was flagged by coil and produced a stop 762 

codon when translated within all reading frames. The top half of the table indicates the number 763 

of sequences flagged by coil as likely to be erroneous, based on the log likelihood values of the 764 

sequences. Results are shown for sequences both before and after the denoising process. The 765 

bottom half of the table contains the number of sequences flagged by coil as likely to be 766 

erroneous, based on the presence of a stop codon in the amino acid sequence resulting from the 767 

censored translation of the framed nucleotide sequence. This high success for the stop-codon 768 

metric (86.3% of errors removed) indicates that the pipeline is an effective means of correcting 769 

frameshift-causing insertion or deletion errors. The relatively lower success at correcting 770 

sequences with low log likelihood values suggests that frameshift-causing errors are not the only 771 

set of errors being flagged by coil, and that non-frameshift errors are not effectively corrected by 772 

the debar pipeline. 773 

 774 

PacBio 

Sequel run 
Run 1 Run 2 Run 3 Run 4 Total 

Original 

flagged 
551 547 609 610 2,317 

Flagged post-

denoising 
254 280 300 271 1,105 

Corrected 53.9% 48.8% 50.7% 55.6% 52.3% 

PacBio 

Sequel run 
Run 1 Run 2 Run 3 Run 4 Total 

Original stop 

codon 
319 295 318 350 1,212 

Stop codon 

post-

denoising 

43 42 36 55 176 

Corrected 

stop codons  
86.5% 85.7% 88.7% 84.2% 86.3% 

 775 

 776 

 777 

 778 

 779 

 780 

  781 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


36 

 
Table 3. Result of the BOLD Data System evaluation of debar denoising workflow’s 782 

effectiveness. The number of sequences identified by BOLD as containing stop codons, before 783 

and after processing with the denoising pipeline (Figure 2). Only the 27,041 specimens with 784 

barcodes and taxonomic information produced through the processing of PacBio Sequel data on 785 

the MBRAVE platform were considered, as BOLD requires taxonomic information for assessing 786 

the presence of stop codons. The rows break the sequences down into categories, which indicate 787 

the source of the post-denoising sequence that was submitted to BOLD for assessment. 788 

  789 

Sequence 

category 

Total Sequence 

count 

Stop codon count Percent error 

reduction Original Post-denoising 

Unaltered 23,992 88 88† - 

Denoised, 

altered 
2,265 1,190 59† 95% 

Flagged for 

potential error, 

unaltered 

701 223 223* - 

Flagged and 

rejected 
16 14 14 - 

Labelled as 

Wolbachia by 

MBRAVE 

67 0 0 - 

Total 27,041 1,515 (6.3%) 384 (1.6%) 74.66% 

Total, non-

flagged only 
26,257 1,278 (4.8%) 147 (0.6%) 88.5% 

† The sum of these categories (shown in the final row of the column) represents the false 790 

negative rate for the denoising pipeline. These are the 0.6% (147/27,041) of sequences that 791 

appear to contain true stop codons that were not flagged for denoising, or that were denoised 792 

unsuccessfully and not flagged as potential errors.  793 

* The false positive rate of the denoising pipeline is the number of sequences in this category 794 

that do not in fact contain a stop codon. There is a total of 478 (701-223) false positives and an 795 

overall false positive rate of 1.8% (478/27,041). Since this set of sequences are flagged for 796 

potential errors, as opposed to being outright rejected, additional inspection of sequences in this 797 

category can separate the unsuccessfully denoised sequences with true errors from those that do 798 

not contain an error. 799 

  800 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


37 

 
Table 4. Assessment of the sequence quality for data from a mock community of arthropods 801 

sequenced in bulk using a Thermo Fisher Ion Torrent and processed on the mBRAVE platform. 802 

Sequencing and processing results in two sets of data, groups of sequences assigned to BINs and 803 

groups of sequences clustered into OTUs. The representative sequences (centroids before 804 

denoising, consensus after denoising) and all individual sequences were checked with the R 805 

package coil for evidence of frameshifts (stop codons in amino acid sequence) before and after 806 

denoising to see if processing the data with the debar package resulted in higher quality barcode 807 

sequences. 808 

 809 

  Original After debar denoising 

Sequences 

analyzed 

Sequence 

data source 
Total count 

Stop codon 

count 

Total 

count 

Stop codon 

count 

Representative 

sequences 

Assigned to 

BINs 
398 125 (31.4%) 394 7 (1.8%) 

OTUs 1,255 681 (54%) 1,224 134 (10.6%) 

ESVs 

Assigned to 

BINs 
123,926 61351 (49.5%) 122,349 2858 (2.3%) 

OTUs 2,199 1310 (59.57%) 2,145 418 (19.49%) 

 810 

 811 

 812 

 813 

 814 

 815 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


38 

 
 816 

 817 
Figure 1. Diagram demonstrating the debar package’s denoising workflow. Blue indicates 818 

nucleotides that are part of the barcode region and orange nucleotides in bold font indicate 819 

technical errors or sequence from outside of the barcode region.  820 

A. The debar package operates on a sequence-by-sequence basis, taking each input and 821 

constructing a custom DNAseq object. A DNAseq object can receive a DNA sequence, an 822 

identifier, and optionally a sequence of corresponding PHRED quality scores. Although not 823 

utilized in the denoising, indel-correcting adjustments to the sequence are applied to the PHRED 824 

scores as well, so that quality information can be carried from input to output.  825 

B. Following DNAseq object construction, the sequence is compared to the PHMM using the 826 

Viterbi algorithm. By default, the full length (657bp) COI-5P PHMM contained in debar is used 827 

to evaluate the sequence. When required, a user may pass a custom PHMM corresponding to a 828 

subsection of the COI-5P region (specified using the coil package’s subsetPHMM function) or a 829 

custom PHMM trained on user-defined data (Wilkinson 2019). The frame function isolates the 830 

correction window, which is the section of the sequence matching the PHMM (the first 10 831 

consecutive base pairs matching to the PHMM on the leading and trailing edges of the sequence 832 

establish the section of the input on which subsequent corrections are applied). 833 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


39 

 
C. The adjust function traverses the section of the sequence and Viterbi path defined by the 834 

frame function. When evidence of an inserted base pair (‘2’ label in the Viterbi path) is 835 

encountered, the corresponding base pair is removed. When evidence of a deleted base pair is 836 

encountered (a ‘0’ label in the Viterbi path) a placeholder ‘N’ nucleotide is inserted. Exceptions 837 

are made for triple inserts or triple deletes (three consecutive ‘0’ or ‘2’ labels), which are skipped 838 

by the adjustment algorithm, as they are indicative of mutations that would not have a large 839 

impact on the structure of the protein-coding gene region and could reflect biological amino acid 840 

indels. The total number of adjustments made by debar is limited by the parameter ‘adjust_limit’ 841 

(default = 5), sequences requiring adjustments in excess of this number are flagged for rejection, 842 

as this high frequency of indels is likely not the result of technical error, but rather other sources 843 

of noise such as pseudogenes. Following adjustment, a mask of placeholder ‘N’ nucleotides is 844 

applied to base pairs flanking the corrected indel (default is 7bp in each direction, see Figure 3. 845 

For derivation of default). Masking of 7bp flanks adjacent to each correction allows imprecise 846 

corrections to effectively correct sequence length and also mask true indel locations in the 847 

majority of instances. 848 

D. Following adjustment, the denoised sequences are output by debar. By default, the outputs 849 

will include trailing sequence outside of the correction window. Leading information outside of 850 

the correction window is dropped, so that sequences are aligned with a common starting position. 851 

A user can choose to keep only the correction window, or have both flanking regions appended 852 

back on to the sequence output. 853 

E. If multiple denoised sequences are available (for either a given specimen in the case of 854 

barcoding or a given OTU in metabarcoding) then the consensus of the denoised sequences can 855 

be taken. The consensus function assumes the sequences have been denoised and their left flanks 856 

removed; as a result, they are aligned to one another. The modal base pair for each position is 857 

then taken to generate a consensus sequence, and in the case of ties, a placeholder “N” character 858 

is added to the consensus. 859 

 860 

 861 

 862 

 863 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


40 

 
 864 
Figure 2. Diagram of the denoising workflow used to improve the quality of barcodes produced 865 

by processing Pacific Biosciences Sequel circular consensus data on the mBRAVE platform. (i) 866 

Pacific Biosciences Sequel data are processed on the mBRAVE platform, and an initial set of 867 

barcode sequences is produced. (ii) The set of consensus barcode sequences produced by the 868 

mBRAVE platform are obtained and analyzed with the coil package, using the ‘coi5p_pipe’ 869 

function (default parameters). Sequences displaying evidence of an indel (either the presence of a 870 

stop codon when translated to amino acids or an amino acid sequence with a low likelihood 871 

score) are retained for further denoising. (iii) For each barcode with evidence of an error, all 872 

component CCS reads (and associated metadata) derived from the given specimen are obtained 873 

from mBRAVE. (iv) Based on the mBRAVE metadata, sequences are trimmed to remove 874 

primers, MID tags, and adapter sequence. The reverse complement of reads are taken when 875 

required. (v) The ‘denoise_list’ function of debar is used to denoise all CCS reads (options: 876 

dir_check = FALSE, keep_flanks = ‘right’, censor_length = 7). Rejected reads (those flagged by 877 

the denoise_list function) are removed from the dataset. (vi) For each specimen, the reads are 878 

clustered into OTUs using the R package kmer (clustering threshold = 0.975). This is done to 879 

mitigate the influence of outlier CCS or contaminant sequences. (vii) For each OTU, a consensus 880 

sequence is generated using debar’s ‘consensus’ function. For each specimen, OTUs are ranked 881 

based on the number of component CCS reads they contain. (vii) The consensus sequences are 882 

reassessed with coil. If the top-ranked consensus sequence now passes the coil check, it is 883 

deemed to have been successfully denoised, and it is selected as the output barcode. If not, the 884 

check is repeated for the second-ranked consensus sequence (when available), and this output is 885 

retained if it is barcode compliant. If neither the first nor second highest ranked consensus 886 

sequence passes the coil check, then the original (pre-denoising process) barcode is retained, as 887 

no meaningful improvement was made. In this situation the barcode is flagged as likely to 888 

contain an error. 889 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


41 

 
890 
Figure 3. The debar package’s denoising of 10,000 COI sequences containing single 891 

insertion or deletion errors. So that exact error positions were known, errors were artificially 892 

introduced in accordance with known probabilities for COI DNA barcode data from the 893 

PacBio Sequel platform (Hebert et al. 2018). Denoising was accomplished through altering 894 

sequences in accordance with the Viterbi path yielded by comparison to the PHMM. The 895 

correct number of adjustments was made for 9,455 sequences, and 61.8% of these corrections 896 

located the indel exactly. Masking of 7bp flanks adjacent to each correction allowed 897 

imprecise corrections to correct sequence length and mask the true indel location 96% of the 898 

time. For the 545 instances where an incorrect number of adjustments were made, 269 were 899 

caught through query of the amino acid sequence for stop codons and the trimming of 900 

spurious matches at the edge of sequences. Overall, 95.74% of errors were effectively 901 

corrected or identified as erroneous. 902 

  903 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


42 

 
 904 

 905 
Figure 4. Histogram indicating the position in the COI-5P region of the 426 uncorrected indel 906 

errors from the 10,000-sequence artificial error dataset. The x axis indicates the base pair 907 

position in the COI-5P profile, and the y axis displays the number of sequences that contained an 908 

uncorrected error at the given range of positions (bins of 10 base pair positions). 909 

  910 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


43 

 
911 
  912 

Figure 5. Histogram showing number of base pairs between inexact corrections applied by debar 913 

and the ground truth error location for the given sequence. In total 3,612 sequences (36.12%) had 914 

errors that were denoised inexactly, and corrections were an average of 2.31 bp (sd = 1.9767) 915 

away from the exact ground truth error location. 916 

  917 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


44 

 
 918 
 919 

Figure 6. Relationship between the amount of missing data in the final denoised barcode 920 

sequences (number of Ns divided by the total length of the sequence) and the number of CCS 921 

reads that contributed to the generation of the barcode. The figure displays only the 1,008 922 

denoised barcode sequences submitted to BOLD that contained at least one “N” (the remaining 923 

28,517 barcode sequences in the BOLD submission did not contain an “N”). 924 

 925 

 926 

  927 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/


45 

 
Supplementary Information  928 

 929 

Supplementary File 1 ('S1-single_errors_in_10k_sequences.csv') The 10,000 COI barcode 930 

sequences with single introduced indel errors that were used to test debar and calibrate the 931 

default parameters. 932 

  933 

Supplementary File 2 ('S2-control_denoising_no_errors.csv') The 10,000 COI barcode 934 

sequences with no known indel errors used to assess the false correction rate of debar 935 

 936 

Supplementary File 3 ('S3-single_file_pipeline') Scripts and example data for the denoising 937 

pipeline developed to process COI DNA barcode sequence data produced using the Pacific 938 

BioSciences Sequel sequencer and mBRAVE platform 939 

  940 

Supplementary File 4 Scripts and example data for the denoising pipeline developed to process 941 

COI DNA metabarcode sequence data produced using the IonTorrent S5 sequencer and the 942 

mBRAVE platform 943 

 944 

Supplementary File 5 Vignette demonstrating the functionality of the debar package. The 945 

vignette is also available as part of the R package 946 

(https://github.com/CNuge/debar/tree/master/vignettes) 947 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint 

https://github.com/CNuge/debar/tree/master/vignettes
https://doi.org/10.1101/2021.01.04.425285
http://creativecommons.org/licenses/by-nc/4.0/