Improving variant calling using population data and deep learning Improving variant calling using population data and deep learning Nae-Chyun Chen1, ‡,∗, Alexey Kolesnikov2, Sidharth Goel2, Taedong Yun2, Pi-Chuan Chang2, †, and Andrew Carroll2, †,∗ 1Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA 2Google Health, Palo Alto, CA 94304 and Cambridge, MA 02142, USA corresponding author: cnaechy1@jhu.edu; awcarroll@google.com †These authors contributed equally to this work. ‡Work performed while an intern at Google Health. January 6, 2021 Abstract Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filter- ing which trades recall for precision. In this study, we modify DeepVariant to add a new channel encoding population allele frequencies from the 1000 Genomes Project. We show that this model reduces variant calling errors, improving both precision and recall. We assess the impact of using population-specific or diverse reference panels. We achieve the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with differ- ent ancestry from the training data even when the ancestry is also excluded from the reference panel. 1 Background Variant calling [1–3] identifies the positions in an individual genome which differ from a reference or population, and is used to characterize a single sample or build large research cohorts [4, 5]. Variant calling is non-trivial, because of sequencing errors, systematic errors in mapping to repetitive and variable regions [6], and imbalanced sampling of alleles needed to identify a heterozygous variant from a homozygous one. 1 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 Variant calling can be improved by jointly genotyping multiple samples together [7– 9], but the raw sequence data for a cohort is not always available, and this process is computationally expensive. Instead, large-scale reference panels from a wide range of populations can provide similar information [4, 5]. Recent studies use such information to improve alignment accuracy and reduce biases in alignment [10–12], but there has been little work to incorporate population data with variant calling. Because far more variants are transmitted than arise de novo, real variants in a pop- ulation tend to recur at various frequencies [13], while false positives are often either not seen elsewhere in a population, or are seen with a consistent signature [14]. Researchers use this knowledge to filter variant calls, often with rules which lose recall for a gain in precision [15]. More sophisticated machine-learning methods to filter are used in larger cohorts, such as gnomAD, but these also trade recall for precision and also only operate on variant calls and summary information [4]. We reason that including population-level information at an earlier stage in variant calling, when the full read-level data is available, might allow for more effective use of population data. To do this, we adapted DeepVariant [2], which represents BAM infor- mation as a multi-dimensional pileup and uses a Convolutional Neural Network (CNN) to call variants. Because DeepVariant learns the features important for variant classifica- tion directly from the data, it allows us to feed in the population allele information as an additional channel. We trained population-aware models and compared them with the default DeepVari- ant v1.1 models which are agnostic of population information. The population-aware approach reduces the number of errors for all tested datasets, including WGS and WES reads, when using the allele frequencies from 1000Genomes. It also shows stronger error reduction efficacy for lower-coverage read sets. While traditional filtering approaches will increase precision at the expense of recall, we observe improvements to both precision and recall with this method. When incorporating population data, it is also important for fairness and equity to understand how it changes the accuracy of methods for individuals with ancestries out- side of those used in the development of the population resources. It is known that many genomic databases have collected more data for the European population than others [16–18]. We demonstrate that even using frequencies from a genetically distinct popula- tion, the population-aware model still performs similarly as the baseline. We find that a reference panel consisting of all ancestries in the 1000 Genomes Project (1000Genomes) outperforms a reference panel with only one of the 1000Genomes population groups, even when that population matches the sample being called. This implies that maximizing the diversity of ancestries in population resources has the potential to improve variant calling for all populations. The Genome in a Bottle (GIAB) truth sets used to train DeepVariant are from Eu- ropean, Ashkenazi, and Asian ancestry. To assess whether the addition of the refer- ence panel information improves variant calling for populations outside of the popula- 2 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 tions represented in training, we use high quality PacBio HiFi [19] data from the Human Genome Structural Variation Consortium for an individual of Puerto Rican ancestry as an evaluation set. We show that an Illumina model using the reference panel has superior concordance with the highly accurate PacBio HiFi variant calls compared to an Illumina model without the reference panel. 2 Results 2.1 Population information improves DeepVariant performance DeepVariant converts input from a BAM file into a pileup image with 6 channels, repre- senting 1) bases, 2) base qualities, 3) mapping quality, 4) strand, 5) supports variant, and 6) base differs from reference. We modified DeepVariant v1.1 to take an additional input channel, the allele-frequency (AF) of the variant [20]. We trained DeepVariant models with and without the AF channel with the testing samples held out. We first compared the whole-genome sequencing (WGS) variant calling accuracy for sample HG003, sequenced with 35x coverage from the PrecisionFDA v2 Truth Challenge [21], using the latest GIAB v4.2.1 truth set [22] (Figure 1). HG003 is not used in the training of these DeepVariant models, and so acts as an independent holdout to evaluate their quality. The population-aware model has superior accuracy than default DeepVariant v1.1 in both precision and recall for both types of variants. It has an overall error reduction of 1514 (4.8%). For SNPs, the error rate (defined as 1-F1 score) decreases from 0.0041 to 0.0038; for indels, the error rate decreases from 0.0044 to 0.0043. Notably, the population- aware model improves SNP false discovery rate (FDR, defined as 1-precision) from 0.0019 to 0.0015, equivalent to an error reduction of 1,096 (17.7%) variants. We then down-sampled the HG003 reads from 35x to 21x to evaluate the performance of the models with lower-coverage datasets. The population-aware method demonstrates a larger improvement in accuracy over default DeepVariant v1.1 by reducing 5,119 (9.5%) overall errors. The error rate decreases from 0.0062 to 0.0056 for SNPS, and 0.0124 to 0.0113 for indels. Similar to using the 35x read set, the population-aware model shows the strongest improvement to reduce false-positive SNPs, reducing FDR from 0.0040 to 0.0031, equivalent to 3,015 (22.5%) errors. We further evaluated the performance of the models using two whole-exome sequenc- ing (WES) datasets from a recently released set of genome and exome data [23] (Figure 2). For both WES datasets, the population-aware model outperforms DeepVariant v1.1 in overall number of errors. It has an overall error reduction of 53 (9.9%) for the IDT dataset, and 13 (6.5%) for the Oslo dataset. It has a slightly higher rate for SNPs for the Oslo dataset, from 0.00087 to 0.00092, but the difference is smaller than the gain for indels for that dataset. The population-aware model tends to have a larger lead on precision for both types of variants compared to the baseline, but still has similar or better recall. 3 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 0.000 0.005 0.010 0.015 v1.1-35x AF-35x v1.1-21x AF-21x v1.1-35x AF-35x v1.1-21x AF-21x v1.1-35x AF-35x v1.1-21x AF-21x 1-F1 1-Precision (FDR) 1-Recall (FNR) INDEL 0.000 0.002 0.004 0.006 0.008 v1.1-35x AF-35x v1.1-21x AF-21x v1.1-35x AF-35x v1.1-21x AF-21x v1.1-35x AF-35x v1.1-21x AF-21x 1-F1 1-Precision (FDR) 1-Recall (FNR) SNP Figure 1: WGS variant calling error rates for HG003. All results are evaluated using the GIAB v4.2.1 truth set in the high-confidence regions. v1.1: DeepVariant v1.1; AF: the population-aware model that uses the allele-frequency channel. The column label suffixes show the average coverage of the read sets. Lower values correspond to better accuracy. 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 v1.1-IDT AF-IDT v1.1-Oslo AF-Oslo v1.1-IDT AF-IDT v1.1-Oslo AF-Oslo v1.1-IDT AF-IDT v1.1-Oslo AF-Oslo 1-F1 1-Precision (FDR) 1-Recall (FNR) INDEL 0.000 0.002 0.004 0.006 0.008 0.010 0.012 v1.1-IDT AF-IDT v1.1-Oslo AF-Oslo v1.1-IDT AF-IDT v1.1-Oslo AF-Oslo v1.1-IDT AF-IDT v1.1-Oslo AF-Oslo 1-F1 1-Precision (FDR) 1-Recall (FNR) SNP Figure 2: WES variant calling error rate for HG003. The IDT results (“*-IDT”) are GRCh38-based and evaluated using the GIAB v4.2.1 truth set; the Oslo datasets (“*-Oslo”) are GRCh37-based and evaluated using the GIAB v3.3.2 truth set. v1.1: DeepVariant v1.1; AF: the population-aware model that uses the allele-frequency channel. Lower values correspond to better accuracy. 4 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 2.2 Model-specific errors for population-aware models Intuitively, population information helps DeepVariant decide whether to make a call based on the commonness of a variant, especially for cases where the variant calling confidence levels are low. With a population-aware model, a variant caller should be more likely to make a positive variant call for a candidate with high allele frequency, and is less likely to make a call when seeing a rare candidate variant. To understand the influence of allele frequencies in the model, we design an analy- sis framework to compare a population-agnostic model with a population-aware model. We call this a model-specific error analysis. We stratify the errors into three groups: population-resolved, population-induced and common. The population-resolved vari- ants are called correctly with the allele frequency model, but called incorrectly when us- ing the baseline model. We say such errors are “rescued” by population information. The population-induced errors are specific to the population-aware model, i.e. they are in- duced by the extra features. The common group contains errors called by both models. The common errors are viewed as ones more difficult to solve without major changes in the data processing pipeline, such as variant caller, upstream computational methods, or sequencing technology. Thus, in this analysis we focus on the first two groups. For sim- plicity, we only considered bi-allelic calls in this analysis, which are the majority of overall errors. We used the 35x HG003 WGS dataset to perform the model-specific error analysis. Af- ter extracting model-specific erroneous calls, we matched the calls with the 1000Genomes variants to obtain associated allele frequencies. We first examined the relationship be- tween allele frequency (AF) and variant allele fraction (VAF), which is the fraction of reads supporting an alternate allele in a given sample, of each false-positive call. There is an ob- servable distinction between the population-induced group and the population-resolved group in the VAF-AF plots (Figure 3, left and middle panels). Among the population- resolved false-positive errors, more than two third (71.0%) are uncommon (allele fre- quency ≤ 5%) among the 1000Genomes samples, whereas there are only 11.4% uncom- mon variants for population-induced false positives. This observation supports the hy- pothesis that the population-aware model uses allele frequency to adjust its variant calls. We then investigated bi-allelic false-negative errors, as shown in the right panel in Fig- ure 3. Variant allele fraction for false negatives are not always available because many false negatives are not identified as a variant candidate due to reasons including low read coverage, incorrect mapping or insufficient sensitivity in variant candidate discovery. Thus, we only evaluated the allele frequency distribution for false negatives. We noticed a significant difference in the number of common variants (with greater than 5% allele frequency). Among all population-resolved false negatives, 94.6% (1,683 out of 1,780) are common variants. For population-induced false negatives, 59.2% (607 out of 883) are un- common. The model-specific analysis highlights the difference of the DeepVariant models with or without the AF channel. With the additional population information, DeepVari- 5 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 Figure 3: Errors specific to a population-agnostic model (in blue) and a population-aware model (in red) using 35x HG003 WGS data. ant is capable of adjusting the calls according to the commonness of a variant and shows improvements in both precision and recall. 2.3 Performance on zero-frequency variants A potential concern for population-aware variant calling models is increasing false neg- ative rate for novel alleles. Since it is not trivial to define a set of truly novel variants in the 1000 Genomes Project, we extracted variants with zero allele frequency to investigate the impact when population information is included in a variant calling model. Using the GIAB v4.2.1 truth set, there are 32,256 (1.0%) SNPs and 3,193 (0.6%) indels that have zero allele frequency for sample HG003. We then use the zero-frequency variant set to evaluate recall of actual variant calls using hap.py [3]. We observed that the recall on zero-frequency variants underperforms the rest using all DeepVariant models, regardless of variant types and whether to utilize population information. With 35x reads, the false-negative rate (FNR, or 1-recall) of the population- agnostic model is 0.1855 for SNPs and 0.2474 for indels (Figure 4). The FNRs further in- crease to 0.1945 for SNPs and 0.2643 for indels when using the population-aware model. When using 21x reads, the drop in accuracy gets larger for both types of variants. This is consistent with our analysis that the population-aware DeepVariant model requires stronger evidence (higher-quality pileup images) to call zero-frequency variants, thus re- ducing recall. Further, the population information has a stronger influence in variant call- ing for low-coverage datasets. Despite the disadvantages, the negative impact on zero- frequency variants is small compared to overall error reduction. To better understand the zero-frequency variants, we called variants using the Deep- Variant PacBio model with the PrecisionFDA v2 35x HG003 reads set sequenced with the PacBio HiFi technology [21]. The FNRs for the zero-frequency variants improve to 0.0481 6 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 0.00 0.05 0.10 0.15 0.20 0.25 0.30 1 -R e c a ll ( F N R ) v1.1 AF v1.1-21x AF-21x v1.1 AF v1.1-21x AF-21x INDEL SNP Figure 4: The false negative rate (FNR) of zero-frequency variants for HG003 with differ- ent models. Lower values correspond to better accuracy. for SNPs and 0.0868 for indels. The large difference in recall/FNR indicates that many of the zero-frequency variants are hard to genotype using Illumina reads, and may not be novel mutations relative to samples in reference panels. In the future, reference panels utilizing high-quality long reads will likely provide better allele frequency estimates and improve the population-aware model performance. 2.4 Assessing biases using different 1000Genomes populations It is important to understand if the inclusion of population information reduces Deep- Variant’s performance for populations that are not well represented, especially when they have a large genomic difference with the reference panel. We first note that Ashke- nazi Jewish, the ethnicity of the HG003, is not among the 26 ethnicities collected by 1000Genomes. Using a testing sample not in the reference panel reduces the risk of bias. Second, we ran inference on the population-aware model using reference panels of alleles frequencies. We split the 1000Genomes sample into five groups based on the superpopu- lation labels (African, AFR; Admixed American, AMR; East Asian, EAS; European, EUR; South Asian, SAS) and calculated allele frequencies for each super-population. We show that all population-aware approaches outperform for SNPs but underperform for indels when evaluated using HG003 (Figure 5). When considering the overall number of errors, only the model inferred with EAS frequencies calls more errors than the baseline, but the deficit (494, or 1.6%) is small. We also compared the performance of using different superpopulation frequencies and observed a correlation between variant calling accuracy and the distance between the tested sample and ethnicity groups. According to the principal component (PC) analysis performed by gnomAD v3 [4], Ashkenazi Jewish is closer to the European populations and is farther from East Asian and African in the PC1-PC2 space. We observed that using frequencies from a genetically closer population usually resulted in higher variant calling 7 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 v1.1 All EUR AMR SAS AFR EAS v1.1 All EUR AMR SAS AFR EAS v1.1 All EUR AMR SAS AFR EAS 1-F1 1-Precision (FDR) 1-Recall (FNR) INDEL 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 v1.1 All EUR AMR SAS AFR EAS v1.1 All EUR AMR SAS AFR EAS v1.1 All EUR AMR SAS AFR EAS 1-F1 1-Precision (FDR) 1-Recall (FNR) SNP Figure 5: Variant calling accuracy when inferring 35x Illumina reads from HG003 using default DeepVariant v1.1 (v1.1), allele frequencies in the entire 1000Genomes (All) and five 1000Genomes superpopulations (EUR, AMR, SAS, AFR and EAS). Lower values cor- respond to better accuracy. accuracy. Using EUR frequencies outperforms using other population frequencies, only falling behind using the entire 1000Genomes. On the other hand, using EAS frequencies results in the highest numbers of errors among all population-aware methods. We point out that using 1000Genomes frequencies from all populations results in the lowest number of errors among all population-aware results, suggesting an advantage to using a diverse population than finding a genetically similar group. This finding echoes our previous statement that we anticipate the population-aware variant calling model to improve further with larger-scaled and more diverse population callsets. 2.5 Silver-standard truth set for HG00733 Genome-in-a-bottle (GIAB) truth variant sets provide gold standards to benchmark vari- ant callers, but until now there are only three samples (HG002-HG003-HG004, the Ashke- nazi trio) with curated calls in difficult-to-map regions added in the v4.2.1 release [22]. Further, the samples are from the same ancestry, making it challenging to perform a generalized benchmarking considering the genetic diversity of the human population. To deal with this difficulty, it is desirable to have other high-quality variant sets from non-GIAB samples, preferably from ancestries not covered by GIAB. Thus, we called variants using the DeepVariant PacBio model with 32x high-coverage PacBio HiFi reads [24] for HG00733, a Puerto Rican (labelled as PUR under the AMR superpopulation in 1000Genomes) sample. The DeepVariant PacBio model has a SNP F1 score higher than 8 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 0.000 0.002 0.004 0.006 0.008 v1.1-30x AF-30x v1.1-18x AF-18x v1.1-30x AF-30x v1.1-18x AF-18x v1.1-30x AF-30x v1.1-18x AF-18x 1-F1 1-Precision (FDR) 1-Recall (FNR) SNP Figure 6: Variant calling results when evaluated using HG00733 data, compared to the PacBio-DeepVariant silver-standard truth set. Lower values correspond to better accu- racy. 99.9% and is one of the most accurate models using PacBio HiFi data [22]. We used the DeepVariant HG00733 PacBio SNP calls as a “silver-standard” truth set and benchmarked the performance for models using Illumina reads. We excluded the Puerto Rican popula- tion when calculating allele frequencies to avoid biases in favor of the population-aware models. We used 30x Illumina WGS reads sequenced by the New York Genome Center to test all HG00733 models. Because the 1000Genomes has a collection of PUR samples, we excluded all PUR samples and re-calculated allele frequencies for both 1000Genomes and the AMR superpopulation. The population-aware model has a lower SNP error rate (0.0041 vs. 0.0043), FDR (0.0022 vs. 0.0023) and FNR (0.0059 vs. 0.0062) than the baseline for HG00733 (Figure 6). The number of SNP errors is reduced by 1,353 (4.82%). Similar to the finding using HG003, the population-aware model performs strongly with a down-sampled (18x) read set. The error rate for the 18x read set is reduced from 0.0056 to 0.0051, and the SNP error reduction is 3,145 (8.5%). We also tested the model using different superpopulation fre- quencies (Figure 7). All but the EAS population-aware model has lower SNP error rates than the baseline. When inferred using the EAS allele frequencies, the SNP error rate in- creased from 0.0043 to 0.0044, equivalent to 878 (3.1%) more errors. All population-aware models, including EAS, outperform the baseline on FDR and only EAS has a higher FNR than the baseline (0.0066 vs. 0.0062). 3 Discussion We designed a new population-aware DeepVariant model which can incorporate both base- and read-level information with the population information. We find that population- aware models reduce error rates by 4.9% for WGS and 6.5-9.9% for WES compared to population-agnostic baselines (default DeepVariant v1.1) The relative advantage of the population-aware model increases at lower coverage (4.9% reduction at 35x and 9.5% at 21x). The increased accuracy at lower coverage suggests that population information is 9 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 v1.1 All EUR AMR SAS AFR EAS v1.1 All EUR AMR SAS AFR EAS v1.1 All EUR AMR SAS AFR EAS 1-F1 1-Precision (FDR) 1-Recall (FNR) SNP Figure 7: Number of SNP errors when evaluated using 30x WGS reads from a Puerto Ri- can sample HG00733. All models other than v1.1 are population-aware, inferred using alleles frequencies from different populations. Lower values correspond to better accu- racy. most valuable in difficult examples, where read-level information alone may not be suffi- cient for confident calling. In population sequencing projects, this finding could be rele- vant to the question of whether to sequence more individuals at lower coverage, or fewer at a high coverage. When sequencing for a species without a reference panel, it is possible that sequencing more, diverse individuals at lower coverage could still retain compara- ble accuracy to traditional methods which do not incorporate population information in calling. We evaluate potential biases introduced by population information in variant call- ing by comparing population-aware models that use allele frequencies from different 1000Genomes superpopulation. This experiment simulates a scenario where the tested sample is genetically distinct from the reference panel. Only one population-aware method (inferred with EAS frequencies) underperforms the baseline in total number of errors, but with a small deficit. Furthermore, using allele frequencies calculated from the entire 1000Genomes outperforms population-specific methods. This finding implies that a di- verse population can provide more benefits than using a homogeneous one, even when the homogeneous population is more genetically similar with the tested sample. This finding may inform efforts to build population or country-specific resources. Increasing the number of samples for a given population will improve accuracy for that population, but the inclusion of samples from diverse populations will also improve the resource. We believe that the accuracy of the population-aware model can further improve with a larger and more diverse population callset in the future, reinforcing the benefit of collaboration between nation-scale efforts. We provide an additional “silver-standard” SNP set for a Purto Rican sample, HG00733, a population not present in the labeled training data. We used high-coverage PacBio HiFi reads and an accurate DeepVariant PacBio model to generate this high-quality call set. This method can provide high-confidence SNP calls for non-GIAB samples and increase population diversity when assessing variant calling results. Similar to the results using HG003 data, we show that the proposed model has strong performance compared to the 10 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 baseline, and only suffers slight loss of accuracy when inferred using a distinct popu- lation. When more high-coverage PacBio HiFi data become available in the future, the high-quality calls generated by DeepVariant can provide a more diversified dataset for variant calling benchmarking and downstream analysis. Despite greater overall accuracy, we note that the population-aware model under- performs on variants with zero allele frequencies in 1000Genomes. Although the dis- advantage is small compared to the overall gain, this results suggests that the decision of whether to use population-aware models should consider the end goal. If reducing po- tential false positives is a larger concern, the use of a population-aware method could be recommended, but if the goal is to maximize recall of rare or novel variants, traditional methods could be preferred. We also notice that all tested Illumina models performed poorly on the zero-frequency variants, regardless of using population information or not. By analyzing the variants with PacBio reads, we point out many zero-frequency variants in 1000Genomes located in difficult-to-map regions, but likely not genetically novel in the population. This suggests that the power of population-aware methods should increase as large panels of long-read population data become available. 4 Methods 4.1 Training the model We trained the model following the procedure described in [2], with additional Illumina WGS datasets included [23]. Variants in chromosomes 1 to 19 are used as the training ex- amples, and those in chromosome 21 and 22 are used for tuning. Variants in chromosome 20 are never used in the training process. 4.2 Datasets The model is evaluated using the GIAB v4.2.1 truth set for HG003 across whole genomes [22]. We also generated another high-quality SNP set using DeepVariant v0.10 and HG00733 PacBio HiFi data [24] across the whole genome. We used the intersection of high-confidence regions of HG002, HG003, and HG004 (GIAB v4.2.1) as the high-confidence regions for the HG00733 SNP set. The read sets used for experiments are listed in Table 1 and the read sets for supporting experiments are provided in Table 2. 4.3 Allele matching algorithm When incorporating population information in DeepVariant, we need to match a variant candidate with a cohort variant. However, this is not a straightforward task since a vari- ant can be represented in multiple formats [3, 26]. A common approach is to normalize variants, such as using bcftools norm [27], but that’s not sufficient for complicated 11 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 Table 1: Testing datasets. Sample Ethnicity Truth variant Dataset HG003 Ashkenazi Jewish v4.2.1 (GRCh38) 35x Illumina WGS [22] 100x Illumina WES [23] HG003 Ashkenazi Jewish v3.3.2 (GRCh37) 300x Illumina WES [25] HG00733 Puerto Rican DeepVariant v0.10 PacBio SNP calls (GRCh38) 30x Illumina WGS (NYGC) Table 2: Other datasets used in this study. Sample Ethnicity Dataset HG003 Ashkenazi Jewish 35x PacBio HiFi [22] HG00733 Puerto Rican 32x PacBio HiFi [24] cases. We designed an algorithm that constructed local haplotypes and performed pre- cise allele matching (Figure 8). The algorithm starts with querying all cohort variants VC overlapped with a window [startv, endv), where startv and endv are the starting and ending positions of a variant candidate v respectively. The queried cohort variants and the candidate variant form set V ≡ v ∪ V C. Then the window is extended to the small- est starting position and the largest ending position within V , as [startV , endV ), where startV ≡ min(startu)∀u ∈ V and endV ≡ max(endw)∀w ∈ V . Local reference haplotype is queried from the reference genome in window [startV , endV ]. For each variant allele in V , its allele haplotype is updated in this window. If there’s a perfect match between a cohort allele haplotype and a candidate allele haplotype, the allele frequency of the cohort allele is added to an allele frequency dictionary, using the alternate allele of the candidate variant as key. Afterwards, DeepVariant looks up the dictionary when processing reads overlapped with the candidate variant. 4.4 Allele frequency channel for DeepVariant To make full advantages of the CNN-based classifier of DeepVariant, allele frequencies need to be encoded in pileup images. We apply a logarithmic transformation to gain resolution for low-frequency signals. For each variant candidate, an additional allele fre- quency channel is added to the pileup image. In this channel, a read is colored by the transformed frequency of its allele at the variant candidate position. A read can carry multiple alternate alleles with different frequencies, so its color intensity may vary across pileup images, where the variant candidates differ. An alternative method to encode al- lele frequencies is to include the information as features in the fully-connected layers [28], but this approach sacrifices the capability to incorporate allele frequencies with base- and read-level information and thus is not adopted. 12 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 Cohort variants position=80 REF=TTTCCA ALT=T,TTTCCATTCCA AF=3.99E-4,2E-4 position=86 REF=TTCCAG ALT=T AF=1.198E-3 Variant candidate position=85 REF=ATTCCAG ALT=AT Reference:80-92 TTTCCATTCCAG a b b Updated haplotypes TTTCCA-----T----- T----------TTCCAG TTTCCATTCCATTCCAG TTTCCA-----T----- c TTTCCAT TTTCCAG TTTCCATTCCATTCCAG TTTCCAT d dict(AT=0.001198, ATTCCAG=0.998802) Variant candidate position=85 REF=ATTCCAG ALT=AT Cohort variants Cohort variant 1 position=80 REF=TTTCCA ALT=T,TTTCCATTCCA AF=0.0004, 0.0002 Cohort variant 2 position=86 REF=TTCCAG ALT=T AF=0.0012 Reference:80-92 TTTCCATTCCAG Updated haplotypes TTTCCA-----T----- T----------TTCCAG TTTCCATTCCATTCCAG TTTCCA-----T----- TTTCCAT TTTCCAG TTTCCATTCCATTCCAG TTTCCAT Candidate frequency AT:0.0012 Figure 8: An example for the allele matching algorithm. This algorithm first queries cohort variants overlapped with the variant candidate. These cohort variants and the candidate determine the window where haplotypes are updated. The frequencies of matched allele haplotypes are then updated for the variant candidate as a dictionary. In this diagram, haplotypes are updated with dashes to keep sequenced aligned for better visualization. In practice, dash-free haplotypes are generated by the allele matching algorithm. To enable the allele frequency channel, users need to enable flag --use allele frequency and provide DeepVariant cohort variants in VCF format with flag --population vcfs. 4.5 Model-specific error analysis We compared actual variant calls with GIAB v4.2.1 truth variants using bcftools isec. Variants specific to actual calls are regarded as false positives, and those specific to the truth set are regarded as false negatives. We generated the false-positive and false-negative sets for two models, and then applied bcftools isec again to obtain model-specific false positives and false negatives. For both sets, we applied the allele matching algo- 13 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 rithm to obtain allele frequencies for the variants. For the false-positive sets, we extracted variant allele fractions from the VCF files generated by DeepVariant. 4.6 1000Genomes frequencies from the DeepVariant-GLnexus pipeline We used the 1000Genomes reference panel generated with the DeepVariant-GLnexus pipeline (v3) [8] for all population-aware experiments, including training and inferring the models. We fill the missing genotypes with the reference genotypes with bcftools +missing2ref to make sure all variants have the same denominator. 5 Availability of data and materials The DeepVariant source code is available at https://github.com/google/deepvariant under the BSD-3-Clause License. The PacBio-based HG00733 SNP set is available at https://console.cloud.google.com/storage/browser/brain-genomics-public/ research/allele_frequency/HG00733_SNP_set. The pre-trained population-aware DeepVariant models are available at https://console.cloud.google.com/storage/ browser/brain-genomics-public/research/allele_frequency/pretrained_ model_WGS (WGS) and https://console.cloud.google.com/storage/browser/ brain-genomics-public/research/allele_frequency/pretrained_model_WES (WES). The VCF files used in this study are available at https://console.cloud. google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/ cohort_dv_glnexus_opt/v3_missing2ref (GRCh38) and https://console.cloud. google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/ cohort_dv_glnexus_opt/v3_GRCh37_missing2ref (GRCh37). 6 Ethics approval and consent to participate Not applicable. 7 Consent for publication Not applicable. 8 Competing interests AK, SG, TY, PC and AC are employees of Google LLC and own Alphabet stock as part of the standard compensation package. This study was funded by Google LLC. 14 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://github.com/google/deepvariant https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/HG00733_SNP_set https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/HG00733_SNP_set https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_WGS https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_WGS https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_WGS https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_WES https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_WES https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_GRCh37_missing2ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_GRCh37_missing2ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_GRCh37_missing2ref https://doi.org/10.1101/2021.01.06.425550 9 Funding All compute resources used in this work were provided by Google, LLC. AK, SG, TY, PC and AC are full-time, salaried employees of Google, LLC. NC con- tributed to this work as a salaried intern of Google, LLC. 10 Acknowledgments We thank Babak Alipanahi, Gunjan Baid, Daniel Cook, Alexander D’Amour, Hojae Lee, Cory McLean, Maria Nattestad and other colleagues at Google for their feedback on this manuscript and the project in general. The HG00733 Illumina data were generated at the New York Genome Center with funds provided by NHGRI Grant 3UM1HG008901-03S1. 11 Authors’ contributions NC, AK, PC and AC designed the method. NC, AK and PC implemented the software. NC and PC performed the experiment. NC, AK, SG, TY, PC and AC analyzed the re- sults. NC, PC and AC wrote the manuscript. All authors read and approved the final manuscript. References 1. DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., Philippakis, A. A., Del Angel, G., Rivas, M. A., Hanna, M., et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics 43, 491 (2011). 2. Poplin, R., Chang, P.-C., Alexander, D., Schwartz, S., Colthurst, T., Ku, A., New- burger, D., Dijamco, J., Nguyen, N., Afshar, P. T., et al. A universal SNP and small- indel variant caller using deep neural networks. Nature biotechnology 36, 983–987 (2018). 3. Krusche, P., Trigg, L., Boutros, P. C., Mason, C. E., Francisco, M., Moore, B. L., Gonzalez- Porta, M., Eberle, M. A., Tezak, Z., Lababidi, S., et al. Best practices for benchmark- ing germline small-variant calls in human genomes. Nature biotechnology 37, 555–560 (2019). 4. Karczewski, K. J., Francioli, L. C., Tiao, G., Cummings, B. B., Alföldi, J., Wang, Q., Collins, R. L., Laricchia, K. M., Ganna, A., Birnbaum, D. P., et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434– 443 (2020). 15 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 5. 1000 Genomes Project Consortium et al. A global reference for human genetic varia- tion. Nature 526, 68–74 (2015). 6. Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014). 7. Lin, M. F., Rodeh, O., Penn, J., Bai, X., Reid, J. G., Krasheninina, O. & Salerno, W. J. GLnexus: joint variant calling for large cohort sequencing. BioRxiv, 343970 (2018). 8. Yun, T., Li, H., Chang, P.-C., Lin, M. F., Carroll, A. & McLean, C. Y. Accurate, scalable cohort variant calls using DeepVariant and GLnexus. bioRxiv (2020). 9. Poplin, R., Ruano-Rubio, V., DePristo, M. A., Fennell, T. J., Carneiro, M. O., Van der Auwera, G. A., Kling, D. E., Gauthier, L. D., Levy-Moonshine, A., Roazen, D., et al. Scaling accurate genetic variant discovery to tens of thousands of samples. BioRxiv, 201178 (2017). 10. Chen, N.-C., Solomon, B., Mun, T., Iyer, S. & Langmead, B. Reducing reference bias using multiple population reference genomes. BioRxiv (2020). 11. Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome biology 21, 1–28 (2020). 12. Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E. T., Jones, W., Garg, S., Markello, C., Lin, M. F., et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology 36, 875–879 (2018). 13. Witherspoon, D. J., Wooding, S., Rogers, A. R., Marchani, E. E., Watkins, W. S., Batzer, M. A. & Jorde, L. B. Genetic similarities within and between human populations. Genetics 176, 351–359 (2007). 14. Abramovs, N., Brass, A. & Tassabehji, M. Hardy-Weinberg Equilibrium in the Large Scale Genomic Sequencing Era. Frontiers in Genetics 11, 210 (2020). 15. Pedersen, B. S., Brown, J. M., Dashnow, H., Wallace, A. D., Velinder, M., Tvrdik, T., Mao, R., Best, H. D., Bayrak-Toydemir, P. & Quinlan, A. R. Effective variant filter- ing and expected candidate variant yield in studies of rare human disease. BioRxiv (2020). 16. Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 26–31 (2019). 17. Martin, A. R., Kanai, M., Kamatani, Y., Okada, Y., Neale, B. M. & Daly, M. J. Clinical use of current polygenic risk scores may exacerbate health disparities. Nature genetics 51, 584–591 (2019). 18. McGuire, A. L., Gabriel, S., Tishkoff, S. A., Wonkam, A., Chakravarti, A., Furlong, E. E., Treutlein, B., Meissner, A., Chang, H. Y., López-Bigas, N., et al. The road ahead in genetics and genomics. Nature Reviews Genetics 21, 581–596 (2020). 16 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 19. Wenger, A. M., Peluso, P., Rowell, W. J., Chang, P.-C., Hall, R. J., Concepcion, G. T., Ebler, J., Fungtammasan, A., Kolesnikov, A., Olson, N. D., et al. Accurate circular con- sensus long-read sequencing improves variant detection and assembly of a human genome. Nature biotechnology 37, 1155–1162 (2019). 20. Carroll, A. & Chang, P.-C. Improving the Accuracy of Genomic Analysis with DeepVariant 1.0 https://ai.googleblog.com/2020/09/improving-accuracy-of- genomic-analysis.html. 2020. (accessed: 2020-12-11). 21. Olson, N. D., Wagner, J., McDaniel, J., Stephens, S. H., Westreich, S. T., Prasanna, A. G., Johanson, E., Boja, E., Maier, E. J., Serang, O., et al. precisionFDA Truth Chal- lenge V2: Calling variants from short-and long-reads in difficult-to-map regions. bioRxiv (2020). 22. Wagner, J., Olson, N. D., Harris, L., Khan, Z., Farek, J., Mahmoud, M., Stankovic, A., Kovacevic, V., Wenger, A. M., Rowell, W. J., et al. Benchmarking challenging small variants with linked and long reads. BioRxiv (2020). 23. Baid, G., Nattestad, M., Kolesnikov, A., Goel, S., Yang, H., Chang, P.-C. & Carroll, A. An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development. bioRxiv. eprint: https://www.biorxiv.org/content/early/ 2020/12/11/2020.12.11.422022.full.pdf. https://www.biorxiv. org/content/early/2020/12/11/2020.12.11.422022 (2020). 24. Porubsky, D., Ebert, P., Audano, P. A., Vollger, M. R., Harvey, W. T., Marijon, P., Ebler, J., Munson, K. M., Sorensen, M., Sulovari, A., et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nature Biotechnology. ISSN: 1546-1696. https://doi.org/10.1038/s41587- 020-0719-5 (Dec. 2020). 25. Zook, J. M., Catoe, D., McDaniel, J., Vang, L., Spies, N., Sidow, A., Weng, Z., Liu, Y., Mason, C. E., Alexander, N., et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific data 3, 1–26 (2016). 26. Sun, C. & Medvedev, P. VarMatch: robust matching of small variant datasets using flexible scoring schemes. Bioinformatics 33, 1301–1308 (2017). 27. Li, H. A statistical framework for SNP calling, mutation discovery, association map- ping and population genetical parameter estimation from sequencing data. Bioinfor- matics 27, 2987–2993 (2011). 28. Yi, R., Chang, P.-C., Baid, G. & Carroll, A. Learning from Data-Rich Problems: A Case Study on Genetic Variant Calling. arXiv preprint arXiv:1911.05151 (2019). 17 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://ai.googleblog.com/2020/09/improving-accuracy-of-genomic-analysis.html https://ai.googleblog.com/2020/09/improving-accuracy-of-genomic-analysis.html https://www.biorxiv.org/content/early/2020/12/11/2020.12.11.422022.full.pdf https://www.biorxiv.org/content/early/2020/12/11/2020.12.11.422022.full.pdf https://www.biorxiv.org/content/early/2020/12/11/2020.12.11.422022 https://www.biorxiv.org/content/early/2020/12/11/2020.12.11.422022 https://doi.org/10.1038/s41587-020-0719-5 https://doi.org/10.1038/s41587-020-0719-5 https://doi.org/10.1101/2021.01.06.425550 Background Results Population information improves DeepVariant performance Model-specific errors for population-aware models Performance on zero-frequency variants Assessing biases using different 1000Genomes populations Silver-standard truth set for HG00733 Discussion Methods Training the model Datasets Allele matching algorithm Allele frequency channel for DeepVariant Model-specific error analysis 1000Genomes frequencies from the DeepVariant-GLnexus pipeline Availability of data and materials Ethics approval and consent to participate Consent for publication Competing interests Funding Acknowledgments Authors' contributions