key: cord-0310445-0im9v8qn authors: Atkinson, Elizabeth G.; Artomov, Mykyta; Karczewski, Konrad J.; Loboda, Alexander A.; Rehm, Heidi L.; MacArthur, Daniel G.; Neale, Benjamin M.; Daly, Mark J. title: Discordant genotype calls across technology platforms elucidate variants with systematic errors in next-generation sequencing date: 2022-03-29 journal: bioRxiv DOI: 10.1101/2022.03.24.485707 sha: 565fc9bae4a26c3d7388ffce75fb4a469d40439c doc_id: 310445 cord_uid: 0im9v8qn Large-scale next-generation sequencing (NGS) datasets have been transformative for informing clinical variant interpretation and as reference panels for statistical and population genetic efforts. While such resources are often treated as ground truth, we find that in widely used reference datasets such as the Genome Aggregation Database (gnomAD), some variants pass gold standard filters yet are systematically different in their genotype calls across sequencing technologies. The inclusion of such discordant sites in study designs involving multiple sequencing platforms (e.g. whole genome and/or different whole-exome captures) could bias results and lead to false-positive hits in association studies due to technological artifacts rather than a true relationship to the phenotype. Here, we describe this phenomenon of discordant genotype calls across sequencing technologies, characterize the error mode of wrong calls, provide a blacklist of discordant sites identified in gnomAD that should be treated with caution in analyses, and present a metric and machine learning classifier trained on gnomAD data to identify likely discordant variants in other datasets. We find that different NGS technologies have different sets of variants at which this problem occurs but that there are characteristic variant features that can be used to predict discordant behavior. Discordant sites are largely shared across ancestry groups, though different populations are powered for discovery of different variants. We find that the most common error mode is that of a variant being heterozygous for one platform and homozygous for the other, with heterozygous in the genomes and homozygous reference in the exomes making up the majority of miscalls. 5 Lam et al. 2019 ). However, even with gold-standard filtering, spurious genotype calls can infiltrate datasets 6 and potentially skew results. This is of particular importance with datasets that aggregate calls generated by 7 multiple sequencing platforms, as different platforms have distinct error modes. Identifying variants that have 8 technical artifacts affecting genotype calls is of major importance, as such loci give misleading information 9 regarding population allele frequencies, and could be incorrectly identified as being phenotypically 10 meaningful in gene discovery. In this article, we comprehensively characterize the observation of discordant genotyping depending on 21 technology using a large set of diverse individuals from the gnomAD database, including a subset of 22 participants who underwent both whole-exome and whole-genome sequencing (WES/WGS). Correcting for 23 this technical error, whether by removing the gnomAD blacklist variants provided here or by identifying user-24 identified spurious calls with our freely distributed machine learning predictor, should be incorporated as a 25 step in QC pipelines to avoid spurious associations, particularly in large-scale studies aggregating data from 26 multiple sources. Discordance in genotype calls across NGS technologies replicates across ancestries 2 We sought to compare allele frequencies across variants found within exome and genome sequencing 3 datasets in gnomAD to test whether there are regions with significant bias associated with sequencing 4 technology. We first focused on the largest population represented in gnomAD 0.2 -the Non-Finnish 5 Europeans (NFE). Using the full release of gnomAD version 2.1.1, we filtered the data to include only sites 6 that were present and had a quality determination of PASS in both the genomes and exomes (Karczewski et 7 al. 2020 ). To ensure sufficient power, we filtered for sites with allele count (AC) >10 and ran a Fisher's Exact 8 Test (FET) on the difference in the number of alternate AC to total alleles (allele number, AN) between these 9 two datasets. A non-negligible fraction of sites were significantly discordant in their calls ( Figure 1A ). We 10 also tested other less stringent AC thresholds (AC>1 and AC>5) and observed that trends in discordance 11 between the platforms were consistent across AC cutoffs ( Figure S1 ). When comparing allele frequencies between sequencing platforms it is critical to control for ancestry, as Table S1 ). Replication of the same variants 19 across multiple ancestries strengthens the argument of a shared technical artifact and suggests ancestral 20 bias between exome and genome datasets is unlikely to be a confounding factor. Error mode of discordant calls 23 Next, we aimed to classify the error mode resulting in discordant calls using the individuals with both 24 WES and WGS data. As 946 individuals underwent both whole exome and whole genome sequencing, we 25 were able to examine the rates and error modes of discordant genotype calls without concern over 26 population structure or differing sample composition; since these are the same individuals, any difference in 27 5 AF and/or genotype calls can be conclusively determined to be due to technical artifacts. We tallied the 1 number of sites in each pairwise exomes-genomes genotype category (6,333 sites; Figure S3 ) to classify 2 miscall error mode. Of the 6 possible error modes (homozygous reference/heterozygous, homozygous 3 reference/homozygous variant, or homozygous variant/heterozygous for both dataset directions), we find that 4 the majority of calls, 57.7%, are a heterozygous genotype call in the genomes but a homozygous reference 5 genotype call in the exomes (Figure 1D, S4) . We also note that different sequencing platforms have different 6 rates of discordant calls, though due to sharing restrictions we can not identify platform names with certainty. Overall, ~16% of the variants that were present and PASS in both exomes and genomes in the overlapping 8 individuals had at least one discordant call. Next, we examined the discordance of allele frequencies, again with a FET. Usually in cohort-based 10 comparisons the expected distribution of FET p values is represented by a uniform distribution. Since we are 11 looking at the same individuals, it is expected that allele frequencies should be identical (i.e. p=1). We Figure S6 ). They also highlight a difference in 6 discordance patterns of indels vs SNVs. SNVs show a pattern of higher AF in genomes as compared to 7 exomes, while indels do not have this trend. Indels are also generally less stable in allele frequency 8 estimates than SNVs. It, therefore, appears that two distinct technical error modes might be affecting miscalls 9 in indels vs SNVs, rather than one shared mechanism. As many of these indels fell in the low complexity 10 regions of the genome, it is likely that a mapping issue is responsible for their miscalls. To correct this, we Having confirmed that there was a systematic and significant AF discordance between sequencing 18 platforms, we used our FET tests to generate a list of bad sites that may be excluded from analyses. Again, 19 these blacklisted sites represent variants that were a PASS in gnomAD QC in both the exomes and 20 genomes, but are unreliably genotyped depending on the sequencing technology used. Given a situation 21 where, for example, a case cohort has been exome sequenced while the control cohort has been genome 22 sequenced, such sites could give false-positive associations due to the resulting AF differences. We, 23 therefore, recommend for them to be treated with caution or broadly excluded (in addition to standard cohort 24 QC) unless thorough confirmation of their validity in a particular dataset has been done. which we consider to be reliable. The exomes represent the vast majority of sequences in gnomAD, which 10 may make their results more stable. For analyses that require less stringent QC, we provide these sites 11 which can be optionally retained, given that they pass cohort QC in the individual dataset. . Table S3 ). Model training and validation was performed using several approaches. First, we used a leave-one-out cross validation using the exome dataset from gnomAD. In 22 trials (one for 18 each of the 22 autosomes) we set aside one chromosome and used the other 21 for model training. Then, 19 the model was tested on the variants from the chromosome that was not used for training. Using the FET p 20 value threshold for the discordance analysis we classified the variants into 'bad' (with p < threshold) and 21 'good' (p > = threshold) groups. By varying the FET p value threshold discriminating the groups we 22 performed ROC analysis. Since representation of the classes varies depending on the selected threshold we 23 used the class weights for balancing the classes (in this and the following tests the weight for the 'bad' class 24 was set as the fraction of 'bad' variants in the training data and the weight for the 'good' class was set to 1). The area under the ROC curve (ROC AUC) for such a model was estimated as 0.841 (Figure 2A) . Next, we used variant annotations from the gnomAD genomes dataset for training and variant 1 annotations from the gnomAD exomes as a test sample. These are two separate datasets that are well-2 powered to detect biases in allele frequencies and ensure full independence between test and training 3 samples. In this setting our model again reliably predicted discordant variants, with ROC AUC=0.803 4 (Figure 2A) . Feature importance analysis of the model suggests that variant quality, inbreeding coefficient, 5 and quality by depth are the key parameters discriminating variants with and without evidence for technical 6 bias. Therefore, current protocols for alignment and variant calling are leaving a notable footprint which can 7 be used to detect platform biases ( Figure 2B ). . Table S4 ) were used to create the variant call set following GATK best practices. Variant annotations 11 were used to classify variants using the gnomAD genomes data as a training sample. Due to sample size, 12 the 1000 Genomes data is significantly less powered to detect technical biases compared to gnomAD. Therefore, it is harder to confidently identify ground truth for discordant variants. Identical 1000 Genomes We first observed the phenomenon of variant discordance through the investigation of GWAS variants in 2 the COVID host genetics initiative and the UK Biobank(COVID-19 Host Genetics Initiative 2021). When 3 inspecting top associated variants that did not have strong LD friends, we noticed that many had discordant 4 frequencies between GWAS arrays and gnomAD, and this often corresponded to variants that had 5 discrepant frequencies between the gnomAD exomes and genomes. We thus suggest that the blacklist 6 provided herein can be used for quality control of GWAS variants. To investigate whether discordant variants may be spuriously attributed phenotypic relevance, we 8 annotated all variants with their predicted functional consequence using the Ensembl Variant Effect Predictor Seventeen bad variants were found in the GWAS catalog, underscoring the importance of controlling for this 16 artifact, as it may impact downstream interpretation of association findings (Table S2 ). Variants in this list 17 have been associated with multiple health-related phenotypes including schizophrenia, telomere length, and 18 blood protein levels. Of these 17, half are multi-allelic and approximately a third are indels, echoing our 19 earlier results of a higher discordance rate for these variant types than bi-allelic SNVs. Additionally, more 20 than half of the 17 are present in the first megabase of the chromosome, suggesting that areas flanking the 21 telomeres should be treated with caution. The need for extremely large sample sizes to obtain sufficient statistical power in genetic studies 25 requires the creation of datasets that may go beyond the financial capabilities of many research groups. This leads to the creation of meta-datasets that have contributions from many individual studies, thus creating 1 heterogeneity in the technologies that were used for genotyping. Therefore, identification of DNA variants 2 that are susceptible to technical bias when genotypes originate from multiple sequencing platforms is vital in 3 order to avoid false-positive associations and analyses of the artificially inflated allele frequencies. This is of 4 particular concern in instances where cases may originate from one sequencing effort and controls from 5 another. Here, we identify and describe a technical artifact arising from NGS that may affect cohort data variant 7 quality despite the following of gold standard QC procedures. We present our metric for the identification of 8 discordant sites, provide a blacklist of the discordant variants in gnomAD which should be treated with 9 caution, and release an openly-available software package containing our random forest predictor that 10 reliably classifies untrustworthy variants in user cohort data. Excluding variants with signals of discordance 11 across sequencing platforms will result in higher quality results and reduces the risk of spurious associations 12 in gene discovery. We provide a blacklist of variants failing our discordance test in the gnomaD v2 dataset for 13 immediate exclusion. We additionally provide an R-package with a predictor trained on gnomad whole 14 genome sequencing data for use in other cohort data to identify sites with features indicative of unreliable 15 genotype calls. Based on the examinations presented in this manuscript, we recommend researchers using aggregated 17 cohort data to implement the following QC procedures to ensure the elimination of discordant sites: Starting with all sites with at least 1 alternate allele, and subsequently for sites with AC>5 and 10, we 11 calculated the allele frequency (AF) in the genomes and exomes separately and ran a Fisher's Exact Test (FET) on the difference in the number of alternate AC to total alleles (allele number, AN) between these two Foundation. An integrated map of genetic variation from 1,092 human 21 genomes Empirical design of a variant quality control pipeline for whole genome sequencing data using replicate 24 discordance. Sci Rep 9: 16156. in genetic case-control association studies The 1000 Genomes Project. Assessing Rare Variation in Complex Traits Insights into human genetic variation and population history from 929 diverse genomes A structural variation reference for medical and population genetics Mapping the human genetic architecture of COVID-19 Demographic history and rare allele sharing among human populations An analytical framework 15 for optimizing variant discovery from personal genomes Hands-On Data Visualization with Bokeh: Interactive web plotting for Python using Bokeh The genome Aggregation Database (gnomAD) The mutational constraint spectrum quantified from variation in 141 Variation across 141,456 human exomes and genomes reveals the spectrum of 2 loss-of-function intolerance across human protein-coding genes RICOPILI: Rapid Imputation for COnsortias PIpeLIne Analysis of protein-coding genetic variation in 60,706 humans Classification and regression by randomForest ForestQC: Quality control on genetic 12 variants from next-generation sequencing data using random forest Analysis of error profiles in deep next-generation sequencing data The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation 17 DNA sequencing data Ensembl Variant Effect Predictor pROC: an open-source 21 package for R and S+ to analyze and compare ROC curves ROCR: visualizing classifier performance in R We thank the members of the gnomAD and Hail teams for their assistance with this project. This project was