Next-generation sequencing-based bulked segregant analysis without sequencing the parental genomes DR AF T Next-generation sequencing-based bulked segregant analysis without sequencing the parental genomes Jianbo Zhanga,� and Dilip R. Pantheea,� a Department of Horticultural Science, North Carolina State University, Mountain Horticultural Crops Research and Extension Center, 455 Research Drive, Mills River, NC 28759, USA This manuscript was compiled on November 24, 2020 The genomic region(s) that controls a trait of interest can be rapidly identified using BSA-Seq, a technology in which next-generation se- quencing (NGS) is applied to bulked segregant analysis (BSA). We recently developed the significant structural variant method for BSA- Seq data analysis that exhibits higher detection power than standard BSA-Seq analysis methods. Our original algorithm was developed to analyze BSA-Seq data in which genome sequences of one par- ent served as the reference sequences in genotype calling, and thus required the availability of high-quality assembled parental genome sequences. Here we modified the original script to allow for the ef- fective detection of the genomic region-trait associations using only bulk genome sequences. We analyzed a public BSA-Seq dataset us- ing our modified method and the standard allele frequency and G- statistic methods with and without the aid of the parental genome sequences. Our results demonstrate that the genomic region(s) as- sociated with the trait of interest could be reliably identified only via the significant structural variant method without using the parental genome sequences. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 BSA-Seq | PyBSASeq | QTL | genomic region-trait association Bulked segregant analysis (BSA) was developed for the1 quick identification of genetic markers associated with a2 trait of interest (1, 2). For a particular trait, two groups of3 individuals with contrasting phenotypes are selected from a4 segregating population. Equal amounts of DNA are pooled5 from each individual within a group. The pooled DNA samples6 are then subjected to analysis, such as restriction fragment7 length polymorphism (RFLP) or random amplification of poly-8 morphic DNA (RAPD). Fragments unique to either group are9 potential genetic markers that may link to the gene(s) that10 control phenotypic expression for the trait of interest. Can-11 didate markers are further tested against the population to12 verify the marker-trait associations. With the recent dramatic13 reductions in cost, next-generation sequencing (NGS) has been14 applied to more and more BSA studies (3–7). This new tech-15 nology is referred to as BSA-Seq. In BSA-Seq, pooled DNA16 samples are not subjected to RFLP/RAPD analysis, but are17 directly sequenced instead. Genome-wide structural variants18 between bulks, such as single nucleotide polymorphisms (SNP)19 and small insertions/deletions (InDel), are identified based20 on the sequencing data. Genomic regions linked to the trait-21 controlling gene(s) are then identified based on the enrichment22 of the SNP/InDel alleles in those regions in each bulk. The23 time-consuming and labor-intensive marker development and24 genetic mapping steps are eliminated in the BSA-Seq method.25 Moreover, SNPs/InDels can be detected genome-wide via NGS,26 which allows for the reliable identification of trait-associated27 genomic regions across the entire genome.28 For each SNP/InDel in a BSA-Seq dataset, the base (or29 oligo in the case of an InDel) that is the same as in the reference30 genome is termed the reference base (REF), and the other31 base is termed the alternative base (ALT). Because each bulk 32 contains many individuals, the vast majority of SNP loci in 33 the dataset have both REF and ALT bases. For each SNP, 34 the number of reads of its REF/ALT alleles is termed allele 35 depth (AD). Because of the phenotypic selection via bulking, 36 for trait-associated SNPs, the ALT allele should be enriched 37 in one bulk while the REF allele should be enriched in the 38 other. However, for SNPs not associated with the trait, both 39 ALT and REF alleles would be randomly segregated in both 40 bulks, and neither enriched in either bulk. Hence these four 41 AD values can be used to assess how likely a SNP/InDel is 42 associated with the trait. 43 We have previously developed the significant structural 44 variant method for BSA-Seq data analysis (8). In this method, 45 a SNP/InDel is assessed with Fisher’s exact test using the AD 46 values of both bulks. A SNP/InDel is considered significant 47 if the P-value of Fisher’s exact test is lower than a specific 48 cut-off value, e.g., 0.01. A genomic region normally contains 49 many SNPs/InDels. The ratio of the significant structural 50 variants to the total structural variants is used to judge if 51 this genomic region is associated with the trait of interest. 52 We tested this method using the BSA-Seq data of a rice cold- 53 tolerance study (9). One of the parents in this study was rice 54 cultivar Oryza sativa ssp. japonica cv. Nipponbare. Its high- 55 quality assembled genome sequences were used as the reference 56 sequences for SNP/InDel calling as well, which makes the 57 genotype calling and SNP/InDel filtering very straightforward: 58 any locus in any bulk that is different from the REF allele is 59 a valid SNP/InDel (8). 60 Only high-quality assembled genome sequences can serve as 61 the reference sequences in genotype calling, an essential step 62 in BSA-Seq data analysis. For most species, however, such 63 sequences are available for only a single or limited number of 64 lines. If lines without high-quality assembled genome sequences 65 are used as the parents in BSA-Seq studies, the parental 66 genomes are often sequenced via NGS for the determination 67 Significance Statement BSA-Seq can be utilized to rapidly identify structural variant- trait associations, and our modified significant structural variant method allows the detection of such associations without se- quencing the parental genomes, leading to further lower the sequencing cost and making BSA-Seq more accessible to the research community and more applicable to the species with a large genome. Author contributions: JZ and DRP conceived the study. JZ developed the algorithm, wrote the Python code, analyzed the data, and wrote and edited the manuscript. DRP edited the manuscript and supervised the project. The authors declare no conflict of interest. �To whom correspondence should be addressed. E-mail: dilip_panthee@ncsu.edu or zhang.jianbo@gmail.com https://doi.org/10.1101/654137 bioRχiv | November 24, 2020 | 1 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint https://doi.org/10.1101/654137 https://doi.org/10.1101/2021.02.08.430275 http://creativecommons.org/licenses/by-nc-nd/4.0/ DR AF T of the parental origin of SNP alleles and the identification68 of parental heterozygous SNPs. Modification of our original69 method to allow the analysis of BSA-Seq data in the absence of70 assembled or NGS-generated parental genome sequences would71 provide greater flexibility and significantly reduce sequencing72 costs. Hence, we modified our original script to allow for73 the identification of the false-positive SNPs/InDels and part74 of the heterozygous loci in the parents without the aid of75 the parental genome sequences. Using the modified script,76 along with the scripts for the standard G-statistic and allele77 frequency methods (10, 11), we analyzed a public BSA-Seq78 dataset using either the genome sequences of both the parents79 and the bulks, or the bulk genome sequences alone. The80 results revealed that reliable detection of genomic region-trait81 associations can be achieved only via our modified script when82 using only the bulk genome sequences.83 Materials and Methods84 The sequencing data used in this study were generated by Lahari et85 al. (12). Using the allele frequency method, the authors identified86 a single locus for root-knot nematode resistance in rice. In that87 study, the parents of the F2 population were LD24 and VialoneNano,88 yielding an F2 population size of 178 (plants), and both the resistant89 bulk and the susceptible bulk contained 23 plants each. The DNA90 samples of both the parents and the bulks were sequenced using91 Illumina MiSeq Sequencing System and MiSeq v3 chemistry.92 The BSA-Seq sequencing data (ERR2696318: parent LD24;93 ERR2696319: parent VialoneNano; ERR2696321: the resistant bulk94 from the F2 population; ERR2696322: the susceptible bulk from95 the F2 population) were downloaded from the European Nucleotide96 Archive (ENA) using the Linux program wget, and the rice reference97 sequence (Release 47) was downloaded from https://plants.ensembl.98 org/Oryza_sativa/Info/Index. Sequencing data preprocessing and SNP99 calling were performed as described previously (8). When analyzing100 the BSA-Seq data with the genome sequences of both the parents101 and the bulks, bulk/parent SNP calling was performed separately.102 The common SNPs of the two SNP datasets were used for the103 downstream analysis.104 The SNP dataset generated via SNP calling was processed with105 our Python script to identify significant SNP-trait associations.106 A single script containing all the three methods is available on107 the website https://github.com/dblhlx/PyBSASeq. The workflow of the108 scripts is as follows:109 1. Read the .tsv input file generated via SNP calling into a Pandas110 DataFrame.111 2. Perform SNP filtering on the Pandas DataFrame.112 3. Identify the significant SNPs (sSNPs) via Fisher’s exact test113 (the significant structural variant method), calculate the ΔAF114 (allele frequency difference between bulks) values (the allele115 frequency method), or calculate the G-statistic values (the116 G-statistic method) using the four AD values (ADref1 and117 ADalt1 of bulk 1 and ADref2 and ADalt2 of bulk 2) of each118 SNP in the filtered Pandas DataFrame.119 4. Use the sliding window algorithm to plot the sSNP/totalSNP120 ratios, the ΔAF values, or the G-statistic values against their121 genomic positions.122 5. Estimate the threshold of the sSNP/totalSNP ratio, the ΔAF,123 or the G-statistic via simulation. The thresholds are used to124 identify the significant peaks/valleys in the plots generated in125 step 4.126 Identification of the sSNPs, calculation of the sSNP/totalSNP127 ratios, the G-statistic values, or the ΔAF values, and estimation128 of their thresholds were carried out as described previously (8).129 The 99.5th percentile of 10 000 simulated sSNP/totalSNP ratios130 or G-statistic values was used as the threshold for the significant131 structural variant method or the G-statistic method, and the 99%132 confidence interval of 10 000 simulated ΔAF values was used as133 the threshold for the allele frequency method. For all methods,134 the size of the sliding windows is 2 Mb and the incremental step is135 10 kb. In our previous work, a parent was the japonica rice cultivar136 nipponbare, and its genome sequences were used as the reference 137 sequences for SNP/InDel calling. In the current dataset, the parents 138 were LD24 and VialoneNano; many false-positive SNPs/InDels and 139 heterozygous loci in the parents would be included in the dataset if 140 analyzing the BSA-Seq data using the original script. Hence, SNP 141 filtering is carried out a little differently from previously described 142 (8), and its details are below (see Table S1 for examples): 143 • Unmapped SNPs or SNPs mapped to the mitochondrial or 144 chloroplast genome 145 • SNPs with an ‘NA’ value in any column of the DataFrame 146 • SNPs with zero REF read and a single ALT allele in both 147 bulks/parents 148 • SNPs with three or more ALT alleles in any bulk/parent 149 • SNPs with two ALT alleles and its REF read is not zero in 150 any bulk/parent 151 • SNPs in which the bulk/parent genotypes do not agree with 152 the REF/ALT bases 153 • SNPs in which the bulk/parent genotypes are not consistent 154 with the AD values 155 • SNPs with a genotype quality (GQ) score less than 20 in any 156 bulk 157 • SNPs with very high reads 158 • SNPs heterozygous in any parent when parental genome se- 159 quences are available 160 Additionally, for SNPs with two ALT alleles and zero REF read 161 in both bulks/parents, the REF allele is replaced with the first allele 162 in the ‘ALT’ field, its ALT allele is replaced with the second allele 163 in the original ‘ALT’ field. The REF read, and a comma after it, 164 are removed from both the allele depth (AD) fields (one for each 165 bulk/parent). This step is carried out before checking the genotype 166 agreement between bulks and the REF/ALT fields. When parental 167 genome sequences are involved, the common SNP set is identified 168 before filtering out the SNPs with a low GQ score in the parental 169 SNP dataset. 170 The tightly linked SNP alleles from the same parent tend to 171 segregate together and should have a similar extent of allele enrich- 172 ment, and thus similar AD values. In a SNP dataset, the genotypes 173 of each bulk/parent are represented as ‘GTref/GTalt’ when a SNP 174 contains both the REF base and the ALT base in the genotype 175 (GT) field, and the AD values in each bulk/parent is represented as 176 ‘ADref,ADalt’. The genotype and the AD value of the REF allele are 177 always placed first in both fields. For a SNP locus in the .tsv input 178 file, the allele having the same genotype as that in the reference 179 genome is defined as the REF allele. However, it is highly unlikely 180 that all of the SNP alleles in a parent are the same as those in 181 the reference genome, except in instances where reference genome 182 sequences used in SNP calling are from one of the parents as in 183 the case of the cold-tolerance study as mentioned above (9). It is 184 necessary to place the genotypes and AD values of all SNP alleles 185 from one parent (e.g., LD24) in the REF position, and those from 186 the other parent (e.g., VialoneNano) to the ALT position in the 187 GT and AD fields to make the bulk dataset consistent. Thus, for 188 a particular SNP, if the REF base in the .tsv file is different from 189 the genotype of LD24 (either parent will work), its GT/AD values 190 would be swapped, e.g., ‘G/A’ to ‘A/G’ and ‘19,9’ to ‘9,19’. AD/GT 191 swapping is performed following SNP filtering and is performed only 192 when the parental genome sequences are used to aid BSA-Seq data 193 analysis. Equation 1 is used for ΔAF calculation. AD swapping 194 ensures that adjacent SNPs have similar ΔAF values. 195 ∆AF = ADalt2 ADref 2 + ADalt2 − ADalt1 ADref 1 + ADalt1 [1] 196 197 Results 198 The original sequence reads were 3.9G, 3.8G, 3.4G, and 3.5G; 199 they became 3.8G, 3.6G, 3.3G, and 3.4G after quality con- 200 trol, respectively, in ERR2696318 (parent LD24), ERR2696319 201 (parent VialoneNano), ERR2696321 (the resistant bulk), and 202 ERR2696322 (the susceptible bulk), which correspond to 8.8×, 203 8.5×, 7.6×, and 7.9× coverage, respectively (12). The prepro- 204 cessed sequences were used for SNP calling to generate a SNP 205 2 | https://doi.org/10.1101/654137 Zhang et al. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint https://plants.ensembl.org/Oryza_sativa/Info/Index https://plants.ensembl.org/Oryza_sativa/Info/Index https://plants.ensembl.org/Oryza_sativa/Info/Index https://github.com/dblhlx/PyBSASeq https://doi.org/10.1101/654137 https://doi.org/10.1101/2021.02.08.430275 http://creativecommons.org/licenses/by-nc-nd/4.0/ DR AF T dataset, which was analyzed using the modified significant206 structural variant method, the G-statistic method, and the207 allele frequency method with or without the aid of the parental208 genome sequences.209 BSA-Seq data analysis using the genome sequences of210 both the parents and the bulks. The SNP calling-generated211 parent/bulk SNP dataset was processed with the Python212 script PyBSASeq_WP.py. SNP filtering was performed as213 described in the Materials and Methods section. The parental214 SNP dataset was processed first, and the SNPs heterozygous215 in any parent were eliminated because all algorithms assume216 all SNP loci are homozygous in the parental lines. Threshold217 estimation is based on this assumption. Although most rice218 breeding lines should be homozygous in most loci, more219 than 7% heterozygous SNP loci (2 011 062 homozygous and220 153 000 heterozygous) were identified in the parental SNP221 dataset. However, the GATK’s variant calling tools are222 designed to be very lenient in order to achieve a high degree223 of sensitivity (https://gatk.broadinstitute.org/hc/en-us/articles/224 360035535932-Germline-short-variant-discovery-SNPs-Indels-),225 we cannot rule out the possibility that some of the heterozy-226 gous loci were caused by sequencing artifacts. The bulk SNP227 dataset was processed second. The SNPs with the same228 chromosome ID and the same genomic coordinate in both229 datasets were considered common SNPs. Common SNPs in230 the bulk dataset were used to detect SNP-trait associations231 for all three methods.232 Table 1. Chromosomal distribution of SNPs - using the genome sequences of both the parents and the bulks Chromosome sSNPs TotalSNPs sSNP/totalSNP 1 1170 139 910 0.0084 2 310 125 129 0.0025 3 459 102 331 0.0045 4 330 89 577 0.0037 5 372 84 706 0.0044 6 1581 83 605 0.0189 7 378 94 371 0.0040 8 258 80 617 0.0032 9 1292 67 157 0.0192 10 363 56 681 0.0064 11 2765 88 287 0.0313 12 241 87 145 0.0028 Genome-wide 9519 1 099 516 0.0087 The significant structural variant method: Each SNP in the233 dataset was tested via Fisher’s exact test using its four AD234 values, and SNPs with P-values less than 0.01 were defined235 as sSNPs. The chromosomal distributions of the sSNPs and236 the total SNPs are summarized in Table 1. Using the sliding237 window algorithm, the genomic distribution of the sSNPs, the238 total SNPs, and the sSNP/totalSNP ratios of sliding windows239 were plotted against their genomic position (Figure 1a and240 Figure 1b). A genome-wide threshold was estimated as 0.0538241 via simulation as described previously (8). Two peaks above242 the threshold were identified: a minor one on chromosome243 9 and a major one on chromosome 11. The position of the244 peak on chromosome 9 was at 1.11 Mb, the sliding window245 contained 230 sSNPs and 3738 total SNPs, corresponding246 to an sSNP/totalSNP ratio of 0.0615; the position of the247 peak on chromosome 11 was at 26.44 Mb, the sliding window248 contained 675 sSNPs and 1139 total SNPs, corresponding to an249 sSNP/totalSNP ratio of 0.5926. The sliding window-specific 250 threshold was estimated for each peak via simulation, and 251 the values were 0.0551 and 0.0623, respectively, indicating 252 both peaks were significant. Both values are higher than the 253 genome-wide threshold, probably due to the lower amounts of 254 total SNPs in these sliding windows. The average SNPs per 255 sliding window was 5893. 256 The G-statistic method: The G-statistic value of each SNP 257 in the dataset was calculated, and its threshold was estimated 258 via simulation as described previously (8). Using the sliding 259 window algorithm, the G-statistic value of each sliding win- 260 dow, the average G-statistic values of all SNPs in that sliding 261 window, was plotted against its genomic position (Figure 1c), 262 and the curve pattern was very similar to that in Figure 1b. A 263 significant peak was identified on chromosome 11; its position 264 was at 26.49 Mb, its G-statistic value was 12.8120, well above 265 the threshold 9.0224 (99.5th percentile). 266 The allele frequency method: The ΔAF value of each SNP in 267 the dataset was calculated, and the ΔAF threshold of the SNP 268 was estimated via simulation as described previously (8). Using 269 the sliding window algorithm, the ΔAF value of each sliding 270 window, the average ΔAF values of all SNPs in that sliding 271 window, was plotted against its genomic position (Figure 1d). 272 A significant peak on chromosome 11 was identified, the peak 273 position was located at 26.45 Mb, its ΔAF value was 0.7173, 274 and the 99% confidence interval was −0.6508 to 0.6497. 275 BSA-Seq data analysis using only the bulk genome se- 276 quences. The SNP calling-generated bulk SNP dataset was 277 processed with the Python script PyBSASeq.py. All the meth- 278 ods and parameters were the same as above; the only difference 279 was that the parental SNP dataset was not used. 280 The significant structural variant method: The chromoso- 281 mal distribution of the sSNPs and total SNPs are summarized 282 in Table 2. The total number of SNPs was 1 346 185 here, 283 much higher than the above, which was 1 099 516. The ge- 284 nomic distribution of the sSNPs, the total SNPs, and the 285 sSNP/totalSNP ratios of the sliding windows are presented 286 in Figure 2a and Figure 2b. The patterns of the curves were 287 very similar to those in Figure 1a and Figure 1b. One of the 288 obvious differences was that sSNP/totalSNP ratios of the slid- 289 ing windows were much lower than those in Figure 1b, leading 290 to missing the minor locus on chromosome 9. Only the peak 291 on chromosome 11 was significant; it was located at 26.96 Mb, 292 a 520 kb shift compared to Figure 1b. The sliding window 293 contained 1122 sSNPs and 2945 total SNPs, corresponding to 294 a 0.3810 sSNP/totalSNP ratio, well above the genome-wide 295 threshold (0.0535) and the sliding window specific threshold 296 (0.0601). The average SNPs per sliding window was 7215. 297 The G-statistic method: The patterns of the G-statistic 298 value plot (Figure 2c) were very similar to that in Figure 1c, 299 but the G-statistic values were significantly lower than those 300 in Figure 1c, and the threshold did not change much. Only 301 a single sliding window was above the threshold (8.8953), its 302 position was at 29.96 Mb, and its G-statistic value was 8.9060. 303 The allele frequency method: Without the aid of the 304 parental genome sequences, the pattern of the ΔAF curve 305 of chromosome 11 (Figure 2d), especially the genomic region 306 associated with the trait, was drastically different from that in 307 Figure 1d. Differences in the curve patterns were observed in 308 other chromosomes as well, but they were relatively minor. All 309 ΔAF values were within the 99% confidence interval, although 310 Zhang et al. bioRχiv | November 24, 2020 | 3 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint https://github.com/dblhlx/PyBSASeq/blob/master/PyBSASeq_WP.py https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels- https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels- https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels- https://github.com/dblhlx/PyBSASeq/blob/master/PyBSASeq.py https://doi.org/10.1101/2021.02.08.430275 http://creativecommons.org/licenses/by-nc-nd/4.0/ DR AF T 0 2000 4000 6000 8000 10000 N um be r o f S N P s Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 Chr11 Chr12 0.0 0.2 0.4 0.6 sS N P /to ta lS N P 0 5 10 G -s ta tis tic 0 1 2 3 4 0.5 0.0 0.5 A F 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 Genomic position (×10 Mb) A B C D Figure 1. BSA-Seq data analysis using the genome sequences of both the parents and the bulks. The red lines/curves are the thresholds. (A) Genomic distributions of sSNPs (blue) and totalSNPs (black). (B) Genomic distributions of sSNP/totalSNP ratios. (C) Genomic distributions of G-statistic values. (D) Genomic distributions of ΔAF values. Table 2. Chromosomal distribution of SNPs - using only the bulk genome sequences Chromosome sSNPs TotalSNPs sSNP/totalSNP 1 1335 163 260 0.0082 2 391 146 877 0.0027 3 578 120 319 0.0048 4 442 110 952 0.0040 5 481 103 362 0.0047 6 1724 103 416 0.0167 7 459 114 564 0.0040 8 373 103 385 0.0036 9 1410 82 744 0.0170 10 572 78 206 0.0073 11 3120 112 719 0.0277 12 281 106 381 0.0026 Genome-wide 11 166 1 346 185 0.0083 AD swapping was performed on only 67 396 SNPs, 6.1% of 311 total SNPs. 312 Discussion 313 We tested how parental genome sequences affected the detec- 314 tion of SNP-trait associations via BSA-Seq using a dataset 315 of the rice root-knot nematode resistance. Using the genome 316 sequences of both the parents and bulks, a major locus on 317 chromosome 11 and a minor locus on chromosome 9 were de- 318 tected via the significant structural variant method. However, 319 only the major locus was detected via the G-statistic method 320 and the allele frequency method. The positions of the peaks 321 detected via different methods were not the same, but they 322 were very close to each other. Using only the bulk genome 323 sequences, the major locus can be detected via only the signif- 324 icant structural variant and G-statistic methods. The allele 325 frequency method uses the ΔAF value of a SNP to measure 326 allele (REF/ALT) enrichment in the SNP locus, and the G- 327 statistic method uses the G-statistic value of a SNP to measure 328 the allele enrichment; ΔAF and G-statistic are parameters 329 at the SNP level, therefore, both methods use a SNP level 330 parameter to identify significant sliding windows for the detec- 331 tion of the genomic region-trait associations. The significant 332 structural variant method, however, uses the sSNP/totalSNP 333 ratio, a parameter at the sliding window level, to measure the 334 sSNP enrichment in a sliding window for the identification of 335 the trait-associated genomic regions. A SNP normally has less 336 than 100 reads because of the cost concern, while a sliding 337 4 | https://doi.org/10.1101/654137 Zhang et al. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint https://doi.org/10.1101/654137 https://doi.org/10.1101/2021.02.08.430275 http://creativecommons.org/licenses/by-nc-nd/4.0/ DR AF T 0 2500 5000 7500 10000 12500 N um be r o f S N P s Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 Chr11 Chr12 0.0 0.2 0.4 sS N P /to ta lS N P 2.5 5.0 7.5 G -s ta tis tic 0 1 2 3 4 0.5 0.0 0.5 A F 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 Genomic position (×10 Mb) A B C D Figure 2. BSA-Seq data analysis using only the bulk genome sequences. The red lines/curves are the thresholds. (A) Genomic distributions of sSNPs (blue) and totalSNPs (black). (B) Genomic distributions of sSNP/totalSNP ratios. (C) Genomic distributions of G-statistic values. (D) Genomic distributions of ΔAF values. window normally contains thousands of SNPs. Thus, the sig-338 nificant structural variant method has much higher statistical339 power, which is consistent with our observation. Our results340 revealed that the parental genome sequences did not much341 affect the plot patterns of the sSNP/totalSNP ratios and the342 G-statistic values. However, the plot patterns of the ΔAF343 value of chromosome 11 were altered dramatically when the344 parental genome sequences were not used.345 The significant structural variant method assesses if a SNP346 is likely associated with the trait via Fisher’s exact test. The347 greater the ALT proportion differences between the bulks, the348 less the P-value of the Fisher’s exact test, and the more likely349 the SNP is associated with the trait. Fisher’s exact test takes a350 numpy array or a Python list as its input, the same P-value will351 be obtained with either [[ADref1, ADalt1], [ADref2, ADalt2]] or352 [[ADalt1, ADref1], [ADalt2, ADref2]] as its input. The G-statistic353 method assesses if a SNP is likely associated with the trait354 via the G-test; the greater the G-statistic value of a SNP, the355 more likely it contributes to the trait phenotype (11). The G-356 statistic values are the same with either input [[ADref1, ADalt1],357 [ADref2, ADalt2]] or [[ADalt1, ADref1], [ADalt2, ADref2]]. The358 order of the AD values (REF/ALT reads) in bulks does not359 affect the P-value of Fisher’s exact test or the G-statistic value360 of G-test, which is why the parental genome sequences-guided361 AD swapping does not alter the curve patterns of both methods.362 Therefore, theoretically, parental genome sequences are not363 required to identify genomic region-trait associations in either364 the significant structural variant method or the G-statistic365 method. 366 When the parental genome sequences were used, AD value 367 swapping was performed for the SNPs in which the genotype of 368 LD24 was different from the REF base, and the ΔAF values of 369 these SNPs were calculated based on the swapped AD values 370 using equation 1. AD swapping makes the adjacent SNP alleles 371 from the same parent have similar AD values and similar ΔAF 372 values. The ΔAF values of such SNPs were calculated using 373 equation 2 if not performing AD swapping. Equation 2 can 374 be converted to equation 3, which produces an opposite value 375 relative to that produced by equation 1. For two adjacent 376 SNPs in LD24, where one SNP has the same genotype as the 377 REF base while the other has the same genotype as the ALT 378 base, they would have opposite ΔAF values if AD swapping 379 is not performed. For the SNPs that do not contribute to 380 the trait phenotype and are not linked to any trait-associated 381 genomic regions, their ΔAF value should fluctuate around 382 zero. The parental genome sequences will have less effect 383 on the ΔAF value of the sliding windows containing such 384 SNPs. However, for trait-associated SNPs, adjacent SNPs 385 with opposite ΔAF values would cancel each other out and 386 lower the ΔAF value of the sliding window significantly, which 387 is the case observed on chromosome 11 in Figure 2d. 388 ∆AF = ADref 2 ADref 2 + ADalt2 − ADref 1 ADref 1 + ADalt1 [2] 389 Zhang et al. bioRχiv | November 24, 2020 | 5 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430275 http://creativecommons.org/licenses/by-nc-nd/4.0/ DR AF T ∆AF = ADalt1 ADref 1 + ADalt1 − ADalt2 ADref 2 + ADalt2 [3]390 When the parental genome sequences were not used, the391 sSNP/totalSNP ratios and the G-statistic values were signifi-392 cantly lower. The peak sSNP/totalSNP ratio on chromosome393 11 was 0.5926 in Figure 1b, while it was 0.3810 in Figure 2b; it394 was similar for the peak G-statistic values. The decreasing of395 sSNP/totalSNP ratio and the G-statistic value is likely caused396 by sequencing artifacts and heterozygosity in the parental397 lines. There were 1 345 185 SNPs in the bulk dataset when398 not using the parental genome sequences, while there were399 1 099 516 SNPs in the dataset with the aid of the parental400 genome sequences. Comparison of the two SNP dataset re-401 vealed that 109 445 SNPs were unique to the bulks. Because402 all the SNPs in the bulks are derived from the parental lines,403 crossing should not generate new SNPs; thus this category404 of SNPs was most likely caused by sequencing artifacts. The405 sequencing coverage in the bulk was less than eight, which is406 very low. Higher sequencing coverage would help decrease the407 number of SNPs derived from sequence artifacts. Additionally,408 137 224 SNP were heterozygous in the parental lines. Without409 the parental genome sequences, this category of SNPs could410 not be filtered out from the bulk SNP dataset. However, these411 SNPs can be decreased via selfing the parental line more gener-412 ations: five-generations selfing can decrease the heterozygosity413 of both parental lines to a maximum of 6.25%.414 To determine how parental heterozygosity and sequenc-415 ing artifacts affected the detection of genomic region-trait416 associations, we removed the heterozygous SNPs or the bulk-417 specific SNPs from the bulk SNP dataset, and analyzed the418 data separately. By removing the heterozygous SNPs, the419 peak on chromosome 11 was shifted to 26.28 Mb for both420 the sSNP/totalSNP ratio and the G-statistic value, and the421 sSNP/totalSNP ratio of the peak was increased to 0.4835,422 well above the sliding window-specific threshold 0.0603. The423 G-statistic value of the peak was 10.8411, significantly higher424 than the threshold 8.9532 as well. By removing bulk-specific425 SNPs, the peak on chromosome 11 shifted to 26.49 Mb for426 both the sSNP/totalSNP ratio and the G-statistic value. The427 sSNP/totalSNP ratio of the peak and the sliding window-428 specific threshold were 0.4302 and 0.0637, respectively, and429 the G-statistic value of the peak and the threshold were 9.7591430 and 8.9092, respectively. Although both the sSNP/totalSNP431 ratio and the G-statistic value were lower than above, they432 were still higher than their corresponding thresholds. While433 seemed the heterozygous SNPs affected the sSNP/totalSNP434 ratio and the G-statistic value a little more than the bulk-435 specific SNPs, it is more likely that both produced similar436 levels of noise for the sSNP/totalSNP ratio and the G-statistic437 value considering that the former was 27 779 greater than438 the latter. When using only the bulk genome sequences, the439 sSNP/totalSNP peak position on chromosome 11 was shifted440 0.52 Mb (26.44 Mb to 26.96 Mb) due to the presence of the441 bulk-specific SNPs and the heterozygous SNPs in the dataset,442 but this is a very short distance for genetic mapping. Although443 only a single dataset was examined here, the genome-wide444 similarity of the sSNP/totalSNP curve patterns in Figure 1b445 and Figure 2b suggests that the significant structural method446 is highly reproducible using only the bulk genome sequences.447 Conclusions 448 The plotting pattern of the ΔAF values in the trait-associated 449 genomic region was very different when using only the bulk 450 genome sequences. Without the aid of the parental genome 451 sequences, the ΔAF values of the sliding windows could not 452 be correctly calculated; thus, the allele frequency method 453 cannot be used to identify SNP-trait association. In contrast, 454 the parental genome sequence does not affect the plotting 455 patterns of both the significant structural variant method and 456 the G-statistic method, but the sSNP/totalSNP ratios and the 457 G-statistic values decreased significantly due to sequencing 458 artifacts and/or heterozygosity of the parental lines. Because 459 of its high detection power, major SNP-trait associations can 460 still be reliably detected via the significant structural variant 461 method even the sequence coverage was very low. 462 Acknowledgments. JZ was supported by the National Science 463 Foundation grant [IOS-1546625 to DRP]. We are grateful to Lahari 464 et al. for generating the sequencing data and making it available 465 to the public. We thank Irene E. Palmer for critical review and 466 thank Nathan Lynch for valuable comments. The manuscript was 467 prepared using a modified version of the PNAS LATEX template. 468 Bibliography 469 1. RW Michelmore, I Paran, RV Kesseli, Identification of markers linked to disease-resistance 470 genes by bulked segregant analysis: a rapid method to detect markers in specific genomic re- 471 gions by using segregating populations. Proc. Natl. Acad. Sci. U.S.A. 88, 9828–9832 (1991). 472 2. JJ Giovannoni, RA Wing, MW Ganal, SD Tanksley, Isolation of molecular markers from spe- 473 cific chromosomal intervals using DNA pools from existing mapping populations. Nucleic 474 Acids Res 19, 6553–6568 (1991). 475 3. I Imerovski, et al., BSA-seq mapping reveals major QTL for broomrape resistance in four 476 sunflower lines. Mol Breed. 39, 41 (2019). 477 4. S Arikit, et al., QTL-seq identifies cooked grain elongation QTLs near soluble starch synthase 478 and starch branching enzymes in rice ( Oryza sativa L.). Sci Rep 9, 1–10 (2019). 479 5. Q Chen, et al., Identification and genetic mapping for rht-DM, a dominant dwarfing gene in 480 mutant semi-dwarf maize using QTL-seq approach. Genes Genomics 40, 1091–1099 (2018). 481 6. J Clevenger, et al., Mapping Late Leaf Spot Resistance in Peanut (Arachis hypogaea) Using 482 QTL-seq Reveals Markers for Marker-Assisted Selection. Front Plant Sci 9, 83 (2018). 483 7. F Duveau, et al., Mapping small effect mutations in Saccharomyces cerevisiae: impacts of 484 experimental design and mutational properties. G3 (Bethesda) 4, 1205–1216 (2014). 485 8. J Zhang, DR Panthee, PyBSASeq: a simple and effective algorithm for bulked segregant 486 analysis with whole-genome sequencing data. BMC Bioinforma. 21, 99 (2020). 487 9. Z Yang, et al., Mapping of Quantitative Trait Loci Underlying Cold Tolerance in Rice Seedlings 488 via High-Throughput Sequencing of Pooled Extremes. PLOS ONE 8, e68433 (2013). 489 10. H Takagi, et al., QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome 490 resequencing of DNA from two bulked populations. Plant J. 74, 174–183 (2013). 491 11. PM Magwene, JH Willis, JK Kelly, The Statistics of Bulk Segregant Analysis Using Next Gen- 492 eration Sequencing. PLOS Comput. Biol. 7, e1002255 (2011). 493 12. Z Lahari, et al., QTL-seq reveals a major root-knot nematode resistance locus on chromo- 494 some 11 in rice (Oryza sativa L.). Euphytica 215, 117 (2019). 495 6 | https://doi.org/10.1101/654137 Zhang et al. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint https://doi.org/10.1101/654137 https://doi.org/10.1101/2021.02.08.430275 http://creativecommons.org/licenses/by-nc-nd/4.0/