Next-generation sequencing-based bulked segregant analysis without sequencing the parental genomes


DR
AF

T

Next-generation sequencing-based bulked segregant analysis without
sequencing the parental genomes
Jianbo Zhanga,� and Dilip R. Pantheea,�

a Department of Horticultural Science, North Carolina State University, Mountain Horticultural Crops Research and Extension Center, 455 Research Drive, Mills River, NC
28759, USA

This manuscript was compiled on November 24, 2020

The genomic region(s) that controls a trait of interest can be rapidly
identified using BSA-Seq, a technology in which next-generation se-
quencing (NGS) is applied to bulked segregant analysis (BSA). We
recently developed the significant structural variant method for BSA-
Seq data analysis that exhibits higher detection power than standard
BSA-Seq analysis methods. Our original algorithm was developed
to analyze BSA-Seq data in which genome sequences of one par-
ent served as the reference sequences in genotype calling, and thus
required the availability of high-quality assembled parental genome
sequences. Here we modified the original script to allow for the ef-
fective detection of the genomic region-trait associations using only
bulk genome sequences. We analyzed a public BSA-Seq dataset us-
ing our modified method and the standard allele frequency and G-
statistic methods with and without the aid of the parental genome
sequences. Our results demonstrate that the genomic region(s) as-
sociated with the trait of interest could be reliably identified only via
the significant structural variant method without using the parental
genome sequences.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

BSA-Seq | PyBSASeq | QTL | genomic region-trait association

Bulked segregant analysis (BSA) was developed for the1 quick identification of genetic markers associated with a2
trait of interest (1, 2). For a particular trait, two groups of3
individuals with contrasting phenotypes are selected from a4
segregating population. Equal amounts of DNA are pooled5
from each individual within a group. The pooled DNA samples6
are then subjected to analysis, such as restriction fragment7
length polymorphism (RFLP) or random amplification of poly-8
morphic DNA (RAPD). Fragments unique to either group are9
potential genetic markers that may link to the gene(s) that10
control phenotypic expression for the trait of interest. Can-11
didate markers are further tested against the population to12
verify the marker-trait associations. With the recent dramatic13
reductions in cost, next-generation sequencing (NGS) has been14
applied to more and more BSA studies (3–7). This new tech-15
nology is referred to as BSA-Seq. In BSA-Seq, pooled DNA16
samples are not subjected to RFLP/RAPD analysis, but are17
directly sequenced instead. Genome-wide structural variants18
between bulks, such as single nucleotide polymorphisms (SNP)19
and small insertions/deletions (InDel), are identified based20
on the sequencing data. Genomic regions linked to the trait-21
controlling gene(s) are then identified based on the enrichment22
of the SNP/InDel alleles in those regions in each bulk. The23
time-consuming and labor-intensive marker development and24
genetic mapping steps are eliminated in the BSA-Seq method.25
Moreover, SNPs/InDels can be detected genome-wide via NGS,26
which allows for the reliable identification of trait-associated27
genomic regions across the entire genome.28

For each SNP/InDel in a BSA-Seq dataset, the base (or29
oligo in the case of an InDel) that is the same as in the reference30
genome is termed the reference base (REF), and the other31

base is termed the alternative base (ALT). Because each bulk 32
contains many individuals, the vast majority of SNP loci in 33
the dataset have both REF and ALT bases. For each SNP, 34
the number of reads of its REF/ALT alleles is termed allele 35
depth (AD). Because of the phenotypic selection via bulking, 36
for trait-associated SNPs, the ALT allele should be enriched 37
in one bulk while the REF allele should be enriched in the 38
other. However, for SNPs not associated with the trait, both 39
ALT and REF alleles would be randomly segregated in both 40
bulks, and neither enriched in either bulk. Hence these four 41
AD values can be used to assess how likely a SNP/InDel is 42
associated with the trait. 43

We have previously developed the significant structural 44
variant method for BSA-Seq data analysis (8). In this method, 45
a SNP/InDel is assessed with Fisher’s exact test using the AD 46
values of both bulks. A SNP/InDel is considered significant 47
if the P-value of Fisher’s exact test is lower than a specific 48
cut-off value, e.g., 0.01. A genomic region normally contains 49
many SNPs/InDels. The ratio of the significant structural 50
variants to the total structural variants is used to judge if 51
this genomic region is associated with the trait of interest. 52
We tested this method using the BSA-Seq data of a rice cold- 53
tolerance study (9). One of the parents in this study was rice 54
cultivar Oryza sativa ssp. japonica cv. Nipponbare. Its high- 55
quality assembled genome sequences were used as the reference 56
sequences for SNP/InDel calling as well, which makes the 57
genotype calling and SNP/InDel filtering very straightforward: 58
any locus in any bulk that is different from the REF allele is 59
a valid SNP/InDel (8). 60

Only high-quality assembled genome sequences can serve as 61
the reference sequences in genotype calling, an essential step 62
in BSA-Seq data analysis. For most species, however, such 63
sequences are available for only a single or limited number of 64
lines. If lines without high-quality assembled genome sequences 65
are used as the parents in BSA-Seq studies, the parental 66
genomes are often sequenced via NGS for the determination 67

Significance Statement

BSA-Seq can be utilized to rapidly identify structural variant-
trait associations, and our modified significant structural variant
method allows the detection of such associations without se-
quencing the parental genomes, leading to further lower the
sequencing cost and making BSA-Seq more accessible to the
research community and more applicable to the species with a
large genome.

Author contributions: JZ and DRP conceived the study. JZ developed the algorithm, wrote the
Python code, analyzed the data, and wrote and edited the manuscript. DRP edited the manuscript
and supervised the project.

The authors declare no conflict of interest.

�To whom correspondence should be addressed. E-mail: dilip_panthee@ncsu.edu or
zhang.jianbo@gmail.com

https://doi.org/10.1101/654137 bioRχiv | November 24, 2020 | 1

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint 

https://doi.org/10.1101/654137
https://doi.org/10.1101/2021.02.08.430275
http://creativecommons.org/licenses/by-nc-nd/4.0/


DR
AF

T

of the parental origin of SNP alleles and the identification68
of parental heterozygous SNPs. Modification of our original69
method to allow the analysis of BSA-Seq data in the absence of70
assembled or NGS-generated parental genome sequences would71
provide greater flexibility and significantly reduce sequencing72
costs. Hence, we modified our original script to allow for73
the identification of the false-positive SNPs/InDels and part74
of the heterozygous loci in the parents without the aid of75
the parental genome sequences. Using the modified script,76
along with the scripts for the standard G-statistic and allele77
frequency methods (10, 11), we analyzed a public BSA-Seq78
dataset using either the genome sequences of both the parents79
and the bulks, or the bulk genome sequences alone. The80
results revealed that reliable detection of genomic region-trait81
associations can be achieved only via our modified script when82
using only the bulk genome sequences.83

Materials and Methods84

The sequencing data used in this study were generated by Lahari et85
al. (12). Using the allele frequency method, the authors identified86
a single locus for root-knot nematode resistance in rice. In that87
study, the parents of the F2 population were LD24 and VialoneNano,88
yielding an F2 population size of 178 (plants), and both the resistant89
bulk and the susceptible bulk contained 23 plants each. The DNA90
samples of both the parents and the bulks were sequenced using91
Illumina MiSeq Sequencing System and MiSeq v3 chemistry.92

The BSA-Seq sequencing data (ERR2696318: parent LD24;93
ERR2696319: parent VialoneNano; ERR2696321: the resistant bulk94
from the F2 population; ERR2696322: the susceptible bulk from95
the F2 population) were downloaded from the European Nucleotide96
Archive (ENA) using the Linux program wget, and the rice reference97
sequence (Release 47) was downloaded from https://plants.ensembl.98
org/Oryza_sativa/Info/Index. Sequencing data preprocessing and SNP99
calling were performed as described previously (8). When analyzing100
the BSA-Seq data with the genome sequences of both the parents101
and the bulks, bulk/parent SNP calling was performed separately.102
The common SNPs of the two SNP datasets were used for the103
downstream analysis.104

The SNP dataset generated via SNP calling was processed with105
our Python script to identify significant SNP-trait associations.106
A single script containing all the three methods is available on107
the website https://github.com/dblhlx/PyBSASeq. The workflow of the108
scripts is as follows:109
1. Read the .tsv input file generated via SNP calling into a Pandas110

DataFrame.111
2. Perform SNP filtering on the Pandas DataFrame.112
3. Identify the significant SNPs (sSNPs) via Fisher’s exact test113

(the significant structural variant method), calculate the ΔAF114
(allele frequency difference between bulks) values (the allele115
frequency method), or calculate the G-statistic values (the116
G-statistic method) using the four AD values (ADref1 and117
ADalt1 of bulk 1 and ADref2 and ADalt2 of bulk 2) of each118
SNP in the filtered Pandas DataFrame.119

4. Use the sliding window algorithm to plot the sSNP/totalSNP120
ratios, the ΔAF values, or the G-statistic values against their121
genomic positions.122

5. Estimate the threshold of the sSNP/totalSNP ratio, the ΔAF,123
or the G-statistic via simulation. The thresholds are used to124
identify the significant peaks/valleys in the plots generated in125
step 4.126

Identification of the sSNPs, calculation of the sSNP/totalSNP127
ratios, the G-statistic values, or the ΔAF values, and estimation128
of their thresholds were carried out as described previously (8).129
The 99.5th percentile of 10 000 simulated sSNP/totalSNP ratios130
or G-statistic values was used as the threshold for the significant131
structural variant method or the G-statistic method, and the 99%132
confidence interval of 10 000 simulated ΔAF values was used as133
the threshold for the allele frequency method. For all methods,134
the size of the sliding windows is 2 Mb and the incremental step is135
10 kb. In our previous work, a parent was the japonica rice cultivar136

nipponbare, and its genome sequences were used as the reference 137
sequences for SNP/InDel calling. In the current dataset, the parents 138
were LD24 and VialoneNano; many false-positive SNPs/InDels and 139
heterozygous loci in the parents would be included in the dataset if 140
analyzing the BSA-Seq data using the original script. Hence, SNP 141
filtering is carried out a little differently from previously described 142
(8), and its details are below (see Table S1 for examples): 143

• Unmapped SNPs or SNPs mapped to the mitochondrial or 144
chloroplast genome 145

• SNPs with an ‘NA’ value in any column of the DataFrame 146
• SNPs with zero REF read and a single ALT allele in both 147

bulks/parents 148
• SNPs with three or more ALT alleles in any bulk/parent 149
• SNPs with two ALT alleles and its REF read is not zero in 150

any bulk/parent 151
• SNPs in which the bulk/parent genotypes do not agree with 152

the REF/ALT bases 153
• SNPs in which the bulk/parent genotypes are not consistent 154

with the AD values 155
• SNPs with a genotype quality (GQ) score less than 20 in any 156

bulk 157
• SNPs with very high reads 158
• SNPs heterozygous in any parent when parental genome se- 159

quences are available 160
Additionally, for SNPs with two ALT alleles and zero REF read 161

in both bulks/parents, the REF allele is replaced with the first allele 162
in the ‘ALT’ field, its ALT allele is replaced with the second allele 163
in the original ‘ALT’ field. The REF read, and a comma after it, 164
are removed from both the allele depth (AD) fields (one for each 165
bulk/parent). This step is carried out before checking the genotype 166
agreement between bulks and the REF/ALT fields. When parental 167
genome sequences are involved, the common SNP set is identified 168
before filtering out the SNPs with a low GQ score in the parental 169
SNP dataset. 170

The tightly linked SNP alleles from the same parent tend to 171
segregate together and should have a similar extent of allele enrich- 172
ment, and thus similar AD values. In a SNP dataset, the genotypes 173
of each bulk/parent are represented as ‘GTref/GTalt’ when a SNP 174
contains both the REF base and the ALT base in the genotype 175
(GT) field, and the AD values in each bulk/parent is represented as 176
‘ADref,ADalt’. The genotype and the AD value of the REF allele are 177
always placed first in both fields. For a SNP locus in the .tsv input 178
file, the allele having the same genotype as that in the reference 179
genome is defined as the REF allele. However, it is highly unlikely 180
that all of the SNP alleles in a parent are the same as those in 181
the reference genome, except in instances where reference genome 182
sequences used in SNP calling are from one of the parents as in 183
the case of the cold-tolerance study as mentioned above (9). It is 184
necessary to place the genotypes and AD values of all SNP alleles 185
from one parent (e.g., LD24) in the REF position, and those from 186
the other parent (e.g., VialoneNano) to the ALT position in the 187
GT and AD fields to make the bulk dataset consistent. Thus, for 188
a particular SNP, if the REF base in the .tsv file is different from 189
the genotype of LD24 (either parent will work), its GT/AD values 190
would be swapped, e.g., ‘G/A’ to ‘A/G’ and ‘19,9’ to ‘9,19’. AD/GT 191
swapping is performed following SNP filtering and is performed only 192
when the parental genome sequences are used to aid BSA-Seq data 193
analysis. Equation 1 is used for ΔAF calculation. AD swapping 194
ensures that adjacent SNPs have similar ΔAF values. 195

∆AF =
ADalt2

ADref 2 + ADalt2
−

ADalt1
ADref 1 + ADalt1

[1] 196

197

Results 198

The original sequence reads were 3.9G, 3.8G, 3.4G, and 3.5G; 199
they became 3.8G, 3.6G, 3.3G, and 3.4G after quality con- 200
trol, respectively, in ERR2696318 (parent LD24), ERR2696319 201
(parent VialoneNano), ERR2696321 (the resistant bulk), and 202
ERR2696322 (the susceptible bulk), which correspond to 8.8×, 203
8.5×, 7.6×, and 7.9× coverage, respectively (12). The prepro- 204
cessed sequences were used for SNP calling to generate a SNP 205

2 | https://doi.org/10.1101/654137 Zhang et al.

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint 

https://plants.ensembl.org/Oryza_sativa/Info/Index
https://plants.ensembl.org/Oryza_sativa/Info/Index
https://plants.ensembl.org/Oryza_sativa/Info/Index
https://github.com/dblhlx/PyBSASeq
https://doi.org/10.1101/654137
https://doi.org/10.1101/2021.02.08.430275
http://creativecommons.org/licenses/by-nc-nd/4.0/


DR
AF

T

dataset, which was analyzed using the modified significant206
structural variant method, the G-statistic method, and the207
allele frequency method with or without the aid of the parental208
genome sequences.209

BSA-Seq data analysis using the genome sequences of210
both the parents and the bulks. The SNP calling-generated211
parent/bulk SNP dataset was processed with the Python212
script PyBSASeq_WP.py. SNP filtering was performed as213
described in the Materials and Methods section. The parental214
SNP dataset was processed first, and the SNPs heterozygous215
in any parent were eliminated because all algorithms assume216
all SNP loci are homozygous in the parental lines. Threshold217
estimation is based on this assumption. Although most rice218
breeding lines should be homozygous in most loci, more219
than 7% heterozygous SNP loci (2 011 062 homozygous and220
153 000 heterozygous) were identified in the parental SNP221
dataset. However, the GATK’s variant calling tools are222
designed to be very lenient in order to achieve a high degree223
of sensitivity (https://gatk.broadinstitute.org/hc/en-us/articles/224
360035535932-Germline-short-variant-discovery-SNPs-Indels-),225
we cannot rule out the possibility that some of the heterozy-226
gous loci were caused by sequencing artifacts. The bulk SNP227
dataset was processed second. The SNPs with the same228
chromosome ID and the same genomic coordinate in both229
datasets were considered common SNPs. Common SNPs in230
the bulk dataset were used to detect SNP-trait associations231
for all three methods.232
Table 1. Chromosomal distribution of SNPs - using the genome
sequences of both the parents and the bulks

Chromosome sSNPs TotalSNPs sSNP/totalSNP

1 1170 139 910 0.0084
2 310 125 129 0.0025
3 459 102 331 0.0045
4 330 89 577 0.0037
5 372 84 706 0.0044
6 1581 83 605 0.0189
7 378 94 371 0.0040
8 258 80 617 0.0032
9 1292 67 157 0.0192
10 363 56 681 0.0064
11 2765 88 287 0.0313
12 241 87 145 0.0028

Genome-wide 9519 1 099 516 0.0087

The significant structural variant method: Each SNP in the233
dataset was tested via Fisher’s exact test using its four AD234
values, and SNPs with P-values less than 0.01 were defined235
as sSNPs. The chromosomal distributions of the sSNPs and236
the total SNPs are summarized in Table 1. Using the sliding237
window algorithm, the genomic distribution of the sSNPs, the238
total SNPs, and the sSNP/totalSNP ratios of sliding windows239
were plotted against their genomic position (Figure 1a and240
Figure 1b). A genome-wide threshold was estimated as 0.0538241
via simulation as described previously (8). Two peaks above242
the threshold were identified: a minor one on chromosome243
9 and a major one on chromosome 11. The position of the244
peak on chromosome 9 was at 1.11 Mb, the sliding window245
contained 230 sSNPs and 3738 total SNPs, corresponding246
to an sSNP/totalSNP ratio of 0.0615; the position of the247
peak on chromosome 11 was at 26.44 Mb, the sliding window248
contained 675 sSNPs and 1139 total SNPs, corresponding to an249

sSNP/totalSNP ratio of 0.5926. The sliding window-specific 250
threshold was estimated for each peak via simulation, and 251
the values were 0.0551 and 0.0623, respectively, indicating 252
both peaks were significant. Both values are higher than the 253
genome-wide threshold, probably due to the lower amounts of 254
total SNPs in these sliding windows. The average SNPs per 255
sliding window was 5893. 256

The G-statistic method: The G-statistic value of each SNP 257
in the dataset was calculated, and its threshold was estimated 258
via simulation as described previously (8). Using the sliding 259
window algorithm, the G-statistic value of each sliding win- 260
dow, the average G-statistic values of all SNPs in that sliding 261
window, was plotted against its genomic position (Figure 1c), 262
and the curve pattern was very similar to that in Figure 1b. A 263
significant peak was identified on chromosome 11; its position 264
was at 26.49 Mb, its G-statistic value was 12.8120, well above 265
the threshold 9.0224 (99.5th percentile). 266

The allele frequency method: The ΔAF value of each SNP in 267
the dataset was calculated, and the ΔAF threshold of the SNP 268
was estimated via simulation as described previously (8). Using 269
the sliding window algorithm, the ΔAF value of each sliding 270
window, the average ΔAF values of all SNPs in that sliding 271
window, was plotted against its genomic position (Figure 1d). 272
A significant peak on chromosome 11 was identified, the peak 273
position was located at 26.45 Mb, its ΔAF value was 0.7173, 274
and the 99% confidence interval was −0.6508 to 0.6497. 275

BSA-Seq data analysis using only the bulk genome se- 276
quences. The SNP calling-generated bulk SNP dataset was 277
processed with the Python script PyBSASeq.py. All the meth- 278
ods and parameters were the same as above; the only difference 279
was that the parental SNP dataset was not used. 280

The significant structural variant method: The chromoso- 281
mal distribution of the sSNPs and total SNPs are summarized 282
in Table 2. The total number of SNPs was 1 346 185 here, 283
much higher than the above, which was 1 099 516. The ge- 284
nomic distribution of the sSNPs, the total SNPs, and the 285
sSNP/totalSNP ratios of the sliding windows are presented 286
in Figure 2a and Figure 2b. The patterns of the curves were 287
very similar to those in Figure 1a and Figure 1b. One of the 288
obvious differences was that sSNP/totalSNP ratios of the slid- 289
ing windows were much lower than those in Figure 1b, leading 290
to missing the minor locus on chromosome 9. Only the peak 291
on chromosome 11 was significant; it was located at 26.96 Mb, 292
a 520 kb shift compared to Figure 1b. The sliding window 293
contained 1122 sSNPs and 2945 total SNPs, corresponding to 294
a 0.3810 sSNP/totalSNP ratio, well above the genome-wide 295
threshold (0.0535) and the sliding window specific threshold 296
(0.0601). The average SNPs per sliding window was 7215. 297

The G-statistic method: The patterns of the G-statistic 298
value plot (Figure 2c) were very similar to that in Figure 1c, 299
but the G-statistic values were significantly lower than those 300
in Figure 1c, and the threshold did not change much. Only 301
a single sliding window was above the threshold (8.8953), its 302
position was at 29.96 Mb, and its G-statistic value was 8.9060. 303

The allele frequency method: Without the aid of the 304
parental genome sequences, the pattern of the ΔAF curve 305
of chromosome 11 (Figure 2d), especially the genomic region 306
associated with the trait, was drastically different from that in 307
Figure 1d. Differences in the curve patterns were observed in 308
other chromosomes as well, but they were relatively minor. All 309
ΔAF values were within the 99% confidence interval, although 310

Zhang et al. bioRχiv | November 24, 2020 | 3

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint 

https://github.com/dblhlx/PyBSASeq/blob/master/PyBSASeq_WP.py
https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-
https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-
https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-
https://github.com/dblhlx/PyBSASeq/blob/master/PyBSASeq.py
https://doi.org/10.1101/2021.02.08.430275
http://creativecommons.org/licenses/by-nc-nd/4.0/


DR
AF

T

0

2000

4000

6000

8000

10000
N

um
be

r o
f S

N
P

s
Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 Chr11 Chr12

0.0

0.2

0.4

0.6

sS
N

P
/to

ta
lS

N
P

0

5

10

G
-s

ta
tis

tic

0 1 2 3 4

0.5

0.0

0.5

A
F

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
Genomic position (×10 Mb)

A

B

C

D

Figure 1. BSA-Seq data analysis using the genome sequences of both the parents and the bulks. The red lines/curves are the thresholds. (A) Genomic distributions of sSNPs
(blue) and totalSNPs (black). (B) Genomic distributions of sSNP/totalSNP ratios. (C) Genomic distributions of G-statistic values. (D) Genomic distributions of ΔAF values.

Table 2. Chromosomal distribution of SNPs - using only the bulk
genome sequences

Chromosome sSNPs TotalSNPs sSNP/totalSNP

1 1335 163 260 0.0082
2 391 146 877 0.0027
3 578 120 319 0.0048
4 442 110 952 0.0040
5 481 103 362 0.0047
6 1724 103 416 0.0167
7 459 114 564 0.0040
8 373 103 385 0.0036
9 1410 82 744 0.0170
10 572 78 206 0.0073
11 3120 112 719 0.0277
12 281 106 381 0.0026

Genome-wide 11 166 1 346 185 0.0083

AD swapping was performed on only 67 396 SNPs, 6.1% of 311
total SNPs. 312

Discussion 313

We tested how parental genome sequences affected the detec- 314
tion of SNP-trait associations via BSA-Seq using a dataset 315
of the rice root-knot nematode resistance. Using the genome 316
sequences of both the parents and bulks, a major locus on 317
chromosome 11 and a minor locus on chromosome 9 were de- 318
tected via the significant structural variant method. However, 319
only the major locus was detected via the G-statistic method 320
and the allele frequency method. The positions of the peaks 321
detected via different methods were not the same, but they 322
were very close to each other. Using only the bulk genome 323
sequences, the major locus can be detected via only the signif- 324
icant structural variant and G-statistic methods. The allele 325
frequency method uses the ΔAF value of a SNP to measure 326
allele (REF/ALT) enrichment in the SNP locus, and the G- 327
statistic method uses the G-statistic value of a SNP to measure 328
the allele enrichment; ΔAF and G-statistic are parameters 329
at the SNP level, therefore, both methods use a SNP level 330
parameter to identify significant sliding windows for the detec- 331
tion of the genomic region-trait associations. The significant 332
structural variant method, however, uses the sSNP/totalSNP 333
ratio, a parameter at the sliding window level, to measure the 334
sSNP enrichment in a sliding window for the identification of 335
the trait-associated genomic regions. A SNP normally has less 336
than 100 reads because of the cost concern, while a sliding 337

4 | https://doi.org/10.1101/654137 Zhang et al.

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint 

https://doi.org/10.1101/654137
https://doi.org/10.1101/2021.02.08.430275
http://creativecommons.org/licenses/by-nc-nd/4.0/


DR
AF

T

0

2500

5000

7500

10000

12500
N

um
be

r o
f S

N
P

s
Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 Chr11 Chr12

0.0

0.2

0.4

sS
N

P
/to

ta
lS

N
P

2.5

5.0

7.5

G
-s

ta
tis

tic

0 1 2 3 4

0.5

0.0

0.5

A
F

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
Genomic position (×10 Mb)

A

B

C

D

Figure 2. BSA-Seq data analysis using only the bulk genome sequences. The red lines/curves are the thresholds. (A) Genomic distributions of sSNPs (blue) and totalSNPs
(black). (B) Genomic distributions of sSNP/totalSNP ratios. (C) Genomic distributions of G-statistic values. (D) Genomic distributions of ΔAF values.

window normally contains thousands of SNPs. Thus, the sig-338
nificant structural variant method has much higher statistical339
power, which is consistent with our observation. Our results340
revealed that the parental genome sequences did not much341
affect the plot patterns of the sSNP/totalSNP ratios and the342
G-statistic values. However, the plot patterns of the ΔAF343
value of chromosome 11 were altered dramatically when the344
parental genome sequences were not used.345

The significant structural variant method assesses if a SNP346
is likely associated with the trait via Fisher’s exact test. The347
greater the ALT proportion differences between the bulks, the348
less the P-value of the Fisher’s exact test, and the more likely349
the SNP is associated with the trait. Fisher’s exact test takes a350
numpy array or a Python list as its input, the same P-value will351
be obtained with either [[ADref1, ADalt1], [ADref2, ADalt2]] or352
[[ADalt1, ADref1], [ADalt2, ADref2]] as its input. The G-statistic353
method assesses if a SNP is likely associated with the trait354
via the G-test; the greater the G-statistic value of a SNP, the355
more likely it contributes to the trait phenotype (11). The G-356
statistic values are the same with either input [[ADref1, ADalt1],357
[ADref2, ADalt2]] or [[ADalt1, ADref1], [ADalt2, ADref2]]. The358
order of the AD values (REF/ALT reads) in bulks does not359
affect the P-value of Fisher’s exact test or the G-statistic value360
of G-test, which is why the parental genome sequences-guided361
AD swapping does not alter the curve patterns of both methods.362
Therefore, theoretically, parental genome sequences are not363
required to identify genomic region-trait associations in either364
the significant structural variant method or the G-statistic365

method. 366
When the parental genome sequences were used, AD value 367

swapping was performed for the SNPs in which the genotype of 368
LD24 was different from the REF base, and the ΔAF values of 369
these SNPs were calculated based on the swapped AD values 370
using equation 1. AD swapping makes the adjacent SNP alleles 371
from the same parent have similar AD values and similar ΔAF 372
values. The ΔAF values of such SNPs were calculated using 373
equation 2 if not performing AD swapping. Equation 2 can 374
be converted to equation 3, which produces an opposite value 375
relative to that produced by equation 1. For two adjacent 376
SNPs in LD24, where one SNP has the same genotype as the 377
REF base while the other has the same genotype as the ALT 378
base, they would have opposite ΔAF values if AD swapping 379
is not performed. For the SNPs that do not contribute to 380
the trait phenotype and are not linked to any trait-associated 381
genomic regions, their ΔAF value should fluctuate around 382
zero. The parental genome sequences will have less effect 383
on the ΔAF value of the sliding windows containing such 384
SNPs. However, for trait-associated SNPs, adjacent SNPs 385
with opposite ΔAF values would cancel each other out and 386
lower the ΔAF value of the sliding window significantly, which 387
is the case observed on chromosome 11 in Figure 2d. 388

∆AF =
ADref 2

ADref 2 + ADalt2
−

ADref 1
ADref 1 + ADalt1

[2] 389

Zhang et al. bioRχiv | November 24, 2020 | 5

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.08.430275
http://creativecommons.org/licenses/by-nc-nd/4.0/


DR
AF

T

∆AF =
ADalt1

ADref 1 + ADalt1
−

ADalt2
ADref 2 + ADalt2

[3]390

When the parental genome sequences were not used, the391
sSNP/totalSNP ratios and the G-statistic values were signifi-392
cantly lower. The peak sSNP/totalSNP ratio on chromosome393
11 was 0.5926 in Figure 1b, while it was 0.3810 in Figure 2b; it394
was similar for the peak G-statistic values. The decreasing of395
sSNP/totalSNP ratio and the G-statistic value is likely caused396
by sequencing artifacts and heterozygosity in the parental397
lines. There were 1 345 185 SNPs in the bulk dataset when398
not using the parental genome sequences, while there were399
1 099 516 SNPs in the dataset with the aid of the parental400
genome sequences. Comparison of the two SNP dataset re-401
vealed that 109 445 SNPs were unique to the bulks. Because402
all the SNPs in the bulks are derived from the parental lines,403
crossing should not generate new SNPs; thus this category404
of SNPs was most likely caused by sequencing artifacts. The405
sequencing coverage in the bulk was less than eight, which is406
very low. Higher sequencing coverage would help decrease the407
number of SNPs derived from sequence artifacts. Additionally,408
137 224 SNP were heterozygous in the parental lines. Without409
the parental genome sequences, this category of SNPs could410
not be filtered out from the bulk SNP dataset. However, these411
SNPs can be decreased via selfing the parental line more gener-412
ations: five-generations selfing can decrease the heterozygosity413
of both parental lines to a maximum of 6.25%.414

To determine how parental heterozygosity and sequenc-415
ing artifacts affected the detection of genomic region-trait416
associations, we removed the heterozygous SNPs or the bulk-417
specific SNPs from the bulk SNP dataset, and analyzed the418
data separately. By removing the heterozygous SNPs, the419
peak on chromosome 11 was shifted to 26.28 Mb for both420
the sSNP/totalSNP ratio and the G-statistic value, and the421
sSNP/totalSNP ratio of the peak was increased to 0.4835,422
well above the sliding window-specific threshold 0.0603. The423
G-statistic value of the peak was 10.8411, significantly higher424
than the threshold 8.9532 as well. By removing bulk-specific425
SNPs, the peak on chromosome 11 shifted to 26.49 Mb for426
both the sSNP/totalSNP ratio and the G-statistic value. The427
sSNP/totalSNP ratio of the peak and the sliding window-428
specific threshold were 0.4302 and 0.0637, respectively, and429
the G-statistic value of the peak and the threshold were 9.7591430
and 8.9092, respectively. Although both the sSNP/totalSNP431
ratio and the G-statistic value were lower than above, they432
were still higher than their corresponding thresholds. While433
seemed the heterozygous SNPs affected the sSNP/totalSNP434
ratio and the G-statistic value a little more than the bulk-435
specific SNPs, it is more likely that both produced similar436
levels of noise for the sSNP/totalSNP ratio and the G-statistic437
value considering that the former was 27 779 greater than438
the latter. When using only the bulk genome sequences, the439
sSNP/totalSNP peak position on chromosome 11 was shifted440
0.52 Mb (26.44 Mb to 26.96 Mb) due to the presence of the441
bulk-specific SNPs and the heterozygous SNPs in the dataset,442
but this is a very short distance for genetic mapping. Although443
only a single dataset was examined here, the genome-wide444
similarity of the sSNP/totalSNP curve patterns in Figure 1b445
and Figure 2b suggests that the significant structural method446
is highly reproducible using only the bulk genome sequences.447

Conclusions 448

The plotting pattern of the ΔAF values in the trait-associated 449
genomic region was very different when using only the bulk 450
genome sequences. Without the aid of the parental genome 451
sequences, the ΔAF values of the sliding windows could not 452
be correctly calculated; thus, the allele frequency method 453
cannot be used to identify SNP-trait association. In contrast, 454
the parental genome sequence does not affect the plotting 455
patterns of both the significant structural variant method and 456
the G-statistic method, but the sSNP/totalSNP ratios and the 457
G-statistic values decreased significantly due to sequencing 458
artifacts and/or heterozygosity of the parental lines. Because 459
of its high detection power, major SNP-trait associations can 460
still be reliably detected via the significant structural variant 461
method even the sequence coverage was very low. 462

Acknowledgments. JZ was supported by the National Science 463
Foundation grant [IOS-1546625 to DRP]. We are grateful to Lahari 464
et al. for generating the sequencing data and making it available 465
to the public. We thank Irene E. Palmer for critical review and 466
thank Nathan Lynch for valuable comments. The manuscript was 467
prepared using a modified version of the PNAS LATEX template. 468

Bibliography 469
1. RW Michelmore, I Paran, RV Kesseli, Identification of markers linked to disease-resistance 470

genes by bulked segregant analysis: a rapid method to detect markers in specific genomic re- 471
gions by using segregating populations. Proc. Natl. Acad. Sci. U.S.A. 88, 9828–9832 (1991). 472

2. JJ Giovannoni, RA Wing, MW Ganal, SD Tanksley, Isolation of molecular markers from spe- 473
cific chromosomal intervals using DNA pools from existing mapping populations. Nucleic 474
Acids Res 19, 6553–6568 (1991). 475

3. I Imerovski, et al., BSA-seq mapping reveals major QTL for broomrape resistance in four 476
sunflower lines. Mol Breed. 39, 41 (2019). 477

4. S Arikit, et al., QTL-seq identifies cooked grain elongation QTLs near soluble starch synthase 478
and starch branching enzymes in rice ( Oryza sativa L.). Sci Rep 9, 1–10 (2019). 479

5. Q Chen, et al., Identification and genetic mapping for rht-DM, a dominant dwarfing gene in 480
mutant semi-dwarf maize using QTL-seq approach. Genes Genomics 40, 1091–1099 (2018). 481

6. J Clevenger, et al., Mapping Late Leaf Spot Resistance in Peanut (Arachis hypogaea) Using 482
QTL-seq Reveals Markers for Marker-Assisted Selection. Front Plant Sci 9, 83 (2018). 483

7. F Duveau, et al., Mapping small effect mutations in Saccharomyces cerevisiae: impacts of 484
experimental design and mutational properties. G3 (Bethesda) 4, 1205–1216 (2014). 485

8. J Zhang, DR Panthee, PyBSASeq: a simple and effective algorithm for bulked segregant 486
analysis with whole-genome sequencing data. BMC Bioinforma. 21, 99 (2020). 487

9. Z Yang, et al., Mapping of Quantitative Trait Loci Underlying Cold Tolerance in Rice Seedlings 488
via High-Throughput Sequencing of Pooled Extremes. PLOS ONE 8, e68433 (2013). 489

10. H Takagi, et al., QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome 490
resequencing of DNA from two bulked populations. Plant J. 74, 174–183 (2013). 491

11. PM Magwene, JH Willis, JK Kelly, The Statistics of Bulk Segregant Analysis Using Next Gen- 492
eration Sequencing. PLOS Comput. Biol. 7, e1002255 (2011). 493

12. Z Lahari, et al., QTL-seq reveals a major root-knot nematode resistance locus on chromo- 494
some 11 in rice (Oryza sativa L.). Euphytica 215, 117 (2019). 495

6 | https://doi.org/10.1101/654137 Zhang et al.

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430275doi: bioRxiv preprint 

https://doi.org/10.1101/654137
https://doi.org/10.1101/2021.02.08.430275
http://creativecommons.org/licenses/by-nc-nd/4.0/