key: cord-0279745-geuevfgj authors: Charles, Oscar J.; Roberts, Joseph; Breuer, Judith; Goldstein, Richard A. title: WeightedLD: The Application of Sequence Weights to Linkage Disequilibrium date: 2021-06-07 journal: bioRxiv DOI: 10.1101/2021.06.04.447093 sha: 3e63ba8b0306af7033daadf812628690e0bbf2e4 doc_id: 279745 cord_uid: geuevfgj Sequence-weighting methods are commonly employed to account for biases in sequence datasets. We use a weighting scheme which considers the observed distinctiveness of sequences and apply it to calculations of linkage disequilibrium. Each sequence now contributes a weighted score to linkage disequilibrium measurements of pairwise loci. We demonstrate that this reduces the effect of uneven sampling, as underrepresented groups of sequences will each contribute more individually than redundant, similar sequences. Availability Source code for a python and rust implementation are freely available at under an MIT license at github.com/ojcharles/WeightedLD. Contact r.goldstein@ucl.ac.uk or oscar.charles.18@ucl.ac.uk Assembling a set of sequence data is a common first step in population genetics projects. The composition of such sets is affected by factors such as geography, accessibility of different populations, similarity to laboratory strains and the idiosyncratic nature of sequencing efforts. This results in sequencing data that is highly biased, complicating and compromising downstream analyses. Sequence weighting schemes have been developed to compensate for overrepresentation of populations in genetic data. The more redundant and less distinct a sequence is, the lower its weight and the less it contributes to downstream analyses. Conversely a sequence that is divergent will be assigned a relatively high weight and will more strongly influence later calculations. While there is consensus that sequence weighting is important, there are a number of different methodologies 1,2 . Linkage Disequilibrium (LD) measures the non-random association between genetic markers across different sites in the genome 3 . LD analyses are frequently applied to pairwise variable loci in a multiple sequence alignment (MSA) in order to test for the presence of recombination. Alleles presenting long-range LD can suggest functional similarities of encoding gene products 4 , the presence of population admixtures, epistatic selection, or other selection pressures 5 . Despite its widespread use in sequence analysis, sequence weighting has not yet been applied to most population genetics calculations such as LD. Consider a set of sequences representing disproportionately sampled populations, by geography or clinical presentation. It is possible that noteworthy signals represented in the minor populations, and/or interplay between the major and minor populations would be diluted by the dominant linkage of alleles in the major population. We describe how sequence weighting can alleviate much of the effect of biases in sample composition in calculations of LD. This represents a general methodology which could be applied to a wide range of population genetic studies. Consider a multiple sequence alignment derived from one or more specific populations. The composition of such an alignment has inevitable biases, such as an overrepresentation of samples from golf-playing nations (GPNs) or nearly exact copies of laboratory strains. We wish to calculate LD in a manner that is robust to these compositional biases. We define to be the frequency of allele at a locus to be the frequency of allele at a second locus, and as the joint frequency of sequences with both alleles at these loci. if there is complete random association between these two loci, = . This will not be true in the presence of linkage disequilibrium. The amount of the discrepancy is quantified by , the linkage disequilibrium coefficient, given by 6,7 . As will be affected by the amount of variation at these two sites, it is common for this quantity to be normalised Next, we introduce sequence weights into this framework. If the weight of sequence is and we are considering two sites, one where sequence has allele , the other where it has allele , then: For calculating these weights, we use the approach developed by Henikoff and Henikoff 8 . We implemented a procedure that calculates LD using sequence weighting and the LD scores above in both Python and Rust 9,10 . The software accepts a multiple sequence alignment or variant call file as input, identifies all variable positions and then returns a matrix of values for [position , position +, , ′, 2 ] per pairwise comparison. To account for ambiguous bases there is an option to set --min-acgt fraction, which will throw out variable positions which contain below this fraction of A, C, G, or T. Also, there is a -min-minor argument, which ignores positions with a (dominant) minor fraction less than the specified value. We benchmarked the rust implementation compute time with default settings on an i7-8650U laptop with 32GB RAM. Sequence data were obtained from NCBI or the 1000 genomes project and aligned with MAFFT [11] [12] [13] . The Hepatitis C weighting and LD calculations shown below completed in 0.35 seconds, 100000 Sars-cov2 global whole genomes from 2021 in 1.02s and Human GRch38 chromosome 19 positions 44000000-45000000 from 200 diploid individuals, evenly split among GBR and PJL populations required 21.53s. Sequence weighted LD calculations should be beneficial in reconstructing a more accurate population LD when only biased data is available. To demonstrate this we generated an unbiased MSA containing two sets of 137 whole genome sequences of Hepatitis C subtype 1 of Japanese and USA origin. This alignment was unevenly sampled to create a second biased alignment of 25 Japanese and 137 USA sequences. With these two alignments we computed three sets of results using WeightedLD, 1) Unbiased-unweighted 2) Biased-unweighted and 3) Biased-Weighted. Figures 1 and 2 demonstrate that assigning weights to sequences results in higher contribution from the minor Japanese population, shifting LD pairwise calculations towards the values obtained with an unbiased set. The following results were calculated using parameters --min-acgt 0.9 --min-minor 0.02 --r2threshold 0.02. Within the figures the Wilcoxon test, Pearson correlation coefficients, and graphics were generated in R 14 . CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Weighting in sequence space: a comparison of methods in terms of generalized sequences Linkage disequilibrium -understanding the evolutionary past and mapping the medical future Population-specific long-range linkage disequilibrium in the human genome and its influence on identifying common disease variants Long Range Linkage Disequilibrium across the Human Genome The Evolutionary Dynamics of Complex Polymorphisms SOME APPLICATIONS OF MATHEMATICS TO BREEDING PROBLEMS I1 Position-based sequence weights The rust language Array programming with NumPy Database resources of the National Center for Biotechnology Information MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability A global reference for human genetic variation ggpubr: 'ggplot2' Based Publication Ready Plots