key: cord-0011647-18xriumr
authors: Swart, Yolandi; van Eeden, Gerald; Sparks, Anel; Uren, Caitlin; Möller, Marlo
title: Prospective avenues for human population genomics and disease mapping in southern Africa
date: 2020-05-21
journal: Mol Genet Genomics
DOI: 10.1007/s00438-020-01684-8
sha: d0055698ce5cdaede1c6e091dba097a113874c9e
doc_id: 11647
cord_uid: 18xriumr

Population substructure within human populations is globally evident and a well-known confounding factor in many genetic studies. In contrast, admixture mapping exploits population stratification to detect genotype–phenotype correlations in admixed populations. Southern Africa has untapped potential for disease mapping of ancestry-specific disease risk alleles due to the distinct genetic diversity in its populations compared to other populations worldwide. This diversity contributes to a number of phenotypes, including ancestry-specific disease risk and response to pathogens. Although the 1000 Genomes Project significantly improved our understanding of genetic variation globally, southern African populations are still severely underrepresented in biomedical and human genetic studies due to insufficient large-scale publicly available data. In addition to a lack of genetic data in public repositories, existing software, algorithms and resources used for imputation and phasing of genotypic data (amongst others) are largely ineffective for populations with a complex genetic architecture such as that seen in southern Africa. This review article, therefore, aims to summarise the current limitations of conducting genetic studies on populations with a complex genetic architecture to identify potential areas for further research and development.

Genetics entered an exciting era of discovery with the advent of next-generation sequencing (NGS) technology, improved bioinformatic techniques and increased international collaboration to include underrepresented diversity in genetic studies. Collaborative initiatives, such as the Human Heredity and Health in Africa (H3Africa) Consortium and, African Genome Variation Project (AGVP) are rapidly obtaining and investigating valuable genetic data previously unattainable (Gurdasani et al. 2015; Zheng-Bradley and Flicek 2017; Fortes-Lima et al. 2017; Mulder et al. 2018 ). These studies have enabled novel genetic investigations, however large sample sizes and high-quality whole-genome data are still lacking for most populations from particularly southern Africa. Association studies often yield no significant single nucleotide polymorphisms (SNPs) associated with multifactorial diseases and fail to detect associations with rare genetic variants [minor allele frequency (MAF) of < 1%] in southern African populations due to a lack of predictive power. Furthermore, the vast majority of association studies continue to be focused on populations of European ancestry and simple admixture scenarios (Wojcik et al. 2019) .

Genetic regions associated with multifactorial diseases could be identified by investigating the allelic architecture of highly complex admixed individuals, since they received haplotypes from diverse continental populations previously exposed to various environments and pathogens (Dias-Alves et al. 2018; Mazandu et al. 2019 ). If such gene regions could be successfully identified, it will aid in the advancement of drug therapies, implementation of personalized medicine and vaccine development in underdeveloped countries such as South Africa. However, individuals from South Africa can be up to five-way admixed, arguably the most complex global example of admixture (Daya et al. 2013; Uren et al. 2017b ). The history of South Africa contributed to this observed population substructure, one of which includes ancestral contributions from predominantly the indigenous hunter-gatherers of southern Africa and Bantu-speaking Africans, as well as European-descent groups, South East Asians and East Asians (de Wit et al. 2010; Chimusa et al. 2014; Daya et al. 2014b; Uren et al. 2016) .

The frequency of disease risk alleles differs between populations (Secolin et al. 2019) . These disparities are exploited to map disease-causing variants of multifactorial diseases in admixed genomes, better known as admixture mapping (Shriner 2013) . However, additional modifications are required to conduct admixture mapping studies for individuals from southern Africa, since most computational tools are designed to infer local ancestry for two-or threeway admixed populations only (Chimusa et al. 2018; Schurz et al. 2019; Mazandu et al. 2019 ). In addition, statistical methods assume homogeneity and may not be applicable for Africans with more complex haplotype structures and mosaic patterns present on chromosomes generated by recent admixture events across the African continent (Fan et al. 2019) . The continuous increase in non-communicable diseases in Africa and the persistent threat of emerging and re-emerging infectious diseases could in part be countered by the development of comprehensive research pipelines for disease mapping in highly admixed individuals from Africa. This review, therefore, aims to summarise the current limitations of and prospective avenues for population genomics research in relation to disease mapping in southern African populations.

The conventional method of genome-wide association studies (GWAS), which is a hypothesis-free method of detecting SNPs associated with a certain disease, is not sufficient for detecting SNPs associated with a disease in a population with admixed genomes (Visscher et al. 2017) . In contrast, admixture mapping (study design summarised in Fig. 1 ) acknowledges biogeographic ancestry associated with specific phenotypes by including ancestry proportions (globally or locally inferred from dense genotypic data) as covariates in epidemiological studies (Thornton and Bermejo 2014; Duan et al. 2018) . Instead of relying on the association between a genotype and a phenotype, as in GWAS, it considers the associations of the number of haplotypes (0, 1 or 2) of a specific ancestry with the phenotype of interest (Hoggart et al. 2004; Duan et al. 2018) . The study design has recently been successfully used in a variety of complex diseases, e.g. hypertension (Zhu et al. 2005) , prostate cancer in African Americans (Freedman et al. 2006) , asthma-associated variants in Latinos (Gignoux et al. 2019) , obstructive sleep apnea in Hispanic/Latino Americans (Wang et al. 2019) , tuberculosis (TB) associated variants in South Africans (Daya et al. 2014a ) and multiple sclerosis in 3692 African Americans, 3777 Hispanics and 4915 Asian Americans (Chi et al. 2019) .

The successful implementation of admixture mapping relies on suitable proxy haplotype reference panels that represent each ancestral group. The reference panels are used to infer the number of haplotypes that originated from a specific ancestral source population at a given locus, better known as local ancestry inference (LAI) (Paşaniuc et al. 2009; Dias-Alves et al. 2018) . However, limited haplotype reference panels are available for southern African populations. Furthermore, admixture between ethnic groups creates long-range linkage disequilibrium (LD) between variants from different ethnic groups with different allelic frequencies. This subsequently results in differing ancestral haplotype LD blocks, which holds implications for the use of tagging SNPs when working with admixed individuals (Skotte et al. 2019) . Tagging SNPs are normally genotyped whilst conducting association studies and act as a proxy for the underlying common disease-causing variants. Association signals depend on how strongly tagging SNPs correlate with the presence of the disease-causing variant located in a haplotype LD block (Hellwege et al. 2017) . The tagging SNPs could be in the same ancestral LD block, but due to admixture induced LD, be located in a different haplotype LD block and no longer tag the original LD block containing the causal variant. Therefore no association signal would be detected (Skotte et al. 2019) . This is predominantly evident when genetic heterogeneity is present in the admixed population (Duan et al. 2018) . Genetic heterogeneity within a population in this context refers to the individuals of the population having different proportions of global ancestry and/or differing local ancestry at a given locus (Duan et al. 2018) .

The increasing number of admixed individuals is an evolving obstacle for epidemiological association studies. Even populations assumed to be unadmixed may harbour fine-scale admixture. A different admixture mapping approach is required for southern African populations since most individuals will have ancestral contributions from more than two different non-intermating admixed subpopulations of unknown origin with different effect sizes (Bostoen 2018) . The Bantu expansion shaped the genetic composition of most populations from southern Africa, which consists of the countries Namibia, Botswana, Eswatini, Lesotho and South Africa, the latter having 11 official languages which reflect the main ethnic groups (Uren et al. 2016) . Differential admixture dynamics were experienced by Bantuspeaking communities in different areas of south-eastern Africa and the indigenous populations were most affected by this event (Pickrell et al. 2012 ; González-Santos et al. . Several aspects, therefore, require consideration before conducting genetic studies in southern African populations. Firstly, the extensive population heterogeneity amongst Africans caused by complex genetic population sub-structure and differing levels of admixture (Choudhury et al. 2018; Fan et al. 2019 ). These groups can therefore not be grouped as one population in genomic studies (Patin et al. 2017) . Secondly, differences in LD between populations of African descent causes different haplotype structures resulting in reduced power to detect untyped causal loci (Campbell and Tishkoff 2008) . Thirdly, derived alleles are more likely to be heterozygous instead of homozygous (Barnes et al. 2007 ).

Although admixture mapping increases power to detect disease-associated variants, due to longer ancestral LD blocks than haplotype blocks, differing amounts of ancestry and LD patterns from unknown ancestral populations could be present in each individual in the cohort under study (Zhu and Wang 2017) . Adjusting by local ancestry, reflecting the admixture induced LD within admixed populations, will significantly improve the detection of genetic variants with small or moderate effect when extensive genetic heterogeneity is present in the admixed population under study (Duan et al. 2018) . However, it is important to also correct for global ancestry proportions, whilst correcting for local ancestry, since association testing will take place in the context of the admixture induced LD blocks in the admixed genome (Duan et al. 2018 ). This emphasizes the importance of characterising fine-scale genome variation among underrepresented southern African populations to facilitate complex-trait mapping amongst those who harbour a significant burden of global disease.

Currently, no universal standard operating procedures for admixture mapping exists due to the unique admixture scenarios of each project (Zhang and Stram 2014) . Considering finer details of population history and ancestry locally inferred for each haplotype will become a requirement for genomic studies among admixed individuals (Duan et al. 2018) . This is especially true for populations with complex histories, such as those found in southern Africa.

The current human reference genome was essential in advancing genetic analysis such as imputation of missing genotypes required for GWAS and admixture mapping studies, as well as the identification of rare variants in mostly European and Asian populations (Ikegawa 2012) . It is estimated that any two humans will share approximately 99.9% of their genomes (Li and Sadler 1991) . A 0.1% difference might seem insignificant, although in a human genome this translates to approximately 3 million base pairs (Suwinski et al. 2019 ). Since the current reference systematically under represents the tremendous global sequence diversity it is necessary to focus on the individual, geographically defined populations.

Reference genomes have been assembled for multiple distinct human populations and have empirically proven their importance by improving both short read mapping and genotype calling. It facilitates the assessment and meta-analysis of studies done on different microarrays by imputing missing genotypes and is regularly utilized for improving imputation of low frequency and rare variants (Vergara et al. 2018) . Despite the decrease in the cost of sequencing, microarrays are still the technology of choice, although these do have certain limitations. Most notably, probes on the microarray are designed from publicly available data to be an exact complement to the desired genomic region to maximize genotyping rate. If not, it will result in a lower genotyping rate and is especially problematic when considering the higher base-pair substitution rate of intergenic versus protein-coding regions (Cargill et al. 1999; Halushka et al. 1999; Leabman et al. 2003) . Another limitation is the implication of a probe binding within structural variants (SV) which have been shown to be greatly population-specific (Rosenfeld et al. 2012; Sudmant et al. 2015) . If sequencing is selected, the raw data still requires conversion to genomic data either through de novo assembly or by alignment to a reference genome. The latter approach, which is less computationally demanding, has not always allowed the detection of SVs. Due to the read length cut-off of standard NGS, this method of variant detection has a resolution perfect for single nucleotide variants (SNV), but is too small to accurately detect larger events (Merker et al. 2018) .

The failure to detect variants of any size in genomic data due to a lack of a population-specific reference genome has multiple consequences. Firstly, it has an impact on the ability to identify potentially disease-associated variants in patients with underlying genetic conditions. This is applicable not only to the discovery of the variant but extends to the precise clinical diagnosis and subsequent treatment. Detecting functionally relevant genetic variants with increased accuracy with the help of a population-specific reference genome brings precision medicine to the forefront of diagnostic/ treatment options. Secondly, the failure to detect variants has consequences for the generation and maintenance of allele frequency databases as the failure to accurately curate these findings overburden variant prioritization pipelines. Whilst considering genomic variants linked to a condition that is more prevalent in a certain population, it is preferable to compare the genomes to a reference genome more representative of that population.

Currently, no reference genome exists that adequately represents African genetic diversity and it is unlikely that a quality reference genome representative of all populations will be achieved. Rare genetic variants could be missed when conducting studies on participants from African origins since Africans contain ± 10% more DNA (± 3 million base pairs) than the presently available human reference genome (Sherman et al. 2019 ). Regions as large as 100,000 base pairs were identified and 387 novel contigs were in 315 distinct protein-coding regions (Sherman et al. 2019) . Furthermore, the current $2.7 billion Human Genome Reference build 38 (GRCh38) lacks genetic variation from individuals worldwide. A study conducted by Yang et al. identified certain biases in GRCh38 by sequencing three Africans, three Asians, two Europeans and three Americans with PacBio single molecule, real-time (SMRT) sequencing and comparing these to GRCh38, 174 individuals from the 1000 Genome Project (1000GP) and 266 individuals from the Simons Genomes Variation Project (SGDP). A total of 40.8% (99,604 nonredundant SVs) were novel compared to previously published large-scale projects (Yang et al. 2019) . The SGDP obtained high quality (average coverage of 43-fold) whole genomes from 142 diverse populations and indicated that these genomes include at least 5.8 million base pairs that are not present in GRCh38 (Mallick et al. 2016 ). Additionally, a study in Sweden identified 61,000 novel genetic sequences from 1000 individuals that were missing in GRCh38 and nearly 40% of the genetic material couldn't be mapped (Eisfeldt et al. 2020) .

Previous attempts to capture underrepresented southern African genetic variation have been made by the 1000GP, AGVP and SGDP to include more genome-wide genetic markers for broader groups of southern African populations (Gurdasani et al. 2015; 1000 Genomes Project Consortium et al. 2015 . The 1000GP, the largest whole-genome sequencing survey, analysed 26 populations from Europe, East Asian, South Asian, the Americas and Africa. Low coverage sequencing was used and the focus was on demographically large populations, while smaller populations were excluded, despite their respective contributions to human diversity. Although five African populations were included in the analysis, most of these populations are of recent Niger-Kordofanian ancestry (West-and East African) and do not reflect the diversity present in southern Africa (1000 Genomes Project Consortium et al. 2015 . Furthermore, 11%, 5% and 5% of heterozygous positions in KhoeSan, New Guineans and Australians respectively were not identified by the 1000GP. This study also validated that populations from southern Africa contain the highest genetic diversity amongst modern humans (Mallick et al. 2016) . Addressing this diversity, the AGVP included whole-genome sequencing across individuals belonging to ten language subgroups in southern Africa, however, the low coverage sequencing (4 × coverage) risks misclassifying both observed and imputed rare variants (Gurdasani et al. 2015) . Although efforts by multiple consortiums are currently expanding, the risk of eliminating genuine pathogenic variants that are segregating in the population will not be improved in the absence of comprehensive knowledge of human genetic architecture including rare variant frequencies.

Recombination maps are often used for admixture mapping (Browning and Browning 2007) . A recombination map is a genetic map that illustrates the variation of the recombination rate across a region of the genome or the entire genome (Myers et al. 2005) . It is dependent on the underlying distribution of recombination events that occur between successive generations within a given population (Kong et al. 2010) . The presence and activity of the PRDM9 zinc finger protein in the population under study, the ratio of males to females and the population's genetic substructure are some of the known factors that have an effect on these recombination events. Population substructure is affected by the migratory history, the evolutionary history and the common ancestry of the population (Manu et al. 2018) . The extent to which the population substructure impacts the utility of a recombination map is yet to be determined.

Currently, there is a lack of high-resolution populationspecific recombination maps for southern African populations. This has inevitably led to inaccuracies in studies that make use of a recombination map. These inaccuracies are exacerbated when no recombination maps for closely related populations are available. Research being done in southern African populations have thus been forced to make use of ancestral maps (such as European and West African maps) (Uren et al. 2017a) or they have to rely on ancestry informative markers to mitigate potential bias when genome-wide data is not available (Daya et al. 2013) . There is thus a need for accurate, high-resolution recombination maps for southern African populations.

However, there are several uncertainties to be addressed before such a map can be handled with confidence. Firstly, the accuracy of the map needs to be established. Software used to construct recombination maps has been developed and tested on populations with homogeneous ancestry (Auton and McVean 2007) . Secondly, testing the accuracy of a recombination map of an admixed population is difficult, because there are variable recombination rates between ancestries. Any one segment of a recombination map would have a recombination rate that closely resembles the average rate of the rates of all the ancestries represented in the population. Thirdly, the method used to develop the map and the map itself would then have to be validated against currently available recombination maps (Kong et al. 2010) . It should also be noted that the resolution of a given map relies strongly on the method used to construct the map and the number of individuals used to construct the map (Halldorsson et al. 2019) . The most common methods used to build recombination maps are pedigree-based methods, LD-based methods and admixture based-inference (Halldorsson et al. 2019) . Of these methods, the LD-based method produces the highest resolution if there are a limited number of individuals available. However, the pedigree-based and the admixturebased method can produce sub-kilobase resolutions when a few thousand individuals are available (Halldorsson et al. 2019 ). The problem with using the admixture-based method on a population for which no recombination map exists is that many methods that infer ancestry rely on a recombination map for the inference. Thus the resulting recombination map could be inaccurate because the map used for the ancestry inference might be based on a population that is distantly related to the population in question. When dealing with admixed populations, the pedigree-based method would produce the least amount of bias due to admixture, since the algorithms employed rely on direct observations of recombination events between parent-offspring pairs (Halldorsson et al. 2019) . Because of the aforementioned reasons, the pedigree-based method should be the method of choice when a large enough sample from a population is available. The theoretical benefit of a population-specific recombination map has yet to be proven in practice, but one can expect such a map to improve the accuracy of admixture mapping and this improved accuracy could result in the discovery of novel variants associated with numerous phenotypes.

Novel genetic regions associated with multifactorial diseases could be identified by investigating the allelic architecture of highly admixed individuals from southern Africa, along with fine-mapping previous genomic loci associated with complex traits (Narang et al. 2011; Gurdasani et al. 2019 ). The first meta-analysis conducted in western Africa identified a novel locus (ZRANB3) significantly associated with type 2 diabetes (T2D) in a study investigating 5231 individuals from Nigeria, Kenya and Ghana. The study also indicated the transferability of 32 established T2D loci from previous investigations and contributed to the disease aetiology of T2D . Furthermore, Gulsuner et al. investigated 909 schizophrenia patients and 917 healthy controls from the Xhosa population of South Africa (residing mostly in the eastern cape of South Africa). Not only did they identify admixture between Bantu-speaking Africans and San individuals, but also identified more private damaging mutations in cases than in controls. Interestingly when the same analysis was replicated in a Swedish cohort, the Xhosa individuals generally had larger effect sizes than that of the Swedes (Gulsuner et al. 2020) . Furthermore, a metaanalysis consisting of 14,100 African individuals concerning cardiometabolic traits, identified novel loci associated with lipid, blood cell, and also other traits that appear to be rare in populations from other parts of the world (Gurdasani et al. 2019) . However, these are mostly concerning common genetic variants and not adequate to identify rare genetic variants.

High-throughput technologies, such as whole-exome sequencing (WES) and whole-genome sequencing (WGS), are required to locate rare population-specific variants (Uren et al. 2017a; Retshabile et al. 2018; De La Vega and Bustamante 2018) . Although WES is a cost-effective approach for identifying coding sequence targets in resource-restricted settings, WGS includes the complete and unbiased information carried by an individual, and high coverage WGS can detect rare variants (Suwinski et al. 2019 ). The first deep sequencing experiment of southern African populations assessed the population substructure within a cohort of HIV positive children from Botswana. WES data of 164 individuals from Botswana were analysed and compared with 150 similarly sequenced HIV positive Ugandan children (Retshabile et al. 2018) . Approximately 13-25% of genetic variation in populations from Botswana was not captured in current public databases. These missing variants were significantly enriched for coding variants with MAF between 1% and 5% and included predicted-damaging non-synonymous variants. This population also had more rare (< 1%) pathogenic and damaging variants (Retshabile et al. 2018 ). These studies highlight the untapped potential of these populations to contribute to the novel discovery of disease risk alleles in GWAS studies. Extending GWAS and sequencing studies to diverse populations will surely generate a rich harvest of novel risk alleles.

Population-specific allele frequencies have been sparsely characterised for southern African populations. For rare disease genetics, reference databases are continuously used for filtering based on allele frequency with the idea that common alleles are unlikely to be responsible for rare, highly penetrant disorders (Visscher et al. 2017) . Therefore, in the absence of appropriate population reference datasets, variants can be misclassified and may lead to false disease associations. For instance, a major allele for southern African populations can be identified as minor, since the current reference genome indicates it is a minor allele (Yang et al. 2019) . High-coverage whole-genome reference datasets are needed to characterize and catalog population-specific variation and facilitate genetic studies in admixed southern African populations to identify causal rare variants.

The clinical value contributed by the deep sequencing of whole genomes was demonstrated by The GenomeAsia 100 K project (GAsP) (Wall et al. 2019) . The pilot phase, which included a WGS dataset of 1739 individuals from 219 populations and 64 countries across Asia, identified a total of 194,585 novel variants with a MAF of > 1%. Overall 23% of protein-coding altering variants in GAsP were not found in publicly available databases such as the Single Nucleotide Polymorphism Database (dbSNP), the Genome Aggregation Database (gnomAD), the Exome Aggregation Consortium (ExAC) and the Exome Sequencing Project (ESP) (Wall et al. 2019) . Importantly, imputation accuracy using the GAsP reference panel was 93-95% compared to < 90% utilising the 1000GP reference panel. GAsP discovered thirteen unique cancer risk variants and HBB, a variant associated with beta-thalassemia. HBB is found almost exclusively in South Asians and at a lower frequency in Southeast Asia. Ultimately the GAsP reference dataset improves the ability to filter out low-probability candidates for highly penetrant disorders to identify putatively pathogenic variants that are found at high frequency in particular populations and improves the ability to infer pathogenicity of identified variants (Wall et al. 2019) . Not only did this study exceed the ability of publicly available sources to annotate protein-coding variants and capture low-frequency rare variants unique to Asian populations, but it also improved the imputation of missing genotypes.

The Ugandan 2000 Genomes Project (UG2G) consists of 1978 individuals from rural Uganda and is the largest sequence panel from Africa (Gurdasani et al. 2019) . The investigators identified 41.5 million SNPs and 4.5 million insertions and deletions. Likewise, 29% of the SNPs discovered in the UG2G project were absent in gnomAD. Furthermore, 52 population clusters in the region of Uganda (home to 9 ethnolinguistic groups) were identified and revealed a mixture of complex ancient East African pastoralists (Gurdasani et al. 2019) . A genetic study conducted by Higasa et al., which included 1208 Japanese individuals identified 156,622 previously unreported variants. Surprisingly, the allele frequencies were lower than 0.5% and functional deleterious. This study specifically emphasized the importance of constructing an ethnicity-specific reference genome for identifying rare variants (Higasa et al. 2016 ).

An existing catalogue of known variants, be it common or rare, will allow researchers to identify mutations in protein-coding regions, rare causal variants and track the small and discrete mutations at a genomic level at multiple loci.

However, population-specific variants will only be accurately collected if a reference genome exists with a representative population consensus, instead of using the existing human reference genome (GRCh38) of European ancestry (currently employed as a proxy in all genomic studies) (Ballouz et al. 2019 ).

Genomic resources lack southern African representation which is impacting on research in these settings. Future investigations to address this could include the following: 1. A consensus southern African reference genome, obtained from high-throughput whole genomes, for southern African populations, is required to capture the major alleles present in the region. This will serve as a genetic toolbox to improve imputation of missing genotypes to standardize cohorts genotyped on different arrays for meta-analysis and minimise the possibility of misclassifying major and minor alleles for southern African populations. 2. A southern African recombination map might improve the phasing of haplotypes to increase the accuracy of local ancestry inference in highly admixed individuals. However, there still exists some uncertainty in this regard and further investigations are warranted. 3. A southern African population-specific catalogue is required to capture allele frequencies in this region. Rare variants could be shared amongst healthy individuals, but not be present in public databases. Only high throughput sequencing technologies will be able to capture population-specific rare variants, since a reference genome, which is used as consensus in disease mapping, would not necessarily contain a specific variant. 4. An electronic catalogue of phenotypic information and the associated genotypic information enables geneticists to accurately identify genetic variants associated with disease phenotypes. However, the complexity of sample collection (due to unique ethical, cultural and socio-economic factors) in southern Africa is frequently underestimated as is reviewed elsewhere (Martin et al. 2018 ). The United Kingdom Biobank is a recent example of how incorporating clinical data embedded in electronic health records combined with GWAS data and registries available for research, can benefit everyone and not just individuals from a specific region.

Current GWAS and admixture mapping study designs are failing to identify disease-causing loci or rare genetic variants in southern African individuals. This is largely the result of limited reference haplotype panels and in turn limited genetic and computational tools available for southern African populations. The majority of SNP genotyping arrays are selected from a small sample of individuals (predominantly of European ancestry) and imputation and phasing of genotypic data usually involve a human reference of European ancestry, missing ± 10% of the genomes of individuals from African descent (De La Vega and Bustamante 2018). The genome structures of future generations might develop in a similar way to that of a complex five-way admixed southern African populations as admixture between populations originating from more than two different continents are now considered a customary feature of human populations across the globe (Busby et al. 2016; Salter-Townshend and Myers 2018) . Existing methods to detect loci associated with the multifactorial disease are not optimized for southern African ancestral groups and innovative approaches are urgently needed to study lethal communicable diseases such as TB, as well as non-communicable diseases such as cardio-metabolic diseases and type 2 diabetes in Africa. This entails the systematic development of best practises for ancestry inference, imputation and association studies. The establishment of a publicly available southern African-specific consensus reference genome is required to capture novel genetic variants and to maximize imputation for southern African populations. This will benefit future genetic studies involving complex diseases and traits by capturing rare variants previously lost due to a lack of publicly available data. Admixture mapping studies will continue to be inconclusive for populations from Southern Africa if no reference panels are available to represent proxy ancestral populations contributing to their genomes. The accuracy of the local ancestry calls for southern African individuals could also be decreased if population-specific recombination maps are not available. This will, in turn, affect the accuracy of admixture mapping studies that make use of LAI.

Conducting genetic studies on admixed southern African populations, with varying ancestral contributions, could also be beneficial for genetic studies of communicable and non-communicable diseases not mentioned in this review. Without a proper representative reference genome and methodologies to analyse complex admixed southern African genomes, genomic medicine will never benefit these individuals in contrast to those of European descent. For southern African countries and ethnicities to benefit from large-scale GWAS, as most European countries have, disease variants associated with southern African-specific diseases have to be identified. This will allow precision medicine and polygenic risk scores to be implemented. Although several consortiums contributed immensely to the development and training of African genetic researchers to include more diverse populations that have traditionally been underrepresented, global collaboration is still essential to increase the genetic representation of southern African populations.

Funding This research was partially funded by the South African government through the South African Medical Research Council. The content is solely the responsibility of the authors and does not necessarily represent the official views of the SAMRC.

Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.

A global reference for human genetic variation

ZRANB3 is an African-specific type 2 diabetes locus associated with beta-cell mass and insulin response

Recombination rate estimation in the presence of hotspots

Is it time to change the reference genome?

African Americans with asthma: genetic insights

Inferences of African evolutionary history from genomic data

The Bantu expansion

Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering

Admixture into and within sub-Saharan Africa

African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping

Characterization of single-nucleotide polymorphisms in coding regions of human genes

Genome-wide association study of ancestry-specific TB risk in the South African Coloured population

Dating admixture events is unsolved problem in multi-way admixed populations

Admixture mapping reveals evidence of differential multiple sclerosis risk by genetic ancestry

African genetic diversity provides novel insights into evolutionary history and local adaptations

A panel of ancestry informative markers for the complex five-way admixed South African coloured population

The role of ancestry in TB susceptibility of an admixed South African population

Using multi-way admixture mapping to elucidate TB susceptibility in the South African Coloured population

Genome-wide analysis of the structure of the South African Coloured Population in the Western Cape

Loter: a software package to infer local ancestry for a wide range of species

A robust and powerful two-step testing procedure for local ancestry adjusted allelic association analysis in admixed populations

Discovery of novel sequences in 1,000 swedish genomes

African evolutionary history inferred from whole genome sequence data of 44 indigenous African populations

Genome-wide ancestry and demographic history of africandescendant maroon communities from French Guiana and Suriname

Admixture mapping identifies 8q24 as a prostate cancer risk locus in African-American men

An admixture mapping meta-analysis implicates genetic variation at 18q21 with asthma susceptibility in Latinos

Genome-wide SNP analysis of Southern African populations provides new insights into the dispersal of Bantu-speaking groups

The African Genome Variation Project shapes medical genetics in Africa

Uganda genome resource enables insights into population history and genomic discovery in Africa

Characterizing mutagenic effects of recombination through a sequence-level genetic map

Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis

Population stratification in genetic association studies

Human genetic variation database, a reference database of genetic variations in the Japanese population

Design and analysis of admixture mapping studies

A Short History of the Genome-wide association study: where we were and where we are going

Fine-scale recombination rate differences between sexes, populations and individuals

Pharmacogenetics of membrane transporters investigators (2003) Natural variation in human membrane transporter genes reveals evolutionary and functional constraints

Low nucleotide diversity in man

Systematic analyses of autosomal recombination rates from the 1000 Genomes Project uncovers the global recombination landscape in humans

The critical needs and challenges for genetic architecture studies in Africa

Orienting future trends in local ancestry deconvolution models to optimally decipher admixed individual genome variations

Long-read genome sequencing identifies causal structural variation in a Mendelian disease

H3Africa: current perspectives

A finescale map of recombination rates and hotspots across the human genome

Recent admixture in an Indian Population of African Ancestry

Imputation-Based Local Ancestry Inference in Admixed Populations

Dispersals and genetic adaptation of Bantu-speaking populations in Africa and North America

The genetic prehistory of southern Africa

Whole-exome sequencing reveals uncaptured variation and distinct ancestry in the Southern African Population of Botswana

Limitations of the Human Reference Genome for Personalized Genomics

Fine-scale inference of ancestry segments without prior knowledge of admixing groups

Evaluating the accuracy of imputation methods in a five-way admixed population

Lopes-Cendes I, Comas D (2019) Distribution of local ancestry and evidence of adaptation in admixed populations

Assembly of a pan-genome from deep sequencing of 910 humans of African descent

Overview of admixture mapping

Ancestry-specific association mapping in admixed populations

Advancing personalized medicine through the application of whole exome sequencing and big data analytics

Local and global ancestry inference, and applications to genetic association analysis for admixed populations

The long walk to African genomics

Fine-scale human population structure in Southern Africa reflects ecogeographic boundaries

Population structure and infectious disease risk in southern Africa

A post-GWAS analysis of predicted regulatory variants and tuberculosis susceptibility

Genotype imputation performance of three reference panels using African ancestry individuals

10 years of GWAS discovery: biology, function, and translation

Admixture mapping identifies novel loci for obstructive sleep apnea in Hispanic/Latino Americans

One reference genome is not enough

The role of local ancestry adjustment in association studies using admixed populations

Applications of the 1000 Genomes Project resources

The analysis of ethnic mixtures

Admixture mapping for hypertension loci with genome-scan markers

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations