key: cord-0777628-afagezd8
authors: Wu, Jie; Liu, Yangxiu; Zhao, Yiqiang
title: Systematic Review on Local Ancestor Inference From a Mathematical and Algorithmic Perspective
date: 2021-05-24
journal: Front Genet
DOI: 10.3389/fgene.2021.639877
sha: d0b25fede271b3dccf6d7ee4ac94bf3f9f338979
doc_id: 777628
cord_uid: afagezd8

Genotypic data provide deep insights into the population history and medical genetics. The local ancestry inference (LAI) (also termed local ancestry deconvolution) method uses the hidden Markov model (HMM) to solve the mathematical problem of ancestry reconstruction based on genomic data. HMM is combined with other statistical models and machine learning techniques for particular genetic tasks in a series of computer tools. In this article, we surveyed the mathematical structure, application characteristics, historical development, and benchmark analysis of the LAI method in detail, which will help researchers better understand and further develop LAI methods. Firstly, we extensively explore the mathematical structure of each model and its characteristic applications. Next, we use bibliometrics to show detailed model application fields and list articles to elaborate on the historical development. LAI publications had experienced a peak period during 2006–2016 and had kept on moving in the following years. The efficiency, accuracy, and stability of the existing models were evaluated by the benchmark. We find that phased data had higher accuracy in comparison with unphased data. We summarize these models with their distinct advantages and disadvantages. The Loter model uses dynamic programming to obtain a globally optimal solution with its parameter-free advantage. Aligned bases can be used directly in the Seqmix model if the genotype is hard to call. This research may help model developers to realize current challenges, develop more advanced models, and enable scholars to select appropriate models according to given populations and datasets.

Rapid advancements in computing technologies, genome sequencing, and single nucleotide polymorphism (SNP) genotyping methods have made it possible to infer the genomic structure at a fine scale (Kidd et al., 2012) . It also accelerates the exploration of mixed ancestry or local ancestry inference (LAI) at the individual and population levels (Schumer et al., 2020) . In LAI, each chromosome is considered as a mosaic of genomic segments, originated from multiple ancestral groups (Padhukasahasram, 2014) . LAI is of great importance in studying population evolution, migration history, or disease risks (Fitak et al., 2018) . Up to now, various LAIs have been widely used; each model comes with its own advantages and disadvantages toward LAI in admixed populations (Geza et al., 2019) .

Due to the genetic recombination after interbreeding, the genome consists of mosaic of DNA segments with different genetic ancestries (Dougherty et al., 2017) . Genotypes from putative ancestral populations are mostly utilized to infer the local ancestry of admixed individuals (Sankararaman et al., 2008a) . Currently, about 70% of LAI models are based on hidden Markov model (HMM), where the hidden states correspond to ancestries and generate the observed haplotypes/genotypes (Baran et al., 2012) . LAI models use ancestry informative markers (AIMs) for simplicity or to account for linkage disequilibrium (LD) of variants, i.e., STRUCTURE (Falush et al., 2003) , Hapmix (Price et al., 2009) , Saber (Tang et al., 2006) , and LAMP-LD (Baran et al., 2012) . Other models consider rich haplotype information by employing window-based strategies, i.e., RFMix (Maples et al., 2013) , PCAdmix (Brisbin et al., 2012) , and LAMP (Sankararaman et al., 2008b) . Table 1 presents more details in this regard.

The challenge of identifying ancestry along each chromosome can be addressed with different approaches. One of the most widely used models is HMM, an extension of a Markov chain, in which the state transformation is generally unobservable (Wu and Zhao, 2019) . In HMM, the parameters include initial state distributions, state transition probability matrix, and emission probability matrix. Algorithms were developed to solve three main questions of HMM: evaluation (forward algorithm), decoding (Viterbi algorithm), and training (Baum-Welch algorithm including expectation maximization or maximum likelihood) (Schuster-Böckler and Bateman, 2007) .

LAI models based on the original HMM algorithm include Hapmix (Price et al., 2009) , Seqmix (Hu et al., 2013) , PCAdmix (Brisbin et al., 2012) , Supportmix , and LAMP-LD (Baran et al., 2012) . These models use Baum-Welch to iteratively update the initialized transition probability matrix and the emission probability matrix and use Viterbi for estimating the hidden ancestral states. The designs of the initialized emission and the subsequent calculations mainly differentiate among the models. Supportmix utilizes a support vector machine (SVM) (Haasl et al., 2013) for classifying the chromosome segments of the ancestral group, while PCAdmix calculates Euclidean distances between the ancestral groups and admixed individuals for finding the closest ancestry for each window.

The Hapmix model (Price et al., 2009 ) is based on a combination of the HMM and haplotype. The hidden state for position s is denoted via a triplet (i,j,k); here, i denotes the ancestry derived from a different population, while j recorded the population from which the haplotype was copied considering miscopying, and k corresponds to the source of the individual the chromosomal segment was copied from. p s (i,j,k;l,m,n) is the transition probability from state (i,j,k) to state (l,m,n) between the adjacent sites s and (s + 1). e 1 ijk (s) denotes the type 1 offspring chromosome probability at site s and t jk represents the parent individual k type in the reference population j. The initialized emission probability matrix is given in Equation (1).

Here, offspring carrying the identical type to the specific parent is with a probability (1 -θ 1 ), while a different type with the probability θ 1 , θ 3 denotes the mutation rate in the case that offspring copied from the other population.

The Seqmix model (Hu et al., 2013) aligns bases directly rather than relying on genotypic calls. The method implemented in Seqmix consists of three layers: the hidden ancestry state, the hidden genotype, and the observed sequence reads. The genotype is placed in the intermediate layer by connecting the sequence reads and ancestry. In the HMM, the transition matrix denotes the hidden ancestry state q s as (A s1 , A s2 ), Herein, A s1 represents the first chromosome ancestry at site s, while the ancestry of the other chromosomes is represented by A s2 . γ s,s + 1 is the rate of recombination per generation between site s and s + 1 and T represents the generations since admixture. π A and π E correspond to the prior probabilities for populations 1 and 2. The initialized transition probability matrix is given in Equation (2).

The initialized emission probability is P(O s | q s ), which is calculated as a sum of the overall possible genotypes, assuming the Hardy-Weinberg equilibrium, and is weighted by ancestryspecific allele frequencies: P O s |q s = (A s1 , A s2 ) . The genotype likelihood P(O s | q s ) is the probability of the observed set of reads given the hidden ancestry state.

The PCAdmix model (Brisbin et al., 2012) is based on a combination of the HMM and principal component analysis (PCA). The principal components (PCs) of the ancestral populations are firstly calculated based on the phased genotypes of the ancestral representatives and the phased genotypes of admixed individuals projected onto the component space. The vector P(S i,w |anc i,w = j) defines the emissions probability, anc i,w denotes the ancestry of haplotype i at window w from population j and comprises the ancestry scores across the first K -1 PCs, where K is the total count of ancestral populations, the weighted sum S iw = L w g iw is the ancestry score for haplotype i in window w, g iw represents a column vector of the haplotype's alleles in the window, and L w represents a matrix in which the individual columns carry the PC loadings of one SNP in the window; each window is used as the observation value in HMM. The transition probability is P(anc i,w = j| anc i,w − 1 = k).

A forward-backward algorithm is applied to find the posterior probability for each window in the admixed haplotype.

In the Supportmix model , SVM and HMM algorithms are combined, and independent SVM classifiers are firstly applied for each genomic window to retrieve putative ancestry origins. The outputs of the SVMs are then fed to HMM to refine the ancestral assignment for each window. The emission possibilities are p for the hidden state (1 -p)/(k -1) and for the other states, where k is the number of ancestral populations and p is the classification from the SVM at the corresponding window. LD is considered in the HMM where the recombination is modeled as a Poisson process. The transition probability is thus defined as (1 -e −gd )/(k -1), where d is the genetic distance (in centimorgan) between the windows and g is the generation since admixture.

The LAMP-LD model (Baran et al., 2012) 

The HMM family, based on an extension of the original algorithm, includes factorial-HMM (F-HMM), hierarchical-HMM (H-HMM), Markov-HMM (M-HMM), conditional random field (CRF), and two-layer HMM. Their transition and emission probabilities have been improved for reinforcing the learning of the original HMM. LAI models based on the HMM family include ALLOY (Rodriguez et al., 2013) , Saber , HAPAA (Sundquist et al., 2008) , ELAI (Guan, 2014) , and SWITCH (Sankararaman et al., 2008a) . ALLOY applies a F-HMM to get hold of the parallel process, thus giving rise to the paternal and maternal admixed haplotypes. This, in turn, strengthens the correction of the HMM parameters, especially for the emission probabilities. Saber and SWITCH improve and enhance the traditional emission probabilities at a marker by using the joint distribution of alleles at two neighboring markers. SWITCH depends on pairwise SNP allele frequencies between consecutive markers, whereas the Saber model relies on the allele frequencies at the two consecutive markers. Unlike the M-HMM emission probability models of SWITCH and Saber, HAPAA has an emission probability of a 5 × 5 stochastic matrix and is historically the first model of the series (Sundquist et al., 2008) . Most of the transition probabilities still consider the genetic distance and generations in extended HMM. Like Supportmix, RFMix adopts a kind of multi-classification models for investigating chromosome segments of similar ancestry and uses CRF to smooth ancestral window information.

The ALLOY model (Rodriguez et al., 2013) uses F-HMM and is an improved form of HMM to capture parallel processes for producing the maternal (m) and paternal (p) admixed haplotypes. This model is denoted by H m l , H p l , the haplotype cluster membership drawn from a l ∈ A l on the haplotypes at position l. G l ∈ {0,1,2}, which is the observed genotype at the same marker position, represents the count of the minor allele. Across all the positions of the L marker, the presence of vectors of haplotype cluster memberships and genotypes are represented by H {m,p} = (H ) and G = (G 1 , G 2 , . . . , G L ), correspondingly. In the model, the posterior marginal is first computed to infer the emission probability, given the sample of genotypes P(H m l , H p l |G) by applying the forwardbackward algorithm. Local observation is made from the multiplication of the emission probability P(G l |H m l = a l , H p l = a l ) and by incorporating the transition probability of (H l |H l−1 ).

The Saber model (Tang et al., 2006) computes the posterior probability of the hidden states in the M-HMM based on forward and backward algorithms and adds the relationship between the observed genotype along each chromosome. The transition probabilities of the initial state are given in Equation (3).

where Z t represents unobserved ancestry, π represents the genome-wide average individual admixture, and τ is the time since admixing. The distribution of O t f given Z t f is described by the emission probability; O t f represents the observed genotype. The allele frequency in each ancestral population is considered as a natural choice of emission probabilities at a particular marker. In M-HMM, the model further requires the alleles' joint distribution at two neighboring markers. Equation (4) can be defined as the emission probability at marker t.

The SWITCH model (Sankararaman et al., 2008a) uses M-HMM and presents an effective initialization procedure that yields a highly accurate outcome at a notably reduced cost of computation via the expectation maximization (EM) algorithm for the estimation of parameters. In each EM iteration, the ancestry information of each haplotype is represented by matrix Z, and matrix W denotes recombination events. The Z and W updates are computed with the help of the Viterbi algorithm having emission probabilities P r (X i,j |Z i,j , p j , q j ), which are replaced with an integral of p j and q j ; the noticed SNP binary matrix has been represented by X i,j at the j-th SNP of the ith haplotype. The expectation step includes the calculation of the posterior probabilities of p j and q j ; that is,

The underlined step can be performed via Bayes' theorem. The maximization step includes finding a solution to m separate optimization problems in Z i , W i , i∈{1, m}, where the vector of ancestries for the i-th haplotype is represented by Z i and the complementary vector of recombination events is shown by W i , as shown in Equation (5).

. corresponds to the log transition probabilities and I j,i Z i,j represents the expectations of the log emission probabilities. α refers to the fraction of the first population in the ancestral population.

In the HAPAA model (Sundquist et al., 2008) based on H-HMM, an integration of the model with multiple HMMs is used. The model assumes the N populations P = {P 1 , P 2 ,. . . , P N }, each P denoted via a set of n p model individuals, P p = {a p1 , a p2 , . . . , a pn p }. The probability of emission is given by a 5 × 5 stochastic matrix, P ā i = x|y i = S pkh , where the hidden state variable is denoted y i . S pkh is for the two haplotypes h ∈ {0, 1} of each k individual in the p population. After that, an emitting state starts with an equivalent probability for the individual population, which is provided as P(y 1 = S pkh ) = 1/2Nn p . Every S pkh state can exist in three transitions: back to itself and the other presumed haplotype in the very individual S pk(1 − h) with a probability of (1 − w pki )e −τ p R i , and w pki · e −τ p R i , respectively, or to the state Out p exit with probability 1 − e −τ p R i . Training samples provide the recombination rate τ p , the probability of a phasing switch error is represented by w pki , R i represents the genetic distance between the loci, the emission probability is represented by P ā i = x|y i = S pkh , and the transition probability is represented by P(Out p → In p ), and using an EM algorithm to update these parameters on the training examples.

In the ELAI model (Guan, 2014) , a two-layer HMM is used: the upper-layer switch probabilities provide the information regarding the switching frequency between various ancestral populations, while the lower-layer switch probabilities are related to the switching frequency between the haplotypes within each ancestral population. For each individual i, let X m (i) , Y m (i) be the hidden state of the upper and lower clusters at marker m.

Herein, X m (i) obtains values in 1,· · ·S, S and Y m (i) obtain values in 1,· · ·K, K. The haplotypic marker h m (i) emission of i at m from a lower-layer cluster is given in Equation (6).

The complete data likelihood combines with the lower-layer and upper-layer clusters, as shown in Equation (7).

where ξ is defined as the parameter correlating with the HMM. The first marker and the Markov transitions are expressed as follows because the model takes two scales of LD occurring in admixed individuals into consideration:

and

In this model (Maples et al., 2013) , CRF and the random forest (RF) (Wu and Zhao, 2019) algorithm are combined. In the event of CRF along with its chain structures, all potential functions work on pairs of haplotype label variables, H i and H i + 1 , that are adjacent to each other. Firstly, the emission probability is learned and RF is trained with segments (reference haplotypes) in the corresponding window, which is then used for the estimation of the ancestry A i, * posterior probabilities, considering the segment of the admixed haplotype for the window. Secondly, the transition probability is also learned. In adjacent windows, the joint probability of the local ancestries relies primarily on the global proportion of the individual ancestry and the likeliness of recombination between the pair of windows. The joint probability distribution is P(A i,p = j, A i,p+1 = k). Thirdly, a linear-chain CRF is independently used to model P(A i , * | H i , * : ) for each admixed chromosome. The EM method is used for updating the above parameters. In consideration of a phasing error, P(A i , * ,A ic , * ,H i , * ,H ic , * | O i , * ,O ic , * : ) is modeled, wherein i and i c are the indices representing both copies of the chromosome under evaluation for a specific admixed subject, O i , * represents the phased sequence observed for chromosome i given by phasing algorithms, while H i, * indicates the set of each potential haplotype in the window.

Along with the HMM family models, there are also some other non-HMM family models that are based on the basic algorithm and data mining techniques. For example, Loter is a parameterfree model that uses dynamic programming (DP) to obtain a globally optimal solution. Chromopainter adopts PCA for investigating chromosome segments of similar ancestry and uses Markov chain Monte Carlo (MCMC) (Gilks, 1999) to smooth ancestral segment information.

The Chromopainter model (Lawson et al., 2012) works based on PCA and MCMC (Gilks, 1999) . Firstly, PCA uses the coancestry matrix x ij. For each element in the matrix, x ij is an estimate of the number of discrete segments of individual i, which is strongly correlated with the individual j corresponding part. The Chromopainter model is built on the assumption that the chunks P q i q j /n q j in various individuals are independent; hence, the cross individuals are multiplied, which results in a complete likelihood, as shown in Equation (8).

where c could be considered as describing an effective number of chunks, N represents the number of individuals, while the individuals are represented by j and i in populations q j and q i , accordingly. Probably a single chunk delivered from the j to the i individual is P q i q j /n q j , and in various individuals, the chunks are independent. Secondly, a prior value P a ∼ Dirichlet(β a = {β a1 , . . . , β aK }) is selected. β ab values are proportionate to the a priori estimated value of each P ab . Eventually, F is updated within the algorithm via the updates of standard Metropolis-Hastings MCMC.

In this model (Sankararaman et al., 2008b) , a clustering algorithm called iterated conditional model (ICM) is used to investigate an optimal classification of all individuals regarding probability. The ICM algorithm is different from the traditional EM model. The E step comprises the expected classification θ, given minor allele frequencies f l , thus resulting in a fractional class membership for each individual i. In the LAMP, it is supposed that a logical answer will be provided by the initial classification, and it determines the maximum a posteriori estimate of θ, as indicated here.

For populations A s and A t , the underlined model uses G i , which represents the genotype (g i1 , . . . , g in ) of the individual i, as shown in Equation (9).

In the M step, it receives the maximum-likelihood estimation of f 1 , . . . , f k via investigation, as shown in Equation (10).

Loter Model

The Loter model (Dias-Alves et al., 2018) adopts DP and supposes that ancestral populations contain individuals n, which results in haplotypes (2n) presented via (H 1 , . . . , H 2n ). The i-th haplotype value (0 or 1) at the j-th SNP is indicated via H i j . The estimation of the haplotype h (admixed individual) is made possible by a vector (s 1 , . . . , s p ) that determines the sequence (haplotype labels). For the j-th SNP in the dataset, s j = k if haplotype h resulted from the haplotype H k copy. The optimization problem comprised reducing the underlined cost function, as shown in Equation (11). (11) In consideration of a phasing error, shown in Equation (12) C

where (s 1 , . . . , s p ) is in {1, . . . , 2n} p . A regularization parameter, called λ, is involved in an optimization problem. A high λ strongly penalizes the transition between the parental haplotypes of long chunks of the constant local ancestry. A 1 = (0, . . . , 0) and A 2 = (1, . . . , 1) represent two possibility ancestry states; haploid local ancestry is represented two by vectors, a ∈ {0, 1} P and a ∈ { 0, 1} P .

In the EILA model (Yang et al., 2013) , fused quantile regression (FQR) and the k-means classifier are used and are based on three steps. Firstly, EILA defines a score e j , i (a continuous variable with a range of 0-1) for the admixed genotype g j , i ( = 0,1,2) as the probability that g j , i is the descendant of ancestry A. This is shown in Equation (13).

j,1 , . . . , g

j,n 1 and g

Secondly, θ j , i is defined as a smooth series and infers the site of breakpoints for ancestral blocks by using FQR and θ j , i is estimated via investigating the value that minimizes m j =1 e j,i − θ j,i + λ m j =2 θ j,i − θ j−1,i . Smaller λ will lead to the lowering of penalty effects. The fitted value of θ j , i is closer to the observed e j , i . Thirdly, the breakpoints for all admixed individuals are investigated, and the model infers the local ancestry for all segments between breakpoints via k-means to obtain a high power of inference.

In the LASER 2.0 model (Wang et al., 2015) , PCA and projection Procrustes analysis (PPA) are combined. Firstly, PCA is conducted on the genotypes of a set that has been chosen from the N reference individuals and results in the construction of a K-dimensional ancestry map. For all the evaluated samples, further PCA is carried out on genotypes through overlapping markers between the N reference individuals and the evaluated sample and for obtaining a K -dimensional map corresponding to N + 1 individuals (K greater than or equal to K). Furthermore, PPA is performed to determine the transformation optimal set on the PCA map (sample-specific) for the maximization of its resemblance with the reference ancestry map. For the similar N reference individuals, the two sets of coordinates are given, i.e., X N × K and Y N × K , and the PPA investigates a set of transformations f to project X from a K -dimensional space to a K-dimensional space and reduces the squared Euclidean distances being added between f (X) and Y. Supposing that X, as well as Y, has been centered toward the origin, the objective of the model is to investigate an isotropic scaling factor, ρ, in such a way that the minimization of | | ρXA -Y| | F 2 and the orthonormal projection matrix A K × K takes place.

Here, we performed a bibliometric analysis of the LAI research. "Local ancestry inference" was selected as the search topic from 2000 to 2020 from the NCBI database. 1 Each bibliographic record includes detailed information of published articles, including their titles, abstracts, and keywords. Figure 1A shows the number of published articles on the significant increase in LAI from 2012. Since 2000, when Chapman and Thompson (2001) published Linkage Disequilibrium Mapping: The Role of Population History, Size, and Structure, 186 articles have been published until 2020. The major topics in LAI research are shown in Figure 1B . The visual representation, known as a form tree, was generated using the clustering tool Carrot II (Cost et al., 2002) based on 40 clusters. The leading topics of research are disease association and human history. We analyzed the main contents of the cited articles for each model in Figure 1C , which illustrates that research on human history plays a leading role in LAI analysis and model development. Similarly, LAI research is also largely applied in disease risk, wildlife conservation, and domestication. Figure 1D shows four original types of research and seven model designs with top citations, which may play a driving role in the research of LAI. During 2006-2016, LAI research had been highly fascinating for various research groups; thus, LAI publications experienced a peak period. This research has gently and extensively infiltrated different fields of science and has kept on moving in the following years (Lao et al., 2006; Sankararaman et al., 2008b; Price et al., 2009; Bryc et al., 2010 Bryc et al., , 2015 Gravel, 2012; Lawson et al., 2012; Eaton and Ree, 2013; Loh et al., 2013; Maples et al., 2013; Moreno-Estrada et al., 2013; Jeong et al., 2014) . To benchmark the computational efficiency and accuracy of the seven most used models (Chromopainter, LAMP, LAMP-LD, Loter, RFMix, Seqmix, and Supportmix), we simulated data using SLiM 3.2 (Messer, 2013) and estimated the average running time (ART), memory footprint size (MFZ), the mean squared error (MSE = 1 n n i =1 (observed i − predicted i )) for an individual genome, standard deviation (SD), and the coefficient of variation (CV) for each model. In the SLiM one, we initially generated two ancestor populations during 5,000 generations. The use of two initial populations differentiates into five admixed subpopulations with different infiltration rates after 4,000 generations. During the next step, differentiated individuals evolve freely during 5-1,000 generations, and every five generation is an interval. This step is repeated 20 times. Finally, we randomly selected 1,000 ancestral populations and 500 admixed populations to stimulate LAI in seven models. Table 2 shows further details regarding the simulation parameters and other simulation processes.

As shown in Table 3 , we adopted seven models in SLiM 1-3 and six models in SLiM 4-5 because Seqmix can only handle two ancestral groups. The most efficient model is LAMP with respect to the run time (ART = 1.50 s) and memory size (MFZ = 53.74 Mb); however, its accuracy is slightly lower (1 -mean of MSE = 0.67) and the results are not stable (SD = 0.20). The primary reason is the total reliance of this model on biological parameters. Seqmix based on aligned bases turns out to be the most accurate (1 -mean of MSE = 0.86) and stable (SD = 0.08) model, while it is also efficient enough.

Loter is the only model with a parameter-free process and general accuracy (1 -mean of MSE = 0.79) and fair stability (SD = 0.10); however, it requires a comparatively longer running time (ART = 2,506.70 s). The RFMix process has general accuracy (1 -mean of MSE = 0.80) and fair stability (SD = 0.10), but it consumes a lot of memory (MFZ = 2, 472.29 Mb) . A weighing between the pros and cons of the different models is shown in Table 4 .

As shown in Figure 2 , the phased data had a higher accuracy in comparison to the unphased data. Besides, there exists a significant difference between the phase and unphased results (1 -mean of MSE) in all the simulated values by each paired comparison in Tukey's HSD (all P < 0.05). As shown in Table 3 , the CV of the phased results is less than that of the unphased results in all simulated values, thus proving the higher stability of phased data.

Various challenges confront the researchers during inferring the local ancestry via genome-wide data. Firstly, several models need complex parameters, such as a genetic map and the number of generations since admixture, that are difficult to be supplied, particularly for non-model species. Secondly, some models only use haplotype information and unlinked markers are removed The number of generations producing true populations 5000 5000 5000 5000 5000

The via the trimming step. With this process, many informative SNPs are lost. Thirdly, because some models exclude probable ancestral informative haplotypes, unmodeled LD could cause systematic biases in determining ancestry, which results in false-positive conclusions regarding the deviation in ancestry at specific loci. Lastly, ancestral segments are windows or blocks of varying lengths; however, existing models commonly use a window of fixed size for simplification. The total count of generations since admixture is inversely proportional to the length of ancestral segments. As the number of generations is hardly recognized, it is difficult to investigate the breakpoint or transition point for ancestral haplotypes based on the statistics of the ancestral group or even an individual's genome.

We summarize these models with their distinct advantages and disadvantages as follows: (i) We recommend Seqmix if the genotype is hard to call, and aligned bases can be used directly in this model (Hu et al., 2013) . (ii) ALLOY utilizes F-HMM and the haplotype structure of the compound state to improve its accuracy. We recommend this model if ancient and complex admixtures need to be analyzed (Rodriguez et al., 2013) .

(iii) We recommend Saber if high-density SNP panels exist; however, a potential weakness of M-HMM, compared with an HMM, is that when the genetic information on the ancestral populations is not rich, it will weaken the accuracy of the calculations (Tang et al., 2006) . (iv) ELAI is appropriate for instances where researchers require detecting further structure of the haplotypes because of the two scales of LD in admixture and a two-layer HMM exists as independent upper-layer latent clusters that enforce structure on the haplotypes and other lowerlayer latent clusters depicting ancestral haplotypes (Guan, 2014) .

(v) We recommend EILA if the researchers are interested in the estimation of recombination events. The model has the advantage of allowing the lack of ancestral populations' highquality haplotype information; however, a potential weakness of the k-means, unsupervised clustering, will weaken the stability of calculations (Yang et al., 2013) . (vi) Loter uses DP to obtain a globally optimal solution, and its advantage is its being parameter-free (Dias-Alves et al., 2018) .

LAI incorporates other bioinformatics approaches and is widely used in different research fields, including breeding new varieties, protection of endangered animals and plants, and the prevention and treatment of human genetic diseases. In the study of population structure, the ADMIXTURE (Alexander et al., 2009) and STRUCTURE (Pritchard et al., 2000) models perform population allele frequencies and observe genotype probability by ancestry proportions. Both models can be used to assign global ancestry. They are applied in fine-matched corrected association research and are relatively consistent with the LAI results. Galaverni et al. better estimated the actual admixture proportions of the hybrids according to the combination of global and local ancestry inferences (Galaverni et al., 2017) . About up to 50% of blocks of domesticated individuals were identified by Moderate processing speed and memory consumption, high accuracy and certain stibility -FIGURE 2 | Box plots of the accuracy of local ancestry inference (LAI) using a benchmark. The red hollow arrows indicate a higher accuracy by the median comparison in this simulation. The results showed that phased data had higher accuracy in comparison with unphased data.

PCADMIX in the hybrid genome. The results of the analysis were consistent with those estimated in ADMIXTURE at K = 2.

In the study of domestication, the admixture compositions of select individuals with the minor allele for the peak markers of quantitative trait loci (QTL) were analyzed by LAI. For example, in one study, QTL were located in a chromosome segment substitution line (CSSL) population. This population comes from an interspecific cross between a wild aus-like Oryza rufipogon donor accession and cv. Curinga (an upland tropical japonica variety from Brazil). It was found that the CSSLs conferred a wild aus-like introgression across the target segment, which was beyond the rest of the CSSLs that carried the tropical japonica genotype (Wang et al., 2017) . In the study of ancient DNA, the use of LAI and masking reconstruct population-specific surrogates of the ancestral components to yield entire genome. Yelmen et al. applied this technique to reconstruct population-specific surrogates of South Asian and West Eurasian populations, which complemented lowquantity and low-coverage availability and provided a substantial advantage (Yelmen et al., 2019) .

Wild populations significantly contribute to the adaptation of domesticated populations; therefore, their absence or presence is imperative for breeding and genetics-related studies. Many good traits exist in the wild population; however, they were lost during domestication. Some advantageous or disadvantageous alleles were located by constructing a hybrid population and were further assigned the corresponding ancestral source. This can help in understanding the molecular mechanisms behind the traits and in explaining the valuable pool of genetic resources found in wild populations. Domesticated rice (Oryza sativa) is adopted as an example. Some traits of wild rice (such as persistent seed dormancy and freely shattering seed) may have high adaptability if introgressed into weedy rice populations. Inversely, some traits of wild rice (prostrate plant architecture and sporadic seed production) are considered inappropriate for survival in domesticated rice. Given the potential combination of the advantageous and disadvantageous traits for weedy rice, it can be expected that introgression evidence of wild rice to weed rice would confer weed rice-adaptive traits to the specific genomic regions. Such as some regions were likely introgressed from wild accessions: PROG1, controlling prostrate versus erect growth; qSW5, controlling seed size; sh4, controlling grain shattering; Bh4, controlling hull color; An-1, controlling awn development; and Rc, controlling pericarp pigmentation (Vigueira et al., 2019) . In another study, the analysis of wild caprids and whole genomes of domestic goats revealed ancient introgression evidence from a West Caucasian tur-like population to the ancestor of domestic goats. It was further revealed that the MUC6 gene was an introgression locus with a strong selection signature and conferred enhanced immune resistance to gastrointestinal pathogens (Zheng et al., 2020) . The third case is the wild yeast (Saccharomyces eubayanus). The lager-style beers are an interspecies hybrid (S. eubayanus × Saccharomyces cerevisiae). It was found that the wild isolates of S. eubayanus are not the closest relatives of lager-brewing hybrids. Inversely, the genetic composition of lager yeasts was contributed by S. eubayanus strains with continuous variation, thus revealing the complex ancestries of lager yeasts (David et al., 2016) . The LAI model can be a powerful tool for protecting wild species by identifying segments of the genomes of hybrids. In the research of Galaverni et al., domestic dogs (Canis lupus familiaris) can reproduce with wild wolves (Canis lupus), coyotes (Canis latrans), and golden jackals (Canis aureus). The gene pool of several wild canid populations were threatened by the widespread diffusion of stray dogs in human-dominated areas. Use of the LAI model and genotype-phenotype association procedures identified putative dog-derived causal mutations associated with phenotypic variants, thereby constituting a conservation strategy. Such as the black coat color, this trait is coded by a 3bp deletion at the β-defensin gene CDB103 that was possibly introduced into wolves by ancient hybridization with dogs (Galaverni et al., 2017) . The LAI model can be applied to the treatment and prevention of human genetic diseases by assigning ancestry to the chromosomal regions and applying admixture mapping to identify candidate genes. Dengue has become a worldwide health concern due to the increase in virus and vector dispersions. LAI analysis has proven that African ancestry has a protective effect against the dengue haemorrhagic phenotype in admixed Cuban population. This was further authenticated by identifying the corresponding candidate genes (Sierra et al., 2017) . A similar study indicates that the Tibetans have a better altitude adaptation, on account of the introgression of associated haplotypes from Denisovans or Denisovan-related populations (Huerta-Sánchez et al., 2014) . Besides, a recent example is that about 3,000 coronavirus disease 2019 (COVID-19) patients and control individuals were adopted, and it was found that a gene cluster can cause severe symptoms after SARS-CoV-2 infection. This genetic risk factor was caused by a genomic segment of a size of about 50 kb inherited from Neanderthals (Zeberg and Pääbo, 2020) . Furthermore, this genomic segment was carried by about 50% South Asian and about 16% European people. In conclusion, these studies not only enhance our understanding of genetic diversity and natural history but also offer valuable evidence for the source of diversity among human beings, animals, plants, and model organisms.

Fast model-based estimation of ancestry in unrelated individuals

Fast and accurate inference of local ancestry in Latino populations

PCAdmix: principal components-based assignment of ancestry along each chromosome in individuals with admixed ancestry from two or more populations

The Genetic Ancestry of African Americans, Latinos, and European Americans across the United States

Genome-wide patterns of population structure and admixture among Hispanic/Latino populations

Linkage disequilibrium mapping: the role of population history, size, and structure

Integrating distributed information sources with CARROT II

Complex ancestries of lager-brewing hybrids were shaped by standing variation in the wild yeast saccharomyces eubayanus

Loter: a software package to infer local ancestry for a wide range of species

The birth of a human-specific neural gene by incomplete duplication and gene fusion

Inferring phylogeny and introgression using RADseq data: an example from flowering plants (pedicularis: orobanchaceae)

Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies

Genome-wide analysis of SNPs is consistent with no domestic dog ancestry in the endangered mexican wolf (Canis lupus baileyi)

Disentangling timing of admixture, patterns of introgression, and phenotypic indicators in a hybridizing wolf population

A comprehensive survey of models for dissecting local ancestry deconvolution in human genome

Markov Chain Monte Carlo

Population genetics models of local ancestry

Detecting structure of haplotypes and local ancestry

Genetic ancestry inference using support vector machines, and the active emergence of a unique American population

Accurate local-ancestry inference in exome-sequenced admixed individuals via off-target sequence reads

Altitude adaptation in tibetans caused by introgression of denisovan-like DNA

Admixture facilitates genetic adaptations to high altitude in Tibet

Population genetic inference from personal genome data: impact of ancestry and admixture on human genomic variation

Proportioning whole-genome single-nucleotide-polymorphism diversity for the identification of geographic population structure and genetic ancestry

Inference of population structure using dense haplotype data

Inferring admixture histories of human populations using linkage disequilibrium

RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference

SLiM: simulating evolution with selection and linkage

Reconstructing the population genetic history of the caribbean

Inferring genome-wide patterns of admixture in Qataris using fifty-five ancestral populations

Inferring ancestry from population genomic data and its applications

Inference of locus-specific ancestry in closely related populations

Sensitive detection of chromosomal segments of distinct ancestry in admixed populations

Inference of population structure using multilocus genotype data

Ancestry inference in complex admixtures via variable-length markov chain linkage models

On the inference of ancestries in admixed populations

Estimating local ancestry in admixed populations

Versatile simulations of admixture and accurate local ancestry inference with mixnmatch and ancestryinfer

An introduction to hidden Markov models

OSBPL10, RXRA and lipid metabolism confer African-ancestry protection against dengue haemorrhagic fever in admixed CUBANS

Effect of genetic divergence in identifying ancestral origin using HAPAA

Reconstructing genetic ancestry blocks in admixed individuals

Call of the wild rice: Oryza rufipogon shapes weedy rice evolution in Southeast Asia

Improved ancestry estimation for both genotyping and sequencing data using projection procrustes analysis and genotype imputation

The buffering capacity of stems: genetic architecture of nonstructural carbohydrates in cultivated Asian rice

Machine learning technology in the application of genome analysis: a systematic review

Efficient inference of local ancestry

Ancestry-specific analyses reveal differential demographic histories and opposite selective pressures in modern south asian populations

The major genetic risk factor for severe COVID-19 is inherited from Neanderthals

The origin of domestication genes in goats

JW and YZ wrote the paper. YL organized and designed the benchmark. YZ supervised the study and revised the manuscript. All authors have read and commented on the manuscript and approved the final version. 

We thank the support of the high-performance computing platform of the State Key Laboratory of Agrobiotechnology.