key: cord-0307437-fx6tf4p1 authors: Yang, Hui; Wu, Jinyuan; Huang, Xiaochang; Zhou, Yunyan; Zhang, Yifeng; Liu, Min; Liu, Qin; Ke, Shanlin; He, Maozhang; Fu, Hao; Fang, Shaoming; Xiong, Xinwei; Jiang, Hui; Chen, Zhe; Wu, Zhongzi; Gong, Huanfa; Tong, Xinkai; Huang, Yizhong; Ma, Junwu; Gao, Jun; Charlier, Carole; Coppieters, Wouter; Shagam, Lev; Zhang, Zhiyan; Ai, Huashui; Yang, Bin; Georges, Michel; Chen, Congying; Huang, Lusheng title: An ancient deletion in the ABO gene affects the composition of the porcine microbiome by altering intestinal N-acetyl-galactosamine concentrations date: 2020-07-16 journal: bioRxiv DOI: 10.1101/2020.07.16.206219 sha: 0a70327be3362a6b54a02c5552a570d6f80f396f doc_id: 307437 cord_uid: fx6tf4p1 We have generated a large heterogenous stock population by intercrossing eight divergent pig breeds for multiple generations. We have analyzed the composition of the intestinal microbiota at different ages and anatomical locations in > 1,000 6th- and 7th- generation animals. We show that, under conditions of exacerbated genetic yet controlled environmental variability, microbiota composition and abundance of specific taxa (including Christensenellaceae) are heritable in this monogastric omnivore. We fine-map a QTL with major effect on the abundance of Erysipelotrichaceae to chromosome 1q and show that it is caused by a common 2.3-Kb deletion inactivating the ABO acetyl-galactosaminyl-transferase gene. We show that this deletion is a trans-species polymorphism that is ≥3.5 million years old and under balancing selection. We demonstrate that it acts by decreasing the concentrations of N-acetyl-galactosamine in the cecum thereby reducing the abundance of Erysipelotrichaceae strains that have the capacity to import and catabolize N-acetyl-galactosamine. Total nr animals produced Table 2 ). Sequence tags were rarefied to ~20,000 per sample, and clustered in 32,032 OTUs 137 (97% similarity threshold). 12,054 OTUs present in at least 0.2% of the samples (with more 138 than two tags in at least two samples) and amounting to an average of 98.7% of sample reads, 139 were retained for further analysis. They were annotated to 41 phyla, 87 classes, 149 orders, 140 207 families, 360 genera and 150 species. The number of OTUs detected per sample averaged 141 1,575 (range: 54 to 3,112) (Suppl. Table 2 ). The first two Principal Coordinates (based on of 58 OTUs that were annotated to 21 taxa were identified in >95% of day 120 and 240 feces 155 and cecum content samples of both F6 and F7 generations, hence defined as core bacterial 156 taxa (Suppl. Fig. 1A ). -diversity (measured by Shannon's index) of fecal samples was lower 157 at day 25 than at days 120-240, reminiscent of the enrichment of the intestinal flora observed 158 between child-and adult-hood in humans (Yatsunenko et al., 2012; Radjabzadeh et al., 2020) . 159 It was also lower for ileal content than for cecal content and mucosa (and probably ileal 160 mucosa) (Fig. 2C ). Six percent of ileal content samples harbored less than 100 OTUs (Suppl. 161 Table 2 ). -diversity (measured by pair-wise Bray-Curtis dissimilarities) tended to be 162 which has largest n). The combined p-value for the five traits combined and computed as 234 above was 0.013, hence providing a second line of evidence for an effect of genetics on 235 microbiome composition in this population (Fig. 3B) . 236 We then evaluated the heritability (ℎ 2 ) of the abundances of individual taxa using a mixed 237 model. This was done for up to 29 phyla, 53 classes, 86 orders, 116 families, 148 genera and 238 4,240 OTUs per data series. Heritabilities were estimated using a mixed model implemented 239 with GEMMA (Zhou & Stephens, 2012) . The model included random polygenic and residual 240 error effects. Kinship coefficients (to constrain the polygenic effect) were computed from 241 whole-genome SNP data, also using GEMMA (Zhou & Stephens, 2012) . To obtain unbiased 242 ℎ 2 estimates and associated p-values, we repeated the analysis 1,000 times with abundances 243 randomly permuted within litter. The average ℎ 2 across the permutations was then 244 subtracted from the ℎ 2 obtained with the unpermuted data to yield conservative estimates 245 of ℎ 2 . Their p-values were estimated as the proportion of permutations yielding an equally 246 high or higher ℎ 2 estimate. Analyses were conducted for the 12 measured data series. P-247 values were ≤ 0.05 for 4,219 (=14%) of the 30,127 realized tests, hence above random 248 expectations (Suppl. Table 4 ). The correlation between F6 and F7 ℎ 2 estimates (or their 249 log(1/p) values) were positive and highly significant (p ≤ 1.1x10 -30 and 1.07x10 -17 , respectively) 250 for D240 fecal samples, hence supporting genuine genetic effects at least for this trait (Suppl. 251 There was no significant correlation between pig and human values when considering all taxa. 257 It is noteworthy, however, that the taxon found to be the most heritable in human, the family 258 Christensenellaceae (Goodrich et al., 2014&2016) , was also amongst the most heritable in 259 pigs. At least seven other taxa were found to be heritable in both human and porcine adult 260 feces (Fig. 3D) . Table 5 ). Both models were implemented with the GenABEL R package (Aulchenko et al., 2007 ) 300 and included sex, batch and the three first genomic principal components as fixed covariates. 301 P-values were further adjusted for residual stratification by genomic control. We obtained 302 1,527 signals encompassing at least three variants with value ≤ 5 × 10 −8 (the standard 303 genome-wide significance threshold). To evaluate whether this number exceeded 304 expectations assuming that all tests were null hypotheses, we performed two analyses. In 305 the first we chose one of the largest (hence best powered) data series (day 240 feces in the 306 F7 generation) and repeated all GWAS on a dataset with permuted (within litter) genotype 307 vectors. The number of microbiota QTL (mQTL) signals detected with the real dataset was 308 221, while the number detected with the permuted dataset was 152, hence suggesting a true 309 discovery rate of ~30%. In the second, we collected -for each of the 1,527 signals with p-310 value(s) ≤ 5 × 10 −8 described above (corresponding each to a lead SNP x taxon x trait x 311 cohort x model combination) -the p-values for the same SNP x taxon x model combination, 312 yet for all other trait x other cohort combinations. Thus, we would typically collect ~5-7 p-313 values for each such signal. We reasoned that if the initial signals included a sufficient 314 proportion of true positives, the lead SNPs would have similar effects in at least some of the 315 traits in the other cohort and the collected p-values concomitantly shifted to low values. The 316 corresponding distribution of p-values was examined by means of a QQ-plot, and was 317 compared with the distribution obtained with an equivalent number of randomly selected mQTL in our data (Fig. 4A ). Of note, the average (F6 and F7) number of genome-wide 322 significant mQTL was positively correlated with the average (F6 and F7) taxon's heritability, 323 particularly for D240 fecal samples (p = 5.2x10 -6 ) (Suppl. Fig. 3B ). 324 To identify the corresponding mQTL yet properly accounting for the large number of realized 325 tests before declaring experiment-wide significance and simultaneously provide confirmation 326 in an independent cohort, we performed meta-analyses (across traits) separately in the F6 327 and F7 generations for the 1,527 above-mentioned signals. We designed an empirical meta-328 analysis approach that accounts for phenotypic correlation across traits if it exists (cfr. 329 Methods). The discovery threshold was set at 0.05 10 6 ×1,527×2×2 = 8.2 × 10 −12 hence corrected 330 for the size of the genome, the number of tested taxa (because genome-wide significant), the 331 two used statistical models, and the two studied cohorts (F6 and F7). There was no need to 332 correct for the number of traits as the meta-analysis generated one statistic for all traits. The 333 confirmation threshold was set at 0.05/ where was the number of signals exceeding the 334 discovery threshold in at least one cohort. Thus, we searched for signals (defined as lead SNP 335 x taxon x method combinations) that would exceed the discovery threshold in either the F6 336 or F7 cohort and the confirmation threshold in the other. We identified seven signals 337 exceeding the discovery threshold in at least one cohort, hence setting . For six of those, as well as genus p-75-a5 to which OTU-476 is assigned (Suppl. standing out as being highly significant in both F6 and F7 (Fig. 4E ) All lead SNPs of the meta-analyses conducted in the F6 and F7 generation map to the 3' end 398 of the porcine acetyl-galactosaminyl transferase gene that is orthologous to the gene 399 underlying the ABO blood group in human (Fig. 4B ). This is a strong candidate gene known to 400 modulate interactions with several pathogens (Cooling et al., 2015) , although not directly 401 known to affect intestinal microbiota composition in healthy humans (Davenport et al., 2016) . 402 Four non-synonymous ABO SNPs (R37G, A48P, S60R, G66C) segregated in the F6 and F7 403 generation but none of these were in high LD with any of the lead SNPs ( 2 ≤ 0.18 Fig. 4D ). In these samples (D120, D240, CC, CM), AO genotype explained 440 on average 7.9%, 3.2% and 6.6% of the variance in abundance for OTU-476, OTU-327 and 441 genus p-75-a5 (Suppl. Fig. 4E ). Of note, the abundance of OTU-476 and OTU-327 was shown 442 to be highest in cecal content where they account on average for respectively ~0.92% and 443 ~0.02% of reads in AA/AO animals, and for 0.47% and 0.003% of reads in OO animals ( Fig. 5 444 and Suppl. Fig. 4F ). 445 The ABO locus is known in humans to be under strong balancing selection that has GalNAc ( 1-3 linkage) to a variety of glycan substrates sharing a Fuc 1-2Gal 1-4GlcNAc or 515 Fuc 1-2Gal 1-3GlcNAc (H antigen) extremity (Cooling, 2015) . In the gut, these include the 516 heavily glycosylated secreted and transmembrane mucins constituting the cecal mucus. repressors (agaR, gntR), for a total of six essential pathway constituents (Suppl. Table 7) . 548 Genes involved in the utilization of specific sugars (including GalNAc) tend to cluster and form 549 operons of potentially coregulated genes (regulons) that support all or most of the essential 550 TR/CP steps. The steps that are not encoded by the operon may be complemented in trans 551 by genes encoding enzymes that are often less substrate-specific (Lawrence, 1999; Koonin, 552 2009). We used GhostKOALA (Kanehisa et al., 2016) to search for orthologues of the 24 genes 553 in the two OTU476-like genomes and 3,111 MAGs. We generated two scores to evaluate the 554 capacity of bacterial species to utilize GalNAc. The first (pathway score) counted the number 555 of essential steps in GalNAc utilization (out of six) that could be accomplished by the set of 556 orthologues detected in the genome (cfr. Methods), irrespective of their map position. The 557 second (regulon score) counted the number of essential GalNAc utilization steps that could 558 be fulfilled by orthologues that were clustered in the genome, i.e. forming a potential operon. 559 Following Ravcheev and Tiele (2017), we used agaS as anchor gene to establish the regulon 560 score, i.e. we counted how many essential steps in GalNAc utilization (out of six) were covered 561 by genes located in the vicinity of agaS. 562 The first striking observation was that at least one orthologue of agaS was found in the two 563 (=100%) OTU476-like strains (4-15-1 and 4-8-110), in 31% of Erysipelotrichaceae MAGs 564 (n=248), yet in only 3.0% of other MAGs (n=2,863). The second, was that both scores 565 (pathway and regulon score) were very significantly higher for Erysipelotrichaceae than for for Erysipelotrichaceae and non-Erysipelotrichaceae MAGs combined (ppathway=2.2e-3 and 568 pregulon=1.2e-5) (Fig. 6B ). These comparisons accounted for variation in the MAGs' completion 569 score, number of contigs and predicted size of the corresponding genomes (see Methods) . agaR, respectively (Fig. 6C) . 581 We further showed that adding GalNAc in the culture medium indeed enhances the growth 582 of the two isolated OTU-476 like strains, indicating that these can indeed utilize GalNAc as 583 carbon source (Fig. 6D) . and F7 generation is remarkably uniform and close to expectations (i.e. 12.5%) both at 679 genome-wide and chromosome-wide level (Fig. 1C) , suggesting comparable levels of genetic 680 diversity across the entire genome. This does not preclude that more granular examination the same traits across the F6 and F7 generation, yet marked compositional differences 685 between traits (Fig. 2B ). Even at family-level, some taxa are found to be nearly trait-specific Pasteurellaceae, the firmicutes Clostridiaceae, Peptostreptococcaceae, Bacillaceae, 688 Leuconostocaceae, and the actinobacteria Microbacteriaceae are at least ten times more 689 abundant in ileal than in any other sample type. Amongst those, Leuconostocaceae are nearly 690 digesta-specific, while Pseudomonadaceae are nearly mucosa-specific. The Bacteroidetes 691 Odoribacteraceae and Rikenellaceae were found to be at least ten times more abundant in 692 day 25 feces than in any other sample type. The firmicutes Christensenellaceae were nearly 693 ten times more abundant in feces (irrespective of age) than in any other sample type. This To evaluate the importance of host genetics in determining gut microbiota composition we 698 first examined the relationship between genetic relatedness and microbiota dissimilarity. It 699 is worth re-emphasizing that food and environment was very standardized in this experiment 700 when compared to typical human studies. Genetic relatedness between individuals was 701 measured using genome-wide SNP information while microbiota dissimilarity was measured 702 using Bray-Curtis distance (f.i. Rothschild et al., 2017). We relied on two approaches to 703 mitigate confounding of genetic and environmental effects. In the first we restricted the 704 analyses to full-sibs raised in the same environment, i.e. we confronted genetic similarity and 705 microbiota dissimilarity of litter-mates. In the second we confronted genetic similarity and 706 microbiota dissimilarity across generations (F6 and F7), yet avoiding parent-offspring pairs. 707 Both approaches supported an effect of genetics on microbiota composition manifested by 708 significant negative correlations between genetic similarity and microbiota dissimilarity for 709 (some) individual traits as well as when combining information across traits (Fig. 3A&B) . 710 Regressing squared phenotypic difference on genetic distance is an established way to 711 estimate local and global heritability (Haseman & Elston, 1972; Visscher et al., 2006 ). Yet, 712 Bray-Curtis distance is peculiar in that the phenotypes between which a "difference" is 713 measured are not defined per se (Bray and Curtis, 1957) . To nevertheless evaluate to what 714 degree of heritability the observed negative correlations might correspond, we simulated quantitative traits with various degrees of heritability in the actual pedigrees and examined 716 the distribution of ensuing correlations between phenotypic distance (absolute value) and 717 genetic distance. These analyses indicated that (in the studied, genetically highly divergent, 718 population) the heritability of microbiota composition may be of the order of ~0.80 within 719 litter, and ~0.20 in the overall population (Suppl. Fig. 7) . That the heritability is higher within 720 litter than in the overall population is expected as the environment is obviously more 721 homogeneous within than across litters and generations. Strikingly the impact of genetics was 722 strongest for fecal samples at day 240 in all analyses. This may be in agreement with the 723 observation that, in human, microbiota composition stabilizes with age (Aleman & Valenzano, 724 2019). Yet, why heritability should be higher in feces than for ileal and cecal content and 725 mucosa remains unclear. Sample types with higher alpha-diversity may be more resilient and 726 hence more heritable. 727 We also measured the heritability of the abundance of individual taxa. As before, we only 728 extracted within-litter information to mitigate confounding between environment and 729 genetics. Convincing evidence for a genuine influence of host genetics on taxa abundance 730 was the observation of a significant correlation between heritability estimates in the F6 and 731 F7 generation for fecal samples at day 240 and -to a lesser extend -cecum content (Suppl. 732 Larger windows will increasingly be contaminated with recombinant A-O 766 haplotypes blurring the sought signal. Indeed, for windows ≥ 20-Kb or more, the gene tree 767 corresponds to the species tree, while for windows ≤ 15-Kb the tree sorts animals by AA vs 768 OO genotype (Suppl. Fig. 8 Fig. 8 ). This suggests that the O 772 allele may be older than the divergence of the Phacochoerus and Sus A alleles, i.e. > 10 MYA. 773 It will be interesting to study larger numbers of warthog to see whether the same 2. behavior. No significant effects were observed when accounting for multiple testing (Suppl. 792 Fig. 9 ), including those pertaining to immunity and disease resistance. 793 It is noteworthy that the old age of the "O" allele must have contributed to the remarkable 794 mapping resolution (≤ 3 Kb) that was achieved in this study. In total, 42 variants were in 795 near perfect LD ( 2 ≥ 0.9) with the 2.3 Kb deletion in the F0 generation, spanning 2,298 bp 796 (1,522 on the proximal side, and 762 on the distal side of the 2.3 Kb deletion). This 2.3 Kb 797 span is lower than genome-wide expectations (17th percentile), presumably due to the 798 numerous cross-overs that have accrued since the birth of the 2.3 Kb deletion that occurred 799 in the distant past (Fig. 1E ). Yet the number of informative variants within this small segment 800 is higher than genome-wide average of (57% percentile) also probably due at least in part to 801 the accumulation of numerous mutations since the remote time of coalescence of the A and 802 O alleles (Fig. 1D) . 803 The chromosome 1 QTL was the only signal that exceeded experiment-wide discovery and 804 conformation thresholds. QQ-plots obtained after removing chromosome 1 variants (272.8-805 273.1Mb interval) did not show convincing evidence for residual inflation of log(1/p) values 806 (Suppl. Fig. 3D ). This suggests that the residual heritability most likely has a highly polygenic 807 architecture, as becoming increasingly apparent for most complex traits. and 327) and genus p-75-a5, but affected at least some other Erysipelotrichaceae as well. As 811 mentioned above, effects of ABO genotype on host-pathogen interactions are usually 812 interpreted in the context of adhesion or immune response. Yet an alternative mechanism 813 is by altering the source of carbon upon which intestinal bacteria feed. Small and large 814 intestine are amongst the tissues in which the ABO gene is the most strongly expressed (Suppl. 815 least two predictions. The first is that the intestinal GalNAc content should be higher in AA 822 than in OO pigs and this was indeed shown to be the case (Fig. 6A) . The second is that the 823 bacteria affected by the mQTL should be able to use GalNAc. We isolated two strains with 824 16S rRNA sequences that were near-identical to those of the OTU strain (OTU476) that was 825 most affected by the mQTL, and sequenced their complete genome. We showed that it 826 contained the orthologues of eight genes known to be essential for GalNAc import (AgaPTS: 827 agaE, agaF, agaV, agaW) and catalysis of the first four GalNAc-specific degradation steps 828 (deacetylation: nagA; demanination/isomerisation: agaS; kination: fruK; aldolase: gatZ) 829 hence the five key steps in GalNAc utilization. Importantly, the eight genes clustered in a 50 830 Kb chromosome segment (Fig. 6C) . We generated 3,111 porcine intestinal MAGs from 831 metagenomic shotgun data for comparison. None of these would harbor a GalNAc gene 832 cluster encoding more than four of the five key steps. One catalytic function was always 833 provided in trans, whether GalNAc-6-P deacetylase, tagatose-6-P kinase, or tagatose-1,6-PP 834 aldolase. This finding clearly revealed the unique status of OTU476 with regards to GalNAc 835 utilization. Also consistent with the QTL findings, Erysipelotrichaceae MAGs were strongly 836 enriched in clustered GalNAc TR/CP orthologues when compared to MAGs assigned to other 837 bacterial species. Finally, the growth of the two isolated OTU476-like strains was shown to 838 increase when fed with increasing concentrations of GalNAc (Fig. 6D) . 839 Amongst the 3,111 studied MAGs, 15 harbored a gene cluster able to sustain four of the five 840 steps, while the fifth enzyme was encoded somewhere else in the genome. These included unidentified Firmicutes (Fig. 6C) . Gene order within the corresponding GalNAc clusters was 844 supported by observation in two or more independent MAGs and/or by the fact that all 845 concerned genes resided on the same contig. For all 15, the orthologue needed to fulfill the 846 fifth enzymatic reaction (2x tagatose-1-P kinase, 2x GalNAc deacetylase, 1x tagatose-1,6-PP 847 aldolase) was found somewhere else in the genome, allowing us to assume that all these 848 species are able to utilize GalNAc. Fifty additional MAGs contained the orthologues needed 849 to accomplish the five key steps in GalNAc utilization albeit without evidence for a similar 850 degree of clustering (either because the genes are indeed not clustered in the corresponding 851 genomes or because they were segregated across distinct sequence contigs). It is reasonable 852 to assume that several of those bacteria are also able to utilize GalNAc as carbon source. Why 853 then would the chromosome 1 mQTL only affect a small subset of Erysipelotrichaceae species? 854 We first reasoned that OTU476-like strains might be more dependent on GalNAc availability 855 than other species, for instance because they can't utilize alternative, common 856 monosaccharides as carbon source. To test this hypothesis, we searched for KEGG numbers 857 that would commonly occur in other genomes, yet were absent in the OTU476-like strains. 858 We performed this analysis for all MAGs, as well as separately for the MAGs that were 859 predicted to be able to use GalNAc (cfr. above). There was no convincing evidence that 860 By regulating expression of their GalNAc operon in response to ambient GalNAc availability, 894 species like E. Coli may fair equally well in the gut of AA/AO as in that of OO pigs, hence not 895 be affected by the mQTL. It is worth noting that the A allele is dominant with regards to 896 OTU476, OTU327 and p-75-a5 abundance (Fig. 5B) , suggesting that the additional increase in 897 GalNAc concentrations in AA (vs AO) animals does not further benefit these taxa. 898 We examined the effect of ABO blood group on the abundance of ~75 OTUs assigned to 899 Erysipelotrichaceae in human gut samples. Although we could not rigorously test this for all 900 OTUs (as some human V1-V2 and V5-V6 data could not directly be compared with porcine V3-901 V4 data) none of the OTUs detected in human samples were as closely related to the pig OTU-902 476, OTU-327 or p75.a5 as these were to each other. We found no evidence for an effect of 903 ABO blood group on the abundance of any of these OTUs. What underlies the difference 904 between pigs and humans is unclear. Either strains susceptible to ABO genotype are not 905 present at sufficient frequency in human feces, or the carbohydrate composition of human intestinal content makes these strains less sensitive to variations in GalNAc concentrations. 907 It is worth noting that the studied human samples were intestinal biopsies collected after a 908 standard gut cleansing procedure. The abundance of the genus p-75-a5 was recently found 909 to differ significantly between African subsistence categories and to be highest in pastoralists 910 similarity (from SNP genotype data) and microbiota dissimilarity (Bray Curtis distance 918 computed from 16S rRNA data) both within litter as well as between generations, supporting 919 an effect of host genetics and intestinal microbiota composition (Fig. 3A&B) . We took care in 920 these analyses to mitigate effects of litter on both genetic and microbiota distance metrics, 921 (as these may inflate statistical significance) by applying permutations tests. Regressing 922 squared phenotypic difference on genetic distance is a standard way to estimate local or 923 global heritability (Haseman & Elston, 1972; Visscher et al., 2006) . It can be shown that 924 −̂/(2̂2) estimates the narrow sense heritability ĥ 2 . In these, ̂ is the least square 925 regression coefficient and ̂2 an estimate of the phenotypic variance. In our analyses, and (microbiota composition) was deduced from the coincidence between the red and green lines. 947 As an example, the heritability of microbiome composition for data series F7-D120 was 948 assumed to be close to 0.8. We proceeded in the same way for the across generation analysis 949 (Suppl. Fig. 7C ). We used the actual measures of kinship across the F6 and F7 generations 950 computed with GEMMA. We standardized them (mean 0 and SD 1), and then scaled them 951 such that the values for an individual with itself would center on 1, and for full-sibs on 0.5. 952 Breeding values and environmental effects were sampled using mvrnorm and rnorm as above. Both models included sex, slaughter batch (21 for F6, 23 for F7) and the three first genomic 1155 principal components as fixed covariates. GWAS were conducted separately for each taxon x 1156 data series combination and p-values concomitantly adjusted for residual stratification by 1157 genomic control. P-values were combined across traits and/or taxa using a z-score. P-values 1158 were converted to signed z-values using the inverse of the standard normal distribution and 1159 summed to give a "z-score". Z-scores were initially calculated using METAL (Willer et al., correlation that exists between the phenotypic values of a given cohort across traits we also computed the genome-wide (i.e. across all tested SNPs) average (Z ̅ ) and standard deviation (σ Z ) specific annealing temperature for each set of primers and 72°C for 2 min), and 72°C for 10 1194 minutes on a PE 9700 thermal cycler (Applied Biosystem, USA). bins (metagenomic assembly genomes, MAGs) generated by the two binning algorithms were 1319 evaluated for quality and combined to form a MAG set using the bin_refinement module in estimate the completeness and contamination of each MAG (Parks et al., 2015) . The MAGs with completeness <50% and contamination >5% were filtered out. Non-redundant MAGs were then clustered to 97% identity level OTUs using the DNACLUST program (Ghodsi et al., from cecum samples, and contributed to ABO genotyping. YZ participated in the preparation Jing Li for their preparation of reagents and management of samples. We are also 1493 grateful to Yukihide Momozawa, Rob Mariman, Myriam Mni, Latifa Karim and Manon Dekkers 1494 for generating the CEDAR-1 16S rRNA data. Lusheng Huang is supported by The National 1495 We thank the long-term projects support from Jiangxi department for education, 1497 the Ministry of Science and Technology of P.P. China, the Ministry of Agriculture and Rural 1498 Affairs of P. R. China, and Jiangxi department of Science and Technology for the swine 1499 heterogeneous stock. Congying Chen was supported by the National Natural Science 1500 Foundation of China (31772579) Lev Shagam is supported by the FNRS IBD-GI-Seq 1502 project to Souad Rahmouni Miquant" project. Carole Charlier is Senior Research Associate 1504 from the FNRS Adaptation and possible ancient interspecies introgression in pigs identified by 1508 whole-genome sequencing Microbiome evolution during hots aging Basic alocal alignment search tool GenABEL: an R library for genome-wide 1514 association analysis Reproducible, interactive, scalable and extensible microbiome data science 1524 using QIIME2 The effect of host genetics on the gut microbiome Attachment of Helicobacter pylori to human gastric epithelium mediated by 1528 blood group antigens An expanded view of complex traits: from polygenic to 1530 omnigenic An ordination of uplnad forest communities of southern Wisconsin Pathways for the utilization of N-acetyl-1534 galactosamine and galactosamine in Escherichia coli Rapid and accurate haplotype phasing and missing-data 1536 inference for whole-genome association studies by use of localized haplotype clustering DADA2: 1539 high-resolution sample inference from Illumina amplicon data ABO blood groups and clinical forms of 1542 schistosomiasis mansoni Second-generation PLINK: rising to the challenge of larger fastp: an ultra-fast all-in-one FASTQ preprocessor ABO blood group and susceptibility to severe acute respiratory syndrome Determination of complete sequence information of the human ABO blood group 1558 orthologous gene in pigs and breed differences in blood type frequencies Blood groups in infection and host susceptibility SNP-based deconvolution of biological mixtures: 1563 application to the detection of cows with subclinical mastitis by whole genome sequencing of 1564 tank milk ABO antigen and 1566 secretor statuses are not associated with gut microbiota composition in 1,500 twins A framework for variation discovery and genotyping 1571 using next-generation DNA sequencing data Search and clustering orders of magnitude faster than BLAST The ABO blood group locus and a chromosome 3 gene cluster associate 1583 with SARS-CoV-2 respiratory failure in an Italian-Spanish genome-wide association analysis Introduction to quantitative genetics Evidence of long-term gene flow and selection during domestication from 1588 analyses of Eurasian wild and domestic pig genomes Harnessing genomic information for livestock improvement Inferring the history of speciation in house mice from autosomal Y-linked and mitochondrial genes DNACLUST: accurate and efficient clsutering of phylogenetic 1594 marker genes Human genetics shape 1597 the gut microbiome Genetic Determinants of the Gut Microbiome in UK Twins The investigation of linkage between a quantitative trait and a 1608 marker locus Genome-wide associations of human gut 1611 microbiome variation and implications for causal inference analyses Circlator: automated circularization of genome assemblies using long 1614 sequencing reads Genetic analysis of the roles of agaA, agaI, and agaS genes 1616 in the N-acetyl-D-galactosamine and D-galactosamine catabolic pathways in Escherichia coli 1617 strains O157:H7 and C Fine-mapping inflammatory bowel disease loci to single-variant resolution Prodigal: prokaryotic gene recognition and translation initiation site 1621 identification A 660-Kb deletion with antagonistic effects on fertility and milk production 1623 segregates at high frequency in Nordic red cattle: additional evidence for the common 1624 occurrence of balancing selection in livestock BlastKOALA and GhostKOALA: KEGG tools for functional 1626 characterization of genome and metagenome sequences Sequence and expression of a 1628 candidate for the human Secretor blood group (1,2) fucosyltransferase gene (FUT2) Our gut microbiome: the evolving inner self Selfish operons: the evolutionary impact of gene clustering in prokrayotes and 1640 eukaryotes N-acetylgalactosamine utilization pathway and 1642 regulon in proteobacteria The sequence alignment/map format and SAMtools Fast and accurate long-read alignment with Burrows-Wheeler transform Aligning sequence reads, clone sequences and assembly contigs with MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced 1651 methodologies and community practices Minimap2: pairwise alignment for nucleotide sequences featureCounts: an efficient general-purpose program for 1655 assigning sequence reads to genomic features Association between the ABO blood group and 1665 the human intestinal microbiota composition Taxonomic identification of commensal bacteria 1667 associated with the mucosa and digesta throughout the gastrointestinal tracts of preweaned 1668 calves IBD risk loci are enriched in multigenic regulatory modules encompassing 1670 putative causative genes Schistosomiasis infection in 1672 relation to the ABO blood groups among school children in Zimbabwe F-statistics and analysis of gene diversity in subdivided populations The role of the gut microbiome in cattle 1677 production and health: driver or passenger? dRep: a tool for fast and accurate genomic 1679 comparisons that enables improved genome recovery from metagenomes through de-1680 replication CheckM: assessing 1682 the quality of microbial genomes recovered from isolates, single cells Genetic evidence for complex speciation of humans and chimpanzees The SILVA ribosomal RNA gene database project: improved data processing and web-based 1695 tools Diversity, compositional and functional 1699 differences between gut microbiota of children and adults Comparative genomic analysis of the human gut microbiome reveals 1701 a broad distribution of metabolic pathways for the degradation of host-synthesized mucin 1702 glycans and utilization of mucin-derived monosaccharides Integrating mapping-, assembly-and haplotype-based approaches for calling 1704 variants in clinical sequencing applications Genomic encyclopedia of sugar utiulization pathways in the Shewanella genus VSEARCH: a versatile open source tool 1710 for metagenomics Metagenomic predictions: from 1712 microbiome to complex health and environmental phenotypes in human and cattle The genomic landscape of Neanderthal ancestry in present-day humans High-resolution QTL mapping with diversity outbred mice identifies Introducing mothur: open-source, platform-independent, community-1727 supported software for describing and comparing microbial communities. Applied and 1728 environmental microbiology 75 The human gut microbiome: from association to modulation PhyloPhlAn is a new method for 1732 improved phylogenetic and taxonomic placement of microbes The ABO blood group is a trans-species 1735 poilymorphism in primates Genomes of the mouse collaborative cross The 1000 Genomes project Consortium. A map of human genome variation from population-1738 scale sequencing Association of host genome with intestinal microbial composition in a large 1740 healthy cohort Pilon: an integrated tool for comprehensive microbial variant detection and 1751 genome assembly improvement Establishing or exaggerating causality for 1753 the gut microbiome: lessons from human microbiota-associated rodents Naive Bayesian classifier for rapid 1756 assignment of rRNA sequences into the new bacterial taxonomy. Applied and environmental 1757 microbiology 73 Prediction of residual feed intake from genome 1759 and metagenome profiles in first lactation Holstein-Friesian dairy cattle Design of glycosyl 1762 transferase inhibitors: serine analogues as pyrophosphate surrogates? ChemPlusChem 80 Genome-wide association analysis identifies variation in vitamin D receptor 1765 and other host factors influencing the gut microbiota GCTA: a tool for genome-wide complex 1778 trait analysis Human gut microbiome viewed across age and geography Meta-analysis of genome-wide association studies for height and body mass 1782 index in ~700,000 individuals of European ancestry Global patterns of human DNA sequence variation in a 10-1785 kb region on chromosome 1 Tow novel regulators of N-acetyl-galactosamine utilization pathway and distinct 1788 roles in bacterial infections Genome-wide efficient mixed-model analysis for association studies Beagle (v.40) (Browning & Browning, 2007) . Genomic variants with minor allele frequencies 1037 (MAF) < 0.03 were removed. 1038 1039 Computing nucleotide diversities. Nucleotide diversities between pairs of breeds were 1040 computed from variant frequencies as follows: 1041where π i is the nucleotide diversity in window i, n i is the number of variants in window i, f ij A 1043 is the frequency of variant j of window i in breed A, f ij B is the frequency of variant j of window 1044 i in breed B, and w is the size of the windows in base pairs. The overall nucleotide diversity 1045 for a pair of breeds A and B was computed as the average of π i across all windows. The 1046 numbers reported are averages of overall nucleotide diversities for multiple pairs of breeds 1047 (within European, within Chinese, between European, between Chinese, between European 1048 and Chinese), computed for a window size of 1 million base pairs. 1049 1050 Estimating the contribution of the eight founder breeds in the F6 and F7 generation at 1051 genome and chromosome level. We estimated the proportion of the genome of the eight 1052 founder breeds in the F6 and F7 generation following Coppieters et al. (2020) . Assume that 1053 the total number of variants segregating in the mosaic population is n T . Each of these variants 1054 has a frequency in each one of the founder breeds which we denote f 1 0.1 → f n T 0.1 for breed 1, 1055 f 1 0.2 → f n T 0.2 for breed 2, etc … as well as a frequency in the F6 (or F7) generation which we 1056 refer to as f 1 6 → f n T 6 . We assume that there is a total of B breeds. We denote the proportion of 1057 the genome of breed 1 in generation F6 (or F7) as P 1 , of breed 2 in generation F6 (or F7) as P 2 , 1058 etc … We estimated the values of P 1 , P 2 , etc. … using a set of linear equations: 1059We used standard least square methods (lm function in R) to find the solutions of P j that 1067 minimize the residual sum of squares. This was done for the entire genome, as well as by 1068 with the unpermuted kinships. The empirical p-value was determined as the proportion of 1117 permutations that yielded a Spearman's correlation that was as low or lower than that obtained 1118 with the real data. Spearman's correlation coefficients were then adjusted to match the 1119 empirical p-values as follows. We generated "breeding values" for all animals used in