key: cord-0725068-qkaql8rd
authors: Lado, Sara; Elbers, Jean P.; Rogers, Mark F.; Melo-Ferreira, José; Yadamsuren, Adiya; Corander, Jukka; Horin, Petr; Burger, Pamela A.
title: Nucleotide diversity of functionally different groups of immune response genes in Old World camels based on newly annotated and reference-guided assemblies
date: 2020-09-03
journal: BMC Genomics
DOI: 10.1186/s12864-020-06990-4
sha: 63d7f24706fd8d8484651333840601aee6e1a0f9
doc_id: 725068
cord_uid: qkaql8rd

BACKGROUND: Immune-response (IR) genes have an important role in the defense against highly variable pathogens, and therefore, diversity in these genomic regions is essential for species’ survival and adaptation. Although current genome assemblies from Old World camelids are very useful for investigating genome-wide diversity, demography and population structure, they have inconsistencies and gaps that limit analyses at local genomic scales. Improved and more accurate genome assemblies and annotations are needed to study complex genomic regions like adaptive and innate IR genes. RESULTS: In this work, we improved the genome assemblies of the three Old World camel species – domestic dromedary and Bactrian camel, and the two-humped wild camel – via different computational methods. The newly annotated dromedary genome assembly CamDro3 served as reference to scaffold the NCBI RefSeq genomes of domestic Bactrian and wild camels. These upgraded assemblies were then used to assess nucleotide diversity of IR genes within and between species, and to compare the diversity found in immune genes and the rest of the genes in the genome. We detected differences in the nucleotide diversity among the three Old World camelid species and between IR gene groups, i.e., innate versus adaptive. Among the three species, domestic Bactrian camels showed the highest mean nucleotide diversity. Among the functionally different IR gene groups, the highest mean nucleotide diversity was observed in the major histocompatibility complex. CONCLUSIONS: The new camel genome assemblies were greatly improved in terms of contiguity and increased size with fewer scaffolds, which is of general value for the scientific community. This allowed us to perform in-depth studies on genetic diversity in immunity-related regions of the genome. Our results suggest that differences of diversity across classes of genes appear compatible with a combined role of population history and differential exposures to pathogens, and consequent different selective pressures.

Accurate genome assemblies provide an invaluable basis to assess genetic variation throughout the genome of species, to detect structural variants and to decipher complex genomic regions such as immune-response (IR) genes. Maintaining high genetic diversity in a population is important to reduce the spread of diseases, allowing rapid adequate immune responses and limiting, e.g., parasite evolution (see [1] ). Even though demographic changes in general may cause important loss of genetic diversity, and particularly during domestication, due to intensive selection and potential inbreeding in many genomic regions [2] , in other regions such as IR genes the genetic diversity can be conserved due to selective pressures of pathogens [3] .

Old World camels (Artiodactyla, Tylopoda, Camelidae, Camelini)the domesticated one-humped dromedaries (Camelus dromedarius) and two-humped Bactrian camels (Camelus bactrianus), as well as the critically endangered two-humped wild camels (Camelus ferus)are valuable species not only for their production traits (e.g., meat, milk or wool), but for their power (e.g., riding or packing). Moreover, they are ungulate species with unique adaptations to diverse and extreme environments. Consequently, as they are in contact with different pathogenic pressures on different environments, there is great interest in understanding the general diversity in the part of the genome encoding their immune system. Previous research on immunogenome diversity in Old World camels focused mainly on the MHC genes (e.g., [4] ), as due to its critical importance for individual survival, the MHC complex is the most intensively studied part of the vertebrate immunogenome [5] . MHC genes, however, account only for part of the genetic variability underlying resistance to infectious pathogens [6, 7] . A broader approach is required to capture the overall genetic diversity of the immune system and to understand its role in response to pathogens. On these grounds, high-quality genome assemblies are needed. Previous studies [8] [9] [10] [11] [12] developed high quality genome assemblies for the three Old World camel species. Although very useful for broad inferences of genome-wide diversity or demographic histories, an improved version of these assemblies is needed to allow more detailed studies of the diversity in parts of the genome, such as IR genes. Access to different computational methods allows overcoming previous genome assembliesĺ imitations.

In this work, we describe our computational efforts to generate improved Old World camelid genome assemblies, and we present versions CamDro3, CamBac2 and CamFer2, for dromedaries, Bactrian camels and wild camels, respectively. Our goal was not only to provide novel assemblies for genomic analysis in camels, but also to take advantage of the upgraded genome assemblies to assess the genetic diversity in different groups of immune genes, and compare them among species and to the rest of the intra-genic genomic diversity.

Despite the utility of the CamDro1 and CamDro2 assemblies, inconsistencies and gaps can limit analyses at various genomic scales. By using different bioinformatic methods, we were able to upgrade the available genome assemblies to CamDro3, which is more accurate, contiguous and show fewer scaffolds of increased size when compared to the previous ones. CamDro3 consistently had higher RNA-Seq read mapping rates than CamDro2, and these two assemblies had much higher mapping rates than the other assemblies (Supplemental Fig. 1 ). After CamDro3 and CamDro2, the assembly with the third highest mapping rates varied depending on the tissue and season analyzed, but B. taurus consistently had the lowest mapping rates. We were able to assign at least one super-scaffold to each of the 37 chromosomes except the Y chromosome as the dromedary used in Cam-Dro1, CamDro2, and CamDro3 was female. Chromosomes are denoted by numbers 1-36 and X in the CamDro3 assembly. There were 113,944,958 bases in scaffolds not assigned to chromosomes (5.25% of the 2,169,346,739 base assembly).

In the CamDro3 annotation, we predicted 22,917 genes that produced 34,135 proteins, and 7.4% (1705) of genes had no assigned annotation. These numbers are slightly higher than for the CamDro2 assembly for which we had predicted 22,534 genes that produced 34,024 proteins, and 7.7% (1730) of genes had no assigned annotation [11] . We assessed if predicted proteins were truncated due to uncorrected indels introduced by Pac-Bio reads by comparing the predicted protein length hit distribution of the CamDro1 assembly (Illumina only data, Fig. 1 , red line), which should lack such PacBio specific error, to that of the CamDro2 (Fig. 1 , green line) and CamDro3 assemblies (Fig. 1, blue line) . First, predicted proteins from the CamDro1 assembly had 21,257 protein hits against the UniProt/TrEMBL database, and 11,671 (55%) hits were between 0.85 and 1.15 (query sequence length/ subject sequence length; Fig. 1 ). Second, predicted proteins from the CamDro2 assembly had 32, 297 protein hits, and 17,341 (54%) were between 0.85 and 1.15 ( Fig. 1) . Third, predicted proteins for CamDro3 assembly had 32,427 protein hits, and 17,006 (52%) were between 0.85 and 1.15 ( Fig. 1 ). This suggests that Cam-Dro3 is similar to CamDro2 with respect to proportion of uncorrected PacBio indels, but the proportions of uncorrected indels are very low when compared to Cam-Dro1. AEDs were slightly higher in CamDro3 versus CamDro2 (Fig. 2) . For example, CamDro2 had AED values ≤0.5 for 78.4% transcripts versus 79.1% transcripts for CamDro3. Lower AED values indicate a better fit to the provided evidence when annotating a genome [15] .

We predicted 22,223 genes that produced 33,153 proteins in CamDro3 using a more up to date set of proteins during annotation. These values were lower than when annotating CamDro3 using the same cDNA transcripts and proteins used for annotating CamDro2 possibly because there were fewer false genes predicted in the up-to-date annotation of CamDro3. Further, 8.46% (1879) genes produced proteins did not match UniProt/ Swiss-Prot proteins. This value was higher than before, but we used UniProt/Swiss-Prot instead of the more comprehensive UniProt-TrEMBL protein database. The CamDro3 assembly and these annotations have been submitted to GenBank (GCA_000803125.3) and Dryadsee Data Accessibility Statement.

Improved Camelus bactrianus and Camelus ferus genomes via reference-guided assembly CamBac2 increased in size by 46,927,041 bases and had 1862 fewer scaffolds than CamBac1, and CamBac2's N50 was nearly 8 times larger (Table 1 ). The longest contig in CamBac2 was more than 7 times larger than before. We have also predicted 19,491 genes that produced 25, Fig. 1 Frequency polygons of query sequence length (predicted proteins) divided by subject (UniProt/TrEMBL) sequence length for DIAMOND [13] mapped MAKER [14] predicted proteins against UniProt/TrEMBL release 2018_07 database for: (red line) the original North African dromedary genome (CamDro1), ( [8] ; GenBank accession: GCA_000803125.1); (green line) the North African dromedary genome after adding~11x PacBio sequencing reads (CamDro2); and (blue line) CamDro3 95 proteins in CamBac2. Of these genes, 4.03% (786) did not match proteins from UniProt/Swiss-Prot. Camelus bactrianus had the second lowest mapping rates, after B. taurus. The CamBac2 assembly and these annotations have been submitted to Dryad -see Data Accessibility Statement.

CamFer2 was 77,064,279 bases larger and was organized into 4176 fewer scaffolds than CamFer1. Cam-Fer2 had an N50 that was nearly 35 times larger than CamFer1's N50 (Table 1 ). CamFer2's longest contig was more than 2 times larger than CamFer1's largest contig.

We predicted 19, 192 genes that produced 19,192 proteins in CamFer2. Of these genes, 3.69% (708) did not match proteins from UniProt/Swiss-Prot. There were many structural variations (inversions and repeats) when comparing the assembled chromosomes of CamFer2 and the C. ferus genome assembly from Ming et al., [12] (Supplemental Fig. 2) . Ultimately, these latter genomes have similar scaffold N/L50 values, but CamFer2 has much smaller contig N/L50 values because of more abundant and larger gaps in assembled chromosomes (Supplemental Table 1 ). The CamFer2 assembly and these annotations have been submitted to Dryad -see Data Accessibility Statement.

Mean coverage throughout the genomes of the three Old World camel species was not different among species (F 2,22 = 0.1871, P = 0.8307; Table 2 ). The mean total number of SNPs was different among species (F 2,22 = 64.943, P < 0.0001) as was the number of synonymous (F 2,22 = 66.99, P < 0.0001) and non-synonymous SNPs (F 2,22 = 113.25, P < 0.0001; Table 2 ). Mean total, synonymous, and non-synonymous SNPs were highest in Bactrian camels, followed by wild camels, then dromedaries. The mean number of insertions was different among species (F 2,22 = 31.269, P < 0.0001) as was the mean number of deletions (F 2,22 = 16.407, P < 0.0001; Table 2 ). Bactrian camels had a higher mean number of insertions than dromedaries and wild camels, which showed similar numbers of insertions. Bactrian camels had higher mean number of deletions, followed by wild camels, then dromedaries.

We assessed the heterozygosity rates in coding (exons) and noncoding (introns) regions, across multiple individuals. Heterozygosity means for all three species and coding/noncoding regions were all significantly different at the 0.05 level of significance. The results show that 

After improving the three Old World camel genome assemblies, we assessed the nucleotide diversity in immune response and intra-genic (within gene) regions. When looking at non-synonymous and synonymous SNPs and indels altogether, mean nucleotide diversity was found not to differ significantly for adaptive, innate IR genes and the rest-of-genome genes, but to be higher in MHC class I and II genes in both dromedaries and domestic Bactrian camels ( Fig. 3a ; Table 3 for mean values and 95% bootstrap confidence limits).

On the other hand, in wild camels, mean nucleotide diversity was not significantly different across gene types. When comparing nucleotide diversity per gene class in species pairs, mean MHC nucleotide diversity did not differ significantly for domestic Bactrian camels and dromedaries, as well as for wild camels and dromedaries, but differed between wild and domestic Bactrian camels, with the latter showing higher nucleotide mean diversity (Supplemental Fig. 3a ; Table 3 for mean values and 95% bootstrap confidence limits). Innate and adaptive IR gene nucleotide diversity was statistically different between domestic Bactrian camels and the other two species, but the same between dromedaries and wild camels, while again Bactrian camels had a higher mean nucleotide diversity. Rest-of-genome gene nucleotide diversity was also higher for the Bactrian camel and different between this and the other two camel species. On the other hand, when looking at only nonsynonymous SNPs, dromedaries' mean nucleotide diversity patterns were more difficult to interpret. Mean innate gene nucleotide diversity was lower than mean rest-of-genome gene nucleotide diversity, but mean innate gene nucleotide diversity was statistically not different from mean adaptive or MHC nucleotide diversity nor was mean rest-of-genome nucleotide diversity different from mean adaptive or MHC nucleotide diversity ( Fig. 3b ; Table 3 for mean values and 95% bootstrap confidence limits). In domestic Bactrian camels, mean nucleotide diversity was the same for adaptive, innate and the rest-of-genome genes, but different in MHC genes where it was the highest. On the other hand, in wild camels, all gene groups had statistically the same mean nucleotide diversity. For both MHC and adaptive IR genes, mean nucleotide diversity was the same among the three camel species (Supplemental Fig. 3b ). For innate IR genes, Bactrian and wild camels had the same mean nucleotide diversities, whereas dromedaries had a different mean nucleotide diversity from the other camel species, but the same compared to wild camels. Finally, for the rest-of-genome genes group, all species had statistically different mean nucleotide diversities, where Table 1 Assembly statistics for the CamBac1 (GCF_000767855.1) and CamFer1 (GCF_000311805.1) and after improvement (CamBac2 and CamFer2, respectively) with reference-guided assembly with Ragout [16] using Progressive Cactus [17] alignments to CamDro3 then filling in gaps with GapFiller [18] Assembly CamBac1

CamBac2 domestic Bactrian camels showed to have the highest values. There were 46 identified single-domain heavy-chain immunoglobulin genes in the Camelus ferus assembly of Ming et al. [12] . Of those 46, annotations for 43 could be lifted over to CamDro3, 36 to CamBac2, and 39 to CamFer2, which mapped on chromosome 6 and on other scaffolds. Mean nucleotide diversity was not significantly different among dromedaries, domestic camels, or wild camels when using either alignments made with all SNPs and indels or only nonsynonymous SNPs (see Supplemental Table 2 and Supplemental Fig. 4 ).

Despite its functional importance, the immunogenome of camels has received only limited attention, with work focusing on cytogenetic mapping in alpaca [22] , the characteristics of single-domain heavy-chain antibodies [23] or specific mechanisms underlying the genetic diversity of T-cell receptors [24] [25] [26] . Dromedary and twohumped camels are important livestock species, well adapted to harsh conditions and resistant to devastating infections that threaten other livestock species in the same areas, like contagious pleuro-pneumonia [27] or foot-and-mouth disease in dromedaries [28] . Other infections have an important role in human health, such as the Middle East Respiratory Syndrome Coronavirus (MERS-CoV), for which dromedaries are potential reservoirs [29] . Variation in genetic diversity between innate and adaptive immunity genes is caused by differences in these gene groups' mechanisms. While innate immunity is less specific and more executive, its genes are subject to purifying rather than to positive/balancing selection, whereas adaptive immunity is more focused on specific recognition of highly diverse antigens and its variability is exposed to different selective pressures [30, 31] . In this study, we compared the diversity in different groups of immune response genes with those found in intra-genic regions among the three Old World camel species, aiming to better understand to which selection pressures Uppercase letters above upper 95% confidence limits indicate groups have different (non-matching letters) or not different (matching letters) means based on non-overlapping confidence intervals they might have been exposed. For this purpose, we first improved the three available Old World camelid genome assemblies.

We applied several computational techniques to improve previous assemblies. To begin with, we were able to greatly improve CamDro3 genome assembly from Cam-Dro2. Compared with the previous version, the number of predicted gene proteins in the CamDro3 were lower, possibly because there were fewer false genes predicted. After correcting mis-assemblies by re-scaffolding Cam-Dro2 and by using a different indel-polishing method, CamDro3 is now more complete, with fewer gaps and likely more accurate. Additionally, the reference-guided assembly process significantly improved the quality and contiguity of CamBac2 and CamFer2, as they are now more contiguous, and have fewer and longer scaffolds. By using a closely related genome to improve a draft assembly, it has a bigger impact on the final assembly, as well as the accuracy and completeness of a reference genome also contribute [32] . Although mean coverage throughout the genome was not different between species, mean total, synonymous, and non-synonymous SNPs, mean number of insertions and deletions were highest in domestic Bactrian camels compared to the other two species. These results might suggest that domestic Bactrian camels generally have higher genetic diversity than dromedaries and wild camels, as they might have experienced less severe demographic changes during domestication than dromedaries [33] and less recent population size reduction than the critically endangered wild camels [34] .

Old World camels are known to be resistant to serious infectious diseases that threaten other livestock species inhabiting the same geographical regions, although they may contract other poorly-studied diseases [35] . On the other hand, diseases of Camelidae are often difficult to deal with, having non-specific signs with a considerable economic impact [36] . Hence, as diversity in immune response gene regions may influence infectious disease susceptibility in populations, a better understanding of IR gene diversity will support camel breeding and sustainable management in countries of the Global South with large camel populations. As our data were not normally distributed and could not be transformed to approximate a normal distribution, we assessed differences in nucleotide diversity within species in different immune complexes of the genome by using a nonparametric bootstrapping method to estimate 95% confidence intervals of mean nucleotide diversity ( Fig. 3 and Supplemental Fig. 3 ). MHC class I and class II genes are amongst the most polymorphic genes studied in vertebrates [37] . Pathogen-mediated selection is widely held to be the major driving force in maintaining the high diversity at MHC loci [38] . In particular, the MHC diversity in populations is maintained by balancing selection [39] . According to the 95% confidence intervals derived from non-parametric bootstrap tests of mean nucleotide diversities, we observed that MHC (class I and II) genes had higher mean nucleotide diversity compared to all other gene groups, for two-humped camels, in both SNPs-indels and just non-synonymous SNPs analyses, and for dromedaries in SNP-indels analysis but not for only non-synonymous SNP analysis (Fig. 3) . Previous research by Plasil et al., [4] showed that MHC nucleotide diversity within the three Old World species was generally low. In this case, the authors looked specifically into the antigen-binding sites and not to the complete genes where, according to our results, additional diversity appears to be present. The functional importance of this variation is currently unknown. However, it is important to acknowledge how particular pathogens affect immune genetic diversity and, vice versa, how genetic variation influences adaptation to emerging zoonosis, habitat fragmentation, and climate change [40] . MHC genes play an important role in the adaptive branch of the immune system and have been used extensively to estimate levels of adaptive genetic variation [41] . While innate immunity is an efficient first protection against many pathogens but rather less specific, adaptive (or acquired) immunity is a highly specific immune response, and its variability is subject to different selective pressures [30, 31] . Overall, mean nucleotide diversity was never different when comparing innate and adaptive IR gene groups in all three species, in both SNPs-indels and non-synonymous SNPs analyses.

When comparing nucleotide diversity among both two-humped camel species, wild camels had lower mean nucleotide diversity for both SNP and indels and nonsynonymous SNP analyses, except for the MHC class I and II genes and for adaptive genes with nonsynonymous SNPs (Supplemental Fig. 3) . Moreover, in general, the domestic Bactrian camel had higher mean nucleotide diversity compared to the wild camel, except for the mean nucleotide diversity in adaptive genes with non-synonymous SNPs. One possible explanation for these results is that the wild camel suffered strong population declines leading to the current status of "critically endangered" species (by the International Union for Conservation of Nature (IUCN)). Thus, with the number of individuals decreasing, loss of genetic diversity is unfortunately real [42, 43] . Another possible explanation is that domestic Bactrian camels are under higher pathogenic pressure compared to the wild species. For example, Bactrian camels can be raised and herded with other domestic species (e.g., sheep or goat and sometimes cattle) and due to this fact, the animals are in contact with different pathogens that would not be present in the wild camels' natural habitat [44] . This pathogenic pressure might have selected for higher diversity in domestic Bactrian camels, explaining the higher diversity in the immunogenome as well as in the rest of the genome. Nevertheless, we cannot discard the possibility that the demographic dynamics influenced the mean nucleotide diversity levels compared among species. Patterns of demographic changes across all three species demonstrated widespread population declines during the Pleistocene [21] . Principally in dromedaries, according to Lado et al. [45] and Fitak et al. [21] , long-term population bottlenecks were detected, which probably reduced the nucleotide diversity even more in this species. Furthermore, there is the assumption that dromedaries have been domesticated from a relatively small population of wild one-humped camels, which already have been declining in numbers in a limited geographical area at the Southeast coast of the Arabian Peninsula [33] . However, the domestication of Bactrian camels might have occurred over a much larger geographic region, involving (genetically) more distant and diverse wild two-humped camels [12] . Our results suggest that the IR genes follow the same pattern of rest-of-the-genome genes where domestic Bactrian camels are more diverse throughout all classes of genes when compared to the endangered wild camel.

We also assessed the nucleotide diversity of singledomain heavy-chain immunoglobulin genes in our data. For that, we lifted the 46 heavy-chain immunoglobulin gene annotations from the Ming et al. 2020 [12] Camelus ferus genome assembly over to CamFer2, CamDro3, and CamBac2. However, we could not detect all 46 gene annotations on chromosome 6 and on other scaffolds as compared to Camelus ferus [12] . We were only able to recover 39 genes for CamFer2, 43 for CamDro3, and 36 for CamBac2. These lower numbers might be due to assembly quality as the contig (not scaffold) lengths are much longer in the Ming et al. [12] Camelus ferus assembly than in CamDro3, CamBac2, or CamFer2. Moreover, mean nucleotide diversity among dromedaries, domestic camels, and wild camels were not significantly different when using either alignments made with all SNPs and indels or only non-synonymous SNPs. In Ming et al. [12] , the authors also compared the heavychain locus on chromosome 6 between the wild camel and alpaca (Vicugna pacos), and found that the gene content and order were very similar between the species. Interestingly, the alpaca, one of the four New World camel species, is evolutionarily the most closely related species to the Old World camels. Only recently, the most up-to-date chromosome-level reference genome assembly was released as VicPac3.1 [46] . Latest research shows that the genomic sequences of Natural Killer cell Receptor (NKR) genes were highly similar in both dromedary and domestic camel to alpaca sequences, as well as the organization of this genomic region [25] . Furthermore, high sequence similarity was observed for genes in the three different classes of MHC as well as MHC genes organization [46, 47] .

In this study, using different computational methods, we were able to improve genomic resources for Camelus dromedarius, C. bactrianus and C. ferus. Our data provides high-quality genome assemblies, which are now more contiguous and have fewer and longer scaffolds than the previous version, and are promising resources for the scientific community. Moreover, our results give new insights into the differences in mean nucleotide diversity in immune response genes within and among the three Old World camel species. From the three species, domestic Bactrian camels had the highest mean nucleotide diversity, and from the different functional gene groups, MHC genes had the highest mean diversity. Examining genetic variation in diverse immune genes in camels should be a priority, not only because camels are well adapted to extreme environments even in contact with different pathogens, but also because both domestic species are economically very important, and the wild two-humped camel is critically endangered. The data also showed that studies focused on functionally important parts of the genes, combined with analyses of selection at the molecular and population level, will be helpful to improve the understanding of the biology and evolution of these species. Altogether, this work not only opens doors for future immunogenome studies, but also serves as a reference to further genome assembly improvements using computational methods.

The original North African dromedary genome assembly (CamDro1) was created from a female dromedary "Waris" ( [8] ; GenBank accession: GCA_000803125.1). Briefly, two types of Illumina libraries were generated and sequenced: 500 bp (short-insert, 100 bp paired-end reads) and 5 Kbp (long-insert/mate-pair, 50 bp pairedend reads) libraries. Short-and long-insert reads were trimmed and, after short-insert reads error-correction, de novo assembled with ABYSS [48] with a k-mer value of 64.

Dovetail Genomics (Santa Cruz, California, USA) created and sequenced Chicago and Dovetail Hi-C libraries derived from the same dromedary "Waris" used in Cam-Dro1. First, the CamDro1 assembly was scaffolded using Dovetail Chicago data run through the HiRise pipeline [49] . Next, the Chicago assembly was scaffolded with Hi-C data. Using a PacBio Sequel sequencer, 11x longread coverage were generated ( [11] ; Sequence Read Archive (SRA) accession: SRP050586) and PBJelly [50] was used to fill in gaps in the Hi-C assembly. PBJelly assembly was polished with Pilon [51] employing the same trimmed and error-corrected Illumina short-insert sequences used for the de novo assembly of CamDro1 by Fitak et al. ([8] ; SRA accession: SRR2002493). Gaps present in the Pilon assembly were then filled with ABYSS Sealer [52] . Finally, the ABYSS assembly was polished with Pilon once again. This assembly is referred to as CamDro2 ( [11] ; GCA_000803125.2).

The CamDro2 assembly was re-scaffolded using the original Dovetail Chicago and Hi-C reads with the HiRise pipeline. We then filled in gaps using our PacBio longreads ( [11] ; SRA accession: SRP050586), running PBJelly v. 15.8.24 twice. Instead of polishing the assembly with Pilon, we used a standard variant calling workflow, which increased RNA-Seq reads mapping rates relative to the Pilon-polished assembly (Table 4 ). Briefly, we first mapped trimmed and error-corrected Illumina shortinsert sequences ( [8] ; Sequence Read Archive accession: SRR2002493) using BBMap v. 38.12 (https://sourceforge. net/projects/bbmap/) with the vslow and usejni settings to the PBJelly assembly. We then sorted and indexed the resulting BAM file with Sambamba v. 0.6.7 [55] and called variants with CallVariants v. 38.12 (https://sourceforge.net/projects/bbmap/). We finally used BCFtools v. 1.2 (http://samtools.github.io/bcftools/) to generate a consensus sequence for which we filled in gaps using ABySS Sealer v. 2.1.0 [52] using default settings except for a bloom filter size of 40 GB and multiple K values from 90 to 20 in increments of 10. We refer to this as the CamDro3 assembly (GCA_000803125.3).

To assess the quality of the new assembly, we aligned 10 sets of paired-end RNA-Seq reads (Alim et al., 2019) to the original assembly (CamDro1), to CamDro2, the new assembly (CamDro3), and to several controls: C. dromedarius (RefSeq version -GCA_000767585.1), C. bactrianus (GCA_ 000767855.1), C. ferus (GCA_000311805.2) and Bos taurus (cattle) (GCA_000003055.3). The 10 RNA-Seq datasets were part of a 2 × 2 factorial experiment: summer vs. winter seasons and supraoptic nucleus (SON) vs. neurointermediate lobe (NIL) brain tissues, with n = 3 replicates in each class. Tissue was homogenized and extracted usingTrizol/ chloroform (ThermoFisher), and purified with the RNeasy MiniKit (Qiagen). The library template was prepared using a ribosome depletion protocol (Ribo-Zero Gold; Illumina) and libraries prepared using TruSeq Stranded protocol (Illumina). Samples were multiplexed into lane pools with an 8pM concentration and sequenced (100 bp paired-end reads with an average 134 bp insert size) to a depth of > 35 million reads using an Illumina HiSeq 2500. Two of the 12 replicates were rejected for insufficient quality. We used Tophat v. 2.0.9 [56] with default settings to align reads to each genome and report overall alignment rate (default output of Tophat) within each class. For chromosome mapping we then used blastn v. 2.2.31+ [57] to map 4981 probe sequences assigned to Vicuna (Lama) pacos chromosomes [11, 22] to CamDro3 assembly scaffolds. We followed the same procedure as Elbers et al., [11] .

To compare CamDro2 and CamDro3 assemblies, we annotated CamDro3 following the same steps used to [63] . We only retained genes, transcripts, and proteins with AED ≤ 0.50. Next, we predicted putative gene function with DIAMOND v. 0.9.19 [13] searches against the UniProt/TrEMBL release 2018_07 database using an e-value cutoff of 1e − 6. For the CamDro1, CamDro2, and CamDro3 assemblies, we also mapped proteins predicted by MAKER against the same UniProt/TrEMBL database using DIAMOND and generated a frequency polygon of the query sequence length (predicted proteins) divided by the subject sequence length (UniProt/TrEMBL proteins) to assess if predicted proteins were truncated (query sequence length divided by the subject sequence length < 1.0) due to uncorrected insertions/deletions (indels) introduced by PacBio reads that might interrupt reading frames affecting protein translation [64] .

Reference-guided assembly of the domestic Bactrian and wild camel genomes

We used CamDro3 in a reference-guided assembly strategy implemented with Ragout v. 2.0 [16] to upgrade the C. bactrianus (CamBac1, GCF_000767855.1, [10] ) and C.

ferus (CamFer1, GCF_000311805.1, [9] ) genome assemblies to chromosome-level scale. Briefly, we used default settings in Progressive Cactus v. Github commit c4bed56c0cd48d23411038acb9c19bcae054837e [17, 65] to generate HAL (hierarchical alignment format) alignments between CamDro3 and CamBac1 or CamDro3 and CamFer1, and then used Ragout with the "refine" and "small synteny block" settings to convert the alignments to FASTA, upgrading the CamBac1 and CamFer1 assemblies to CamBac2 and CamFer2, respectively. [12] using a genome size of 2.1 Gbp. To assess the level of disagreement between CamFer2 and C. ferus genome assembly from Ming et al. [12] , we made a whole genome alignment with Minimap2 v. 2.17 [68] using the "asm5" preset. We then used D-GENIES [69] to generate a dot plot for the alignment by using the contig sorting function and filtering alignments for strong precision. Chromosomal synteny between the wild camel and dromedary was analyzed by Ming et al. [12] after whole-genome alignment between C. ferus genome assembly (new-CamFer) and CamDro3, where assignment of the chromosome nomenclature between these species was similar, with only few structural differences at the megabase (Mbp) scale. Synteny is likely highly conserved between wild camel and dromedary, and domestic Bactrian and dromedary.

Most up to date annotation for CamBac2, CamFer2, CamDro3

To get the most up to date annotation for CamBac2, CamFer2, and CamDro3, we annotated scaffolds greater than 10 [54] ). We performed adapter and quality trimming on raw demultiplexed paired-end reads using BBDuk v. 37.25, using the following settings: ktrim = r, k = 23, mink = 11, hdist = 1, tpe, tbo, qtrim = rl, trimq = 15. We then mapped quality and adapter trimmed reads to the CamDro3 assembly using HiSat v. 2.1.0 using a maximum intron length of 100,000 and the "dta" option. Reads were assembled into transcripts using StringTie v. 1.3.3b and extracted using Gffread v. 0.9.9. For CamBac2 ESTs but CamDro3 and CamFer2 alternative ESTs, we processed transcriptome reads from C. bactrianus (SRA accessions: SRP014573 and SRP148535) with HiSat, StringTie, and Gffread as before but mapped quality controlled reads to the CamBac2 assembly. For proteins, we combined predicted proteins from B. taurus, C. bactrianus, C. dromedarius, C. ferus, and V. pacos (GenBank accessions (NCBI annotation release): GCF_002263795.1 (106), GCF_000767585.1 (100), GCF_000767855.1 (100), GCF_000311805.1 (101), and GCF_000164845.2 (101), respectively). We trained Augustus v. 3.3.2 using BUSCO v. 3.0.2 searching for Eukaroyota OrthoDB v. 9.1 genes in CamDro3, Cam-Bac2, and CamFer2. We used a C. dromedarius, C. bactrianus, or C. ferus specific repeat library created with RepeatModeler open-1.0.10 with the CamDro3, Cam-Bac2, or CamFer2 assemblies as input, respectively. We filtered each repeat library from RepeatModeler to remove known UniProt/Swiss-Prot release 2018_11 proteins using ProtExcluder v. 1.1. We only retained genes, transcripts, and proteins with AED ≤ 0.50. Next, we predicted putative gene function with blastp v. 2.2.31+ [57] searches against the UniProt/Swiss-Prot release 2018_11 database using an e-value cutoff of 1e − 6.

From whole-genome sequencing reads (100-bp Illumina paired end reads) of 25 Old World camels [21] , we removed adapter sequences and reads with > 10% uncalled bases and/or > 50% of bases with a Phred-scaled quality score < 4. We also trimmed reads with PoPoolation v. 1.2.2 [70] , where low-quality bases with a Phred score below 20 at the ends of the reads were removed. We converted base quality scores from Phred 64 to Phred 33 encoding and performed quality trimming with Repair v. 38.39 (https://sourceforge.net/projects/bbmap/) using the qtrim = rl and trimq = 15 settings. We mapped quality and adapter trimmed paired-end reads for C. bactrianus, C. dromedarius, and C. ferus individuals to the CamBac2, CamDro3, and CamFer2 references, respectively with BWA-MEM v. 0.7.17 [71, 72] . We converted SAM files to BAM files with SAMtools v. 1.9 [73] , then cleaned, sorted, added read groups, and marked duplicates with Picard v. 2.18.10 (http://broadinstitute.github. io/picard). We called variants for each species separately with CallVariants v. 38.39 (https://sourceforge.net/projects/bbmap/), keeping only SNPs and indels with quality scores greater than or equal to 27. We predicted what SNP alleles might be synonymous or nonsynonymous using snpEff v 4.0e [74] .

We calculated coverage metrics with mosdepth v. 0.2.6 [75] with the settings "-n --fast-mode and --by 500". We used R v. 3.6.0 to test for differences in mean coverage, total number of SNPs, number of synonymous SNPs, number of non-synonymous SNPs, number of insertions, and number of deletions within species with the "lm" and "anova" base functions. For all models, we used a Benajimini-Hochberg post-hoc test [76] implemented in glht and summary functions in the R package multcomp v. 1.4-10 [77] .

We predicted intron regions for gene annotations of CamDro3, CamBac2, and CamFer2 using Genome Tools v. 1.5.8 [78] with the gff3 function and -addintrons -retainids options. We then generated gene annotation files of only exons or introns for each camel species. We filtered the VCF files for each individual to retain only heterozygous SNPs. We then used BEDTools intersect v. 2.29.0 [79] to count the number of heterozygous SNPs for each individual (n = 25) in the exons or introns across the genome. We estimated heterozygosity as the number of heterozygous SNPs in the exons or introns of a given gene for a given individual divided by gene length.

We used the lm function in R 3.6.3 using heterozygosity as the dependent variable and the interaction of species and whether heterozygosity was estimated from exons or introns (hereafter exons or introns) as the independent variable. Residuals needed to be log10 transformed to be normally distributed. We used a generalized least squares variation of ANOVA (hereafter ANOVA [80] ) as our transformed data did not have homogeneous variance. To control for heterogeneous variance, we used weights as "varIdent=(1|interaction of species, and exons or introns)" implemented with the gls function in the R package nlme v. 3.1-147 [81] . We used a Benajimini-Hochberg post-hoc test as before implemented with the glht and summary functions in the R package multcomp v. 1.4-13 and the cld function in multcomp with the options level = 0.05 and decreasing = T to determine if means for all species for exons and introns were significantly different at the 0.05 level.

Two comparisons of nucleotide diversity were made, (i) between functionally different gene groups within each species: innate immune response genes, adaptive immune response genes, MHC class I and II genes, and rest-of-genome genes, and (ii) between Old World camel species: domesticated dromedaries and Bactrian camels, and wild camels among gene groups.

To test for differences in genetic variation among functionally different gene groups, we performed blastp searches of CamBac2, CamFer2, and CamDro3 predicted proteins against UniProt/Swiss-Prot release_2018_11 proteins to assign gene ontology terms, and filtered these gene/GO term lists by the gene ontology terms "innate immune response" and "adaptive immune response" using the rGO2TR package [82] . For MHC class I and class II genes, we filtered the GFF3 (General Feature Format) files of gene annotations manually. For the restof-genome gene group, we examined genes that were not assigned to either the innate or adaptive immune response gene groups. We used BCFtools v. 1.9 to generate a consensus sequence with IUPAC codes for each individual against its respective reference genome for each gene being analyzed and made a multiple sequence alignment for each gene and species with FSA v. 1.15.9 [83] with MuMmer v. 4.0.0beta2 [84] for long alignments. Finally, we calculated nucleotide diversity for entire gene sequence multiple sequence alignments (each species separately) using the R package Pegas's "nuc.div" function [85] . We used R v. 3.6.3 to test for differences in mean nucleotide diversity within species among gene groups. For this we compared the 95% confidence intervals of the mean estimated with the boot.ci function's "basic" confidence interval method based on 1000 "ordinary" simulations (i.e., non-parametric bootstraps) implemented with the boot function from the R package boot v. 1.3-24 [86] . We chose to use non-parametric inference as the residuals could not be transformed to approximate a normal distribution, precluding the use of traditional ANOVA/linear model testing for differences in means.

For analyzing differences in mean nucleotide diversities within gene groups but among species, we used the same procedures as before but with the explanatory variable "species" (dromedary, domestic Bactrian camel, or wild camel) and response variable "nucleotide diversity" (adaptive, innate, MHC, or rest-of-genome genes). In addition to nucleotide diversity, estimated with gene consensus sequences made with non-synonymous and synonymous SNPs and indels, we also repeated all steps above using only non-synonymous SNPs (indels and synonymous SNPs were not included).

Interestingly, camelids (New World and Old World camels) produce homodimeric heavy-chain immunoglobulins (hcIGs [87] ;) without a light chain and with the antigen-binding fragment reduced to a single heavy-chain variable domain (VHH), in addition to the conventional antibodies [88] . To assess the nucleotide diversity of single-domain heavy-chain IG genes in our data, we first downloaded the scaffold.fasta.gz (Ming et al.'s [12] Camelus ferus genome assembly) and IGH.gff (heavy-chain immunoglobulin gene annotations) from https://figshare.com/articles/Data_from_Chromosome-level_assembly_of_wild_Bac-trian_camel_genome_reveals_organization_of_immune_ gene_loci/11297489. We then lifted over the Camelus ferus IGH.gff gene annotations assembly [12] to CamDro3, Cam-Bac2, and CamFer2 using Liftoff Github commit #77b7c4c91b294737d18d7a76e3611d279bebea6e [89] . We repeated previous nucleotide diversity assessment steps as described above (see Nucleotide diversity) using the new lifted over annotations. As we could not transform data to have residuals with a normal distribution, we followed analysis steps as before, except that we used R v. 3.6.3 along with the R package boot v. 1.3-25 [86] , and compared mean nucleotide diversity in heavy-chain immunoglobulin genes among dromedaries, domestic camels, and wild camels.

The diversity-generating benefits of a prokaryotic adaptive immune system

Detection of selective sweeps in cattle using genome-wide SNP data

Pathogen pressure puts immune defense into perspective

The major histocompatibility complex in Old World camelids and low polymorphism of its class II genes

Major histocompatibility complex genomics and human disease

Quantification of the relative contribution of major histocompatibility complex (MHC) and non-MHC genes to human immune responses to foreign antigens

Is MHC enough for understanding wildlife immunogenetics?

The de novo genome assembly and annotation of a female domestic dromedary of north African origin

Genome sequences of wild and domestic bactrian camels

Camelid genomes reveal evolution and adaptation to desert environments

Improving Illumina assemblies with hi-C and long reads: an example with the north African dromedary

Chromosome-level assembly of wild Bactrian camel genome reveals organization of immune gene loci

Fast and sensitive protein alignment using DIAMOND

MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects

A beginner's guide to eukaryotic genome annotation

Ragout -a reference-assisted assembly tool for bacterial genomes

Cactus graphs for genome comparisons

Toward almost closed genomes with GapFiller

BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs

OrthoDB v9.1: cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs

Genomic signatures of domestication in Old World

A comprehensive whole-genome integrated cytogenetic map for the alpaca (Lama pacos)

Camelid immunoglobulins and nanobody technology

Comprehensive genomic analysis of the dromedary T cell receptor gamma (TRG) locus and identification of a functional TRGC5 cassette

Natural killer cell receptor genes in camels: Another mammalian model

Generation of diversity by somatic mutation in the Camelus dromedarius T-cell receptor gamma variable domains

Review of infectious diseases of the camel

Foot and mouth disease and similar virus infections in camelids: a review

MERS coronavirus in dromedary camel herd, Saudi Arabia

Major histocompatibility complex diversity influences parasite resistance and innate immunity in sticklebacks

Toll-like receptors (TLRs) and their ligands

Assisted assembly: how to improve a de novo genome assembly by using related species

Ancient and modern DNA reveal dynamics of domestication and cross-continental dispersal of the dromedary

Camels in Asia and North-Africa-interdisciplinary perspectives on their past and present significance

Observations on little known diseases of camels (Camelus dromedarius) in the horn of

Major histocompatibility complex class II polymorphisms in primates

MHC studies in nonmodel vertebrates: what have we learned about natural selection in 15 years?

In: Immunobiology: The Immune System in Health and Disease

Major histocompatibility complex (MHC) markers in conservation biology

Neutral genetic processes influence MHC evolution in threatened gopher tortoises (Gopherus polyphemus)

Genetic diversity and phylogeographic structure of Bactrian camels shown by mitochondrial sequence variations

Whole-genome sequencing of 128 camels across Asia reveals origin and migration of domestic Bactrian camels

Global spread of helminth parasites at the human-domestic animal-wildlife interface

Genome-wide diversity and global migration patterns in dromedaries follow ancient caravan routes

Chromosome-level alpaca reference genome VicPac3.1 improves genomic insight into the biology of new world camelids

The major histocompatibility complex of Old World camelids: class I and class I-related genes

ABySS: a parallel assembler for short read sequence data

Chromosome-scale shotgun assembly using an in vitro method for longrange linkage

Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology

Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement

ABySS 2.0: resource-efficient assembly of large genomes using a bloom filter effect of bloom filter false positive rate

HISAT: a fast spliced aligner with low memory requirements Daehwan HHS public access

Seasonal adaptations of the hypothalamo-neurohypophyseal system of the dromedary camel

Sambamba: fast processing of NGS alignment formats

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

Basic local alignment search tool

MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes

Gene identification in novel eukaryotic genomes by self-training algorithm

StringTie enables improved reconstruction of a transcriptome from RNA-seq reads

AUGUSTUS: ab initio prediction of alternative transcripts

UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view

MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations

Errors in long-read assemblies can critically affect protein prediction

Cactus: algorithms for genome multiple sequence alignment

Repbase update, a database of eukaryotic repetitive elements

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

Minimap2: pairwise alignment for nucleotide sequences

Dot plot large genomes in an interactive, efficient and simple way

Popoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Fast and accurate short read alignment with burrowswheeler transform

The sequence alignment/map format and SAMtools

A program for annotating and predicting the effects of single nucleotide polymorphisms

Mosdepth: quick coverage calculation for genomes and exomes

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Simultaneous inference in general parametric models

Genome tools: a comprehensive software library for efficient processing of structured genome annotations

BEDTools: a flexible suite of utilities for comparing genomic features

Modern applied statistics with S

nlme: linear and nonlinear mixed effects models. R package version 3.1-111

GO2TR: a gene ontology-based workflow to generate target regions for target enrichment experiments

Fast statistical alignment

MUMmer4: A fast and versatile genome alignment system

Pegas: an R package for population genetics with an integratedmodular approach

Bootstrap R (S-Plus) Functions. R package version

The camel adaptive immune receptors repertoire as a singular example of structural and functional genomics

Naturally occurring antibodies devoid of light chains

Liftoff: an accurate gene annotation mapping tool

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

We thank the CSC -IT Center for Science, Finland, for generous computational resources. We are also very grateful to all camel owners for their agreements to use the collected data for scientific purposes and to the Wild Camel Protection Foundation for continuous support.

Supplementary information accompanies this paper at https://doi.org/10. 1186/s12864-020-06990-4.Additional file 1: Supplemental Table 1 Table 2 . Means with 95% bootstrap confidence limits (CL, see Methods) of nucleotide diversity for alignments made with non-synonymous and synonymous SNPs and indels and only non-synonymous SNPs in HC (heavy-chain) immunoglobulin genes in DC (domestic camel), DROM (dromedary), and WC (wild camel). [68] whole-genome alignment between CamFer2 and the Camelus ferus genome (new-CamFer) assembly from Ming et al., [12] . Contigs are sorted and matches are filtered using the strong precision setting in D-GENIES. 

CamDro3 is available from NCBI GenBank (GCA_000803125.3) and NCBI RefSeq (GCF_000803125.2). Our CamDro3/CamBac2/CamFer2 gene annotations, predicted mRNA and proteins, and assemblies for gene annotations are available from Dryad (https://doi.org/10.5061/dryad. qv9s4mwb3). Raw VCF files (snp and indel variants) for each camel are also available in the Dryad repository. Example scripts and code for analyses are available from the Dryad repository.Ethics approval and consent to participate Not applicable.

Not applicable.

The authors declare that they have no competing interests.