key: cord-0296461-m53wfdca authors: Li, Ming; Sun, Congjiao; Xu, Naiyi; Bian, Peipei; Tian, Xiaomeng; Wang, Xihong; Wang, Yuzhe; Jia, Xinzheng; Heller, Rasmus; Wang, Mingshan; Wang, Fei; Dai, Xuelei; Luo, Rongsong; Guo, Yingwei; Wang, Xiangnan; Yang, Peng; Zhang, Shunjin; Li, Xiaochang; Wen, Chaoliang; Lan, Fangren; Zonaed Siddiki, AMAM; Suwannapoom, Chatmongkon; Zhao, Xin; Nie, Qinghua; Hu, Xiaoxiang; Jiang, Yu; Yang, Ning title: De novo assembly of 20 chickens reveals the undetectable phenomenon for thousands of core genes on sub-telomeric regions date: 2021-11-05 journal: bioRxiv DOI: 10.1101/2021.11.05.467060 sha: ae054479d99917e4a28ff1898a1162e1be369fcb doc_id: 296461 cord_uid: m53wfdca The gene numbers and evolutionary rates of birds were assumed to be much lower than that of mammals, which in sharp contrast to the huge species number and morphological diversity of birds. It is very necessary to construct a complete avian genome and analyze its evolution.We constructed a chicken pan-genome from 20 de novo genome assemblies with high sequencing depth, newly identified 1,335 protein-coding genes and 3,011 long noncoding RNAs. The majority of these novel genes were detected across most individuals of the examined transcriptomes but were accidentally measured in each of the DNA sequencing data regardless of Illumina or PacBio technology. Furthermore, different from previous pan-genome models, most of these novel genes were overrepresented on chromosomal sub-telomeric regions, surrounded with extremely high proportions of tandem repeats, and strongly blocked DNA sequencing. These hidden genes were proved to be shared by all chicken genomes, included many housekeeping genes, and enriched in immune pathways. Comparative genomics revealed the novel genes had three-fold elevated substitution rates than known ones, updating the evolutionary rates of birds. Our study provides a framework for constructing a better chicken genome, which will contribute towards the understanding of avian evolution and improvement of poultry breeding. 1c). We merged the novel sequences from all 20 assemblies and built a pan-genome of chicken. 122 The pan-genome contains 158.98 Mb of non-redundant novel sequence,which obtained from 123 45,715 sequences with an average length of 3,478 bp. (Table 1, Supplementary Fig. 7 , and 124 Supplementary Table 5 ). The chicken pan-genome expanded the size of GRCg6a by 14.92% 125 which is the highest percentage among the published vertebrate pan-genomes. We found that the distribution of the novel sequences is obviously inconsistent in different 139 verified sources. The detection rate of novel sequences in one genome is extremely low, the 140 median is only 0.43% among the 922 resequencing data and 5% among the 20 assemblies (Fig. 141 2a) . Among all 159Mb novel sequences, the 10 Illumina assemblies independently detected 142 about 60Mb, containing only 3.44 Mb intersection with PacBio assemblies. Due to the higher 143 detection rate of RNA-Seq, we picked up the transcribed novel sequences according to the 263 144 transcriptomes for further validation. RNA-Seq confirmed that 60.51% of the transcribed 145 regions of the cryptic novel sequences were shared among more than half of the chicken 146 genomes (Supplementary Fig. 11 ). In the six individuals with both PacBio genome assembly 147 and transcriptome data, the transcriptomes supported a total of 16,169 novel sequences, 9,200 148 (56.90%) of which were detected in the transcriptomes of all six individuals. However, 5,711 149 (62.08%) of the 9,200 sequences were absent in all six PacBio genome assemblies (Fig. 2b) . 150 By mapping the PacBio reads to the novel sequences, 76.35% and 52.81% of the novel 151 sequences were covered by at least one read across more than half or all PacBio-sequenced 152 individuals, respectively. Moreover, although the GRCg6a assembly did not contain our novel 153 sequences, 6.30% (2,879) of the sequences were covered by the Illumina sequencing reads of 154 the GRCg6a individual with at least 7× coverage (corresponding to 25% of the genome-wide 155 depth) (Supplementary Fig. 12 and Supplementary Dataset). To explain the prevalence of 156 ubiquitously transcribed yet missing novel sequences in the assemblies, we compared the 157 median sequencing depth of the novel sequences with the whole-genome depth in the 158 individuals. We found that the median sequencing depth of the novel sequences was only one-159 third of the whole-genome depth in the individuals in which the novel sequences were 160 successfully assembled. Furthermore, in the individuals in which a given novel sequence was 161 missing from the assembly, the median sequencing depth of the novel sequences was only one-162 twentieth of the whole-genome depth, which is insufficient for successful assembly (Fig. 2c) . 163 Collectively, the results indicated that the novel sequences were most likely present in most or 164 all the chicken genomes but were prone to be missing in the assemblies due to their extremely 165 low DNA sequencing depth. 166 167 We observed a higher GC content in the novel sequences than in the reference genome (57.2% 169 vs. 42.30%). However, the GC content around 60% could not significantly reduce the depth of 170 sequencing. Another influence of sequencing is repeat. The content of tandem repeats (TRs) in 171 the novel sequences was 79.13%, which is extremely high and significantly higher than in 172 GRCg6a (2.2%; chi-squared test, P-value = 0) ( Fig. 2d and Table 1 ). Other interspersed repeats 173 such as LTR and LINE were low (0.09% in novel sequence vs 9.6% in GRCg6a, 174 Supplementary Fig. 13) . We predicted the relative importance of TR and GC content in 175 detection rate in assembly using random forest classifier and found the TR content had a greater 176 influence than GC (Fig. 2d, Supplementary Fig. 14) . The TR can form noncanonical DNA 177 structures, such as G-quadruplexes (four-stranded noncanonical DNA/RNA topologies, 178 hereafter referred to as G4 motifs), Z-DNA, A-phased repeats and inverted repeats, which can 179 form cruciforms, triplexes and slipped structures, leading to genomic instability (Zhao et al. 180 2010) and incapable DNA sequencing (Guiblet et al. 2018 ). We found these noncanonical 181 structure are highly intersected with TR regions (Supplementary Fig. 15 ). Among these 182 structures, the content of direct repeats (DR) (37.96%) and G4 motifs (37.08%), are the highest 183 in novel sequences, while their occurrence in GRCg6a is only 1.47% and 0.77%. DR and G4 184 also showed the largest negative correlation with read depth, the novel sequence with DR and 185 G4 motif had only 1/3 and 1/2 read depth of all novel sequences. (Fig. 2e) . It is worth noting 186 that as particularly stable noncanonical DNA structures, G4 motifs typically form in guanine-187 rich regions of genomes, which may be one of the reasons why GC-rich sequences are difficult 188 to sequence. We also found that the transcribed regions of novel sequences showed a lower TR 189 content (Supplementary Fig. 16) , which might be the reason why RNA-Seq resulted in a 190 higher observed frequency than DNA sequencing. Within the novel sequences, the expressible regions are of the most interest for potentially 194 discovering novel candidate genes. To identify novel chicken genes, we performed gene 195 annotation for all 20 assemblies by de novo and reference-guided methods using the multi-196 tissue transcriptomes (see Methods, Supplementary Table 8) . The median expression level 197 of these putative novel genes was significantly higher than the median expression of annotated genes (P-value = 2.84 × 10 -7 ) ( Fig. 3a and Supplementary Fig. 17) . The 199 chromosome conformation analysis showed that the regions containing novel genes were 200 significantly enriched in the A-compartment (P-value = 2.2× 10 -16 ) ( Fig. 3b and 201 Supplementary Fig. 18) , which is associated with open, expression-active chromatin. 202 Furthermore, the orthologues of the novel genes showed expression levels that were higher 203 than the median levels observed in other species, such as human and mouse (Supplementary 204 Fig. 19 ), suggesting plausible functions and active expression of these genes. 205 We identified 1,335 novel coding genes with FPKM > 1, and completely missing from 206 GRCg6a (see Methods, Supplementary Fig. 20 , and Supplementary Table 9 ). The novel 207 coding genes were distributed across 1,100 novel sequences, with an average length of 1,047 208 bp. By searching against the non-redundant (nr) protein database of NCBI (E-value ≤ 1×10 -5 ), 209 969 of the novel coding genes were found to show Chordata protein orthologues, 738 of which 210 belonged to Aves (Supplementary Table 9 ). In addition to novel coding genes, we also 211 of all the coding genes present in the reference genome. The lower detection rate of novel genes 218 in proteomics may be affected by the differences in protein length and the quality of the protein 219 database used for searching. Notably, after removing novel coding genes less than 1 Kb in 220 length, the proteomic verification ratio of the remaining novel coding genes increased to 221 29.11%. 222 We found that most of the novel coding genes were present and expressed in most 223 chicken breeds. According to the DNA data, 92.47% of the novel sequences containing novel 224 coding genes were supported by at least one PacBio read in each sample (Supplementary Fig. 225 22) . According to the comparison of multi-tissue transcriptomes of six individuals, 55.13% and 226 80.97% of the novel coding genes were detected in all six or at least three individuals, 227 respectively (Supplementary Table 9 ). Based on our sequencing platform, assembly strategy 228 and annotation pipeline, the modelling of the saturation curve by iteratively randomly sampling 229 individuals suggested that the number of novel genes detected by genome assembly did not 230 significantly increase beyond a sample size of ten (Fig. 3c) . A previous study (Yin et al. 2019 ) 231 based on the de novo assembly of massive chicken transcriptomes increased the number of 232 known chicken coding genes from 17,477 to 17,967 (Fig. 3d, Supplementary Fig. 23) . 233 According to our chicken pan-genome, we found that the total number of chicken coding genes 234 reached at least 19,223 (Fig. 3d, Supplementary Fig. 23 , and Supplementary Table 11) . 235 In addition to coding genes, we identified 3,011 long noncoding RNAs (lncRNAs) (see 236 Table 12) . Among these novel lncRNAs, 87.85% were supported 237 by at least one PacBio read in each sample (Supplementary Fig. 22 ). In our multi-tissue 238 transcriptomes of six individuals, 47.72% and 75.09% of novel lncRNA genes were detected 239 in all six or at least three individuals, respectively (Supplementary Table 12 ). The increasing 240 saturation curve of the observed novel lncRNA genes was similar to that of novel coding genes 241 (Fig. 3c) . Using the same pipeline as in a previous study (Sarropoulos et al. 2019 ), we showed 242 that the total number of chicken lncRNAs was at least 19,795 (Fig. 3e) . Therefore, our study 243 revealed that the numbers of both the protein-coding and lncRNA genes of chicken are 244 comparable to those of other tetrapod vertebrates ( Fig. 3d and 3e) . GRCg6a, filling 72 of 946 gaps in GRCg6a (Supplementary Fig. 24a, e) . 256 The fully anchored novel sequences and genes were overrepresented on micro-257 chromosomes (GGA11-38) (< 10 Mb) or the terminal 5 Mb ends of macro-chromosomes ( Fig. 258 4a, Supplementary Fig. 25 ), which are termed as sub-telomeric regions. By comparison with 259 the random distribution, we estimated 2.5-(P-value < 1 × 10 -6 , permutation) and 5-fold (P-260 value < 1 × 10 -6 , permutation) increases in fully anchored novel sequences and gene density 261 by comparing chicken genes with those of human and mouse. The synonymous substitution 269 rate (dS) and nonsynonymous substitution rate (dN) of these novel genes were 3.3-and 2.5-270 fold higher than that anchored GRCg6a genes, respectively. And the dN/dS ratio of these novel 271 genes was lower than that of the reference genes. Interestingly, the unlocalized genes of 272 GRCg6a, which may also be located in sub-telomeric regions, showed the similar mutation 273 pattern as the novel genes (Fig. 4b) . This suggested that the novel coding genes in sub-274 telomeric regions showed a higher mutation rate. Table 9 ). Through the 291 annotation and enrichment analyses, we also found that a large number of them were involved 292 in essential biological reactions and pathways, such as metabolism, signal transduction, basic 293 biological functions, the immune system and disease (Fig. 5a, Supplementary Tables 14, 15 294 and 16). 295 In the novel regions, we dissected chromosome 16 and the sub-telomeric part of 296 chromosome 1 as two examples to reveal their plausible gene arrangement and functions. 297 Chromosome 16 is a micro-chromosome that contains many immune system-related genes (Fig. 298 4d) and spans only 2.84 Mb of GRCg6a. We assembled 3.76 Mb of novel sequences and 299 identified 61 novel coding genes and 80 lncRNA genes on chromosome 16. The novel gene 300 clusters showed good syntenic relationships with other tetrapods (Fig. 4d) . One of the novel 301 gene clusters showed that birds had experienced regional complications in the cluster and 302 lacked a large number of coding genes (Fig. 4d) . One novel coding gene, the complement 303 factor B (CFB) gene, which is an important immune gene involved in the alternative 304 complement pathway of the immune system ( Supplementary Fig. 27 ) and is regulated by the 305 nuclear factor kappa B (NF-κB) pathway, was de novo identified on chromosome 16. This gene 306 is highly and uniquely expressed in the liver of chickens and confirmed based on our MS/MS 307 data (Supplementary Fig. 27 ). In addition, we identified two novel ribosomal genes, 308 mitochondrial ribosomal protein S18B (MRPS18B) and ribosomal protein S18 (RPS18) on 309 chromosome 16 (Fig. 4d, Supplementary Table 13 ). 310 The other novel gene cluster, including the leptin gene, located on chromosome 1 311 ( Supplementary Fig. 25 ). Based on RNA-Seq, previous research has shown that the leptin 312 gene does exist in the chicken genome, yet it was absent from the chicken reference genome 313 (Seroussi et al. 2017 ). Interestingly, we found that two divergent haplotypes of the leptin gene 314 were assembled from two individuals. The entire gene region and its flanking regions had 315 extremely high TR and G4 motif contents (Supplementary Fig. 28) . Based on chromosome 316 interaction data, leptin was assigned to the distal tip of chromosome 1p, showing collinearity 317 with SND1 and LRRC4 (Supplementary Fig. 25, Supplementary Fig. 28) . We found that 318 leptin exon 2 was conserved, while exon 1 was variable in chicken. The length of its intron also 319 varied among different chicken individuals (Supplementary Fig. 28) . Neither of the two exons 320 showed good homology with other species. In this region, we found another novel gene, 321 ovocleidin-17 (OC-17), which plays a key role in avian eggshell biomineralization and is not 322 contained in the reference genome (Supplementary Table 9 ). 323 324 The chicken pan-genome identify more crucial novel genes related to avian diseases resistance 326 that had not been discovered previously. Chickens are susceptible to several diseases that have KAT8, and KHSRP) belonged to or were regulated by the NF-κB signaling pathway, which is 337 the master regulator of the immune response to infection due to its role in regulating cytokine 338 and antimicrobial peptide expression (Fig. 5b) Supplementary Fig. 29) . Another novel gene, IKKγ (Supplementary Table 9 A. In total, there were 21 novel coding genes and 7 partially missing coding genes that belonged 347 to or were regulated by the NF-κB signaling pathway (Fig. 5b) . The NF-κB pathway is essential 348 in defense against viral infections, such as those caused by influenza viruses. 349 The chicken is the modern descendant of the dinosaurs with its genome firstly sequenced in waiting to be discovered. Our study not only revealed the gene number in birds is comparable 367 to that in other tetrapods but also presented a novel closed pattern of avian pan-genome. The 368 complete avian genomes will greatly improve the studies on comparative genomics and 369 functional genomics research in birds. 370 It has been believed that both the evolutionary substitution rate and the rate of 371 chromosomal rearrangement in the avian lineage are lower than that of mammals (Burt et al. 372 1999; Zhang et al. 2014 ). However, we found that a large number of novel genes that have 373 three times the substitution rate than the known ones, which can greatly increase the average 374 substitution rate of the chicken genome. We find that the novel sequences and genes were 375 Eight micrograms of genomic DNA were sheared using g-Tubes (Covaris), and concentrated 408 with AMPure PB magnetic beads. Each SMRT bell library was constructed using the Pacific 409 Biosciences SMRTbell template prep kit 2.1. The constructed libraries were size-selected on a 410 BluePippin™ system for molecules ≥ 20 kb, followed by primer annealing and the binding of 411 SMRT bell templates to polymerases with the DNA/Polymerase Binding Kit. Sequencing was 412 carried out using P6-C4 chemistry on the Pacific Bioscience Sequel II platform by Annoroad 413 For short-read DNA sequencing, the genomic DNA of ten samples used for next-415 generation sequencing (NGS) assembly was extracted from ethylenediaminetetraacetic acid 416 (EDTA)-anticoagulated blood randomly fragmented. Two paired-end libraries and two mate-417 pair libraries with insert sizes of 500 bp, 800 bp, 3 Kb and 5 Kb were constructed. All libraries 418 were sequenced on the Illumina HiSeq 2000 platform according to the manufacturer's protocol. 419 After filtering out adapter sequences and low-quality reads, a total of 1.61 Tb (average 134 × 420 coverage of chicken genome) of data were retained for assembly. In addition, the libraries of 421 ten samples used for PacBio sequencing were also constructed using an amplification-free For transcriptome analysis, total RNA was extracted using TRIzol extraction reagent (Thermo 427 Fisher). The RNA quality analysis method was the same as DNA quality analysis method 428 The samples were ground into a cell powder in liquid nitrogen and then sonicated in lysis buffer 442 (8 M urea, 1% protease inhibitor cocktail) three times on ice using a high-intensity ultrasonic 443 processor (Scientz). The remaining debris was removed by centrifugation at 12,000 g at 4 °C 444 for 10 min. Thereafter, the supernatant was collected, and the protein concentration was 445 determined with a BCA kit according to the manufacturer's instructions. Then, the protein 446 solution was subjected to trypsin digestion. Next, the tryptic peptides were fractionated by 447 high-pH reverse-phase HPLC using a Thermo Betasil C18 column (5 μm particles, 10 mm ID, 448 and 250 mm length). 449 The tryptic peptides were dissolved in 0.1% formic acid (solvent A) and directly loaded 450 onto a homemade reversed-phase analytical column (15-cm length, 75 μm i.d.). The gradient 451 consisted of an increase from 6% to 23% solvent B (0.1% formic acid in 98% acetonitrile) over 452 26 min, an increase from 23% to 35% over 8 min and then to 80% over 3 min, with holding at 453 80% for the last 3 min, all at a constant flow rate of 400 nL/min in an EASY-nLC 1000 UPLC 454 system. The peptides were introduced to a nanospray ionisation (NSI) source, followed by 455 MS/MS in a Q ExactiveTM Plus system (Thermo) coupled online to the UPLC system. 456 The MS/MS data were processed using the MaxQuant search engine (v.1.5.2.8) (Cox 457 and Mann 2008). Tandem mass spectra were searched against the human UniProt database 458 concatenated with the reverse decoy database. Trypsin/P was specified as the cleavage enzyme, Then, we used minimap2 v2.14-r883 (Li 2018) to map the corrected reads to the consensus, 474 and they were subsequently polished by using WTPOA-CNS v1.2. This process was repeated 475 three times. Next, the consensus sequence obtained in the previous step was mapped by using The de novo assemblies were aligned to the chicken reference genome (GRCg6a; 506 GCF_000002315.6) using minimap2 (Li 2018) (-cx asm10). Based on the pairwise alignment, 507 unaligned or low-identity sequences (showing more than 10% sequence divergence relative to 508 GRCg6a) were extracted. Then, the adjacent sequences within 200 bp were merged. BLASTN 509 2.6.0+ (Camacho et al. 2009 ) (with the parameters -word_size 20 -max_hsps 1 -510 max_target_seqs 1 -dust no -soft_masking false -evalue 0.00001) was further used to align the 511 unaligned sequences from the previous step to GRCg6a, and the sequences showing identity 512 greater than 90% to GRCg6a sequences were removed. The remaining sequences were merged 513 according to the adjacent regions within 200 bp, and sequences of less than 500 bp in length 514 were removed. Subsequently, the unaligned and low-identity sequences obtained from all of 515 the assemblies were combined, redundancies were removed with CD-HIT v4.7 (Fu et al. 2012 ) 516 (parameter: -c 0.9 -aS 0.8 -d 0 -sf 1), and the longest sequence in the cluster was selected as 517 the representative sequence. To further exclude potential contaminants in the dataset, we used 518 BLASTN to compare the non-redundant set with the nr database of NCBI (v20181220). The 519 sequences that were aligned to non-Chordata species were removed from the final novel 520 sequence set (Supplementary Table 5) . which were previously used for single-base error correction) were also included in this analysis. 528 The presence and absence of each novel sequence was then determined according to the 529 sequence coverage and depth. First, to obtain high-quality reads and minimize false genotyping 530 results due to low-quality reads supplied by Illumina, we implemented the following quality 531 control procedures to filter the reads before read mapping using Trimmomatic v0.36 (Bolger 532 et al. 2014) , and leading or trailing stretches of Ns and bases with a quality score below 3 were 533 trimmed. Then, the reads were scanned using a 4-base wide sliding window and clipped when 534 the average quality per base was below 15, and only reads of 40 nucleotides or longer were 535 finally retained. Second, high-quality paired reads were aligned to GRCg6a using BWA-MEM 536 Poorly aligned or unaligned reads were extracted as follows: Samblaster v0.1.24 (Faust 543 and Hall 2014) was used to extract clipped reads and unaligned reads, while sambamba v0.6.8 544 (Tarasov et al. 2015) and SAMTools v1.9 (Li et al. 2009 ) were used to collect other poorly 545 aligned reads. The paired reads with unaligned/poorly aligned read pairs were extracted using 546 seqtk v1.3-r106 (https://github.com/lh3/seqtk) and were then aligned to the novel sequence set 547 using a previously described process. Novel sequences with a coverage above 0.8 and a depth 548 greater than one-quartered of the whole-genome depth were identified as present. Then, StringTie (--merge) was used to merge all the transcript GTFs obtained from the samples 566 mapped to this assembly to obtain a reference annotation. Finally, all samples were 567 reassembled and quantified using StringTie with the reference annotation to obtain the 568 expression level of each transcript. Notably, the transcripts with fragments per kilobase per 569 million mapped reads (FPKM) ≥ 1 were considered robustly expressed. 570 Redundancy among genes that were annotated based on the de novo and reference-571 guided methods and intersected with novel sequences was removed with CD-HIT (parameter: 572 -c 0.9 -aS 0.8 -d 0 -sf 1). Then, the remaining genes were searched against the nr database and 573 the genes of GRCg6a using BLASTN 2.6.0+. Genes with no hits to either non-Chordata species 574 or GRCg6a were retained as "novel genes" that were completely absent in the chicken reference 575 genome. Genes showing hits to GRCg6a genes with more than 95% identity were classified as 576 partially missing in the chicken reference genome. 577 Next, the coding potential of these novel genes was assessed by using CPAT v1.2.3 578 (Wang et al. 2013 ) with the default parameters. CPAT uses an alignment-independent logistic 579 regression model to detect coding potential based on sequence features. To select a cut-off for 580 classification, we built hexamer tables and logit models for chicken using chicken CDSs and 581 ncRNA sequences downloaded from Ensembl (release 98) as training data. Then, a two-graph 582 receiver operating characteristic curve was used to determine the optimum cut-off value 583 through 10 random sample validations (Supplementary Figure 20) . A cut-off of 0.69 was 584 selected to classify the novel genes as potential protein-coding or noncoding genes. Then, the 585 ORFs were searched by using TransDecoder v5.5.0 (http://transdecoder.github.io) and 586 ORFfinder v0.4.3 (https://www.ncbi.nlm.nih.gov/orffinder/) with the default parameters. 587 Genes showing values above the cut-off of the CPAT with a minimum ORF of at least 100 588 amino acids were classified as novel coding genes. For the remaining novel genes, RNAcode 589 (Washietl et al. 2011 ) was used to further estimate the coding potential. To prevent the 590 divergent homologous haplotypes that can caused false gene duplications ), we 591 merged novel coding genes that have high similarity with each other or can be annotated to the 592 same gene, and then performed manual check. We generated customized whole-genome 593 alignments for each de novo assembly against Japanese quail (GCF_001577835.1), turkey 594 (GCF_000146605.3), and helmeted guineafowl (GCF_002078875.1), which we used to 595 estimate coding potential. We used BLASTX 2.6.0+ (with the parameters '-evalue 0.00001') 596 to translate each novel genes form all six possible reading frames, and the results were 597 compared to known proteins in the nr database. Genes with an E value ≤ 10 −5 , alignment length 598 of ≥ 10 amino acids and identity ≥ 95% were removed from the final potential long noncoding 599 gene set. Only multiple exon genes with more than 200 nucleotides and without any detectable 600 protein-coding potential were classified as novel long noncoding genes. The expression levels of each gene obtained from the previous step were used for differential 616 expression analysis. The R language was used to identify differentially expressed genes with 617 the edgeR package (v3.28.1) (Robinson et al. 2010) . The fold changes between the two groups 618 were calculated as logFC = log2 (experimental/control group). Benjamini-Hochberg correction 619 was used to correct for multiple comparisons (with a false discovery cut-off of < 0.05). Genes 620 in the two groups a with |logFC| > 2 and q-value < 0.05 were defined as differentially expressed 621 genes. 622 623 Anchoring novel sequences onto the reference genome 624 The novel sequences were anchored to GRCg6a based on alignment information between all 626 de novo assemblies and GRCg6a. First, the scaffolds of the de novo assemblies that contained 627 novel sequences were extracted and anchored on the chromosome/scaffold of GRCg6a which 628 showed the most alignment hits with them. Then, the adjacent flanking sequences (more than 629 100 bp) of the novel sequences aligned to the same chromosome/scaffold were retained for 630 further positioning. If the flanking sequences were perfectly aligned to GRCg6a with no gaps, 631 an identity ≥ 90% and a breakpoint shift of ≤5bp, we recorded the sequences as "placed". The 632 other alignments were recorded as "ambiguously placed". The novel sequences with two placed 633 flanking sequences were reported as "localized". The novel sequences with one or two 634 ambiguously placed flanking sequences were reported as "unlocalized". The final remaining 635 sequences were reported as "unplaced". Based on the genome placement information, the 636 localized sequences could be further classified as insertions, alternate alleles, or ambiguous 637 sequences. The insertions introduced only one sequence fragment to the reference genome and 638 were no more than 10 bp in length. For alternate alleles, the novel sequences had to share less 639 than 90% (or 0%) identity with their counterparts in the reference. Furthermore, the novel 640 sequences and their counterparts had to have comparable lengths, with a length ratio between 641 1/3 and 3. The remaining sequences that did not meet the above criteria for insertions and 642 alternate alleles were classified as ambiguous sequences. 643 The preprocessing of paired-end sequencing data, mapping of reads and filtering of mapped di-645 tags were performed using the Juicer pipeline (version 1.5) (Durand et al. 2016 ). Briefly, short 646 reads were mapped to the chicken pan-genome using BWA-MEM (version 0.7.17-r1188) (Li 647 and Durbin 2010). Reads with low mapping quality were filtered using Juicer with the default 648 parameters, discarding invalid self-ligated and un-ligated fragments as well as PCR artefacts. 649 Filtered di-tags were further processed with Juicer command line tools to bin ditags (10 kb bins) 650 and to normalize matrices with KR normalization (Knight and Ruiz 2012) . We normalized all 651 Hi-C matrices on the same scale by KR normalization, ensuring that any differences between 652 Hi-C data were not attributable to variation in sequence length. The maximum 100-kb bin of 653 each novel sequence interaction (interaction intensity ≥5) was collected as a potential location 654 of novel sequences. Novel sequences that were validated in at least two individuals with Hi-C 655 data and anchored to the same location were remained. Western terrestrial garter snake (GCF_009769535.1), Common lizard 669 (GCF_011800845.1), Red-eared slider turtle (GCF_013100865.1), Goodes thornscrub tortoise 670 (GCF_007399415.2), Green sea turtle (GCF_015237465.1) Japanese quail 677 (GCF_001577835.2), and Chicken (GCF_000002315.6 ) Insights into 687 variation in meiosis from 31,228 human sperm genomes Tandem repeats finder: a program to analyze DNA sequences Scaffolding pre-assembled 691 contigs using SSPACE Trimmomatic: a flexible trimmer for Illumina 693 sequence data Gap5--editing the billion fragment sequence assembly Correspondence on Lovell et al.: identification of chicken genes 698 previously assumed to be evolutionarily lost Avian Genomes Revisited: 700 Hidden Genes Uncovered and the Rates versus Traits Paradox in Birds Near-optimal probabilistic RNA-seq 703 quantification The chicken as a model for large-scale 705 analysis of vertebrate gene function Chicken genome: current status and future opportunities The dynamics of chromosome evolution in birds and mammals rnaSPAdes: a de novo 712 transcriptome assembler and its application to RNA-Seq data 714 BLAST+: architecture and applications Gene expression across mammalian organ 717 development TBtools: An Integrative Toolkit Developed for Interactive Analyses of Big Biological Data Nasopharyngeal shedding of severe acute respiratory syndrome-associated 723 coronavirus is associated with genetic polymorphisms The functions of the 725 thymus system and the bursa system in the chicken MaxQuant enables high peptide identification rates, individualized 727 p.p.b.-range mass accuracies and proteome-wide protein quantification HUPAN: 730 a pan-genome analysis pipeline for human genomes Machol 732 Hi-C yields chromosome-length scaffolds Juicer 735 Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments OrthoFinder: phylogenetic orthology inference for comparative 738 genomics SAMBLASTER: fast duplicate marking and structural variant read 740 extraction CD-HIT: accelerated for clustering the next-742 generation sequencing data IOC World Bird List (v10.1) Pangenomics Comes of Age: 745 From Bacteria to Plant and Animal Applications Long-read sequencing technology indicates genome-748 wide effects of non-B DNA on polymerization speed and error rate HRT Atlas v1.0 database: 751 redefining human and mouse housekeeping genes and candidate reference transcripts by 752 mining massive RNA-seq datasets Sequence and comparative analysis 754 of the chicken genome provide unique perspectives on vertebrate evolution InterProScan 5: genome-scale protein function classification Graph-based genome alignment 760 and genotyping with HISAT2 and HISAT-genotype False gene and chromosome losses affected by assembly and sequence errors A fast algorithm for matrix balancing 767 Widespread false gene gains caused by duplication errors in genome assemblies Canu: scalable 769 and accurate long-read assembly via adaptive k-mer weighting and repeat separation Genomic 772 data for 78 chickens from 14 populations Exploring single-sample SNP and INDEL calling with whole-genome de novo 774 assembly Minimap2: pairwise alignment for nucleotide sequences Fast and accurate long-read alignment with Burrows-Wheeler 778 transform The Sequence Alignment/Map format and 781 SAMtools Towards 783 the Complete Goat Pan-Genome by Recovering Missing Genomic Segments From the 784 Reference Genome De 786 novo assembly of human genomes with massively parallel short read sequencing Comprehensive mapping of long-range 790 interactions reveals folding principles of the human genome Human 792 subtelomeres are hot spots of interchromosomal recombination and segmental 793 duplication Comparative 795 analysis of selected innate immune-related genes following infection of immortal DF-1 796 cells with highly pathogenic (H5N1) and low pathogenic (H9N2) avian influenza viruses Conserved syntenic clusters of protein coding genes are missing in birds Biased gene conversion: implications for genome and sex evolution CK2beta gene silencing increases cell 804 susceptibility to influenza A virus infection resulting in accelerated virus entry and higher 805 viral protein content The phusion assembler Qualimap 2: advanced multi-sample 808 quality control for high-throughput sequencing data StringTie 810 enables improved reconstruction of a transcriptome from RNA-seq reads Role of recombination and replication 813 fork restart in repeat instability Towards complete and error-free genome assemblies 816 of all vertebrate species Mini-and microsatellite expansions: the recombination 818 connection edgeR: a Bioconductor package for 820 differential expression analysis of digital gene expression data Fast and accurate long-read assembly with wtdbg2 The NS1 protein 824 of influenza A virus blocks RIG-I-mediated activation of the noncanonical NF-kappaB 825 pathway and p52/RelB-dependent gene expression in lung epithelial cells 828 Machine learning model for sequence-driven DNA G-quadruplex formation Developmental 831 dynamics of lncRNAs across mammalian organs and species Assembly of large genomes using second-833 generation sequencing Rothlin 835 CV. 2016. AXL receptor tyrosine kinase is required for T cell priming and antiviral 836 immunity Mapping of leptin and its syntenic genes to chicken 839 chromosome 1p Assembly of a pan-genome from deep sequencing of 910 842 humans of African descent Pan-genomics in the human genome era BUSCO: 846 assessing genome assembly and annotation completeness with single-copy orthologs A comparative analysis of host responses to avian influenza 850 infection in ducks and chickens highlights a role for the interferon-induced 851 transmembrane proteins in viral resistance DNA related to the transforming gene(s) 853 of avian sarcoma viruses is present in normal avian DNA Sambamba: fast processing of 855 NGS alignment formats Building a 857 sequence map of the pig pan-genome from multiple de novo assemblies and Hi-C data Historical introduction to the general properties of retroviruses Pilon: an integrated tool for comprehensive microbial 862 variant detection and genome assembly improvement Influenza A Virus Facilitates Its 864 Infectivity by Activating p53 to Inhibit the Expression of Interferon The 867 chicken pan-genome reveals gene content variation and a promoter region deletion in 868 IGF2BP1 affecting body size Assessment Tool using an alignment-free logistic regression model Otecko 873 NO, et al. 2020. 863 genomes reveal the origin and domestication of chicken 876 MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and 877 collinearity A New Chicken Genome Assembly Provides 880 Insight into RNAcode: robust discrimination of coding and noncoding regions in 883 comparative sequence data The cellular RNA 885 helicase UAP56 is required for prevention of double-stranded RNA formation during 886 influenza A virus infection De novo human genome assemblies reveal 888 spectrum of alternative haplotypes in diverse populations Towards a reference genome that captures global genetic diversity KOBAS 2.0: 893 a web server for annotation and identification of enriched pathways and diseases PAML 4: phylogenetic analysis by maximum likelihood Revisiting avian 'missing' genes from de novo assembled transcripts Comparative genomics reveals insights into avian genome evolution and 901 adaptation Non-B DNA structure-induced genetic 903 instability and evolution Three chromosome-level duck genome assemblies provide insights into 906 genomic variation during domestication