key: cord-015527-ph576eji authors: Mostajo, Nelly F; Lataretu, Marie; Krautwurst, Sebastian; Mock, Florian; Desirò, Daniel; Lamkiewicz, Kevin; Collatz, Maximilian; Schoen, Andreas; Weber, Friedemann; Marz, Manja; Hölzer, Martin title: A comprehensive annotation and differential expression analysis of short and long non-coding RNAs in 16 bat genomes date: 2019-09-30 journal: NAR Genom Bioinform DOI: 10.1093/nargab/lqz006 sha: doc_id: 15527 cord_uid: ph576eji Although bats are increasingly becoming the focus of scientific studies due to their unique properties, these exceptional animals are still among the least studied mammals. Assembly quality and completeness of bat genomes vary a lot and especially non-coding RNA (ncRNA) annotations are incomplete or simply missing. Accordingly, standard bioinformatics pipelines for gene expression analysis often ignore ncRNAs such as microRNAs or long antisense RNAs. The main cause of this problem is the use of incomplete genome annotations. We present a complete screening for ncRNAs within 16 bat genomes. NcRNAs affect a remarkable variety of vital biological functions, including gene expression regulation, RNA processing, RNA interference and, as recently described, regulatory processes in viral infections. Within all investigated bat assemblies, we annotated 667 ncRNA families including 162 snoRNAs and 193 miRNAs as well as rRNAs, tRNAs, several snRNAs and lncRNAs, and other structural ncRNA elements. We validated our ncRNA candidates by six RNA-Seq data sets and show significant expression patterns that have never been described before in a bat species on such a large scale. Our annotations will be usable as a resource (rna.uni-jena.de/supplements/bats) for deeper studying of bat evolution, ncRNAs repertoire, gene expression and regulation, ecology and important host–virus interactions. Bats (Chiroptera) are the most abundant, ecologically diverse and globally distributed animals within all vertebrates (1), but representative genome arrangements and corresponding coding and non-coding gene annotations are still incomplete (2) . Except for the extreme polar regions, bats can be found throughout the globe, feeding on diverse sources such as insects, blood, nectar, fruits and pollen (2) . Their origin has been dated in the Cretaceous period, with a diversification explosion process dating back to the Eocene (3). The 21 bat families known to date are classified into the suborders Yinpterochiroptera and Yangochiroptera (2,4) ( Figure 1 ). Although they account for >20 % of the total living mammalian diversity (5) , the genomes of only 16 bat species of the estimated >1300 species (2) have been sequenced with adequate coverage to date and are publicly available ( Figure 1) . Bats have developed a variety of unique biological features that are the rarest among all mammalian, including laryngeal echolocation (4, 6) , vocal learning (7) and the ability to fly (3) . They occupy a broad range of different ecological niches (2) , have an exceptional longevity (8-10) and a natural and unique resilience against various pathogenic viruses (1, 11) . For example, bats are the suspected reservoirs for some of the deadliest viral diseases such as Ebola and SARS (12) (13) (14) , but appear to be asymptomatic and survive the infection. Possibly, the solution to better understand and fight these pathogens lies in the uniquely developed immune system of bats (15, 16) . Studying bats and their genomes is likely to have high impacts on various sci- Figure 1 . We used available genomes of 16 bat species from eight out of 21 families for non-coding RNA annotation in this study. The tree shows their phylogenetic relationship and is based on a molecular consensus on family relationships of bats (4), further adapted and extended from (2) . Bat families and species with published genomes currently available in the NCBI are shown (details see Table 1 ). Bat families still lacking a published genome assembly are written in gray color. RNA-Seq data sets were selected from species marked with an asterisk and additionally obtained from a Myotis daubentonii cell line (see Table 2 ). Bat silhouettes were adapted from artworks created by Fiona Reid. entific fields, including healthy ageing, immune and ecosystem functioning, the evolution of sensory perception and human communication, and mammalian genome architecture (see the recent Bat1K review for further details (2) ). Despite the unique biological characteristics of these flying mammals and their important role as natural reservoirs for viruses, bats are one of the least studied taxa of all mammalian (17) . Accordingly, there is little knowledge about the non-protein-coding transcriptome of bats, which plays a crucial role in an extensive number of cellular and regular functions and comprises a very diverse family of untranslated RNA molecules (18, 19) . In addition, it is believed that due to the early evolutionary radiation of bats (compared to other mammals) their innate and acquired immune responses have a different set of molecules (1) . Genome assemblies and annotations are essential starting points for many molecular-driven and comparative studies (20) . Especially, studies of non-model organisms play important roles in many investigations (21) . In most cases, however, these organisms lack well-annotated genomes (22) , which severely limit our ability to gain a deeper understanding of these species and may further impede biomedical research (23) . In this study, we comprehensively annotated non-coding RNAs in 16 available bat genome assemblies (Table 1) . For each bat species, we provide final annotations that are compatible with current Ensembl and NCBI (National Center for Biotechnology Information) standards (GTF format) and that can be directly used in other studies, for example for differential gene expression analysis. We compare our new annotations with the currently available annotations for bats and show that a large number of non-coding genes are simply not annotated and are therefore overlooked by other studies. We used six RNA-Seq data sets comprising different conditions (Table 2 ) to validate our annotations and to determine the expression levels of our newly annotated ncRNAs. Exemplarily, we show that our novel annotations can be used to identify ncRNAs that are significantly differential expressed during viral infections and were missed by previous studies. We downloaded the last recent genome versions for 16 bat species in September 2018 from the NCBI genome database (Table 1) . Within the order of Yinpterochiroptera, nine genomic sequences were obtained covering the bat families Pteropodidae, Hipposideridae, Rhinolophidae and Megadermatidae whereas for the order of Yangochiroptera another seven genome assemblies were available for the bat families Phyllostomidae, Mormoopidae, Vespertilionidae and Miniopteridae ( Figure 1 ). We introduced a unique three-letter abbreviation code (Table 1) to easily distinguish between the 16 bat species in the manuscript and intermediate annotation files provided in the Electronic Supplement. We used QUAST (v5.0.2) (24) to calculate several assembly statistics for all genomes, shown in Supplementary Table S1 . At the end of 2018, two new bat genomes were presented by the Bat1K project (http://bat1k.ucd.ie) (2), comprising a newer version of the greater horseshoe bat genome (Rhinolophus ferrumequinum; Rhinolophidae) and the genome of the pale spear-nose bat (Phyllostomus discolor; Phyllostomidae). However, these two bat genomes were not included in our present study due to the data use policy of the Bat1K consortium and to support a fair and productive use of these data. To validate our novel ncRNA predictions, we selected six RNA-Seq data sets (16, (25) (26) (27) (28) comprising all together 98 samples gathered from four different bat species. We have labeled each published RNA-Seq data set based on the first authors last name and the year of data set publication ( Table 2 and Supplementary Table S2 ). All samples were quality trimmed using Trimmomatic (29) (v0.36) with a 4 nt-step sliding-window approach (Q20) and a minimum remaining read length of 20 nt. For the Field-2018 data set (28) , we additionally removed the three leading 5' nucleotides from the reads of each sample because of a generally low quality observed by FastQC (www.bioinformatics.babraham.ac.uk/projects/ Table 1 . We have annotated ncRNAs within 16 bat genomes of different assembly quality. We introduced three-letter abbreviations for each bat species used throughout the manuscript and in supplemental files and annotations. Genome sizes were estimated (Est.) by using C-values (DNA content per pg) from the animal genome size database (http://genomesize.com) and by applying the following formula: Genome size = (0.978 · 10 9 ) · C. If multiple entries for one species were available, an average over all C-values was calculated and used to estimate the genome size. If one species could not be found, an average C-value for the corresponding genus was used. Supplementary Table S1 provides additional assembly statistics calculated by QUAST (v5.0.2) (24) . NCBI acc. -GenBank assembly accession without the prefix GCA fastqc/) (v0.11.7). The remaining reads of all processed samples were individually mapped to all 16 bat assemblies using HISAT2 (30) (v2.1.0) and transcript abundances were subsequently calculated from all resulting 1568 mappings by featureCounts (31) (v1.6.3). If suitable, the appropriate strand-specific counting mode was applied for each data set (see Table 2 for information about the strand specificity). For each bat genome assembly, the merged annotation of already known (NCBI) and newly identified (this study) ncR-NAs was used (Supplementary Files S1). Due to the size of the annotations and the huge amount of overlaps, long ncR-NAs were counted and analyzed for differential expression separately. To enable a better investigation of small RNAs (sR-NAs), we included a data set of the targeted sequencing of sRNAs (especially miRNAs) from M. daubentonii cells (Weber-2019). To obtain this sRNA sequencing data, total RNA of 18 samples, which was obtained using the same procedure like explained for the rRNA-depleted M. daubentonii data (26) , was preprocessed using the Illumina TruSeq smallRNA protocol, sequenced on one HiSeq 2500 lane, and finally uploaded in the course of this study under GEO accession GSE132336. The reads of these 18 sRNA samples were additionally preprocessed by removing potential adapter sequences with cutadapt (32) (v1.8.3) followed by a quality (Q20) trimming using again a window-size of 4 and a minimum length of 15 nt by PrinSeq (33) (v0.20.3). The processed sRNA samples were either individually mapped to the 16 bat genomes for differential expression analysis or combined and mapped on each bat genome to predict known and novel miRNAs with miRDeep2 (34). Only uniquely mapped reads were counted and used for the differential gene expression analyses with DESeq2 (35) (v1.16.1). Annotated rRNA genes were removed prior DE-Seq2 and TPM (transcripts per million) analysis. All raw read counts from samples of one data set were combined and normalized together using the built-in functionality of DESeq2, followed by pairwise comparisons to detect significant (adjusted p-value < 0.05; absolute log 2 fold change > 2) differential expressed ncRNAs. Besides the DESeq2 normalization, we calculated TPM values for each ncRNA in each sample as previously described (36) : where c i is the raw read count of ncRNA i, l i is the length of ncRNA i (and the cumulative exon length in the case of lncRNAs) and N is the number of all ncRNAs in the given annotation. To this end, we obtained for each RNA-Seq sample, each bat annotation, and each ncRNA one TPM value representing the normalized expression level of this ncRNA. If available, we calculated all TPM values in relation to the expression of all already known coding and non-coding genes and not only based on our novel ncRNA annotation. Although we performed mappings, read countings, and normalization for all samples, bat genome assemblies and all six data sets ( Table 2 ; overall 1568 mappings), we only selected one comparison per data set to exemplarily show novel and significantly differential expressed ncRNAs (Supplementary Files S2.1-S2.15; divided by data set and input annotation). For each data set, we chose the bat species that was also used in the corresponding study. For the Hölzer-2019 and Weber-2019 data sets, we used the closely related M. lucifugus genome assembly and annotation as a refer- Table 2 . Six RNA-Seq data sets comprising all together 98 samples derived from four different bat species were used to evaluate our novel ncRNA annotations. All samples were quality trimmed and individually mapped to all 16 bat assemblies using HISAT2 (30) and transcript abundances were subsequently calculated from all 1568 mappings by featureCounts (31) . We labeled each RNA-Seq data set based on the first authors last name and the year of data set publication. Raw read data of the enriched sequencing of small RNAs (especially miRNAs) of a M. daubentonii cell line (accompanying GSE121301 (26) ) have been uploaded in the course of this publication under GEO accession GSE132336 (Weber-2019). polyA+ -library preparation with mRNA selection; rRNA--library preparation with rRNA depletion and size selection (>200 nt); sRNA -library preparation with size selection (<200 nt); se/pe -single-/paired-end sequencing; ss/not-ss -strand-specific/unstranded sequencing; ss s /ss a -strand-specific in sense orientation/in antisense orientation Figure 4 . Please note that for M. daubentonii currently no genome assembly is available, so the genome assembly of M. lucifugus was used as a close relative. All of our annotations follow the General Transfer Format (GTF) as described and defined in the Ensembl (37) database (https://ensembl.org/info/website/upload/gff. html). Therefore, each row of each annotation file is either defined as a gene, transcript, or exon (by the feature column) and strictly following a hierarchical structure, even if only one exon (as for most ncRNAs) is reported. By adhering to this annotation format, our novel annotations can be easily merged with existing ones as derived from Ensembl or NCBI and are directly usable as input for computational tools such as HISAT2 for mapping or featureCounts for transcript abundance estimation. We defined gene, transcript, and exon IDs following the Ensembl pattern: <11-digit-number>. In general, we used specialized computational tools for the annotation of specific ncRNA classes (Supplementary Tables S3-S10). If not otherwise stated, the main ncRNA-discovery is based on homology searches against the Rfam database (38-40) (v14.0). We used the Gorap pipeline (https://github.com/koriege/gorap), a specially developed software suite for genome-wide ncRNA screening. Gorap screens genomic sequences for all ncRNAs present in the Rfam database using a generalized strategy by applying multiple filters and specialized software tools. To this end, Gorap takes huge advantage of Infernal (41,42) (v1.1.2) to annotate ncRNAs based on input alignment files conserved in sequence and secondary structure (so-called Stockholm alignment files; stk). All resulting alignment files were automatically pre-filtered by Gorap based on different ncRNA class-specific parameters including taxonomy, secondary structure and primary sequence comparisons. Due to repeats, pseudogenes, undiscriminable un-/functional genes and overlapping results from the different assembly methods, we defined a ncRNA set per species for annotation that includes filtered sequences, but allows for variants and multiple copies. This final annotation set is defined by hand-curating the resulting stk alignments of Gorap with the help of Emacs RALEE mode (43) . Due to the removal of sequences in the stockholm alignments, the remaining sequences were extracted and again aligned into stockholm format using cmalign --noprob from Infernal. The Rfam-derived ncRNA alignments were further split into snoRNAs (Supplementary Table S5 ), miRNAs (Supplementary Table S6 ) and other ncRNAs including snRNAs, lncRNAs and other structural RNAs (Supplementary Table S8 ). In general, our annotation results give an overview about the amount of different ncRNAs in bat species and intentionally can include false positive hits and duplicates. All ncRNA hits are placed as STK (if available), GTF and FASTA-files in the Electronic Supplement and OSF (doi. org/10.17605/OSF.IO/4CMDN). rRNAs. We used the prediction tool RNAmmer (44) (v1.2) to identify 5.8S, 18S and 28S rRNA genes using hidden markov models. The tools output was transformed into regular GTF file format. All output files can be found in Supplementary Table S3 . tRNAs. For the annotation of tRNAs, we applied tRNAscan-SE (45) (v1.3.1) to the bat contigs using default parameters. The results were filtered by removing any tRNAs of type 'Undet' or 'Pseudo' and the tabular output was transformed into the GTF file format. Additional information about the anticodon and the coverage score were added to the description column. We provide the raw tRNAscan-SE files in Supplementary Table S4. snoRNAs. We annotated snoRNAs based on available alignments from the Rfam database using Gorap and additionally marked and classified them into box C/D and box H/ACA when intersecting with the set of snoRNAs from http://www.bioinf.uni-leipzig.de/publications/supplements/ 12-022 (46) (Supplementary Table S5 ). miRNAs. Additionally to the Rfam-screening (Supplementary Table S6 ), miRNAs were predicted by the miRD-eep2 pipeline (34) (v2.0.0.8) using default parameters and the combined smallRNA-Seq data set (Weber-2019; 18 samples) mapped to each individual bat assembly as an input for miRDeep2 (Supplementary Table S7 ). To validate the accuracy of our approach, we compared our miRDeep2 annotations of M. lucifugus/P. alecto (based on the transcriptomic data derived from M. daubentonii; Weber-2019) with annotations of miRNAs for transcriptomic data of Myotis myotis (8) and P. alecto (47) . For reference mapping, Huang et al. also used the M. lucifugus genome, so we were theoretically able to directly compare our annotations with the annotations of both studies. Unfortunately, no positional information (annotation file) of the identified miRNAs derived from the transcriptomic data of M. myotis were given in the manuscript or supplement (8) . Therefore, we blasted the precursor miRNA sequences identified with the help of the M. myotis transcriptome against the M. lucifugus genome and retained only hits with a sequence identity of 100%. The so obtained positional information was further used to calculate the overlap between our predicted miRNAs in M. lucifugus. We used the same approach for the P. alecto comparison. If the annotated location of an miRNA and one of our identified miRNA locations in M. lucifugus/P.alecto were overlapping by at least 85 %, we counted this location as a common prediction. lncRNAs. Long ncRNAs were annotated using a high confidence data set H from the LNCipedia (48) (v5.2) database comprising 107 039 transcripts of potential human lncRNAs. The transcripts were used as input for a BLASTn (2.7.1+, 1e −10 ) search against each of the 16 bat assemblies (compiled as BLAST databases). The BLASTn result for each bat assembly was further processed to group single hits into potential transcripts as follows: first, for each query sequence q ∈ H, hits of q found on the same contig c and strand s were selected (hits c, s, q ) and the longest one, q 1 , was chosen as a starting point so that trscp c, s, q = (q 1 ). Second, all hits q i ∈ hits c, s, q with q i ∈trscp c, s, q , which overlap neither in the query q nor in the target sequence and do not exceed a maximum range of 500 000 nt from the most upstream to the most down-stream target sequence position of all q j ∈ trscp c, s, q ∪q i , were added iteratively to trscp c, s, q . To this end, we introduced a simple model of exon-intron structures, naturally occurring when using transcript sequences as queries against a target genome assembly. We defined the 500 000 nt search range based on an estimation of lncRNA gene sizes derived from the human Ensembl (37) annotation. If the sum of the lengths of all q i ∈ trscp c, s, q covers the query transcript length length q at least for 70 %, trscp c, s, q was considered as a transcript and its elements q i as exons, otherwise all q i ∈ trscp c, s, q were withdrawn. This procedure is repeated until all hits ∈ hits c, s, q were used or withdrawn. Therefore, each so-defined group of non-overlapping hits derived from the same query sequence and found on the same contig and strand should represent a lncRNA transcript with its (rough) exon structure. The defined transcripts were saved as BLAST-like output and transformed into GTF file format. To follow the GTF annotation format and to harmonize our lncRNA annotations with the other ncRNA annotations, each lncRNA transcript was also saved as a gene annotation and consists of at least one exon. As we observed a lot of different sequences from LNCipedia aligning to the same positions in the genomes, we decided to condense exons at the same sequence positions, considering transcripts with one or multiple exons separately. For each contig and strand, starting from the 5' end, exons with a minimum overlap of 10 nt were grouped together. In the case of multiple exons, groups of exons were merged, if they shared any transcript origin. If all exons in the group originated from the same LNCipedia gene, the group was considered as one gene with several transcripts and its associated exon(s). Otherwise, we defined a lncRNA hot spot on gene level with several transcripts and their associated exon(s). The LNCipedia names of the gathered transcripts of a lncRNA hot spot, as well as start and end positions of all exons, are listed in the GTF gene attribute field (Supplementary Table S9 ). The scripts used for the identification of lncRNAs can be found at https://github.com/rnajena/bats ncrna. Table 3) . For the other six species we used BLASTn (2.7.1+, 1e −10 ) with the MLU and PVA mitochondrial genomes as queries against the remaining bat genomes. For EHE, we found a possible mtDNA contig in full-length (16 141 nt; AWHC01200796) in the genome assembly. Due to the circularization of mtDNA, we rearranged the sequence of this contig to start with the gene coding for the phenylalanin tRNA and to match the gene order of the other mitochondrial genomes. Only for ESP, RSI, MLY, EFU and MNA, we were not able to detect any possible contigs of mtDNA (Table 3 ). All 11 mitochondrial genomes were annotated with MITOS (49). The ncRNA results were filtered by e-value (threshold 0.001), thus one of two small rRNAs in MBR and RFE and one of two large rRNAs in EHE were discarded as false positive hits (Supplementary Table S10 ). For the five bat assemblies directly including mtDNAs (Table 3) , the MITOS annotations were added to the final merged ncRNA annotation. All other mtDNAs and annotations can be found in Supplementary Table S10 . As we annotated all bat assemblies by using different tools, we needed to merge the resulting GTF files to resolve overlapping annotations and to receive a final annotation of ncRNAs for each bat species. Furthermore, we extended the available NCBI annotations (including protein-and noncoding genes) by integrating our novel ncRNA annotations. The scripts used to merge the different annotation files and to calculate overlaps between annotations can be found at https://github.com/rnajena/bats ncrna. Due to their size, we have not included the lncRNA annotations based on LNCipedia. These can be downloaded and used separately (Supplementary Table S9 ). of novel non-coding annotations. For each bat species, we merged the ncRNA annotations (except for lncRNAs) using a custom script (merge gtf global ids.py). After reading in all features and asserting correct file structure, overlaps were resolved in the following manner: (i) Exons are considered overlapping if >50 % of the shorter one is covered by Table 3 . Mitochondrial bat genomes (mtDNA) publicly available and used for annotation with MITOS (49). For 10 out of the 16 bat species investigated in this study, mtDNA could be found in the NCBI. For four bat species, the mtDNA is also part of the genome assembly as determined using BLASTn. For E. helvum, no mtDNA could be found in the NCBI, but we were able to identify a single contig that is part of the genome assembly as mtDNA using BLASTn and the mitochondrial genomes of the other bats as query. The contig was rearranged to match the gene order of the other mtDNAs. R -found via BLASTn and rearranged Merge of NCBI and novel annotations. We first converted and filtered the NCBI annotations to a compatible format with a custom script (format ncbi.py) and then combined the results with our merged novel ncRNA annotations using the same strategy to resolve overlaps as above, but imposing less strict format rules (merge gtf ncbi. py). At best, a genome assembly represents the full genetic content of a species at chromosome level. Whereas the first complex eukaryotic genomes were generated using Sanger chemistry, today's technologies such as Illumina short-read sequencing and PacBio or Oxford Nanopore long-read approaches are increasingly used (50) . The currently available bat genomes vary widely regarding their assembly quality and completeness (Table 1 and Figure 2 ; Supplementary Table S1 ) and were predominantly assembled by using short Illumina-derived reads and low (∼18 X) (51) up to moderate/higher coverage (77-218 X) approaches (52) (53) (54) (55) (56) 25) . A new assembly of the cave nectar bat (Eonycteris spelaea) (57) was exclusively generated from long-read data derived from the PacBio platform, and the genome of the Egyptian fruit bat (Rousettus aegyptiacus) (58) was assembled by using a hybrid-approach of Illumina and PacBio Table 1 and Supplementary Table S1) Supplementary Table S9. data. These two genomes from the Pteropodidae family are of a generally higher quality (Figure 2 and Supplementary Table S1 ). Regardless of their assembly quality, these genomes need to be annotated to identify regions of interest, for example, encoding for protein-and non-coding genes or other regulatory elements. Current genome annotations, mostly generated by automatic annotation pipelines provided by databases such as the NCBI (59) or Ensembl (37), are predominantly focusing on protein-coding genes and well-studied ncRNAs such as tRNAs and rRNAs. Accordingly, the available bat genome annotations vary a lot regarding their quality, ranging from more comprehensive annotations for longstanding bat genomes such as M. lucifugus or P. vampyrus to annotations on region level, completely missing any coding or non-coding gene annotations at the current NCBI version ( Figure 3 and Table 4 ). Furthermore, by using strandspecific RNA-Seq data, we could show that some genes (e.g. IFNA5/IFNW2 in the Ensembl annotation of M. lucifugus (26) ) are annotated on the wrong strand and are therefore entirely missed by differential expression studies when relying on a strand-specific read quantification. For all publicly available bat genomes, ncRNAs are generally annotated on low levels and are highly incomplete, mostly only comprising some tRNAs, rRNAs, snRNAs, snoRNAs and lncRNAs ( Figure 3 and Table 4 ). Therefore, many ncRNAs, especially miRNAs, are simply overlooked by current molecular studies, for example from RNA-Seq studies that aim to call differential expressed genes based on such in-complete genome annotation files. Studies that have made additional effort on annotating ncRNAs in bats (8, 25, 47, (60) (61) are not reporting their results on a level that can be directly used for further computational assessment (e.g. as a direct input for RNA-Seq abundance estimation). Currently, in the NCBI database, five bat assemblies are entirely lacking any coding/non-coding annotations and miRNAs are not annotated at all ( Table 4 ). The Rfam database (39) contains mainly for M. lucifugus and P. vampyrus 336 ncRNA families. Other ncRNAs are currently unknown from bat genomes or not well documented. The genome assembly status of different bat species varies widely: ranging for example from 29 801 contigs and an N50 of 26 869 735 nt (D. rotundus) to 192 872 contigs and an N50 of only 16 881 nt (M. lyra), see Table 1 and Supplementary Table S1 . Accordingly, the annotation status also varies a lot (Table 4) . Within this work, we give an overview of potential ncRNA annotations in bats. However, the precise number of ncRNAs remains unclear, because of ncRNAs being present several times in the assemblies, and others still remaining undiscovered. To give a better estimation of transcribed and potentially functional ncRNAs, we used six Illumina short-read RNA-Seq data sets derived from four bat species (Table 2) to estimate the expression levels of our novel annotations. Note that we refer throughout this paper to an RNA-Seq data set by the first author's name and the year of the respective data set publication. The only included data set derived from a species of the Yinpterochiroptera suborder (R. aegyptiacus) was obtained from a study dealing with the differential transcriptional responses of Ebola and Marburg virus infections in human and bat cells (16) (data set: Hölzer-2016) . In this study, total RNA of nine samples of R06E-J cells, either challenged by the Ebola or Marburg virus or left un-infected, were harvested at 3, 7 or 23 h post infection (poi) and sequenced. Unfortunately, no biological replicates could be generated for this study. Therefore, we did not use this data set for the differential ex-pression analysis, but also calculated normalized expression values (TPM; transcripts per million) as done for all RNA-Seq data sets. The other five data sets comprise Yangochiroptera species of the Vespertilionidae (M. lucifugus, M. daubentonii) and Miniopteridae (M. natalensis) families (Table 2 ). Field et al. conducted two transcriptomic studies (27, 28) (Field-2015 , Field-2018 using wing tissue of the hibernating little brown myotis bat. They were especially interested in transcriptional changes between un-infected wing tissue and adjacent tissue infected with Pseudogymnoascus destructans, the fungal pathogen that causes the white-nose syndrome. Two other data sets were obtained from virus-(RVFV Clone 13) and interferon (IFN) alphainduced transcriptomes of a Myotis daubentonii kidney cell line (MyDauNi/2c). RNA of mock, IFN and Clone 13 samples were gathered at two time points, 6 and 24 h poi (26) . From the same samples, rRNA-depleted (Hölzer-2019) and smallRNA-concentrated (Weber-2019; see Methods section for details) libraries were generated and sequenced. Finally, we used data of the long-fingered bat M. natalensis initially obtained to characterize the developing bat wing (25) . Here, total RNA was extracted from paired forelimbs and hindlimbs from three individuals at three developmental stage. We have mapped each sample to each bat genome, regardless of the origin of the RNA. Expectedly, the mapping rate decreases in bat species that are evolutionary more far away from the original species from which the RNA was sequenced. However, we were interested to find out which ncRNAs are consistently transcribed in all investigated bat species or only in certain bat families and sub-groups. Over all 16 bat assemblies, we annotated ncRNA families for in total 23 tRNAs, 3 rRNAs, 193 miRNAs (2680 predicted by mirDeep2), 162 snoRNAs, 22 mitochondrial (mt-)tRNAs and 2 mt-rRNAs as well as 244 other ncRNAs additionally derived from the Rfam database (39) (selected ncRNAs are shown in Table 5 ). With a broad approach, we have identified 24 316 potential lncRNAs and defined between 27 149 (M. lyra) and 158 135 (D. rotundus) lncRNA hot spots. All annotations, separated for each ncRNA class and summarized for each bat species, can be found in the Open Science Framework (doi.org/10.17605/OSF.IO/ 4CMDN) and in our Electronic Supplement (rna.uni-jena. de/supplements/bats) and are compatible with the genome assembly versions listed in Table 1 . Thus, our annotations together with the bat genome assemblies obtained from the NCBI can be directly used for subsequent analysis such as differential gene expression detection. We detected 18S and 28S rRNA for the majority of investigated bat species (Supplementary Table S3 ). The varying number of rRNAs is in line with all currently available metazoan genomes, lacking the correct composition of rRNAs due to misassemblies. However, the number of 5.8S rRNA varies a lot between 17 919 for P. vampyrus and 51 for M. natalensis (Table 5 and Supplementary Table S3 ). Interestingly, 4 of 5 Pteropodidae show a higher number of 5.8S rRNAs compared to the other species (∼50-100 fold). Only the P. alecto assembly is in line with the other bat assemblies in regard to the amount of 5.8S rRNA copies. For all bat species, we observed various numbers of tRNAs (Table 5 and Supplementary Table S4 ). We could detect full sets of tRNAs for E. fuscus and M. davidii, whereas between one and four tRNAs are missing from the assemblies of H. armiger, M. brandtii, M. lucifugus, M. lyra, M. natalensis, P. parnellii, R. ferrumequinum, R. sinicus and D. rotundus, and between nine and twelve are missing for E. helvum, P. alecto, P. vampyrus, R. aegyptiacus and E. spelaea (Supplementary Table S4 ). Interestingly, we identified a large number of tRNAs (18 235) for M. lyra in comparison to all other bat species (Supplementary Table S4 ), likely a result of the low assembly quality (Figure 2 and Supplementary Table S1 ). The lowest amount of tRNAs was annotated for P. parnellii and D. rotundus with only 1059 and 1041 copies, respectively. In all other bat genomes, we found between 1735 and 4657 tRNA genes (Supplementary Table S4 ). The tRNA encoding for valine (Val) with the anticodon structure TAC had high copy numbers (over 1000) in all genome assemblies of the Pteropodidae family (Table 5) . Similar copy numbers were achieved by the R. ferrumequinum and R. sinicus assemblies regarding the tRNA encoding for isoleucine (Ile) with the anticodon AAT (Supplementary Table S4 ). For tRNA(Ile) and the anticodon GAT, we also observed high copy numbers in H. armiger. Interestingly, all species with high tRNA(Val) and tRNA(Ile) copy numbers had rather low counts (between 0 and 28) of tRNA(SeC) with the anticodon TCA, while this tRNA was found with higher copy numbers in P. parnelli (68) and in D. rotundus (145) and with even higher counts (between 315 and 498) in all other bat species (Supplementary Table S4 ). However, high copy numbers might be also occur due to assembly quality and false positive predictions of tRNAscan-SE. In Supplementary Table S5 we list all detected snoRNAs, divided into box C/D and box H/ACA types. Overall, we found 162 snoRNA families within the investigated bat species, comprising 88 box C/D, 61 box H/ACA and 13 unclassified snoRNAs. Many snoRNAs were found with exactly one copy present in each bat genome assembly (e.g. SCARNA7), whereas others were found in multiple copies for each bat species (e.g. SNORD112) or completely absent from certain bat families (e.g. ACA64), see Table 5 . Exactly one copy of the small nucleolar RNA ACA64 was found within the genomes of the Pteropodidae family and multiple copies for D. rotundus, P. parnellii, M. natalensis and members of the Vespertilionidae family; however, this snoRNA seems to be completely absent from bat species of the Megadermatidae and Rhinolophidae families ( Table 5 ). The H/ACA box snoRNA ACA64 is predicted to guide the pseudouridylation of 28S rRNA U4331 (62) . Interestingly and as another example, SNORA48 was found in higher copy numbers in the genomes of the Vespertilionidae family. Among others, this H/ACA box snoRNA was described to be commonly altered in human disease (63) . Over all bat assemblies, we detected 193 miRNA families based on Rfam alignments (Supplementary Table S6 ) and predicted between 349 (E. helvum) and 2464 (M. davidii) miRNAs based on the 18 combined small RNA-Seq data sets (Weber-2019) using miRDeep2 (34) (Supplementary Table S7 ). The higher number of miRNAs predicted for Myotis species can be explained because the small RNA-Seq data set is derived from a Myotis daubentonii cell line. Similar to other ncRNA classes, we observe various differences in miRNA copy numbers between the bat families. For example, mir-454 and mir-563 are absent in all Vespertilionidae and M. natalensis (Table 5) , whereas mir-1912 is present in all Yangochiroptera (except D. rotundus) but absent from all Pteropodidae (Supplementary Table S6 ). The miRNA 541 is absent from all Yangochiroptera except D. rotundus. There are many other examples of absent/present miRNAs in certain bat species/families such as mir-662 (absent from Pteropodidae), mir-767 (absent from Yinpterochiroptera) and mir-626 (absent from Rhinolophidae and Megaderma lyra), see Supplementary Table S6 . With miRDeep2, we detected hundreds of potential miRNAs for all investigated bat species (Supplementary Table S7 ). Generally, all 16 bat species can be divided into two groups. For the majority of the bat assemblies (12 out of 16) about 400 miRNAs were predicted. For the other 4 species of the Vespertilionidae family about 5 times as many miR-NAs could be found. This is in concordance with the small RNA-Seq data used for the prediction, that was obtained from Myotis daubentonii kidney cells (see Methods). Nevertheless, ∼400 conserved miRNAs can be predicted in the evolutionary more distant bat species. Table 5 . General genome information for each of the 16 investigated bat assemblies and selected ncRNA examples annotated in this study. We selected ncRNAs with interesting copy number distributions among the investigated bat species. For SNORA38, PVT1 5 and HOTTIP 2 and 3 we additionally found sophisticated differential expression patterns in at least one of the used RNA-Seq data sets (absolute log2 fold-change greater 1, TPM >10). Full tables and detailed information for each ncRNA class (FASTA, STK, GTF files) can be found in the Electronic Supplement online (Supplementary Tables S3-S10) lncRNAs For the annotation of lncRNAs, we have deliberately chosen a broad Blast-based approach, using 107,039 transcripts of potential human lncRNAs obtained from the LNCipedia database (48) . We have consciously chosen this approach, because lncR-NAs have diverse genomic contexts, reveal various functions and act in different biological mechanisms (64) (65) (66) . From 107 039 LNCipedia transcripts, we annotated 24 316 genes and 182 451 lncRNA hot spots. We defined regions in a genome as a lncRNA hot spot, if different LNCipedia transcripts derived from different genes map to the same region (see Methods section for detailed description). Overall, we found between 58 425 and 137 161 potential lncRNA genes in M. lucifugus and R. sinicus, respectively, and between 27 149 and 158 135 lncRNA hot spots in M. lucifugus and D. rotundus, respectively. We annotated the previously described lncRNAs Tbx5-as1 and Hottip (25) in all bat genomes, except Tbx5-as1 in M. lucifugus, presumably due to the lower genome assembly quality (Table 1) . Besides evolutionarily explainable differences in the presence of lncRNAs, we have observed that the number of lncRNAs and lncRNA hot spots increases with increasing assembly quality (e.g. with a higher N50; see Supplementary Figure S1). We have not observed such a clear correlation between assembly quality and annotation results for short ncRNAs. Based on the Rfam alignments, we were able to detect 244 other ncRNA families in addition to the rRNAs, tRNAs, snoRNAs and miRNAs described before (Supplementary Table S8 ). Overlaps with annotated lncRNAs (Supplementary Table S9 ) are intentional, because the Rfam includes only highly structured parts of long ncRNAs. The highest number of ncRNA copies (951) was detected for the U6 spliceosomal RNA in M. brandtii, already known to have a lot of pseudogenes (67) . For 50 ncRNAs such as CAESAR (RF00172), G-CSF SLIDE (RF00183), NRON (RF00636), TUSC7 (RF01879), Xist exon4 (RF01881), LINC00901 (RF01884), and Hammerhead HH9 and HH10 (RF02275, RF02277), we found exactly one copy in each investigated bat genome assembly (Supplementary Table S8 ). Again, we also observed ncRNA families that are lost for some species or entire families, for example the ribozyme CoTC (RF00621) that seems to be absent from all Vespertilionidae members and M. natalensis. For each investigated bat species except E. spelaea, R. sinicus, M. lyra, E.fuscus and M. natalensis (where no mitochondrial contigs could be identified; Table 3 ) mitochondrial protein-coding genes and ncRNAs were annotated (see Methods). In total, 37 mitochondrial genes comprising 22 tRNAs, two rRNAs (12S and 16S) and 13 protein-coding genes were detected for each bat species as known for other metazoans (68) (Supplementary Table S10 ). The mitochondrial genome lengths range from 16 343 nt in E. helvum to 17 783 nt in M. davidii. For the five bat species, where the mitochondrial genome could be identified as a part of the NCBI genome assembly, we appended the mtDNA annotation to the final annotation of ncRNAs. Exemplarily, we investigated known and novel differentially expressed (DE) ncRNAs found in the genome of M. lucifugus in more detail. To this end, we used the RNA-Seq data sets Field-2015 , Field-2018 , Hölzer-2019 and Weber-2019 (Table 2 ) as a basis to identify DE ncRNAs that were newly discovered in this study and were not part of the current NCBI or Ensembl genome annotations for this bat species. More detailed DE results can be found in Supplementary File S2. We filtered for novel M. lucifugus DE ncRNAs by (i) an absolute log 2 fold change (fc) > 1, (ii) an adjusted pvalue < 0.05, and (iii) a TPM > 10. We further manually investigated the expression patterns with the IGV (69) and discarded DE ncRNAs overlapping with the current NCBI (Myoluc2.0 annotation release 102) or Ensembl (Myoluc2.0.96) annotations. Based on the small RNA-Seq comparison of mock and virus-infected (Clone 13) samples 24 h post infection (Weber-2019), we found several miRNAs (Rfam-and mirDeep2-based) and snoRNAs to be differentially expressed ( Figure 4 and details in Supplementary Files S2. 13). In general, replicates of virus-infected and IFN-treated samples 24 h post infection tend to cluster together only based on the expression profiles of small ncRNAs (mainly miRNAs) ( Figure 4A and B) . Most differences can be observed between the 24 h virus-infected and all other samples, which seem to show no clearly distinguishable expression pattern. Interestingly, at 6 h post infection, we see replicates clustering together regardless of their treatment (mock, IFN, Clone 13). Thus, after only 6 h, few miRNAs are differentially expressed and therefore the samples of each replicate (mock, IFN, Clone 13) cluster together, because they have the same passaging history but the passaging history in between the replicates differ (26) . After 24 h, more and more miRNAs are significantly differentially expressed and the samples can be better distinguished based on their treatment ( Figure 4A and B). We observed that, in general, miRNAs tend to be down-regulated ( Figure 4B ; upper half), while snoRNAs tend to be up-regulated (lower half) after 24 h of Clone 13 infection compared to mock. For example, we found a novel miRNA (MLUGD00000002094 in our annotation; predicted by mirDeep2; Supplementary Table S7 ) located in an intron of the protein-coding gene SEMA3G, significantly down-regulated (log 2 fc = -2.56) during Clone 13 infection ( Figure 4C and D) . Based on Rfam alignments we further found a histone 3'-UTR stemloop (RF00032), an RNA element involved in nucleocytoplasmic transport of the histone mRNAs, significantly down-regulated during infection. For the same comparison of mock and virus-infected samples at 24 h, the rRNA-depleted data set (Hölzer-2019) revealed several DE lncRNAs. For example, we found a lncRNA potentially transcribed in an intron of MX1 (MLUGL00000039178) up-regulated (log 2 fc = 3.36) during infection. Another lncRNA (MLUGL00000087396), we found potentially transcribed as a part of two exons of the PLAT protein-coding gene and down-regulated (log 2 fc = −1.67) during viral infection (see details in Supplementary Files S2.11). Interestingly, based on the Field-2015 RNA-Seq data, we found two internal ribosomal entry site (IRES) in the genes VEGFA and ODC with Rfam IDs RF00461 and RF02535, respectively, to be 2-fold up-regulated during P. destructans infection. In this study, we comprehensively annotated ncRNAs in 16 readily available bat genomes obtained from the NCBI database (Table 1) . We provide novel annotations in the common GTF format, following a hierarchical structure of gene, transcript and exon features to allow direct integration of our annotations into already available ones (Supplementary Tables S3-S10 ). Finally, we provide for each bat genome assembly an extended annotation file merged with the protein-and non-coding gene annotations that were already available by the NCBI database (Supplementary Files S1; leaving out potential lncRNAs that can be downloaded separately, see Supplementary Table S9) . We used six RNA-Seq data sets derived from the transcriptomic sequencing of four bat species (Table 2) to calculate normalized expression values for our newly annotated ncRNAs and exemplarily show significantly differential expressed ncRNAs (Figure 4) , which were never before described on such a large scale for any bat species. In addition to the evolutionarily explainable differences in the pure existence and the amount of annotated ncRNAs in bats, we have observed that the assembly quality can also influence the annotation results. While the effects on the annotation of short ncRNAs seem to be small (with some exceptions), the number of identified lncRNAs and lncRNA hot spots increases with increasing assembly quality (e.g. with a higher N50; see Supplementary Figure S1 ). However, this observation may be true for our data and analyses, but it also depends strongly on the annotation method used. Recently, the Bat1K project (http://bat1k.ucd.ie/) was announced as a global effort to sequence, assemble and annotate high-quality genomes of all living bat species (2) . We aim to extend our annotation of ncRNAs regularly and whenever new bat genomes become publicly available. In mid January 2019, 16 new low-coverage bat genomes of nine families were submitted by the Broad Institute to the NCBI genome database. Unfortunately, our timeconsuming and computationally extensive analyses were already completed at this time point. We want to further automate our ncRNA annotation workflow, to easily include these and any new bat (or other mammalian) genomes that will be sequenced and assembled in the future. Our current identification of ncRNAs in bat species will be usable as a resource (Electronic Supplement) for deeper studying of bat evolution, ncRNAs repertoire, gene expression and regulation, ecology, and important host-virus interactions. Detailed information about the bat genomes used in this study, their assembly quality and all ncRNA candidates (in FASTA, STK and GTF format) can be found in the Electronic Supplement (rna.uni-jena.de/supplements/ bats). The final extended annotations for each investigated bat species can be found in Supplementary Files S1 and the lncRNA annotations in Supplementary Table S9 . To allow full reproducibility of our study, all final and intermediate data files (such as used genome files and mapping files in BAM format) were uploaded to the Open Science Framework under accession doi.org/10.17605/OSF.IO/4CMDN. Python scripts used to filter and merge our annotations were deposited at GitHub (github.com/rnajena/ bats ncrna). The virus-infected and IFN-stimulated small RNA-Seq data of the M. daubentonii kidney cell line was uploaded to GEO (GSE132336). Bats: important reservoir hosts of emerging viruses Bat biology, genomes, and the Bat1K Project: to generate chromosome-level genomes for all living bat species An Eocene big bang for bats Phylogeny, genes, and hearing: implications for the evolution of echolocation in bats Mammal species of the world. A taxonomic and geographic reference A molecular phylogeny for bats illuminates biogeography and the fossil record Crowd vocal learning induces vocal dialects in bats: Playback of conspecifics shapes fundamental frequency usage by pups Blood miRNomes and transcriptomes reveal novel longevity mechanisms in the long-lived bat Growing old, yet staying young: The role of telomeres in bats' exceptional longevity Longitudinal comparative transcriptomics reveals unique mechanisms underlying extended healthspan in bats Bats and their virome: an important source of emerging viruses capable of infecting humans Mass extinctions, biodiversity and mitochondrial function: are bats 'special' as reservoirs for emerging viruses? Bats as 'special' reservoirs for emerging zoonotic pathogens Global patterns in coronavirus diversity Bat flight and zoonotic viruses Differential transcriptional responses to Ebola and Marburg virus infection in bat and human cells The immune gene repertoire of an important viral reservoir, the Australian black flying fox Non-coding RNAs: the architects of eukaryotic complexity Novel insight into the non-coding repertoire through deep sequencing analysis Genome annotation: from sequence to biology Applications of next generation sequencing in molecular ecology of non-model organisms Next-generation genome annotation: we still struggle to get it right QUAST: quality assessment tool for genome assemblies Transcriptomic and epigenomic characterization of the developing bat wing Virus-and interferon alpha-induced transcriptomes of cells from the microbat Myotis daubentonii. iScience The white-nose syndrome transcriptome: activation of anti-fungal host responses in wing tissue of hibernating little brown myotis Effect of torpor on host transcriptomic responses to a fungal pathogen in hibernating bats Trimmomatic: a flexible trimmer for Illumina Sequence Data HISAT: a fast spliced aligner with low memory requirements featureCounts: an efficient general purpose program for assigning sequence reads to genomic features Cutadapt removes adapter sequences from high-throughput sequencing reads Quality control and preprocessing of metagenomic datasets 2012) miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 RNA-Seq gene expression estimation with read mapping uncertainty Rfam: an RNA family database Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families Non-coding RNA analysis using the Rfam database Infernal 1.0: inference of RNA alignments Infernal 1.1: 100-fold faster RNA homology searches RALEE--RNA Alignment editor in Emacs RNAmmer: consistent and rapid annotation of ribosomal RNA genes tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence Matching of Soulmates: coevolution of snoRNAs and their targets Characterisation of novel microRNAs in the Black flying fox (Pteropus alecto) by deep sequencing LNCipedia 5: towards a reference set of human long non-coding RNAs MITOS: Improved de novo metazoan mitochondrial genome annotation A comprehensive study of de novo genome assemblers: current challenges and future prospective Genome-wide signatures of convergent evolution in echolocating mammals A high-resolution map of human evolutionary constraint using 29 mammals Comparative analysis of bat genomes provides insight into the evolution of flight and immunity The genomes of two bat species with long constant frequency echolocation calls Hologenomic adaptations underlying the evolution of sanguivory in the common vampire bat Genome analysis reveals insights into physiology and longevity of the Brandt's bat Myotis brandtii Exploring the genome and transcriptome of the cave nectar bat Eonycteris spelaea with PacBio long-read sequencing The Egyptian rousette genome reveals unexpected features of bat antiviral immunity Reference sequence (RefSeq) database at [NCBI: current status, taxonomic expansion, and functional annotation sequencing and annotation for the Jamaican fruit bat (Artibeus jamaicensis) Down but not out: the role of microRNAs in hibernating bats A computational screen for mammalian pseudouridylation guide H/ACA RNAs Small RNAs with big implications: new insights into H/ACA snoRNA function and their role in human disease Long noncoding RNAs: past, present, and future Incredible RNA: dual functions of coding and noncoding cncRNAs: Bi-functional RNAs with protein coding and non-coding functions Evolution of spliceosomal snRNA genes in metazoan animals Animal mitochondrial genomes Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration Icarus: visualizer for de novo assembly evaluation We thank Ivonne Görlich