key: cord-015850-ef6svn8f authors: Saitou, Naruya title: Eukaryote Genomes date: 2013-08-22 journal: Introduction to Evolutionary Genomics DOI: 10.1007/978-1-4471-5304-7_8 sha: doc_id: 15850 cord_uid: ef6svn8f General overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk DNAs. We then discuss the evolutionary features of eukaryote genomes, such as genome duplication, C-value paradox, and the relationship between genome size and mutation rates. Genomes of multicellular organisms, plants, fungi, and animals are then briefly discussed. duplications sometimes occur in eukaryotes, especially in plants and in vertebrates, but genome duplication is so far not known for prokaryotic genomes. Because the gene number of typical eukaryotic genomes is much larger than that of prokaryotes, there are many genes shared among most of eukaryote genomes but nonexisting in prokaryote genomes. Some examples are listed in Table 8.2 . For example, myosin is located in animal muscle tissues, and its homologous protein exists in cytoskeleton of all eukaryotes, but not found in prokaryotes. Recently, Kryukov et al. (2012; [ 1 ] ) constructed a new database on oligonucleotide sequence frequencies and conducted a series of statistical analyses. Frequencies of all possible 1-10 oligonucleotides were counted for each genome, and these observed values were compared with expected values computed under observed oligonucleotide frequencies of length 1-4. Deviations from expected values were much larger for eukaryotes than prokaryotes, except for fungal genomes. Figure 8 .1 shows the distribution of the deviation for various organismal groups. The biological reason for this difference is not known. There are two major types of organella in eukaryotes: mitochondria and plastids. Figure 8 .2 shows schematic views of mitochondria and chloroplasts. These two organella has their independent genomes. This suggests that they were initially independent organisms which started intracellular symbiosis with primordial eukaryotic cells. Because most eukaryotes have mitochondria, the ancestral eukaryotes, a lineage that emerged from Archaea, most probably started intracellular symbiosis with mitochondrial ancestor. A parasitic Rickettsia prowazekii is so far phylogenetically closest to mitochondria [ 2 ] , and a rickettsia-like bacterium is the best candidate as the mitochondrial ancestor. However, there is an alternative "hydrogen hypothesis" [ 3 ] . Plastids include chloroplasts, leucoplasts, and chromoplasts and exist in land plants, green algae, red algae, glaucophyte algae, and some protists like euglenoids. Mitochondrial genome sizes of some representative eukaryotes are listed in Table 8 . 3 . Most of animal mitochondrial genomes are less than 20 kb, and sizes of protist and fungi mitochondrial genomes are somewhat larger. Mitochondrial genome size of plants is much larger than those of other eukaryotic lineages, yet the size is mostly less than 500 kb. An ancestral eukaryotic cell, probably an archaean lineage, hosted a bacterial cell, and intracellular symbiosis started. Initially, Archaea and Bacteria shared genes responsible for basic metabolism, and the situation is a sort of gene duplication for many genes, though homologous genes are not identical but already diverged long time ago. In any case, division of labor followed, and only limited metabolic pathways were left in the bacterial system, which eventually became mitochondria. Animal mitochondrial genomes contain very small number of genes; 13 for peptide subunits, 20 for tRNA, and 2 for rRNA [ 4 ] . Genome size (kb) Animals Homo sapiens (human) 16 .5 Takifugu rubripes (Torafugu fi sh) 16 representative animal species mitochondrial DNA genomes. Although most of vertebrate mitochondrial DNA genomes have the same gene order as in human ( Fig. 8 .3a ), gene order may vary from phylum to phylum. Yet the gene content and the genome size are more or less constant among animals. It is not clear why animal mitochondrial genomes are so small. One possibility is that animal individuals are highly integrated compared to fungi and plants, and this might have infl uenced a drastic reduction of the mitochondrial genome size. Another interesting feature of animal mitochondrial DNA genomes is the heterogeneous rates of gene order change. For example, platyhelminthes exhibit great variability in mitochondrial gene order (Sakai and Sakaizumi, 2012; [ 5 ] ). In contrast, plant mitochondrial genomes are much larger (see Table 8 .3 ). Figure 8 .4 shows the genome structure of tobacco mitochondrial genome (from Sugiyama et al. 2005; [ 6 ] ). Horizontal gene transfers are also known to occur in plant mitochondrial DNAs even between remotely related species [ 7 ] . The melon ( Cucumis melo ) mitochondrial genome size, ca. 2.9 Mb, is exceptionally large, and recently its draft genome was determined [ 8 ] . Interestingly, melon mitochondrial genome looks like the vertebrate nuclear genome in its contents, in spite of its genome size being similar to that of bacteria. The protein coding gene region accounted for only 1.7 % of the genome, and about half of the genome is composed of repeats. The remaining part is mostly homologous to melon nuclear DNA, and 1.4 % is homologous to melon chloroplast DNA. Most of the protein coding genes of melon mitochondrial DNAs are highly similar to those of its congeneric species, which are watermelon and squash whose mitochondrial genome sizes are 119 kb and 125 kb, respectively. This indicates that the huge expansion of its genome size occurred only recently. Interestingly, cucumber ( Cucumis sativus ), another congeneric species, also has ~1.8-Mb mitochondrial genome with many repeat sequences [ 9 ] . It will be interesting to study whether the increase of mitochondrial genomes of melon and cucumber is independent or not. Chloroplasts exist only in plants, algae, and some protists. It may change to leucoplasts and chromoplasts. Because of this, a generic name "plastids" may also be used. The origin of chloroplast seems to be a cyanobacterium that started intracellular symbiosis as in the case of mitochondria. A unique but common feature of chloroplast genome is the existence of inverted repeats [ 10 ] , and they mainly contain rRNA genes. Chloroplast DNA contents may [ 11 ] . Chloroplast genomes were determined for more than 340 species as of December 2013 [ 106 ] . Their genome sizes range from 59 kb ( Rhizanthella gardneri ) to 521 kb ( Floydiella terrestris ). Although the largest chloroplast genome is still much smaller than atypical bacterial genome, its average intergenic length is 4 kb, much longer than that for bacterial genomes. Fractions of mitochondrial DNA may sometimes be inserted to nuclear genomes, and they are called "numts." An extensive analysis of the human genome found over 600 numts [ 12 ] . Their sequence patterns are random in terms of mitochondrial genome locations. This suggests that mitochondrial DNAs themselves were inserted, not via cDNA reverse-transcribed from mitochondrial mRNA. A possible source is sperm mitochondrial DNA that were fragmented after fertilization [ 12 ] . The reverse direction, from nucleus to mitochondria, was observed in melon, as discussed in subsection 8.2.1 . Intron is a DNA region of a gene that is eliminated during splicing after transcription of a long premature mRNA molecule. Intron was discovered by Phillip A. Sharp and Richard J. Roberts in 1977 as "intervening sequence" [ 13 ] , but the name "intron" coined by Walter Gilbert in 1978 [ 14 ] is now widely used. It should be noted that some description on intron by Kenmochi [ 15 ] was used for writing this section. There are various types of introns, but they can be classifi ed into two: those requiring spliceosomes (spliceosome type) and self-splicing type. Figure 8 .5 shows the splicing mechanisms of these two major types. Most of introns in nuclear genomes of eukaryotes are spliceosome type, and there are common GU-AG type and rare AU-AC type, depending on the nucleotide sequences of the intron-exon boundaries [ 16 ] . Spliceosomes involving these two types differ [ 17 ] . Self-splicing introns are divided into three groups: groups I, II, and III. Group I introns exist in organellar and nuclear rRNA genes of eukaryotes and prokaryotic tRNA genes. Group II are found in organellar and some eubacterial genomes. Cavalier-Smith [ 18 ] suggested that spliceosome-type introns originated from group II introns because of their similarity in splicing mechanism and structural similarity between group II introns and spliceosomal RNA. Group III introns exist in organellar genomes, and its splicing system is similar with that of group II intron, though they are smaller and have unique secondary structure. There is yet another type of introns which exist only in tRNAs of single-cell eukaryotes and Archaea [ 19 ] . These introns do not have self-splicing functions, but endonuclease and RNA ligase are involved in splicing. The location of this type of introns is often at a certain position of the tRNA anticodon loop. After the discovery of introns, their probable functions and evolutionary origin have long been argued (e.g., [ 20 , 21 ] ). Because self-splicing introns can occur at any time, even in the very early stage of origin of life, we consider only spliceosometype introns. For brevity, we hereafter call this type of introns as simply "intron." There are mainly two major hypotheses: introns early and introns late. The former claims that exon existed as a functional unit from the common ancestor of prokaryotes and eukaryotes, and "exon shuffl ing" was proposed for creating new protein functions [ 14 ] . Introns which separate exons should also be quite an ancient origin [ 14 , 22 ] . In contrast, introns are considered to emerge only in the eukaryotic lineage according to the introns-late hypothesis [ 23 , 24 ] . The protein "module" hypothesis proposed by Go [ 25 ] is related to be intronsearly hypothesis. Pattern of intron appearance and loss has been estimated by various methods (e.g., [ 21 , 26 ] ). Kenmochi and his colleagues analyzed introns of ribosomal proteins of mitochondrial genomes and eukaryotic nuclear genomes in details [ 27 -29 ] . These studies supported the introns-late hypothesis, because introns in mitochondrial and cytosolic ribosomal proteins seem to be independent origins and introns seem to emerge in many ribosomal protein genes after eukaryotes appeared. Introns do not code for amino acid sequences by defi nition. In this sense, most of introns may be classifi ed as junk DNAs (see the next section). There are, however, evolutionarily conserved regions in introns, suggesting the existence of some functional roles in introns. Ohno (1972; [ 30 ] ) proclaimed that the most part of mammalian genomes are nonfunctional and coined the term "junk DNA." With the advent of eukaryotic genome sequence data, it is now clear that he was right. There are in fact so much junk DNAs in eukaryotic genomes. Junk DNAs or nonfunctional DNAs can be divided into repeat sequences and unique sequences. Repeat sequences are either dispersed type or tandem type. Unique sequences include pseudogenes that keep homology with functional genes. Prokaryote genomes sometimes contain insertion sequences; however, this kind of dispersed repeats constitutes the major portion of many eukaryotic genomes. These interspersed elements are divided into two major categories according to their lengths: short ones (SINEs) and long ones (LINEs). One well-known example of SINE is Alu elements in primate genomes. It is about 300-bp length, and originated from 7SL ribosomal RNA gene. Let us see the real Alu element sequence from the human genome sequence. If we retrieve the DDBJ/EMBL/GenBank International Sequence Database accession number AP001720 (a part of chromosome 21), there are 128 Alu elements among the 340kb sequence. The density is 0.38 Alu elements per 1 kb. If we consider the whole human genome of ~ 3 billion bp, Alu repeats are expected to exist in ~1.13 million copies. One example of Alu sequence is shown below from this entry coordinates from 133600 to 133906: ggcgggagcg atggctcacg cctgtaatgc cagcactttg ggaggccgag gtgggtggat cacaaggtca ggagatagag accatcctgg ctaacacggt gaaacactgt ctctactaaa aacacaaaaa actagccagg cgtggtggcg ggtgcctgta atcccagcta ctcgggaggc tgaggcagga gaatggtgtg aacccaggaa gtggagcttg cagtgagctc agattgcgcc actgcactcc agcctgggtg acagagtgag actccatctc aaaaaaaata aaataaataa aaaaaa If we do BLAST homology search (see Chap. 14 ) using DDBJ system ( http:// blast.ddbj.nig.ac.jp/blast/blastn ) targeted to nonhuman primate sequences (PRI division of DDBJ database), the best hit was obtained from chimpanzee chromosome 22, which is orthologous to human chromosome 21. I suggest interested readers to do this homology search practice. Alu elements were fi rst classifi ed into J and S subfamilies [ 31 ] . It is not clear about the reason of selection of two characters (J and S), but probably two authors (Jurka and Smith) used initials of their surnames. In any case, this division was based on the distance from Alu consensus sequence; Alu elements which are more close to the consensus were classifi ed as S and those not as J. Later, a subset of the S subfamily were found to be highly similar with each other, and they were named as Y after 'young," for they appeared relatively in young or recent age. Rough estimates of the divergence time of Alu elements are as follows: J subfamily appeared about 60 million years ago, and S subfamily separated from J at 44 million years ago, followed by further separation of Y at 32 million years ago [ 32 ] . Figure 8 .6 shows the overall pattern of Alu element evolution (based on [ 32 ] ). Tandemly repeated sequences are also abundant in eukaryotic genomes, and the representative ones are heterochromatin regions. Heterochromatins are highly condensed nonfunctional regions in nuclear DNA, in contrast to euchromatins, in which many genes are actively transcribed. Heterochromatins usually reside at teromeres, terminal parts of chromosomes, and at centromeres, internal parts of chromosomes, that connect spindle fi bers during cell division. A more than 1-Mb teromeric regions of Arabidopsis thaliana were found to be tandem repeats of ca. 180-bp repeat unit [ 33 , 34 ] . The nucleotide sequence below is Arabidopsis thaliana tandemly repeated sequence AR12 (International Sequence Database accession number X06467): aagcttcttc ttgcttctca atgctttgtt ggtttagccg aagtccatat gagtctttgt ctttgtatct tctaacaagg aaacactact taggctttta ggataagatt gcggtttaag ttcttatact taatcataca catgccatca agtcatattc gtactccaaa acaataacc The human genome also has a similar but nonhomologous sequence in centromeres, called "alphoid DNA" with the 171-bp repeat unit [ 35 ] . The following is the sequence (International Sequence Database accession number M21746): catcctcaga aacttctttg tgatgtgtgc attcaagtca cagagttgaa cattcccttt cgtacagcag tttttaaaca ctctttctgt agtatctgga agtgaacatt aggacagctt tcaggtctat ggtgagaaag gaaatatctt caaataaaaa ctagacagaa g If we do BLAST homology search (see Chap. 13 ) targeted to the human genome sequences of the NCBI database, there was no hit with this alphoid sequence. This clearly shows that the human genome sequences currently available are far from complete, for they do not include most of these tandem repeat sequences. Telomores of the human genome are composed of hundreds of 6-bp repeats, ttaggg. If we search the human genome as 36-bp long 6 tandem repeats of this 6-repeat units as query using the NCBI BLAST, many hits are obtained. As we already discussed in Chap. 4 , authentic pseudogenes have no function, and they are genuine members of junk DNAs. When a gene duplication occurs, one of two copies often become a pseudogene. Because gene duplication is prevalent in eukaryote genomes, pseudogenes are also abundant. Pseudogenes are, by defi nition, homologous to functional genes. However, after a long evolutionary time, many selectively neutral mutations accumulate on pseudogenes, and eventually they will lose sequence homology with their functional counterpart. There are many unique sequences in eukaryote genomes, and majority of them may be this kind of homology-lost pseudogenes. A long RNA is initially transcribed from a genomic region having an exon-intron structure, and then RNAs corresponding to introns are spliced out. These leftover RNAs may be called "junk" RNAs, for they will soon be degraded by RNAse. Only a limited set of genes are transcribed in each tissue of multicellular organisms, but leaky expression of some genes may happen in tissues in which these genes should not be expressed. Again these are "junk" RNAs, and they are swiftly decomposed. A series of studies (e.g., [ 36 , 37 ] ) claimed that many noncoding DNA regions are transcribed. However, van Bakel et al. [ 38 ] showed that most of them were found to be artifact of chip-chip technologies used in these studies. If nonsense or frameshift mutations occur in a protein coding sequences, that gene cannot make proteins. Yet its mRNA may be produced continuously until the promoter or its enhancer will become nonfunctional. In this case, this sort of mutated genes produces junk RNAs. If only a small quantity of RNAs are found from cells and when they are not evolutionarily conserved, they are probably some kind of junk RNAs. As junk DNAs and junk RNAs exist, cells may also have "junk" proteins. If mature mRNAs are not produced in the expected way, various aberrant mRNA molecules will be produced, and ribosomes try to translate them to peptides based on these wrong mRNA information. Proteins produced in this way may be called "junk" proteins, for they often have no or little functions. Even if one protein is correctly translated and is moved to its expected cellular location, it can still be considered as "junk" protein. One good example is the ABCC11 transporter protein of dry-type cerumen (earwax), for one nonsynonymous substitution at this gene caused that protein to be essentially nonfunctional [ 39 ] . There are various genomic features that are specifi c to eukaryotes other than existence of introns and junk DNAs, such as genome duplication, RNA editing, C-value paradox, and the relationship between genome size and mutation rates. We will briefl y discuss them in this section. The most dramatic and infl uential change of the genome structure is genome duplications. Genome duplications are also called polyploidization, but this term is tightly linked to karyotypes or chromosome constellation. Prokaryotes are so far not known to experience genome duplications, which are restricted to eukaryotes. Interestingly, genome duplications are quite frequent in plants, while it is relatively rare in the other two multicellular eukaryotic lineages. An ancient genome duplication was found from the genome analysis of baker's yeast [ 40 ] , and Rhizopus oryzae , a basal lineage fungus, was also found to experience a genome duplication [ 41 ] . Among protists, Paramecium tetraurelia is known to have experienced at least three genome duplications [ 42 ] . Because we human belongs to vertebrates and the two-round genome duplications occurred at the common ancestor of vertebrates (see Chap. 9 ), we may incline to think that genome duplications often happen in many animal species. It is not the case. So far, only vertebrates and some insects are known to experience genome duplications. The reason for this scattered distribution of genome duplication occurrences is not known. If we plot the number of synonymous substitutions between duplogs in one genome, it is possible to detect a relatively recent genome duplication. This is because all genes duplicate when a genome duplication occurs, while only a small number of genes duplicate in other modes of gene duplications (see Chap. 3 ). Figure 8 .7 shows the schematic view of two cases: with and without genome duplication. Lynch and Conery (2000; [ 44 ] ) used this method to various genome sequences and found that the Arabidopsis thaliana genome showed a clear peak indicative of relatively recent genome duplication, while the genome sequences of nematode Caenorhabditis elegans and yeast Saccharomyces cerevisiae showed the curves of exponential reduction. It is interesting to note that before the genome sequence was determined, the genome duplication was not known for Arabidopsis thaliana, while the genome of Saccharomyces cerevisiae was later shown to be duplicated long time ago [ 40 ] . When genome duplications occurred in some ancient time, the number of synonymous substitutions may become saturated and cannot give appropriate result. In this case, the number of amino acid substitutions may be used, even if each protein may have varied rates of amino acid substitutions. In any case, accumulation of mutations will eventually cause two homologous genes to become not similar with each other. Therefore, although the possibility of genome duplications in prokaryotes are so far rejected [ 45 ] , it is not possible to infer the remote past events simply by searching sequence similarity. We should be careful to reach the fi nal conclusion. Modifi cation of particular RNA molecules after they are produced via transcription is called RNA editing. All three major RNA molecules (mRNA, tRNA, and rRNA) may experience editing [ 46 ] . There are various patterns of RNA editing; substitutions, in particular between C and U, and insertions and deletions, particularly U, are mainly found in eukaryote genomes. Guide RNA molecules exist in one of the main RNA editing mechanisms, and they specify the location of editing, but there are some other mechanisms [ 47 ] . It is not clear how the RNA editing mechanism evolved. Tillich et al. [ 47 ] studied chloroplast RNA editing and concluded that suddenly many nucleotide sites of chloroplast DNA genome started to have RNA editing, but later the sites experiencing RNA editing constantly decreased via mutational changes. They claimed that there was no involvement of RNA editing on gene expression. This result does not give RNA editing a positive signifi cance. Because there are many types of RNA molecules inside a cell, there also exist many sorts of enzymes that modify RNAs. It may be possible that some of them suddenly started to edit RNAs via a particular mutation. RNA editing which did not cause deleterious effects to the genome may have survived by chance at the initial phase. This view suggests the involvement of neutral evolutionary process in the evolution of RNA editing. Organisms with complex metabolic pathways have many genes. Multicellular organisms are such examples. Generally speaking, their genome sizes are expected to be large. In contrast, viruses whose genomes contain only a handful of genes have small genome sizes. Therefore, their possibility of genome evolution is rather limited. Even if amino acid sequences are rapidly changing because of high mutation rates, the protein function may not change. Unless the gene number and genome size increase, viruses cannot evolve their genome structures. It is thus clear that the increase of the genome size is crucial to produce the diversity of organisms. However, genomes often contain DNA regions which are not indispensable. Organisms with large genome sizes have many such junk DNA regions. Because of their existence, the genome size and the gene number are not necessarily highly correlated. This phenomenon was historically called C-value paradox (e.g., [ 48 ] ), after the constancy of the haploid DNA amount for one species was found, yet their values were found to vary considerably among species at around 1950 (e.g., [ 49 -51 ] ). "C-value" is the amount of haploid DNA, and C probably stands as acronym of "constant" or "chromosomes." We now know that the majority of eukaryote genome DNA is junk, and there is no longer a paradox in C-values among species. 56 ]) found conserved noncoding DNA sequences from insects, nematodes, and yeasts by comparing closely related species. We will discuss more on conserved noncoding sequences of vertebrates in Chap. 9 . As for plants, Kaplinsky [ 62 ] ) compared genome sequences of Arabidopsis, grape rice, and Brachypodium and found >100 times more abundant CNSs from monocots than dicots. Hettiarachchi and Saitou; [ 63 ] compared genome sequences of 15 plant species and searched lineage-specifi c CNSs. They found 2 and 22 CNSs shared by all vascular plants and angiosperms, respectively, and also confi rmed that monocot CNSs are much more abundant than those of dicots. What kind of the relationship exists between the genome size and mutation rates? If all the genetic information contained in the genome of one organism are necessary for survival of that organism, the individual will die even if only one gene of its genome lost its function by a mutation. An organism with a small genome size and hence with a small number of genes, such as viruses, can survive even if the mutation rate is high. In contrast, organisms with many genes may not be able to survive if highly deleterious mutations often happen. Therefore, such organisms must reduce the mutation rate. However, when the nucleotide substitution type mutation rate per generation was compared with the whole-genome size, Lynch (2006; [ 65 ] ) found a positive correlation. More recently, Lynch (2010; [ 66 ] ) admitted that for organisms with small-sized genomes, these two values were in fact negatively correlated. However, when large-genome-sized eukaryotes are compared, now a positive correlation was observed. We have to be careful when we discuss these two contradictory reports. One considered the rate using unit as physical year, while the other used one generation as the unit. Another difference is to use either only protein coding gene region DNA sizes or the whole-genome sizes. The relationship between the mutation rate and genome size is not simple. Drake et al. (1998; [ 67 ] ) examined this problem and found that the mutation rate per genome per replication was approximately 1/300 for bacteria, while mutation rates of multicellular eukaryotes vary between 0.1 and 100 per genome per sexual or individual generation. Table 8 .4 shows the list of the mutation rate and the genome size for various organisms. Apparently there is no clear tendency. We will discuss genomes of three multicellular lineages of eukaryotes: plants, fungi, and animals in this section. Unfortunately, there seems to be no common feature of genomes of multicellular organisms, so each lineage is discussed independently. Arabidopsis thaliana was the fi rst plant species whose 125-Mb genome was determined in 2000 [ 68 ] . A. thaliana is a model organism for fl owering plants (angiosperms), with only 2-month generation time. In spite of its small genome size, only 4 % of the human genome, it has 32,500 protein coding genes. The genome sequence of its closely related species, A. lyrata , was also recently determined [ 69 ] . Angiosperms are divided into monocots and dicots. A. thaliana is a dicot, and genome sequences of six more species were determined as of December 2013 (see Table 8 .5 ). Rice, Oryza sativa , is a monocot, and its genome size, 370 ~ 410 Mb, is much smaller than that of the wheat genome. Its japonica and indica subspecies genomes were determined [ 70 ] and [ 71 ] , and the origin of rice domestication is currently in great controversy, particularly in single or multiple domestication events (e.g., [ 72 , 73 ] ). The number of protein coding genes in the rice genome is 37,000 ~ 40,000 [ 74 ] . Wheat corresponds to genus Triticum , and there are many species in this genus. The typical bread wheat is Triticum aestivum , and it is a hexaploid with 42 (7 × 6) chromosomes. Its genome arrangement is conventionally written as AABBDD [ 75 ] . Because it is now behaving as diploid, genomic sequencing of 21 chromosomes (A1-A7, B1-B7, and D1-D7) is under way (see http://www.wheatgenome.org/ for the current status). The hexaploid genome structure emerged by hybridization of diploid (DD) cultivated species T. durum and tetraploid (AABB) wild species Aegilops tauschii [ 75 ] . A genome duplication followed hybridization. Non-seedling land plants are ferns, lycophytes, and bryophytes, in the order of closeness to seed plants (e.g., [ 76 ] ). A draft genome sequence of a moss, Physcomitrella patens was reported in 2008 [ 77 ] , followed by genome sequencing of a lycophyte, Selaginella moellendorffi i, in 2011 [ 78 ] . These genome sequences of different lineages of plants are deciphering stepwise evolution of land plants. The genome sequence of baker's yeast ( Saccharomyces cerevisiae ) was determined in 1996, as the fi rst eukaryotic organism [ 79 ] . There are 16 chromosomes in S . cerevisiae, and its genome size is about 12 Mb. There are a total of 8,000 genes in its genome: 6,600 ORFs and 1,400 other genes. The genome-wide GC content is 38 %, slightly lower than that of the human genome. The proportion of introns is very small compared to that of the human genome, and the average length of one intron is only 20 bp, in contrast to the 1,440-bp average length of exons [ 80 ] . As we already discussed, the ancestral genome of baker's yeast experienced a genome-wide duplication [ 40 ] . Pseudogenes, which are common in vertebrate genomes, are rather rare in the genome of baker's yeast; they constitute only 3 % of the protein coding genes [ 80 ] . The baker's yeast is often considered as the model organisms for all eukaryotes; however, their genome may not be a typical eukaryote genome. As of December 2013, genome sequences of more than 400 fungi species are available (see NCBI genome list at http://www.ncbi.nlm.nih.gov/genome/browse/ for the present situation). Figure 8 .9 shows the relationship between the genome size and gene numbers for 88 genomes. There is a clear positive correlation between them. However, there are some outliers. The Perigord black truffl e ( Tuber melanosporum ), shown as A i n Fig. 8.9 , has the largest genome size (~125 Mb) among the 88 fungi species whose genome sequences were so far determined, yet the number of genes is only ~7,500 [ 81 ] . Three other outlier species are Postia placenta , Ajellomyces dermatitidis , and Melampsora laricipopulina , shown as B, C, and D in Fig. 8.9 , respectively. Interestingly, these four outlier species are phylogenetically not clustered well; two are belonging to Pezizomycotina of Ascomycota and the other two are Agaricomycotina and Pucciniomycotina of Basidiomycota. If we exclude these four outlier species, a good linear regression is obtained, as shown in Fig. 8.9 . This straight line indicates that in average, one gene size corresponds to 2.9 kb in a typical fungi genome. If we apply this average gene size to the truffl e genome, its genome size should be ~22 Mb, but the real size is 103 Mb larger. This suggests that there is unusually large number of junk DNA in this genome. In fact, 58 % of its genome consists of transposable elements [ 81 ] . The truffl e genome must still have 24 % more junk DNA region. Gain and loss of genes in each branch of the phylogenetic tree for fungi species are shown in Fig. 8 .10 (based on [ 81 ] ). It will be interesting to examine genome sizes of species related to the Perigord black truffl e, so as to infer the evolutionary period when the genome size expansion occurred. The relationship between the genome size and gene numbers among 88 fungi genomes system that is responsible for this is Hox genes. We thus fi rst discuss this gene system in this subsection. The genome of C. elegans , fi rst determined genome among animals, will be discussed next, followed by genomes of insects and those of deuterostomes. Because genomes of many vertebrate species were determined, we discuss them in Chap. 9 , and in particular, on the human genome in Chap. 10 . Hox genes were initially found through studies of homeotic mutations that dramatically change segmental structure of Drosophila by Edward B. Lewis [ 82 ] . They code for transcription factors, and a DNA-binding peptide, now called homeobox domain, was later found in almost all animal phyla [ 83 ] . Figure 8 .11 shows the Hox gene clusters found in 12 animal groups. There are four Hox clusters in mammalian and avian genomes, and they are most probably generated by the two-round genome duplication in the common ancestor of vertebrates (see Chap. 9 ). Interestingly, the physical order of Hox genes in chromosomes and the order of gene expression during the development are corresponding, called "collinearity" [ 84 ] . This suggests that some sort of cis-regulation is operating in Hox gene clusters, and in fact, many long transcripts are found, and some of their transcription start sites are highly conserved among vertebrates [ 85 ] . Figure 8 .12 shows highly conserved The Hox genes control expression of different groups of downstream genes, such as transcription factors, elements in signaling pathways, or genes with basic cellular functions. Hox gene products interact with other proteins, in particular, on signaling pathways, and contribute to the modifi cation of homologous structures and creation of new morphological structures [ 87 ] . There are other gene families that are thought to be involved in diverse animal body plan. One of them is the Zic gene family [ 88 ] . The Zic gene family exists in many animal phyla with high amino acid sequence homology in a zinc-fi nger domain called ZF, and members of this gene family are involved in neural and neural crest development, skeletal patterning, and left-right axis establishment. This gene family has two additional domains, ZOC and ZF-BC. Interestingly, Cnidaria, Platyhelminthes, and Urochordata lack the ZOC domain, and their ZF-BC domain sequences are quite diverged compared to Arthropoda, Mollusca, Annelida, Echinodermata, and Chordata. This distribution suggests that the Zic family genes with the entire set of the three conserved domains already existed in the common ancestor of bilateralian animals, and some of them may be lost in parallel in the platyhelminthes, nematodes, and urochordates [ 88 ] . Interestingly, phyla that lost ZOC domains have quite distinct body plan although they are bilateralian. Caenorhabditis elegans was the fi rst animal species whose 97-Mb draft genome sequence was determined in 1998 [ 89 ] . This organism belongs to the Nematoda phylum which includes a vast number of species [ 90 ] . Brenner (1974; [ 91 ] ) chose this species as model organism to study neuronal system, for its short generation time (~ 4 days) and its size (~1 mm). The following description of this section is based on the information given in online "WormBook" [ 86 ] . There are 22,227 protein coding genes in C. elegans including 2,575 alternatively spliced forms, with 79 % confi rmed to be transcribed at least partially. The number of tRNA genes is 608, and 274 are located in X chromosome. The three kinds of rRNA genes (18S, 5.8S, and 26S) are located in chromosome I in 100-150 tandem repeats, while ~100 5S rRNA genes are also in tandem form but located in chromosome V. The average protein coding gene length is 3 kb, with the average of 6.4 coding exons per gene. In total, protein coding exons constitute 25.6 % of the whole genome. Figure 8 .13 shows the distribution of the protein coding genes, and Fig. 8 .14 the distribution of exon numbers per gene. Both distributions have long tails. The median sizes of exons and introns are 123 bp and 65 bp, respectively. Intron lengths of C. elegans are quite short compared to these of vertebrate genes (see Chap. 9 ). The distribution of protein coding genes varies depending on chromosomes, slightly more dense for fi ve autosomes than X chromosome and more dense in the central region than the edge of one chromosome. Processed, i.e., intronless, pseudogenes are rare, and a total of 561 pseudogenes were reported at the Wormbase version WS133. About half of them are homologous to functional chemoreceptor genes. Genome sequences of four congeneric species of C. elegans ( C. brenneri , C. briggsae , C. japonica , and C. remanei ) were determined ( http://www.ncbi.nlm.nih. gov/genome/browse/ ). A fruit fl y Drosophila melanogaster was used by Thomas Hunt Morgan's group in the early twentieth century and has been used for many genetic studies. Because of this importance, its genome sequence was determined at fi rst among Arthropods in 2000 [ 92 ] . Heterochromatin regions of ~50 Mb were excluded from sequencing, [ 93 ] . Their genome sizes vary from 145 to 258 Mb, and the number of genes is 15,000-18,000. Interestingly, D . melanogaster has the largest genome size and the smallest number of genes. A total of 12 insect species other than Drosophila 12 species were sequenced by end of 2011 [ 1 ] . As of December 2013, their genome sizes are in the range of 108 Mb and 540 Mb, more than fi ve times difference, and the gene numbers are from 9,000 to 23,000. Deuterostomes contain fi ve phyla: Echinodermata, Hemichordata, Chaetognatha, Xenoturbellida, and Chordata. The genome of sea urchin Strongylocentrotus purpuratus [ 94 ] was determined in 2006. Its genome size is 814 Mb with 23,300 genes. Genomes of another sea urchins, Lytechinus variegatus and Patiria miniata , are also under sequencing, as well as hemicordate Saccoglossus kowalevskii . Chordata is classifi ed into Urochordata (ascidians), Cephalochordata (lancelets or amphioxus), and Vertebrata (vertebrates). Because we will discuss genomes of vertebrates in Chap. 9 , let us discuss genomes of ascidians and lancelets only. The genome of ascidian Ciona intestinalis was determined in 2002 [ 95 ] , and the genome sequence of its congeneric species, C. savignyi , was also determined three years later [ 96 ] . The genome size of C. intestinalis is ~155 Mb with ~16,000 genes. Interestingly it contains a group of cellulose synthesizing enzyme genes, which were probably introduced from some bacterial genomes via horizontal gene transfer [ 8 , 97 ] . The C. intestinalis genome also contains several genes that are considered to be important for heart development ( [ 95 ] ), and this suggests that heart of ascidians and vertebrates may be homologous. Through the superimposition of phylogenetic trees (see Chapter A2) for fi ve genes coding muscle proteins, OOta and Saitou ([ 98 ] ) estimated that vertebrate heart muscle was phylogenetically closer to vertebrate skeletal muscles. If both results are true, muscles used in heart might have been substituted in the vertebrate lineage. The genome sequences of an amphioxus (Cephalochordate Branchiostoma fl oridae ) was determined in by Holland et al. (2008; [ 104 ] ), and they provide good outgroup sequence data for vertebrates. Eukaryotic viruses are relying most of metabolic pathways to their eukaryote host species. Therefore, the number of genes in virus genomes is usually very small. For example, infl uenza A virus has 8 RNA fragments coding for 11 protein genes, and the total genome size is ~13.6 kb. As in bacteriophages, there are both DNA type and RNA type genomes in eukaryotic viruses. Table 8 .6 shows one example of classifi cation of eukaryotic viruses based on their genome structure [ 99 ] . Genomes of double-strand DNA genome viruses have four types: circular, simple linear, linear with proteins covalently attached to both ends, and linear but both ends were closed. Genomes of single-strand DNA genome viruses are either circular or linear. Genomes of RNA genomes are all linear in both single-and double-strand type. Those of single-strand RNA genomes are classifi ed into two types: plus strand and minus strand. A subset of single-plus strand RNA genome type is experiencing [ 100 ] . Megavirus is phylogenetically close to mimivirus [ 101 ] , a member of nucleoplasmic large DNA viruses, including pox virus. Recently, a larger genome size virus, Pandoravirus, with more than 2.5-Mb genome, was discovered [ 105 ] . The phylogenetic status of these large genome size DNA viruses is unknown at this moment. Analysis of the genome sequence of the fl owering plant Arabidopsis thaliana The genome of the cucumber, Cucumis sativu s L Draft genome sequence of the oilseed species Ricinus communis The Genome of black cottonwood, Populus trichocarpa The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla Genome sequence of foxtail millet ( Setaria italica ) provides insights into grass evolution and biofuel potential A new database (GCD) on genome composition for eukaryote and prokaryote genome sequences and their initial analyses The genome sequence of Rickettsia prowazekii and the origin of mitochondria The hydrogen hypothesis for the fi rst eukaryote Mitochondrial genome The complete mitochondrial genome of Dugesia japonica (Platyhelminthes; Order Tricladida) The complete nucleotide sequence of the tobacco mitochondrial genome: Comparative analysis of mitochondrial genomes in higher plants and multipartite organization Widespread horizontal transfer of mitochondrial genes in fl owering plants Determination of the melon chloroplast and mitochondrial genome sequences reveals that the largest reported mitochondrial genome in plants contains a significant amount of DNA having a nuclear origin Small, repetitive DNAs contribute signifi cantly to the expanded mitochondrial genome of cucumber The complete nucleotide sequence of the tobacco chloroplast genome: Its gene organization and expression Changes in the structure of DNA molecules and the amount of DNA per plastid during chloroplast development in maize Pattern of organization of human mitochondrial pseudogenes in the nuclear genome Why genes in pieces? Introns. In Encyclopedia of evolution . Tokyo: Kyoritsu Shuppan Comprehensive splice-site analysis using comparative genomics The ever-growing world of small nuclear ribonucleoproteins Intron phylogeny: A new hypothesis tRNomics: Analysis of tRNA genes from 50 genomes of Eukarya, Archaea, and Bacteria reveals anticodon-sparing strategies and domain-specifi c features The origin of introns and their role in eukaryogenesis: A compromise solution to the introns-early versus introns-late debate? The evolution of spliceosomal introns: Patterns, puzzles and progress Genes in pieces: Were they ever together? Nuclear volume control by nucleoskeletal DNA, selection for cell volume and cell growth rate, and the solution of the DNA C-value paradox The recent origins of spliceosomal introns revisited Correlation of DNA exonic regions with protein structural units in haemoglobin Remarkable interkingdom conservation of intron positions and massive, lineage-specifi c intron loss and gain in eukaryotic evolution New maximum likelihood estimators for eukaryotic intron evolution Analysis of ribosomal protein gene structures: Implications for intron evolution Intron dynamics in ribosomal protein genes So much "junk" DNA in our genome A fundamental division in the Alu family of repeated sequences Whole-genome analysis of Alu repeat elements reveals complex evolutionary history Characterization of highly repetitive sequences of Arabidopsis thaliana Centromeric repetitive sequences in Arabidopsis thaliana Sequence defi nition and organization of a human repeated DNA Empirical analysis of transcriptional activity in the Arabidopsis genome Identifi cation and analysis of functional elements in 1% of the human genome by the ENCODE pilot project Most "dark matter" transcripts are associated with known genes A SNP in the ABCC11 gene is the determinant of human earwax type Molecular evidence for an ancient duplication of the entire yeast genome Genomic analysis of the basal lineage fungus Rhizopus oryzae reveals a whole-genome duplication Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia Size of the protein-coding genome and rate of molecular evolution The evolutionary fate and consequences of duplicated genes Comparative genomics in prokaryotes Functions and mechanisms of RNA editing The evolution of chloroplast RNA editing Chromosome structure and the C-value paradox La teneur du noyau cellulaire en acide désoxyribonucléique à travers les organes, les individus et les espèces animales (in French) Nucleoprotein determination in cytological preparations The constancy of deoxyribose nucleic acid in plant nuclei Conserved linkage between the puffer fi sh (Fugu rubripes) and human genes for platelet-derived growth factor receptor and macrophage colony-stimulating factor receptor Conserved noncoding sequences are reliable guides to regulatory elements Enrichment of regulatory signals in conserved non-coding genomic sequence Evolution at two level: On genes and form Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes Utility and distribution of conserved noncoding sequences in the grasses Conserved noncoding sequences among cultivated cereal genomes identify candidate regulatory sequence elements and patterns of promoter evolution Conserved noncoding sequences in the grasses Arabidopsis intragenomic conserved noncoding sequence The banana ( Musa acuminata ) genome and the evolution of monocotyledonous plants Computational analysis and characterization of UCE-like elements (ULEs) in plant genomes Identifi cation and analysis of conserved noncoding sequences in plants Viral mutation rates The origins of eukaryotic gene structure Evolution of the mutation rate Rates of spontaneous mutation Analysis of the genome sequence of the fl owering plant Arabidopsis thaliana The Arabidopsis lyrata genome sequence and the basis of rapid genome size change A draft sequence of the rice genome ( Oryza sativa L. ssp. japonica) A draft sequence of the rice genome Phylogeography of Asian wild rice, Oryza rufi pogon , reveals multiple independent domestications of cultivated rice, Oryza sativa Independent domestication of Asian rice followed by gene fl ow from japonica to indica Curated genome annotation of Oryza sativa ssp. japonica and comparative genome analysis with Arabidopsis thaliana Multigene phylogeny of land plants with special reference to bryophytes and the earliest land plants The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants The Selaginella genome identifi es genetic changes associated with the evolution of vascular plants Overview of the yeast genome Origin of genome architecture Perigord black truffl e genome uncovers evolutionary origins and mechanisms of symbiosis Master control genes in development and evolution: The homeobox story From DNA to diversity Evolution of conserved non-coding sequences within the vertebrate Hox clusters through the two-round whole genome duplications revealed by phylogenetic footprinting analysis WormBook -The online review of C. elegans biology Function and specifi city of Hox genes A wide-range phylogenetic analysis of Zic proteins: Implications for correlations between protein structure conservation and body plan complexity Genome sequence of the nematode C. elegans : A platform for investigating biology An improved molecular phylogeny of the Nematoda with special emphasis on marine taxa The genetics of Caenorhabditis elegans The genome sequence of Drosophila melanogaster Evolution of genes and genomes on the Drosophila phylogeny The genome of the sea urchin Strongylocentrotus purpuratus The draft genome of Ciona intestinalis : Insights into chordate and vertebrate origins Assembly of polymorphic genomes: Algorithms and application to Ciona savignyi A functional cellulose synthase from ascidian epidermis Phylogenetic relationship of muscle tissues deduced from superimposition of gene trees Genome science and microorganismal molecular genetics Distant Mimivirus relative with a larger genome highlights the fundamental features of Megaviridae The 1.2-megabase sequence of mimivirus Ultraconserved elements in the human genome Genomu Shinkagaku Nyumon (written in Japanese, meaning 'Introduction to evolutionary genomics') The amphioxus genome illuminates vertebrate origins and cephalochordate biology Pandoraviruses: Amoeba viruses with genomes up to 2.5 Mb reaching that of parasitic eukaryotes