Summary of your 'study carrel' ============================== This is a summary of your Distant Reader 'study carrel'. The Distant Reader harvested & cached your content into a collection/corpus. It then applied sets of natural language processing and text mining against the collection. The results of this process was reduced to a database file -- a 'study carrel'. The study carrel can then be queried, thus bringing light specific characteristics for your collection. These characteristics can help you summarize the collection as well as enumerate things you might want to investigate more closely. This report is a terse narrative report, and when processing is complete you will be linked to a more complete narrative report. Eric Lease Morgan Number of items in the collection; 'How big is my corpus?' ---------------------------------------------------------- 49 Average length of all items measured in words; "More or less, how big is each item?" ------------------------------------------------------------------------------------ 6457 Average readability score of all items (0 = difficult; 100 = easy) ------------------------------------------------------------------ 45 Top 50 statistically significant keywords; "What is my collection about?" ------------------------------------------------------------------------- 46 genome 16 dna 16 RNA 14 virus 12 sequence 8 gene 5 human 4 viral 4 protein 4 figure 4 SARS 3 mutation 3 Genome 2 sequencing 2 recombination 2 pathogen 2 disease 2 HGP 1 trait 1 tool 1 technology 1 subsp 1 stability 1 ssr 1 product 1 probe 1 poliovirus 1 plant 1 pig 1 pestis 1 patent 1 pan 1 pallidum 1 natural 1 model 1 malaria 1 isolate 1 insert 1 host 1 genomic 1 genetic 1 fragment 1 datum 1 crop 1 clinical 1 chapter 1 cell 1 cat 1 ascaris 1 array Top 50 lemmatized nouns; "What is discussed?" --------------------------------------------- 4053 genome 2756 virus 2155 sequence 1850 gene 1148 protein 817 analysis 776 dna 760 cell 699 mutation 675 host 632 datum 604 disease 537 strain 516 study 516 recombination 507 sequencing 490 region 475 replication 474 type 445 number 436 specie 435 infection 432 population 410 poliovirus 402 % 399 pathogen 388 rate 386 example 371 evolution 355 structure 349 time 349 size 328 approach 319 site 312 information 310 tool 307 diversity 296 method 296 expression 294 organism 293 system 286 result 281 mechanism 281 level 277 sample 275 model 274 function 270 database 264 figure 261 polymerase Top 50 proper nouns; "What are the names of persons or places?" -------------------------------------------------------------- 1357 RNA 1326 al 1108 et 995 . 356 Genome 244 SARS 173 DNA 129 C 119 Human 110 Fig 105 Virus 98 China 97 GenBank 96 NCBI 86 CoV-2 85 PCR 83 SNP 81 kb 81 B 81 A 80 Complete 79 CoV 77 Yersinia 74 Coronavirus 72 WGS 72 HIV-1 67 T 67 Strain 66 C. 65 Y. 65 T. 65 Project 63 bp 63 Wimmer 63 National 63 Institute 63 ExoN 63 E. 61 S. 60 HIV 59 Figure 58 Table 55 picornavirus 55 IRES 53 Treponema 53 S 52 SNPs 52 NIAID 52 Europe 51 HGP Top 50 personal pronouns nouns; "To whom are things referred?" ------------------------------------------------------------- 1044 it 654 we 422 they 97 them 73 i 65 he 49 us 32 one 30 itself 22 themselves 12 you 9 him 7 p~ 3 she 3 himself 2 u 2 ourselves 2 https://github.com/ababaian/serratus 1 mine 1 https://serratus.io 1 her 1 hadv-4 1 coronaspades Top 50 lemmatized verbs; "What do things do?" --------------------------------------------- 10696 be 2292 have 1138 use 463 identify 449 include 415 base 396 provide 360 show 341 find 321 do 286 contain 280 associate 273 know 267 develop 259 sequence 251 suggest 246 cause 239 make 234 require 230 reveal 229 generate 221 lead 220 encode 213 produce 210 allow 207 occur 207 determine 206 result 202 increase 197 follow 194 predict 191 relate 190 give 190 express 189 involve 183 see 177 describe 169 isolate 168 consider 158 compare 148 detect 148 code 144 represent 140 infect 137 target 136 emerge 135 indicate 133 remain 133 become 129 perform Top 50 lemmatized adjectives and adverbs; "How are things described?" --------------------------------------------------------------------- 1112 viral 926 not 878 human 724 also 716 genetic 650 other 639 high 582 such 571 more 507 - 496 only 474 genomic 469 new 445 different 422 large 397 most 375 single 374 well 370 however 322 many 312 first 293 molecular 290 specific 269 small 267 important 258 complete 256 whole 244 nucleotide 243 as 235 low 233 same 230 non 224 evolutionary 215 long 214 multiple 213 infectious 212 several 211 possible 209 similar 207 clinical 206 available 193 bacterial 189 highly 180 cellular 176 novel 174 functional 172 microbial 172 biological 163 major 161 early Top 50 lemmatized superlative adjectives; "How are things described to the extreme?" ------------------------------------------------------------------------- 136 most 53 least 49 good 29 Most 27 large 26 high 18 close 13 small 11 strong 11 great 10 early 9 low 9 late 8 simple 4 near 4 long 2 weak 2 short 2 old 2 hot 2 fit 2 bad 1 ~15 1 wide 1 northernmost 1 little 1 innermost 1 flat 1 fast 1 clever 1 buildt 1 big Top 50 lemmatized superlative adverbs; "How do things do to the extreme?" ------------------------------------------------------------------------ 261 most 42 least 29 well 2 shortest 1 long 1 close Top 50 Internet domains; "What Webbed places are alluded to in this corpus?" ---------------------------------------------------------------------------- 17 github.com 8 s3.amazonaws.com 7 www.ncbi.nlm.nih.gov 7 serratus.io 4 www.niaid.nih.gov 4 www.ebi.ac.uk 4 www 4 doi.org 3 www.ncbi.nlm.nih 2 www.who.int 2 www.broadinstitute.org 2 submit.ncbi.nlm.nih.gov 2 nextstrain.org 2 gmod.org 2 bioconductor.org 1 xmtb 1 www3.niaid.nih.gov 1 www3 1 www.wheatgenome.org 1 www.wdcm.org 1 www.secondarymetabolites 1 www.rostlab.org 1 www.ridom.de 1 www.ridom.com 1 www.predictprotein.org 1 www.paintmychromosomes.com 1 www.oxfordjournals.org 1 www.ostp 1 www.istm.org 1 www.inforsense.com 1 www.iedb.org 1 www.healthmap.org 1 www.hackseq.com 1 www.fruitfly.org 1 www.fludb.org 1 www.epicov.org 1 www.ensembl.org 1 www.doe-mbi.ucla.edu 1 www.dnastar.com 1 www.csgid.org 1 www.cogconsortium.uk 1 www.cdc.gov 1 www.broad.mit.edu 1 www.brccentral.org 1 www.boldsystems 1 www.angis.org.au 1 www.r-project.org 1 woldlab.caltech.edu 1 wishart.biology 1 virological.org Top 50 URLs; "What is hyperlinked from this corpus?" ---------------------------------------------------- 4 http://www 3 http://www.ncbi.nlm.nih 3 http://serratus.io/access 3 http://serratus.io 3 http://github.com/rcs333/VAPiD 2 http://www.niaid.nih.gov/dmid/genomes/ 2 http://www.ncbi.nlm.nih.gov/Sequin/modifiers.html 2 http://nextstrain.org 2 http://gmod.org 2 http://github.com/serratus-bio/tantalus 2 http://github.com/ababaian/serratus 1 http://xmtb 1 http://www3.niaid.nih.gov/research/topics/ 1 http://www3 1 http://www.who.int/tdr 1 http://www.who.int/csr/disease/plague/Plague-map-2016.pdf 1 http://www.wheatgenome.org/ 1 http://www.wdcm.org 1 http://www.secondarymetabolites 1 http://www.rostlab.org/ 1 http://www.ridom.de/traceedit/ 1 http://www.ridom.com/seqsphere/ 1 http://www.predictprotein.org/ 1 http://www.paintmychromosomes.com 1 http://www.oxfordjournals.org/nar/database/ 1 http://www.ostp 1 http://www.niaid.nih.gov/dmid/genomes/mscs/ 1 http://www.niaid.nih.gov/dmid/ 1 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3394299/ 1 http://www.ncbi.nlm.nih.gov/genome/browse/ 1 http://www.ncbi.nlm.nih.gov/COG 1 http://www.ncbi.nlm.nih.gov/BLAST 1 http://www.ncbi.nlm.nih.gov 1 http://www.istm.org/geosentinel/main.html 1 http://www.inforsense.com 1 http://www.iedb.org 1 http://www.healthmap.org/en 1 http://www.hackseq.com 1 http://www.fruitfly.org/seq_tools/promoter.html 1 http://www.fludb.org/ 1 http://www.epicov.org 1 http://www.ensembl.org 1 http://www.ebi.ac.uk/interpro/ 1 http://www.ebi.ac.uk/Bzerbino/velvet 1 http://www.ebi.ac.uk/Bzerbino/oases 1 http://www.ebi.ac.uk 1 http://www.doe-mbi.ucla.edu/TB 1 http://www.dnastar.com/products/lasergene.php 1 http://www.csgid.org 1 http://www.cogconsortium.uk/data/ Top 50 email addresses; "Who are you gonna call?" ------------------------------------------------- 1 ytliu@ucsd.edu 1 journals.permissions@oup.com 1 gb-admin@ncbi.nlm.nih.gov 1 christine.burkard@roslin.ed.ac.uk 1 celniker@fruitfly.org Top 50 positive assertions; "What sentences are in the shape of noun-verb-noun?" ------------------------------------------------------------------------------- 11 genome sequence data 4 % sequence identity 3 data are available 3 gene finding hmm 3 genes have also 3 genome is often 3 genome sequence analysis 3 genome sequences available 3 genomes are not 3 proteins are also 3 recombination does not 3 recombination is also 3 sequence is present 3 sequences are important 3 sequences are similar 3 viruses are also 3 viruses are not 3 viruses are often 3 viruses have not 3 viruses is not 2 cells are capable 2 data were recently 2 disease is endemic 2 dna sequence data 2 gene finding algorithms 2 gene was stably 2 genes are also 2 genes are not 2 genes are often 2 genes are well 2 genes using conditional 2 genome does not 2 genome have already 2 genome is still 2 genome reveals features 2 genome sequence information 2 genome sequence length 2 genome was completely 2 genomes are extremely 2 genomes are highly 2 genomes are much 2 genomes are present 2 genomes are routinely 2 genomes contain tetra 2 genomes containing penta 2 genomes do not 2 genomes has not 2 host is not 2 hosts are also 2 infections are self Top 50 negative assertions; "What sentences are in the shape of noun-verb-no|not-noun?" --------------------------------------------------------------------------------------- 2 recombination does not necessarily 1 % was not similar 1 genes are not identical 1 genomes are not naked 1 genomes are not robust 1 genomes has not only 1 genomes is not only 1 genomes were not only 1 host is not able 1 host is not that 1 infections do not typically 1 infections is not well 1 mutations are not easily 1 mutations are not necessarily 1 mutations were not present 1 number are not necessarily 1 proteins have no clear 1 recombination shows no appreciable 1 recombination was not essential 1 replication was not significantly 1 sequence has no nucleotides 1 sequence has no protein 1 sequences are not contiguous 1 sequences have not yet 1 sequences were not public 1 virus does not solely 1 virus is not well 1 viruses are not homogeneous 1 viruses are not pathogenic 1 viruses did not readily 1 viruses have no autonomous 1 viruses have no mechanisms 1 viruses is not well A rudimentary bibliography -------------------------- id = cord-310406-5pvln91x author = Asbury, Thomas M title = Genome3D: A viewer-model framework for integrating and visualizing multi-scale epigenomic information within a three-dimensional genome date = 2010-09-02 keywords = datum; genome; model summary = RESULTS: We have applied object-oriented technology to develop a downloadable visualization tool, Genome3D, for integrating and displaying epigenomic data within a prescribed three-dimensional physical model of the human genome. In addition, in spite of the many recent efforts to measure and model the genome structure at various resolutions and detail [3] [4] [5] [6] [7] [8] [9] [10] , little work has focused on combining these models into a plausible aggregate, or has taken advantage of the large amount of genomic and epigenomic data available from new high-throughput approaches. The viewer is designed to display data from multiple scales and uses a hierarchical model of the relative positions of all nucleotide atoms in the cell nucleus, i.e., the complete physical genome. An integrated physical genome model can show the interplay between histone modifications and other genomic data, such as SNPs, DNA methylation, the structure of gene, promoter and transcription machinery, etc. In addition to epigenomic data, the physical genome model also provides a platform to visualize highthroughput gene expression data and its interplay with global binding information of transcription factors. doi = 10.1186/1471-2105-11-444 id = cord-301709-kvyes2lz author = Baker, Susan C. title = Developing Bioinformatic Resources for Coronaviruses date = 2006 keywords = genome summary = The database will contain high-quality curated data: sequence annotations from published whole and partial genomes; relevant experimental data; metabolic pathway data; taxonomic data; literature citations; and a suite of visualization and analysis tools. The results of these programs and searches assembled by the annotation pipeline are used to propose biological features that are also stored in the curation database that uses the Genomics Unified Schema (GUS). For the purposes of defining minimal, non-redundant set of genes characteristic of the category, one genome (usually the best-known or best-characterized) is identified as the "reference genome"; the remaining members of the class are called "associated genomes." For example, the Tor2 and Urbani isolates were the first two SARS coronavirus genomes to be sequenced and therefore were named as reference genomes. This allows high-value, manually curated information from the corresponding reference genes to be automatically linked to the associated genes, provided minimal similarity criteria based on automated sequence analysis are satisfied. doi = 10.1007/978-0-387-33012-9_70 id = cord-003316-r5te5xob author = Balloux, Francois title = From Theory to Practice: Translating Whole-Genome Sequencing (WGS) into the Clinic date = 2018-12-17 keywords = AMR; WGS; clinical; genome; sequence; sequencing summary = WGS-based strain identification gives a far superior resolution In principle, WGS can provide highly relevant information for clinical microbiology in near-real-time, from phenotype testing to tracking outbreaks. As an example, genome assembly might appear to be a bottleneck for real-time WGS diagnostics, but is probably rarely required; sufficient characterization of an isolate can be made by analysis of the k-mers in the raw sequence data, which is orders of magnitude faster. These include, among others: the current costs of WGS, which remain far from negligible despite a common belief that sequencing costs have plummeted; a lack of training in, and possible cultural resistance to, bioinformatics among clinical microbiologists; a lack of the necessary computational infrastructure in most hospitals; the inadequacy of existing reference microbial genomics databases necessary for reliable AMR and virulence profiling; and the difficulty of setting up effective, standardized, and accredited bioinformatics protocols. doi = 10.1016/j.tim.2018.08.004 id = cord-340423-f8ab7413 author = Barr, J.N. title = Genetic Instability of RNA Viruses date = 2016-09-09 keywords = RNA; genome; mutation; viral; virus summary = We then discuss evidence that at least some RNA viruses have a replication fidelity that is poised to maximize genome sequence space without incurring catastrophic lethal mutations and describe how this can be exploited to control viral infections. The error-prone nature of polymerase activity, coupled with the absence of a proofreading mechanism, is the key reason why RNA virus genomes acquire mutations and exist as a swarm of genetic variants. The mutation rate of the viral polymerase, coupled with the replication mode that the virus employs (and extrinsic factors, described in the following text) will determine the extent of genetic variability of viruses released from an infected cell. Thus, it is possible that the high mutation rates of RNA viruses are simply a consequence of polymerases that are under selective pressure to replicate genomes very rapidly to ensure efficient viral infection [79] [80] [81] . doi = 10.1016/b978-0-12-803309-8.00002-1 id = cord-000012-p56v8wi1 author = Bigot, Yves title = Molecular evidence for the evolution of ichnoviruses from ascoviruses by symbiogenesis date = 2008-09-18 keywords = dna; gene; genome; protein; virus summary = CONCLUSION: Our results provide molecular evidence supporting the origin of ichnoviruses from ascoviruses by lateral transfer of ascoviral genes into ichneumonid wasp genomes, perhaps the first example of symbiogenesis between large DNA viruses and eukaryotic organisms. With respect to both species number and mechanisms that lead to successful parasitism, endoparasitic wasps are known to inject secretions at oviposition, but only a few lineages use viruses or virus-like particles (VLPs) to evade or to suppress host defences. Extending our investigations to proteins encoded by open reading frames of certain ascoviruses and bracoviruses, hosts and bacteria, in the light of recent analyses about the involvement of the replication machinery of virus groups related to ascoviruses in lateral gene transfer [29] , we discuss the robustness and the limits of the molecular evidence supporting an ascovirus origin for ichnovirus lineages. doi = 10.1186/1471-2148-8-253 id = cord-005281-wy0zk9p8 author = Blinov, V. M. title = Viral component of the human genome date = 2017-05-09 keywords = RNA; dna; genome; host; virus summary = In the human genome, this capacity is determined by the portion of chromosomal DNA, which does not contain species-specific protein-encoding sequences and, thus, can basically make a place for novel information that will be modified to reach a new balance. In fact, the scope of the described phenomena is not limited to retroviruses as such, since the ubiquity of retroviral elements in animal genomes, their activity in germline cells [31] , along with the fact that viral replication depends significantly on RNA expression, allow retroviruses to contribute in different ways to the insertion of nonretroviral genes into animal germline cells. Finally, the ability to incorporate parts of the viral genome into the chromosomal DNA of host germline cells can vary strongly among different taxonomic groups of viruses, i.e., orders, families, genera, and even species If insertions of viral sequences remain functionally active in the host cell genome, they can give rise to either proteins that function in a new environment or untranslated RNAs of different sizes. doi = 10.1134/s0026893317020066 id = cord-012473-p66of6kq author = Celniker, Susan E. title = Unlocking the secrets of the genome date = 2009-06-17 keywords = dna; genome summary = T he primary objective of the Human Genome Project was to produce highquality sequences not just for the human genome but also for those of the chief model organisms: Escherichia coli, yeast (Saccharomyces cerevisiae), worm (Caenorhabditis elegans), fly (Drosophila melanogaster) and mouse (Mus musculus). Free access to the resultant data has prompted much biological research, including development of a map of common human genetic variants (the International HapMap Project) 1 , expression profiling of healthy and diseased cells 2 and in-depth studies of many individual genes. On the basis of this experience, the NHGRI launched two complementary programmes in 2007: an expansion of the human ENCODE project to the whole genome (www.genome.gov/ENCODE) and the model organism ENCODE (modENCODE) project to generate a comprehensive annotation of the functional elements in the C. The research communities that study these two organisms will rapidly make use of the modENCODE results, deploying powerful experimental approaches that are often not possible or practical in mammals, including genetic, genomic, transgenic, biochemical and RNAi assays. doi = 10.1038/459927a id = cord-304498-ty41xob0 author = Denison, Mark R title = Coronaviruses: An RNA proofreading machine regulates replication fidelity and diversity date = 2011-03-01 keywords = ExoN; RNA; SARS; genome; virus summary = Genetic inactivation of exoN activity in engineered SArS-Cov and MHv genomes by alanine substitution at conserved De-D-D active site residues results in viable mutants that demonstrate 15-to 20-fold increases in mutation rates, up to 18 times greater than those tolerated for fidelity mutants of other rNA viruses. Genetic inactivation of exoN activity in engineered SArS-Cov and MHv genomes by alanine substitution at conserved De-D-D active site residues results in viable mutants that demonstrate 15-to 20-fold increases in mutation rates, up to 18 times greater than those tolerated for fidelity mutants of other rNA viruses. The high mutation rates of RNA viruses also render them particularly susceptible to repeated genetic bottleneck events during replication, transmission between hosts or spread within a host, resulting in progressive deviation from the consensus sequence associated with decreased viral fitness and sometimes extinction. doi = 10.4161/rna.8.2.15013 id = cord-022128-r8el8nqm author = Domingo, Esteban title = Molecular basis of genetic variation of viruses: error-prone replication date = 2019-11-08 keywords = HIV-1; RNA; chapter; dna; genome; mutation; recombination; virus summary = doi = 10.1016/b978-0-12-816331-3.00002-7 id = cord-316033-xg8eb2nm author = Easton, Alice title = Molecular evidence of hybridization between pig and human Ascaris indicates an interbred species complex infecting humans date = 2020-11-06 keywords = SNP; ascaris; dna; figure; genome summary = suum transcripts (Jex et al., 2011; Wang et al., 2017) to the human Ascaris germline assembly to annotate the genome, identifying and classifying 17,902 protein-coding genes ( Table 1 , Supplementary file 1). As this reference-based assembly exhibits the best assembly attributes, including high continuity with a large N50, low gaps and unplaced sequences, and high-quality protein-coding genes (see Table 1 ), we suggest that this version should be used as a reference germline genome for a human Ascaris spp. We next took advantage of the abundant reads from the mitochondrial genome in our sequencing data (on average 7690X coverage, see Supplementary file 1) to perform de novo assembly of 68 complete human Ascaris spp. Furthermore, there were no significant associations between mitochondrial sequence variations and other factors (e.g. village, household, time of worm collection, host) based on PERMANOVA (see methods and Table 2 ) after translating the phylogenetic tree into a distance matrix, suggesting not only a lack of differentiation into distinct species but also a potentially large interbreeding population of worms being transmitted between individuals and across villages. doi = 10.7554/elife.61562 id = cord-334394-qgyzk7th author = Edgar, Robert C. title = Petabase-scale sequence alignment catalyses viral discovery date = 2020-08-10 keywords = Extended; Figure; RNA; SRA; Serratus; genome; sequence summary = To address the ongoing pandemic caused by Severe Acute Respiratory Syndrome Coronavirus 2 and expand the known sequence diversity of viruses, we aligned pangenomes for coronaviruses (CoV) and other viral families to 5.6 petabases of public sequencing data from 3.8 million biologically diverse samples. To expand the known repertoire of viruses and catalyse global virus discovery, in particular for Coronaviridae (CoV) family, we developed the Serratus cloud computing architecture for ultra-high throughput sequence alignment. We aligned 3,837,755 public RNA-seq, meta-genome, meta-virome and meta-transcriptome datasets (termed a sequencing run [5] ) against a collection of viral family pangenomes comprising all GenBank CoV records clustered at 99% identity plus all non-retroviral RefSeq records for vertebrate viruses (see Methods and Extended Table 1 ). We performed de novo assembly on 52,772 runs potentially containing CoV sequencing reads by combining 37,131 SRA accessions identified by the Serratus search with 18,584 identified by an ongoing cataloguing initiative of the SRA called STAT [5] . doi = 10.1101/2020.08.07.241729 id = cord-016798-tv2ntug6 author = Gautam, Ablesh title = Bioinformatics Applications in Advancing Animal Virus Research date = 2019-06-06 keywords = genome; sequence; tool; viral; virus summary = The chapter further provides information on the tools that can be used to study viral epidemiology, phylogenetic analysis, structural modelling of proteins, epitope recognition and open reading frame (ORF) recognition and tools that enable to analyse host-viral interactions, gene prediction in the viral genome, etc. This chapter will introduce virologists to some of the common as well virus-specific bioinformatics tools that the researches can use to analyse viral sequence data to elucidate the viral dynamics, evolution and preventive therapeutics. Novel virus types comprise of new CDSs that are different than previously known CDSs. There are multiple databases and tools available for analysis of human viruses; however, there are still only a limited number of resources designed specifically for veterinary viruses. VIRsiRNAdb is an online curated repository that stores experimentally validated research data of siRNA and short hairpin RNA (shRNA) targeting diverse genes of 42 important human viruses, including influenza virus (Tyagi et al. doi = 10.1007/978-981-13-9073-9_23 id = cord-017932-vmtjc8ct author = Georgiev, Vassil St. title = Genomic and Postgenomic Research date = 2009 keywords = NIAID; gene; genome; sequence summary = The family Enterobacteriaceae encompasses a diverse group of bacteria including many of the most important human pathogens (Salmonella, Yersinia, Klebsiella, Shigella), as well as one of the most enduring laboratory research organisms, the nonpathogenic Escherichia coli K12. To this end, NIAID has made significant investments in large-scale sequencing projects, including projects to sequence the complete genomes of many pathogens, such as the bacteria that cause tuberculosis, gonorrhea, chlamydia, and cholera, as well as organisms that are considered agents of bioterrorism. The availability of microbial and human DNA sequences opens up new opportunities and allows scientists to perform functional analyses of genes and proteins in whole genomes and cells, as well as the host''s immune response and an individual''s genetic susceptibility to pathogens. The PFGRC was established in 2001 to provide and distribute to the broader research community a wide range of genomic resources, reagents, data, and technologies for the functional analysis of microbial pathogens and invertebrate vectors of infectious diseases. doi = 10.1007/978-1-60327-297-1_25 id = cord-348059-wa1gjbck author = Gibbs, Richard A. title = The Human Genome Project changed everything date = 2020-08-07 keywords = Genome; HGP summary = Thirty years on from the launch of the Human Genome Project, Richard Gibbs reflects on the promises that this voyage of discovery bore. Thirty years on from the launch of the Human Genome Project, Richard Gibbs reflects on the promises that this voyage of discovery bore. He developed basic methods for DNA and mutation ana lysis and was an early contributor to the Human Genome Project (HGP), leading one of five sites that generated the majority of the sequence. The power of advances in genomics and computers was revealed in the spectacular series of post-HGP projects that were of comparable scale. Some still tally the success of the HGP from lists of new drugs or therapies and argue that world-changing examples in biology, such as the spectacular advances of gene editing tools or the expansion of cancer therapeutics through targeted immunotherapy, are largely based on microbial, cellular and animal studies rather than genomics. doi = 10.1038/s41576-020-0275-3 id = cord-350747-5t5xthk6 author = Gmyl, A. P. title = Diverse Mechanisms of RNA Recombination date = 2005 keywords = RNA; fragment; genome; recombination; virus summary = It was believed until recently that the only possible mechanism of RNA recombination is replicative template switching, with synthesis of a complementary strand starting on one viral RNA molecule and being completed on another. An illustrative example of deletions is provided by defective interfering (DI) genomes, which accumulate in a virus population upon high-multiplicity infections and lack a fragment of the sequence coding for viral proteins [5] [6] [7] . A special role in the variation of RNA viruses is played by recombination, the generation of new genomes from two or more parental RNAs. Recombination between viral RNA molecules was observed for the first time as early as in the 1960s in the poliovirus [14, 15] . In other words, it is possible to assume that some of the mechanisms of nonreplicative RNA recombination play an important role in the evolution of not only viral, but also cell genomes [51, 90] . doi = 10.1007/s11008-005-0069-x id = cord-022262-ck2lhojz author = Gromeier, Matthias title = Genetics, Pathogenesis and Evolution of Picornaviruses date = 2007-09-02 keywords = IRES; RNA; Wimmer; figure; genome; poliovirus; protein; virus summary = The following viruses have been recognized as picornaviruses on the basis of their genome sequences and physico-chemical properties as well as the result of comparative sequence analyses (see the section on Evolution): equine rhinovirus types I and 2, Aichi virus, porcine enterovirus, avian encephalomyelitis virus, infectious flacherie virus of silkworm Clusters of enteroviruses refer to groups of enteroviruses arranged predominantly according to genotypic kinship (Hyypia et al., 1997) . Briefly, when expression vectors ( Figure 12 .6E) consisting of a gag gene (encoding p17-p24; 1161 nt) of human immunodeficiency virus that was fused to the N-terminus of the poliovirus polyprotein (Andino et al., 1994; Mueller and Wimmer, 1998) were analysed after transfection into HeLa cells, the genomes were not only found to be severely impaired in viral replication but they were also genetically unstable (Mueller and Wimmer, 1997) . doi = 10.1016/b978-012220360-2/50013-1 id = cord-267714-ji88tvsl author = JAKUPCIAK, JOHN P. title = Biological agent detection technologies date = 2009-04-21 keywords = dna; genome; sequencing summary = PCR-based methods have critical limitations, since they depend on a priori knowledge of what sequence to detect in a sample further complicated by recent demonstrations of greater variability in genomic sequence than expected. A platform for genome identification of a specimen from any source must not only be sensitive and specific, but must also detect a variety of pathogens with high accuracy, including modified or previously uncharacterized agents, and this challenge is daunting when identification must be achieved using nucleic acids in a complex sample matrix. The build-out of genome identification DNA sequencing technology in the form of practical instrumentation will be achieved by incorporating the critical requirements for accurate long reads, without dependency for template amplification, capable of manipulating terabytes of data to provide reliable and useful identification of genetic sequences within any unknown sample, whether clinical, environmental, or other type of specimen. doi = 10.1111/j.1755-0998.2009.02632.x id = cord-004123-1s8kuno2 author = Jaiswal, Arun Kumar title = The pan-genome of Treponema pallidum reveals differences in genome plasticity between subspecies related to venereal and non-venereal syphilis date = 2020-01-10 keywords = Treponema; genome; pallidum; subsp summary = title: The pan-genome of Treponema pallidum reveals differences in genome plasticity between subspecies related to venereal and non-venereal syphilis pallidum strains isolated from different parts of the world and a diverse range of hosts were comparatively analysed using pan-genomic strategy. pertenue, we found differences in the presence/absence of pathogenicity islands (PAIs) and genomic islands (GIs) on subsp.-based study. In this work, we perform a pan-genome approach to better understand the differences of Treponema pallidum infections in the broad spectrum and how genome plasticity is related to the symptom patterns. Finally, we provide insights into the specific subsets (singletons and the panand core genomes) of 53 genomes of T pallidum strains and correlate these subsets with the plasticity of pathogenicity islands and virulence genes. The subspecies responsible for non-venereal syphilis is Treponema pallidum subsp. Genes which are present in pallidum subspecies pathogenicity islands (PAIs) or genomic islands (GIs) are absent in the subspecies endemicum and pertenue. doi = 10.1186/s12864-019-6430-6 id = cord-324811-yjwavea5 author = Kidgell, Claire title = Elucidating genetic diversity with oligonucleotide arrays date = 2005 keywords = dna; genome summary = Oligonucleotide microarrays, predominantly high-density oligonucleotide arrays, have emerged as the principal platforms for performing genome-wide diversity analysis. Since a number of complex issues still remain with high-throughput microarray-based SNP genotyping in humans, in the remainder of this review, we will discuss the application of high-density oligonucleotide arrays to elucidate genetic diversity, with particular focus on studies undertaken with Saccharomyces cerevisiae (Winzeler et al. falciparum (Clark 2002) , the genome-wide analysis facilitated by hybridization of genomic DNA to the A¡ymetrix microarray identi¢ed signi¢cant di¡erences in potential selection pressure across di¡erent gene families and locations within the chromosome (Volkman et al. Although SNPs and deletions can be readily identi¢ed using A¡ymetrix high-density arrays, more complex types of genetic diversity may also be determined using this platform. doi = 10.1007/s10577-005-1503-6 id = cord-000556-uu1oz2ei author = Kumar, Ranjit title = RNA-Seq Based Transcriptional Map of Bovine Respiratory Disease Pathogen “Histophilus somni 2336” date = 2012-01-20 keywords = RNA; Seq; genome summary = Whole genome transcriptome analysis is a complementary method to identify "novel" genes, small RNAs, regulatory regions, and operon structures, thus improving the structural annotation in bacteria. Therefore, genome structural annotation or the identification and demarcation of boundaries of functional elements in a genome (e.g., genes, non-coding RNAs, proteins, and regulatory elements) are critical elements in infectious disease systems biology. Whole genome transcriptome studies (such as whole genome tiling arrays [13, 14, 15] and high throughput sequencing [16, 17] ) are complementary experimental approaches for bacterial genome annotation and can identify ''''novel'''' genes, gene boundaries, regulatory regions, intergenic regions, and operon structures. We compared the RNA-Seq based transcriptome map with the available genome annotation to identify expressed, novel, and intergenic regions in the genome. The single nucleotide resolution map helped uncover the structure and complexity of this pathogen''s transcriptome and led to the identification of novel, small RNAs and protein coding genes as well as gene co-expression. doi = 10.1371/journal.pone.0029435 id = cord-001340-kqcx7lrq author = Ladner, Jason T. title = Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing date = 2014-06-17 keywords = genome; sequence; viral summary = Genome sequences play a critical role in our understanding of viral evolution, disease epidemiology, surveillance, diagnosis, and countermeasure development and thus represent valuable resources which must be properly documented and curated to ensure future utility. Here, we outline a set of viral genome quality standards, similar in concept to those proposed for large DNA genomes (4) but focused on the particular challenges of and needs for research on small RNA/ DNA viruses, including characterization of the genomic diversity inherent in all viral samples/populations. Therefore, we have used technology-agnostic criteria to define five standard categories designed to encompass the levels of completeness most often encountered in viral sequencing projects. There is a trend toward requiring a complete genome sequence when a description of a novel virus is being published, and we agree that this is a good goal; however, the amount of time and resources required to complete the last 1 to 2% of a viral genome is often cost and time prohibitive for projects sequencing a large number of samples, and in most cases the very ends of the segments are not essential for proper identification and characterization. doi = 10.1128/mbio.01360-14 id = cord-330312-1pjolkql author = Liu, Y.-T. title = Infectious Disease Genomics date = 2017-01-20 keywords = HGP; genome; human; malaria; sequence summary = One of the important motivations for these efforts is to develop preventative, diagnostic, and therapeutic strategies through the analysis of sequenced microorganisms, parasites, and vectors related to human health. 16, 17 The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002. 30e32 Genome-sequencing projects for other important human disease vectors are in progress. 38 One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. 48 The completed or ongoing genome projects (Table 10 .1) provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. Genome sequence of the human malaria parasite Plasmodium falciparum doi = 10.1016/b978-0-12-799942-5.00010-x id = cord-265857-fs6dj3dp author = Liu, Yu-Tsueng title = Infectious Disease Genomics date = 2010-12-24 keywords = genome; human; sequence summary = The completed or ongoing genome projects will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002 (Gardner et al., 2002; Holt et al., 2002) . Genome sequencing projects for other important human disease vectors are in progress Megy et al., 2009 ). One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. The completed or ongoing genome projects (Table 10 .1) will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. doi = 10.1016/b978-0-12-384890-1.00010-8 id = cord-018804-wj35q88f author = Lázaro, Ester title = Genetic Variability in RNA Viruses: Consequences in Epidemiology and in the Development of New Stratgies for the Extinction of Infectivity date = 2007 keywords = RNA; genome; mutation; virus summary = High error prone replication, together with the short replication times and large population sizes typical of RNA viruses, instead of being a handicap for survival provides an extraordinary evolutionary advantage by permitting the generation of a wide reservoir of mutants with different phenotypic properties [7] . However, the fact that DNA organisms, which usually live in constant environments, have evolved corrector activities, whereas RNA viruses have not, suggests that replication with high error rates is a selected character that strongly favours viral adaptation to fast changing conditions. Quasi-species replicating during a long time in a near-constant environment in the absence of large population size fluctuations can present a low rate of fixation of mutations in the consensus sequence, despite the continuous occurrence of mutants that is characteristic of the underlying dynamics of the population. The infection of a new host constitutes a sudden change in the environment in which viral replication takes place, usually with the consequence of a drastic decrease in the average fitness of the virus population, which prevents further transmission. doi = 10.1007/978-3-540-35306-5_15 id = cord-018437-yjvwa1ot author = Mitchell, Michael title = Taxonomy date = 2013-08-26 keywords = RNA; dna; genome; human; protein; virus summary = Classifi cation is based on the genomic nucleic acid used by the virus (DNA or RNA), strandedness (single or double stranded), and method of replication. The nucleocapsids of some viruses are surrounded by envelopes composed of lipid bilayers and host-or viral-encoded proteins. The sequence of negative-sense ssRNA is complementary to the coding sequence for translation, so mRNA must be synthesized by RNA polymerase, typically carried within the virion, before translation into viral proteins. Among the families of viruses able to infect humans and other vertebrate hosts, there are many species that target and cause disease in the lung. The nucleocapsid is surrounded by an envelope derived from host-cell membrane and viral envelope proteins, including hepatitis B surface antigen. The genome of human parainfl uenza viruses is ~15 kb in length with an organization and six reading frames (N, P, M, F, HN, L) typical of the Paramyxoviridae (Karron and Collins 2007 ) . doi = 10.1007/978-3-642-40605-8_3 id = cord-264746-gfn312aa author = Muse, Spencer title = GENOMICS AND BIOINFORMATICS date = 2012-03-29 keywords = RNA; dna; figure; gene; genome; sequence summary = The success of this project (it came in almost 3 years ahead of time and 10% under budget, while at the same time providing more data than originally planned) depended on innovations in a variety of areas: breakthroughs in basic molecular biology to allow manipulation of DNA and other compounds; improved engineering and manufacturing technology to produce equipment for reading the sequences of DNA; advances in robotics and laboratory automation; development of statistical methods to interpret data from sequencing projects; and the creation of specialized computing hardware and software systems to circumvent massive computational barriers that faced genome scientists. Although the list of important biotechnologies changes on an almost daily basis, there are three prominent data types in today''s environment: (1) genome sequences provide the starting point that allows scientists to begin understanding the genetic underpinnings of an organism; (2) measurements of gene expression levels facilitate studies of gene regulation, which, among other things, help us to understand how an organism''s genome interacts with its environment; and (3) genetic polymorphisms are variations from individual to individual within species, and understanding how these variations correlate with phenotypes such as disease susceptibility is a crucial element of modern biomedical research. doi = 10.1016/b978-0-12-238662-6.50015-x id = cord-014461-2ubh9u8r author = Nelson, Oranmiyan W. title = Genome sequences published outside of Standards in Genomic Sciences, July - October 2012 date = 2012-10-10 keywords = Complete; Draft; Genome; Strain; isolate; sequence summary = Complete Genome Sequence of Brucella abortus A13334, a New Strain Isolated from the Fetal Gastric Fluid of Dairy Cattle Complete Genome Sequence of Brucella canis Strain HSK A52141, Isolated from the Blood of an Infected Dog Complete Genome Sequence of Streptococcus salivarius PS4, a Strain Isolated from Human Milk Complete Genome Sequences of Probiotic Strains Bifidobacterium animalis subsp. Complete Genome Sequence of Corynebacterium pseudotuberculosis Strain 1/06-A, Isolated from a Horse in North America Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Characterization and Complete Genome Sequence of Human Coronavirus NL63 Isolated in China Complete Genome Sequence of a Novel Pararetrovirus Isolated from Soybean Complete Genome Sequence of a Polyomavirus Isolated from Horses Complete Genome Sequence of a Novel Porcine Sapelovirus Strain YC2011 Isolated from Piglets with Diarrhea Draft Genome Sequence of Aspergillus oryzae Strain 3.042 doi = 10.4056/sigs.3416907 id = cord-016293-pyb00pt5 author = Newell-McGloughlin, Martina title = The flowering of the age of Biotechnology 1990–2000 date = 2006 keywords = FDA; Genome; NIH; RNA; U.S.; University; Venter; cell; disease; dna; gene; human; plant; sequence; technology summary = doi = 10.1007/1-4020-5149-2_4 id = cord-007923-j3jpqd7k author = O''Brien, Stephen J. title = Cats date = 2004-12-14 keywords = cat; genome summary = Wild cats dominate their habitat but require vast expanses to survive, which explains the tragic depredation such that every species of Felidae, except the domestic cat, is considered either endangered or threatened in the wild today by CITES, IUCN Red Book and other monitors of the world''s most endangered species. Domestic cats and dogs enjoy more medical scrutiny than any species except humans. The cat offers the promise of a second carnivore species (in addition to the dog, which shares a common ancestor with cats dating back to approximately 60 million years ago) to improve human genome annotation, as well as to complement the biomedical and genomic discoveries that make the feline genome attractive. The conserved genome of the cat is retained in the other 36 Felidae species, as well as most of the 246 species of the Carnivora order, the only reshuffled exceptions occuring in the dog and bear families. doi = 10.1016/j.cub.2004.11.017 id = cord-298136-mel9fxw8 author = O''Malley, Maureen A. title = Whole-genome patenting date = 2005-05-10 keywords = dna; genome; patent summary = Gene patenting is now a familiar commercial practice, but there is little awareness that several patents claim ownership of the complete genome sequence of a prokaryote or virus. However, further analysis reveals that patent specifications describing whole-genome inventions use arguments that imply that genomes are qualitatively different from individual genes. This standard allows several sub-inventions to be linked together by a common "general inventive concept", but prevents unrelated inventions from succeeding as a single Abstract | Gene patenting is now a familiar commercial practice, but there is little awareness that several patents claim ownership of the complete genome sequence of a prokaryote or virus. If there are any qualitative differences between patents for whole genomes and those for DNA fragments, it seems likely that they will be found in the utility arguments -the most contested feature of recent gene patenting. doi = 10.1038/nrg1613 id = cord-320005-i30t7cvr author = Pardo, A. title = The Human Genome and Advances in Medicine: Limits and Future Prospects date = 2004-03-31 keywords = dna; gene; genome; human summary = The HGP''s initial objectives were fulfilled 2 years ahead of schedule, and, in addition to compiling a highly accurate sequence of the human genome which has been made freely available and accessible to everyone, the Consortium has developed a set of new technologies and has constructed genetic maps of the genomes of various organisms. Around the same time, the public consortium known as the Human Genome Project was formed, and this organization announced a 15-year plan (from 1990 to 2005) with the following objectives: a) to determine the complete nucleotide sequence of human DNA and identify all the genes in human DNA (estimated to number between 50 000 and 100 000); b) to build physical and genetic maps; c) to analyze the genomes of selected organisms used in research as model systems (eg, the mouse); d) to develop new technologies; and e) to analyze and debate the ethical and legal implications for individuals and for society as a whole. doi = 10.1016/s1579-2129(06)70078-7 id = cord-304607-td0776wj author = Paszkiewicz, Konrad H. title = Omics, Bioinformatics, and Infectious Disease Research date = 2010-12-24 keywords = gene; genome; protein; sequence summary = This chapter discusses the current state of play of bioinformatics related to genomics and transcriptomics, briefs metagenomics that finds use in infectious disease research as well as the random sequencing of genomes from a variety of organisms. Bioinformatics plays a key role at several steps in genomics, comparative genomics, and functional genomics: sequence alignment, assembly, identification of single nucleotide polymorphisms (SNP), gene prediction, quantitative analysis of transcription data, etc. The term "metagenomics" was originally used to describe the sequencing of genomes of uncultured microorganisms in order to explore their abilities to produce natural products (Handelsman et al., 1998 , Rondon et al., 2000 and subsequently resulted in novel insights into the ecology and evolution of microorganisms on a scale not imagined possible before (see Cardenas and Tiedje, 2008; Hugenholtz and Tyson, 2008 for an overview). However, metagenomics now finds use in infectious disease research as well as the random sequencing of genomes from a variety of organisms from, for example, patient material that could lead to the identification of the cause of disease. doi = 10.1016/b978-0-12-384890-1.00018-2 id = cord-352619-s2x53grh author = Payne, Natalie title = Novel Circoviruses Detected in Feces of Sonoran Felids date = 2020-09-15 keywords = Rep; dna; genome; virus summary = Genomes from several families of circular Rep-encoding single-stranded DNA viruses (CRESS-DNA viruses) are part of the phylum Cressdnaviricota [22] and have been identified in fecal samples of other mammals, including domestic cats [23, 24] , bobcats, African lions [25] , capybaras [26] , and Tasmanian devils [27] . Here we used a metagenomic approach to identify novel circoviruses in the feces of two species of Sonoran felids, the puma and bobcat; although not endangered, knowledge of viral threats facing these species could help prevent future population decline, as well as indicate potential threats to the endangered ocelot and jaguar. Based on the species-demarcation threshold for circoviruses which is 80% genome-wide identity [28] , both of these belong to a new species which we refer to as Sonfela (derived from Sonoran felid associated) circovirus 1. As the viral genomes were derived from scat samples, the circoviruses could have infected the bobcat prey species or the felids themselves or be environmentally derived. doi = 10.3390/v12091027 id = cord-281959-g4sjyytr author = Phillippy, Adam M title = Efficient oligonucleotide probe selection for pan-genomic tiling arrays date = 2009-09-16 keywords = array; genome; pan; probe summary = The viability of the algorithm is demonstrated by array designs for seven different bacterial pan-genomes and, in particular, the design of a 385,000 probe array that fully tiles the genomes of 20 different Listeria monocytogenes strains with overlapping probes at greater than twofold coverage. In order to both characterize new strains based on genetic content, and detect polymorphism at a higher resolution in small RNAs (sRNAs) and intergenic sequences, the array was required to cover all pan-genomic sequences with a high density of probes. To see the similarities between the Pan-Tiling and Minimum Hitting Set problems, let the sequence G be a concatenation of all the genomes from a species, and let W = {w 1 , w 2 ,..., w m } be the set of m intervals that results from segmenting G into non-overlapping, end-to-end, length l windows. doi = 10.1186/1471-2105-10-293 id = cord-297669-22fctxk4 author = Proudfoot, Chris title = Genome editing for disease resistance in pigs and chickens date = 2019-06-25 keywords = CD163; disease; genome; pig summary = The virus was thought to attach to CD169 to be taken up into the cells; however, genome-edited pigs lacking CD169 were not resistant to PRRSV infection (Prather et al., 2013) . Chicken somatic cell lines have been edited to introduce changes to this gene-conferring resistance to avian leucosis virus in vitro (Lee et al., 2017) . However, as the example for avian influenza shows, host genes play an important role in other steps of the pathogen replication cycle and also provide editing targets for disease resilience or resistance. Genome editing allows integration of the disease-resistance trait into a wider selection of pigs, ensuring genetic variability and maintenance of desirable traits. (D) Resistance genes may be identified in laboratory research but not in highly bred lines, making integration into those productive animals only possible using genome editing. She employs genome editing and genetic selection to generate animals genetically resistant to viral disease. doi = 10.1093/af/vfz013 id = cord-275683-1qj9ri18 author = Roux, Simon title = Metagenomics in Virology date = 2019-06-12 keywords = RNA; genome; viral; virus summary = Against the background of an extensive viral diversity revealed by metagenomics across many environments, new sequence assembly approaches that reconstruct complete genome sequences from metagenomes have recently revealed surprisingly cosmopolitan viruses in specific ecological niches. However, these techniques can only detect previously known viruses, and often require Box 1 Use of complementary methods to target different types of viruses A number of approaches have been developed to specifically select and survey the genetic material contained by virus particles in a given sample. Virus sequences obtained from "bulk" metagenomes will typically reflect viruses infecting their host cell at the time of sampling, either actively replicating or not, while viromes enables a deeper and more focused exploration of the virus diversity in a specific site or sample. With viral metagenomics being applied to a larger set of samples and environments, and with bioinformatic analyses including genome assembly and interpretation constantly improving, novel groups of dominant and widespread viruses may thus be progressively revealed across many environments. doi = 10.1016/b978-0-12-809633-8.20957-6 id = cord-015850-ef6svn8f author = Saitou, Naruya title = Eukaryote Genomes date = 2013-08-22 keywords = RNA; dna; gene; genome; sequence summary = General overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk DNAs. We then discuss the evolutionary features of eukaryote genomes, such as genome duplication, C-value paradox, and the relationship between genome size and mutation rates. Most of the protein coding genes of melon mitochondrial DNAs are highly similar to those of its congeneric species, which are watermelon and squash whose mitochondrial genome sizes are 119 kb and 125 kb, respectively. There are various genomic features that are specifi c to eukaryotes other than existence of introns and junk DNAs, such as genome duplication, RNA editing, C-value paradox, and the relationship between genome size and mutation rates. The Perigord black truffl e ( Tuber melanosporum ), shown as A i n Fig. 8.9 , has the largest genome size (~125 Mb) among the 88 fungi species whose genome sequences were so far determined, yet the number of genes is only ~7,500 [ 81 ] . doi = 10.1007/978-1-4471-5304-7_8 id = cord-268795-tjmx6msm author = Sardar, Rahila title = Comparative analyses of SAR-CoV2 genomes from different geographical locations and other coronavirus family genomes reveals unique features potentially consequential to host-virus interaction and pathogenesis date = 2020-03-21 keywords = SARS; genome summary = title: Comparative analyses of SAR-CoV2 genomes from different geographical locations and other coronavirus family genomes reveals unique features potentially consequential to host-virus interaction and pathogenesis We have performed an integrated sequence-based analysis of SARS-CoV2 genomes from different geographical locations in order to identify its unique features absent in SARS-CoV and other related coronavirus family genomes, conferring unique infection, facilitation of transmission, virulence and immunogenic features to the virus. Our analysis reveals nine host miRNAs which can potentially target SARS-CoV2 genes. Our analysis shows unique host-miRNAs targeting SARS-CoV2 virus genes. CELLO2GO (7)server was used to infer biological function for each protein of SARS-CoV2 genome with their localization prediction. Assembled SARS-CoV2 genomes sequences in FASTA format from India, USA, China, Italy and Nepal used for coronavirus typing tool analysis. For the phylogenetic analysis, we compared the sequences of 6 SARS-CoV2 isolates from different countries namely, Wuhan, India, Italy, USA and Nepal along with other corona virus species ( Figure 1 ). doi = 10.1101/2020.03.21.001586 id = cord-277687-u3q36o3e author = Shean, Ryan C. title = VAPiD: a lightweight cross-platform viral annotation pipeline and identification tool to facilitate virus genome submissions to NCBI GenBank date = 2019-01-23 keywords = NCBI; RNA; genome summary = title: VAPiD: a lightweight cross-platform viral annotation pipeline and identification tool to facilitate virus genome submissions to NCBI GenBank In order to accept submitted viral genomic data, NCBI GenBank requires 1) viral sequence complete with at least one protein annotation, 2) author/depositor metadata, and 3) viral sequence metadata, such as strain, collection date, collection location, and coverage. VAPiD handles batch submissions of multiple viruses of different types without prior knowledge of the viral species, correctly annotates RNA editing and ribosomal slippage, performs spellchecking on annotations, handles batch or individual submission of metadata, runs with a simple one-line command, and creates annotated viral sequence files for GenBank submission. This first example is the task that the authors originally wrote VAPiD for -annotating large numbers of genomes from different viral species, which mirrors the type of data that many clinical and public health laboratories may encounter. doi = 10.1186/s12859-019-2606-y id = cord-314594-xvc8hvpq author = Singh, Roshan Kumar title = Breeding and biotechnological interventions for trait improvement: status and prospects date = 2020-09-18 keywords = QTL; crop; genetic; genome; trait summary = Advances in high-throughput genomics strategies at a whole-genome level, including genetic association mapping, map-based cloning, genomic selection, and speed breeding, are also proven useful in improvising genetic gains for expediting the crop improvement processes. Through genome-wide association study (GWAS), 60 loci significantly associated with agronomic traits such as oil content, seed quality, stress tolerance were identified, which may be proven as a valuable resource for genetic improvement (Lu et al. Marker-assisted backcrossing (MABC) is the introgression of a genomic region (QTL or locus or gene) contributing the desired trait from a donor genotype into a breeding line or elite cultivar without linkage drag through backcrossing after multiple generations. As the name suggests, CRISPR/Cas9 consists of two components: a single-guide Application of functional and comparative genomics in marker-assisted breeding and biotechnological approaches for crop improvement. The candidate gene(s) identified from functional genomic studies can be introduced through genetic engineering or tar-geted modify through genome editing technology in crop species for improved agronomic traits. doi = 10.1007/s00425-020-03465-4 id = cord-016588-f8uvhstb author = Sintchenko, Vitali title = Informatics for Infectious Disease Research and Control date = 2009-10-03 keywords = dna; gene; genome; genomic; pathogen summary = The goal of infectious disease informatics is to optimize the clinical and public health management of infectious diseases through improvements in the development and use of antimicrobials, the design of more effective vaccines, the identification of biomarkers for life-threatening infections, a better understanding of host-pathogen interactions, and biosurveillance and clinical decision support. "New Age" infectious disease informatics rests on advances in microbial genomics, the sequencing and comparative study of the genomes of pathogens, and proteomics or the identification and characterization of their protein related properties and reconstruction of metabolic and regulatory pathways (Bansal 2005) . The figure was produced using Artemis software (The Wellcome Trust Sanger Institute, UK) 1 Informatics for Infectious Disease Research and Control evidence-based gene calling or translating alignments of the DNA sequence to known proteins; and (3) aligning cDNAs from the same or related species. doi = 10.1007/978-1-4419-1327-2_1 id = cord-269124-oreg7rnj author = Spyrou, Maria A. title = Ancient pathogen genomics as an emerging tool for infectious disease research date = 2019-04-05 keywords = Europe; Fig; Yersinia; ancient; dna; genome; pathogen; pestis summary = Examples of tools that have shown their effectiveness with ancient metagenomic DNA include the widely used Basic Local Alignment Search Tool (BLAST) 68 ; the MEGAN Alignment Tool (MALT) 41 , which involves a taxonomic binning algorithm that can use whole genome databases (such as the National Center for Biotechnical Information (NCBI) Reference Sequence (RefSeq) database 69 ); Metagenomic Phylogenetic Analysis (MetaPhlAn) 70 , which is also integrated into the metagenomic pipeline MetaBIT 71 and uses thousands (or millions) of marker genes for the distinction of specific microbial clades; or Kraken 72 , an alignment free sequence classifier that is based on k-mer matching of a query to a constructed database. Similar limitations can arise when the evolutionary history of a microorganism is vastly affected by recombination, as observed for HBV 44, 53 , although HBV molecular dating was recently attempted using a different genomic data set and suggested that the currently explored diversity of Old and New World pri mate lineages (including all human genotypes) may have emerged within the last 20,000 years 43 . doi = 10.1038/s41576-019-0119-1 id = cord-346335-el45v0a5 author = Tan, H.S. title = Fourier spectral density of the coronavirus genome date = 2020-08-11 keywords = SARS; Spike; genome summary = We uncover an interesting, new scaling law for the coronavirus genome: the complexity of the genome scales linearly with the power-law exponent that characterizes the enveloping curve of the low-frequency domain of the spectral density. An example of a seminal paper in this subject is that of Voss in [2] where the author found that the spectral density of the genome of many different species follows a power law of the form 1/k β in the low-frequency domain, with the exponent β potentially related to the organism''s evolutionary category. We develop a few models to characterize the typical spectrum, and in the process stumble upon a linear scaling law between a measure of the complexity of each genome and the power-law exponent that describes the enveloping curve of the low-frequency domain. doi = 10.1101/2020.06.30.180034 id = cord-265581-pbv8mjfc author = Tong, Yaojun title = An aurora of natural products-based drug discovery is coming date = 2020-06-06 keywords = genome; natural; product summary = With recent scientific advances combining metabolic sciences and technology, multi-omics, big data, combinatorial biosynthesis, synthetic biology, genome editing technology (such as CRISPR), artificial intelligence (AI), and 3D printing, the "high-hanging fruit" is becoming more and more accessible with reduced costs. The incredible rate of development in genome sequencing, modern metabolic engineering, synthetic biology, advanced genome editing, big data, artificial intelligence (AI), and 3D printing together with the growing microbial strain collections enable us to access the previously inaccessible natural products. It starts with genome mining (the analysis of high quality whole genome information), which requires bioinformatics, big data, and even AI; to pathway cloning (refactoring), expression and fermentation, which needs design-buildtest-learn (DBTL) cycle-based metabolic engineering; to the target natural product identification, which requires modern chemical analysis; and to later compound modification and clinical studies, which needs biochemistry and cell biology. doi = 10.1016/j.synbio.2020.05.003 id = cord-302047-vv5gpldi author = Willemsen, Anouk title = On the stability of sequences inserted into viral genomes date = 2019-11-14 keywords = Gene; RNA; genome; insert; stability; virus summary = Viruses are widely used as vectors for heterologous gene expression in cultured cells or natural hosts, and therefore a large number of viruses with exogenous sequences inserted into their genomes have been engineered. Viruses genera covered in relevant studies Conclusions of this review All viruses • Inserted sequences are often unstable and rapidly lost upon passaging of an engineered virus • The position at which a sequence is integrated in the genome can be important for stability • Sequence stability is not an intrinsic property of genomes because demographic parameters, such as population size and bottleneck size, can have important effects on sequence stability • The multiplicity of cellular infection affects sequence stability, and can in some cases directly affect whether there is selection for deletion variants • Deletions are not the only class of mutations that can reduce the cost of inserted sequences, although they are the most common I: dsDNA doi = 10.1093/ve/vez045 id = cord-318392-r9bbomvk author = Woo, Patrick CY title = Coronavirus HKU15 in respiratory tract of pigs and first discovery of coronavirus quasispecies in 5′-untranslated region date = 2017-06-21 keywords = Coronavirus; HKU15; PCR; genome summary = The genomes of two Coronavirus HKU15 strains detected in the nasopharyngeal samples of two different pigs were sequenced following our previous publications 26, 27 with modifications. Divergence times for the Coronavirus HKU15 strains were calculated based on the complete genome sequence data, utilizing the Bayesian Markov chain Monte Carlo method using BEAST 1.8.0 33 with the substitution model GTR (general time-reversible model)+G (gammadistributed rate variation)+I (estimated proportion of invariable sites), a strict molecular clock, and a constant coalescent. In one (S579N) of the two Coronavirus HKU15 genomes that we sequenced in this study, variant sites were observed at four positions; two of them were due to nucleotide substitutions, and the other two were results of indels at mononucleotide polymeric regions (189th and 376th bases). doi = 10.1038/emi.2017.37 id = cord-348515-bqqyly23 author = Zhao, Suhui title = Re-emergent Human Adenovirus Genome Type 7d Caused an Acute Respiratory Disease Outbreak in Southern China After a Twenty-one Year Absence date = 2014-12-08 keywords = ARD; China; DG01_2011; REA; genome summary = Recombination analysis reveals this genome differs from the 1950s-era prototype and vaccine strains by a lateral gene transfer, substituting the coding region for the L1 52/55 kDa DNA packaging protein from HAdV-16. Recombination analysis reveals this genome differs from the 1950s-era prototype and vaccine strains by a lateral gene transfer, substituting the coding region for the L1 52/55 kDa DNA packaging protein from HAdV-16. Thorough characterization of these pathogens is evidenced by the availability of two genome sequences (JF800905 and JX625134), both of which are further identified as the HAdV-7d genome type in this report, and shown to be nearly identical to this report of an isolate from a 2011 ARD outbreak in Guangdong Province (strain DG01_2011) by comparative genomics and, in particular, in silico REA pattern analysis, as presented in Figure 2 . doi = 10.1038/srep07365 id = cord-000902-ew8orn0z author = Zhao, Xiangyan title = Coevolution between simple sequence repeats (SSRs) and virus genome size date = 2012-08-30 keywords = additional; genome; ssr; virus summary = The results showed that simple sequence repeats (SSRs) is strongly, positively and significantly correlated with genome size. While, relative abundance and relative density were examined to make the SSRs comparison parallel among differently sized species genomes; principal component analysis (PCA) was designed to investigate which repeat class(es) made a greater contribution to the variance among virus species as well as the relationships between repeat classes. Therefore, the 257 genome sequences were selected as samples for the analysis of relationship between SSRs distribution and genome size in the level of the whole virus. We surveyed the distribution of different SSR classes in virus genomes to investigate the relationship between repeat classes (mono-, di-, tri-, tetra-, penta-and hexa-) and genome sequence length. Coevolution between simple sequence repeats (SSRs) and virus genome size doi = 10.1186/1471-2164-13-435 id = cord-265329-bsypo08l author = van Dorp, Lucy title = Emergence of genomic diversity and recurrent mutations in SARS-CoV-2 date = 2020-05-05 keywords = CoV-2; SARS; figure; genome summary = Three sites in Orf1ab in the regions encoding Nsp6, Nsp11, Nsp13, and one in the Spike protein are characterised by a particularly large number of recurrent mutations (>15 events) which may signpost convergent evolution and are of particular interest in the context of adaptation of SARS-CoV-2 to the human host. The extraordinary availability of genomic data during the COVID-19 pandemic has been made possible thanks to a tremendous effort by hundreds of researchers globally depositing SARS-CoV-2 assemblies (Table S1 ) and the proliferation of close to real time data visualisation and analysis tools including NextStrain (https://nextstrain.org) and CoV-GLUE (http://cov-glue.cvr.gla.ac.uk). In this work we use this data to analyse the genomic diversity that has emerged in the global population of SARS-CoV-2 since the beginning of the COVID-19 pandemic, based on a download of 7710 assemblies. The genomic diversity of the global SARS-CoV-2 population being recapitulated in multiple countries points to extensive worldwide transmission of COVID-19, likely from extremely early on in the pandemic. doi = 10.1016/j.meegid.2020.104351