Summary of your 'study carrel'
==============================

This is a summary of your Distant Reader 'study carrel'.

The Distant Reader harvested & cached your content into a
collection/corpus. It then applied sets of natural language
processing and text mining against the collection. The results of
this process was reduced to a database file -- a 'study carrel'.
The study carrel can then be queried, thus bringing light
specific characteristics for your collection. These
characteristics can help you summarize the collection as well as
enumerate things you might want to investigate more closely.

This report is a terse narrative report, and when processing 
is complete you will be linked to a more complete narrative
report. 

                               Eric Lease Morgan <emorgan@nd.edu>


Number of items in the collection; 'How big is my corpus?'
----------------------------------------------------------
49


Average length of all items measured in words; "More or less, how big is each item?"
------------------------------------------------------------------------------------
6457


Average readability score of all items (0 = difficult; 100 = easy)
------------------------------------------------------------------
45


Top 50 statistically significant keywords; "What is my collection about?"
-------------------------------------------------------------------------
46	genome
16	dna
16	RNA
14	virus
12	sequence
8	gene
5	human
4	viral
4	protein
4	figure
4	SARS
3	mutation
3	Genome
2	sequencing
2	recombination
2	pathogen
2	disease
2	HGP
1	trait
1	tool
1	technology
1	subsp
1	stability
1	ssr
1	product
1	probe
1	poliovirus
1	plant
1	pig
1	pestis
1	patent
1	pan
1	pallidum
1	natural
1	model
1	malaria
1	isolate
1	insert
1	host
1	genomic
1	genetic
1	fragment
1	datum
1	crop
1	clinical
1	chapter
1	cell
1	cat
1	ascaris
1	array


Top 50 lemmatized nouns; "What is discussed?"
---------------------------------------------
4053	genome
2756	virus
2155	sequence
1850	gene
1148	protein
817	analysis
776	dna
760	cell
699	mutation
675	host
632	datum
604	disease
537	strain
516	study
516	recombination
507	sequencing
490	region
475	replication
474	type
445	number
436	specie
435	infection
432	population
410	poliovirus
402	%
399	pathogen
388	rate
386	example
371	evolution
355	structure
349	time
349	size
328	approach
319	site
312	information
310	tool
307	diversity
296	method
296	expression
294	organism
293	system
286	result
281	mechanism
281	level
277	sample
275	model
274	function
270	database
264	figure
261	polymerase


Top 50 proper nouns; "What are the names of persons or places?"
--------------------------------------------------------------
1357	RNA
1326	al
1108	et
995	.
356	Genome
244	SARS
173	DNA
129	C
119	Human
110	Fig
105	Virus
98	China
97	GenBank
96	NCBI
86	CoV-2
85	PCR
83	SNP
81	kb
81	B
81	A
80	Complete
79	CoV
77	Yersinia
74	Coronavirus
72	WGS
72	HIV-1
67	T
67	Strain
66	C.
65	Y.
65	T.
65	Project
63	bp
63	Wimmer
63	National
63	Institute
63	ExoN
63	E.
61	S.
60	HIV
59	Figure
58	Table
55	picornavirus
55	IRES
53	Treponema
53	S
52	SNPs
52	NIAID
52	Europe
51	HGP


Top 50 personal pronouns nouns; "To whom are things referred?"
-------------------------------------------------------------
1044	it
654	we
422	they
97	them
73	i
65	he
49	us
32	one
30	itself
22	themselves
12	you
9	him
7	p~
3	she
3	himself
2	u
2	ourselves
2	https://github.com/ababaian/serratus
1	mine
1	https://serratus.io
1	her
1	hadv-4
1	coronaspades


Top 50 lemmatized verbs; "What do things do?"
---------------------------------------------
10696	be
2292	have
1138	use
463	identify
449	include
415	base
396	provide
360	show
341	find
321	do
286	contain
280	associate
273	know
267	develop
259	sequence
251	suggest
246	cause
239	make
234	require
230	reveal
229	generate
221	lead
220	encode
213	produce
210	allow
207	occur
207	determine
206	result
202	increase
197	follow
194	predict
191	relate
190	give
190	express
189	involve
183	see
177	describe
169	isolate
168	consider
158	compare
148	detect
148	code
144	represent
140	infect
137	target
136	emerge
135	indicate
133	remain
133	become
129	perform


Top 50 lemmatized adjectives and adverbs; "How are things described?"
---------------------------------------------------------------------
1112	viral
926	not
878	human
724	also
716	genetic
650	other
639	high
582	such
571	more
507	-
496	only
474	genomic
469	new
445	different
422	large
397	most
375	single
374	well
370	however
322	many
312	first
293	molecular
290	specific
269	small
267	important
258	complete
256	whole
244	nucleotide
243	as
235	low
233	same
230	non
224	evolutionary
215	long
214	multiple
213	infectious
212	several
211	possible
209	similar
207	clinical
206	available
193	bacterial
189	highly
180	cellular
176	novel
174	functional
172	microbial
172	biological
163	major
161	early


Top 50 lemmatized superlative adjectives; "How are things described to the extreme?"
-------------------------------------------------------------------------
136	most
53	least
49	good
29	Most
27	large
26	high
18	close
13	small
11	strong
11	great
10	early
9	low
9	late
8	simple
4	near
4	long
2	weak
2	short
2	old
2	hot
2	fit
2	bad
1	~15
1	wide
1	northernmost
1	little
1	innermost
1	flat
1	fast
1	clever
1	buildt
1	big


Top 50 lemmatized superlative adverbs; "How do things do to the extreme?"
------------------------------------------------------------------------
261	most
42	least
29	well
2	shortest
1	long
1	close


Top 50 Internet domains; "What Webbed places are alluded to in this corpus?"
----------------------------------------------------------------------------
17	github.com
8	s3.amazonaws.com
7	www.ncbi.nlm.nih.gov
7	serratus.io
4	www.niaid.nih.gov
4	www.ebi.ac.uk
4	www
4	doi.org
3	www.ncbi.nlm.nih
2	www.who.int
2	www.broadinstitute.org
2	submit.ncbi.nlm.nih.gov
2	nextstrain.org
2	gmod.org
2	bioconductor.org
1	xmtb
1	www3.niaid.nih.gov
1	www3
1	www.wheatgenome.org
1	www.wdcm.org
1	www.secondarymetabolites
1	www.rostlab.org
1	www.ridom.de
1	www.ridom.com
1	www.predictprotein.org
1	www.paintmychromosomes.com
1	www.oxfordjournals.org
1	www.ostp
1	www.istm.org
1	www.inforsense.com
1	www.iedb.org
1	www.healthmap.org
1	www.hackseq.com
1	www.fruitfly.org
1	www.fludb.org
1	www.epicov.org
1	www.ensembl.org
1	www.doe-mbi.ucla.edu
1	www.dnastar.com
1	www.csgid.org
1	www.cogconsortium.uk
1	www.cdc.gov
1	www.broad.mit.edu
1	www.brccentral.org
1	www.boldsystems
1	www.angis.org.au
1	www.r-project.org
1	woldlab.caltech.edu
1	wishart.biology
1	virological.org


Top 50 URLs; "What is hyperlinked from this corpus?"
----------------------------------------------------
4	http://www
3	http://www.ncbi.nlm.nih
3	http://serratus.io/access
3	http://serratus.io
3	http://github.com/rcs333/VAPiD
2	http://www.niaid.nih.gov/dmid/genomes/
2	http://www.ncbi.nlm.nih.gov/Sequin/modifiers.html
2	http://nextstrain.org
2	http://gmod.org
2	http://github.com/serratus-bio/tantalus
2	http://github.com/ababaian/serratus
1	http://xmtb
1	http://www3.niaid.nih.gov/research/topics/
1	http://www3
1	http://www.who.int/tdr
1	http://www.who.int/csr/disease/plague/Plague-map-2016.pdf
1	http://www.wheatgenome.org/
1	http://www.wdcm.org
1	http://www.secondarymetabolites
1	http://www.rostlab.org/
1	http://www.ridom.de/traceedit/
1	http://www.ridom.com/seqsphere/
1	http://www.predictprotein.org/
1	http://www.paintmychromosomes.com
1	http://www.oxfordjournals.org/nar/database/
1	http://www.ostp
1	http://www.niaid.nih.gov/dmid/genomes/mscs/
1	http://www.niaid.nih.gov/dmid/
1	http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3394299/
1	http://www.ncbi.nlm.nih.gov/genome/browse/
1	http://www.ncbi.nlm.nih.gov/COG
1	http://www.ncbi.nlm.nih.gov/BLAST
1	http://www.ncbi.nlm.nih.gov
1	http://www.istm.org/geosentinel/main.html
1	http://www.inforsense.com
1	http://www.iedb.org
1	http://www.healthmap.org/en
1	http://www.hackseq.com
1	http://www.fruitfly.org/seq_tools/promoter.html
1	http://www.fludb.org/
1	http://www.epicov.org
1	http://www.ensembl.org
1	http://www.ebi.ac.uk/interpro/
1	http://www.ebi.ac.uk/Bzerbino/velvet
1	http://www.ebi.ac.uk/Bzerbino/oases
1	http://www.ebi.ac.uk
1	http://www.doe-mbi.ucla.edu/TB
1	http://www.dnastar.com/products/lasergene.php
1	http://www.csgid.org
1	http://www.cogconsortium.uk/data/


Top 50 email addresses; "Who are you gonna call?"
-------------------------------------------------
1	ytliu@ucsd.edu
1	journals.permissions@oup.com
1	gb-admin@ncbi.nlm.nih.gov
1	christine.burkard@roslin.ed.ac.uk
1	celniker@fruitfly.org


Top 50 positive assertions; "What sentences are in the shape of noun-verb-noun?"
-------------------------------------------------------------------------------
11	genome sequence data
4	% sequence identity
3	data are available
3	gene finding hmm
3	genes have also
3	genome is often
3	genome sequence analysis
3	genome sequences available
3	genomes are not
3	proteins are also
3	recombination does not
3	recombination is also
3	sequence is present
3	sequences are important
3	sequences are similar
3	viruses are also
3	viruses are not
3	viruses are often
3	viruses have not
3	viruses is not
2	cells are capable
2	data were recently
2	disease is endemic
2	dna sequence data
2	gene finding algorithms
2	gene was stably
2	genes are also
2	genes are not
2	genes are often
2	genes are well
2	genes using conditional
2	genome does not
2	genome have already
2	genome is still
2	genome reveals features
2	genome sequence information
2	genome sequence length
2	genome was completely
2	genomes are extremely
2	genomes are highly
2	genomes are much
2	genomes are present
2	genomes are routinely
2	genomes contain tetra
2	genomes containing penta
2	genomes do not
2	genomes has not
2	host is not
2	hosts are also
2	infections are self


Top 50 negative assertions; "What sentences are in the shape of noun-verb-no|not-noun?"
---------------------------------------------------------------------------------------
2	recombination does not necessarily
1	% was not similar
1	genes are not identical
1	genomes are not naked
1	genomes are not robust
1	genomes has not only
1	genomes is not only
1	genomes were not only
1	host is not able
1	host is not that
1	infections do not typically
1	infections is not well
1	mutations are not easily
1	mutations are not necessarily
1	mutations were not present
1	number are not necessarily
1	proteins have no clear
1	recombination shows no appreciable
1	recombination was not essential
1	replication was not significantly
1	sequence has no nucleotides
1	sequence has no protein
1	sequences are not contiguous
1	sequences have not yet
1	sequences were not public
1	virus does not solely
1	virus is not well
1	viruses are not homogeneous
1	viruses are not pathogenic
1	viruses did not readily
1	viruses have no autonomous
1	viruses have no mechanisms
1	viruses is not well


A rudimentary bibliography
--------------------------
      id = cord-310406-5pvln91x
  author = Asbury, Thomas M
   title = Genome3D: A viewer-model framework for integrating and visualizing multi-scale epigenomic information within a three-dimensional genome
    date = 2010-09-02
keywords = datum; genome; model
 summary = RESULTS: We have applied object-oriented technology to develop a downloadable visualization tool, Genome3D, for integrating and displaying epigenomic data within a prescribed three-dimensional physical model of the human genome. In addition, in spite of the many recent efforts to measure and model the genome structure at various resolutions and detail [3] [4] [5] [6] [7] [8] [9] [10] , little work has focused on combining these models into a plausible aggregate, or has taken advantage of the large amount of genomic and epigenomic data available from new high-throughput approaches. The viewer is designed to display data from multiple scales and uses a hierarchical model of the relative positions of all nucleotide atoms in the cell nucleus, i.e., the complete physical genome. An integrated physical genome model can show the interplay between histone modifications and other genomic data, such as SNPs, DNA methylation, the structure of gene, promoter and transcription machinery, etc. In addition to epigenomic data, the physical genome model also provides a platform to visualize highthroughput gene expression data and its interplay with global binding information of transcription factors.
     doi = 10.1186/1471-2105-11-444

      id = cord-301709-kvyes2lz
  author = Baker, Susan C.
   title = Developing Bioinformatic Resources for Coronaviruses
    date = 2006
keywords = genome
 summary = The database will contain high-quality curated data: sequence annotations from published whole and partial genomes; relevant experimental data; metabolic pathway data; taxonomic data; literature citations; and a suite of visualization and analysis tools. The results of these programs and searches assembled by the annotation pipeline are used to propose biological features that are also stored in the curation database that uses the Genomics Unified Schema (GUS). For the purposes of defining minimal, non-redundant set of genes characteristic of the category, one genome (usually the best-known or best-characterized) is identified as the "reference genome"; the remaining members of the class are called "associated genomes." For example, the Tor2 and Urbani isolates were the first two SARS coronavirus genomes to be sequenced and therefore were named as reference genomes. This allows high-value, manually curated information from the corresponding reference genes to be automatically linked to the associated genes, provided minimal similarity criteria based on automated sequence analysis are satisfied.
     doi = 10.1007/978-0-387-33012-9_70

      id = cord-003316-r5te5xob
  author = Balloux, Francois
   title = From Theory to Practice: Translating Whole-Genome Sequencing (WGS) into the Clinic
    date = 2018-12-17
keywords = AMR; WGS; clinical; genome; sequence; sequencing
 summary = WGS-based strain identification gives a far superior resolution In principle, WGS can provide highly relevant information for clinical microbiology in near-real-time, from phenotype testing to tracking outbreaks. As an example, genome assembly might appear to be a bottleneck for real-time WGS diagnostics, but is probably rarely required; sufficient characterization of an isolate can be made by analysis of the k-mers in the raw sequence data, which is orders of magnitude faster. These include, among others: the current costs of WGS, which remain far from negligible despite a common belief that sequencing costs have plummeted; a lack of training in, and possible cultural resistance to, bioinformatics among clinical microbiologists; a lack of the necessary computational infrastructure in most hospitals; the inadequacy of existing reference microbial genomics databases necessary for reliable AMR and virulence profiling; and the difficulty of setting up effective, standardized, and accredited bioinformatics protocols.
     doi = 10.1016/j.tim.2018.08.004

      id = cord-340423-f8ab7413
  author = Barr, J.N.
   title = Genetic Instability of RNA Viruses
    date = 2016-09-09
keywords = RNA; genome; mutation; viral; virus
 summary = We then discuss evidence that at least some RNA viruses have a replication fidelity that is poised to maximize genome sequence space without incurring catastrophic lethal mutations and describe how this can be exploited to control viral infections. The error-prone nature of polymerase activity, coupled with the absence of a proofreading mechanism, is the key reason why RNA virus genomes acquire mutations and exist as a swarm of genetic variants. The mutation rate of the viral polymerase, coupled with the replication mode that the virus employs (and extrinsic factors, described in the following text) will determine the extent of genetic variability of viruses released from an infected cell. Thus, it is possible that the high mutation rates of RNA viruses are simply a consequence of polymerases that are under selective pressure to replicate genomes very rapidly to ensure efficient viral infection [79] [80] [81] .
     doi = 10.1016/b978-0-12-803309-8.00002-1

      id = cord-000012-p56v8wi1
  author = Bigot, Yves
   title = Molecular evidence for the evolution of ichnoviruses from ascoviruses by symbiogenesis
    date = 2008-09-18
keywords = dna; gene; genome; protein; virus
 summary = CONCLUSION: Our results provide molecular evidence supporting the origin of ichnoviruses from ascoviruses by lateral transfer of ascoviral genes into ichneumonid wasp genomes, perhaps the first example of symbiogenesis between large DNA viruses and eukaryotic organisms. With respect to both species number and mechanisms that lead to successful parasitism, endoparasitic wasps are known to inject secretions at oviposition, but only a few lineages use viruses or virus-like particles (VLPs) to evade or to suppress host defences. Extending our investigations to proteins encoded by open reading frames of certain ascoviruses and bracoviruses, hosts and bacteria, in the light of recent analyses about the involvement of the replication machinery of virus groups related to ascoviruses in lateral gene transfer [29] , we discuss the robustness and the limits of the molecular evidence supporting an ascovirus origin for ichnovirus lineages.
     doi = 10.1186/1471-2148-8-253

      id = cord-005281-wy0zk9p8
  author = Blinov, V. M.
   title = Viral component of the human genome
    date = 2017-05-09
keywords = RNA; dna; genome; host; virus
 summary = In the human genome, this capacity is determined by the portion of chromosomal DNA, which does not contain species-specific protein-encoding sequences and, thus, can basically make a place for novel information that will be modified to reach a new balance. In fact, the scope of the described phenomena is not limited to retroviruses as such, since the ubiquity of retroviral elements in animal genomes, their activity in germline cells [31] , along with the fact that viral replication depends significantly on RNA expression, allow retroviruses to contribute in different ways to the insertion of nonretroviral genes into animal germline cells. Finally, the ability to incorporate parts of the viral genome into the chromosomal DNA of host germline cells can vary strongly among different taxonomic groups of viruses, i.e., orders, families, genera, and even species If insertions of viral sequences remain functionally active in the host cell genome, they can give rise to either proteins that function in a new environment or untranslated RNAs of different sizes.
     doi = 10.1134/s0026893317020066

      id = cord-012473-p66of6kq
  author = Celniker, Susan E.
   title = Unlocking the secrets of the genome
    date = 2009-06-17
keywords = dna; genome
 summary = T he primary objective of the Human Genome Project was to produce highquality sequences not just for the human genome but also for those of the chief model organisms: Escherichia coli, yeast (Saccharomyces cerevisiae), worm (Caenorhabditis elegans), fly (Drosophila melanogaster) and mouse (Mus musculus). Free access to the resultant data has prompted much biological research, including development of a map of common human genetic variants (the International HapMap Project) 1 , expression profiling of healthy and diseased cells 2 and in-depth studies of many individual genes. On the basis of this experience, the NHGRI launched two complementary programmes in 2007: an expansion of the human ENCODE project to the whole genome (www.genome.gov/ENCODE) and the model organism ENCODE (modENCODE) project to generate a comprehensive annotation of the functional elements in the C. The research communities that study these two organisms will rapidly make use of the modENCODE results, deploying powerful experimental approaches that are often not possible or practical in mammals, including genetic, genomic, transgenic, biochemical and RNAi assays.
     doi = 10.1038/459927a

      id = cord-304498-ty41xob0
  author = Denison, Mark R
   title = Coronaviruses: An RNA proofreading machine regulates replication fidelity and diversity
    date = 2011-03-01
keywords = ExoN; RNA; SARS; genome; virus
 summary = Genetic inactivation of exoN activity in engineered SArS-Cov and MHv genomes by alanine substitution at conserved De-D-D active site residues results in viable mutants that demonstrate 15-to 20-fold increases in mutation rates, up to 18 times greater than those tolerated for fidelity mutants of other rNA viruses. Genetic inactivation of exoN activity in engineered SArS-Cov and MHv genomes by alanine substitution at conserved De-D-D active site residues results in viable mutants that demonstrate 15-to 20-fold increases in mutation rates, up to 18 times greater than those tolerated for fidelity mutants of other rNA viruses. The high mutation rates of RNA viruses also render them particularly susceptible to repeated genetic bottleneck events during replication, transmission between hosts or spread within a host, resulting in progressive deviation from the consensus sequence associated with decreased viral fitness and sometimes extinction.
     doi = 10.4161/rna.8.2.15013

      id = cord-022128-r8el8nqm
  author = Domingo, Esteban
   title = Molecular basis of genetic variation of viruses: error-prone replication
    date = 2019-11-08
keywords = HIV-1; RNA; chapter; dna; genome; mutation; recombination; virus
 summary = 
     doi = 10.1016/b978-0-12-816331-3.00002-7

      id = cord-316033-xg8eb2nm
  author = Easton, Alice
   title = Molecular evidence of hybridization between pig and human Ascaris indicates an interbred species complex infecting humans
    date = 2020-11-06
keywords = SNP; ascaris; dna; figure; genome
 summary = suum transcripts (Jex et al., 2011; Wang et al., 2017) to the human Ascaris germline assembly to annotate the genome, identifying and classifying 17,902 protein-coding genes ( Table 1 , Supplementary file 1). As this reference-based assembly exhibits the best assembly attributes, including high continuity with a large N50, low gaps and unplaced sequences, and high-quality protein-coding genes (see Table 1 ), we suggest that this version should be used as a reference germline genome for a human Ascaris spp. We next took advantage of the abundant reads from the mitochondrial genome in our sequencing data (on average 7690X coverage, see Supplementary file 1) to perform de novo assembly of 68 complete human Ascaris spp. Furthermore, there were no significant associations between mitochondrial sequence variations and other factors (e.g. village, household, time of worm collection, host) based on PERMANOVA (see methods and Table 2 ) after translating the phylogenetic tree into a distance matrix, suggesting not only a lack of differentiation into distinct species but also a potentially large interbreeding population of worms being transmitted between individuals and across villages.
     doi = 10.7554/elife.61562

      id = cord-334394-qgyzk7th
  author = Edgar, Robert C.
   title = Petabase-scale sequence alignment catalyses viral discovery
    date = 2020-08-10
keywords = Extended; Figure; RNA; SRA; Serratus; genome; sequence
 summary = To address the ongoing pandemic caused by Severe Acute Respiratory Syndrome Coronavirus 2 and expand the known sequence diversity of viruses, we aligned pangenomes for coronaviruses (CoV) and other viral families to 5.6 petabases of public sequencing data from 3.8 million biologically diverse samples. To expand the known repertoire of viruses and catalyse global virus discovery, in particular for Coronaviridae (CoV) family, we developed the Serratus cloud computing architecture for ultra-high throughput sequence alignment. We aligned 3,837,755 public RNA-seq, meta-genome, meta-virome and meta-transcriptome datasets (termed a sequencing run [5] ) against a collection of viral family pangenomes comprising all GenBank CoV records clustered at 99% identity plus all non-retroviral RefSeq records for vertebrate viruses (see Methods and Extended Table 1 ). We performed de novo assembly on 52,772 runs potentially containing CoV sequencing reads by combining 37,131 SRA accessions identified by the Serratus search with 18,584 identified by an ongoing cataloguing initiative of the SRA called STAT [5] .
     doi = 10.1101/2020.08.07.241729

      id = cord-016798-tv2ntug6
  author = Gautam, Ablesh
   title = Bioinformatics Applications in Advancing Animal Virus Research
    date = 2019-06-06
keywords = genome; sequence; tool; viral; virus
 summary = The chapter further provides information on the tools that can be used to study viral epidemiology, phylogenetic analysis, structural modelling of proteins, epitope recognition and open reading frame (ORF) recognition and tools that enable to analyse host-viral interactions, gene prediction in the viral genome, etc. This chapter will introduce virologists to some of the common as well virus-specific bioinformatics tools that the researches can use to analyse viral sequence data to elucidate the viral dynamics, evolution and preventive therapeutics. Novel virus types comprise of new CDSs that are different than previously known CDSs. There are multiple databases and tools available for analysis of human viruses; however, there are still only a limited number of resources designed specifically for veterinary viruses. VIRsiRNAdb is an online curated repository that stores experimentally validated research data of siRNA and short hairpin RNA (shRNA) targeting diverse genes of 42 important human viruses, including influenza virus (Tyagi et al.
     doi = 10.1007/978-981-13-9073-9_23

      id = cord-017932-vmtjc8ct
  author = Georgiev, Vassil St.
   title = Genomic and Postgenomic Research
    date = 2009
keywords = NIAID; gene; genome; sequence
 summary = The family Enterobacteriaceae encompasses a diverse group of bacteria including many of the most important human pathogens (Salmonella, Yersinia, Klebsiella, Shigella), as well as one of the most enduring laboratory research organisms, the nonpathogenic Escherichia coli K12. To this end, NIAID has made significant investments in large-scale sequencing projects, including projects to sequence the complete genomes of many pathogens, such as the bacteria that cause tuberculosis, gonorrhea, chlamydia, and cholera, as well as organisms that are considered agents of bioterrorism. The availability of microbial and human DNA sequences opens up new opportunities and allows scientists to perform functional analyses of genes and proteins in whole genomes and cells, as well as the host''s immune response and an individual''s genetic susceptibility to pathogens. The PFGRC was established in 2001 to provide and distribute to the broader research community a wide range of genomic resources, reagents, data, and technologies for the functional analysis of microbial pathogens and invertebrate vectors of infectious diseases.
     doi = 10.1007/978-1-60327-297-1_25

      id = cord-348059-wa1gjbck
  author = Gibbs, Richard A.
   title = The Human Genome Project changed everything
    date = 2020-08-07
keywords = Genome; HGP
 summary = Thirty years on from the launch of the Human Genome Project, Richard Gibbs reflects on the promises that this voyage of discovery bore. Thirty years on from the launch of the Human Genome Project, Richard Gibbs reflects on the promises that this voyage of discovery bore. He developed basic methods for DNA and mutation ana lysis and was an early contributor to the Human Genome Project (HGP), leading one of five sites that generated the majority of the sequence. The power of advances in genomics and computers was revealed in the spectacular series of post-HGP projects that were of comparable scale. Some still tally the success of the HGP from lists of new drugs or therapies and argue that world-changing examples in biology, such as the spectacular advances of gene editing tools or the expansion of cancer therapeutics through targeted immunotherapy, are largely based on microbial, cellular and animal studies rather than genomics.
     doi = 10.1038/s41576-020-0275-3

      id = cord-350747-5t5xthk6
  author = Gmyl, A. P.
   title = Diverse Mechanisms of RNA Recombination
    date = 2005
keywords = RNA; fragment; genome; recombination; virus
 summary = It was believed until recently that the only possible mechanism of RNA recombination is replicative template switching, with synthesis of a complementary strand starting on one viral RNA molecule and being completed on another. An illustrative example of deletions is provided by defective interfering (DI) genomes, which accumulate in a virus population upon high-multiplicity infections and lack a fragment of the sequence coding for viral proteins [5] [6] [7] . A special role in the variation of RNA viruses is played by recombination, the generation of new genomes from two or more parental RNAs. Recombination between viral RNA molecules was observed for the first time as early as in the 1960s in the poliovirus [14, 15] . In other words, it is possible to assume that some of the mechanisms of nonreplicative RNA recombination play an important role in the evolution of not only viral, but also cell genomes [51, 90] .
     doi = 10.1007/s11008-005-0069-x

      id = cord-022262-ck2lhojz
  author = Gromeier, Matthias
   title = Genetics, Pathogenesis and Evolution of Picornaviruses
    date = 2007-09-02
keywords = IRES; RNA; Wimmer; figure; genome; poliovirus; protein; virus
 summary = The following viruses have been recognized as picornaviruses on the basis of their genome sequences and physico-chemical properties as well as the result of comparative sequence analyses (see the section on Evolution): equine rhinovirus types I and 2, Aichi virus, porcine enterovirus, avian encephalomyelitis virus, infectious flacherie virus of silkworm Clusters of enteroviruses refer to groups of enteroviruses arranged predominantly according to genotypic kinship (Hyypia et al., 1997) . Briefly, when expression vectors ( Figure 12 .6E) consisting of a gag gene (encoding p17-p24; 1161 nt) of human immunodeficiency virus that was fused to the N-terminus of the poliovirus polyprotein (Andino et al., 1994; Mueller and Wimmer, 1998) were analysed after transfection into HeLa cells, the genomes were not only found to be severely impaired in viral replication but they were also genetically unstable (Mueller and Wimmer, 1997) .
     doi = 10.1016/b978-012220360-2/50013-1

      id = cord-267714-ji88tvsl
  author = JAKUPCIAK, JOHN P.
   title = Biological agent detection technologies
    date = 2009-04-21
keywords = dna; genome; sequencing
 summary = PCR-based methods have critical limitations, since they depend on a priori knowledge of what sequence to detect in a sample further complicated by recent demonstrations of greater variability in genomic sequence than expected. A platform for genome identification of a specimen from any source must not only be sensitive and specific, but must also detect a variety of pathogens with high accuracy, including modified or previously uncharacterized agents, and this challenge is daunting when identification must be achieved using nucleic acids in a complex sample matrix. The build-out of genome identification DNA sequencing technology in the form of practical instrumentation will be achieved by incorporating the critical requirements for accurate long reads, without dependency for template amplification, capable of manipulating terabytes of data to provide reliable and useful identification of genetic sequences within any unknown sample, whether clinical, environmental, or other type of specimen.
     doi = 10.1111/j.1755-0998.2009.02632.x

      id = cord-004123-1s8kuno2
  author = Jaiswal, Arun Kumar
   title = The pan-genome of Treponema pallidum reveals differences in genome plasticity between subspecies related to venereal and non-venereal syphilis
    date = 2020-01-10
keywords = Treponema; genome; pallidum; subsp
 summary = title: The pan-genome of Treponema pallidum reveals differences in genome plasticity between subspecies related to venereal and non-venereal syphilis pallidum strains isolated from different parts of the world and a diverse range of hosts were comparatively analysed using pan-genomic strategy. pertenue, we found differences in the presence/absence of pathogenicity islands (PAIs) and genomic islands (GIs) on subsp.-based study. In this work, we perform a pan-genome approach to better understand the differences of Treponema pallidum infections in the broad spectrum and how genome plasticity is related to the symptom patterns. Finally, we provide insights into the specific subsets (singletons and the panand core genomes) of 53 genomes of T pallidum strains and correlate these subsets with the plasticity of pathogenicity islands and virulence genes. The subspecies responsible for non-venereal syphilis is Treponema pallidum subsp. Genes which are present in pallidum subspecies pathogenicity islands (PAIs) or genomic islands (GIs) are absent in the subspecies endemicum and pertenue.
     doi = 10.1186/s12864-019-6430-6

      id = cord-324811-yjwavea5
  author = Kidgell, Claire
   title = Elucidating genetic diversity with oligonucleotide arrays
    date = 2005
keywords = dna; genome
 summary = Oligonucleotide microarrays, predominantly high-density oligonucleotide arrays, have emerged as the principal platforms for performing genome-wide diversity analysis. Since a number of complex issues still remain with high-throughput microarray-based SNP genotyping in humans, in the remainder of this review, we will discuss the application of high-density oligonucleotide arrays to elucidate genetic diversity, with particular focus on studies undertaken with Saccharomyces cerevisiae (Winzeler et al. falciparum (Clark 2002) , the genome-wide analysis facilitated by hybridization of genomic DNA to the A¡ymetrix microarray identi¢ed signi¢cant di¡erences in potential selection pressure across di¡erent gene families and locations within the chromosome (Volkman et al. Although SNPs and deletions can be readily identi¢ed using A¡ymetrix high-density arrays, more complex types of genetic diversity may also be determined using this platform.
     doi = 10.1007/s10577-005-1503-6

      id = cord-000556-uu1oz2ei
  author = Kumar, Ranjit
   title = RNA-Seq Based Transcriptional Map of Bovine Respiratory Disease Pathogen “Histophilus somni 2336”
    date = 2012-01-20
keywords = RNA; Seq; genome
 summary = Whole genome transcriptome analysis is a complementary method to identify "novel" genes, small RNAs, regulatory regions, and operon structures, thus improving the structural annotation in bacteria. Therefore, genome structural annotation or the identification and demarcation of boundaries of functional elements in a genome (e.g., genes, non-coding RNAs, proteins, and regulatory elements) are critical elements in infectious disease systems biology. Whole genome transcriptome studies (such as whole genome tiling arrays [13, 14, 15] and high throughput sequencing [16, 17] ) are complementary experimental approaches for bacterial genome annotation and can identify ''''novel'''' genes, gene boundaries, regulatory regions, intergenic regions, and operon structures. We compared the RNA-Seq based transcriptome map with the available genome annotation to identify expressed, novel, and intergenic regions in the genome. The single nucleotide resolution map helped uncover the structure and complexity of this pathogen''s transcriptome and led to the identification of novel, small RNAs and protein coding genes as well as gene co-expression.
     doi = 10.1371/journal.pone.0029435

      id = cord-001340-kqcx7lrq
  author = Ladner, Jason T.
   title = Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing
    date = 2014-06-17
keywords = genome; sequence; viral
 summary = Genome sequences play a critical role in our understanding of viral evolution, disease epidemiology, surveillance, diagnosis, and countermeasure development and thus represent valuable resources which must be properly documented and curated to ensure future utility. Here, we outline a set of viral genome quality standards, similar in concept to those proposed for large DNA genomes (4) but focused on the particular challenges of and needs for research on small RNA/ DNA viruses, including characterization of the genomic diversity inherent in all viral samples/populations. Therefore, we have used technology-agnostic criteria to define five standard categories designed to encompass the levels of completeness most often encountered in viral sequencing projects. There is a trend toward requiring a complete genome sequence when a description of a novel virus is being published, and we agree that this is a good goal; however, the amount of time and resources required to complete the last 1 to 2% of a viral genome is often cost and time prohibitive for projects sequencing a large number of samples, and in most cases the very ends of the segments are not essential for proper identification and characterization.
     doi = 10.1128/mbio.01360-14

      id = cord-330312-1pjolkql
  author = Liu, Y.-T.
   title = Infectious Disease Genomics
    date = 2017-01-20
keywords = HGP; genome; human; malaria; sequence
 summary = One of the important motivations for these efforts is to develop preventative, diagnostic, and therapeutic strategies through the analysis of sequenced microorganisms, parasites, and vectors related to human health. 16, 17 The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002. 30e32 Genome-sequencing projects for other important human disease vectors are in progress. 38 One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. 48 The completed or ongoing genome projects (Table 10 .1) provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. Genome sequence of the human malaria parasite Plasmodium falciparum
     doi = 10.1016/b978-0-12-799942-5.00010-x

      id = cord-265857-fs6dj3dp
  author = Liu, Yu-Tsueng
   title = Infectious Disease Genomics
    date = 2010-12-24
keywords = genome; human; sequence
 summary = The completed or ongoing genome projects will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. The genomes of human malaria parasite Plasmodium falciparum and its major mosquito vector Anopheles gambiae were published in 2002 (Gardner et al., 2002; Holt et al., 2002) . Genome sequencing projects for other important human disease vectors are in progress Megy et al., 2009 ). One of the similar efforts for human pathogens is the NIH Influenza Genome Sequencing Project. The completed or ongoing genome projects (Table 10 .1) will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control.
     doi = 10.1016/b978-0-12-384890-1.00010-8

      id = cord-018804-wj35q88f
  author = Lázaro, Ester
   title = Genetic Variability in RNA Viruses: Consequences in Epidemiology and in the Development of New Stratgies for the Extinction of Infectivity
    date = 2007
keywords = RNA; genome; mutation; virus
 summary = High error prone replication, together with the short replication times and large population sizes typical of RNA viruses, instead of being a handicap for survival provides an extraordinary evolutionary advantage by permitting the generation of a wide reservoir of mutants with different phenotypic properties [7] . However, the fact that DNA organisms, which usually live in constant environments, have evolved corrector activities, whereas RNA viruses have not, suggests that replication with high error rates is a selected character that strongly favours viral adaptation to fast changing conditions. Quasi-species replicating during a long time in a near-constant environment in the absence of large population size fluctuations can present a low rate of fixation of mutations in the consensus sequence, despite the continuous occurrence of mutants that is characteristic of the underlying dynamics of the population. The infection of a new host constitutes a sudden change in the environment in which viral replication takes place, usually with the consequence of a drastic decrease in the average fitness of the virus population, which prevents further transmission.
     doi = 10.1007/978-3-540-35306-5_15

      id = cord-018437-yjvwa1ot
  author = Mitchell, Michael
   title = Taxonomy
    date = 2013-08-26
keywords = RNA; dna; genome; human; protein; virus
 summary = Classifi cation is based on the genomic nucleic acid used by the virus (DNA or RNA), strandedness (single or double stranded), and method of replication. The nucleocapsids of some viruses are surrounded by envelopes composed of lipid bilayers and host-or viral-encoded proteins. The sequence of negative-sense ssRNA is complementary to the coding sequence for translation, so mRNA must be synthesized by RNA polymerase, typically carried within the virion, before translation into viral proteins. Among the families of viruses able to infect humans and other vertebrate hosts, there are many species that target and cause disease in the lung. The nucleocapsid is surrounded by an envelope derived from host-cell membrane and viral envelope proteins, including hepatitis B surface antigen. The genome of human parainfl uenza viruses is ~15 kb in length with an organization and six reading frames (N, P, M, F, HN, L) typical of the Paramyxoviridae (Karron and Collins 2007 ) .
     doi = 10.1007/978-3-642-40605-8_3

      id = cord-264746-gfn312aa
  author = Muse, Spencer
   title = GENOMICS AND BIOINFORMATICS
    date = 2012-03-29
keywords = RNA; dna; figure; gene; genome; sequence
 summary = The success of this project (it came in almost 3 years ahead of time and 10% under budget, while at the same time providing more data than originally planned) depended on innovations in a variety of areas: breakthroughs in basic molecular biology to allow manipulation of DNA and other compounds; improved engineering and manufacturing technology to produce equipment for reading the sequences of DNA; advances in robotics and laboratory automation; development of statistical methods to interpret data from sequencing projects; and the creation of specialized computing hardware and software systems to circumvent massive computational barriers that faced genome scientists. Although the list of important biotechnologies changes on an almost daily basis, there are three prominent data types in today''s environment: (1) genome sequences provide the starting point that allows scientists to begin understanding the genetic underpinnings of an organism; (2) measurements of gene expression levels facilitate studies of gene regulation, which, among other things, help us to understand how an organism''s genome interacts with its environment; and (3) genetic polymorphisms are variations from individual to individual within species, and understanding how these variations correlate with phenotypes such as disease susceptibility is a crucial element of modern biomedical research.
     doi = 10.1016/b978-0-12-238662-6.50015-x

      id = cord-014461-2ubh9u8r
  author = Nelson, Oranmiyan W.
   title = Genome sequences published outside of Standards in Genomic Sciences, July - October 2012
    date = 2012-10-10
keywords = Complete; Draft; Genome; Strain; isolate; sequence
 summary = Complete Genome Sequence of Brucella abortus A13334, a New Strain Isolated from the Fetal Gastric Fluid of Dairy Cattle Complete Genome Sequence of Brucella canis Strain HSK A52141, Isolated from the Blood of an Infected Dog Complete Genome Sequence of Streptococcus salivarius PS4, a Strain Isolated from Human Milk Complete Genome Sequences of Probiotic Strains Bifidobacterium animalis subsp. Complete Genome Sequence of Corynebacterium pseudotuberculosis Strain 1/06-A, Isolated from a Horse in North America Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Complete Genome Sequence of Bacteriophage BC-611 Specifically Infecting Enterococcus faecalis Strain NP-10011 Characterization and Complete Genome Sequence of Human Coronavirus NL63 Isolated in China Complete Genome Sequence of a Novel Pararetrovirus Isolated from Soybean Complete Genome Sequence of a Polyomavirus Isolated from Horses Complete Genome Sequence of a Novel Porcine Sapelovirus Strain YC2011 Isolated from Piglets with Diarrhea Draft Genome Sequence of Aspergillus oryzae Strain 3.042
     doi = 10.4056/sigs.3416907

      id = cord-016293-pyb00pt5
  author = Newell-McGloughlin, Martina
   title = The flowering of the age of Biotechnology 1990–2000
    date = 2006
keywords = FDA; Genome; NIH; RNA; U.S.; University; Venter; cell; disease; dna; gene; human; plant; sequence; technology
 summary = 
     doi = 10.1007/1-4020-5149-2_4

      id = cord-007923-j3jpqd7k
  author = O''Brien, Stephen J.
   title = Cats
    date = 2004-12-14
keywords = cat; genome
 summary = Wild cats dominate their habitat but require vast expanses to survive, which explains the tragic depredation such that every species of Felidae, except the domestic cat, is considered either endangered or threatened in the wild today by CITES, IUCN Red Book and other monitors of the world''s most endangered species. Domestic cats and dogs enjoy more medical scrutiny than any species except humans. The cat offers the promise of a second carnivore species (in addition to the dog, which shares a common ancestor with cats dating back to approximately 60 million years ago) to improve human genome annotation, as well as to complement the biomedical and genomic discoveries that make the feline genome attractive. The conserved genome of the cat is retained in the other 36 Felidae species, as well as most of the 246 species of the Carnivora order, the only reshuffled exceptions occuring in the dog and bear families.
     doi = 10.1016/j.cub.2004.11.017

      id = cord-298136-mel9fxw8
  author = O''Malley, Maureen A.
   title = Whole-genome patenting
    date = 2005-05-10
keywords = dna; genome; patent
 summary = Gene patenting is now a familiar commercial practice, but there is little awareness that several patents claim ownership of the complete genome sequence of a prokaryote or virus. However, further analysis reveals that patent specifications describing whole-genome inventions use arguments that imply that genomes are qualitatively different from individual genes. This standard allows several sub-inventions to be linked together by a common "general inventive concept", but prevents unrelated inventions from succeeding as a single Abstract | Gene patenting is now a familiar commercial practice, but there is little awareness that several patents claim ownership of the complete genome sequence of a prokaryote or virus. If there are any qualitative differences between patents for whole genomes and those for DNA fragments, it seems likely that they will be found in the utility arguments -the most contested feature of recent gene patenting.
     doi = 10.1038/nrg1613

      id = cord-320005-i30t7cvr
  author = Pardo, A.
   title = The Human Genome and Advances in Medicine: Limits and Future Prospects
    date = 2004-03-31
keywords = dna; gene; genome; human
 summary = The HGP''s initial objectives were fulfilled 2 years ahead of schedule, and, in addition to compiling a highly accurate sequence of the human genome which has been made freely available and accessible to everyone, the Consortium has developed a set of new technologies and has constructed genetic maps of the genomes of various organisms. Around the same time, the public consortium known as the Human Genome Project was formed, and this organization announced a 15-year plan (from 1990 to 2005) with the following objectives: a) to determine the complete nucleotide sequence of human DNA and identify all the genes in human DNA (estimated to number between 50 000 and 100 000); b) to build physical and genetic maps; c) to analyze the genomes of selected organisms used in research as model systems (eg, the mouse); d) to develop new technologies; and e) to analyze and debate the ethical and legal implications for individuals and for society as a whole.
     doi = 10.1016/s1579-2129(06)70078-7

      id = cord-304607-td0776wj
  author = Paszkiewicz, Konrad H.
   title = Omics, Bioinformatics, and Infectious Disease Research
    date = 2010-12-24
keywords = gene; genome; protein; sequence
 summary = This chapter discusses the current state of play of bioinformatics related to genomics and transcriptomics, briefs metagenomics that finds use in infectious disease research as well as the random sequencing of genomes from a variety of organisms. Bioinformatics plays a key role at several steps in genomics, comparative genomics, and functional genomics: sequence alignment, assembly, identification of single nucleotide polymorphisms (SNP), gene prediction, quantitative analysis of transcription data, etc. The term "metagenomics" was originally used to describe the sequencing of genomes of uncultured microorganisms in order to explore their abilities to produce natural products (Handelsman et al., 1998 , Rondon et al., 2000 and subsequently resulted in novel insights into the ecology and evolution of microorganisms on a scale not imagined possible before (see Cardenas and Tiedje, 2008; Hugenholtz and Tyson, 2008 for an overview). However, metagenomics now finds use in infectious disease research as well as the random sequencing of genomes from a variety of organisms from, for example, patient material that could lead to the identification of the cause of disease.
     doi = 10.1016/b978-0-12-384890-1.00018-2

      id = cord-352619-s2x53grh
  author = Payne, Natalie
   title = Novel Circoviruses Detected in Feces of Sonoran Felids
    date = 2020-09-15
keywords = Rep; dna; genome; virus
 summary = Genomes from several families of circular Rep-encoding single-stranded DNA viruses (CRESS-DNA viruses) are part of the phylum Cressdnaviricota [22] and have been identified in fecal samples of other mammals, including domestic cats [23, 24] , bobcats, African lions [25] , capybaras [26] , and Tasmanian devils [27] . Here we used a metagenomic approach to identify novel circoviruses in the feces of two species of Sonoran felids, the puma and bobcat; although not endangered, knowledge of viral threats facing these species could help prevent future population decline, as well as indicate potential threats to the endangered ocelot and jaguar. Based on the species-demarcation threshold for circoviruses which is 80% genome-wide identity [28] , both of these belong to a new species which we refer to as Sonfela (derived from Sonoran felid associated) circovirus 1. As the viral genomes were derived from scat samples, the circoviruses could have infected the bobcat prey species or the felids themselves or be environmentally derived.
     doi = 10.3390/v12091027

      id = cord-281959-g4sjyytr
  author = Phillippy, Adam M
   title = Efficient oligonucleotide probe selection for pan-genomic tiling arrays
    date = 2009-09-16
keywords = array; genome; pan; probe
 summary = The viability of the algorithm is demonstrated by array designs for seven different bacterial pan-genomes and, in particular, the design of a 385,000 probe array that fully tiles the genomes of 20 different Listeria monocytogenes strains with overlapping probes at greater than twofold coverage. In order to both characterize new strains based on genetic content, and detect polymorphism at a higher resolution in small RNAs (sRNAs) and intergenic sequences, the array was required to cover all pan-genomic sequences with a high density of probes. To see the similarities between the Pan-Tiling and Minimum Hitting Set problems, let the sequence G be a concatenation of all the genomes from a species, and let W = {w 1 , w 2 ,..., w m } be the set of m intervals that results from segmenting G into non-overlapping, end-to-end, length l windows.
     doi = 10.1186/1471-2105-10-293

      id = cord-297669-22fctxk4
  author = Proudfoot, Chris
   title = Genome editing for disease resistance in pigs and chickens
    date = 2019-06-25
keywords = CD163; disease; genome; pig
 summary = The virus was thought to attach to CD169 to be taken up into the cells; however, genome-edited pigs lacking CD169 were not resistant to PRRSV infection (Prather et al., 2013) . Chicken somatic cell lines have been edited to introduce changes to this gene-conferring resistance to avian leucosis virus in vitro (Lee et al., 2017) . However, as the example for avian influenza shows, host genes play an important role in other steps of the pathogen replication cycle and also provide editing targets for disease resilience or resistance. Genome editing allows integration of the disease-resistance trait into a wider selection of pigs, ensuring genetic variability and maintenance of desirable traits. (D) Resistance genes may be identified in laboratory research but not in highly bred lines, making integration into those productive animals only possible using genome editing. She employs genome editing and genetic selection to generate animals genetically resistant to viral disease.
     doi = 10.1093/af/vfz013

      id = cord-275683-1qj9ri18
  author = Roux, Simon
   title = Metagenomics in Virology
    date = 2019-06-12
keywords = RNA; genome; viral; virus
 summary = Against the background of an extensive viral diversity revealed by metagenomics across many environments, new sequence assembly approaches that reconstruct complete genome sequences from metagenomes have recently revealed surprisingly cosmopolitan viruses in specific ecological niches. However, these techniques can only detect previously known viruses, and often require Box 1 Use of complementary methods to target different types of viruses A number of approaches have been developed to specifically select and survey the genetic material contained by virus particles in a given sample. Virus sequences obtained from "bulk" metagenomes will typically reflect viruses infecting their host cell at the time of sampling, either actively replicating or not, while viromes enables a deeper and more focused exploration of the virus diversity in a specific site or sample. With viral metagenomics being applied to a larger set of samples and environments, and with bioinformatic analyses including genome assembly and interpretation constantly improving, novel groups of dominant and widespread viruses may thus be progressively revealed across many environments.
     doi = 10.1016/b978-0-12-809633-8.20957-6

      id = cord-015850-ef6svn8f
  author = Saitou, Naruya
   title = Eukaryote Genomes
    date = 2013-08-22
keywords = RNA; dna; gene; genome; sequence
 summary = General overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk DNAs. We then discuss the evolutionary features of eukaryote genomes, such as genome duplication, C-value paradox, and the relationship between genome size and mutation rates. Most of the protein coding genes of melon mitochondrial DNAs are highly similar to those of its congeneric species, which are watermelon and squash whose mitochondrial genome sizes are 119 kb and 125 kb, respectively. There are various genomic features that are specifi c to eukaryotes other than existence of introns and junk DNAs, such as genome duplication, RNA editing, C-value paradox, and the relationship between genome size and mutation rates. The Perigord black truffl e ( Tuber melanosporum ), shown as A i n Fig. 8.9 , has the largest genome size (~125 Mb) among the 88 fungi species whose genome sequences were so far determined, yet the number of genes is only ~7,500 [ 81 ] .
     doi = 10.1007/978-1-4471-5304-7_8

      id = cord-268795-tjmx6msm
  author = Sardar, Rahila
   title = Comparative analyses of SAR-CoV2 genomes from different geographical locations and other coronavirus family genomes reveals unique features potentially consequential to host-virus interaction and pathogenesis
    date = 2020-03-21
keywords = SARS; genome
 summary = title: Comparative analyses of SAR-CoV2 genomes from different geographical locations and other coronavirus family genomes reveals unique features potentially consequential to host-virus interaction and pathogenesis We have performed an integrated sequence-based analysis of SARS-CoV2 genomes from different geographical locations in order to identify its unique features absent in SARS-CoV and other related coronavirus family genomes, conferring unique infection, facilitation of transmission, virulence and immunogenic features to the virus. Our analysis reveals nine host miRNAs which can potentially target SARS-CoV2 genes. Our analysis shows unique host-miRNAs targeting SARS-CoV2 virus genes. CELLO2GO (7)server was used to infer biological function for each protein of SARS-CoV2 genome with their localization prediction. Assembled SARS-CoV2 genomes sequences in FASTA format from India, USA, China, Italy and Nepal used for coronavirus typing tool analysis. For the phylogenetic analysis, we compared the sequences of 6 SARS-CoV2 isolates from different countries namely, Wuhan, India, Italy, USA and Nepal along with other corona virus species ( Figure 1 ).
     doi = 10.1101/2020.03.21.001586

      id = cord-277687-u3q36o3e
  author = Shean, Ryan C.
   title = VAPiD: a lightweight cross-platform viral annotation pipeline and identification tool to facilitate virus genome submissions to NCBI GenBank
    date = 2019-01-23
keywords = NCBI; RNA; genome
 summary = title: VAPiD: a lightweight cross-platform viral annotation pipeline and identification tool to facilitate virus genome submissions to NCBI GenBank In order to accept submitted viral genomic data, NCBI GenBank requires 1) viral sequence complete with at least one protein annotation, 2) author/depositor metadata, and 3) viral sequence metadata, such as strain, collection date, collection location, and coverage. VAPiD handles batch submissions of multiple viruses of different types without prior knowledge of the viral species, correctly annotates RNA editing and ribosomal slippage, performs spellchecking on annotations, handles batch or individual submission of metadata, runs with a simple one-line command, and creates annotated viral sequence files for GenBank submission. This first example is the task that the authors originally wrote VAPiD for -annotating large numbers of genomes from different viral species, which mirrors the type of data that many clinical and public health laboratories may encounter.
     doi = 10.1186/s12859-019-2606-y

      id = cord-314594-xvc8hvpq
  author = Singh, Roshan Kumar
   title = Breeding and biotechnological interventions for trait improvement: status and prospects
    date = 2020-09-18
keywords = QTL; crop; genetic; genome; trait
 summary = Advances in high-throughput genomics strategies at a whole-genome level, including genetic association mapping, map-based cloning, genomic selection, and speed breeding, are also proven useful in improvising genetic gains for expediting the crop improvement processes. Through genome-wide association study (GWAS), 60 loci significantly associated with agronomic traits such as oil content, seed quality, stress tolerance were identified, which may be proven as a valuable resource for genetic improvement (Lu et al. Marker-assisted backcrossing (MABC) is the introgression of a genomic region (QTL or locus or gene) contributing the desired trait from a donor genotype into a breeding line or elite cultivar without linkage drag through backcrossing after multiple generations. As the name suggests, CRISPR/Cas9 consists of two components: a single-guide Application of functional and comparative genomics in marker-assisted breeding and biotechnological approaches for crop improvement. The candidate gene(s) identified from functional genomic studies can be introduced through genetic engineering or tar-geted modify through genome editing technology in crop species for improved agronomic traits.
     doi = 10.1007/s00425-020-03465-4

      id = cord-016588-f8uvhstb
  author = Sintchenko, Vitali
   title = Informatics for Infectious Disease Research and Control
    date = 2009-10-03
keywords = dna; gene; genome; genomic; pathogen
 summary = The goal of infectious disease informatics is to optimize the clinical and public health management of infectious diseases through improvements in the development and use of antimicrobials, the design of more effective vaccines, the identification of biomarkers for life-threatening infections, a better understanding of host-pathogen interactions, and biosurveillance and clinical decision support. "New Age" infectious disease informatics rests on advances in microbial genomics, the sequencing and comparative study of the genomes of pathogens, and proteomics or the identification and characterization of their protein related properties and reconstruction of metabolic and regulatory pathways (Bansal 2005) . The figure was produced using Artemis software (The Wellcome Trust Sanger Institute, UK) 1 Informatics for Infectious Disease Research and Control evidence-based gene calling or translating alignments of the DNA sequence to known proteins; and (3) aligning cDNAs from the same or related species.
     doi = 10.1007/978-1-4419-1327-2_1

      id = cord-269124-oreg7rnj
  author = Spyrou, Maria A.
   title = Ancient pathogen genomics as an emerging tool for infectious disease research
    date = 2019-04-05
keywords = Europe; Fig; Yersinia; ancient; dna; genome; pathogen; pestis
 summary = Examples of tools that have shown their effectiveness with ancient metagenomic DNA include the widely used Basic Local Alignment Search Tool (BLAST) 68 ; the MEGAN Alignment Tool (MALT) 41 , which involves a taxonomic binning algorithm that can use whole genome databases (such as the National Center for Biotechnical Information (NCBI) Reference Sequence (RefSeq) database 69 ); Metagenomic Phylogenetic Analysis (MetaPhlAn) 70 , which is also integrated into the metagenomic pipeline MetaBIT 71 and uses thousands (or millions) of marker genes for the distinction of specific microbial clades; or Kraken 72 , an alignment free sequence classifier that is based on k-mer matching of a query to a constructed database. Similar limitations can arise when the evolutionary history of a microorganism is vastly affected by recombination, as observed for HBV 44, 53 , although HBV molecular dating was recently attempted using a different genomic data set and suggested that the currently explored diversity of Old and New World pri mate lineages (including all human genotypes) may have emerged within the last 20,000 years 43 .
     doi = 10.1038/s41576-019-0119-1

      id = cord-346335-el45v0a5
  author = Tan, H.S.
   title = Fourier spectral density of the coronavirus genome
    date = 2020-08-11
keywords = SARS; Spike; genome
 summary = We uncover an interesting, new scaling law for the coronavirus genome: the complexity of the genome scales linearly with the power-law exponent that characterizes the enveloping curve of the low-frequency domain of the spectral density. An example of a seminal paper in this subject is that of Voss in [2] where the author found that the spectral density of the genome of many different species follows a power law of the form 1/k β in the low-frequency domain, with the exponent β potentially related to the organism''s evolutionary category. We develop a few models to characterize the typical spectrum, and in the process stumble upon a linear scaling law between a measure of the complexity of each genome and the power-law exponent that describes the enveloping curve of the low-frequency domain.
     doi = 10.1101/2020.06.30.180034

      id = cord-265581-pbv8mjfc
  author = Tong, Yaojun
   title = An aurora of natural products-based drug discovery is coming
    date = 2020-06-06
keywords = genome; natural; product
 summary = With recent scientific advances combining metabolic sciences and technology, multi-omics, big data, combinatorial biosynthesis, synthetic biology, genome editing technology (such as CRISPR), artificial intelligence (AI), and 3D printing, the "high-hanging fruit" is becoming more and more accessible with reduced costs. The incredible rate of development in genome sequencing, modern metabolic engineering, synthetic biology, advanced genome editing, big data, artificial intelligence (AI), and 3D printing together with the growing microbial strain collections enable us to access the previously inaccessible natural products. It starts with genome mining (the analysis of high quality whole genome information), which requires bioinformatics, big data, and even AI; to pathway cloning (refactoring), expression and fermentation, which needs design-buildtest-learn (DBTL) cycle-based metabolic engineering; to the target natural product identification, which requires modern chemical analysis; and to later compound modification and clinical studies, which needs biochemistry and cell biology.
     doi = 10.1016/j.synbio.2020.05.003

      id = cord-302047-vv5gpldi
  author = Willemsen, Anouk
   title = On the stability of sequences inserted into viral genomes
    date = 2019-11-14
keywords = Gene; RNA; genome; insert; stability; virus
 summary = Viruses are widely used as vectors for heterologous gene expression in cultured cells or natural hosts, and therefore a large number of viruses with exogenous sequences inserted into their genomes have been engineered. Viruses genera covered in relevant studies Conclusions of this review All viruses • Inserted sequences are often unstable and rapidly lost upon passaging of an engineered virus • The position at which a sequence is integrated in the genome can be important for stability • Sequence stability is not an intrinsic property of genomes because demographic parameters, such as population size and bottleneck size, can have important effects on sequence stability • The multiplicity of cellular infection affects sequence stability, and can in some cases directly affect whether there is selection for deletion variants • Deletions are not the only class of mutations that can reduce the cost of inserted sequences, although they are the most common I: dsDNA
     doi = 10.1093/ve/vez045

      id = cord-318392-r9bbomvk
  author = Woo, Patrick CY
   title = Coronavirus HKU15 in respiratory tract of pigs and first discovery of coronavirus quasispecies in 5′-untranslated region
    date = 2017-06-21
keywords = Coronavirus; HKU15; PCR; genome
 summary = The genomes of two Coronavirus HKU15 strains detected in the nasopharyngeal samples of two different pigs were sequenced following our previous publications 26, 27 with modifications. Divergence times for the Coronavirus HKU15 strains were calculated based on the complete genome sequence data, utilizing the Bayesian Markov chain Monte Carlo method using BEAST 1.8.0 33 with the substitution model GTR (general time-reversible model)+G (gammadistributed rate variation)+I (estimated proportion of invariable sites), a strict molecular clock, and a constant coalescent. In one (S579N) of the two Coronavirus HKU15 genomes that we sequenced in this study, variant sites were observed at four positions; two of them were due to nucleotide substitutions, and the other two were results of indels at mononucleotide polymeric regions (189th and 376th bases).
     doi = 10.1038/emi.2017.37

      id = cord-348515-bqqyly23
  author = Zhao, Suhui
   title = Re-emergent Human Adenovirus Genome Type 7d Caused an Acute Respiratory Disease Outbreak in Southern China After a Twenty-one Year Absence
    date = 2014-12-08
keywords = ARD; China; DG01_2011; REA; genome
 summary = Recombination analysis reveals this genome differs from the 1950s-era prototype and vaccine strains by a lateral gene transfer, substituting the coding region for the L1 52/55 kDa DNA packaging protein from HAdV-16. Recombination analysis reveals this genome differs from the 1950s-era prototype and vaccine strains by a lateral gene transfer, substituting the coding region for the L1 52/55 kDa DNA packaging protein from HAdV-16. Thorough characterization of these pathogens is evidenced by the availability of two genome sequences (JF800905 and JX625134), both of which are further identified as the HAdV-7d genome type in this report, and shown to be nearly identical to this report of an isolate from a 2011 ARD outbreak in Guangdong Province (strain DG01_2011) by comparative genomics and, in particular, in silico REA pattern analysis, as presented in Figure 2 .
     doi = 10.1038/srep07365

      id = cord-000902-ew8orn0z
  author = Zhao, Xiangyan
   title = Coevolution between simple sequence repeats (SSRs) and virus genome size
    date = 2012-08-30
keywords = additional; genome; ssr; virus
 summary = The results showed that simple sequence repeats (SSRs) is strongly, positively and significantly correlated with genome size. While, relative abundance and relative density were examined to make the SSRs comparison parallel among differently sized species genomes; principal component analysis (PCA) was designed to investigate which repeat class(es) made a greater contribution to the variance among virus species as well as the relationships between repeat classes. Therefore, the 257 genome sequences were selected as samples for the analysis of relationship between SSRs distribution and genome size in the level of the whole virus. We surveyed the distribution of different SSR classes in virus genomes to investigate the relationship between repeat classes (mono-, di-, tri-, tetra-, penta-and hexa-) and genome sequence length. Coevolution between simple sequence repeats (SSRs) and virus genome size
     doi = 10.1186/1471-2164-13-435

      id = cord-265329-bsypo08l
  author = van Dorp, Lucy
   title = Emergence of genomic diversity and recurrent mutations in SARS-CoV-2
    date = 2020-05-05
keywords = CoV-2; SARS; figure; genome
 summary = Three sites in Orf1ab in the regions encoding Nsp6, Nsp11, Nsp13, and one in the Spike protein are characterised by a particularly large number of recurrent mutations (>15 events) which may signpost convergent evolution and are of particular interest in the context of adaptation of SARS-CoV-2 to the human host. The extraordinary availability of genomic data during the COVID-19 pandemic has been made possible thanks to a tremendous effort by hundreds of researchers globally depositing SARS-CoV-2 assemblies (Table S1 ) and the proliferation of close to real time data visualisation and analysis tools including NextStrain (https://nextstrain.org) and CoV-GLUE (http://cov-glue.cvr.gla.ac.uk). In this work we use this data to analyse the genomic diversity that has emerged in the global population of SARS-CoV-2 since the beginning of the COVID-19 pandemic, based on a download of 7710 assemblies. The genomic diversity of the global SARS-CoV-2 population being recapitulated in multiple countries points to extensive worldwide transmission of COVID-19, likely from extremely early on in the pandemic.
     doi = 10.1016/j.meegid.2020.104351